Skip to content

feat(log): add --log-request-stats per-request statistics logging#512

Open
LorrinWWW wants to merge 10 commits into
mainfrom
jue/req-stats
Open

feat(log): add --log-request-stats per-request statistics logging#512
LorrinWWW wants to merge 10 commits into
mainfrom
jue/req-stats

Conversation

@LorrinWWW

Copy link
Copy Markdown
Contributor

Summary

Adds an opt-in --log-request-stats flag that emits one structured RequestStats(...) line per request when it finishes or aborts, for latency/throughput debugging. Each line reports the request id, status/finish-reason, token counts (prompt/cache/output) and cache-hit rate, a host-side timing breakdown (queue, prefill, TTFT, total, preemption time/count), decode throughput, and spec-decode acceptance length/rate. All values are derived from timestamps and counters already available on the host — scheduled time is stamped pre-forward in the event loop and the rest post-forward in the output processor — so no GPU sync is introduced and there is no engine slowdown; when the flag is off, requests carry a shared no-op tracker so the hot path stays guard-free. The stats data model (tracker + summary) lives in a dedicated request_stats.py module, and the feature is covered by unit tests plus an end-to-end check.

Test Plan

Adds an opt-in per-request performance log emitted when a request finishes or
aborts, for latency/throughput debugging. Replaces the bare
"Req: <rid> Finish! Accept_num_tokens_avg" line with a Python-object repr:

  RequestStats(rid=..., status='finished', reason='stop', prompt_tokens=...,
    cache_tokens=..., output_tokens=..., cache_hit_rate=..., queue_ms=...,
    prefill_ms=..., ttft_ms=..., total_ms=..., preempt_ms=..., preempt_count=...,
    decode_tps=..., acc_len=..., acc_rate=..., recv_ts=..., commit_ts=...,
    finish_ts=...)

Every value is derived from host-side timestamps/counters already available in
the scheduler process (received via the request's created_time; scheduled/
prefill-done/first-token/finish stamped in post_process_forward_op; counts and
spec acceptance from RequestState). No new GPU sync is introduced, so it adds
no engine slowdown. Preemption (count + time a request's decode was delayed by
others' prefill) is tracked host-side, best-effort, from forward_op.

Gated by --log-request-stats (default off). http_status_code is intentionally
omitted: it is only known at the HTTP/SMG gateway, not in the engine.

Tests cover the disabled (zero-overhead) path, the derived-field formatting,
the aborted + spec-decode case, and end-to-end timestamp recording through
post_process_forward_op. Docs: configuration/server.md Observability section.

Signed-off-by: Jue Wang <zjuwangjue@gmail.com>
@LorrinWWW LorrinWWW requested a review from gongwei-130 June 24, 2026 22:51
@jaywme jaywme marked this pull request as ready for review June 27, 2026 02:43
@jaywme jaywme requested a review from a team as a code owner June 27, 2026 02:43
@jaywme jaywme self-requested a review June 27, 2026 02:44

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a3cb8ed467

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +417 to +418
if self.log_request_stats:
state.stats = RequestStatsTracker()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ensure admission-time aborts get stats trackers

When --log-request-stats is enabled, a real tracker is only installed in register(), but _process_new_requests() sends invalid-session or grammar-aborted requests through publish_finished_at_admission() without calling register() (event_loop.py:1021). Those requests therefore keep NOOP_STATS and the admission-finish path never logs a RequestStats(...) line, so the new flag silently omits these aborts despite promising one line per finished/aborted request.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4b53855920

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +181 to +183
acc_len = rs.accept_draft_tokens or 0.0
acc_rate = (
round(max(0.0, acc_len - 1.0) / spec_num_tokens, 4)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct spec-decode acceptance math

When speculative decoding is enabled, rs.accept_draft_tokens is the extras-per-verify value; the repo documents this in test/runtime/models/test_mm25_perf.py as needing + 1 to get true acceptance length. This code logs that value directly as acc_len and then subtracts one more for acc_rate, so a request averaging 3 accepted drafts with draft width 4 is reported as acc_len=3.0, acc_rate=0.5 instead of an acceptance length of 4.0 and a draft acceptance rate of 0.75, making the new per-request stats misleading for spec-decode tuning.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants