feat(log): add --log-request-stats per-request statistics logging#512
feat(log): add --log-request-stats per-request statistics logging#512LorrinWWW wants to merge 10 commits into
Conversation
Adds an opt-in per-request performance log emitted when a request finishes or
aborts, for latency/throughput debugging. Replaces the bare
"Req: <rid> Finish! Accept_num_tokens_avg" line with a Python-object repr:
RequestStats(rid=..., status='finished', reason='stop', prompt_tokens=...,
cache_tokens=..., output_tokens=..., cache_hit_rate=..., queue_ms=...,
prefill_ms=..., ttft_ms=..., total_ms=..., preempt_ms=..., preempt_count=...,
decode_tps=..., acc_len=..., acc_rate=..., recv_ts=..., commit_ts=...,
finish_ts=...)
Every value is derived from host-side timestamps/counters already available in
the scheduler process (received via the request's created_time; scheduled/
prefill-done/first-token/finish stamped in post_process_forward_op; counts and
spec acceptance from RequestState). No new GPU sync is introduced, so it adds
no engine slowdown. Preemption (count + time a request's decode was delayed by
others' prefill) is tracked host-side, best-effort, from forward_op.
Gated by --log-request-stats (default off). http_status_code is intentionally
omitted: it is only known at the HTTP/SMG gateway, not in the engine.
Tests cover the disabled (zero-overhead) path, the derived-field formatting,
the aborted + spec-decode case, and end-to-end timestamp recording through
post_process_forward_op. Docs: configuration/server.md Observability section.
Signed-off-by: Jue Wang <zjuwangjue@gmail.com>
# Conflicts: # test/runtime/test_generation_output_processor.py
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a3cb8ed467
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self.log_request_stats: | ||
| state.stats = RequestStatsTracker() |
There was a problem hiding this comment.
Ensure admission-time aborts get stats trackers
When --log-request-stats is enabled, a real tracker is only installed in register(), but _process_new_requests() sends invalid-session or grammar-aborted requests through publish_finished_at_admission() without calling register() (event_loop.py:1021). Those requests therefore keep NOOP_STATS and the admission-finish path never logs a RequestStats(...) line, so the new flag silently omits these aborts despite promising one line per finished/aborted request.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4b53855920
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| acc_len = rs.accept_draft_tokens or 0.0 | ||
| acc_rate = ( | ||
| round(max(0.0, acc_len - 1.0) / spec_num_tokens, 4) |
There was a problem hiding this comment.
Correct spec-decode acceptance math
When speculative decoding is enabled, rs.accept_draft_tokens is the extras-per-verify value; the repo documents this in test/runtime/models/test_mm25_perf.py as needing + 1 to get true acceptance length. This code logs that value directly as acc_len and then subtracts one more for acc_rate, so a request averaging 3 accepted drafts with draft width 4 is reported as acc_len=3.0, acc_rate=0.5 instead of an acceptance length of 4.0 and a draft acceptance rate of 0.75, making the new per-request stats misleading for spec-decode tuning.
Useful? React with 👍 / 👎.
Summary
Adds an opt-in --log-request-stats flag that emits one structured RequestStats(...) line per request when it finishes or aborts, for latency/throughput debugging. Each line reports the request id, status/finish-reason, token counts (prompt/cache/output) and cache-hit rate, a host-side timing breakdown (queue, prefill, TTFT, total, preemption time/count), decode throughput, and spec-decode acceptance length/rate. All values are derived from timestamps and counters already available on the host — scheduled time is stamped pre-forward in the event loop and the rest post-forward in the output processor — so no GPU sync is introduced and there is no engine slowdown; when the flag is off, requests carry a shared no-op tracker so the hot path stays guard-free. The stats data model (tracker + summary) lives in a dedicated request_stats.py module, and the feature is covered by unit tests plus an end-to-end check.
Test Plan