Skip to content

Add TensorRT-LLM support for inference backend#121

Open
chungen04 wants to merge 28 commits into
lightseekorg:mainfrom
chungen04:chungen/trtllm
Open

Add TensorRT-LLM support for inference backend#121
chungen04 wants to merge 28 commits into
lightseekorg:mainfrom
chungen04:chungen/trtllm

Conversation

@chungen04

@chungen04 chungen04 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Integrates TensorRT-LLM (v1.3.0rc18) as an inference backend for hidden-state capture, alongside the existing vLLM and SGLang backends. Validated end-to-end against SGLang on Qwen3-8B (EAGLE-3 / DFlash).

Approach

TRT-LLM already ships the EAGLE3 capture machinery natively (SaveHiddenStatesDecodingConfig / SaveHiddenStatesResourceManager), so this integration is thin. It builds on that mode and adds a small patch that redirects captured aux + final hidden states to Mooncake instead of writing .pt files to disk.

Changes

  • patches/trtllm/v1.3.0rc18/trtllm.patch: env-gated
    (TORCHSPEC_TRTLLM_MOONCAKE) Mooncake redirect in save_hidden_state.py.
    • Adds batched-prefill capture (upstream only stored a single request per forward) by de-interleaving the packed capture buffer per request, walking context_requests in native order to match how the executor flattens tokens.
    • Keys the store on the client request id + DP rank.
    • Forwards Mooncake/TorchSpec env to MPI workers.
    • relaxes the max_batch_size=1 cap for hidden state capture.
    • Disables KV block reuse, overlap scheduler, and chunked prefill (all required for correct, full-prefill capture).
  • torchspec/inference/engine/trtllm_engine.py:
    • TrtllmEngine: Ray actor wrapping the PyTorch-backend LLM
    • Maps RequestOutput.request_id to Mooncake key using the same sanitizer as the patch.
  • factory.py / config:
    • Includes "trtllm" dispatch, TrtllmConfig, trtllm_ flatten prefix.
    • Example config in configs/trtllm_qwen3_8b.yaml.
  • docker/trtllm/v1.3.0rc18/Dockerfile + justfile branch + [trtllm] extra
    • patches the installed package at build time with an assertion guard, and includes the CUDA-13 Mooncake wheel fix.
  • configs/trtllm_qwen3_8b.yaml + README entry + qwen3-8b-single-node example wired to the shared launcher.

Performance

Setting: Qwen3-8B, single node (4xB300, 2 inference (TP=2) + 2 training FSDP), open-perfectblend dataset (not re-generated), global batch 8, 10k steps, 512 samples split for eval. Each backend evaluates on its own captured eval cache.

Quality is verified, identical (eval accuracy, matched at every checkpoint):

step SGLang TRT-LLM
2000 0.496 0.495
6000 0.620 0.619
10000 0.650 0.650

For E2E Speed, TRT-LLM is ~3.6× faster (training-loop wall-clock):

backend wall-clock
SGLang 18886 s (5h15m)
TRT-LLM 5297 s (1h28m)

SGLang is inference-bound. TRT-LLM speeds up inference and becomes training-bound under this setting.

There is also a DFlash example added, mirroring SGLang's example (For SGLang's training, see #126 for an encountered hang). The training results shows quality remains equivalent while wall clock time also mirrors, as with both engine the pipeline was training-bound.

E2E speed for DFlash:

SGLang DFlash TRT-LLM DFlash
wall-clock 4173.5 s (69.6 min) 4320.4 s (72.0 min)
step rate 2.91 step/s 2.83 step/s
avg training throughput 19.2 entries/s 18.5 entries/s
avg inference throughput 19.3 entries/s 18.6 entries/s

Training results for DFlash:

step SGL loss TRT loss SGL acc TRT acc SGL sim_acc_len TRT sim_acc_len
1000 4.9995 4.9752 0.1381 0.1396 0.60 0.62
2000 4.5755 4.5726 0.1638 0.1630 0.75 0.75
3000 4.3092 4.2956 0.1783 0.1811 0.84 0.85
4000 4.1083 4.0983 0.1951 0.1967 0.93 0.94
5000 3.9469 3.9286 0.2091 0.2108 1.01 1.02
6000 3.8038 3.7964 0.2187 0.2204 1.07 1.08
7000 3.6954 3.6888 0.2305 0.2322 1.13 1.13
8000 3.6041 3.5976 0.2391 0.2405 1.17 1.18
9000 3.5430 3.5406 0.2453 0.2462 1.21 1.21
10000 3.5059 3.5020 0.2491 0.2505 1.23 1.23

Test

just BACKEND=trtllm build
deploy/run_compare.sh trtllm \
  dataset.train_data_path=<train.jsonl> \
  dataset.eval_data_path=<eval.jsonl> dataset.eval_interval=1000 \
  training.micro_batch_size=4 training.num_train_steps=10000

Note

Multimodal input is not yet supported. I plan to split the multimodal input support and the base TRTLLM support into two PRs.

@chungen04 chungen04 marked this pull request as draft June 16, 2026 00:55

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac2cd6db34

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

assert pp_size == 1, f"trtllm_pp_size must be 1, got {pp_size}"

if self.base_gpu_id is not None:
self.local_gpu_id = self.setup_gpu(self.base_gpu_id)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Set CUDA visibility before initializing CUDA

When inference_num_gpus_per_engine > 1 (including the added sample config), this calls setup_gpu(), which runs torch.cuda.set_device() while Ray still exposes only the actor's single scheduled GPU. _init_engine() expands CUDA_VISIBLE_DEVICES later, but CUDA has already been initialized, so TensorRT-LLM's multi-GPU constructor/device-count check can still see only one GPU and fail or start TP workers without the full assigned set. Set the engine's full GPU visibility before touching torch.cuda.

Useful? React with 👍 / 👎.

Comment on lines +302 to +308
self._engine = LLM(
model=self.args.target_model_path,
backend="pytorch",
tensor_parallel_size=tp_size,
trust_remote_code=getattr(self.args, "trust_remote_code", True),
speculative_config=spec_config,
**engine_kwargs,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Protect trust_remote_code from extra_args

If a user puts trust_remote_code in inference.trtllm.extra_args (a normal LLM kwarg and common when porting backend configs), it is not removed by _PROTECTED_ENGINE_KEYS, so this call passes trust_remote_code both explicitly and via **engine_kwargs and raises TypeError: got multiple values for keyword argument 'trust_remote_code' during engine initialization. Either add it to the protected set or move the explicit value into engine_kwargs before applying extras.

Useful? React with 👍 / 👎.

@chungen04 chungen04 force-pushed the chungen/trtllm branch 3 times, most recently from d8bc0d5 to 9342b09 Compare June 23, 2026 22:40
echo "Extra args: $*"
echo "=============================================="

# TODO: unify tp_size config across sglang/vllm backends

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, vLLM hardcodes the config in its yaml (configs/vllm_qwen3_8b.yaml)

Comment thread README.md
**vLLM**

```bash
./examples/qwen3-8b-single-node/run.sh --config configs/vllm_qwen3_8b.yaml

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no --config field in run.sh . See run.sh docstring.

TorchSpec explicitly uses are listed; any other ``LLM`` kwarg can be passed
via ``extra_args``.

Single-node tensor parallelism only (nnodes must be 1); multi-node TP is

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be checked in factory.py.

@chungen04 chungen04 marked this pull request as ready for review June 25, 2026 22:57

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 38bb53c003

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +50 to +51
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi import KvCacheConfig, SaveHiddenStatesDecodingConfig

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Defer TensorRT-LLM imports until backend selection

When tensorrt-llm is installed in an environment where the NVIDIA driver/libcuda is not available, importing this module at package import time can raise a non-ImportError load failure; torchspec.inference.engine.__init__ imports it opportunistically for every backend and only catches ImportError, so unrelated HF/vLLM/SGLang imports can fail before TRT-LLM is selected. Move these imports into TrtllmEngine initialization or catch the CUDA loader failure in the optional import path.

Useful? React with 👍 / 👎.

Comment thread examples/qwen3-8b-single-node/run.sh Outdated
fi

# Derive the tp_size override block from the config's engine type ("sgl" -> "sglang").
ENGINE_TYPE=$(grep -oE "inference_engine_type:[[:space:]]*[a-zA-Z]+" "$CONFIG_FILE" | awk '{print $2}')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make engine-type probing non-fatal

With set -euo pipefail, any custom config that omits a literal unquoted inference_engine_type in the YAML, or supplies it via the extra CLI args, makes grep return 1 and exits the launcher before the ${ENGINE_TYPE:-sglang} fallback can run. This regresses the script's [CONFIG_FILE] [EXTRA_ARGS...] path for otherwise usable configs; make the probe tolerate no match or derive the value from the resolved config.

Useful? React with 👍 / 👎.

Comment thread patches/trtllm/v1.3.0rc18/trtllm.patch Outdated
+ key=key,
+ hidden_states=aux_hidden_states,
+ input_ids=input_ids,
+ last_hidden_states=last_hidden_states,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor disabled last-hidden-state storage

When inference.store_last_hidden_states=false, TrtllmEngine._get_tensor_shapes() omits last_hidden_states, so the data fetcher later removes only _hs and _ids; however this patched resource manager still always passes last_hidden_states to store.put() here. In that configuration every TRT-LLM sample leaves its _lhs object behind in Mooncake and long runs can fill the segment, so either skip storing last_hidden_states in this mode or include it in the returned metadata.

Useful? React with 👍 / 👎.

@chungen04 chungen04 marked this pull request as draft June 26, 2026 04:37
chungen04 and others added 20 commits June 26, 2026 23:08
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
tensorrt_llm's package init loads libcuda.so.1, which is absent during the
image build. Apply the patch in the fixed python3.12 dist-packages path and
verify it landed by grepping the patched file instead of importing the module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
TRT-LLM's MpiPoolSession._start_mpi_pool only forwards env vars prefixed
TRTLLM/TLLM to its spawned MPI workers. The SaveHiddenStates Mooncake
redirect gates on TORCHSPEC_TRTLLM_MOONCAKE and reads the MOONCAKE_*
connection params, none of which reached the workers, so hidden-state
capture silently fell back to disk mode and the trainer timed out waiting
for keys ("batch_get_buffer missing keys").

Add mpi_session.patch forwarding TORCHSPEC*/MOONCAKE*/MC_* prefixes
(applied by the existing patch loop) plus a build-time validation grep.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
The patch stored hidden states under request.py_request_id, the backend
runtime request id which is bumped by JIT-warmup and internal requests. But
the TorchSpec engine reconstructs Mooncake keys from
RequestOutput.request_id == GenerationRequest.id, which TRT-LLM sets to the
client_id assigned by the proxy (_get_next_client_id). The two id spaces are
mapped via _client_id_to_request_id and differ, so the stored keys were
systematically offset from the keys the trainer requested, surfacing as
"Size mismatch for hidden_states" (data shifted between keys).

Key on request.py_client_id, falling back to py_request_id if unset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
SaveHiddenStates only captures hidden states for tokens that actually run a
forward pass. With enable_block_reuse on (the TRT-LLM default), prompts that
share a prefix reuse cached KV blocks and skip the forward for the shared
tokens, so fewer tokens are captured than the prompt length.

Always construct KvCacheConfig with enable_block_reuse=False and
enable_partial_reuse=False (previously the config was only built when a mem
fraction was set, leaving reuse on otherwise).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
SaveHiddenStates runs each prompt as a single prefill (chunked prefill is
disabled), so a prompt longer than max_num_tokens is rejected outright
("sum of prompt length ... should not exceed max_num_tokens") and the
sample is dropped. With max_num_tokens=8192 but max_seq_length=16384, samples
in that range were silently lost during training. Raise to 16384.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Collapse a NotImplementedError call onto one line (fits in the 100-char
limit) so `ruff format --check` passes. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Merge the standalone mpi_session.patch (forwarding TORCHSPEC*/MOONCAKE*/MC_*
env to MPI workers) into the main trtllm.patch and drop the separate file.

Also relax the upstream max_batch_size=1 cap that TRT-LLM forces whenever
SaveHiddenStatesDecodingConfig is active: when TORCHSPEC_TRTLLM_MOONCAKE is
set, keep the configured max_batch_size so batched prefill capture works.
The Mooncake resource manager de-interleaves the packed multi-request
buffer, so the offset-0 single-request constraint no longer applies. Overlap
scheduler and CUDA graphs stay disabled (shared capture buffer is
overwritten every forward and must be read first).

Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Derive the per-backend config block from inference_engine_type instead of
hardcoding inference.sglang.tp_size. "sgl" maps to the "sglang" block;
vllm/trtllm map 1:1. The same launcher now drives sglang, vllm, and trtllm
with an identical inference layout for fair side-by-side runs.

Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Switch the inference layout to 1 GPU / tp=1, drop the eval dataset, shrink
the Mooncake global_segment_size to 16GB, and namespace cache_dir per
example. Point the usage comment at the unified run.sh launcher.

Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Adopt the positional run.sh launch form in the README (vLLM + TRT-LLM
examples) and drop the patch-verify grep steps from the trtllm Dockerfile,
matching origin/chungen/trtllm.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
SaveHiddenStates' Mooncake path sliced per-request hidden states in
py_batch_idx (= py_seq_slot, an arbitrary KV slot) order, but the forward
packs tokens in scheduled_requests.context_requests native order. Under
multi-request batching the orders differ, so per-request slices read the
wrong hidden states and corrupt a subset of captured samples. Iterate
context_requests in native order so the slices align.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <cho322@gatech.edu>
TRT-LLM's MPI tp-workers bind GPU ordinals 0..tp_size-1 regardless of the
Ray placement group, so under the default training_first they collide with
training on the low GPUs and OOM. inference_first assigns inference those
low indices. SGLang honors base_gpu_id and is unaffected, so the override
is TRT-config-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <cho322@gatech.edu>
trust_remote_code is passed explicitly to LLM() but was missing from
_PROTECTED_ENGINE_KEYS. A user setting it in inference.trtllm.extra_args
(a standard LLM kwarg) would have it survive the filter and be passed
twice, raising "TypeError: got multiple values for keyword argument
'trust_remote_code'" at engine init. Add it to the protected set so the
explicit config-sourced value wins, matching model/backend/tp_size.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <cho322@gatech.edu>
chungen04 and others added 6 commits June 26, 2026 23:08
Tighten TRT-only comments to the project's lean style and align with the
sglang reference where applicable. No functional change.

- Dockerfile: drop the known-issue CUDA-13 wheel comment; condense the
  patch-step comment to the one non-obvious gotcha (don't import
  tensorrt_llm at build time -- libcuda absent until runtime).
- run.sh: collapse the tp_size-block derivation comment to one line.
- inference_config.py: drop the redundant init_timeout comment (sglang/
  vllm declare the same field without one).
- trtllm_qwen3_8b.yaml: add the commented eval_data_path/eval_interval
  placeholders to match the sglang config; consolidate the max_num_tokens
  note to one line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <cho322@gatech.edu>
When store_last_hidden_states=false (e.g. the DFlash configs), the engine
omits last_hidden_states from the returned tensor metadata, so the fetcher
never fetches or deletes the {key}_lhs object -- but the patched resource
manager still always wrote it, orphaning one _lhs per sample in Mooncake
and filling the segment over long runs. Gate storage on a new
TORCHSPEC_TRTLLM_STORE_LAST_HIDDEN env (exported by the engine before LLM
construction, propagated to MPI workers) so _lhs is written only when the
flag is set. Default path (true) is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <cho322@gatech.edu>
tensorrt_llm's package __init__ loads libcuda.so.1 on import. With the LLM
/SamplingParams/KvCacheConfig/SaveHiddenStatesDecodingConfig imports at
module scope, importing torchspec.inference.engine on a host without a CUDA
driver raised a non-ImportError loader failure; engine/__init__.py only
catches ImportError, so the whole engine package (HF/vLLM/SGLang included)
failed before TRT-LLM was ever selected. Move the imports into the methods
that use them, matching vllm_engine.py's lazy `from vllm import LLM`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <cho322@gatech.edu>
The grep that derives the tp_size override block returns 1 when a config
has no literal inference_engine_type line (e.g. set via CLI extra args);
under set -euo pipefail that exited the launcher before the ${ENGINE_TYPE
:-sglang} fallback could run, regressing the [CONFIG_FILE] [EXTRA_ARGS...]
path. Append `|| true` so a no-match yields empty and falls back.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <cho322@gatech.edu>
…ent + per-engine keys)

Enables N independent single-GPU (tp=1) TRT-LLM inference engines on distinct
GPUs, needed for data-parallel inference (e.g. DFlash's 4 tp=1 replicas).
Two co-dependent pieces:

- GPU placement: single-GPU engines drop the NOSET override so Ray scopes
  CUDA_VISIBLE_DEVICES per actor to its own GPU; the engine no longer manually
  pins CVD for tp=1 (TRT's MPI worker maps cuda:0 to the scoped GPU). CVD is
  set before any CUDA/tensorrt_llm init (also fixes premature-CUDA-init). TP>1
  engines keep the prior NOSET + manual contiguous-block pin.
- Mooncake keys: each engine has its own request_id counter, so keys collided
  in the shared store across data-parallel engines. Prefix keys per engine
  (e{rank}_) on both the write side (patch) and the read side (engine).

Without these, 4 tp=1 engines all bound GPU 0 (OOM) and clobbered each other's
Mooncake keys (shape mismatch). Validated: DFlash trtllm 10k steps clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <cho322@gatech.edu>
Relax the DFlash engine guard in train_entry.py from sgl-only to
{'sgl', 'trtllm'}, and add configs/trtllm_qwen3_8b_dflash.yaml (mirrors the
sglang DFlash config; trtllm inference block + inference_first placement).
DFlash's aux-layer wiring and the trainer are engine-agnostic, so no other
changes are needed for the trtllm path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <cho322@gatech.edu>
Drop the stale sgl-only guard NOTE from the trtllm dflash config (the guard
now permits trtllm), remove a redundant CVD-scoping comment in trtllm_engine,
and rewrite the factory GPU-assignment comment to explain the tp==1 vs tp>1
split without the dangling "Approach 1" reference that assumed prior context.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: chungen04 <cho322@gatech.edu>
@chungen04 chungen04 marked this pull request as ready for review June 26, 2026 23:46

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d29cd75573

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +115 to +116
+ if store is None:
+ return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail when Mooncake storage is unavailable

When the patched resource manager cannot create a Mooncake store (for example, bad Mooncake env, an incompatible wheel, or a setup failure), these lines return without storing any tensors, but TrtllmEngine.generate() still reconstructs and returns Mooncake keys from the request id. The training fetcher later tries to read missing _hs/_ids/_lhs objects, so the actual failure is delayed until after bogus samples have been queued; surface the storage error here instead of returning success.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mooncake is required for this framework.

Comment thread configs/trtllm_qwen3_8b.yaml Outdated
# Installation:
# Use the TensorRT-LLM docker image (docker/trtllm/v1.3.0rc18/Dockerfile),
# which ships tensorrt_llm patched for Mooncake hidden-state capture.
# For a local install: pip install -e ".[trtllm]"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not document an unpatched local install

When users follow this local install path, pip install -e ".[trtllm]" only installs the TensorRT-LLM dependency; the Mooncake redirect patch under patches/trtllm/v1.3.0rc18/ is only applied by the Dockerfile. With an unpatched local TensorRT-LLM, TORCHSPEC_TRTLLM_MOONCAKE is ignored and hidden states are written to disk while TrtllmEngine returns Mooncake keys that were never stored, so either make the local install apply the patch or remove this instruction.

Useful? React with 👍 / 👎.

Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
@chungen04

Copy link
Copy Markdown
Contributor Author

@yubofredwang In this PR the TensorRT-LLM engine was integrated as a backend for inference. Please review and I look forward to suggestions, split PR plans, etc. Also, I did not include tests currently. Please let me know if you suggest writing test in this PR or split one. Thank you!

@yubofredwang

Copy link
Copy Markdown
Collaborator

@yubofredwang In this PR the TensorRT-LLM engine was integrated as a backend for inference. Please review and I look forward to suggestions, split PR plans, etc. Also, I did not include tests currently. Please let me know if you suggest writing test in this PR or split one. Thank you!

thanks for the great work! I will run it end to end to verify

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants