Skip to content

Conversation

@gcanlin
Copy link
Contributor

@gcanlin gcanlin commented Dec 2, 2025

Purpose

Related to #165. Add Qwen2.5-Omni model test with OmniRunner.

Test Plan

pytest -sv tests/omni/test_qwen_omni.py

Test Result

Pass.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: gcanlin <[email protected]>
Signed-off-by: gcanlin <[email protected]>
Signed-off-by: gcanlin <[email protected]>
Signed-off-by: gcanlin <[email protected]>
@gcanlin gcanlin marked this pull request as ready for review December 3, 2025 03:54
Signed-off-by: gcanlin <[email protected]>
Copy link
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please discuss with PR#174 to unify env setup

"--strict-markers",
"--strict-config",
"--cov=vllm_omni",
"--cov-report=term-missing",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the removal for?

Copy link
Contributor Author

@gcanlin gcanlin Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sry, it was removed by mistake. Recovered now.

from vllm.assets.video import VideoAsset
from vllm.multimodal.image import convert_image_mode

models = ["Qwen/Qwen2.5-Omni-7B"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change it to 3B model to make the test lighter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. It will be done later. Thanks. I'm still trying to fix the OOM bug on NPU when running this test.

@gcanlin gcanlin mentioned this pull request Dec 4, 2025
5 tasks
@gcanlin
Copy link
Contributor Author

gcanlin commented Dec 4, 2025

It's ready on GPU. Let me fix the OOM error on NPU.

(EngineCore_DP0 pid=2101976) /home/guocanlin/vllm-omni/vllm_omni/worker/gpu_model_runner.py:207: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
(EngineCore_DP0 pid=2101976)   info_dict[k] = torch.from_numpy(arr)
[Stage-1] Generate done: batch=1, req_ids=[0], gen_ms=39939.4
[Stage-2] Max batch size: 1
--------------------------------
[Stage-2] Received batch size=1, request_ids=[0]
--------------------------------
(EngineCore_DP0 pid=2102395) INFO:vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni:Currently, we do not use the chunked process, we only use the token2wav.process_chunk for the whole sequence. The stream mode will be implemented in the future.
[Stage-2] Generate done: batch=1, req_ids=[0], gen_ms=21030.0
INFO:vllm_omni.entrypoints.omni_llm:[Summary] {'e2e_requests': 1, 'e2e_total_time_ms': 133066.19429588318, 'e2e_sum_time_ms': 133065.80710411072, 'e2e_total_tokens': 0, 'e2e_avg_time_per_request_ms': 133065.80710411072, 'e2e_avg_tokens_per_s': 0.0, 'wall_time_ms': 133066.19429588318, 'final_stage_id': 2, 'stages': [{'stage_id': 0, 'requests': 1, 'tokens': 77, 'total_time_ms': 71301.9208908081, 'avg_time_per_request_ms': 71301.9208908081, 'avg_tokens_per_s': 1.0799148050712117}, {'stage_id': 1, 'requests': 1, 'tokens': 2048, 'total_time_ms': 40275.97904205322, 'avg_time_per_request_ms': 40275.97904205322, 'avg_tokens_per_s': 50.84916738738067}, {'stage_id': 2, 'requests': 1, 'tokens': 0, 'total_time_ms': 21053.220510482788, 'avg_time_per_request_ms': 21053.220510482788, 'avg_tokens_per_s': 0.0}], 'transfers': [{'from_stage': 0, 'to_stage': 1, 'samples': 1, 'total_bytes': 80126028, 'total_time_ms': 251.47128105163574, 'tx_mbps': 2549.0315288463453, 'rx_samples': 1, 'rx_total_bytes': 80126028, 'rx_total_time_ms': 273.35214614868164, 'rx_mbps': 2344.9906394784366, 'total_samples': 1, 'total_transfer_time_ms': 526.7231464385986, 'total_mbps': 1216.973714434484}, {'from_stage': 1, 'to_stage': 2, 'samples': 1, 'total_bytes': 5979, 'total_time_ms': 0.3604888916015625, 'tx_mbps': 132.68647415873016, 'rx_samples': 1, 'rx_total_bytes': 5979, 'rx_total_time_ms': 0.05078315734863281, 'rx_mbps': 941.8870841690141, 'total_samples': 1, 'total_transfer_time_ms': 1.384735107421875, 'total_mbps': 34.542346578512394}]}
[rank0]:[W1204 02:26:54.720720142 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W1204 02:26:54.737824625 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W1204 02:26:54.746801971 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
PASSED

Signed-off-by: gcanlin <[email protected]>
Signed-off-by: gcanlin <[email protected]>
@gcanlin
Copy link
Contributor Author

gcanlin commented Dec 4, 2025

NPU works now:

(EngineCore_DP0 pid=292337)   info_dict[k] = torch.from_numpy(arr)
[Stage-1] Generate done: batch=1, req_ids=[0], gen_ms=31229.9
('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
[Stage-2] Max batch size: 1
--------------------------------
[Stage-2] Received batch size=1, request_ids=[0]
--------------------------------
(EngineCore_DP0 pid=293377) INFO:vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni:Currently, we do not use the chunked process, we only use the token2wav.process_chunk for the whole sequence. The stream mode will be implemented in the future.
...[Stage-2] Generate done: batch=1, req_ids=[0], gen_ms=204816.4
('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
INFO:vllm_omni.entrypoints.omni_llm:[Summary] {'e2e_requests': 1, 'e2e_total_time_ms': 261041.13006591797, 'e2e_sum_time_ms': 261040.785074234, 'e2e_total_tokens': 0, 'e2e_avg_time_per_request_ms': 261040.785074234, 'e2e_avg_tokens_per_s': 0.0, 'wall_time_ms': 261041.13006591797, 'final_stage_id': 2, 'stages': [{'stage_id': 0, 'requests': 1, 'tokens': 79, 'total_time_ms': 24824.2928981781, 'avg_time_per_request_ms': 24824.2928981781, 'avg_tokens_per_s': 3.1823665763224196}, {'stage_id': 1, 'requests': 1, 'tokens': 1213, 'total_time_ms': 31276.91650390625, 'avg_time_per_request_ms': 31276.91650390625, 'avg_tokens_per_s': 38.78259545977}, {'stage_id': 2, 'requests': 1, 'tokens': 0, 'total_time_ms': 204829.79822158813, 'avg_time_per_request_ms': 204829.79822158813, 'avg_tokens_per_s': 0.0}], 'transfers': [{'from_stage': 0, 'to_stage': 1, 'samples': 1, 'total_bytes': 45824576, 'total_time_ms': 48.127174377441406, 'tx_mbps': 7617.247693058714, 'rx_samples': 1, 'rx_total_bytes': 45824576, 'rx_total_time_ms': 24.182558059692383, 'rx_mbps': 15159.546276911258, 'total_samples': 1, 'total_transfer_time_ms': 73.17924499511719, 'total_mbps': 5009.57078778909}, {'from_stage': 1, 'to_stage': 2, 'samples': 1, 'total_bytes': 3647, 'total_time_ms': 0.3380775451660156, 'tx_mbps': 86.2997274358251, 'rx_samples': 1, 'rx_total_bytes': 3647, 'rx_total_time_ms': 0.1347064971923828, 'rx_mbps': 216.5894044318584, 'total_samples': 1, 'total_transfer_time_ms': 1.455545425415039, 'total_mbps': 20.044719656674857}]}
PASSED

@gcanlin gcanlin requested a review from Gaohan123 December 4, 2025 06:30
@Isotr0py
Copy link
Member

Isotr0py commented Dec 4, 2025

BTW, I think we need a different stage yaml config for GPU CI's e2e test, because the default CI machine is a L4 24GB GPU.

Here is the yaml I used to run examples/offline_inference/qwen2_5_omni/end2end.py with Qwen2.5-omni-3B on RTX 3090 24GB GPU:

Qwen2_5_omni_3b_RTX3090.yaml
# stage config for running qwen2.5-omni with architecture of OmniLLM.

# The following config has been verified on 1x 24GB RTX3090 GPU.
stage_args:
  - stage_id: 0
    runtime:
      process: true            # Run this stage in a separate process
      devices: "0"            # Visible devices for this stage (CUDA_VISIBLE_DEVICES/torch.cuda.set_device)
      max_batch_size: 1
    engine_args:
      model_stage: thinker
      model_arch: Qwen2_5OmniForConditionalGeneration
      worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
      scheduler_cls: vllm_omni.core.sched.scheduler.OmniScheduler
      max_model_len: 8192
      max_num_seqs: 2
      gpu_memory_utilization: 0.55
      enforce_eager: true  # Now we only support eager mode
      trust_remote_code: true
      engine_output_type: latent
      enable_prefix_caching: false
    is_comprehension: true
    final_output: true
    final_output_type: text
    default_sampling_params:
      temperature: 0.0
      top_p: 1.0
      top_k: -1
      max_tokens: 2048
      seed: 42
      detokenize: True
      repetition_penalty: 1.1
  - stage_id: 1
    runtime:
      process: true
      devices: "0"
      max_batch_size: 1
    engine_args:
      model_stage: talker
      model_arch: Qwen2_5OmniForConditionalGeneration
      worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
      scheduler_cls: vllm_omni.core.sched.scheduler.OmniScheduler
      max_model_len: 8192
      max_num_seqs: 2
      gpu_memory_utilization: 0.32
      enforce_eager: true
      trust_remote_code: true
      enable_prefix_caching: false
      engine_output_type: latent
    engine_input_source: [0]
    custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker
    default_sampling_params:
      temperature: 0.9
      top_p: 0.8
      top_k: 40
      max_tokens: 2048
      seed: 42
      detokenize: True
      repetition_penalty: 1.05
      stop_token_ids: [8294]
  - stage_id: 2
    runtime:
      process: true
      devices: "0"            # Example: use a different GPU than the previous stage; use "0" if single GPU
      max_batch_size: 1
    engine_args:
      model_stage: code2wav
      model_arch: Qwen2_5OmniForConditionalGeneration
      worker_cls: vllm_omni.worker.gpu_diffusion_worker.GPUDiffusionWorker
      scheduler_cls: vllm_omni.core.sched.diffusion_scheduler.DiffusionScheduler
      gpu_memory_utilization: 0.125
      enforce_eager: true
      trust_remote_code: true
      enable_prefix_caching: false
      engine_output_type: audio
    engine_input_source: [1]
    final_output: true
    final_output_type: audio
    default_sampling_params:
      temperature: 0.0
      top_p: 1.0
      top_k: -1
      max_tokens: 2048
      seed: 42
      detokenize: True
      repetition_penalty: 1.1

# Top-level runtime config (concise): default windows and stage edges
runtime:
  enabled: true
  defaults:
    window_size: -1             # Simplified: trigger downstream only after full upstream completion
    max_inflight: 1             # Simplified: process serially within each stage
  edges:
    - from: 0                   # thinker → talker: trigger only after receiving full input (-1)
      to: 1
      window_size: -1
    - from: 1                   # talker → code2wav: trigger only after receiving full input (-1)
      to: 2
      window_size: -1
INFO:vllm_omni.entrypoints.omni_llm:[Orchestrator] Stage-2 reported ready
INFO:vllm_omni.entrypoints.omni_llm:[Orchestrator] All stages initialized successfully
[Stage-0] Max batch size: 1
--------------------------------
[Stage-0] Received batch size=1, request_ids=[0]
--------------------------------
[Stage-0] Generate done: batch=1, req_ids=[0], gen_ms=7460.5
[Stage-1] Max batch size: 1
--------------------------------
[Stage-1] Received batch size=1, request_ids=[0]
--------------------------------
(EngineCore_DP0 pid=1251620) /home/mozf/develop-projects/vllm-omni/vllm_omni/worker/gpu_model_runner.py:207: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
(EngineCore_DP0 pid=1251620)   info_dict[k] = torch.from_numpy(arr)
[Stage-1] Generate done: batch=1, req_ids=[0], gen_ms=30644.4
[Stage-2] Max batch size: 1
--------------------------------
[Stage-2] Received batch size=1, request_ids=[0]
--------------------------------
(EngineCore_DP0 pid=1252655) INFO:vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni:Currently, we do not use the chunked process, we only use the token2wav.process_chunk for the whole sequence. The stream mode will be implemented in the future.
[Stage-2] Generate done: batch=1, req_ids=[0], gen_ms=34116.7
INFO:vllm_omni.entrypoints.omni_llm:[Summary] {'e2e_requests': 1, 'e2e_total_time_ms': 72805.70530891418, 'e2e_sum_time_ms': 72805.37819862366, 'e2e_total_tokens': 0, 'e2e_avg_time_per_request_ms': 72805.37819862366, 'e2e_avg_tokens_per_s': 0.0, 'wall_time_ms': 72805.70530891418, 'final_stage_id': 2, 'stages': [{'stage_id': 0, 'requests': 1, 'tokens': 137, 'total_time_ms': 7645.807504653931, 'avg_time_per_request_ms': 7645.807504653931, 'avg_tokens_per_s': 17.918316661334906}, {'stage_id': 1, 'requests': 1, 'tokens': 2048, 'total_time_ms': 30854.575634002686, 'avg_time_per_request_ms': 30854.575634002686, 'avg_tokens_per_s': 66.37589264857823}, {'stage_id': 2, 'requests': 1, 'tokens': 0, 'total_time_ms': 34130.319118499756, 'avg_time_per_request_ms': 34130.319118499756, 'avg_tokens_per_s': 0.0}], 'transfers': [{'from_stage': 0, 'to_stage': 1, 'samples': 1, 'total_bytes': 46321940, 'total_time_ms': 98.6323356628418, 'tx_mbps': 3757.1402675361014, 'rx_samples': 1, 'rx_total_bytes': 46321940, 'rx_total_time_ms': 129.7628879547119, 'rx_mbps': 2855.7897087596666, 'total_samples': 1, 'total_transfer_time_ms': 229.16722297668457, 'total_mbps': 1617.0528890935782}, {'from_stage': 1, 'to_stage': 2, 'samples': 1, 'total_bytes': 6090, 'total_time_ms': 0.42247772216796875, 'tx_mbps': 115.31969011286682, 'rx_samples': 1, 'rx_total_bytes': 6090, 'rx_total_time_ms': 0.07271766662597656, 'rx_mbps': 669.9884946885246, 'total_samples': 1, 'total_transfer_time_ms': 1.4905929565429688, 'total_mbps': 32.68497934740883}]}
[rank0]:[W1204 22:04:18.559406999 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W1204 22:04:18.568555755 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W1204 22:04:18.569440808 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Request ID: 0, Text saved to output_audio/00000.txt
Request ID: 0, Saved audio to output_audio/output_0.wav

@gcanlin
Copy link
Contributor Author

gcanlin commented Dec 4, 2025

BTW, I think we need a different stage yaml config for GPU CI's e2e test, because the default CI machine is a L4 24GB GPU.

Here is the yaml I used to run examples/offline_inference/qwen2_5_omni/end2end.py with Qwen2.5-omni-3B on RTX 3090 24GB GPU:

Thx. Add the CI stage config now.

@pytest.mark.core_model
@pytest.mark.parametrize("model", models)
@pytest.mark.parametrize("max_tokens", [2048])
def test_mixed_modalities_to_audio(omni_runner, model: str, max_tokens: int) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_mixed_modalities_to_audio(omni_runner, model: str, max_tokens: int) -> None:
def test_mixed_modalities_to_audio(omni_runner: type[OmniRunner], model: str, max_tokens: int) -> None:

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed the test can pass on RTX 3090 as well:

(EngineCore_DP0 pid=1413214) INFO 12-05 00:08:10 [__init__.py:381] Cudagraph is disabled under eager mode
INFO:vllm_omni.entrypoints.omni_llm:[Orchestrator] Stage-2 reported ready
INFO:vllm_omni.entrypoints.omni_llm:[Orchestrator] All stages initialized successfully
[Stage-0] Max batch size: 1
--------------------------------
[Stage-0] Received batch size=1, request_ids=[0]
--------------------------------
[Stage-0] Generate done: batch=1, req_ids=[0], gen_ms=6246.4
[Stage-1] Max batch size: 1
--------------------------------
[Stage-1] Received batch size=1, request_ids=[0]
--------------------------------
(EngineCore_DP0 pid=1412210) /home/mozf/develop-projects/vllm-omni/vllm_omni/worker/gpu_model_runner.py:207: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
(EngineCore_DP0 pid=1412210)   info_dict[k] = torch.from_numpy(arr)
[Stage-1] Generate done: batch=1, req_ids=[0], gen_ms=17460.2
[Stage-2] Max batch size: 1
--------------------------------
[Stage-2] Received batch size=1, request_ids=[0]
--------------------------------
(EngineCore_DP0 pid=1413214) INFO:vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni:Currently, we do not use the chunked process, we only use the token2wav.process_chunk for the whole sequence. The stream mode will be implemented in the future.
[Stage-2] Generate done: batch=1, req_ids=[0], gen_ms=15025.2
INFO:vllm_omni.entrypoints.omni_llm:[Summary] {'e2e_requests': 1, 'e2e_total_time_ms': 39295.241832733154, 'e2e_sum_time_ms': 39294.82388496399, 'e2e_total_tokens': 0, 'e2e_avg_time_per_request_ms': 39294.82388496399, 'e2e_avg_tokens_per_s': 0.0, 'wall_time_ms': 39295.241832733154, 'final_stage_id': 2, 'stages': [{'stage_id': 0, 'requests': 1, 'tokens': 72, 'total_time_ms': 6459.305763244629, 'avg_time_per_request_ms': 6459.305763244629, 'avg_tokens_per_s': 11.146708739149865}, {'stage_id': 1, 'requests': 1, 'tokens': 1154, 'total_time_ms': 17631.80184364319, 'avg_time_per_request_ms': 17631.80184364319, 'avg_tokens_per_s': 65.4499188587497}, {'stage_id': 2, 'requests': 1, 'tokens': 0, 'total_time_ms': 15033.496141433716, 'avg_time_per_request_ms': 15033.496141433716, 'avg_tokens_per_s': 0.0}], 'transfers': [{'from_stage': 0, 'to_stage': 1, 'samples': 1, 'total_bytes': 45764674, 'total_time_ms': 97.99075126647949, 'tx_mbps': 3736.2443625354754, 'rx_samples': 1, 'rx_total_bytes': 45764674, 'rx_total_time_ms': 124.15766716003418, 'rx_mbps': 2948.810173181569, 'total_samples': 1, 'total_transfer_time_ms': 223.24252128601074, 'total_mbps': 1639.9984639617237}, {'from_stage': 1, 'to_stage': 2, 'samples': 1, 'total_bytes': 3486, 'total_time_ms': 0.5724430084228516, 'tx_mbps': 48.717513516034984, 'rx_samples': 1, 'rx_total_bytes': 3486, 'rx_total_time_ms': 0.07987022399902344, 'rx_mbps': 349.1664177671642, 'total_samples': 1, 'total_transfer_time_ms': 1.8804073333740234, 'total_mbps': 14.830829206542411}]}
[rank0]:[W1205 00:08:51.904617626 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W1205 00:08:51.907913413 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W1205 00:08:51.911772229 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
PASSED

Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
@gcanlin
Copy link
Contributor Author

gcanlin commented Dec 5, 2025

OOM happens. Should we continue to reduce gpu_memory_utilization?

ERROR 12-04 10:24:52 [core.py:708] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.05 GiB of which 113.62 MiB is free. Process 60 has 184.00 MiB memory in use. Process 161 has 184.00 MiB memory in use. Process 198 has 13.57 GiB memory in use. Including non-PyTorch memory, this process has 7.79 GiB memory in use. Process 344 has 184.00 MiB memory in use. Of the allocated memory 6.79 GiB is allocated by PyTorch, and 778.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Signed-off-by: gcanlin <[email protected]>
@Isotr0py
Copy link
Member

Isotr0py commented Dec 5, 2025

OOM happens. Should we continue to reduce gpu_memory_utilization?

The OOM is happening at mm profiling during profile run, let's skip mm profiling for single GPU test for now.

Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
@Isotr0py
Copy link
Member

Isotr0py commented Dec 5, 2025

The omni e2e test finally passes now. 😭

Copy link
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It is so great it can run in limited resource!

@Gaohan123 Gaohan123 enabled auto-merge (squash) December 5, 2025 09:52
Signed-off-by: gcanlin <[email protected]>
auto-merge was automatically disabled December 5, 2025 12:50

Head branch was pushed to by a user without write access

@gcanlin
Copy link
Contributor Author

gcanlin commented Dec 5, 2025

Updated the class names. Hope it will be a final commit.🥹

@Isotr0py Isotr0py enabled auto-merge (squash) December 5, 2025 16:18
@gcanlin gcanlin mentioned this pull request Dec 5, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants