-
Notifications
You must be signed in to change notification settings - Fork 79
[E2E] Add Qwen2.5-Omni model test with OmniRunner #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: gcanlin <[email protected]>
Signed-off-by: gcanlin <[email protected]>
Signed-off-by: gcanlin <[email protected]>
Signed-off-by: gcanlin <[email protected]>
Signed-off-by: gcanlin <[email protected]>
Signed-off-by: gcanlin <[email protected]>
Gaohan123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please discuss with PR#174 to unify env setup
| "--strict-markers", | ||
| "--strict-config", | ||
| "--cov=vllm_omni", | ||
| "--cov-report=term-missing", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the removal for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sry, it was removed by mistake. Recovered now.
tests/omni/test_qwen_omni.py
Outdated
| from vllm.assets.video import VideoAsset | ||
| from vllm.multimodal.image import convert_image_mode | ||
|
|
||
| models = ["Qwen/Qwen2.5-Omni-7B"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change it to 3B model to make the test lighter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. It will be done later. Thanks. I'm still trying to fix the OOM bug on NPU when running this test.
|
It's ready on GPU. Let me fix the OOM error on NPU. |
Signed-off-by: gcanlin <[email protected]>
Signed-off-by: gcanlin <[email protected]>
|
NPU works now: |
|
BTW, I think we need a different stage yaml config for GPU CI's e2e test, because the default CI machine is a L4 24GB GPU. Here is the yaml I used to run Qwen2_5_omni_3b_RTX3090.yaml# stage config for running qwen2.5-omni with architecture of OmniLLM.
# The following config has been verified on 1x 24GB RTX3090 GPU.
stage_args:
- stage_id: 0
runtime:
process: true # Run this stage in a separate process
devices: "0" # Visible devices for this stage (CUDA_VISIBLE_DEVICES/torch.cuda.set_device)
max_batch_size: 1
engine_args:
model_stage: thinker
model_arch: Qwen2_5OmniForConditionalGeneration
worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
scheduler_cls: vllm_omni.core.sched.scheduler.OmniScheduler
max_model_len: 8192
max_num_seqs: 2
gpu_memory_utilization: 0.55
enforce_eager: true # Now we only support eager mode
trust_remote_code: true
engine_output_type: latent
enable_prefix_caching: false
is_comprehension: true
final_output: true
final_output_type: text
default_sampling_params:
temperature: 0.0
top_p: 1.0
top_k: -1
max_tokens: 2048
seed: 42
detokenize: True
repetition_penalty: 1.1
- stage_id: 1
runtime:
process: true
devices: "0"
max_batch_size: 1
engine_args:
model_stage: talker
model_arch: Qwen2_5OmniForConditionalGeneration
worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
scheduler_cls: vllm_omni.core.sched.scheduler.OmniScheduler
max_model_len: 8192
max_num_seqs: 2
gpu_memory_utilization: 0.32
enforce_eager: true
trust_remote_code: true
enable_prefix_caching: false
engine_output_type: latent
engine_input_source: [0]
custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker
default_sampling_params:
temperature: 0.9
top_p: 0.8
top_k: 40
max_tokens: 2048
seed: 42
detokenize: True
repetition_penalty: 1.05
stop_token_ids: [8294]
- stage_id: 2
runtime:
process: true
devices: "0" # Example: use a different GPU than the previous stage; use "0" if single GPU
max_batch_size: 1
engine_args:
model_stage: code2wav
model_arch: Qwen2_5OmniForConditionalGeneration
worker_cls: vllm_omni.worker.gpu_diffusion_worker.GPUDiffusionWorker
scheduler_cls: vllm_omni.core.sched.diffusion_scheduler.DiffusionScheduler
gpu_memory_utilization: 0.125
enforce_eager: true
trust_remote_code: true
enable_prefix_caching: false
engine_output_type: audio
engine_input_source: [1]
final_output: true
final_output_type: audio
default_sampling_params:
temperature: 0.0
top_p: 1.0
top_k: -1
max_tokens: 2048
seed: 42
detokenize: True
repetition_penalty: 1.1
# Top-level runtime config (concise): default windows and stage edges
runtime:
enabled: true
defaults:
window_size: -1 # Simplified: trigger downstream only after full upstream completion
max_inflight: 1 # Simplified: process serially within each stage
edges:
- from: 0 # thinker → talker: trigger only after receiving full input (-1)
to: 1
window_size: -1
- from: 1 # talker → code2wav: trigger only after receiving full input (-1)
to: 2
window_size: -1
|
Signed-off-by: gcanlin <[email protected]>
Thx. Add the CI stage config now. |
tests/omni/test_qwen_omni.py
Outdated
| @pytest.mark.core_model | ||
| @pytest.mark.parametrize("model", models) | ||
| @pytest.mark.parametrize("max_tokens", [2048]) | ||
| def test_mixed_modalities_to_audio(omni_runner, model: str, max_tokens: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def test_mixed_modalities_to_audio(omni_runner, model: str, max_tokens: int) -> None: | |
| def test_mixed_modalities_to_audio(omni_runner: type[OmniRunner], model: str, max_tokens: int) -> None: |
Signed-off-by: Isotr0py <[email protected]>
Isotr0py
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed the test can pass on RTX 3090 as well:
(EngineCore_DP0 pid=1413214) INFO 12-05 00:08:10 [__init__.py:381] Cudagraph is disabled under eager mode
INFO:vllm_omni.entrypoints.omni_llm:[Orchestrator] Stage-2 reported ready
INFO:vllm_omni.entrypoints.omni_llm:[Orchestrator] All stages initialized successfully
[Stage-0] Max batch size: 1
--------------------------------
[Stage-0] Received batch size=1, request_ids=[0]
--------------------------------
[Stage-0] Generate done: batch=1, req_ids=[0], gen_ms=6246.4
[Stage-1] Max batch size: 1
--------------------------------
[Stage-1] Received batch size=1, request_ids=[0]
--------------------------------
(EngineCore_DP0 pid=1412210) /home/mozf/develop-projects/vllm-omni/vllm_omni/worker/gpu_model_runner.py:207: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
(EngineCore_DP0 pid=1412210) info_dict[k] = torch.from_numpy(arr)
[Stage-1] Generate done: batch=1, req_ids=[0], gen_ms=17460.2
[Stage-2] Max batch size: 1
--------------------------------
[Stage-2] Received batch size=1, request_ids=[0]
--------------------------------
(EngineCore_DP0 pid=1413214) INFO:vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni:Currently, we do not use the chunked process, we only use the token2wav.process_chunk for the whole sequence. The stream mode will be implemented in the future.
[Stage-2] Generate done: batch=1, req_ids=[0], gen_ms=15025.2
INFO:vllm_omni.entrypoints.omni_llm:[Summary] {'e2e_requests': 1, 'e2e_total_time_ms': 39295.241832733154, 'e2e_sum_time_ms': 39294.82388496399, 'e2e_total_tokens': 0, 'e2e_avg_time_per_request_ms': 39294.82388496399, 'e2e_avg_tokens_per_s': 0.0, 'wall_time_ms': 39295.241832733154, 'final_stage_id': 2, 'stages': [{'stage_id': 0, 'requests': 1, 'tokens': 72, 'total_time_ms': 6459.305763244629, 'avg_time_per_request_ms': 6459.305763244629, 'avg_tokens_per_s': 11.146708739149865}, {'stage_id': 1, 'requests': 1, 'tokens': 1154, 'total_time_ms': 17631.80184364319, 'avg_time_per_request_ms': 17631.80184364319, 'avg_tokens_per_s': 65.4499188587497}, {'stage_id': 2, 'requests': 1, 'tokens': 0, 'total_time_ms': 15033.496141433716, 'avg_time_per_request_ms': 15033.496141433716, 'avg_tokens_per_s': 0.0}], 'transfers': [{'from_stage': 0, 'to_stage': 1, 'samples': 1, 'total_bytes': 45764674, 'total_time_ms': 97.99075126647949, 'tx_mbps': 3736.2443625354754, 'rx_samples': 1, 'rx_total_bytes': 45764674, 'rx_total_time_ms': 124.15766716003418, 'rx_mbps': 2948.810173181569, 'total_samples': 1, 'total_transfer_time_ms': 223.24252128601074, 'total_mbps': 1639.9984639617237}, {'from_stage': 1, 'to_stage': 2, 'samples': 1, 'total_bytes': 3486, 'total_time_ms': 0.5724430084228516, 'tx_mbps': 48.717513516034984, 'rx_samples': 1, 'rx_total_bytes': 3486, 'rx_total_time_ms': 0.07987022399902344, 'rx_mbps': 349.1664177671642, 'total_samples': 1, 'total_transfer_time_ms': 1.8804073333740234, 'total_mbps': 14.830829206542411}]}
[rank0]:[W1205 00:08:51.904617626 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W1205 00:08:51.907913413 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W1205 00:08:51.911772229 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
PASSED
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
|
OOM happens. Should we continue to reduce gpu_memory_utilization? |
Signed-off-by: gcanlin <[email protected]>
Signed-off-by: gcanlin <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
The OOM is happening at mm profiling during profile run, let's skip mm profiling for single GPU test for now. |
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
|
The omni e2e test finally passes now. 😭 |
Gaohan123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It is so great it can run in limited resource!
Signed-off-by: gcanlin <[email protected]>
Head branch was pushed to by a user without write access
|
Updated the class names. Hope it will be a final commit.🥹 |
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Purpose
Related to #165. Add Qwen2.5-Omni model test with OmniRunner.
Test Plan
pytest -sv tests/omni/test_qwen_omni.pyTest Result
Pass.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)