Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
model_name: "Qwen/Qwen3-235B-A22B-Thinking-2507-FP8"
tasks:
- name: "mmlu_pro"
metrics:
- name: "exact_match,custom-extract"
value: 0.77
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you generate these reference values? is it similar to what's reported?

Copy link
Contributor Author

@hl475 hl475 Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reported value is 84.4 from https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507-FP8 . Here we are using some additional parameters (num_fewshot, limit, max_model_len, gen_kwargs)

num_fewshot: 5
limit: 250 # will run on 250 * 14 subjects = 3500 samples
max_model_len: 8096
gen_kwargs: "top_p=1,top_k=0,max_gen_toks=1536"
10 changes: 10 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen3-8B.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
model_name: "Qwen/Qwen3-8B"
tasks:
- name: "mmlu_pro"
metrics:
- name: "exact_match,custom-extract"
value: 0.60
num_fewshot: 5
limit: 250 # will run on 250 * 14 subjects = 3500 samples
max_model_len: 8096
gen_kwargs: "top_p=1,top_k=0,max_gen_toks=1536"
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/configs/models-large-h100.txt
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml
Qwen3-235B-A22B-Thinking-2507-FP8.yaml
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Qwen2.5-1.5B-Instruct.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Qwen2.5-VL-3B-Instruct-FP8-dynamic.yaml
Qwen1.5-MoE-W4A16-compressed-tensors.yaml
Qwen3-8B.yaml
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ def launch_lm_eval(eval_config, tp_size):
# existing text models in CI, so only apply it for mm.
apply_chat_template=backend == "vllm-vlm",
batch_size=batch_size,
gen_kwargs=eval_config.get("gen_kwargs", None),
)
return results

Expand Down
13 changes: 12 additions & 1 deletion .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1089,7 +1089,7 @@ steps:
- tests/weight_loading
commands:
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt

- label: NixlConnector PD accuracy tests (Distributed) # 30min
timeout_in_minutes: 30
working_dir: "/vllm-workspace/tests"
Expand Down Expand Up @@ -1145,6 +1145,17 @@ steps:
- pytest -v -s tests/distributed/test_context_parallel.py
- CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048

- label: LM Eval Large Models (H200) # optional
gpu: h200
optional: true
num_gpus: 4
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
source_file_dependencies:
- csrc/
- vllm/model_executor/layers/quantization
commands:
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large-h100.txt --tp-size=4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: models-large-hopper.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the txt is named as models-large-h100.txt. do you suggest we change it to models-large-hopper.txt instead?


##### B200 test #####
- label: Distributed Tests (B200) # optional
gpu: b200
Expand Down