Automated Triton w8a8 block FP8 kernel tuning tool for vLLM. Automatically detects model architecture and optimizes kernel configurations for maximum performance.
- π― Model Auto-Detection: Automatically extracts weight shapes from HuggingFace model configs
- π Multi-GPU Support: Parallel tuning across multiple GPUs for faster results
- π Flexible Configuration: Support for different TP sizes, block sizes, and batch sizes
- π Preset Scripts: Quick-start scripts for popular models (Qwen3, DeepSeek-V3)
- β Environment Check: Pre-flight checks to ensure environment readiness
- π’ Multiple Quantization Methods: Support for FP8 (W8A8), INT8 (W8A8), and AWQ (W4A16)
bash scripts/environment_check.sh# Qwen3-Coder-30B-A3B-Instruct-FP8 (optimized preset)
bash scripts/tune_qwen3_coder.sh
# Qwen3 models (auto-detects shapes)
bash scripts/tune_qwen3.sh Qwen/Qwen3-MoE-A14.5B-Chat 4
# DeepSeek-V3 (uses default shapes)
bash scripts/tune_deepseek_v3.sh 8
# Custom model (auto-detects shapes)
bash scripts/tune_custom.sh your-model-name 2# Qwen3-Coder-30B-A3B-Instruct-FP8 (with auto-detection)
python benchmark_w8a8_block_fp8.py --model Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 --tp-size 4 --input-type fp8 --trust-remote-code
# Qwen3 models (auto-detects shapes)
python benchmark_w8a8_block_fp8.py --model Qwen/Qwen3-MoE-A14.5B-Chat --tp-size 4 --input-type fp8
# DeepSeek-V3 (default shapes)
python benchmark_w8a8_block_fp8.py --tp-size 8 --input-type fp8| Argument | Description | Default |
|---|---|---|
--model |
HuggingFace model identifier (auto-detects shapes) | None (uses DeepSeek-V3) |
--tp-size |
Tensor parallelism size | 8 |
--input-type |
Input quantization type | fp8 |
--out-dtype |
Output data type | float16 |
--block-n |
Block size N for quantization | 128 |
--block-k |
Block size K for quantization | 128 |
--batch-size |
Single batch size to test (default: test all) | None |
--save-path |
Directory to save tuned configs | ./tuned_configs |
--trust-remote-code |
Trust remote code when loading model | False |
The tool automatically detects weight shapes for:
- Qwen3 Series: Qwen3-MoE, Qwen3-Next, Qwen3-Coder, Qwen3-VL, Qwen3-Omni models
- DeepSeek-V3: Default shapes built-in
- Generic Transformers: Auto-detects from
hidden_sizeandintermediate_size
For unsupported models, the tool falls back to DeepSeek-V3 default shapes.
- Quantization: FP8 weights and activations (W8A8)
- Use Case: High-performance inference with FP8-capable GPUs (Ada/Hopper)
- Requirements: GPU compute capability >= 8.9
- Quantization: INT8 weights and activations (W8A8)
- Use Case: Memory-efficient inference on GPUs with compute capability >= 7.5
- Requirements: NVIDIA GPU (Turing/Ampere/Ada/Hopper)
- Quantization: 4-bit weights, 16-bit activations (W4A16)
- Use Case: Maximum memory savings with good performance
- Requirements: GPU compute capability >= 7.5
- Documentation: See README_AWQ.md for detailed usage
benchmark_w8a8_block_fp8_qwen3_30b.py: Optimized for Qwen3-30B-A3B-Instruct-2507-Int8-W8A16benchmark_w8a8_block_fp8_qwen3omni_talker.py: Optimized for Qwen3-Omni-30B-A3B-Instruct (talker config)
Tuned configurations are saved as JSON files in the format:
N={N},K={K},device_name={device_name},dtype=fp8_w8a8,block_shape=[{block_n},{block_k}].json
Each file contains optimal configurations for different batch sizes:
{
"1": { "BLOCK_SIZE_M": 64, "BLOCK_SIZE_N": 128, ... },
"64": { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, ... },
...
}After tuning, copy the generated configs to vLLM:
cp tuned_configs/*.json /path/to/vllm/model_executor/layers/quantization/utils/configs/bash scripts/tune_qwen3_coder.sh Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 4bash scripts/tune_qwen3.sh Qwen/Qwen3-MoE-A14.5B-Chat 4bash examples/tune_qwen3_models.shpython benchmark_w8a8_block_fp8.py \
--model your-model \
--tp-size 2 \
--block-n 64 \
--block-k 128 \
--input-type fp8# Basic AWQ tuning
python benchmark_awq_w4a16.py \
--tp-size 1 \
--group-size 128 \
--split-k-iters 1 \
--save-path ./awq_configs
# See README_AWQ.md for detailed AWQ usagevllm_benchmark_block_fp8/
βββ README.md # English documentation
βββ README_zh.md # Chinese documentation
βββ README_AWQ.md # AWQ W4A16 tuning documentation
βββ benchmark_w8a8_block_fp8.py # W8A8 Block FP8 tuning script
βββ benchmark_w8a8_block_int8.py # W8A8 Block INT8 tuning script
βββ benchmark_awq_w4a16.py # AWQ W4A16 tuning script
βββ benchmark_w8a8_block_fp8_qwen3_30b.py # Qwen3-30B optimized script
βββ benchmark_w8a8_block_fp8_qwen3omni_talker.py # Qwen3-Omni Talker optimized script
βββ scripts/ # Helper scripts
β βββ environment_check.sh # Environment check
β βββ tune_qwen3_coder.sh # Qwen3-Coder preset (optimized)
β βββ tune_qwen3.sh # Qwen3 preset
β βββ tune_deepseek_v3.sh # DeepSeek-V3 preset
β βββ tune_custom.sh # Custom model preset
βββ configs/ # Configuration files
β βββ model_shapes.json # Model shape references
βββ examples/ # Example scripts
βββ tune_qwen3_models.sh # Batch tuning example
- Python 3.8+
- PyTorch with CUDA support
- vLLM installed (must be in Python path)
- CUDA-compatible GPU(s)
- Required vLLM modules:
fp8_utils,triton_utils,transformers_utils
Apache-2.0 License
Contributions are welcome! Please feel free to submit Issues and Pull Requests.
β If this project helps you, please give us a Star!