Skip to content

Automated Triton w8a8 block FP8 kernel tuning tool for vLLM. Auto-detects model architecture, supports Qwen3-Coder-30B-A3B-Instruct-FP8/DeepSeek-V3/custom models, multi-GPU parallel tuning, and generates optimized kernel configs for quantization.

License

Notifications You must be signed in to change notification settings

massif-01/vllm_benchmark_block_fp8

Repository files navigation

vLLM Block FP8 Kernel Tuning Tool

🌐 English | δΈ­ζ–‡

Automated Triton w8a8 block FP8 kernel tuning tool for vLLM. Automatically detects model architecture and optimizes kernel configurations for maximum performance.

Features

  • 🎯 Model Auto-Detection: Automatically extracts weight shapes from HuggingFace model configs
  • πŸ”„ Multi-GPU Support: Parallel tuning across multiple GPUs for faster results
  • πŸ“Š Flexible Configuration: Support for different TP sizes, block sizes, and batch sizes
  • πŸš€ Preset Scripts: Quick-start scripts for popular models (Qwen3, DeepSeek-V3)
  • βœ… Environment Check: Pre-flight checks to ensure environment readiness
  • πŸ”’ Multiple Quantization Methods: Support for FP8 (W8A8), INT8 (W8A8), and AWQ (W4A16)

Quick Start

1. Environment Check

bash scripts/environment_check.sh

2. Single Model Tuning

# Qwen3-Coder-30B-A3B-Instruct-FP8 (optimized preset)
bash scripts/tune_qwen3_coder.sh

# Qwen3 models (auto-detects shapes)
bash scripts/tune_qwen3.sh Qwen/Qwen3-MoE-A14.5B-Chat 4

# DeepSeek-V3 (uses default shapes)
bash scripts/tune_deepseek_v3.sh 8

# Custom model (auto-detects shapes)
bash scripts/tune_custom.sh your-model-name 2

3. Or Use Python Directly

# Qwen3-Coder-30B-A3B-Instruct-FP8 (with auto-detection)
python benchmark_w8a8_block_fp8.py --model Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 --tp-size 4 --input-type fp8 --trust-remote-code

# Qwen3 models (auto-detects shapes)
python benchmark_w8a8_block_fp8.py --model Qwen/Qwen3-MoE-A14.5B-Chat --tp-size 4 --input-type fp8

# DeepSeek-V3 (default shapes)
python benchmark_w8a8_block_fp8.py --tp-size 8 --input-type fp8

Configuration

Command Line Arguments

Argument Description Default
--model HuggingFace model identifier (auto-detects shapes) None (uses DeepSeek-V3)
--tp-size Tensor parallelism size 8
--input-type Input quantization type fp8
--out-dtype Output data type float16
--block-n Block size N for quantization 128
--block-k Block size K for quantization 128
--batch-size Single batch size to test (default: test all) None
--save-path Directory to save tuned configs ./tuned_configs
--trust-remote-code Trust remote code when loading model False

Supported Models

The tool automatically detects weight shapes for:

  • Qwen3 Series: Qwen3-MoE, Qwen3-Next, Qwen3-Coder, Qwen3-VL, Qwen3-Omni models
  • DeepSeek-V3: Default shapes built-in
  • Generic Transformers: Auto-detects from hidden_size and intermediate_size

For unsupported models, the tool falls back to DeepSeek-V3 default shapes.

Available Tuning Scripts

1. W8A8 Block FP8 (benchmark_w8a8_block_fp8.py)

  • Quantization: FP8 weights and activations (W8A8)
  • Use Case: High-performance inference with FP8-capable GPUs (Ada/Hopper)
  • Requirements: GPU compute capability >= 8.9

2. W8A8 Block INT8 (benchmark_w8a8_block_int8.py)

  • Quantization: INT8 weights and activations (W8A8)
  • Use Case: Memory-efficient inference on GPUs with compute capability >= 7.5
  • Requirements: NVIDIA GPU (Turing/Ampere/Ada/Hopper)

3. AWQ W4A16 (benchmark_awq_w4a16.py)

  • Quantization: 4-bit weights, 16-bit activations (W4A16)
  • Use Case: Maximum memory savings with good performance
  • Requirements: GPU compute capability >= 7.5
  • Documentation: See README_AWQ.md for detailed usage

Model-Specific Scripts

  • benchmark_w8a8_block_fp8_qwen3_30b.py: Optimized for Qwen3-30B-A3B-Instruct-2507-Int8-W8A16
  • benchmark_w8a8_block_fp8_qwen3omni_talker.py: Optimized for Qwen3-Omni-30B-A3B-Instruct (talker config)

Output

Tuned configurations are saved as JSON files in the format:

N={N},K={K},device_name={device_name},dtype=fp8_w8a8,block_shape=[{block_n},{block_k}].json

Each file contains optimal configurations for different batch sizes:

{
  "1": { "BLOCK_SIZE_M": 64, "BLOCK_SIZE_N": 128, ... },
  "64": { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, ... },
  ...
}

Copying Configs to vLLM

After tuning, copy the generated configs to vLLM:

cp tuned_configs/*.json /path/to/vllm/model_executor/layers/quantization/utils/configs/

Examples

Example 1: Tune Qwen3-Coder-30B-A3B-Instruct-FP8 with TP=4

bash scripts/tune_qwen3_coder.sh Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 4

Example 1b: Tune Qwen3-MoE with TP=4

bash scripts/tune_qwen3.sh Qwen/Qwen3-MoE-A14.5B-Chat 4

Example 2: Batch Tune Multiple Models

bash examples/tune_qwen3_models.sh

Example 3: Custom Block Sizes

python benchmark_w8a8_block_fp8.py \
    --model your-model \
    --tp-size 2 \
    --block-n 64 \
    --block-k 128 \
    --input-type fp8

Example 4: AWQ W4A16 Tuning

# Basic AWQ tuning
python benchmark_awq_w4a16.py \
    --tp-size 1 \
    --group-size 128 \
    --split-k-iters 1 \
    --save-path ./awq_configs

# See README_AWQ.md for detailed AWQ usage

Project Structure

vllm_benchmark_block_fp8/
β”œβ”€β”€ README.md                              # English documentation
β”œβ”€β”€ README_zh.md                           # Chinese documentation
β”œβ”€β”€ README_AWQ.md                          # AWQ W4A16 tuning documentation
β”œβ”€β”€ benchmark_w8a8_block_fp8.py            # W8A8 Block FP8 tuning script
β”œβ”€β”€ benchmark_w8a8_block_int8.py            # W8A8 Block INT8 tuning script
β”œβ”€β”€ benchmark_awq_w4a16.py                 # AWQ W4A16 tuning script
β”œβ”€β”€ benchmark_w8a8_block_fp8_qwen3_30b.py  # Qwen3-30B optimized script
β”œβ”€β”€ benchmark_w8a8_block_fp8_qwen3omni_talker.py  # Qwen3-Omni Talker optimized script
β”œβ”€β”€ scripts/                               # Helper scripts
β”‚   β”œβ”€β”€ environment_check.sh              # Environment check
β”‚   β”œβ”€β”€ tune_qwen3_coder.sh              # Qwen3-Coder preset (optimized)
β”‚   β”œβ”€β”€ tune_qwen3.sh                     # Qwen3 preset
β”‚   β”œβ”€β”€ tune_deepseek_v3.sh               # DeepSeek-V3 preset
β”‚   └── tune_custom.sh                    # Custom model preset
β”œβ”€β”€ configs/                               # Configuration files
β”‚   └── model_shapes.json                 # Model shape references
└── examples/                              # Example scripts
    └── tune_qwen3_models.sh             # Batch tuning example

Prerequisites

  • Python 3.8+
  • PyTorch with CUDA support
  • vLLM installed (must be in Python path)
  • CUDA-compatible GPU(s)
  • Required vLLM modules: fp8_utils, triton_utils, transformers_utils

License

Apache-2.0 License

Contributing

Contributions are welcome! Please feel free to submit Issues and Pull Requests.


⭐ If this project helps you, please give us a Star!

About

Automated Triton w8a8 block FP8 kernel tuning tool for vLLM. Auto-detects model architecture, supports Qwen3-Coder-30B-A3B-Instruct-FP8/DeepSeek-V3/custom models, multi-GPU parallel tuning, and generates optimized kernel configs for quantization.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published