Releases: intel/llm-scaler
Releases · intel/llm-scaler
llm-scaler-vllm PV release 1.2
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:1.2
- Offline Installer: offline-installer:25.45.5.4
Ingredients
| Ingredients | Version |
|---|---|
| Host OS | Ubuntu 25.04 Desktop/Server |
| vllm | 0.10.2 |
| PyTorch | 2.8.0 |
| OneAPI | 2025.1.3-7 |
| OneCCL | 15.6.2 |
| UMD Driver | 25.40.35563.7 |
| KMD Driver | 6.14.0-1008-intel |
| GuC Firmware | 70.45.2 |
| XPU Manager | 1.3.3 |
| Offline Installer | 25.45.5.4 |
What’s new
- vLLM:
- MoE-Int4 support for Qwen3-30B-A3B
- Bpe-Qwen tokenizer support
- Enable Qwen3-VL Dense/MoE models
- Enable Qwen3-Omni models
- MinerU 2.5 Support
- Enable whisper transcription models
- Fix minicpmv4.5 OOM issue and output error
- Enable ERNIE-4.5-vl models
- Enable Glyph based GLM-4.1V-9B-Base
- Attention kernel optimizations for decoding phases for all workloads (>10% e2e throughput on 10+ models with all in/out seq length)
- Gpt-oss 20B and 120B support in mxfp4 with optimized performance
- MoE models optimizations, output throughput:Qwen3-30B-A3B 2.6x e2e improvement; DeeSeek-V2-lite 1.5x improvement.
- New models: added 8 multi-modality models, image/video are supported.
- vLLM 0.10.2 with new features: P/D disaggregation(experimental), tooling, reasoning output, structured output,
- fp16/bf16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
- Bug fixes
Known issue
- Crash during initialization with 2DP x 4TP configuration.
- Status: Scheduled to be fixed in release b7.
- Abnormal output (excessive "!!!") observed during JMeter stress testing.
- Status: Scheduled to be fixed in release b7.
- UR_ERROR_DEVICE_LOST occurs due to excessive preemption under high load.
- Description: Requests exceeding server capacity trigger frequent preemption, eventually leading to device loss.
- Workaround: Temporarily mitigate by increasing the number of GPU blocks(set higher gpu_memory_utilization) or adjusting the --max-num-seqs parameter.
- An abnormal value for gpu_blocks_num will cause a performance degradation when running large batches with gpt-oss-120b.
- Description: The vllm profile_run() logic will cause the kv_cache's gpu_blocks_num decrease, and lead to performance drop.
- Workaround(hotfix): Temporarily update profile_run()'s logic to allow for a larger and more efficient KV Cache.
file: /usr/local/lib/python3.12/dist-packages/vllm-0.10.3.dev0+g01efc7ef7.d20251125.xpu-py3.12-linux-x86_64.egg/vllm/v1/worker/xpu_worker.py
line: 107
# From:
used_memory = torch.xpu.memory_reserved()
# To:
used_memory = torch.xpu.memory_allocated()llm-scaler-omni beta release 0.1.0-b4
Highlights
Resources
- Docker Image: intel/llm-scaler-omni:0.1.0-b4
What’s new
-
omni:
- Added SGLang Diffusion support. 10% perf improve for ComfyUI in single card scenario
- Added ComfyUI workflows for Hunyuan-Video-1.5 (T2V, I2V, Multi-B60), Z-Image
llm-scaler-vllm beta release 0.10.2-b6
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:0.10.2-b6
What’s new
-
vLLM:
- MoE-Int4 support for Qwen3-30B-A3B
- Bpe-Qwen tokenizer support
- Enable Qwen3-VL Dense/MoE models
- Enable Qwen3-Omni models
- MinerU 2.5 Support
- Enable whisper transcription models
- Fix minicpmv4.5 OOM issue and output error
- Enable ERNIE-4.5-vl models
- Enable Glyph based GLM-4.1V-9B-Base
- Attention kernel optimizations for decoding phases for all workloads (>10% e2e throughput on 10+ models with all in/out seq length)
- Gpt-oss 20B and 120B support in mxfp4 with optimized performance
- MoE models optimizations, output throughput:Qwen3-30B-A3B 2.6x e2e improvement; DeeSeek-V2-lite 1.5x improvement.
- New models: added 8 multi-modality models, image/video are supported.
- vLLM 0.10.2 with new features: P/D disaggregation(experimental), tooling, reasoning output, structured output,
- fp16/bf16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
- Bug fixes
llm-scaler-omni beta release 0.1.0-b3
Highlights
Resources
- Docker Image: intel/llm-scaler-omni:0.1.0-b3
What’s new
-
omni:
- More workflows support:
- Hunyuan 3D 2.1
- Controlnet on SD3.5, FLUX.1, etc.
- Multi XPU support for Wan 2.2 I2V 14B rapid aio
- AnimateDiff lightning
- Add Windows installation
- More workflows support:
llm-scaler-vllm beta release 0.10.2-b5
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:0.10.2-b5
What’s new
-
vLLM:
- Enable Qwen3-VL series models
- Enable Qwen3-Omni series models
- Add gpt-oss models support
llm-scaler-omni beta release 0.1.0-b2
Highlights
Resources
- Docker Image: intel/llm-scaler-omni:0.1.0-b2
What’s new
-
omni:
- Fix issues
- Fix ComfyUI interpolate issue.
- Fix Xinference XPU index selection issue.
- Support more workflows
- ComfyUI
- Wan2.2-Animate-14B basic workflow
- Qwen-Image-Edit 2509 workflow
- VoxCPM workflow
- Xinference
- Kokoro-82M-v1.1-zh
- ComfyUI
- Fix issues
llm-scaler-vllm beta release 1.1-preview
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:1.1-preview
(functionally equivalent to intel/llm-scaler-vllm:0.10.0-b2)
What’s new
- vLLM:
- Bug fix for sym_int4 online quantization on Multi-modal models
llm-scaler-omni beta release 0.1.0-b1
Highlights
Resources
- Docker Image: intel/llm-scaler-omni:0.1.0-b1
What’s new
-
omni:
- Integrated ComfyUI on XPU and provide sample workflows for:
- Wan2.2 TI2V 5B
- Wan2.2 T2V 14B (multi-XPU supported)
- FLUX.1 dev
- FLUX.1 Kontext dev
- Stable Diffusion 3.5 large
- Qwen Image, Qwen Image Edit, etc
- Added support for xDit, Yunchang, and Raylight usages on XPU.
- Integrated Xinference with OpenAI-compatible APIs to provide:
- TTS: kokoro 82M
- STT: Whisper Large v3
- T2I: Stable Diffusion 3.5 Medium
- Integrated ComfyUI on XPU and provide sample workflows for:
llm-scaler-vllm beta release 0.10.0-b4
llm-scaler-vllm beta release 0.10.0-b3
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:0.10.0-b3
What’s new
-
vLLM:
- Support Seed-oss model
- Adding miner-U
- Enable MiniCPM-V-4_5
- Fix internvl_3_5 and deepseek-v2-lite error