Skip to content

Releases: intel/llm-scaler

llm-scaler-vllm PV release 1.2

11 Dec 08:24
426cf68

Choose a tag to compare

Highlights

Resources

Ingredients

Ingredients Version
Host OS Ubuntu 25.04 Desktop/Server
vllm 0.10.2
PyTorch 2.8.0
OneAPI 2025.1.3-7
OneCCL 15.6.2
UMD Driver 25.40.35563.7
KMD Driver 6.14.0-1008-intel
GuC Firmware 70.45.2
XPU Manager 1.3.3
Offline Installer 25.45.5.4

What’s new

  • vLLM:
    • MoE-Int4 support for Qwen3-30B-A3B
    • Bpe-Qwen tokenizer support
    • Enable Qwen3-VL Dense/MoE models
    • Enable Qwen3-Omni models
    • MinerU 2.5 Support
    • Enable whisper transcription models
    • Fix minicpmv4.5 OOM issue and output error
    • Enable ERNIE-4.5-vl models
    • Enable Glyph based GLM-4.1V-9B-Base
    • Attention kernel optimizations for decoding phases for all workloads (>10% e2e throughput on 10+ models with all in/out seq length)
    • Gpt-oss 20B and 120B support in mxfp4 with optimized performance
    • MoE models optimizations, output throughput:Qwen3-30B-A3B 2.6x e2e improvement; DeeSeek-V2-lite 1.5x improvement.
    • New models: added 8 multi-modality models, image/video are supported.
    • vLLM 0.10.2 with new features: P/D disaggregation(experimental), tooling, reasoning output, structured output,
    • fp16/bf16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
    • Bug fixes

Known issue

  • Crash during initialization with 2DP x 4TP configuration.
    • Status: Scheduled to be fixed in release b7.
  • Abnormal output (excessive "!!!") observed during JMeter stress testing.
    • Status: Scheduled to be fixed in release b7.
  • UR_ERROR_DEVICE_LOST occurs due to excessive preemption under high load.
    • Description: Requests exceeding server capacity trigger frequent preemption, eventually leading to device loss.
    • Workaround: Temporarily mitigate by increasing the number of GPU blocks(set higher gpu_memory_utilization) or adjusting the --max-num-seqs parameter.
  • An abnormal value for gpu_blocks_num will cause a performance degradation when running large batches with gpt-oss-120b.
    • Description: The vllm profile_run() logic will cause the kv_cache's gpu_blocks_num decrease, and lead to performance drop.
    • Workaround(hotfix): Temporarily update profile_run()'s logic to allow for a larger and more efficient KV Cache.
file: /usr/local/lib/python3.12/dist-packages/vllm-0.10.3.dev0+g01efc7ef7.d20251125.xpu-py3.12-linux-x86_64.egg/vllm/v1/worker/xpu_worker.py
line: 107
# From:
used_memory = torch.xpu.memory_reserved()
# To:
used_memory = torch.xpu.memory_allocated()

llm-scaler-omni beta release 0.1.0-b4

10 Dec 01:15
f0019a1

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • omni:

    • Added SGLang Diffusion support. 10% perf improve for ComfyUI in single card scenario
    • Added ComfyUI workflows for Hunyuan-Video-1.5 (T2V, I2V, Multi-B60), Z-Image

llm-scaler-vllm beta release 0.10.2-b6

26 Nov 07:22

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • vLLM:

    • MoE-Int4 support for Qwen3-30B-A3B
    • Bpe-Qwen tokenizer support
    • Enable Qwen3-VL Dense/MoE models
    • Enable Qwen3-Omni models
    • MinerU 2.5 Support
    • Enable whisper transcription models
    • Fix minicpmv4.5 OOM issue and output error
    • Enable ERNIE-4.5-vl models
    • Enable Glyph based GLM-4.1V-9B-Base
    • Attention kernel optimizations for decoding phases for all workloads (>10% e2e throughput on 10+ models with all in/out seq length)
    • Gpt-oss 20B and 120B support in mxfp4 with optimized performance
    • MoE models optimizations, output throughput:Qwen3-30B-A3B 2.6x e2e improvement; DeeSeek-V2-lite 1.5x improvement.
    • New models: added 8 multi-modality models, image/video are supported.
    • vLLM 0.10.2 with new features: P/D disaggregation(experimental), tooling, reasoning output, structured output,
    • fp16/bf16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
    • Bug fixes

llm-scaler-omni beta release 0.1.0-b3

19 Nov 02:47
5696df8

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • omni:

    • More workflows support:
      • Hunyuan 3D 2.1
      • Controlnet on SD3.5, FLUX.1, etc.
      • Multi XPU support for Wan 2.2 I2V 14B rapid aio
      • AnimateDiff lightning
    • Add Windows installation

llm-scaler-vllm beta release 0.10.2-b5

04 Nov 08:07
e1b6411

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • vLLM:

    • Enable Qwen3-VL series models
    • Enable Qwen3-Omni series models
    • Add gpt-oss models support

llm-scaler-omni beta release 0.1.0-b2

21 Oct 01:31
690807f

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • omni:

    • Fix issues
      • Fix ComfyUI interpolate issue.
      • Fix Xinference XPU index selection issue.
    • Support more workflows
      • ComfyUI
        • Wan2.2-Animate-14B basic workflow
        • Qwen-Image-Edit 2509 workflow
        • VoxCPM workflow
      • Xinference
        • Kokoro-82M-v1.1-zh

llm-scaler-vllm beta release 1.1-preview

29 Sep 07:53
1006351

Choose a tag to compare

Highlights

Resources

What’s new

  • vLLM:
    • Bug fix for sym_int4 online quantization on Multi-modal models

llm-scaler-omni beta release 0.1.0-b1

29 Sep 05:22
1006351

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • omni:

    • Integrated ComfyUI on XPU and provide sample workflows for:
      • Wan2.2 TI2V 5B
      • Wan2.2 T2V 14B (multi-XPU supported)
      • FLUX.1 dev
      • FLUX.1 Kontext dev
      • Stable Diffusion 3.5 large
      • Qwen Image, Qwen Image Edit, etc
    • Added support for xDit, Yunchang, and Raylight usages on XPU.
    • Integrated Xinference with OpenAI-compatible APIs to provide:
      • TTS: kokoro 82M
      • STT: Whisper Large v3
      • T2I: Stable Diffusion 3.5 Medium

llm-scaler-vllm beta release 0.10.0-b4

23 Sep 08:00
2548330

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • vLLM:

    • Fix Issues

llm-scaler-vllm beta release 0.10.0-b3

23 Sep 07:06
927de0e

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • vLLM:

    • Support Seed-oss model
    • Adding miner-U
    • Enable MiniCPM-V-4_5
    • Fix internvl_3_5 and deepseek-v2-lite error