Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 163 additions & 0 deletions blog/2026-01-21-novita-glm4.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
---
title: "Optimizing GLM4-MoE for Production: 65% Faster TTFT with SGLang"
author: "Novita AI"
date: "January 21, 2026"
previewImg: /images/blog/novita-glm4/novita-glm4-preview.png
---

## TL;DR
A suite of production-tested, high-impact optimizations has been developed by Novita AI for deploying GLM4-MOE models based on SGLANG.
We introduce an **end-to-end performance optimization strategy** that addresses bottlenecks across the entire inference pipeline — from kernel execution efficiency to cross-node data transfer scheduling.
Through the integration of **Shared Experts Fusion** and **Suffix Decoding**, we observe substantial gains in key production metrics, including:

- **up to 65% reduction in Time-to-First-Token (TTFT)**
- **22% improvement in Time-Per-Output-Token (TPOT)**

under agentic coding workloads.

All results were validated on **H200 clusters under TP8 and FP8 configurations**, providing a battle-tested blueprint for achieving both optimal throughput and low latency in demanding production environments.

## How We Implemented Core Production Optimizations for GLM-MoE

### 1. Shared Experts Fusion

- [SGLang PR #13873: Shared Experts Fusion](https://github.com/sgl-project/sglang/pull/13873)

![Shared Experts Fusion](/images/blog/novita-glm4/shared-experts-fusion.png)

Full credit for this optimization belongs to the original work on Deepseek model. As illustrated in the figure above, MoE models such as GLM4.7 route all input tokens through a shared expert, while each token is also individually routed to its own set of top‑k routed experts as selected by the model’s router. The outputs from all experts are then weighted and aggregated. GLM4.7, for instance, employs 160 routed experts alongside a single shared expert, selecting the top 8 routed experts per token. In earlier implementations, these two components were handled separately. Given that they share identical tensor shapes and computational procedures, it is natural to unify them by merging the shared expert into the routed MoE structure—selecting the top 9 out of the total 161 experts, with the shared expert consistently assigned the 9th position.

As documented in PR, this optimization achieves performance gains of up to 23.7% in TTFT and 20.8% in ITL. These gains are expected because, under TP8 and FP8 configurations—where the intermediate size is only 192, which is relatively small for H200 hardware—the fusion operation substantially boosts Streaming Multiprocessor (SM) utilization and significantly reduces memory I/O overhead.

### 2. Qknorm Fusion

- [SGLang PR #15141: Qknorm Fusion](https://github.com/sgl-project/sglang/pull/15141)
- [SGLang PR #15305: Qknorm Fusion Fix](https://github.com/sgl-project/sglang/pull/15305)

![Qknorm Fusion](/images/blog/novita-glm4/qknorm-fusion.png)

This migration builds upon the optimization from Qwen-MOE. The underlying idea is straightforward. Since both operators perform head-wise computations, it is a natural approach to fuse them into a single kernel. Our contribution lies in adapting this fused kernel to accommodate the GLM4-MOE variant's specific case, where only half of the dimensions within a head are rotated.

### 3. Async Transfer

[SGLang PR #14782: Async Transfer](https://github.com/sgl-project/sglang/pull/14782)

![Async Transfer Schedule](/images/blog/novita-glm4/async-transfer-schedule.png)

In scenarios where PD disaggregation with overlapping schedules is applied, although throughput can gain about 10%, TTFT drops significantly. We observed that in the current implementation of prefill, the data transfer process is delayed until after the kernel launch for the next batch. For a model like GLM4.7, which consists of 92 layers, kernel launch without CUDA Graph can be time‑consuming (often taking hundreds of milliseconds even more than 1 second).

To address this, in our modification we advance the transfer step slightly, scheduling it right after its corresponding GPU operations complete. Additionally, the transfer is placed in a separate thread. By carefully handling potential data‑race structures, it can proceed without blocking the main thread.

The performance is enormous for models with a lot kernel launches. When at heavy workloads, this optimization can save up to 1 second in terms of TTFT as shown below.

![Async Transfer TTFT Gains](/images/blog/novita-glm4/async-transfer-ttft.png)

## Production Benchmark Results

After implementing the approaches described above, we observed significant performance improvements for GLM-MOE models, as clearly demonstrated by the benchmark results below.

### Benchmark configuration

- Input length: **4096**
- Output length: **1000**
- Request rate: **14 req/s**
- Model: **GLM-4.7 FP8 (TP8)**

### Results

![Benchmark TTFT](/images/blog/novita-glm4/benchmark-ttft.png)
![Benchmark TPOT](/images/blog/novita-glm4/benchmark-tpot.png)

> These optimizations are not just experimental — they have already been deployed and validated in [Novita AI's](https://novita.ai/?utm_source=sglang&utm_medium=article&utm_campaign=sglang-glm-optimization) production inference service.

## Suffix decoding

Agentic coding scenarios (like Cursor and Claude Code) exhibit a high volume of reusable code patterns, allowing for targeted performance optimizations such as Suffix Decoding.

### Background: The Inference Bottleneck in Agentic Coding

LLM Agents excel at code generation tasks, but latency remains a significant challenge. Traditional Speculative Decoding accelerates inference by predicting multiple tokens in advance, but common approaches require training additional draft models, introducing engineering complexity.

### How Suffix Decoding Works

![How Suffix Decoding Works](/images/blog/novita-glm4/suffix-decoding.png)

Suffix Decoding takes a fundamentally different approach—it is completely model-free:

- No dependency on additional model weights
- Leverages patterns from previously generated output sequences to predict upcoming tokens
- When the current request's suffix matches a historical pattern, it continues along that historical sequence for speculation

### Data Validation: Output Pattern Repetition Analysis

By analyzing 22 Claude Code sessions (17,487 conversation turns), we discovered:

- 39.3% output pattern repetition: High frequency of similar tool calls and response patterns
- Highly structured agentic behaviors: Fixed phrases like "Let me...", "Now let me..." appear frequently

To support further research, we have open-sourced the evaluation dataset on Hugging Face : [Agentic Code Dataset on Hugging Face](https://huggingface.co/datasets/novita/agentic_code_dataset_22)

### Performance Comparison

With built-in MTP acceleration, Suffix Decoding further reduces TPOT by 22% (from 25.13ms to 19.63ms):

| Metric | MTP | Suffix Decoding | Change |
| :--- | :--- | :--- | :--- |
| Mean TPOT | 25.13 ms | 19.63 ms | -21.90% |
| Median TPOT | 25.95 ms | 20.05 ms | -22.70% |

## Conclusion

The combination of these optimizations provides comprehensive performance improvements for SGLANG deployments:

| Optimization | Impact/Benefit |
| :--- | :--- |
| Shared Experts Fusion | Addresses compute efficiency in MoE models |
| QK-Norm-RoPE Fusion | Reduces kernel launch overhead |
| Async Transfer | Optimizes data movement in disaggregated deployments |
| Suffix Decoding | Leverages pattern repetition for speculative decoding for agentic coding |

Most components are already merged upstream or undergoing integration; feel free to check them out on the SGLang repo.

## How to Reproduce

Only the key performance-relevant parameters are shown here.
Full launch scripts (baseline vs optimized), benchmark harness, and profiling traces are published in GitHub: [novitalabs/sglang (glm_suffix branch)](https://github.com/novitalabs/sglang/tree/glm_suffix).

### Core Optimization Flags (SGLang Runtime)

```bash
--tp-size 8
--kv-cache-dtype fp8_e4m3
--attention-backend fa3
--chunked-prefill-size 16384
--enable-flashinfer-allreduce-fusion
--enable-fused-qk-norm-rope
--enable-shared-experts-fusion
--disaggregation-async-transfer
```

### Speculative decoding configuration (agentic coding workload)

```bash
--speculative-algorithm NEXTN
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
```

### Suffix Decoding configuration (optional)

```bash
--speculative-algorithm SUFFIX
--speculative-suffix-cache-max-depth 64
--speculative-suffix-max-spec-factor 1.0
--speculative-suffix-min-token-prob 0.1
```

## References

1. [SGLANG PR #13873: Shared Experts Optimization](https://github.com/sgl-project/sglang/pull/13873)
2. [Snowflake Engineering Blog: SuffixDecoding at Production Scale](https://www.snowflake.com/en/engineering-blog/suffixdecoding-arctic-inference-vllm/)
3. [NeurIPS Paper: SuffixDecoding](https://arxiv.org/abs/2411.04975)
4. [Arctic Inference Repository](https://github.com/snowflakedb/ArcticInference)
28 changes: 24 additions & 4 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added public/images/blog/novita-glm4/benchmark-tpot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added public/images/blog/novita-glm4/benchmark-ttft.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added public/images/blog/novita-glm4/qknorm-fusion.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.