Add EarlyTom video token compression for Qwen2.5-VL. by zhangj1an · Pull Request #1 · zhangj1an/vllm

zhangj1an · 2026-06-09T23:18:02Z

Port of EarlyTom (arXiv:2605.30010, github.com/viridisGreen/EarlyTom)
from LLaVA-OneVision/SigLIP to vLLM's Qwen2.5-VL, riding the EVS
pruning infrastructure (--video-pruning-rate) so the processor-side
placeholder count contract is preserved exactly.

In-encoder temporal frame merging (EMA cosine segmentation + frame mixing) at configurable ViT layers, with full rebuild of cu_seqlens, window cu_seqlens, rope tables and merger reverse indices
Attention saliency recomputed from QK^T at the last full-attention layer (flash-attention safe, query-chunked, TP all-reduced)
Outer compression: global attention top-k dominant tokens + DPC-KNN contextual merging for segment-boundary frames; local-window top-k for static frames; exact-budget allocator + retention mask drives EVS mrope recomputation
Inner (FastV-style) LLM compression intentionally not ported: paged KV cache assumes fixed prompt length
Enabled via VLLM_EARLYTOM=1 (+ VLLM_EARLYTOM_* hyperparams)

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Port of EarlyTom (arXiv:2605.30010, github.com/viridisGreen/EarlyTom) from LLaVA-OneVision/SigLIP to vLLM's Qwen2.5-VL, riding the EVS pruning infrastructure (--video-pruning-rate) so the processor-side placeholder count contract is preserved exactly. - In-encoder temporal frame merging (EMA cosine segmentation + frame mixing) at configurable ViT layers, with full rebuild of cu_seqlens, window cu_seqlens, rope tables and merger reverse indices - Attention saliency recomputed from QK^T at the last full-attention layer (flash-attention safe, query-chunked, TP all-reduced) - Outer compression: global attention top-k dominant tokens + DPC-KNN contextual merging for segment-boundary frames; local-window top-k for static frames; exact-budget allocator + retention mask drives EVS mrope recomputation - Inner (FastV-style) LLM compression intentionally not ported: paged KV cache assumes fixed prompt length - Enabled via VLLM_EARLYTOM=1 (+ VLLM_EARLYTOM_* hyperparams) Signed-off-by: Zhang Jian <jianmusings@gmail.com>

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

zhangj1an added 2 commits June 9, 2026 22:28

refactor to clean up code

c10ef53

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add EarlyTom video token compression for Qwen2.5-VL.#1

Add EarlyTom video token compression for Qwen2.5-VL.#1
zhangj1an wants to merge 2 commits into
mainfrom
jian/earlytom

zhangj1an commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhangj1an commented Jun 9, 2026

Purpose

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant