Skip to content

Add EarlyTom video token compression for Qwen2.5-VL.#1

Open
zhangj1an wants to merge 2 commits into
mainfrom
jian/earlytom
Open

Add EarlyTom video token compression for Qwen2.5-VL.#1
zhangj1an wants to merge 2 commits into
mainfrom
jian/earlytom

Conversation

@zhangj1an

Copy link
Copy Markdown
Owner

Port of EarlyTom (arXiv:2605.30010, github.com/viridisGreen/EarlyTom)
from LLaVA-OneVision/SigLIP to vLLM's Qwen2.5-VL, riding the EVS
pruning infrastructure (--video-pruning-rate) so the processor-side
placeholder count contract is preserved exactly.

  • In-encoder temporal frame merging (EMA cosine segmentation + frame mixing) at configurable ViT layers, with full rebuild of cu_seqlens, window cu_seqlens, rope tables and merger reverse indices
  • Attention saliency recomputed from QK^T at the last full-attention layer (flash-attention safe, query-chunked, TP all-reduced)
  • Outer compression: global attention top-k dominant tokens + DPC-KNN contextual merging for segment-boundary frames; local-window top-k for static frames; exact-budget allocator + retention mask drives EVS mrope recomputation
  • Inner (FastV-style) LLM compression intentionally not ported: paged KV cache assumes fixed prompt length
  • Enabled via VLLM_EARLYTOM=1 (+ VLLM_EARLYTOM_* hyperparams)

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

zhangj1an added 2 commits June 9, 2026 22:28
Port of EarlyTom (arXiv:2605.30010, github.com/viridisGreen/EarlyTom)
   from LLaVA-OneVision/SigLIP to vLLM's Qwen2.5-VL, riding the EVS
   pruning infrastructure (--video-pruning-rate) so the processor-side
   placeholder count contract is preserved exactly.

   - In-encoder temporal frame merging (EMA cosine segmentation + frame
     mixing) at configurable ViT layers, with full rebuild of cu_seqlens,
     window cu_seqlens, rope tables and merger reverse indices
   - Attention saliency recomputed from QK^T at the last full-attention
     layer (flash-attention safe, query-chunked, TP all-reduced)
   - Outer compression: global attention top-k dominant tokens + DPC-KNN
     contextual merging for segment-boundary frames; local-window top-k
     for static frames; exact-budget allocator + retention mask drives
     EVS mrope recomputation
   - Inner (FastV-style) LLM compression intentionally not ported: paged
     KV cache assumes fixed prompt length
   - Enabled via VLLM_EARLYTOM=1 (+ VLLM_EARLYTOM_* hyperparams)

Signed-off-by: Zhang Jian <jianmusings@gmail.com>
Signed-off-by: Zhang Jian <jianmusings@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant