Skip to content

Conversation

@Sunny-bot1
Copy link
Collaborator

PR修改

  1. 融合get_max_len和get_kv_max_len(3.6us+1.8us->3.6us)
  2. 融合max_len_tensor_cpu和max_len_kv_cpu到cpu的copy,将max_len_kv_cpu放在max_len_tensor_cpu[8](36us->18us)
  3. 优化split_q_block kernel(21us->3us)
  4. 消除一些冗余分支和memset

TODO

  1. 进CUDA graph
  2. kernel、DtoH进一步融合
  3. 优化MLA前处理相关kernel

@paddle-bot
Copy link

paddle-bot bot commented Sep 29, 2025

Thanks for your contribution!

@Sunny-bot1 Sunny-bot1 closed this Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant