Skip to content

fix: apply QK norm in MultiheadAttention hook methods#85

Open
orrzohar wants to merge 1 commit into
Cerebras:mainfrom
orrzohar:fix/apply-qk-norm-in-attention-hooks
Open

fix: apply QK norm in MultiheadAttention hook methods#85
orrzohar wants to merge 1 commit into
Cerebras:mainfrom
orrzohar:fix/apply-qk-norm-in-attention-hooks

Conversation

@orrzohar

@orrzohar orrzohar commented Apr 15, 2026

Copy link
Copy Markdown

Bug

attention_qk_norm_layer config instantiates self.q_norm and self.k_norm in MultiheadAttention.__init__, but neither is ever called in the forward pass. QK normalization is silently a no-op.

Fix

Apply self.q_norm / self.k_norm inside process_q_before_logits_calc and process_k_before_logits_calc, so Q and K are normalized before the logits matmul when configured.

q_norm and k_norm modules were instantiated from attention_qk_norm_layer
config but never invoked in the forward pass. Apply them in
process_q_before_logits_calc and process_k_before_logits_calc so that
QK normalization actually affects attention logits when configured.

Made-with: Cursor
@orrzohar

Copy link
Copy Markdown
Author

@bhargav-cerebras

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant