Skip to content

Update documentation of input_embedding_scale parameter in TransformerDecoderV1Config#91

Merged
curufinwe merged 3 commits intomainfrom
moritz-trafo-docs-scale
Apr 17, 2026
Merged

Update documentation of input_embedding_scale parameter in TransformerDecoderV1Config#91
curufinwe merged 3 commits intomainfrom
moritz-trafo-docs-scale

Conversation

@NeoLegends
Copy link
Copy Markdown
Contributor

@NeoLegends NeoLegends commented Dec 22, 2025

Mohammad did not use the scale in a pure LM setup.

@NeoLegends NeoLegends self-assigned this Dec 22, 2025
@albertz
Copy link
Copy Markdown
Member

albertz commented Dec 22, 2025

I think in a pure LM setup you don't use the scale.

I don't think this is true in general. I have seen both variants. Also for LMs. For example, Gemma3:

self.embed_tokens = Gemma3TextScaledWordEmbedding(
    config.vocab_size, config.hidden_size, self.padding_idx, embed_scale=self.config.hidden_size**0.5
)

Note, there are some other things to consider:

If you don't apply the scale in forward, what people do instead then is to apply the scale during init, or make the random init very large. E.g. nanochat:

elif isinstance(module, nn.Embedding):
    torch.nn.init.normal_(module.weight, mean=0.0, std=1.0)

I also saw that some people use custom (much larger) LRs for embeddings, which again might compensate the fact of not using a scale. E.g. see nanochat.

If you share the embedding weights with the LM head, this might affect whether you want such a scale or not (I'm not sure in what way, though...). Most LMs do this.

input_dropout: Dropout applied to the input embedding.
input_embedding_scale: Scale applied to the input embedding.
Set to `None` to apply a (tuned) default.
Set to `None` to apply a default that is suitable for ASR AED decoder models.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not mention any specific model at all here. I think this is just confusing. I would instead just say what default you use.

@curufinwe curufinwe changed the title Transformer Decoder: extend docs on input embedding scale Update documentation of input_embedding_scale parameter in TransformerDecoderV1Config Apr 17, 2026
@curufinwe curufinwe merged commit f347906 into main Apr 17, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants