Skip to content

Add Qwen3 Bidirectional encoder (voyage-4-nano support)#66

Open
agsuy wants to merge 1 commit into
Blaizzy:mainfrom
agsuy:qwen3-bidirectional-encoder
Open

Add Qwen3 Bidirectional encoder (voyage-4-nano support)#66
agsuy wants to merge 1 commit into
Blaizzy:mainfrom
agsuy:qwen3-bidirectional-encoder

Conversation

@agsuy

@agsuy agsuy commented May 20, 2026

Copy link
Copy Markdown

Add Qwen3 Bidirectional encoder (voyage-4-nano support)

Summary

Adds a new qwen3_bidirec.py model module that supports Qwen3-based bidirectional encoder embedding models — most notably voyageai/voyage-4-nano, Voyage AI's first open-weights release (Apache 2.0, January 2026, 340M params, 2048d Matryoshka head, 8K context).

Mirrors the existing llama_bidirec.py pattern: a separate module file alongside qwen3.py, dispatched via model_type in the HF config. Same idea, Qwen3 internals.

What this is solving

The upstream voyage-4-nano config declares:

{
  "model_type": "qwen3",
  "use_bidirectional_attention": true,
  "num_labels": 2048
}

The current qwen3.py Model class is decoder-only — causal mask, last-token pooling, no projection head. Trying to load voyage-4-nano (either via mlx_embeddings.convert or load) fails with:

ValueError: Received 1 parameters not in model:
model.linear.weight.

…because the upstream HF module Qwen3BidirectionalModel stores its Matryoshka projection at self.linear (top-level, alongside self.model), and the existing qwen3.Model has no slot for it.

What changed

File Change
mlx_embeddings/models/qwen3_bidirec.py New. Bidirectional Qwen3 with optional Matryoshka projection. Reuses Qwen3DecoderLayer from qwen3.py and mean_pooling from pooling.py.
mlx_embeddings/utils.py (_get_model_arch) When model_type == "qwen3" AND config has use_bidirectional_attention=True, route to qwen3_bidirec. Otherwise route as before.
README.md Adds the new architecture to the supported-models list.

No new runtime dependencies. qwen3.py is untouched — existing Qwen3 / Qwen3-Embedding-* models that don't set the flag continue to load via the original module.

Design notes

  • Subclasses qwen3.ModelArgs rather than redefining all fields — keeps the schema in one place and means Qwen3DecoderLayer works on either ModelArgs without changes.
  • The projection slot lives at self.linear (top-level, not nested under self.model) because that matches the upstream HF safetensors layout. sanitize() has a special case for linear.weight / linear.bias to prevent the existing model. prefix rule from wrongly rewriting it.
  • Routing decision: a static MODEL_REMAPPING entry would have been the smaller diff, but it can't condition on more than model_type. The two-condition check (model_type == "qwen3" AND use_bidirectional_attention=True) is the smallest extension that makes upstream voyage configs "just work" without users editing config.json.

Validation

1. Convert the upstream snapshot

python -m mlx_embeddings.convert \
    --hf-path voyageai/voyage-4-nano \
    --mlx-path ./voyage-4-nano-bf16 \
    --dtype bfloat16

2. Load + embed a sentence and verify routing and output shape

python <<'PY'
from mlx_embeddings.utils import load
model, tok = load("./voyage-4-nano-bf16")
assert type(model).__module__ == "mlx_embeddings.models.qwen3_bidirec"
assert model.linear is not None
assert model.linear.weight.shape == (2048, 1024)
ids = tok("hello world", return_tensors="mlx",
          padding=True, truncation=True)
out = model(ids["input_ids"], attention_mask=ids["attention_mask"])
assert out.text_embeds.shape == (1, 2048)
print("voyage path ok")
PY

3. Regression check on the existing Qwen3 decoder path

python <<'PY'
from mlx_embeddings.utils import load
model, _ = load("mlx-community/Qwen3-Embedding-0.6B-8bit")
assert type(model).__module__ == "mlx_embeddings.models.qwen3"
print("decoder Qwen3 path unchanged")
PY

New `qwen3_bidirec.py` module mirroring `llama_bidirec.py` for the
Qwen3 architecture: same transformer building blocks as `qwen3.py`
(reuses `Qwen3DecoderLayer`) but swaps the autoregressive causal
mask for full bidirectional attention, mean-pools rather than
last-token-pools, and optionally appends a
`nn.Linear(hidden_size, num_labels)` projection head before pooling
(used for Matryoshka outputs).

Motivating model: `voyageai/voyage-4-nano` (Voyage's first
open-weights embedding model, Apache 2.0): Qwen3 base, 340M params,
2048d Matryoshka head. Upstream config declares `"model_type":
"qwen3"` with `"use_bidirectional_attention": true` and
`"num_labels": 2048` — the existing `qwen3.py` rejects it because
`Model` has no slot for the projection weight (`linear.weight`).

Changes:

* `mlx_embeddings/models/qwen3_bidirec.py` (new): bidirectional Qwen3
  with optional Matryoshka projection. Reuses `Qwen3DecoderLayer`
  and `mean_pooling`; no new runtime dependencies.
* `mlx_embeddings/utils.py` `_get_model_arch`: when `model_type ==
  "qwen3"` and `use_bidirectional_attention=True`, route to the new
  module. The existing `qwen3.py` is untouched and continues to
  serve models that don't set the flag (e.g.
  `mlx-community/Qwen3-Embedding-0.6B-8bit`).
* `README.md`: list the new architecture.

See PR description for step-by-step validation and downstream
benchmark numbers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant