Add Qwen3 Bidirectional encoder (voyage-4-nano support) by agsuy · Pull Request #66 · Blaizzy/mlx-embeddings

agsuy · 2026-05-20T12:44:43Z

Add Qwen3 Bidirectional encoder (voyage-4-nano support)

Summary

Adds a new qwen3_bidirec.py model module that supports Qwen3-based bidirectional encoder embedding models — most notably voyageai/voyage-4-nano, Voyage AI's first open-weights release (Apache 2.0, January 2026, 340M params, 2048d Matryoshka head, 8K context).

Mirrors the existing llama_bidirec.py pattern: a separate module file alongside qwen3.py, dispatched via model_type in the HF config. Same idea, Qwen3 internals.

What this is solving

The upstream voyage-4-nano config declares:

{
  "model_type": "qwen3",
  "use_bidirectional_attention": true,
  "num_labels": 2048
}

The current qwen3.py Model class is decoder-only — causal mask, last-token pooling, no projection head. Trying to load voyage-4-nano (either via mlx_embeddings.convert or load) fails with:

ValueError: Received 1 parameters not in model:
model.linear.weight.

…because the upstream HF module Qwen3BidirectionalModel stores its Matryoshka projection at self.linear (top-level, alongside self.model), and the existing qwen3.Model has no slot for it.

What changed

File	Change
`mlx_embeddings/models/qwen3_bidirec.py`	New. Bidirectional Qwen3 with optional Matryoshka projection. Reuses `Qwen3DecoderLayer` from `qwen3.py` and `mean_pooling` from `pooling.py`.
`mlx_embeddings/utils.py` (`_get_model_arch`)	When `model_type == "qwen3"` AND config has `use_bidirectional_attention=True`, route to `qwen3_bidirec`. Otherwise route as before.
`README.md`	Adds the new architecture to the supported-models list.

No new runtime dependencies. qwen3.py is untouched — existing Qwen3 / Qwen3-Embedding-* models that don't set the flag continue to load via the original module.

Design notes

Subclasses qwen3.ModelArgs rather than redefining all fields — keeps the schema in one place and means Qwen3DecoderLayer works on either ModelArgs without changes.
The projection slot lives at self.linear (top-level, not nested under self.model) because that matches the upstream HF safetensors layout. sanitize() has a special case for linear.weight / linear.bias to prevent the existing model. prefix rule from wrongly rewriting it.
Routing decision: a static MODEL_REMAPPING entry would have been the smaller diff, but it can't condition on more than model_type. The two-condition check (model_type == "qwen3" AND use_bidirectional_attention=True) is the smallest extension that makes upstream voyage configs "just work" without users editing config.json.

Validation

1. Convert the upstream snapshot

python -m mlx_embeddings.convert \
    --hf-path voyageai/voyage-4-nano \
    --mlx-path ./voyage-4-nano-bf16 \
    --dtype bfloat16

2. Load + embed a sentence and verify routing and output shape

python <<'PY'
from mlx_embeddings.utils import load
model, tok = load("./voyage-4-nano-bf16")
assert type(model).__module__ == "mlx_embeddings.models.qwen3_bidirec"
assert model.linear is not None
assert model.linear.weight.shape == (2048, 1024)
ids = tok("hello world", return_tensors="mlx",
          padding=True, truncation=True)
out = model(ids["input_ids"], attention_mask=ids["attention_mask"])
assert out.text_embeds.shape == (1, 2048)
print("voyage path ok")
PY

3. Regression check on the existing Qwen3 decoder path

python <<'PY'
from mlx_embeddings.utils import load
model, _ = load("mlx-community/Qwen3-Embedding-0.6B-8bit")
assert type(model).__module__ == "mlx_embeddings.models.qwen3"
print("decoder Qwen3 path unchanged")
PY

New `qwen3_bidirec.py` module mirroring `llama_bidirec.py` for the Qwen3 architecture: same transformer building blocks as `qwen3.py` (reuses `Qwen3DecoderLayer`) but swaps the autoregressive causal mask for full bidirectional attention, mean-pools rather than last-token-pools, and optionally appends a `nn.Linear(hidden_size, num_labels)` projection head before pooling (used for Matryoshka outputs). Motivating model: `voyageai/voyage-4-nano` (Voyage's first open-weights embedding model, Apache 2.0): Qwen3 base, 340M params, 2048d Matryoshka head. Upstream config declares `"model_type": "qwen3"` with `"use_bidirectional_attention": true` and `"num_labels": 2048` — the existing `qwen3.py` rejects it because `Model` has no slot for the projection weight (`linear.weight`). Changes: * `mlx_embeddings/models/qwen3_bidirec.py` (new): bidirectional Qwen3 with optional Matryoshka projection. Reuses `Qwen3DecoderLayer` and `mean_pooling`; no new runtime dependencies. * `mlx_embeddings/utils.py` `_get_model_arch`: when `model_type == "qwen3"` and `use_bidirectional_attention=True`, route to the new module. The existing `qwen3.py` is untouched and continues to serve models that don't set the flag (e.g. `mlx-community/Qwen3-Embedding-0.6B-8bit`). * `README.md`: list the new architecture. See PR description for step-by-step validation and downstream benchmark numbers.

agsuy force-pushed the qwen3-bidirectional-encoder branch from 3e51938 to 6a32e1e Compare May 20, 2026 12:52

contrapuntal mentioned this pull request Jun 20, 2026

fix(qwen3): avoid out-of-memory crash on long-context embeddings #68

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3 Bidirectional encoder (voyage-4-nano support)#66

Add Qwen3 Bidirectional encoder (voyage-4-nano support)#66
agsuy wants to merge 1 commit into
Blaizzy:mainfrom
agsuy:qwen3-bidirectional-encoder

agsuy commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

agsuy commented May 20, 2026

Add Qwen3 Bidirectional encoder (voyage-4-nano support)

Summary

What this is solving

What changed

Design notes

Validation

1. Convert the upstream snapshot

2. Load + embed a sentence and verify routing and output shape

3. Regression check on the existing Qwen3 decoder path

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant