Releases: huggingface/text-embeddings-inference
v1.8.2
🔧 Fixed Intel MKL Support
Since Text Embeddings Inference (TEI) v1.7.0, Intel MKL support had been broken due to changes in the candle dependency. Neither static-linking nor dynamic-linking worked correctly, which caused models using Intel MKL on CPU to fail with errors such as:  "Intel oneMKL ERROR: Parameter 13 was incorrect on entry to SGEMM".
Starting with v1.8.2, this issue has been resolved by fixing how the intel-mkl-src dependency is defined. Both features, static-linking and dynamic-linking (the default), now work correctly, ensuring that Intel MKL libraries are properly linked.
This issue occurred in the following scenarios:
- Users installing text-embeddings-routerviacargowith the--feature mklflag. Althoughdynamic-linkingshould have been used, it was not working as intended.
- Users relying on the CPU Dockerfilewhen running models without ONNX weights. In these cases, Safetensors weights were used withcandleas backend (with MKL optimizations), instead ofort.
The following table shows the affected versions and containers:
| Version | Image | 
|---|---|
| 1.7.0 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.0 | 
| 1.7.1 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.1 | 
| 1.7.2 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.2 | 
| 1.7.3 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.3 | 
| 1.7.4 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.4 | 
| 1.8.0 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.0 | 
| 1.8.1 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 | 
More details: PR #715
Full Changelog: v1.8.1...v1.8.2
v1.8.1
 
Today, Google releases EmbeddingGemma, a state-of-the-art multilingual embedding model perfect for on-device use cases. Designed for speed and efficiency, the model features a compact size of 308M parameters and a 2K context window, unlocking new possibilities for mobile RAG pipelines, agents, and more. EmbeddingGemma is trained to support over 100 languages and is the highest-ranking text-only multilingual embedding model under 500M on the Massive Text Embedding Benchmark (MTEB) at the time of writing.
- CPU:
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
    --model-id google/embeddinggemma-300m --dtype float32- CPU with ONNX Runtime:
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
    --model-id onnx-community/embeddinggemma-300m-ONNX --dtype float32 --pooling mean- NVIDIA CUDA:
docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.1 \
    --model-id google/embeddinggemma-300m --dtype float32Notable Changes
- Add support for Gemma3 (text-only) architecture
- Intel updates to Synapse 1.21.3 and IPEX 2.8
- Extend ONNX Runtime support in OrtRuntime- Support position_idsandpast_key_valuesas inputs
- Handle padding_sideandpad_token_id
 
- Support 
What's Changed
- Adjust HPU warmup: use dummy inputs with shape more close to real scenario by @kaixuanliu in #689
- Add extra_argstotrufflehogto exclude unverified results by @alvarobartt in #696
- Update GitHub templates & fix mentions to Text Embeddings Inference by @alvarobartt in #697
- Disable Flash Attention with USE_FLASH_ATTENTIONby @alvarobartt in #692
- Add support for position_idsandpast_key_valuesinOrtBackendby @alvarobartt in #700
- HPU upgrade to Synapse 1.21.3 by @kaixuanliu in #703
- Upgrade to IPEX 2.8 by @kaixuanliu in #702
- Parse modules.jsonto identify defaultDensemodules by @alvarobartt in #701
- Add padding_sideandpad_token_idinOrtBackendby @alvarobartt in #705
- Update docs/openapi.jsonfor v1.8.0 by @alvarobartt in #708
- Add Gemma3 architecture (text-only) by @alvarobartt in #711
- Update versionto 1.8.1 by @alvarobartt in #712
Full Changelog: v1.8.0...v1.8.1
v1.8.0
 
Notable Changes
- Qwen3 support for 0.6B, 4B and 8B on CPU, MPS, and FlashQwen3 on CUDA and Intel HPUs
- NomicBert MoE support
- JinaAI Re-Rankers V1 support
- Matryoshka Representation Learning (MRL)
- Dense layer module support (after pooling)
Note
Some of the aforementioned changes were released within the patch versions on top of v1.7.0, whilst both Matryoshka Representation Learning (MRL) and Dense layer module support have been recently included and were not released yet.
What's Changed
- [Docs] Update quick tour by @NielsRogge in #574
- Update README.mdandsupported_models.mdby @alvarobartt in #572
- Back with linting. by @Narsil in #577
- [Docs] Add cloud run example by @NielsRogge in #573
- Fixup by @Narsil in #578
- Fixing the tokenization routes token (offsets are in bytes, not in by @Narsil in #576
- Removing requirements file. by @Narsil in #585
- Removing candle-extensions to live on crates.io by @Narsil in #583
- Bump sccacheto 0.10.0 andsccache-actionto 0.0.9 by @alvarobartt in #586
- optimize the performance of FlashBert Path for HPU by @kaixuanliu in #575
- Revert "Removing requirements file. (#585)" by @Narsil in #588
- Get opentelemetry trace id from request headers by @kozistr in #425
- Add argument for configuring Prometheus port by @kozistr in #589
- Adding missing head.prefix in the weight name inModernBertClassificationHeadby @kozistr in #591
- Fixing the CI (grpc path). by @Narsil in #593
- fix xpu env issue that cannot find right libur_loader.so.0 by @kaixuanliu in #595
- enable flash mistral model for HPU device by @kaixuanliu in #594
- remove optimum-habana dependency by @kaixuanliu in #599
- Support NomicBert MoE by @kozistr in #596
- Remove duplicate short option '-p' to fix router executable by @cebtenzzre in #602
- Update text-embeddings-router --helpoutput by @alvarobartt in #603
- Warmup padded models too. by @Narsil in #592
- Add support for JinaAI Re-Rankers V1 by @alvarobartt in #582
- Gte diffs by @Narsil in #604
- Fix the weight name in GTEClassificationHead by @kozistr in #606
- upgrade pytorch and ipex to 2.7 version by @kaixuanliu in #607
- upgrade HPU FW to 1.21; upgrade transformers to 4.51.3 by @kaixuanliu in #608
- Patch DistilBERT variants with different weight keys by @alvarobartt in #614
- add offline modeling for model jinaai/jina-embeddings-v2-base-codeto avoidauto_mapto other repository by @kaixuanliu in #612
- Add mean pooling strategy for Modernbert classifier by @kwnath in #616
- Using serde for pool validation. by @Narsil in #620
- Preparing the update to 1.7.1 by @Narsil in #623
- Adding suggestions to fixing missing ONNX files. by @Narsil in #624
- Add Qwen3Modelby @alvarobartt in #627
- Add HiddenAct::Silu(removeserdealias) by @alvarobartt in #631
- Add CPU support for Qwen3-Embedding models by @randomm in #632
- refactor the code and add wrap_in_hpu_graph to corner case by @kaixuanliu in #625
- Support Qwen3 w/ fp32 on GPU by @kozistr in #634
- Preparing the release. by @Narsil in #639
- Default to Qwen3 in README.mdanddocs/examples by @alvarobartt in #641
- Fix Qwen3 by @kozistr in #646
- Add integration tests for Gaudi by @baptistecolle in #598
- Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in #648
- Fix FlashQwen3 by @kozistr in #650
- Make flake work on metal by @Narsil in #654
- Fixing metal backend. by @Narsil in #655
- Qwen3 hpu support by @kaixuanliu in #656
- change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #659
- Update versionto 1.7.3 by @alvarobartt in #666
- Add last token pooling support for ORT. by @tpendragon in #664
- Fix Qwen3 Embedding Float16 DType by @tpendragon in #663
- Fix fmtby re-runningpre-commitby @alvarobartt in #671
- Update versionto 1.7.4 by @alvarobartt in #677
- Support MRL (Matryoshka Representation Learning) by @kozistr in #676
- Add Denselayer for2_Dense/modules by @alvarobartt in #660
- Update versionto 1.8.0 by @alvarobartt in #686
New Contributors
- @NielsRogge made their first contribution in #574
- @cebtenzzre made their first contribution in #602
- @kwnath made their first contribution in #616
- @randomm made their first contribution in #632
- @lance-miles made their first contribution in #648
- @tpendragon made their first contribution in #664
Full Changelog: v1.7.0...v1.8.0
v1.7.4
Noticeable Changes
Qwen3 was not working fine on CPU / MPS when sending batched requests on FP16 precision, due to the FP32 minimum value downcast (now manually set to FP16 minimum value instead) leading to null values, as well as a missing to_dtype call on the attention_bias when working with batches.
What's Changed
- Fix Qwen3 Embedding Float16 DType by @tpendragon in #663
- Fix fmtby re-runningpre-commitby @alvarobartt in #671
- Update versionto 1.7.4 by @alvarobartt in #677
Full Changelog: v1.7.3...v1.7.4
v1.7.3
Noticeable Changes
Qwen3 support included for Intel HPU, and fixed for CPU / Metal / CUDA.
What's Changed
- Default to Qwen3 in README.mdanddocs/examples by @alvarobartt in #641
- Fix Qwen3 by @kozistr in #646
- Add integration tests for Gaudi by @baptistecolle in #598
- Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in #648
- Fix FlashQwen3 by @kozistr in #650
- Make flake work on metal by @Narsil in #654
- Fixing metal backend. by @Narsil in #655
- Qwen3 hpu support by @kaixuanliu in #656
- change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #659
- Update versionto 1.7.3 by @alvarobartt in #666
- Add last token pooling support for ORT. by @tpendragon in #664
New Contributors
- @lance-miles made their first contribution in #648
- @tpendragon made their first contribution in #664
Full Changelog: v1.7.2...v1.7.3
v1.7.2
Notable change
- Added support for Qwen3 embeddigns
What's Changed
- Adding suggestions to fixing missing ONNX files. by @Narsil in #624
- Add Qwen3Modelby @alvarobartt in #627
- Add HiddenAct::Silu(removeserdealias) by @alvarobartt in #631
- Add CPU support for Qwen3-Embedding models by @randomm in #632
- refactor the code and add wrap_in_hpu_graph to corner case by @kaixuanliu in #625
- Support Qwen3 w/ fp32 on GPU by @kozistr in #634
- Preparing the release. by @Narsil in #639
New Contributors
Full Changelog: v1.7.1...v1.7.2
v1.7.1
What's Changed
- [Docs] Update quick tour by @NielsRogge in #574
- Update README.mdandsupported_models.mdby @alvarobartt in #572
- Back with linting. by @Narsil in #577
- [Docs] Add cloud run example by @NielsRogge in #573
- Fixup by @Narsil in #578
- Fixing the tokenization routes token (offsets are in bytes, not in by @Narsil in #576
- Removing requirements file. by @Narsil in #585
- Removing candle-extensions to live on crates.io by @Narsil in #583
- Bump sccacheto 0.10.0 andsccache-actionto 0.0.9 by @alvarobartt in #586
- optimize the performance of FlashBert Path for HPU by @kaixuanliu in #575
- Revert "Removing requirements file. (#585)" by @Narsil in #588
- Get opentelemetry trace id from request headers by @kozistr in #425
- Add argument for configuring Prometheus port by @kozistr in #589
- Adding missing head.prefix in the weight name inModernBertClassificationHeadby @kozistr in #591
- Fixing the CI (grpc path). by @Narsil in #593
- fix xpu env issue that cannot find right libur_loader.so.0 by @kaixuanliu in #595
- enable flash mistral model for HPU device by @kaixuanliu in #594
- remove optimum-habana dependency by @kaixuanliu in #599
- Support NomicBert MoE by @kozistr in #596
- Remove duplicate short option '-p' to fix router executable by @cebtenzzre in #602
- Update text-embeddings-router --helpoutput by @alvarobartt in #603
- Warmup padded models too. by @Narsil in #592
- Add support for JinaAI Re-Rankers V1 by @alvarobartt in #582
- Gte diffs by @Narsil in #604
- Fix the weight name in GTEClassificationHead by @kozistr in #606
- upgrade pytorch and ipex to 2.7 version by @kaixuanliu in #607
- upgrade HPU FW to 1.21; upgrade transformers to 4.51.3 by @kaixuanliu in #608
- Patch DistilBERT variants with different weight keys by @alvarobartt in #614
- add offline modeling for model jinaai/jina-embeddings-v2-base-codeto avoidauto_mapto other repository by @kaixuanliu in #612
- Add mean pooling strategy for Modernbert classifier by @kwnath in #616
- Using serde for pool validation. by @Narsil in #620
- Preparing the update to 1.7.1 by @Narsil in #623
New Contributors
- @NielsRogge made their first contribution in #574
- @cebtenzzre made their first contribution in #602
- @kwnath made their first contribution in #616
Full Changelog: v1.7.0...v1.7.1
v1.7.0
Notable changes
- Upgrade dependencies heavily (candle 0.5 -> 0.8 and related)
- Added ModernBert support by @kozistr !
What's Changed
- Moving cublaslt into TEI extension for easier upgrade of candle globally by @Narsil in #542
- Upgrade candle2 by @Narsil in #543
- Upgrade candle3 by @Narsil in #545
- Fixing the static-linking. by @Narsil in #547
- Fix linking bis by @Narsil in #549
- Make sliding_windowforQwen2optional by @alvarobartt in #546
- Optimize the performance of FlashBert on HPU by using fast mode softmax by @kaixuanliu in #555
- Fixing cudarc to the latest unified bindings. by @Narsil in #558
- Fix typos / formatting in CLI args in Markdown files by @alvarobartt in #552
- Use custom serdedeserializer for JinaBERT models by @alvarobartt in #559
- Implement the ModernBertmodel by @kozistr in #459
- Fixing FlashAttention ModernBert. by @Narsil in #560
- Enable ModernBert on metal by @ivarflakstad in #562
- Fix {Bert,DistilBert}SpladeHeadwhen loading from Safetensors by @alvarobartt in #564
- add related docs for intel cpu/xpu/hpu container by @kaixuanliu in #550
- Update the doc for submodule. by @Narsil in #567
- Update docs/source/en/custom_container.mdby @alvarobartt in #568
- Preparing for release 1.7.0 (candle update + modernbert). by @Narsil in #570
New Contributors
- @ivarflakstad made their first contribution in #562
Full Changelog: v1.6.1...v1.7.0
v1.6.1
What's Changed
- Enable intel devices CPU/XPU/HPU for python backend by @yuanwu2017 in #245
- add reranker model support for python backend by @kaixuanliu in #386
- (FIX): CI Security Fix - branchname injection by @glegendre01 in #479
- Upgrade TEI. by @Narsil in #501
- Pin cargo-chefinstallation to 0.1.62 by @alvarobartt in #469
- add TRUST_REMOTE_CODEparam to python backend. by @kaixuanliu in #485
- Enable splade embeddings for Python backend by @pi314ever in #493
- Hpu bucketing by @kaixuanliu in #489
- Optimize flash bert path for hpu device by @kaixuanliu in #509
- upgrade ipex to 2.6 version for cpu/xpu by @kaixuanliu in #510
- fix bug for MaskedLanguageModelclass` by @kaixuanliu in #513
- Fix double incrementing te_request_countmetric by @kozistr in #486
- Add intel based images to the CI by @baptistecolle in #518
- Fix typo on intel docker image by @baptistecolle in #529
- chore: Upgrade to tokenizers 0.21.0 by @lightsofapollo in #512
- feat: add support for "model_type": "gte" by @anton-pt in #519
- Update README.mdto include ONNX by @alvarobartt in #507
- Fusing both Gte Configs. by @Narsil in #530
- Add HF_HUB_USER_AGENT_ORIGINby @alvarobartt in #534
- Use --hf-tokeninstead of--hf-api-tokenby @alvarobartt in #535
- Fixing the tests. by @Narsil in #531
- Support classification head for DistilBERT by @kozistr in #487
- add CLI flag disable-spansto toggle span trace logging by @obloomfield in #481
- feat: support HF_ENDPOINT environment when downloading model by @StrayDragon in #505
- Small fixup. by @Narsil in #537
- Fix VarBuilderhandling in GTE e.g.gte-multilingual-reranker-baseby @Narsil in #538
- make a WA in case Bert model do not have safetensorfile by @kaixuanliu in #515
- Add missing matchononnx/model.onnxdownload by @alvarobartt in #472
- Fixing the impure flake devShell to be able to run python code. by @Narsil in #539
- Prepare for release. by @Narsil in #540
New Contributors
- @yuanwu2017 made their first contribution in #245
- @kaixuanliu made their first contribution in #386
- @Narsil made their first contribution in #501
- @pi314ever made their first contribution in #493
- @baptistecolle made their first contribution in #518
- @lightsofapollo made their first contribution in #512
- @anton-pt made their first contribution in #519
- @obloomfield made their first contribution in #481
- @StrayDragon made their first contribution in #505
Full Changelog: v1.6.0...v1.6.1
v1.6.0
What's Changed
- feat: support multiple backends at the same time by @OlivierDehaene in #440
- feat: GTE classification head by @kozistr in #441
- feat: Implement GTE model to support the non-flash-attn version by @kozistr in #446
- feat: Implement MPNet model (#363) by @kozistr in #447
Full Changelog: v1.5.1...v1.6.0