Fix Qwen3-Embedding batch vs single inference inconsistency #648

lance-miles · 2025-06-18T16:49:25Z

What does this PR do?

Fix Qwen3-Embedding batch vs single inference inconsistency

Fixes #642 and PR #646

Problem

PR #646 introduced a regression where Qwen3-Embedding models produced inconsistent embeddings between batch and single sequence inference for identical inputs. The test backends/candle/tests/test_qwen3.rs was failing with assertion errors on the line:
assert_eq!(embeddings_batch[0], embeddings_single[0]);

Root Cause

The issue stemmed from inconsistent attention bias handling between batch and single sequence processing:

Batch processing: Applied padding and attention bias correctly with causal masking
Single processing: Did not create attention bias tensors, causing SDPA vs eager attention behavioral differences
Padding inconsistency: The original implementation used right padding, but Qwen3-Embedding requires left padding for proper causal attention

Investigation

Further analysis revealed that Qwen3-Embedding models use causal attention by design (not bidirectional like BERT), requiring:

Left padding for batched sequences to align EOS tokens
Consistent causal attention masking for both single and batch inference
Proper last token indexing accounting for padding position

Fixes # (issue)

This fix ensures consistent behavior by:

Left Padding Implementation:
- Pad sequences at the beginning (left) rather than end (right)
- Aligns with Qwen3-Embedding's causal attention requirements
Consistent Attention Bias Creation:
- Create causal attention bias for both single and batch processing
- Apply identical upper triangular masking in both code paths
Correct Last Token Indexing:
- Account for left padding when extracting last token embeddings
- Ensure EOS token is correctly identified for pooling
Causal Attention Masking:
- Maintain proper causal mask generation preventing future token attention
- Consistent with Qwen3-Embedding's architectural requirements

Changes

backends/candle/src/models/qwen3.rs: Updated batch processing logic, attention bias handling, and last token extraction
Test snapshots: Updated to reflect correct implementation outputs

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

kozistr

thanks for catching this quickly :)

and i've also verified that the output is identical with Transformer!

looks good to me!

backends/candle/src/models/qwen3.rs

lance-miles · 2025-06-19T02:21:59Z

@kozistr Thank you so much for the initial PR and initial fix!!

Tagging @Narsil or @alvarobartt for review

Co-authored-by: Hyeongchan Kim <[email protected]>

lance-miles · 2025-06-19T18:35:23Z

@kozistr Thank you so much for the initial PR and initial fix!!

Tagging @Narsil or @alvarobartt for review

I've run the pre-commit hooks and fixed the formatting issues.

Narsil

LGTM

lance-miles · 2025-06-23T17:27:22Z

@alvarobartt and @Narsil,
Is there anything else you need me to do before merging this one in? Please let me know how I can help!

alvarobartt · 2025-06-24T10:46:07Z

Hey @lance-miles, apologies for the delay!

FYI float16 inference is now broken on CPU and MPS, but I already have a fix for that in a follow up PR, so no need to worry about that!

But when comparing the outputs with the Sentence Transformers counterpart, I realized that those are still not matching, since both you and @kozistr got matching results, could you please share the snippet you used? For reference I deployed Text Embeddings Inference (TEI) from this branch + the causal attention mask boolean flag from #650, and ran it as cargo run --release --features candle-cuda,dynamic-linking,http --no-default-features -- --model-id Qwen/Qwen3-Embedding-0.6B --dtype float16 in an instance with a single NVIDIA L40 48GB and then compared it against the Sentence Transformers' output in Python with numpy as it follows:

import numpy as np
import requests
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Qwen/Qwen3-Embedding-0.6B",
    model_kwargs={
        "attn_implementation": "flash_attention_2",
        "torch_dtype": "float16",
        "device_map": "cuda",
    },
    tokenizer_kwargs={"padding_side": "left"},
)

out_py = model.encode(
    ["What is Deep Learning?", "Who is Walt Disney?"], normalize_embeddings=True
)
print(out_py)

response = requests.post(
    "http://localhost:3000/embed",
    json={
        "inputs": ["What is Deep Learning?", "Who is Walt Disney?"],
        "normalize": True,
    },
)

response.raise_for_status()
out = response.json()
out_http = np.array(out, dtype=np.float16)

np.testing.assert_allclose(out_py, out_http)

And that will fail with:

Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 1883 / 2048 (91.9%)
Max absolute difference among violations: 0.0005035
Max relative difference among violations: 5.
 ACTUAL: array([[-0.01566 , -0.03204 , -0.010994, ..., -0.004463,  0.03345 ,
         0.001066],
       [ 0.00918 ,  0.03442 , -0.004593, ...,  0.03952 , -0.01244 ,
        -0.01973 ]], shape=(2, 1024), dtype=float16)
 DESIRED: array([[-0.01567 , -0.0318  , -0.010956, ..., -0.00437 ,  0.0334  ,
         0.001115],
       [ 0.00939 ,  0.0343  , -0.004593, ...,  0.03952 , -0.01247 ,
        -0.01971 ]], shape=(2, 1024), dtype=float16)

P.S. There's a mismatch on both single and batched inference, but the cosine similarity value seems to be 1.0, whereas the allclose check fails when it should generate the same embedding AFAIK

Thanks again for the PR @lance-miles 🤗 And we can merge as soon as we clarify that, then I'll patch the float16 on both CPU and MPS.

kozistr · 2025-06-24T11:07:58Z

Hey @lance-miles, apologies for the delay!

FYI float16 inference is now broken on CPU and MPS, but I already have a fix for that in a follow up PR, so no need to worry about that!

But when comparing the outputs with the Sentence Transformers counterpart, I realized that those are still not matching, since both you and @kozistr got matching results, could you please share the snippet you used? For reference I deployed Text Embeddings Inference (TEI) from this branch + the causal attention mask boolean flag from #650, and ran it as cargo run --release --features candle-cuda,dynamic-linking,http --no-default-features -- --model-id Qwen/Qwen3-Embedding-0.6B --dtype float16 in an instance with a single NVIDIA L40 48GB and then compared it against the Sentence Transformers' output in Python with numpy as it follows:
import numpy as np
import requests
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Qwen/Qwen3-Embedding-0.6B",
    model_kwargs={
        "attn_implementation": "flash_attention_2",
        "torch_dtype": "float16",
        "device_map": "cuda",
    },
    tokenizer_kwargs={"padding_side": "left"},
)

out_py = model.encode(
    ["What is Deep Learning?", "Who is Walt Disney?"], normalize_embeddings=True
)
print(out_py)

response = requests.post(
    "http://localhost:3000/embed",
    json={
        "inputs": ["What is Deep Learning?", "Who is Walt Disney?"],
        "normalize": True,
    },
)

response.raise_for_status()
out = response.json()
out_http = np.array(out, dtype=np.float16)

np.testing.assert_allclose(out_py, out_http)
And that will fail with:
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 1883 / 2048 (91.9%)
Max absolute difference among violations: 0.0005035
Max relative difference among violations: 5.
 ACTUAL: array([[-0.01566 , -0.03204 , -0.010994, ..., -0.004463,  0.03345 ,
         0.001066],
       [ 0.00918 ,  0.03442 , -0.004593, ...,  0.03952 , -0.01244 ,
        -0.01973 ]], shape=(2, 1024), dtype=float16)
 DESIRED: array([[-0.01567 , -0.0318  , -0.010956, ..., -0.00437 ,  0.0334  ,
         0.001115],
       [ 0.00939 ,  0.0343  , -0.004593, ...,  0.03952 , -0.01247 ,
        -0.01971 ]], shape=(2, 1024), dtype=float16)
P.S. Fails on both single and batch inference on my end!

Thanks again for the PR @lance-miles 🤗 And we can merge as soon as we clarify that, then I'll patch the float16 on both CPU and MPS.

Hi! Personally, I use this script to validate the outputs! code (need to be modified to work with Qwen3 Embedding)

I might be wrong, IMHO, the difference between Sentence Transformer and TEI seems marginal, primarily due to the precision and difference at some layers (e.g. activation, ...).

pocman · 2025-06-27T08:24:46Z

Amazing work, could we schedule a version release 1.7.3 to ship this fix ?

Lance Miles added 2 commits June 18, 2025 09:38

Fix Qwen3-Embedding batch vs single inference inconsistency

91eb550

Fix Qwen3-Embedding batch vs single inference inconsistency

10f5509

lance-miles marked this pull request as ready for review June 18, 2025 17:16

Fix Qwen3-Embedding batch vs single inference inconsistency

9fb741e

kozistr reviewed Jun 19, 2025

View reviewed changes

backends/candle/src/models/qwen3.rs Outdated Show resolved Hide resolved

lance-miles and others added 2 commits June 18, 2025 19:22

Update backends/candle/src/models/qwen3.rs

9d69615

Co-authored-by: Hyeongchan Kim <[email protected]>

Formatting file.

c0f8a82

alvarobartt approved these changes Jun 20, 2025

View reviewed changes

alvarobartt mentioned this pull request Jun 20, 2025

Inconsistent Embedding Outputs with qwen3-embedding-0.6B #649

Closed

alvarobartt requested a review from Narsil June 20, 2025 14:04

Narsil approved these changes Jun 20, 2025

View reviewed changes

kozistr mentioned this pull request Jun 21, 2025

Fix FlashQwen3 #650

Merged

6 tasks

alvarobartt merged commit f7aa35b into huggingface:main Jun 24, 2025
3 of 13 checks passed

alvarobartt mentioned this pull request Jun 24, 2025

Fix Qwen3 when --dtype float16 on CPU / MPS #653

Closed

5 tasks

BrewTestBot mentioned this pull request Jun 30, 2025

text-embeddings-inference 1.7.3 Homebrew/homebrew-core#228576

Merged

pocman mentioned this pull request Jul 1, 2025

The output of Qwen3 0.6B model is not correct #665

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Qwen3-Embedding batch vs single inference inconsistency #648

Fix Qwen3-Embedding batch vs single inference inconsistency #648

Uh oh!

lance-miles commented Jun 18, 2025 •

edited

Loading

Uh oh!

kozistr left a comment

Uh oh!

Uh oh!

lance-miles commented Jun 19, 2025

Uh oh!

lance-miles commented Jun 19, 2025

Uh oh!

Narsil left a comment

Uh oh!

lance-miles commented Jun 23, 2025

Uh oh!

alvarobartt commented Jun 24, 2025 •

edited

Loading

Uh oh!

kozistr commented Jun 24, 2025

Uh oh!

Uh oh!

pocman commented Jun 27, 2025

Uh oh!

Uh oh!

Fix Qwen3-Embedding batch vs single inference inconsistency #648

Fix Qwen3-Embedding batch vs single inference inconsistency #648

Uh oh!

Conversation

lance-miles commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?