Skip to content

Fix Qwen3 when --dtype float16 on CPU / MPS #653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

alvarobartt
Copy link
Member

What does this PR do?

This PR is a follow up fix of #648, since the float16 inference for Qwen3 models was broken due to the attention_bias dtype, which was defaulting to float32, so this PR adds the dtype field for Qwen3Model so that the dtype is inherited.

As a side-fix, this PR also removes the to_device method from Tensor::zeros_like as it's not required, and by default placed in the same device as the tensor used.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@Narsil or @kozistr

@alvarobartt alvarobartt force-pushed the fix-cpu-metal-f16-qwen3 branch from 0856372 to 30f9d1f Compare June 24, 2025 11:43
@Narsil
Copy link
Collaborator

Narsil commented Jun 24, 2025

index 989968d..cfe056c 100644
--- a/backends/candle/src/models/qwen3.rs
+++ b/backends/candle/src/models/qwen3.rs
@@ -454,13 +454,14 @@ impl Qwen3Model {
         let causal_mask = Tensor::from_slice(&mask, (seq_len, seq_len), &Device::Cpu)?;
         let causal_mask = causal_mask.expand(&[bs, dim, seq_len, seq_len])?;+        let device = attention_bias.device();
         let negatives =
             Tensor::full(f32::MIN, attention_bias.shape(), &Device::Cpu)?.to_dtype(self.dtype)?;
-        let zeros = Tensor::zeros_like(&attention_bias)?.to_dtype(self.dtype)?;
+        let zeros = Tensor::zeros_like(&attention_bias)?.to_device(&Device::Cpu)?.to_dtype(self.dtype)?;         let causal_mask = causal_mask
             .where_cond(&negatives, &zeros)?
-            .to_device(&self.device)?;
+            .to_device(device)?;         attention_bias.broadcast_add(&causal_mask)
     }

This one is working for me. Here there is an issue because the zeros are on the CPU, while the negatives are on CPU.

@kozistr
Copy link
Contributor

kozistr commented Jun 25, 2025

thanks for catching this! I missed that part, sorry for that.

@alvarobartt
Copy link
Member Author

Closed in favor of #654

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants