Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about accuracy alignment between BF16 and FP8 #1419

Open
zigzagcai opened this issue Jan 22, 2025 · 2 comments
Open

Questions about accuracy alignment between BF16 and FP8 #1419

zigzagcai opened this issue Jan 22, 2025 · 2 comments
Labels
question Further information is requested

Comments

@zigzagcai
Copy link

zigzagcai commented Jan 22, 2025

Hello developers,

Thanks for introducing such a great library that demonstrate the power of FP8 training.

But when I tried to integrate FP8 training into my training framework. I found it is hard to make the acc/loss aligned between BF16 and FP8. Actually, in my experiments, the difference is somehow not negligible. I thinks the ideal difference should be e-2~e-3, which is the negligible random error.

Also, I cannot found some resources that show the accuracy alignment of BF16 and FP8.

So, could anyone give some hint about this?

@timmoon10 timmoon10 added the question Further information is requested label Jan 22, 2025
@timmoon10
Copy link
Collaborator

timmoon10 commented Jan 22, 2025

The expected rounding error (machine epsilon) can be estimated based on the number of mantissa bits. TE uses two FP8 datatypes:

  • fp8e4m3: eps = 2^-4 = 6.25e-2
  • fp8e5m2: eps = 2^-3 = 1.25e-1

We just don't have many bits to work with, so the numerical errors can get quite large. This becomes trickier when training since such numerical errors can disrupt training and cause the loss curve to diverge. In general we don't expect good results from naively changing compute to FP8.

TE avoids these problems with per-tensor scaling. By shifting tensor values before casting to FP8, we fully utilize the available dynamic range and reduce the number of values that underflow to zero. See the TE docs for a more thorough explanation. Using tricks like this, we've been able to do full-scale training runs in FP8 with minimal degradation relative to BF16.

The next step after per-tensor scaling is block scaling (see this and this for discussion of the "MX" data formats). DeepSeek also used a similar approach in their recent training run. This is an area under active development and research, so watch this space.

@zigzagcai
Copy link
Author

zigzagcai commented Jan 23, 2025

The expected rounding error (machine epsilon) can be estimated based on the number of mantissa bits. TE uses two FP8 datatypes:

* fp8e4m3: eps = 2^-4 = 6.25e-2

* fp8e5m2: eps = 2^-3 = 1.25e-1

We just don't have many bits to work with, so the numerical errors can get quite large. This becomes trickier when training since such numerical errors can disrupt training and cause the loss curve to diverge. In general we don't expect good results from naively changing compute to FP8.

TE avoids these problems with per-tensor scaling. By shifting tensor values before casting to FP8, we fully utilize the available dynamic range and reduce the number of values that underflow to zero. See the TE docs for a more thorough explanation. Using tricks like this, we've been able to do full-scale training runs in FP8 with minimal degradation relative to BF16.

The next step after per-tensor scaling is block scaling (see this and this for discussion of the "MX" data formats). DeepSeek also used a similar approach in their recent training run. This is an area under active development and research, so watch this space.

Hi @timmoon10 ,

Thanks for the reply!

It is really interesting! I will read through the details of the listed two papers of block-wise FP8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants