Questions about accuracy alignment between BF16 and FP8 #1419

zigzagcai · 2025-01-22T08:44:25Z

Hello developers,

Thanks for introducing such a great library that demonstrate the power of FP8 training.

But when I tried to integrate FP8 training into my training framework. I found it is hard to make the acc/loss aligned between BF16 and FP8. Actually, in my experiments, the difference is somehow not negligible. I thinks the ideal difference should be e-2~e-3, which is the negligible random error.

Also, I cannot found some resources that show the accuracy alignment of BF16 and FP8.

So, could anyone give some hint about this?

timmoon10 · 2025-01-22T19:14:52Z

The expected rounding error (machine epsilon) can be estimated based on the number of mantissa bits. TE uses two FP8 datatypes:

fp8e4m3: eps = 2^-4 = 6.25e-2
fp8e5m2: eps = 2^-3 = 1.25e-1

We just don't have many bits to work with, so the numerical errors can get quite large. This becomes trickier when training since such numerical errors can disrupt training and cause the loss curve to diverge. In general we don't expect good results from naively changing compute to FP8.

TE avoids these problems with per-tensor scaling. By shifting tensor values before casting to FP8, we fully utilize the available dynamic range and reduce the number of values that underflow to zero. See the TE docs for a more thorough explanation. Using tricks like this, we've been able to do full-scale training runs in FP8 with minimal degradation relative to BF16.

The next step after per-tensor scaling is block scaling (see this and this for discussion of the "MX" data formats). DeepSeek also used a similar approach in their recent training run. This is an area under active development and research, so watch this space.

zigzagcai · 2025-01-23T09:45:08Z

The expected rounding error (machine epsilon) can be estimated based on the number of mantissa bits. TE uses two FP8 datatypes:
* fp8e4m3: eps = 2^-4 = 6.25e-2

* fp8e5m2: eps = 2^-3 = 1.25e-1
We just don't have many bits to work with, so the numerical errors can get quite large. This becomes trickier when training since such numerical errors can disrupt training and cause the loss curve to diverge. In general we don't expect good results from naively changing compute to FP8.

TE avoids these problems with per-tensor scaling. By shifting tensor values before casting to FP8, we fully utilize the available dynamic range and reduce the number of values that underflow to zero. See the TE docs for a more thorough explanation. Using tricks like this, we've been able to do full-scale training runs in FP8 with minimal degradation relative to BF16.

The next step after per-tensor scaling is block scaling (see this and this for discussion of the "MX" data formats). DeepSeek also used a similar approach in their recent training run. This is an area under active development and research, so watch this space.

Hi @timmoon10 ,

Thanks for the reply!

It is really interesting! I will read through the details of the listed two papers of block-wise FP8.

timmoon10 added the question Further information is requested label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about accuracy alignment between BF16 and FP8 #1419

Questions about accuracy alignment between BF16 and FP8 #1419

zigzagcai commented Jan 22, 2025 •

edited

Loading

timmoon10 commented Jan 22, 2025 •

edited

Loading

zigzagcai commented Jan 23, 2025 •

edited

Loading

Questions about accuracy alignment between BF16 and FP8 #1419

Questions about accuracy alignment between BF16 and FP8 #1419

Comments

zigzagcai commented Jan 22, 2025 • edited Loading

timmoon10 commented Jan 22, 2025 • edited Loading

zigzagcai commented Jan 23, 2025 • edited Loading

zigzagcai commented Jan 22, 2025 •

edited

Loading

timmoon10 commented Jan 22, 2025 •

edited

Loading

zigzagcai commented Jan 23, 2025 •

edited

Loading