You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our nllb models use the standard fp16 floating point format under the hood. This issue is proposing to explore the benefits of using bf16 ("brain floating point") as it may:
improve accuracy
reduce training time
reduce memory requirements
Proposal
understand what support nvidia GPU's have for bf16
add an experimental flag that allows enabling bf16 (current default behavior won't change)
for nllb, investigate what effect using bf16 instead of fp16 has in terms of
model accuracy
training time
GPU memory usage
Investigation related to tf32 is out of scope.
In the future if bf16 proves to be stable and generally better, we may make that the default, but that's out of scope for this issue.
Current settings
Currently we use this logic to set the preferred floating point:
1 8 23
fp32 s eeeeeeee fffffffffffffffffffffff
s = sign bit
e = exponent bits
f = fractional bits
fp16, fp32 and mixed precision
32 bit floating points are more accurate, but it halves the number of values you can fit into some memory area.
In the context of GPU's and model training, this can mean it's much slower to train models.
That is why it can be helpful to used "mixed precision" where you sometimes lower to 16 bit values like fp16.
There is a trade off of accuracy to speed. Using fp16 also introduces problems because it can't represent very large numbers and also very small positive numbers. These issues can lower the accuracy of the model and increase the processing time and memory doing conversions related to loss scaling.
By default our model training uses mixed precision, but it can be disabled with --disable-mixed-precision.
1 8 7
s eeeeeeee fffffff
exp. significand
s = sign bit
e = exponent bits
f = fractional bits
You can think of it as fp16 where 3 bits moved from the significand to the exponent. This means it can represent larger numbers, and the smallest positive number is more accurate, but you have less significant figures.
Another way to understand it is in relation to fp32:
bf16 s eeeeeeee fffffff (7)
fp32 s eeeeeeee fffffffffffffffffffffff (23)
^^^^^^^^^^^^^^^^
16
additional precision
It is really just fp32 with the last 16 precision bits "chopped off", and they have the same number of exponent bits.
Conversions between fp32 and bf16 are very efficient because of this bit layout:
fp32 -> bf16: chop off the back 16 bits
bf16 -> fp32: pad with 16x0 bits
There's a nice summary in this post of the potential benefits for bf16 over fp16.
1 8 *
bf16 s eeeeeeee fffffff (7)
fp32 s eeeeeeee fffffffffffffffffffffff (23)
tf32 s eeeeeeee ffffffffff (10)
My impression is that it's intended as a way to "reinterpret" existing fp32 values into a lower precision to make processing them faster. I don't think it's packing data into 19 bit chunks. So it's effectively a lazy "backwards compatible" fp32.
Historical support
My impression is that bf16 was initially created by Google and native support was added to their custom TPU's.
This is back around 2018 but I haven't found an exact date.
At that point, the existing nvidia GPU's wouldn't have had native support for bf16.
However since then other chip manufacturers have started adding native support, and I suspect that most nvidia hardware
used by the silnlp team for local development and in our clearml infra would support it.
For example these GPU's should support bf16:
3000 series
4000 series
A100
H100
However it's not clear to me yet what effect it would have.
The text was updated successfully, but these errors were encountered:
David volunteered to do the experimentation. For initial experimentation/POC, we don't even need a branch or experimental cli support - all David needs to do is hack changes directly to the method _create_training_arguments line 1225 of hugging_face_config.py:
merge_dict(
args,
{
- "fp16": self._mixed_precision and not self._is_t5,- "bf16": self._mixed_precision and self._is_t5,+ "fp16": False,+ "bf16": True,
"tf32": self._mixed_precision,
},
)
Overview
Our nllb models use the standard fp16 floating point format under the hood. This issue is proposing to explore the benefits of using bf16 ("brain floating point") as it may:
Proposal
Investigation related to tf32 is out of scope.
In the future if bf16 proves to be stable and generally better, we may make that the default, but that's out of scope for this issue.
Current settings
Currently we use this logic to set the preferred floating point:
(For context, the
_is_t5
field is related to the google madlad model and will beFalse
for nllb)For nllb, mixed precision is enabled by default, so usually the above will reduce to:
This means the models are using fp16 currently.
Notes on floating point formats
fp16
fp16 = "half precision floating point"
https://en.wikipedia.org/wiki/Half-precision_floating-point_format
It's the standard 16 bit floating point representation (see IEEE 754 standard)
The 16 bits are used like so:
Because fp16 has only 5 exponent bits, the largest normal value that can be represented is 65504.
fp32
fp32 = "single precision floating point"
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
The big brother of fp16
fp16, fp32 and mixed precision
32 bit floating points are more accurate, but it halves the number of values you can fit into some memory area.
In the context of GPU's and model training, this can mean it's much slower to train models.
That is why it can be helpful to used "mixed precision" where you sometimes lower to 16 bit values like fp16.
There is a trade off of accuracy to speed. Using fp16 also introduces problems because it can't represent very large numbers and also very small positive numbers. These issues can lower the accuracy of the model and increase the processing time and memory doing conversions related to loss scaling.
By default our model training uses mixed precision, but it can be disabled with
--disable-mixed-precision
.bf16
bf16 = "brain floating point"
Created by the Google Brain AI research group.
https://en.wikipedia.org/wiki/Bfloat16_floating-point_format
You can think of it as fp16 where 3 bits moved from the significand to the exponent. This means it can represent larger numbers, and the smallest positive number is more accurate, but you have less significant figures.
Another way to understand it is in relation to fp32:
It is really just fp32 with the last 16 precision bits "chopped off", and they have the same number of exponent bits.
Conversions between fp32 and bf16 are very efficient because of this bit layout:
There's a nice summary in this post of the potential benefits for bf16 over fp16.
tf32
tf32 = "tensor float 32"
https://en.wikipedia.org/wiki/TensorFloat-32
It only uses 19 bytes in total:
My impression is that it's intended as a way to "reinterpret" existing fp32 values into a lower precision to make processing them faster. I don't think it's packing data into 19 bit chunks. So it's effectively a lazy "backwards compatible" fp32.
Historical support
My impression is that bf16 was initially created by Google and native support was added to their custom TPU's.
This is back around 2018 but I haven't found an exact date.
At that point, the existing nvidia GPU's wouldn't have had native support for bf16.
However since then other chip manufacturers have started adding native support, and I suspect that most nvidia hardware
used by the silnlp team for local development and in our clearml infra would support it.
For example these GPU's should support bf16:
However it's not clear to me yet what effect it would have.
The text was updated successfully, but these errors were encountered: