Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with bf16 on nvidia graphics cards #647

Open
rminsil opened this issue Feb 7, 2025 · 1 comment · Fixed by #648
Open

Experiment with bf16 on nvidia graphics cards #647

rminsil opened this issue Feb 7, 2025 · 1 comment · Fixed by #648
Assignees
Labels
enhancement New feature or request

Comments

@rminsil
Copy link
Collaborator

rminsil commented Feb 7, 2025

Overview

Our nllb models use the standard fp16 floating point format under the hood. This issue is proposing to explore the benefits of using bf16 ("brain floating point") as it may:

  • improve accuracy
  • reduce training time
  • reduce memory requirements

Proposal

  • understand what support nvidia GPU's have for bf16
  • add an experimental flag that allows enabling bf16 (current default behavior won't change)
  • for nllb, investigate what effect using bf16 instead of fp16 has in terms of
    • model accuracy
    • training time
    • GPU memory usage

Investigation related to tf32 is out of scope.

In the future if bf16 proves to be stable and generally better, we may make that the default, but that's out of scope for this issue.

Current settings

Currently we use this logic to set the preferred floating point:

def _create_training_arguments(self) -> Seq2SeqTrainingArguments:
    ...
    merge_dict(
        args,
        {
            "fp16": self._mixed_precision and not self._is_t5, # <------------
            "bf16": self._mixed_precision and self._is_t5,     # <------------
            "tf32": self._mixed_precision,
        },
    )
    ...

(For context, the _is_t5 field is related to the google madlad model and will be False for nllb)

For nllb, mixed precision is enabled by default, so usually the above will reduce to:

{
    "fp16": True,
    "bf16": False,
    "tf32": True,
}

This means the models are using fp16 currently.

Notes on floating point formats

fp16

fp16 = "half precision floating point"

https://en.wikipedia.org/wiki/Half-precision_floating-point_format

It's the standard 16 bit floating point representation (see IEEE 754 standard)

The 16 bits are used like so:

       1   5       10
       s eeeee ffffffffff
          exp. significand

s = sign bit
e = exponent bits
f = fractional bits

Because fp16 has only 5 exponent bits, the largest normal value that can be represented is 65504.

fp32

fp32 = "single precision floating point"

https://en.wikipedia.org/wiki/Single-precision_floating-point_format

The big brother of fp16

       1    8               23
fp32   s eeeeeeee fffffffffffffffffffffff

s = sign bit
e = exponent bits
f = fractional bits

fp16, fp32 and mixed precision

32 bit floating points are more accurate, but it halves the number of values you can fit into some memory area.

In the context of GPU's and model training, this can mean it's much slower to train models.

That is why it can be helpful to used "mixed precision" where you sometimes lower to 16 bit values like fp16.

There is a trade off of accuracy to speed. Using fp16 also introduces problems because it can't represent very large numbers and also very small positive numbers. These issues can lower the accuracy of the model and increase the processing time and memory doing conversions related to loss scaling.

By default our model training uses mixed precision, but it can be disabled with --disable-mixed-precision.

bf16

bf16 = "brain floating point"

Created by the Google Brain AI research group.

https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

       1    8        7
       s eeeeeeee fffffff
          exp.    significand

s = sign bit
e = exponent bits
f = fractional bits

You can think of it as fp16 where 3 bits moved from the significand to the exponent. This means it can represent larger numbers, and the smallest positive number is more accurate, but you have less significant figures.

Another way to understand it is in relation to fp32:

bf16   s eeeeeeee fffffff (7)
fp32   s eeeeeeee fffffffffffffffffffffff (23)
                         ^^^^^^^^^^^^^^^^
                              16
                       additional precision

It is really just fp32 with the last 16 precision bits "chopped off", and they have the same number of exponent bits.

Conversions between fp32 and bf16 are very efficient because of this bit layout:

  • fp32 -> bf16: chop off the back 16 bits
  • bf16 -> fp32: pad with 16x0 bits

There's a nice summary in this post of the potential benefits for bf16 over fp16.

tf32

tf32 = "tensor float 32"

https://en.wikipedia.org/wiki/TensorFloat-32

It only uses 19 bytes in total:

       1    8        *
bf16   s eeeeeeee fffffff (7)
fp32   s eeeeeeee fffffffffffffffffffffff (23)
tf32   s eeeeeeee ffffffffff (10)

My impression is that it's intended as a way to "reinterpret" existing fp32 values into a lower precision to make processing them faster. I don't think it's packing data into 19 bit chunks. So it's effectively a lazy "backwards compatible" fp32.

Historical support

My impression is that bf16 was initially created by Google and native support was added to their custom TPU's.
This is back around 2018 but I haven't found an exact date.

At that point, the existing nvidia GPU's wouldn't have had native support for bf16.

However since then other chip manufacturers have started adding native support, and I suspect that most nvidia hardware
used by the silnlp team for local development and in our clearml infra would support it.

For example these GPU's should support bf16:

  • 3000 series
  • 4000 series
  • A100
  • H100

However it's not clear to me yet what effect it would have.

@rminsil rminsil linked a pull request Feb 7, 2025 that will close this issue
@rminsil rminsil reopened this Feb 7, 2025
@rminsil rminsil self-assigned this Feb 7, 2025
@davidbaines davidbaines added the enhancement New feature or request label Feb 7, 2025
@ddaspit ddaspit moved this from 🆕 New to 📋 Backlog in SIL-NLP Research Feb 7, 2025
@rminsil rminsil assigned davidbaines and unassigned rminsil Feb 9, 2025
@rminsil
Copy link
Collaborator Author

rminsil commented Feb 9, 2025

David and I caught up to discuss this.

David volunteered to do the experimentation. For initial experimentation/POC, we don't even need a branch or experimental cli support - all David needs to do is hack changes directly to the method _create_training_arguments line 1225 of hugging_face_config.py:

         merge_dict(
             args,
             {
-                "fp16": self._mixed_precision and not self._is_t5,
-                "bf16": self._mixed_precision and self._is_t5,
+                "fp16": False,
+                "bf16": True,
                 "tf32": self._mixed_precision,
             },
         )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: 📋 Backlog
Development

Successfully merging a pull request may close this issue.

2 participants