Cannot train HF-model with mixed precision when using A100 and L4 GPUs #4948

bialczykk · 2024-11-14T19:06:31Z

Describe the current behavior
I just purchased Colab Pro+, as I need faster GPU units for LLM fine-tuning for CausalLM with the use of HuggingFace's Trainer API. It is working fine when I implement everything on the T4 GPU (mixed precision fp16 as training argument as it significantly speeds up the training). However, when I run the same code on A100 or L4 runtimes, the Trainer (with exactly the same config) starts, but during validation steps does not calculate Train & Val loss, which normally works in case of T4. When I remove the arguments related to mixed precision, it works normally on A100 and L4, but without significant speed-up or even slower than on T4.

Describe the expected behavior
I want the trainer API to run normally when using the A100 runtime and implementing mixed precision, i.e. the loss is calculated over epochs.

What web browser you are using
Chrome

Additional context
There is also this Warning message appearing when trainer.train() starts:
/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

Link to a minimal, public, self-contained notebook that reproduces this issue.
To be added...

The text was updated successfully, but these errors were encountered:

cperry-goog · 2024-11-14T22:37:07Z

I don't know that this is a Colab issue - I'd wager a lot has to do with NVIDIA support for the modules you're using. Do you have data to suggest this is not an issue with the hardware?

bialczykk added the bug label Nov 14, 2024

cperry-goog added the reply-needed label Nov 14, 2024

cperry-goog closed this as completed Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot train HF-model with mixed precision when using A100 and L4 GPUs #4948

Cannot train HF-model with mixed precision when using A100 and L4 GPUs #4948

bialczykk commented Nov 14, 2024 •

edited

Loading

cperry-goog commented Nov 14, 2024

Cannot train HF-model with mixed precision when using A100 and L4 GPUs #4948

Cannot train HF-model with mixed precision when using A100 and L4 GPUs #4948

Comments

bialczykk commented Nov 14, 2024 • edited Loading

cperry-goog commented Nov 14, 2024

bialczykk commented Nov 14, 2024 •

edited

Loading