Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot train HF-model with mixed precision when using A100 and L4 GPUs #4948

Open
bialczykk opened this issue Nov 14, 2024 · 1 comment
Open

Comments

@bialczykk
Copy link

bialczykk commented Nov 14, 2024

Describe the current behavior
I just purchased Colab Pro+, as I need faster GPU units for LLM fine-tuning for CausalLM with the use of HuggingFace's Trainer API. It is working fine when I implement everything on the T4 GPU (mixed precision fp16 as training argument as it significantly speeds up the training). However, when I run the same code on A100 or L4 runtimes, the Trainer (with exactly the same config) starts, but during validation steps does not calculate Train & Val loss, which normally works in case of T4. When I remove the arguments related to mixed precision, it works normally on A100 and L4, but without significant speed-up or even slower than on T4.

Describe the expected behavior
I want the trainer API to run normally when using the A100 runtime and implementing mixed precision, i.e. the loss is calculated over epochs.

What web browser you are using
Chrome

Additional context
There is also this Warning message appearing when trainer.train() starts:
/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Screenshot 2024-11-14 at 18 54 23

Link to a minimal, public, self-contained notebook that reproduces this issue.
To be added...

@bialczykk bialczykk added the bug label Nov 14, 2024
@cperry-goog
Copy link

I don't know that this is a Colab issue - I'd wager a lot has to do with NVIDIA support for the modules you're using. Do you have data to suggest this is not an issue with the hardware?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants