[Bug] Unexpected performance drop with float8 training + compiling only nn.Linear layers + using selective per op AC #786

danielvegamyhre · 2025-01-10T17:42:47Z

Summary

I'm doing some benchmarking with torchtitan on H100s to compare the experimental feature in #778 with other training configurations.

One important comparison is comparing:

float8nocompile (handwritten triton kernels for fp8 conversion of nn.Linear layers), and
production float8 training + using torch.compile on only the nn.Linear layers

I've ran this comparison using:

no AC
full AC
selective per op AC

However, specifically when using selective per op AC, there is a massive drop in performance when using the production float8 training + using torch.compile on only the nn.Linear layers, as compared to using torch.compile on the full model.

This does not occur when using no AC or full AC.

I would expect some performance degradation compiling only nn.Linear instead of the full model, but the drop off is massive (see screenshot of benchmarks below). TFLOPS drops from 386.58 down to 125.88!

I looked at the traces for these 2 runs, and found some surprising issues:

In the forward pass of the slow configuration (compiling only nn.Linear layers) it seems like aten::cross_entropy_loss is running on CPU and not dispatching any CUDA kernels, causing it to take 267ms vs 71us in the fully compiled version (~3800x slowdown).

Only nn.Linears compiled (267ms):

Full model compiled (71us):

In the backward pass of the slow configuration (compiling only nn.Linear layers) there is an extremely long/slow FSDP::post_backward_reduce call that does not appear in the fully compiled version (or rather, it is orders of magnitude faster).

Only nn.Linears compiled:

Steps to reproduce

Checkout [Not for land] Integrate float8nocompile, an experimental feature for high performance #778
Edit training_configs/llama3_8b.toml to run prod float8 + fully compiled model + selective per op AC on H100s:

NGPU=4 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh

Edit training_configs/llama3_8b.toml to run prod float8 + only linear layers compiled + selective per op AC on H100s (don't think # of GPUs matters):

TORCHTITAN_COMPILE_LINEAR_ONLY=1 NGPU=4 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh

cc @vkuzo @soulitzer

The text was updated successfully, but these errors were encountered:

tianyu-l · 2025-01-10T19:08:52Z

Thanks for flagging the issue.

It looks the memory usage is too high, which might cause throughput degradation. I think it's worth checking

the memory snapshots (with https://github.com/pytorch/torchtitan/blob/main/docs/debugging.md)
meanwhile, change to a smaller model and see when memory is no longer saturated, if the throughput difference is still this large.

also copying over what @awgu said offline, supporting point 1

looks like it is doing a cuda defragmentation with expandable segments turned on, would be good to check the memory

gnadathur · 2025-01-15T06:02:12Z

cc: @vkuzo , fyi

danielvegamyhre · 2025-01-31T01:17:21Z

Thanks for flagging the issue.

It looks the memory usage is too high, which might cause throughput degradation. I think it's worth checking

the memory snapshots (with https://github.com/pytorch/torchtitan/blob/main/docs/debugging.md)

meanwhile, change to a smaller model and see when memory is no longer saturated, if the throughput difference is still this large.

also copying over what @awgu said offline, supporting point 1

looks like it is doing a cuda defragmentation with expandable segments turned on, would be good to check the memory

@tianyu-l I think we are running into another instance of this issue in #808 (see my comment here #808 (comment)). I went ahead and pulled some memory snapshots and analyzed them here #808 (comment). Still not sure of the root cause but please feel free to add any input.

If I'm correct that this issue is related, then the scope of this bug is bigger than we thought, and selective per op AC has some deeper issues that affects more than just compiling only linear layers, but also fp8 row-wise quantization.

vkuzo · 2025-01-31T04:17:59Z

At the top of this issue, the peak memory usage in the problematic experiment is 92GB, which is too close to the max memory available on the H100 GPU.

looks like it is doing a cuda defragmentation with expandable segments turned on, would be good to check the memory

^ is likely related to peak memory being close to machine limit

I don't think there is conclusive evidence that #808 and this issue are related, they just both seem to be about selective per-op AC.

tianyu-l added the bug Something isn't working label Jan 10, 2025

gnadathur assigned vkuzo and tianyu-l Jan 15, 2025

gnadathur unassigned vkuzo Jan 15, 2025

danielvegamyhre mentioned this issue Jan 31, 2025

add configuration for float8 with rowwise scaling, via recipe lookup #808

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Unexpected performance drop with float8 training + compiling only nn.Linear layers + using selective per op AC #786

[Bug] Unexpected performance drop with float8 training + compiling only nn.Linear layers + using selective per op AC #786

danielvegamyhre commented Jan 10, 2025

tianyu-l commented Jan 10, 2025

gnadathur commented Jan 15, 2025

danielvegamyhre commented Jan 31, 2025 •

edited

Loading

vkuzo commented Jan 31, 2025

[Bug] Unexpected performance drop with float8 training + compiling only nn.Linear layers + using selective per op AC #786

[Bug] Unexpected performance drop with float8 training + compiling only nn.Linear layers + using selective per op AC #786

Comments

danielvegamyhre commented Jan 10, 2025

tianyu-l commented Jan 10, 2025

gnadathur commented Jan 15, 2025

danielvegamyhre commented Jan 31, 2025 • edited Loading

vkuzo commented Jan 31, 2025

danielvegamyhre commented Jan 31, 2025 •

edited

Loading