[not for land yet] example of float8 with rowwise scaling

vkuzo · vkuzo · commit 3ebdf0592d5c · 2025-01-27T13:35:59.000-08:00
Summary:

This is an example of how to call float8 training with rowwise scaling
from torchao.

TODO: finalize API in torchao, and finalize how we want to expose it in
torchtitan, and optimize performance.

```
// baseline (bf16 + compile)
&gt; with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.compile
...
step: 20  loss:  8.4931  memory: 47.65GiB(50.16%)  tps: 5,760  mfu: 33.73%

// experiment (rowwise float8 + compile)
&gt; with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --float8.enable_float8_linear --training.compile
...
step: 40  loss:  7.3818  memory: 66.81GiB(70.33%)  tps: 6,412  mfu: 37.55%

// for comparison, tensorwise float8 with float8 all-gather (on main branch)
with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --float8.enable_float8_linear --training.compile --float8.enable_fsdp_float8_all_gather --float8.precompute_float8_dynamic_scale_for_fsdp
...
step: 20  loss:  8.4258  memory: 47.32GiB(49.81%)  tps: 7,186  mfu: 42.08%

```

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
diff --git a/torchtitan/float8.py b/torchtitan/float8.py
@@ -42,6 +42,7 @@ def __init__(self, job_config: JobConfig, parallel_dims: ParallelDims):
             return
         try:
             from torchao.float8 import CastConfig, Float8LinearConfig, ScalingType
+            from torchao.float8.config import Float8LinearRecipeName, recipe_name_to_linear_config
         except ImportError as e:
             raise ImportError(
                 "torchao is not installed. Please install it to use float8 linear layers."
@@ -55,13 +56,22 @@ def __init__(self, job_config: JobConfig, parallel_dims: ParallelDims):
         scaling_type_input = ScalingType(float8_config.scaling_type_input)
         scaling_type_weight = ScalingType(float8_config.scaling_type_weight)
         scaling_type_grad_output = ScalingType(float8_config.scaling_type_grad_output)
+        # Note: this is overridden below
         self.config = Float8LinearConfig(
             enable_fsdp_float8_all_gather=enable_fsdp_float8_all_gather,
             cast_config_input=CastConfig(scaling_type=scaling_type_input),
             cast_config_weight=CastConfig(scaling_type=scaling_type_weight),
             cast_config_grad_output=CastConfig(scaling_type=scaling_type_grad_output),
+            # force_recompute_fp8_weight_in_bwd=True,
         )
 
+        # Note: the recipe lookup by name is currently a private API, we'll need
+        # to expose it publically in torchao before a PR similar to this one can be
+        # landed in torchtitan
+        recipe = "all_axiswise"
+        recipe = Float8LinearRecipeName(recipe)
+        self.config = recipe_name_to_linear_config(recipe)
+
         self.enabled = True
 
         # for precompute_float8_dynamic_scale_for_fsdp
diff --git a/train.py b/train.py
@@ -114,6 +114,7 @@ def main(job_config: JobConfig):
     float8_handler = Float8Handler(job_config, parallel_dims)
     # swap to Float8Linear based on float8 configs
     float8_handler.convert_to_float8_training(model)
+    print(model)
 
     # log model size
     model_param_count = utils.get_num_params(model)