[Flux] Enable FSDP for flux model training (#1074)

wwwjn · web-flow · commit ab0861259af9 · 2025-04-10T13:10:32.000-07:00
## Context - Enable FSDP for flux model training ## Test Some ablation study using Flux-dev model (flux-dev enabled FSDP) on 8 H100 GPU: Test id | Full AC? | Shard t5? | T5 on/off load? | Batch size? | Results? -- | -- | -- | -- | -- | -- 1 | Yes | No | Yes | 4 * 8 = 32 | ✅ 2 | Yes | No | Yes | 8 * 8 = 64 | ❌ OOM 3 | Yes | Yes | Yes | 4 * 8 = 32 | ✅ GPU memory, 78.35GiB(82.48%) 4 | Yes | Yes | No (T5 always on GPU) | 4 * 8 = 32 | ✅ GPU memory 78.41GiB(82.54%). See profiler analysis. 5 | Yes | Yes | No (T5 always on GPU) | 8 * 8 = 64 | ❌ OOM 6 | Yes | No | No (T5 always on GPU) | 4 * 8 = 32 | ✅ GPU memory 84.98GiB(89.46%). See profiler analysis. 7 | Yes | No | No (T5 always on GPU) | 8 * 8 = 64 | ❌ OOM - T5 encoder on/off loading saves a little bit GPU memory, might take extra time to perform on/off load between GPU and CPU. **Thus we don’t recommend enabling on/off load for T5 model**. - For end-to-end training, if a user doesn't shard T5 w/ FSDP, and doesn't want to use T5 on/off loading, the max batch size is 32. Profiler observation of test No.4: **[Recommended]** <img width="1738" alt="Screenshot 2025-04-09 at 1 51 10 PM" src="https://github.com/user-attachments/assets/3b836d7b-089d-4c41-9069-755b6c3ee0bf" /> Profiler observation of test No.6: <img width="1738" alt="Screenshot 2025-04-09 at 1 50 57 PM" src="https://github.com/user-attachments/assets/1a3fbde5-33b7-444d-878b-74d789ea53d8" /> - From the above profiling comparison, enabling FSDP for T5 didn't increase throughput (no bubbles in computation) but saves GPU memory. **So we recommend enable FSDP sharding T5 by default.**
diff --git a/torchtitan/experiments/flux/README.md b/torchtitan/experiments/flux/README.md
@@ -1,23 +1,30 @@
 # FLUX model in torchtitan
 
 ## Overview
+This directory contains the implementation of the [FLUX](https://github.com/black-forest-labs/flux/tree/main) model in torchtitan. In torchtitan, we showcase the pre-training process of text-to-image part of the FLUX model.
 
 ## Usage
 First, download the autoencoder model from HuggingFace with your own access token:
 ```bash
 python torchtitan/experiments/flux/scripts/download_autoencoder.py --repo_id black-forest-labs/FLUX.1-dev --ae_path ae.safetensors --hf_token <your_access_token>
 ```
+
 This step will download the autoencoder model from HuggingFace and save it to the `torchtitan/experiments/flux/assets/autoencoder/ae.safetensors` file.
 
 Run the following command to train the model on a single GPU:
 ```bash
-PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --nproc_per_node=1 torchtitan/experiments/flux/train.py --job.config_file torchtitan/experiments/flux/train_configs/debug_model.toml
+./torchtitan/experiments/flux/run_train.sh
+
 ```
 
+## Supported Features
+- Parallelism: The model supports FSDP, HSDP for training on multiple GPUs.
+- Activation checkpointing: The model uses activation checkpointing to reduce memory usage during training.
+
+
 ## TODO
-- [ ] Supporting for multiple GPUs is comming soon (FSDP, etc)
-- [ ] Implement test cases in CI for FLUX model. Adding more unit tests for FLUX model (eg, unit test for preprocessor, etc)
 - [ ] More parallesim support (Tensor Parallelism, Context Parallelism, etc)
 - [ ] Support for distributed checkpointing and loading
 - [ ] Implement init_weights() function to initialize the model weights
 - [ ] Implement the num_flops_per_token calculation in get_nparams_and_flops() function
+- [ ] Implement test cases in CI for FLUX model. Adding more unit tests for FLUX model (eg, unit test for preprocessor, etc)
diff --git a/torchtitan/experiments/flux/__init__.py b/torchtitan/experiments/flux/__init__.py
@@ -6,6 +6,7 @@
 #
 # Copyright (c) Meta Platforms, Inc. All Rights Reserved.
 
+
 from torchtitan.components.lr_scheduler import build_lr_schedulers
 from torchtitan.components.optimizer import build_optimizers
 from torchtitan.experiments.flux.dataset.flux_dataset import build_flux_dataloader
@@ -29,7 +30,7 @@
         in_channels=64,
         out_channels=64,
         vec_in_dim=768,
-        context_in_dim=512,
+        context_in_dim=4096,
         hidden_size=3072,
         mlp_ratio=4.0,
         num_heads=24,
@@ -81,10 +82,10 @@
         in_channels=64,
         out_channels=64,
         vec_in_dim=768,
-        context_in_dim=512,
-        hidden_size=512,
+        context_in_dim=4096,
+        hidden_size=3072,
         mlp_ratio=4.0,
-        num_heads=4,
+        num_heads=24,
         depth=2,
         depth_single_blocks=2,
         axes_dim=(16, 56, 56),
diff --git a/torchtitan/experiments/flux/dataset/flux_dataset.py b/torchtitan/experiments/flux/dataset/flux_dataset.py
@@ -56,8 +56,8 @@ def _process_cc12m_image(
 
     assert resized_img.size[0] == resized_img.size[1] == output_size
 
-    # Skip grayscale images
-    if resized_img.mode == "L":
+    # Skip grayscale images, and RGBA, CMYK images
+    if resized_img.mode != "RGB":
         return None
 
     np_img = np.array(resized_img).transpose((2, 0, 1))
diff --git a/torchtitan/experiments/flux/flux_argparser.py b/torchtitan/experiments/flux/flux_argparser.py
@@ -40,3 +40,8 @@ def extend_parser(parser: argparse.ArgumentParser) -> None:
         default=512,
         help="Maximum length of the T5 encoding.",
     )
+    parser.add_argument(
+        "--encoder.offload_encoder",
+        action="store_true",
+        help="Whether to shard the encoder using FSDP",
+    )
diff --git a/torchtitan/experiments/flux/model/layers.py b/torchtitan/experiments/flux/model/layers.py
@@ -43,11 +43,12 @@ def timestep_embedding(t: Tensor, dim, max_period=10000, time_factor: float = 10
     """
     t = time_factor * t
     half = dim // 2
-    freqs = torch.exp(
-        -math.log(max_period)
-        * torch.arange(start=0, end=half, dtype=torch.float32)
-        / half
-    ).to(t.device)
+    with torch.device(t.device):
+        freqs = torch.exp(
+            -math.log(max_period)
+            * torch.arange(start=0, end=half, dtype=torch.float32)
+            / half
+        )
 
     args = t[:, None].float() * freqs[None]
     embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
diff --git a/torchtitan/experiments/flux/model/model.py b/torchtitan/experiments/flux/model/model.py
@@ -69,6 +69,7 @@ def __init__(self, model_args: FluxModelArgs):
         super().__init__()
 
         self.model_args = model_args
+
         self.in_channels = model_args.in_channels
         self.out_channels = model_args.out_channels
         if model_args.hidden_size % model_args.num_heads != 0:
diff --git a/torchtitan/experiments/flux/parallelize_flux.py b/torchtitan/experiments/flux/parallelize_flux.py
@@ -4,16 +4,19 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
 
-# This file applies the PT-D parallelisms (except pipeline parallelism) and various
-# training techniques (e.g. activation checkpointing and compile) to the Llama model.
-
 
+import torch
 import torch.nn as nn
+from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
+    checkpoint_wrapper as ptd_checkpoint_wrapper,
+)
 
 from torch.distributed.device_mesh import DeviceMesh
+from torch.distributed.fsdp import CPUOffloadPolicy, fully_shard, MixedPrecisionPolicy
 
-from torchtitan.config_manager import JobConfig
+from torchtitan.config_manager import JobConfig, TORCH_DTYPE_MAP
 from torchtitan.distributed import ParallelDims
+from torchtitan.tools.logging import logger
 
 
 def parallelize_flux(
@@ -22,5 +25,133 @@ def parallelize_flux(
     parallel_dims: ParallelDims,
     job_config: JobConfig,
 ):
-    # TODO: Add model parallel strategy here
+    if job_config.activation_checkpoint.mode != "none":
+        apply_ac(model, job_config.activation_checkpoint)
+
+    if (
+        parallel_dims.dp_shard_enabled or parallel_dims.dp_replicate_enabled
+    ):  # apply FSDP or HSDP
+        if parallel_dims.dp_replicate_enabled:
+            dp_mesh_dim_names = ("dp_replicate", "dp_shard_cp")
+        else:
+            dp_mesh_dim_names = ("dp_shard_cp",)
+
+        apply_fsdp(
+            model,
+            world_mesh[tuple(dp_mesh_dim_names)],
+            param_dtype=TORCH_DTYPE_MAP[job_config.training.mixed_precision_param],
+            reduce_dtype=TORCH_DTYPE_MAP[job_config.training.mixed_precision_reduce],
+            cpu_offload=job_config.training.enable_cpu_offload,
+        )
+
+        if parallel_dims.dp_replicate_enabled:
+            logger.info("Applied HSDP to the model")
+        else:
+            logger.info("Applied FSDP to the model")
+
     return model
+
+
+def apply_fsdp(
+    model: nn.Module,
+    dp_mesh: DeviceMesh,
+    param_dtype: torch.dtype,
+    reduce_dtype: torch.dtype,
+    cpu_offload: bool = False,
+):
+    """
+    Apply data parallelism (via FSDP2) to the model.
+
+    Args:
+        model (nn.Module): The model to apply data parallelism to.
+        dp_mesh (DeviceMesh): The device mesh to use for data parallelism.
+        param_dtype (torch.dtype): The data type to use for model parameters.
+        reduce_dtype (torch.dtype): The data type to use for reduction operations.
+        cpu_offload (bool): Whether to offload model parameters to CPU. Defaults to False.
+    """
+    mp_policy = MixedPrecisionPolicy(param_dtype=param_dtype, reduce_dtype=reduce_dtype)
+    fsdp_config = {"mesh": dp_mesh, "mp_policy": mp_policy}
+    if cpu_offload:
+        fsdp_config["offload_policy"] = CPUOffloadPolicy()
+
+    linear_layers = [
+        model.img_in,
+        model.time_in,
+        model.guidance_in,
+        model.vector_in,
+        model.txt_in,
+    ]
+    for layer in linear_layers:
+        fully_shard(layer, **fsdp_config)
+
+    for block in model.double_blocks:
+        fully_shard(
+            block,
+            **fsdp_config,
+        )
+
+    for block in model.single_blocks:
+        fully_shard(
+            block,
+            **fsdp_config,
+        )
+    # apply FSDP to last layer
+    fully_shard(model.final_layer, **fsdp_config)
+    # Wrap all the rest of model
+    fully_shard(model, **fsdp_config)
+
+
+def apply_ac(model: nn.Module, ac_config):
+    """Apply activation checkpointing to the model."""
+
+    for layer_id, block in model.double_blocks.named_children():
+        block = ptd_checkpoint_wrapper(block, preserve_rng_state=False)
+        model.double_blocks.register_module(layer_id, block)
+
+    for layer_id, block in model.single_blocks.named_children():
+        block = ptd_checkpoint_wrapper(block, preserve_rng_state=False)
+        model.single_blocks.register_module(layer_id, block)
+
+    logger.info(f"Applied {ac_config.mode} activation checkpointing to the model")
+
+
+def parallelize_encoders(
+    t5_model: nn.Module,
+    clip_model: nn.Module,
+    world_mesh: DeviceMesh,
+    parallel_dims: ParallelDims,
+    job_config: JobConfig,
+):
+    if (
+        parallel_dims.dp_shard_enabled or parallel_dims.dp_replicate_enabled
+    ):  # apply FSDP or HSDP
+        if parallel_dims.dp_replicate_enabled:
+            dp_mesh_dim_names = ("dp_replicate", "dp_shard_cp")
+        else:
+            dp_mesh_dim_names = ("dp_shard_cp",)
+
+        mp_policy = MixedPrecisionPolicy(
+            param_dtype=TORCH_DTYPE_MAP[job_config.training.mixed_precision_param],
+            reduce_dtype=TORCH_DTYPE_MAP[job_config.training.mixed_precision_reduce],
+        )
+        fsdp_config = {
+            "mesh": world_mesh[tuple(dp_mesh_dim_names)],
+            "mp_policy": mp_policy,
+        }
+        if job_config.training.enable_cpu_offload:
+            fsdp_config["offload_policy"] = CPUOffloadPolicy()
+        # FSDP for encoder blocks
+        for block in clip_model.hf_module.text_model.encoder.layers:
+            fully_shard(block, **fsdp_config)
+        fully_shard(clip_model, **fsdp_config)
+
+        for block in t5_model.hf_module.encoder.block:
+            fully_shard(block, **fsdp_config)
+        fully_shard(t5_model.hf_module, **fsdp_config)
+
+        if parallel_dims.dp_replicate_enabled:
+            logger.info("Applied FSDP to the T5 and CLIP model")
+        else:
+            logger.info("Applied FSDP to the T5 and CLIP model")
+
+    return t5_model, clip_model
diff --git a/torchtitan/experiments/flux/requirements.txt b/torchtitan/experiments/flux/requirements.txt
@@ -1,2 +1,3 @@
-transformers
+transformers>=4.51.1
 einops
+sentencepiece
diff --git a/torchtitan/experiments/flux/run_train.sh b/torchtitan/experiments/flux/run_train.sh
@@ -0,0 +1,26 @@
+#!/usr/bin/bash
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+set -ex
+
+# use envs as local overrides for convenience
+# e.g.
+# LOG_RANK=0,1 NGPU=4 ./torchtitan/experiments/flux/run_train.sh
+NGPU=${NGPU:-"8"}
+export LOG_RANK=${LOG_RANK:-0}
+CONFIG_FILE=${CONFIG_FILE:-"./torchtitan/experiments/flux/train_configs/debug_model.toml"}
+
+overrides=""
+if [ $# -ne 0 ]; then
+    overrides="$*"
+fi
+
+
+PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" \
+torchrun --nproc_per_node=${NGPU} --rdzv_backend c10d --rdzv_endpoint="localhost:0" \
+--local-ranks-filter ${LOG_RANK} --role rank --tee 3 \
+-m torchtitan.experiments.flux.train --job.config_file ${CONFIG_FILE} $overrides
diff --git a/torchtitan/experiments/flux/train.py b/torchtitan/experiments/flux/train.py
@@ -14,6 +14,7 @@
 from torchtitan.experiments.flux.model.autoencoder import load_ae
 from torchtitan.experiments.flux.model.hf_embedder import FluxEmbedder
 from torchtitan.experiments.flux.model.model import FluxModel
+from torchtitan.experiments.flux.parallelize_flux import parallelize_encoders
 from torchtitan.experiments.flux.utils import (
     create_position_encoding_for_latents,
     pack_latents,
@@ -29,24 +30,36 @@ def __init__(self, job_config: JobConfig):
         super().__init__(job_config)
 
         self.preprocess_fn = preprocess_flux_data
-        # self.dtype = job_config.encoder.dtype
+        # NOTE: self._dtype is the data type used for encoders (image encoder, T5 text encoder, CLIP text encoder).
+        # We cast the encoders and it's input/output to this dtype.
+        # For Flux model, we use FSDP with mixed precision training.
         self._dtype = torch.bfloat16
         self._seed = job_config.training.seed
         self._guidance = job_config.training.guidance
 
         # load components
         model_config = self.train_spec.config[job_config.model.flavor]
+
         self.autoencoder = load_ae(
             job_config.encoder.auto_encoder_path,
             model_config.autoencoder_params,
-            device="cpu",
+            device=self.device,
             dtype=self._dtype,
         )
         self.clip_encoder = FluxEmbedder(version=job_config.encoder.clip_encoder).to(
-            dtype=self._dtype
+            device=self.device, dtype=self._dtype
         )
         self.t5_encoder = FluxEmbedder(version=job_config.encoder.t5_encoder).to(
-            dtype=self._dtype
+            device=self.device, dtype=self._dtype
+        )
+
+        # Apply FSDP to the T5 model / CLIP model
+        self.t5_encoder, self.clip_encoder = parallelize_encoders(
+            t5_model=self.t5_encoder,
+            clip_model=self.clip_encoder,
+            world_mesh=self.world_mesh,
+            parallel_dims=self.parallel_dims,
+            job_config=job_config,
         )
 
     def _predict_noise(
@@ -120,7 +133,6 @@ def train_step(self, input_dict: dict[str, torch.Tensor], labels: torch.Tensor):
             clip_encoder=self.clip_encoder,
             t5_encoder=self.t5_encoder,
             batch=input_dict,
-            offload=True,
         )
         labels = input_dict["img_encodings"]
 
@@ -148,8 +160,6 @@ def train_step(self, input_dict: dict[str, torch.Tensor], labels: torch.Tensor):
         target = noise - labels
 
         assert len(model_parts) == 1
-        # TODO(jianiw): model_parts will be wrapped by FSDP, which will cacluate
-        model_parts[0] = model_parts[0].to(dtype=self._dtype)
 
         pred = self._predict_noise(
             model_parts[0],
diff --git a/torchtitan/experiments/flux/train_configs/debug_model.toml b/torchtitan/experiments/flux/train_configs/debug_model.toml
diff --git a/torchtitan/experiments/flux/train_configs/flux_dev_model.toml b/torchtitan/experiments/flux/train_configs/flux_dev_model.toml
diff --git a/torchtitan/experiments/flux/train_configs/flux_schnell_model.toml b/torchtitan/experiments/flux/train_configs/flux_schnell_model.toml
diff --git a/torchtitan/experiments/flux/utils.py b/torchtitan/experiments/flux/utils.py

Original file line number	Diff line number	Diff line change
`@@ -40,3 +40,8 @@ def extend_parser(parser: argparse.ArgumentParser) -> None:`
`40`	`40`	`default=512,`
`41`	`41`	`help="Maximum length of the T5 encoding.",`
`42`	`42`	`)`
	`43`	`+ parser.add_argument(`
	`44`	`+ "--encoder.offload_encoder",`
	`45`	`+ action="store_true",`
	`46`	`+ help="Whether to shard the encoder using FSDP",`
	`47`	`+ )`