Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FLUX with SP 并行生成图像差异 #262

Open
lixiang007666 opened this issue Sep 11, 2024 · 7 comments
Open

FLUX with SP 并行生成图像差异 #262

lixiang007666 opened this issue Sep 11, 2024 · 7 comments
Assignees

Comments

@lixiang007666
Copy link
Contributor

问题描述

固定 seed 测了下,为了确认 seed 是固定的,先重复运行了多卡脚本,确保每次图像不变。

在这个条件下,不同卡数生成的图像:

image
flux_result_dp1_cfg1_ulysses1_ringNone_tp1_pp1_patchNone_0 flux_result_dp1_cfg1_ulysses1_ringNone_tp1_pp1_patchNone_0
flux_result_dp1_cfg1_ulysses2_ringNone_tp1_pp1_patchNone_0 flux_result_dp1_cfg1_ulysses2_ringNone_tp1_pp1_patchNone_0
flux_result_dp1_cfg1_ulysses4_ringNone_tp1_pp1_patchNone_0 flux_result_dp1_cfg1_ulysses4_ringNone_tp1_pp1_patchNone_0
flux_result_dp1_cfg1_ulysses8_ringNone_tp1_pp1_patchNone_0 flux_result_dp1_cfg1_ulysses8_ringNone_tp1_pp1_patchNone_0

可以看到在 1024 下是有损的,但是两卡的时候损失比较的小(也分seed)。

一个观察是 512 下损失会更大。

复现脚本:

set -x

# export NCCL_PXN_DISABLE=1
# # export NCCL_DEBUG=INFO
# export NCCL_SOCKET_IFNAME=eth0
# export NCCL_IB_GID_INDEX=3
# export NCCL_IB_DISABLE=0
# export NCCL_NET_GDR_LEVEL=2
# export NCCL_IB_QPS_PER_CONNECTION=4
# export NCCL_IB_TC=160
# export NCCL_IB_TIMEOUT=22
# export NCCL_P2P=0
# export CUDA_DEVICE_MAX_CONNECTIONS=1

export PYTHONPATH=$PWD:$PYTHONPATH

# Select the model type
# The model is downloaded to a specified location on disk, 
# or you can simply use the model's ID on Hugging Face, 
# which will then be downloaded to the default cache path on Hugging Face.

export MODEL_TYPE="Flux"
# Configuration for different model types
# script, model_id, inference_step
declare -A MODEL_CONFIGS=(
    ["Pixart-alpha"]="pixartalpha_example.py /mnt/models/SD/PixArt-XL-2-1024-MS 20"
    ["Pixart-sigma"]="pixartsigma_example.py /cfs/dit/PixArt-Sigma-XL-2-2K-MS 20"
    ["Sd3"]="sd3_example.py /cfs/dit/stable-diffusion-3-medium-diffusers 20"
    ["Flux"]="flux_example.py black-forest-labs/FLUX.1-dev 20"
    ["HunyuanDiT"]="hunyuandit_example.py /mnt/models/SD/HunyuanDiT-v1.2-Diffusers 50"
    ["CogVideoX"]="cogvideox_example.py /cfs/dit/CogVideoX-2b 1"
)

if [[ -v MODEL_CONFIGS[$MODEL_TYPE] ]]; then
    IFS=' ' read -r SCRIPT MODEL_ID INFERENCE_STEP <<< "${MODEL_CONFIGS[$MODEL_TYPE]}"
    export SCRIPT MODEL_ID INFERENCE_STEP
else
    echo "Invalid MODEL_TYPE: $MODEL_TYPE"
    exit 1
fi

mkdir -p ./results

for HEIGHT in 1024
do
for N_GPUS in 8;
do 


# task args
if [ "$MODEL_TYPE" = "CogVideoX" ]; then
  TASK_ARGS="--height 480 --width 720 --num_frames 9"
else
  TASK_ARGS="--height $HEIGHT --width $HEIGHT --no_use_resolution_binning"
fi

# Flux only supports SP, do not set the pipefusion degree
if [ "$MODEL_TYPE" = "Flux" ] || [ "$MODEL_TYPE" = "CogVideoX" ]; then
PARALLEL_ARGS="--ulysses_degree $N_GPUS"
export CFG_ARGS=""
elif [ "$MODEL_TYPE" = "HunyuanDiT" ]; then
# HunyuanDiT asserts sp_degree <=2, or the output will be incorrect.
PARALLEL_ARGS="--pipefusion_parallel_degree 1 --ulysses_degree 2 --ring_degree 1"
export CFG_ARGS="--use_cfg_parallel"
else
# On 8 gpus, pp=2, ulysses=2, ring=1, cfg_parallel=2 (split batch)
PARALLEL_ARGS="--pipefusion_parallel_degree 2 --ulysses_degree 2 --ring_degree 1"
export CFG_ARGS="--use_cfg_parallel"
fi


# By default, num_pipeline_patch = pipefusion_degree, and you can tune this parameter to achieve optimal performance.
# PIPEFUSION_ARGS="--num_pipeline_patch 8 "

# For high-resolution images, we use the latent output type to avoid runing the vae module. Used for measuring speed.
# OUTPUT_ARGS="--output_type latent"

# PARALLLEL_VAE="--use_parallel_vae"

# Another compile option is `--use_onediff` which will use onediff's compiler.
# COMPILE_FLAG="--use_torch_compile"

torchrun --nproc_per_node=$N_GPUS ./examples/$SCRIPT \
--model $MODEL_ID \
$PARALLEL_ARGS \
$TASK_ARGS \
$PIPEFUSION_ARGS \
$OUTPUT_ARGS \
--num_inference_steps $INFERENCE_STEP \
--seed 1 \
--warmup_steps 0 \
--prompt "a female character with long, flowing hair that appears to be made of ethereal, swirling patterns resembling the Northern Lights or Aurora Borealis. The background is dominated by deep blues and purples, creating a mysterious and dramatic atmosphere. The character's face is serene, with pale skin and striking features. She wears a dark-colored outfit with subtle patterns. The overall style of the artwork is reminiscent of fantasy or supernatural genres." \
$CFG_ARGS \
$PARALLLEL_VAE \
$COMPILE_FLAG
# seed 在 flux_example.py 中是有 manual_seed 的。

done
done

@lixiang007666
Copy link
Contributor Author

lixiang007666 commented Sep 11, 2024

From @Eigensystem
可能是由 kernel 选择引起的误差。

@feifeibear
Copy link
Collaborator

From @Eigensystem: 可能是由 kernel 选择引起的误差。

cuDNN会根据输入的形状和类型自动选择最优的算法。不同并行度导致使用kernel不同,从而生成图片有差异?

我觉得可以

  1. 确保每次运行时使用的cuDNN算法是确定性的。(可能没帮助)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

  2. 观察不同并行度cpu运行结果。可能需要试试用gloo后端运行xDiT。

@fengchuanBIG
Copy link

就是说现在flux 也没办法完美使用,生图会有瑕疵是吗,楼主有解决吗

@feifeibear
Copy link
Collaborator

就是说现在flux 也没办法完美使用,生图会有瑕疵是吗,楼主有解决吗

我们研究了这个问题,不能叫生图有瑕疵。对于一个单独的attention算子,并行和非并行的结果都是有diff的。并行计算加乘的顺序就是不一样的,数值不可能完全一模一样。所以flux使用usp并行生成的结果和单卡不等价是正常。我们观察的生成图片,不比原来的差。二者都是正确的。

@fengchuanBIG
Copy link

fengchuanBIG commented Nov 5, 2024

好的感谢回复,但是我看到好像现在还不支持lora模型一起使用,这个能解决么? 没有lora的话还是没法使用

@feifeibear
Copy link
Collaborator

好的感谢回复,但是我看到好像现在还不支持lora模型一起使用,这个能解决么? 没有lora的话还是没法使用

这个支持起来很容易。我们发现大部分用户都用comfyui使用lora,你可以看我们comfyui的demo早就支持lora了。

@fengchuanBIG
Copy link

好的感谢回复,但是我看到好像现在不太支持lora模型一起使用,这个能解决么? 没有lora的话还是不能使用

这个支持起来很容易。我们发现大多数用户都使用 comfyui 使用lora,你可以看我们comfyui 的演示很快支持lora了。

好的 感谢回复 马上去试一试

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants