Raylight

Raylight. Using Ray Worker to manage multi GPU sampler setup. With XDiT-XFuser and FSDP to implement parallelism.

"Why buy 5090 when you can buy 2x5070s"-Komikndr

UPDATE

GGUF added thanks to City96, only in USP mode, not in FSDP
Reworked the entire FSDP loader. Model loading should now be more stable and faster, as Raylight no longer kills active workers to reset the model state. Previously, this was necessary because Comfy could not remove FSDP models from VRAM, which caused memory leaks.
No need to install FlashAttn.
SageAttn is now supported.
Full FSDP support for Wan, Qwen, Flux, and Hunyuan Video.
Qwen USP can't do any square dimension, only 1280x1280 that's working, so pick any dim that's is not square
Full LoRA support.
FSDP CPU offload, analogous to block swap/DisTorch.

What exactly is Raylight

Raylight is a parallelism node for ComfyUI, where the tensor of an image or video sequence is split among GPU ranks. Raylight, as its partial namesake, uses Ray to manage its GPU workers.

So how does it split among the ranks? It uses Unified Sequence Parallelism (USP), embedded inside XDiT, a core library of Raylight that splits and allgathers tensors among GPU ranks.

Unfortunately, although it splits across GPUs, each GPU must still load the full model weight. And let's be honest, most of us do not have a 4090 or 5090. In my opinion, buying a second 4070 is monetarily less painful than buying a 5090. This is where FSDP comes in. Its job is to split the model weights among GPUs.

TLDR: Raylight is multi-GPU nodes for Comfy, USP for splitting the work, and FSDP for splitting the model weights.

Raylight vs MultiGPU vs ComfyUI Worksplit branch vs ComfyUI-Distributed

MultiGPU Loads models selectively on specified GPUs without sharing workload. Includes CPU RAM offloading, which benefits single-GPU users.
ComfyUI Worksplit branch Splits workload at the CFG level, not at the tensor level. Since most workflows use CFG=1.0 like Wan with Lora, this approach provides limited use cases.
ComfyUI-Distributed Distribute jobs among workers. Run your workflow on multiple GPUs simultaneously with varied seeds. Easily connect to local/remote/cloud worker like RunPod.
Raylight Provides both tensor split in sequence parallelism (USP) and model weight sharding (FSDP). Your GPUs will 100% being used at the same time. In technical sense it combine your VRAM. This enables efficient multi-GPU utilization and scales beyond single high-memory GPUs (e.g., RTX 4090/5090).

RTM and Known Issues

Scroll further down for the installation guide.
If NCCL communication fails before running (e.g., watchdog timeout), set the following environment variables:
```
export NCCL_P2P_DISABLE=1
export NCCL_SHM_DISABLE=1
```
Windows is in partial testing, switch to dev branch to test it. And scroll down below for more information
Non-DiT models are not supported (SDXL, SD1.5).
Example WF just open from your comfyui menu and browse templates
GPU Topology is very important, not all PCIe in your motherboard is equal.

Operation

Mode

Sequence Parallel This mode splits the sequence among GPUs, the full model will be loaded into each GPU. Use the XFuser KSampler to increase the Ulysses degree according to the number of your GPUs, while keeping the Ring degree at 1 for small systems.

Data Parallel The full sequence will be processed independently on each GPU. Use the Data Parallel KSampler. There are two options, enable FSDP, or disable all options in Ray Init Actor. By disabling them, it will run in DP mode. Both FSDP and DP modes must have the Ulysses and Ring degrees set to 0.

FSDP will shard the weights, but each GPU will still work independently, as the name suggests, Fully Sharded (Weight) Data Parallel.

Sequence + FSDP Activate FSDP, and set the Ulysses degree to the number of GPUs. Use the XFuser KSampler.

Side Notes

Rule of thumb, if you have enough VRAM, just use USP, if not, enable the FSDP, and if that is still not enough, enable also the FSDP CPU Offload.
FSDP CPU Offload is intended for systems with very low VRAM, though it will come with a performance hit work akin to DisTorch from MultiGPU.

GPU Architectures

NVidia

Ampere: There is an issue with NCCL broadcast and reduction in FSDP on PyTorch 2.8. Please use the previous version instead. FSDP works successfully on Torch 2.7.1 CU128 for Ampere. Reference: pytorch/pytorch#162057 (comment)
Turing: Not tested. Please use FlashAttn1 instead of FlashAttn2 or Torch Attn.
Ada Lovelace: There is also an issue with Torch 2.8 which when assigning device_id to torch.dist_init_process_group() cause OOM. In a mean time, you would see torch distributor complaining about device assigment, but other- than that it should be working fine.
Blackwell: Expected to work just like Ada Lovelace.

AMD

MI3XX : User confirmed working on 8xMI300X using ROCm compiled PyTorch and Flash Attention

Supported Models

Wan

Model	USP	FSDP
Wan2.1 14B T2V	✅	✅
Wan2.1 14B I2V	✅	✅
Wan2.2 14B I2V	✅	✅
Wan2.2 14B I2V	✅	✅
Wan2.1 1.3B T2V	✅	✅
Wan2.2 5B TI2V	✅	✅
Wan2.1 Vace	✅	❌

Flux

Model	USP	FSDP
Flux Dev	✅	✅
Flux Konteks	✅	✅
Flux Krea	✅	✅
Flux ControlNet	❌	❌

Qwen

Model	USP	FSDP
Qwen Image/Edit	✅	✅
ControlNet	❌	❌

Hunyuan Video

Model	USP	FSDP
Hunyuan Video	✅	✅
ControlNet	❌	❌

Legend:

✅ = Supported
❌ = Not currently supported
❓ = Maybe work?

Notes:

Non standard Wan variant (Phantom, S2V, etc...) is not tested

Scaled vs Non-Scaled Models

Model	USP	FSDP
Non-Scaled	✅	✅
Scaled	✅	⚠️

Notes:

Scaled models use multiple dtypes inside their transformer blocks: typically FP32 for scale, FP16 for bias, and FP8 for weights.
Raylight FSDP can work with scaled model, but it really does not like it. Since FSDP shards must have uniform dtype, if not it will not be sharded.

Attention

Attention Variant	Time (s)
sage_fp8	10.75
sage_fp16_cuda	11.00
sage_fp16_triton	11.17
flash	11.24
torch	11.36

Notes:

Tested on Wan 2.1 T2V 14B 832x480 33 frame 2 RTX 2000 ADA

Wan T2V 1.3B

MultiGPU.mp4

Wan T2V 14B on RTX 2000 ADA ≈ RTX 4060 TI 16GB

Qwen Image 20B on RTX 2000 ADA ≈ RTX 4060 TI 16GB , 4x Playback speed up

Raylight_Qwen_4x.mp4

DEBUG Notes

Wan T2V 14B (fp8) — 1×RTX 2000 ADA 16G

Resolution: 480×832 × 33F

Setup	VRAM (GB)	Speed
Normal	OOM	22 it/s (before OOM)

Wan T2V 14B (fp8) — 2×RTX 2000 ADA 16G

Setup	VRAM (GB) / Device	Speed
Ulysses	15.8 (Near OOM)	11 it/s
FSDP2	12.8	19 it/s
Ulysses + FSDP2	10.25	12 it/s

Notes

FSDP2 is now available and can do fp8 calculation, but needs scalar tensors converted into 1D tensors.

Installation

Manual

Clone this repository under ComfyUI/custom_nodes.
cd raylight
Install dependencies: your_python_env - pip install -r requirements.txt

Install FlashAttention:

Option A (NOT recommended due to long build time): pip install flash-attn --no-build-isolation

Option B (recommended, use prebuilt wheel): For Torch 2.8:

wget https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.3.14/flash_attn-2.8.2+cu128torch2.7-cp311-cp311-linux_x86_64.whl -O flash_attn-2.8.2+cu128torch2.7-cp311-cp311-linux_x86_64.whl

pip install flash_attn-2.8.2+cu128torch2.7-cp311-cp311-linux_x86_64.whl`

For other versions, check: https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/

Restart ComfyUI.

ComfyUI Manager

Find raylight in the manager and install it.

Windows

Only works for PyTorch 2.7, because of pytorch/pytorch#150381
POSIX and Win32 style paths can cause issues when importing raylight.
Recommended steps:
- Manually clone the Raylight repo
- Switch to the dev branch for now
- Inside the top-most Raylight folder (where pyproject.toml exists), run:
```
..\..\..\python_embeded\python.exe -m pip install -r .\requirements.txt
..\..\..\python_embeded\python.exe -m pip install -e .
```
Highly experimental — please open an issue if you encounter errors.
Advisable to run in WSL since windows does not have NCCL support from PyTorch, raylight will run using GLOO, which is slower than NCCL. Might not even worth it to run in windows other than WSL.

Support

PayPal Thanks for the support :) (I want to buy 2nd GPU (5060Ti) so i dont have to rent cloud GPU) RunPod

Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
.github		.github
example_workflows		example_workflows
src/raylight		src/raylight
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
__init__.py		__init__.py
icon.png		icon.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Raylight

UPDATE

What exactly is Raylight

Raylight vs MultiGPU vs ComfyUI Worksplit branch vs ComfyUI-Distributed

RTM and Known Issues

Operation