Description Path to v1.2.0
Ascend NPU Support: Context parallelism, FLUX/Qwen-Image @DefTruth @gameofdimension feat: support ascend npu #651 feat: add abstract platform #653
introduce accumulated_rel_l1_diff to reduce accumulated cache error, using in official TeaCache and EasyCache. feat: support step compute mask #444
Introduce LeMiCa/EasyCache style custom step compute mask, like: "111110100100000100000010000001", 1: Full compute, 0: dynamic/static cache (hybrid with a autotune function) @DefTruth feat: support step compute mask #444
Context Parallelism for any tokens (any resolution, any prompt tokens) @DefTruth UAA: ulysses anything attn w/ zero overhead #462
Support All Gather for any tokens (any resolution, any prompt tokens), for UAA @DefTruth feat: support unshard anything for UAA #465
Optimize the performance of UAA while using torch.compile (due to the graph break intro by if branch) feat: allow UAA in compiled graph #474
Parallelize VAE @DefTruth @tingkuanpei feat: support 🔥vae parallelism #645
Parallelize Text Encoder @gameofdimension @DefTruth feat: support TP for many text encoder #569
Manually Compute and Comm overlap (Attention level or Model level) for Ulysses and UAA, e.g: AsyncUlyssesQKVProj @tingkuanpei @DefTruth
Cache and Parallelism support for HunyuanVideo-1.5、FLUX.2、Z-Image @DefTruth @gameofdimension
Fused Per Tensor FP8 All2All via triton/cuda kernel @DefTruth @triple-mu feat: support per_token_quant_fp8 triton kernel #524
Any Head num support for Ulysses, e.g., Z-Image @DefTruth
More CIs @DefTruth
official readthedocs.io
Performance benchmark, NVDIA A800, L20, NPU, etc. @DefTruth docs: update nvidia gpu benchmark #684
GPU CIs: model tests ci: add basic gpu ci tests #688
mkdocs CIs: check mkdocs build --strict @DefTruth CI: add check-mkdocs ci #680
Reactions are currently unavailable
You can’t perform that action at this time.
Path to v1.2.0
if branch) feat: allow UAA in compiled graph #474