Open
Description
The accelerator collective module (which allocates and moves the data onto the host in order to complete collective communications) has a priority higher than some collective modules that do natively support CUDA/ROCM (such as UCC). This leads the terrible performance for most users, for as long as they don't manually exclude the accelerator collective (via --mca coll ^accelerator
).
This is definitively not very user-friendly, we need to find a way to prevent the accelerator framework from staying in the way of collective components that handle accelerator buffers.