File tree Expand file tree Collapse file tree 1 file changed +10
-1
lines changed Expand file tree Collapse file tree 1 file changed +10
-1
lines changed Original file line number Diff line number Diff line change @@ -470,7 +470,16 @@ class AllreduceOp
470470 int size = input.numel ();
471471 size_t bufferSizeBytes = size * input.element_size ();
472472
473- // Manual tuning
473+ // Using unregistered input buffers with NCCL symmetric, requires a memcpy
474+ // This is an overhead introduced with using NCCL_SYMMTRIC over NCCL.
475+ // Both the memcpy and the perf benefit from using NCCL_SYMMETRIC scale linear with the message size.
476+ // But a local memcpy is cheaper than the remote operations, so with larger message sizes the benefit is
477+ // stronger. Additionally, the perf benefit scales with the number of ranks, since multimem enables O(const.)
478+ // versus O(N) complexity. Hence we model this cutoff with a linear model. The numbers below were obtained on
479+ // GB200, scanning different message sizes and ranks. You can determine the regression onset for each number of
480+ // ranks to a single message size. And the following formula was obtained by fitting a linear model to the
481+ // regression onset. It is possible to override this empirical heuristic with the TLLM_NCCL_MIN_REGISTRATION
482+ // environment variable.
474483 double const a = -4986.43478503 ;
475484 double const b = 156716.52177552 ;
476485 int nRanks;
You can’t perform that action at this time.
0 commit comments