Skip to content

Commit 429dcaa

Browse files
nv-lschneiderTabrizian
authored andcommitted
adding empirical model explanation
Signed-off-by: Ludwig Schneider <[email protected]>
1 parent bd3f2a1 commit 429dcaa

File tree

1 file changed

+10
-1
lines changed

1 file changed

+10
-1
lines changed

cpp/tensorrt_llm/thop/allreduceOp.cpp

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -470,7 +470,16 @@ class AllreduceOp
470470
int size = input.numel();
471471
size_t bufferSizeBytes = size * input.element_size();
472472

473-
// Manual tuning
473+
// Using unregistered input buffers with NCCL symmetric, requires a memcpy
474+
// This is an overhead introduced with using NCCL_SYMMTRIC over NCCL.
475+
// Both the memcpy and the perf benefit from using NCCL_SYMMETRIC scale linear with the message size.
476+
// But a local memcpy is cheaper than the remote operations, so with larger message sizes the benefit is
477+
// stronger. Additionally, the perf benefit scales with the number of ranks, since multimem enables O(const.)
478+
// versus O(N) complexity. Hence we model this cutoff with a linear model. The numbers below were obtained on
479+
// GB200, scanning different message sizes and ranks. You can determine the regression onset for each number of
480+
// ranks to a single message size. And the following formula was obtained by fitting a linear model to the
481+
// regression onset. It is possible to override this empirical heuristic with the TLLM_NCCL_MIN_REGISTRATION
482+
// environment variable.
474483
double const a = -4986.43478503;
475484
double const b = 156716.52177552;
476485
int nRanks;

0 commit comments

Comments
 (0)