perf: Reduce memory footprint for ChunkedDistribuedLogProb #1895

nujoug · 2026-02-06T18:23:34Z

What does this PR do ?

Reduces the peak memory footprint when using chunk during loss function. (when sequence packing is disabled)

Issues

Details

Problem

Current approach stores each chunk's gradient in a list, followed by torch.cat at the end to return the entire gradient tensor. This resulted in at least 2 copies of gradient tensor to exist during peak.

Modification

Preallocate a gradient tensor and copy each chunk's gradient inplace.

Additional Gains

Reducing the chunk size observes less reduction in peak memory in current approach due to some intermediate tensors overlapping (lazy delete).

With explicit deallocation of intermediate tensor (using del), the memory footprint reduction is more significant (-0.4GiB to -0.6GiB)

Caveats

The modification does not reduce the peak memory when sequence packing is enable. This is because SequencePackingLossWrapper is being used and default torch.autograd handles the backprop resulting in this undesirable behavior.

This issue should be able to solve this once there is a customize torch.autograd.Function that can handle sequence packing.

Summary by CodeRabbit

Performance Improvements
- Optimized gradient computation during the backward pass to reduce memory usage and improve efficiency through better memory allocation and cleanup strategies.

Signed-off-by: mloh <mloh@nvidia.com>

coderabbitai · 2026-02-06T19:51:58Z

📝 Walkthrough

Walkthrough

Modified the backward gradient computation in ChunkedDistributedLogprob to use preallocated gradient tensors with in-place chunk copying instead of accumulating gradients in a list and concatenating them. Added explicit in-loop deallocation of temporary tensors to optimize memory usage.

Changes

Cohort / File(s)	Summary
Backward Pass Optimization `nemo_rl/distributed/model_utils.py`	Refactored `ChunkedDistributedLogprob.backward` to preallocate `grad_input` tensor and copy chunk results into corresponding slices, replacing list accumulation and final concatenation. Added in-loop deletion of temporary tensors (`softmax_output`, `is_chosen`, `logits`) for explicit memory deallocation. Updated single-chunk path to use preallocated chunk views with `copy_` operations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR lacks test results for numerical correctness and convergence despite demonstrating memory improvements. A dtype precision concern exists where bf16/fp16 preallocation conflicts with float32 gradient computations.	Add unit tests verifying gradient equivalence, small-scale convergence comparisons, and address dtype precision by either adding dtype=torch.float32 to preallocation or justifying the dtype change with test results.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly describes the main optimization: reducing memory footprint for ChunkedDistributedLogProb, which is the primary objective of this performance-focused PR.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch mloh/loss_fn_memory_footprint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@nemo_rl/distributed/model_utils.py`:
- Line 223: The preallocated grad_input created with
torch.empty_like(vocab_parallel_logits) inherits bf16/fp16 and thus silently
truncates float32 gradient arithmetic; change the allocation of grad_input in
model_utils.py so it is explicitly float32 (e.g., create an empty tensor with
the same shape/device but dtype=torch.float32) so the subsequent operations
(is_chosen.float().sub_(softmax_output), the copy_ into grad_input, and the mul_
call) run in float32 and preserve gradient precision; update the grad_input
creation site (the grad_input variable near the vocab_parallel_logits usage) to
allocate float32 and ensure device/shape match.

🧹 Nitpick comments (1)

nemo_rl/distributed/model_utils.py (1)

209-260: Consider applying the same preallocation pattern to ChunkedDistributedGatherLogprob.backward and ChunkedDistributedEntropy.backward.

Both sibling backward methods (lines 334–386 and 1041–1060) still use the list-accumulate + torch.cat pattern. They'd benefit from the same preallocation optimization for consistency and memory reduction.

nemo_rl/distributed/model_utils.py

Signed-off-by: mloh <mloh@nvidia.com>

nujoug · 2026-02-06T21:22:09Z

Nitpick comments (1)

nemo_rl/distributed/model_utils.py (1)> 209-260: Consider applying the same preallocation pattern to ChunkedDistributedGatherLogprob.backward and ChunkedDistributedEntropy.backward.

Both sibling backward methods (lines 334–386 and 1041–1060) still use the list-accumulate + torch.cat pattern. They'd benefit from the same preallocation optimization for consistency and memory reduction.

Should we implement this to ChunkedDistributedGatherLogprob.backward and ChunkedDistributedEntropy.backward?

nujoug self-assigned this Feb 6, 2026

nujoug force-pushed the mloh/loss_fn_memory_footprint branch from aaa5b0b to 3a6cd1a Compare February 6, 2026 18:31

nujoug changed the title ~~perf: Reduce memory footprint for ChunkedDistribuedLogProb~~ Draft: perf: Reduce memory footprint for ChunkedDistribuedLogProb Feb 6, 2026

nujoug linked an issue Feb 6, 2026 that may be closed by this pull request

Improve loss function memory usage #990

Open

nujoug force-pushed the mloh/loss_fn_memory_footprint branch from 3a6cd1a to d1ce6b9 Compare February 6, 2026 19:47

nujoug changed the title ~~Draft: perf: Reduce memory footprint for ChunkedDistribuedLogProb~~ perf: Reduce memory footprint for ChunkedDistribuedLogProb Feb 6, 2026

nujoug marked this pull request as ready for review February 6, 2026 19:49

nujoug requested a review from a team as a code owner February 6, 2026 19:49

Reduce memory footprint for ChunkedDistributedLogProb

f10e4e4

Signed-off-by: mloh <mloh@nvidia.com>

nujoug force-pushed the mloh/loss_fn_memory_footprint branch from d1ce6b9 to f10e4e4 Compare February 6, 2026 19:51

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

nemo_rl/distributed/model_utils.py Outdated Show resolved Hide resolved

Explicity set the dtype for gradient tensor

ac29781

Signed-off-by: mloh <mloh@nvidia.com>

nujoug force-pushed the mloh/loss_fn_memory_footprint branch from 7725656 to ac29781 Compare February 6, 2026 21:21

guyueh1 requested review from guyueh1 and youngeunkwon0405 February 6, 2026 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Reduce memory footprint for ChunkedDistribuedLogProb #1895

perf: Reduce memory footprint for ChunkedDistribuedLogProb #1895

Uh oh!

nujoug commented Feb 6, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

nujoug commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

perf: Reduce memory footprint for ChunkedDistribuedLogProb #1895

Are you sure you want to change the base?

perf: Reduce memory footprint for ChunkedDistribuedLogProb #1895

Uh oh!

Conversation

nujoug commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Details

Problem

Modification

Additional Gains

Caveats

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nujoug commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nujoug commented Feb 6, 2026 •

edited

Loading

coderabbitai bot commented Feb 6, 2026 •

edited

Loading