-
Notifications
You must be signed in to change notification settings - Fork 33
Memory leak in ppisp_cuda kernel (~18MB/step) causes OOM during training #6
Description
Description
When using PPISP in a training loop, GPU memory grows linearly at approximately 18MB per training step, causing OOM errors after ~2000 steps on a 46GB GPU. The leak persists regardless of:
- Detaching the PPISP output from the computation graph
- Disabling regularization loss
- Calling
torch.cuda.synchronize(),torch.cuda.empty_cache(), andgc.collect()after each forward pass
The leak appears to be inside the ppisp_cuda CUDA kernel rather than in PyTorch's autograd system.
Environment
- PPISP version: 1.0.0
- PyTorch version: 2.5.1
- CUDA version: 12.4
- GPU: NVIDIA A6000 (46GB)
- OS: Ubuntu Linux 6.8.0
Minimal Reproduction
import torch
from ppisp import PPISP, PPISPConfig
# Initialize PPISP
ppisp = PPISP(num_cameras=1, num_frames=100, config=PPISPConfig(use_controller=False))
ppisp = ppisp.cuda()
optimizers = ppisp.create_optimizers()
# Simulate training loop
height, width = 540, 960
pixel_y, pixel_x = torch.meshgrid(
torch.arange(height, device="cuda", dtype=torch.float32) + 0.5,
torch.arange(width, device="cuda", dtype=torch.float32) + 0.5,
indexing="ij",
)
pixel_coords = torch.stack([pixel_x, pixel_y], dim=-1)
for step in range(3000):
# Simulate rendered RGB from Gaussian splatting
rgb_in = torch.rand(1, height, width, 3, device="cuda", requires_grad=True)
# Apply PPISP
rgb_out = ppisp(
rgb=rgb_in,
pixel_coords=pixel_coords,
resolution=(width, height),
camera_idx=0,
frame_idx=step % 100,
)
# Compute loss and backward
loss = (rgb_out - torch.rand_like(rgb_out)).pow(2).mean()
loss.backward()
for opt in optimizers:
opt.step()
opt.zero_grad(set_to_none=True)
# Memory debug
if step % 100 == 0:
alloc = torch.cuda.memory_allocated() / 1024**3
print(f"Step {step}: {alloc:.2f} GB")
# Cleanup
del rgb_in, rgb_out, lossExpected: Memory stable around 1-2 GB
Actual: Memory grows ~18MB/step, OOM around step 2000
Experimental Evidence
I ran 6 controlled experiments to isolate the leak:
| Experiment | Configuration | Result |
|---|---|---|
| No PPISP | post_processing=None |
Stable at ~1.7GB (8000 steps) |
| With PPISP | Default | OOM at step ~2000 |
| Detached output | rgb.detach().requires_grad_(True) after PPISP |
OOM at step ~150 |
| No reg loss | Skip get_regularization_loss() |
OOM at step ~2000 |
| Skip forward | PPISP module initialized but forward() not called |
Stable at ~1.7GB (8000 steps) |
| Aggressive cleanup | synchronize + empty_cache + gc.collect after each call |
OOM at step ~2000 |
Key finding: Only experiments that don't call PPISP forward have stable memory.
Memory Growth Pattern
Step 0: 0.93 GB
Step 50: 16.28 GB (+15.35 GB)
Step 100: 28.96 GB (+12.68 GB)
Step 150: 40.16 GB (+11.20 GB)
Step 165: OOM (43+ GB)
Analysis
The leak is NOT caused by:
- PyTorch autograd graph retention (detaching doesn't help)
- PPISP regularization loss computation
- Python garbage collection issues
The leak IS caused by something in _PPISPFunction.forward() or the underlying _C.ppisp_forward() CUDA kernel. The ctx.save_for_backward() tensors should be released after backward, but something prevents this.
Workaround
Currently disabling PPISP entirely as a workaround.