Skip to content

Memory leak in ppisp_cuda kernel (~18MB/step) causes OOM during training #6

@ehatami65

Description

@ehatami65

Description

When using PPISP in a training loop, GPU memory grows linearly at approximately 18MB per training step, causing OOM errors after ~2000 steps on a 46GB GPU. The leak persists regardless of:

  • Detaching the PPISP output from the computation graph
  • Disabling regularization loss
  • Calling torch.cuda.synchronize(), torch.cuda.empty_cache(), and gc.collect() after each forward pass

The leak appears to be inside the ppisp_cuda CUDA kernel rather than in PyTorch's autograd system.

Environment

  • PPISP version: 1.0.0
  • PyTorch version: 2.5.1
  • CUDA version: 12.4
  • GPU: NVIDIA A6000 (46GB)
  • OS: Ubuntu Linux 6.8.0

Minimal Reproduction

import torch
from ppisp import PPISP, PPISPConfig

# Initialize PPISP
ppisp = PPISP(num_cameras=1, num_frames=100, config=PPISPConfig(use_controller=False))
ppisp = ppisp.cuda()
optimizers = ppisp.create_optimizers()

# Simulate training loop
height, width = 540, 960
pixel_y, pixel_x = torch.meshgrid(
    torch.arange(height, device="cuda", dtype=torch.float32) + 0.5,
    torch.arange(width, device="cuda", dtype=torch.float32) + 0.5,
    indexing="ij",
)
pixel_coords = torch.stack([pixel_x, pixel_y], dim=-1)

for step in range(3000):
    # Simulate rendered RGB from Gaussian splatting
    rgb_in = torch.rand(1, height, width, 3, device="cuda", requires_grad=True)

    # Apply PPISP
    rgb_out = ppisp(
        rgb=rgb_in,
        pixel_coords=pixel_coords,
        resolution=(width, height),
        camera_idx=0,
        frame_idx=step % 100,
    )

    # Compute loss and backward
    loss = (rgb_out - torch.rand_like(rgb_out)).pow(2).mean()
    loss.backward()

    for opt in optimizers:
        opt.step()
        opt.zero_grad(set_to_none=True)

    # Memory debug
    if step % 100 == 0:
        alloc = torch.cuda.memory_allocated() / 1024**3
        print(f"Step {step}: {alloc:.2f} GB")

    # Cleanup
    del rgb_in, rgb_out, loss

Expected: Memory stable around 1-2 GB
Actual: Memory grows ~18MB/step, OOM around step 2000

Experimental Evidence

I ran 6 controlled experiments to isolate the leak:

Experiment Configuration Result
No PPISP post_processing=None Stable at ~1.7GB (8000 steps)
With PPISP Default OOM at step ~2000
Detached output rgb.detach().requires_grad_(True) after PPISP OOM at step ~150
No reg loss Skip get_regularization_loss() OOM at step ~2000
Skip forward PPISP module initialized but forward() not called Stable at ~1.7GB (8000 steps)
Aggressive cleanup synchronize + empty_cache + gc.collect after each call OOM at step ~2000

Key finding: Only experiments that don't call PPISP forward have stable memory.

Memory Growth Pattern

Step 0:    0.93 GB
Step 50:   16.28 GB  (+15.35 GB)
Step 100:  28.96 GB  (+12.68 GB)
Step 150:  40.16 GB  (+11.20 GB)
Step 165:  OOM (43+ GB)

Analysis

The leak is NOT caused by:

  • PyTorch autograd graph retention (detaching doesn't help)
  • PPISP regularization loss computation
  • Python garbage collection issues

The leak IS caused by something in _PPISPFunction.forward() or the underlying _C.ppisp_forward() CUDA kernel. The ctx.save_for_backward() tensors should be released after backward, but something prevents this.

Workaround

Currently disabling PPISP entirely as a workaround.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions