Memory leak in ppisp_cuda kernel (~18MB/step) causes OOM during training

### Description

When using PPISP in a training loop, GPU memory grows linearly at approximately **18MB per training step**, causing OOM errors after ~2000 steps on a 46GB GPU. The leak persists regardless of:
- Detaching the PPISP output from the computation graph
- Disabling regularization loss
- Calling `torch.cuda.synchronize()`, `torch.cuda.empty_cache()`, and `gc.collect()` after each forward pass

The leak appears to be **inside the `ppisp_cuda` CUDA kernel** rather than in PyTorch's autograd system.

### Environment

- **PPISP version:** 1.0.0
- **PyTorch version:** 2.5.1
- **CUDA version:** 12.4
- **GPU:** NVIDIA A6000 (46GB)
- **OS:** Ubuntu Linux 6.8.0

### Minimal Reproduction

```python
import torch
from ppisp import PPISP, PPISPConfig

# Initialize PPISP
ppisp = PPISP(num_cameras=1, num_frames=100, config=PPISPConfig(use_controller=False))
ppisp = ppisp.cuda()
optimizers = ppisp.create_optimizers()

# Simulate training loop
height, width = 540, 960
pixel_y, pixel_x = torch.meshgrid(
    torch.arange(height, device="cuda", dtype=torch.float32) + 0.5,
    torch.arange(width, device="cuda", dtype=torch.float32) + 0.5,
    indexing="ij",
)
pixel_coords = torch.stack([pixel_x, pixel_y], dim=-1)

for step in range(3000):
    # Simulate rendered RGB from Gaussian splatting
    rgb_in = torch.rand(1, height, width, 3, device="cuda", requires_grad=True)

    # Apply PPISP
    rgb_out = ppisp(
        rgb=rgb_in,
        pixel_coords=pixel_coords,
        resolution=(width, height),
        camera_idx=0,
        frame_idx=step % 100,
    )

    # Compute loss and backward
    loss = (rgb_out - torch.rand_like(rgb_out)).pow(2).mean()
    loss.backward()

    for opt in optimizers:
        opt.step()
        opt.zero_grad(set_to_none=True)

    # Memory debug
    if step % 100 == 0:
        alloc = torch.cuda.memory_allocated() / 1024**3
        print(f"Step {step}: {alloc:.2f} GB")

    # Cleanup
    del rgb_in, rgb_out, loss
```

**Expected:** Memory stable around 1-2 GB
**Actual:** Memory grows ~18MB/step, OOM around step 2000

### Experimental Evidence

I ran 6 controlled experiments to isolate the leak:

| Experiment | Configuration | Result |
|------------|--------------|--------|
| No PPISP | `post_processing=None` | **Stable at ~1.7GB** (8000 steps) |
| With PPISP | Default | OOM at step ~2000 |
| Detached output | `rgb.detach().requires_grad_(True)` after PPISP | OOM at step ~150 |
| No reg loss | Skip `get_regularization_loss()` | OOM at step ~2000 |
| Skip forward | PPISP module initialized but `forward()` not called | **Stable at ~1.7GB** (8000 steps) |
| Aggressive cleanup | `synchronize + empty_cache + gc.collect` after each call | OOM at step ~2000 |

**Key finding:** Only experiments that don't call PPISP forward have stable memory.

### Memory Growth Pattern

```
Step 0:    0.93 GB
Step 50:   16.28 GB  (+15.35 GB)
Step 100:  28.96 GB  (+12.68 GB)
Step 150:  40.16 GB  (+11.20 GB)
Step 165:  OOM (43+ GB)
```

### Analysis

The leak is **NOT** caused by:
- PyTorch autograd graph retention (detaching doesn't help)
- PPISP regularization loss computation
- Python garbage collection issues

The leak **IS** caused by something in `_PPISPFunction.forward()` or the underlying `_C.ppisp_forward()` CUDA kernel. The `ctx.save_for_backward()` tensors should be released after backward, but something prevents this.

### Workaround

Currently disabling PPISP entirely as a workaround.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in ppisp_cuda kernel (~18MB/step) causes OOM during training #6

Description

Environment

Minimal Reproduction

Experimental Evidence

Memory Growth Pattern

Analysis

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Experiment	Configuration	Result
No PPISP	`post_processing=None`	Stable at ~1.7GB (8000 steps)
With PPISP	Default	OOM at step ~2000
Detached output	`rgb.detach().requires_grad_(True)` after PPISP	OOM at step ~150
No reg loss	Skip `get_regularization_loss()`	OOM at step ~2000
Skip forward	PPISP module initialized but `forward()` not called	Stable at ~1.7GB (8000 steps)
Aggressive cleanup	`synchronize + empty_cache + gc.collect` after each call	OOM at step ~2000

Memory leak in ppisp_cuda kernel (~18MB/step) causes OOM during training #6

Description

Description

Environment

Minimal Reproduction

Experimental Evidence

Memory Growth Pattern

Analysis

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions