GPU Memory Leak (cuda) during Forward Pass #63

mrinalTheCoder · 2021-08-18T18:30:50Z

mrinalTheCoder
Aug 18, 2021

I am exploring the effects of spiking in an inception-based broadcast network. Without spiking, the network runs and trains normally. However, when I add spiking the memory drastically increases. I have narrowed down the leak to a loop in the forward pass, which executes the 3 inception modules and concatenates their output. Here is my code:

dtype = torch.float
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")


alpha, beta, num_steps = 0.7, 0.8, 30

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.num_dense_blocks = 3
        self.num_conv_filters = 32
        self.num_classes = 10

        self.lifs = []
        for i in range(self.num_dense_blocks+2):
            _ = self.synaptic(alpha, beta)
        self.initial_layers = nn.Sequential(
            nn.Conv2d(1, self.num_conv_filters, 3, padding="same"),
            nn.BatchNorm2d(self.num_conv_filters),
            nn.MaxPool2d((4, 1)),
        )
        self.inception_blocks = nn.ModuleList()
        for i in range(self.num_dense_blocks):
            self.inception_blocks.append(self.get_inception_block(i))
        self.final_layers = nn.Sequential(
            nn.BatchNorm2d(416),
            nn.Conv2d(416, self.num_conv_filters, 1),
            nn.AvgPool2d(self.num_conv_filters),
            nn.BatchNorm2d(self.num_conv_filters),
        )
        self.final_linear = nn.Linear(self.num_conv_filters, self.num_classes)
        self.final_lif = snn.Synaptic(alpha=alpha, beta=beta)
    
    def synaptic(self, alpha, beta):
        self.lifs.append(snn.Synaptic(alpha=alpha, beta=beta))
        return self.lifs[-1]

    def global_average_pool(self, x, device):
        n, c = x.size()[:2]
        out = torch.zeros(n, c).to(device)
        for i in range(n):
            for j in range(c):
                out[i][j] = torch.mean(x[i][j])
        return out

    def base_conv_block(self, kernel_size, block_num):
        num_channels = self.num_conv_filters * (4*block_num + 1)
        return nn.Sequential(
            nn.BatchNorm2d(num_channels),
            nn.ReLU(),
            nn.Conv2d(num_channels, self.num_conv_filters, kernel_size, padding="same")
        )
    
    def base_conv_block_32(self, kernel_size):
        return nn.Sequential(
            nn.BatchNorm2d(self.num_conv_filters),
            nn.ReLU(),
            nn.Conv2d(self.num_conv_filters, self.num_conv_filters, kernel_size, padding="same")
        )
    
    def get_inception_block(self, block_num):
        #return nn.Flatten()
        return nn.ModuleList(
            modules=[self.base_conv_block(1, block_num),
            nn.Sequential(
                self.base_conv_block(1, block_num),
                self.base_conv_block_32(3),
            ),
            nn.Sequential(
                self.base_conv_block(1, block_num),
                self.base_conv_block_32(5),
            ),
            nn.Sequential(
                nn.MaxPool2d(3, stride=1, padding=1),
                self.base_conv_block(1, block_num)
            )]
        )
    
    def forward(self, x):
        syn, mem = [], []
        initial_in = x
        for i in range(len(self.lifs)):
            temp = self.lifs[i].init_synaptic()
            syn.append(temp[0])
            mem.append(temp[1])
        temp = self.final_lif.init_synaptic()
        syn.append(temp[0])
        mem.append(temp[1])
        print('{}MB allocated after initial syn & mem'.format(torch.cuda.memory_allocated()/1024**2))

        spk_out, mem_out = [], []
        for step in range(num_steps):
            x = initial_in
            x = self.initial_layers(x)
            x, syn[0], mem[0] = self.lifs[0](x, syn[0], mem[0])
            print('{}MB allocated after intial layers in step {}'.format(torch.cuda.memory_allocated()/1024**2, step))
            for i in range(len(self.inception_blocks)):
                block = self.inception_blocks[i]
                #print((sub_block(x) for sub_block in block))
                out = torch.cat((block[0](x), block[1](x), block[2](x), block[3](x)), dim=1)
                out, syn[i+1], mem[i+1] = self.lifs[i+1](out, syn[i+1], mem[i+1])
                x = torch.cat((x, out), dim=1)
                del out
                gc.collect()
                torch.cuda.empty_cache()
            print('{}MB allocated after inception blocks in step {}'.format(torch.cuda.memory_allocated()/1024**2, step))
            x = self.global_average_pool(self.final_layers(x), device)
            x, syn[-2], mem[-2] = self.lifs[-2](x, syn[-2], mem[-2])
            x = self.final_linear(x)
            spk, syn[-1], mem[-1] = self.final_lif(x, syn[-1], mem[-1])

            spk_out.append(spk.detach())
            mem_out.append(mem[-1].detach())
            gc.collect()
            print('{}MB allocated after step {}'.format(torch.cuda.memory_allocated()/1024**2, step))
        return torch.stack(spk_out, dim=0), torch.stack(mem_out, dim=0)

net = Net().to(device)

print('{}MB allocated before call'.format(torch.cuda.memory_allocated()/1024**2))
temp = net(torch.ones(8, 1, 647, 128).to(device))
print('{}MB allocated after call'.format(torch.cuda.memory_allocated()/1024**2))

The output is:

0.77734375MB allocated before call
3.3046875MB allocated after initial syn & mem
305.9306640625MB allocated after intial layers in step 0
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
5749.5849609375MB allocated after inception blocks in step 0
6031.7421875MB allocated after step 0
6253.86767578125MB allocated after intial layers in step 1
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-7-f307d27bf1c6> in <module>()
      1 print('{}MB allocated before call'.format(torch.cuda.memory_allocated()/1024**2))
----> 2 temp = net(torch.ones(8, 1, 647, 128).to(device))
      3 print('{}MB allocated after call'.format(torch.cuda.memory_allocated()/1024**2))

3 frames
/usr/local/lib/python3.7/dist-packages/snntorch/__init__.py in forward(self, input_, syn, mem)
    466 
    467             if self.reset_mechanism == "subtract":
--> 468                 mem = self.beta * mem + syn - reset * self.threshold
    469 
    470             elif self.reset_mechanism == "zero":

RuntimeError: CUDA out of memory. Tried to allocate 82.00 MiB (GPU 0; 11.17 GiB total capacity; 10.46 GiB already allocated; 55.81 MiB free; 10.66 GiB reserved in total by PyTorch)

Has anyone ever experienced this before? I have been stuck on this problem for many hours now.

jeshraghian · 2021-08-19T01:35:23Z

jeshraghian
Aug 19, 2021
Maintainer

Training SNNs at the scale of Inception Net using this PyTorch backend approach is extremely expensive, and unfortunately, is one of those things that the SNN research community is still trying to understand better.

For every time-step in the for-loop, a computational graph is being constructed, with the gradients of all hidden states stored so that backprop through time (BPTT) works as intended. Another way to think about it is that every time step is creating another Inception Net and storing that in memory.

Potential solutions include performing a backwards pass at every time step (there are methods available in snntorch.backprop called RTRL, or TBPTT). These free up memory from previous time steps. But the drawback is that the backward pass no longer considers the temporal dynamics of the SNN, which might not be optimal for time-varying input data.

I am currently working on several alternatives to BPTT that do not require the storage of a full computational graph, and approximate BPTT well enough. This might be useful for you.

Though even if you were to get around the memory issue, I haven't seen any networks as deep as Inception Net successfully trained using backprop directly on the SNN itself. While vanishing gradients have been addressed in non-spiking networks thanks to good parameter initialization strategies, and batch norm etc., we still don't have equivalent methods that translate perfectly to SNNs.

For deep nets, you might have more luck if you pre-train the ANN and convert that into an SNN: https://snntoolbox.readthedocs.io/en/latest/

2 replies

mrinalTheCoder Aug 20, 2021
Author

Thanks for the help! I am not using time-related data (images) and am passing the same image as it is for each timestep. I tried using detached tensors, but realised that they would break the backprop chain and the model wouldn't learn. I will implement RTRL or TBPTT, and get back to you with the results!

jeshraghian Aug 20, 2021
Maintainer

You're on the right track! Given your memory limits + the size of the network, you will need to detach the temporal connections (through time), but keep your spatial connections (across layers). TBPTT applies a backprop update every k steps. So if you set k=1, then you'll hopefully be able to fit all of your gradients in memory by updating at every time step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Memory Leak (cuda) during Forward Pass #63

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

GPU Memory Leak (cuda) during Forward Pass #63

mrinalTheCoder Aug 18, 2021

Replies: 1 comment · 2 replies

jeshraghian Aug 19, 2021 Maintainer

mrinalTheCoder Aug 20, 2021 Author

jeshraghian Aug 20, 2021 Maintainer

mrinalTheCoder
Aug 18, 2021

Replies: 1 comment 2 replies

jeshraghian
Aug 19, 2021
Maintainer

mrinalTheCoder Aug 20, 2021
Author

jeshraghian Aug 20, 2021
Maintainer