GitHub - hemanth346/mde_bs: Monocular Depth Estimation and Background substraction on custom dataset of 400k images created. Both objectives are achieved using single custom deep learning model using transpose convolutions

Objective :

Train a model to predict,

mask for the fg object (Background Subtraction)
Depth map for the image (Monocular Depth Estimation)

given 2 images

a background(bg) image and
Object inside the above background

i.e. foreground*(fg)* overlaid on background (fg_bg)

Usecases

Can be used mostly for security checks and in CC TV footage videos

the mask can be used to check if a person is in the frame and the movement of the person
the depth map can be used to determine if the person has entered any restricted zone

The same can be extended to parking lot scenario to assist with/monitor parking.

To be made real-time the inference time has to be very less and be able to run on the high fps.

The objective of this project is to get a working model with less number of parameters (and less size).

Dataset Overview

Own dataset has been curated/created for the task at hand.

Using 100 background images and 100 transparent foreground images(with alphachannel) created a dataset, which contains

100 images of empty - office spaces, office reception lounges and office kitchen as background(bg)

100 images of people as foreground(fg)

100 respective masks for fg

400k foreground with background(fg_bg) images
- Generated by overlaying foreground images(and their flips) top of background

400k respective masks for fg_bg images
- by overlaying corresponding foreground mask at the same postion on black canvas of bg shape

400k respective depth images
Generated using pretrained weights of a SOTA model pretrained weights

Complete details can be found here

Dataset is stored google drive and can be accessed by this link

The link redirects to a google drive folder which has 2 folders

Dataset(Incomplete, due to i/o timeout in colab)
Compressed_Dataset (will be using this. Size: 1.8G)

With below tree structure

- Dataset
  -- inp_bg
  -- inp_fg
  -- inp_fg_masks
  -- depth_maps
    -- bg1
      -- image1
      -- image2
      ...
    -- bg2
      -- image1
      -- image2
      ...
    -- bg3
    ...
  -- fg_bg
    -- bg1
      -- image1
      -- image2
      ...
    -- bg2
      -- image1
      -- image2
      ...
    -- bg3
    ...
  -- fg_bg_masks
    -- bg1
      -- image1
      -- image2
      ...
    -- bg2
      -- image1
      -- image2
      ...
    -- bg3
    ...

- Compressed_Dataset
  -- depth_maps
    -- bg1.zip
    -- bg2.zip
    ...
  -- fg_bg
    -- bg1.zip
    -- bg2.zip
    ...
  -- fg_bg_masks
    -- bg1.zip
    -- bg2.zip
    ...

Calculated mean, std for each of the image set...

To calculate the standard deviation by averaging samples of the std from mini batches. While very close to the true std, it’s not calculated exactly and can be leveraged if time/computation limitations. But in production settings where new data is added on daily basis this will work.

def get_batchwise_avg_mean_std(dataset, batch_size=50):
    print(len(dataset))
    loader = DataLoader(dataset,
                      batch_size=batch_size,
                      shuffle=True)
    mean = 0.
    std = 0.
    nb_samples = 0.
    for data in loader:
        batch_samples = data.size(0)
        data = data.view(batch_samples, data.size(1), -1)
        mean += data.mean(2).sum(0)
        std += data.std(2).sum(0)
        nb_samples += batch_samples

    mean /= nb_samples
    std /= nb_samples
    # return mean, std
    print(mean, std)

Complete details can be found at dataset creation

Iterations

1.Get the setup right (for 2 problems)

Worked with minimal data of 12k images to test the waters before diving in.

##### Dataloader

Each of the folders in fg_bg, depth_maps, fg_bg_masks - has separate zip file for every bg file. Images are read directly from the zip file to save on disk space and extraction time.

List of all paths are stored obtained from fg_bg and when dataset is called by dataloader, corresponding images from other folders/zip files are obtained

  for file in os.listdir(fg_bg_dir):
     fname = os.path.join(fg_bg_dir, file)
     if zipfile.is_zipfile(fname):
         self.fg_bg+=[x.filename for x in zipfile.ZipFile(fname).infolist()]

def __getitem__(self, index):
   bg = self.fg_bg[index].split('_')[0]
   bg_file = Path(self.data_root).joinpath('bg' ,bg+'.jpg')

   bg_img = np.array(Image.open(str(bg_file)))
   fg_bg_img = self.read_img_from_zip(f'{self.data_root}/fg_bg/{bg}.zip', self.fg_bg[index])
   mask_img = self.read_img_from_zip(f'{self.data_root}/fg_bg_masks/{bg}.zip', self.fg_bg[index])
   depth_img = self.read_img_from_zip(f'{self.data_root}/depth_maps/{bg}.zip', self.fg_bg[index])

A helper function is added to the dataset class to convert the zipfile into PIL file for regular transforms and numpy array for albumentation transformations

def read_img_from_zip(self, zip_name, file_name, array=True):
    imgdata = zipfile.ZipFile(zip_name).read(file_name)
    img = Image.open(io.BytesIO(imgdata))
    # img = img.convert("RGB")
    if array:
        img = np.array(img)
        return img
    # PIL image
    return img

Data is split into train and test set with 80:20 split

tr_size = int(0.8 * len(dataset))
tst_size = len(dataset) - train_size

train_dl, test_dl = torch.utils.data.random_split(dataset, [tr_size, tst_size])

##### Basic transforms is set for each of the image set.
Transformations used are : RandomCrop, HorizontalFlip, Resize(64x64), Normalization

Used albumentation library for transformations. Read about its advantages here and here

Albumentations doesn't support loading PIL images directly and works with numpy arrays. Hence we have to modify the dataset class accordingly. I found this notebook very useful as a guide.

Albumentations also have support to pass masks/weights along with original image and get the same transforms applied on them example here or we can create our own custom targets to be sent for the compose, which are useful if we have multiple images not linked with each other but need same transforms.

Examples of some of the transforms applied for segmentation problem can be found in this notebook

##### Setup basic model

Model is expected to give us a mask and depth map for foreground given 2 images.

 Can we think of using a single fg_bg image and predict it as well.?

   Will that change the problem scope, from trying to identify foreground object in given background image the model will be trying to find the mask and depth in any general setting..?

   Will experiment with that as well if time permits

using inputs as
- only fg_bg
- both bg and fg_bg
Predicting
- Only mask
- Only depth map
- both mask and depth_maps

After looking at the 64x64 images, the images became pixelted and share edges are not available. RF for gradients was set at 5 pixels.

Since the output is also to be as the same size of input, we have to either do a Transpose conv or maintiain the size without doing any maxpool/stride 2 and no padding. Without any stride/maxpool - to get the receptive field of image size, in final layer, requires lot of convolutional layers.

Used Group convolutions with 2 groups of 3 channels each for first few convolutions. The idea being for the network to be able to learn for low level features from both the images intially.

Initial network is created without using transpose conv/de-conv. And accounting for RF, below is the brief network summary. Heavy model with more params

Final layer RF - 70

Total params: 3,769,664
Trainable params: 3,769,664
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.09
Forward/backward pass size (MB): 462.03
Params size (MB): 14.38
Estimated Total Size (MB): 476.51

Training took around 40 mins per epoch on 16G Tesla P100

##### Setup tensorboard on colab

Used below extension and magic function to access TensorBoard. This will open the tensorboard in the cell output

#Load extension
%load_ext tensorboard

logs_base_dir = 'logs'
%tensorboard --logdir {logs_base_dir}

To write to tensorboard we can use summary writer from torch.utils

# TensorBoard support
from torch.utils.tensorboard import SummaryWriter

##### Make sure model is training

Model was run for 2 epochs without any issue..

2. Get the basic skeleton

Different model architectures were tried out involving transpose convolutions.

Architecture search was done and found that similar problems involve one form or another of U-Net. Taken below two architectures as starting point

U-Net arxiv pdf; Github Code; article
Deep Residual U-Net arxiv pdf; Github Code(Pytorch); Notebook(Keras)

Architecutures experimented are

transpose convolutions using width of 2

ConvTranspose is convolution and has trainable kernels while Upsample is a simple interpolation (bilinear, nearest etc.).. Transpose has learning parameter while Up-sampling has no-learning parameters. Using Upsampling can make inference or training faster as it does not require to update weight or compute gradient, but since the input image is already pixelated, using transpose conv with cost of additional parameters and model size

Final layer RF - 124

================================================================
Total params: 1,190,272
Trainable params: 1,190,272
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.09
Forward/backward pass size (MB): 139.16
Params size (MB): 4.54
Estimated Total Size (MB): 143.80

Training took around 15 mins per epoch on 16G Tesla P100

Modified DeepResUNet architecuture.

Have to revisit the architecture, to reduce training memory.

================================================================
Total params: 1,765,421
Trainable params: 1,765,421
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.09
Forward/backward pass size (MB): 43093.52
Params size (MB): 6.73
Estimated Total Size (MB): 43100.34
----------------------------------------------------------------

Proceeding with ConvTranspose as the results seems to be good with limited training

3. Loss functions

All above models were trained with BCE loss for both mask and depth_maps. While the results looks promising masks are not clear and depth maps still misses out the details.

After learning about few losses and experimenting soft Dice loss was chosen for masks, and pixel wise MSE loss was chosen for Depth maps over ssim and msssim, since we have to decide the window width for ssim.

Below are the loss functions used

Dice coefficient based Soft Dice Loss for Mask.

Read my post about the loss [here]
MSEloss preferred over Multi-Scale Structural Similarity loss for Depth Maps - implementation

Depth maps need mode training and hence more weightage to its corresponding loss

mask_loss = soft Dice loss

depth_loss = MSE loss

#Final loss function

loss = 1*mask_loss + 2*depth_loss

Dice loss implementation

def dice_coeff(pred, target):
    smooth = 1.
    num = pred.size(0)
    m1 = pred.view(num, -1)  # Flatten
    m2 = target.view(num, -1)  # Flatten
    intersection = (m1 * m2).sum()

    return (2. * intersection + smooth) / (m1.sum() + m2.sum() + smooth)


class SoftDiceLoss(nn.Module):
    def __init__(self, weight=None, size_average=True):
        super(SoftDiceLoss, self).__init__()

    def forward(self, logits, targets):
        probs = F.sigmoid(logits)
        num = targets.size(0)  # Number of batches

        score = dice_coeff(probs, targets)
        score = 1 - score.sum() / num
        return score

4. Metrics, Results and observations

Pixel wise accuracy was initially used to as a metric but since we have lots of black pixels it is easy to fool the metric to get high values. Other options are using - iou or dice coefficient. IOU again for the same reason of most matching

Chosen dice coefficient for final runs

def dice_coeff(pred, target):
    smooth = 1.
    num = pred.size(0)
    m1 = pred.view(num, -1)  # Flatten
    m2 = target.view(num, -1)  # Flatten
    intersection = (m1 * m2).sum()

    return (2. * intersection + smooth) / (m1.sum() + m2.sum() + smooth)

After running 64x64 images, for the complete dataset for 10 epochs gave below results

Mask Dice Score - 0.87

depth_maps DiceScore - 0.34

Logs and outputs saved to tensorboard_logdir. Final ouput images

Ground truth and predicted images :