Train a model to predict,
- mask for the fg object (Background Subtraction)
- Depth map for the image (Monocular Depth Estimation)
given 2 images
- a background(bg) image and
- Object inside the above background
i.e. foreground*(fg)* overlaid on background (fg_bg)
Can be used mostly for security checks and in CC TV footage videos
- the mask can be used to check if a person is in the frame and the movement of the person
- the depth map can be used to determine if the person has entered any restricted zone
The same can be extended to parking lot scenario to assist with/monitor parking.
To be made real-time the inference time has to be very less and be able to run on the high fps.
The objective of this project is to get a working model with less number of parameters (and less size).
Own dataset has been curated/created for the task at hand.
Using 100 background images and 100 transparent foreground images(with alphachannel) created a dataset, which contains
- 100 images of empty - office spaces, office reception lounges and office kitchen as background(bg)
- 100 images of people as foreground(fg)
- 100 respective masks for fg
- 400k foreground with background(fg_bg) images
- Generated by overlaying foreground images(and their flips) top of background
- 400k respective masks for fg_bg images
- by overlaying corresponding foreground mask at the same postion on black canvas of bg shape
- 400k respective depth images
- Generated using pretrained weights of a SOTA model pretrained weights
Complete details can be found here
Dataset is stored google drive and can be accessed by this link
The link redirects to a google drive folder which has 2 folders
- Dataset(Incomplete, due to i/o timeout in colab)
- Compressed_Dataset (will be using this. Size: 1.8G)
With below tree structure
- Dataset
-- inp_bg
-- inp_fg
-- inp_fg_masks
-- depth_maps
-- bg1
-- image1
-- image2
...
-- bg2
-- image1
-- image2
...
-- bg3
...
-- fg_bg
-- bg1
-- image1
-- image2
...
-- bg2
-- image1
-- image2
...
-- bg3
...
-- fg_bg_masks
-- bg1
-- image1
-- image2
...
-- bg2
-- image1
-- image2
...
-- bg3
...
- Compressed_Dataset
-- depth_maps
-- bg1.zip
-- bg2.zip
...
-- fg_bg
-- bg1.zip
-- bg2.zip
...
-- fg_bg_masks
-- bg1.zip
-- bg2.zip
...
Calculated mean, std for each of the image set...
To calculate the standard deviation by averaging samples of the std from mini batches. While very close to the true std, it’s not calculated exactly and can be leveraged if time/computation limitations. But in production settings where new data is added on daily basis this will work.
def get_batchwise_avg_mean_std(dataset, batch_size=50):
print(len(dataset))
loader = DataLoader(dataset,
batch_size=batch_size,
shuffle=True)
mean = 0.
std = 0.
nb_samples = 0.
for data in loader:
batch_samples = data.size(0)
data = data.view(batch_samples, data.size(1), -1)
mean += data.mean(2).sum(0)
std += data.std(2).sum(0)
nb_samples += batch_samples
mean /= nb_samples
std /= nb_samples
# return mean, std
print(mean, std)
Complete details can be found at dataset creation
Worked with minimal data of 12k images to test the waters before diving in.
- ##### Dataloader
Each of the folders in fg_bg, depth_maps, fg_bg_masks - has separate zip file for every bg file. Images are read directly from the zip file to save on disk space and extraction time.
List of all paths are stored obtained from fg_bg and when dataset is called by dataloader, corresponding images from other folders/zip files are obtained
for file in os.listdir(fg_bg_dir):
fname = os.path.join(fg_bg_dir, file)
if zipfile.is_zipfile(fname):
self.fg_bg+=[x.filename for x in zipfile.ZipFile(fname).infolist()]
def __getitem__(self, index):
bg = self.fg_bg[index].split('_')[0]
bg_file = Path(self.data_root).joinpath('bg' ,bg+'.jpg')
bg_img = np.array(Image.open(str(bg_file)))
fg_bg_img = self.read_img_from_zip(f'{self.data_root}/fg_bg/{bg}.zip', self.fg_bg[index])
mask_img = self.read_img_from_zip(f'{self.data_root}/fg_bg_masks/{bg}.zip', self.fg_bg[index])
depth_img = self.read_img_from_zip(f'{self.data_root}/depth_maps/{bg}.zip', self.fg_bg[index])
A helper function is added to the dataset class to convert the zipfile into PIL file for regular transforms and numpy array for albumentation transformations
def read_img_from_zip(self, zip_name, file_name, array=True):
imgdata = zipfile.ZipFile(zip_name).read(file_name)
img = Image.open(io.BytesIO(imgdata))
# img = img.convert("RGB")
if array:
img = np.array(img)
return img
# PIL image
return img
Data is split into train and test set with 80:20 split
tr_size = int(0.8 * len(dataset))
tst_size = len(dataset) - train_size
train_dl, test_dl = torch.utils.data.random_split(dataset, [tr_size, tst_size])
-
##### Basic transforms is set for each of the image set.
-
Transformations used are : RandomCrop, HorizontalFlip, Resize(64x64), Normalization
Used albumentation library for transformations. Read about its advantages here and here
Albumentations doesn't support loading PIL images directly and works with numpy arrays. Hence we have to modify the dataset class accordingly. I found this notebook very useful as a guide.
Albumentations also have support to pass masks/weights along with original image and get the same transforms applied on them example here or we can create our own custom targets to be sent for the compose, which are useful if we have multiple images not linked with each other but need same transforms.
Examples of some of the transforms applied for segmentation problem can be found in this notebook
- ##### Setup basic model
Model is expected to give us a mask and depth map for foreground given 2 images.
Can we think of using a single fg_bg image and predict it as well.?
Will that change the problem scope, from trying to identify foreground object in given background image the model will be trying to find the mask and depth in any general setting..?
Will experiment with that as well if time permits
-
using inputs as
- only fg_bg
- both bg and fg_bg
-
Predicting
- Only mask
- Only depth map
- both mask and depth_maps
After looking at the 64x64 images, the images became pixelted and share edges are not available. RF for gradients was set at 5 pixels.
Since the output is also to be as the same size of input, we have to either do a Transpose conv or maintiain the size without doing any maxpool/stride 2 and no padding. Without any stride/maxpool - to get the receptive field of image size, in final layer, requires lot of convolutional layers.
Used Group convolutions with 2 groups of 3 channels each for first few convolutions. The idea being for the network to be able to learn for low level features from both the images intially.
Initial network is created without using transpose conv/de-conv. And accounting for RF, below is the brief network summary. Heavy model with more params
Final layer RF - 70
Total params: 3,769,664
Trainable params: 3,769,664
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.09
Forward/backward pass size (MB): 462.03
Params size (MB): 14.38
Estimated Total Size (MB): 476.51
- Training took around 40 mins per epoch on 16G Tesla P100
- ##### Setup tensorboard on colab
Used below extension and magic function to access TensorBoard. This will open the tensorboard in the cell output
#Load extension
%load_ext tensorboard
logs_base_dir = 'logs'
%tensorboard --logdir {logs_base_dir}
To write to tensorboard we can use summary writer from torch.utils
# TensorBoard support
from torch.utils.tensorboard import SummaryWriter
- ##### Make sure model is training
Model was run for 2 epochs without any issue..
Different model architectures were tried out involving transpose convolutions.
Architecture search was done and found that similar problems involve one form or another of U-Net. Taken below two architectures as starting point
-
U-Net arxiv pdf; Github Code; article
-
Deep Residual U-Net arxiv pdf; Github Code(Pytorch); Notebook(Keras)
Architecutures experimented are
- transpose convolutions using width of 2
ConvTranspose is convolution and has trainable kernels while Upsample is a simple interpolation (bilinear, nearest etc.).. Transpose has learning parameter while Up-sampling has no-learning parameters. Using Upsampling can make inference or training faster as it does not require to update weight or compute gradient, but since the input image is already pixelated, using transpose conv with cost of additional parameters and model size
Final layer RF - 124
================================================================
Total params: 1,190,272
Trainable params: 1,190,272
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.09
Forward/backward pass size (MB): 139.16
Params size (MB): 4.54
Estimated Total Size (MB): 143.80
- Training took around 15 mins per epoch on 16G Tesla P100
- Modified DeepResUNet architecuture.
Have to revisit the architecture, to reduce training memory.
================================================================
Total params: 1,765,421
Trainable params: 1,765,421
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.09
Forward/backward pass size (MB): 43093.52
Params size (MB): 6.73
Estimated Total Size (MB): 43100.34
----------------------------------------------------------------
Proceeding with ConvTranspose as the results seems to be good with limited training
All above models were trained with BCE loss for both mask and depth_maps. While the results looks promising masks are not clear and depth maps still misses out the details.
After learning about few losses and experimenting soft Dice loss was chosen for masks, and pixel wise MSE loss was chosen for Depth maps over ssim and msssim, since we have to decide the window width for ssim.
Below are the loss functions used
-
Dice coefficient based Soft Dice Loss for Mask.
Read my post about the loss [here]
-
MSEloss preferred over Multi-Scale Structural Similarity loss for Depth Maps - implementation
Depth maps need mode training and hence more weightage to its corresponding loss
mask_loss = soft Dice loss
depth_loss = MSE loss
#Final loss function
loss = 1*mask_loss + 2*depth_loss
Dice loss implementation
def dice_coeff(pred, target):
smooth = 1.
num = pred.size(0)
m1 = pred.view(num, -1) # Flatten
m2 = target.view(num, -1) # Flatten
intersection = (m1 * m2).sum()
return (2. * intersection + smooth) / (m1.sum() + m2.sum() + smooth)
class SoftDiceLoss(nn.Module):
def __init__(self, weight=None, size_average=True):
super(SoftDiceLoss, self).__init__()
def forward(self, logits, targets):
probs = F.sigmoid(logits)
num = targets.size(0) # Number of batches
score = dice_coeff(probs, targets)
score = 1 - score.sum() / num
return score
Pixel wise accuracy was initially used to as a metric but since we have lots of black pixels it is easy to fool the metric to get high values. Other options are using - iou or dice coefficient. IOU again for the same reason of most matching
Chosen dice coefficient for final runs
def dice_coeff(pred, target):
smooth = 1.
num = pred.size(0)
m1 = pred.view(num, -1) # Flatten
m2 = target.view(num, -1) # Flatten
intersection = (m1 * m2).sum()
return (2. * intersection + smooth) / (m1.sum() + m2.sum() + smooth)
After running 64x64 images, for the complete dataset for 10 epochs gave below results
Mask Dice Score - 0.87
depth_maps DiceScore - 0.34
Logs and outputs saved to tensorboard_logdir. Final ouput images
Ground truth and predicted images :
- todo
Links: Saved models