Caption Mosaic : Pre-Inject Architecture

"A picture is worth a thousand words, but sometimes we actually want the words."

A Image Captioning Model using PyTorch

Above is the Word Cloud of the whole dataset captions.

Dataset

It has been trained on flickr 30k dataset.

Encoder: Utilizes a pre-trained ResNet101 model(IMAGENET1K_V2) from PyTorch. It is used to extract feature vector from the image. It can be fine-tuned on the dataset images too.
Decoder: Employs an LSTM network to generate caption. With 3 layers and 1x264 hidden size for each captions.

Trained for 35 epochs, on 24GB GPU for more than 2 hours.

All the code, data augmentation, training and testing has been done here : Notebook

Improve the BELU score calculation method by considering all the captions as reference of a particular image.
Impliment attention mechanisn between encoder and decoder for better results.
Do a hyperparameter-sweep to find the best Hyperparameters.
Function to directly upload raw image and generate captions.
Impliment Beam search while generating captions.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Data		Data
assets		assets
code		code
notebook		notebook
sample		sample
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md