This is an implementation for vision and language multimodal research developed on top of Pythia
We have proposed few major improvements over Pythia, which is an implementation for the LoRRA model. The dynamic answering space is expanded by adding an Instance segmentation module. We have also replaced the existing OCR with spell correcting OCR and add the spatial features of the OCR. We have also implemented n-gram to the modified OCR.
For testing run our notebook