Thank you for sharing your work!
I have a doubt, why different training datasets are used for training each module? For example: VTRN dataset is used for training large-scale retrieval module and RSSDIVCS dataset is used for training fine-grained matching module?
even if training on different datasets, how will the "shared encoder" work as images passed from Large-Scale Retrieval module is being passed to Fine-Grained Matching module?
Also, the dataset link which you have provided does not include VTRN dataset, so is ICRA2022 dataset and VTRN dataset the same?
Thankyou.