Big gap when reproducing the reported results

Thank you for your great work! I have encountered gaps when reproducing the reported results of some baseline models. 
For example, the reported avg QA GPT-acc of llava-1.5 is 17.18, but i only get 11.46 when i try to reproduce.


Could you kindly release the evaluation prompt and scripts of the baseline model?


Thanks for your time and efforts.