Training codes to optmize the reward mdoel in 2 stages

Thanks for sharing the codes!

While I find no code on fine-tuning the multimodal LLM and linear regression model in stage 1 and stage 2. 

For stage 1, does the training configurations align with the standard CogVLM2-Video fine-tuning? Is the mllm in this stage trained via fully fine-tuned or only lora?