Thanks for sharing the codes!
While I find no code on fine-tuning the multimodal LLM and linear regression model in stage 1 and stage 2.
For stage 1, does the training configurations align with the standard CogVLM2-Video fine-tuning? Is the mllm in this stage trained via fully fine-tuned or only lora?