[QUESTION] LLaVA model_type, pipeline parallel training #1078
Unanswered
KookHoiKim
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm trying to apply pipeline parallelism in training LLaVA. (TP1, PP2)
Although I followed the instruction , the code is not working. (TP2 PP1 working)
And i found there is some weird points in the code.
In my understanding, vision encoder / vision projector is additional embedding part, which is only used in pre_process part.
However, LLaVA model is initialized as
encoder_and_decoder
model_type. Why notencoder_or_decoder
?Furthermore, while pp communication, the recv, send tensor shape is set as
(num_image_token, B, hidden_size)
.It seems shards gives/takes vision embedding, not the intermediate states from middle of language model.
P.S. Currently, i do not use encoder_pipeline_model_parallel_size / tensor parallel size because it occurs errors while initializing megatron that is not divisible
world_size % total_model_size
.So i forced
vision_config.pipeline_model_parallel_size
to be 1.I am not familiar with megatron code, and really hope that get some help with llava training.
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions