-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Support for Phi4MMForCausalLM Architecture #12117
Comments
Depends on #11292 |
Anything I can do to assist getting this merged? I'll pull the branch and play around with it. Still in development from the looks of it, and not ready for merge yet? |
OK so Phi-4-multimodal-instruct is a bit more messy. Traditional vision model are simple: just 2 separated transformers, one for vision encoder and one for language decoder. However, on Phi-4 embedding data from vision/audio encoder must also be processed using a dedicated LoRA adapter applied on top of the language decoder Normal vision models: flowchart TD
image --> vision_transformer
vision_transformer[[vision_transformer]] --> embd_input
text_input --> embd_input
embd_input --> text_transformer[[text_transformer]]
text_transformer --> text_output
Phi-4 multimodal: flowchart TD
image --> vision_transformer[[vision_transformer]]
vision_transformer --> embd_input
audio --> audio_transformer[[audio_transformer]]
audio_transformer --> embd_input
text_input --> embd_input
embd_input --> text_transformer
subgraph text_transformer
vision_LoRA[[vision_LoRA]]
audio_LoRA[[audio_LoRA]]
base_model[[base_model]]
end
text_transformer --> text_output
Diagram from the paper: For now, I've been able to convert only the text/language part. Turns out, it just a simple Phi-4-mini-instruct under the hood, so nothing interesting. This is also mentioned in the paper: While llama.cpp already had support for LoRA, implementing this in a way that it just work out-of-the-box is quite complicated at the moment. So I think we should wait a bit more, keep an eye on:
|
Thanks, I was looking in the wrong places for the paper. Digging into the layers, I see two additional tensor types that may be missing as well. model.embed_tokens_extend.audio_embed.encoder.encoder_embedding.global_mean Looks like ggml has GGML_OP_MEAN, but nothing for "invstd". I can only assume that is Inverse Standard Deviation, will need to read the paper. A couple other layers I'm not certain I matched the tensors correctly. @ngxson one quick question though, with the branch you are working on for vision, is there a reason you separate out tensor types for vision? |
I don't know enough about audio processing to answer your question, unfortunately. In addition, I think the adequate infrastructure to process audio input is not yet there in llama.cpp Re. why vision tensors are quantized differently, this is because (1) some ops in ggml only support f16 or f32, IIRC I got some problems with ggml_add, and reason (2) because most models I was working with has small vision tower, which is very prone to error when quantized less than q8_0 |
Small update: you should also look at the reference python implementation. So seems like the mean and invstd for audio is simply:
Which roughly equivalent to |
Prerequisites
Feature Description
Unable to convert Phi-4-multimodal-instruct
Motivation
Phi-4-multimodal-instruct is not supported
Possible Implementation
No response
The text was updated successfully, but these errors were encountered: