Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support for Phi4MMForCausalLM Architecture #12117

Open
4 tasks done
ns3284 opened this issue Feb 28, 2025 · 6 comments
Open
4 tasks done

Feature Request: Support for Phi4MMForCausalLM Architecture #12117

ns3284 opened this issue Feb 28, 2025 · 6 comments
Labels
enhancement New feature or request

Comments

@ns3284
Copy link

ns3284 commented Feb 28, 2025

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Unable to convert Phi-4-multimodal-instruct

Motivation

Phi-4-multimodal-instruct is not supported

Possible Implementation

No response

@ns3284 ns3284 added the enhancement New feature or request label Feb 28, 2025
@ngxson
Copy link
Collaborator

ngxson commented Feb 28, 2025

Depends on #11292

@ns3284
Copy link
Author

ns3284 commented Feb 28, 2025

Depends on #11292

Anything I can do to assist getting this merged?

I'll pull the branch and play around with it. Still in development from the looks of it, and not ready for merge yet?

@ngxson
Copy link
Collaborator

ngxson commented Mar 1, 2025

OK so Phi-4-multimodal-instruct is a bit more messy.

Traditional vision model are simple: just 2 separated transformers, one for vision encoder and one for language decoder. However, on Phi-4 embedding data from vision/audio encoder must also be processed using a dedicated LoRA adapter applied on top of the language decoder

Normal vision models:

flowchart TD
  image --> vision_transformer
  vision_transformer[[vision_transformer]] --> embd_input
  text_input --> embd_input
  embd_input --> text_transformer[[text_transformer]]
  text_transformer --> text_output
Loading

Phi-4 multimodal:

flowchart TD
  image --> vision_transformer[[vision_transformer]]
  vision_transformer --> embd_input
  audio --> audio_transformer[[audio_transformer]]
  audio_transformer --> embd_input
  text_input --> embd_input
  embd_input --> text_transformer
  subgraph text_transformer
    vision_LoRA[[vision_LoRA]]
    audio_LoRA[[audio_LoRA]]
    base_model[[base_model]]
  end
  text_transformer --> text_output
Loading

Diagram from the paper:

image

For now, I've been able to convert only the text/language part. Turns out, it just a simple Phi-4-mini-instruct under the hood, so nothing interesting.

This is also mentioned in the paper:

image

While llama.cpp already had support for LoRA, implementing this in a way that it just work out-of-the-box is quite complicated at the moment.

So I think we should wait a bit more, keep an eye on:

@nisparks
Copy link
Contributor

nisparks commented Mar 2, 2025

Thanks, I was looking in the wrong places for the paper. Digging into the layers, I see two additional tensor types that may be missing as well.

model.embed_tokens_extend.audio_embed.encoder.encoder_embedding.global_mean
model.embed_tokens_extend.audio_embed.encoder.encoder_embedding.global_invstd

Looks like ggml has GGML_OP_MEAN, but nothing for "invstd". I can only assume that is Inverse Standard Deviation, will need to read the paper. A couple other layers I'm not certain I matched the tensors correctly.

@ngxson one quick question though, with the branch you are working on for vision, is there a reason you separate out tensor types for vision?

@ngxson
Copy link
Collaborator

ngxson commented Mar 2, 2025

I don't know enough about audio processing to answer your question, unfortunately. In addition, I think the adequate infrastructure to process audio input is not yet there in llama.cpp

Re. why vision tensors are quantized differently, this is because (1) some ops in ggml only support f16 or f32, IIRC I got some problems with ggml_add, and reason (2) because most models I was working with has small vision tower, which is very prone to error when quantized less than q8_0

@ngxson
Copy link
Collaborator

ngxson commented Mar 2, 2025

Small update: you should also look at the reference python implementation.

So seems like the mean and invstd for audio is simply:

    def forward(self, input_: Tensor) -> Tensor:
        """MeanVarianceNormLayer Forward
        Args:
            input_: torch.Tensor
                input tensor.
        """
        return (input_ - self.global_mean) * self.global_invstd

Which roughly equivalent to ggml_mul(global_invstd, ggml_sub(input_, global_mean))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants