RFC: Improve developer experience by anchoring on multimodal use-case #8485

mergennachin · 2024-11-26T21:23:17Z

mergennachin
Nov 26, 2024
Collaborator

🚀 The feature, motivation and pitch

Let's build an example demo app, perhaps in pytorch-labs, which will become a forcing function to improve developer experience from a user perspective. A positive outcome of this demo app is to define and build new higher level abstractions (e.g., similar to Pipelines).

On a high-level, here's the app we would like to build: LLM based on voice input and output. In terms of implementation, it's a three step process:

Given a voice input, convert to text (e.g., Whisper)
Run text based LLM (e.g., Llama 1B)
Convert text output to voice (e.g., using T5)

Here are the requirements:

Be able to run on iOS, Android and Desktop app.
Be able to prototype e2e flow in Python first and HuggingFace
Be able to deploy on laptop without Python runtime easily for testing purposes.
Be able to swap underlying models easily (e.g., Whisper -> Seamless, Llama 1B -> Qwen)
Easy to swap Sampler/Tokenizer/KVCache implementations in LLM, perhaps, use this issue
Easy deployment process to mobile and desktop app.
Everything in OSS
Easy to improve performance optimization and debugging (e.g., use mobile accelerators, quantization)

Here's a positive outcome of this demo app:

Define and build new higher level abstractions to make these possible.
ExecuTorch and torchchat uses this abstraction for text-based LLMs.
Llava and multimodal image uses this abstraction.
Community can build completely new apps using this new abstractions

Alternatives

No response

Additional context

Already another RFC, but specifically in the context of LLMs

RFC (Optional)

No response

cc @cccclai @helunwencser @dvorjackz

shoumikhin · 2024-11-26T21:46:04Z

shoumikhin
Nov 26, 2024
Collaborator

Another success story for iOS specifically (and hopefully for Android too) should look roughly like on this video, where the clients could just add some executorch-llm Swift PM package, write a few lines to create a Pipeline using an exported .pte file, add a text-edit field and a button to UI, and then just run LLaMA inference out-the-box.

0 replies

mergennachin · 2024-11-26T21:53:58Z

mergennachin
Nov 26, 2024
Collaborator Author

@shoumikhin

yeah, that's cool. two additional comments:

one thing is to think not only LLM but other models as well... Usually an AI application would be a combination of multiple models, orchestrating together (voice, image, text etc).
users would like to experiment with python first by combining multiple models... And when they're satisifed with the result, can "just click a button" and deploy to iOS and Android.

0 replies

iseeyuan · 2024-11-26T23:15:00Z

iseeyuan
Nov 26, 2024
Collaborator

It's great to have this kind of experience in general, but we may need to think more on how the framework can really help. For multimodal we need to learn more on the common pattern. Note that different components may not work together directly out of box due to:

different multimodal architecture (for example, with or without cross attention)
training dependency: some encoders may be trained on a certain LLM foundation model, or different training steps are required, like training the encoder with a frozen foundation model, and then fine tune the foundation model with frozen encoder.
As a framework we may think about how we can help users to hook their components with a working and robust pipeline.

0 replies

kimishpatel · 2024-12-02T00:49:33Z

kimishpatel
Dec 2, 2024
Collaborator

I think it would be great if the issue/pain points, solution space and potential way to validate this actually came from users a level higher than framework devs. My fear is that we will do incremental improvements that aesthetically please us based on our own experience. Even if such issues and/or solution space is not driven by other users or product engineers, such personas have to be close part of iterating over solution space. Maybe this would be people from Paris hackathon.

0 replies

shoumikhin · 2024-12-02T00:53:28Z

shoumikhin
Dec 2, 2024
Collaborator

@kimishpatel I imagine if for users it looks similar to HF transformers or OpenAI API, that should be good enough?

0 replies

kimishpatel · 2024-12-02T01:50:04Z

kimishpatel
Dec 2, 2024
Collaborator

@kimishpatel I imagine if for users it looks similar to HF transformers or OpenAI API, that should be good enough?

WHat is OpenAI API?

Why do you believe users similar to HF transformers covers spans of users that interact with torchchat/ET? Any examples?

0 replies

shoumikhin · 2024-12-02T02:14:18Z

shoumikhin
Dec 2, 2024
Collaborator

HF transformers or OpenAI API are sorta de-facto standards how devs and clients interact with LLMs these days. I guess if TC provides a similar interface it at least wouldn't be worse.
The real question is can we do better and what such better would be? Agree that's something the researchers or consumers can help us to define, along with our own iterations.

0 replies

kimishpatel · 2024-12-02T03:50:39Z

kimishpatel
Dec 2, 2024
Collaborator

HF transformers or OpenAI API are sorta de-facto standards how devs and clients interact with LLMs these days. I guess if TC provides a similar interface it at least wouldn't be worse. The real question is can we do better and what such better would be? Agree that's something the researchers or consumers can help us to define, along with our own iterations.

@shoumikhin OpenAI API, AFAIU, is related to end point APIs while building pipeline of components, with customizability across different aspects such as tokenizer, kv cache management, long context management etc. might be different. Dont know enough about HF in this space. I assume that would more closely align with some of the objectives here.

And generally @mergennachin, I would probably want to also understand how different end-users envision deploying models. Requirements listed here make sense but I cant place them in larger context and where they are coming from. @shoumikhin's comment regarding HF users do make sense though but does the same exist for on-device use-cases?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Improve developer experience by anchoring on multimodal use-case #8485

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

RFC: Improve developer experience by anchoring on multimodal use-case #8485

mergennachin Nov 26, 2024 Collaborator

🚀 The feature, motivation and pitch

Alternatives

Additional context

RFC (Optional)

Replies: 8 comments

shoumikhin Nov 26, 2024 Collaborator

mergennachin Nov 26, 2024 Collaborator Author

iseeyuan Nov 26, 2024 Collaborator

kimishpatel Dec 2, 2024 Collaborator

shoumikhin Dec 2, 2024 Collaborator

kimishpatel Dec 2, 2024 Collaborator

shoumikhin Dec 2, 2024 Collaborator

kimishpatel Dec 2, 2024 Collaborator

mergennachin
Nov 26, 2024
Collaborator

shoumikhin
Nov 26, 2024
Collaborator

mergennachin
Nov 26, 2024
Collaborator Author

iseeyuan
Nov 26, 2024
Collaborator

kimishpatel
Dec 2, 2024
Collaborator

shoumikhin
Dec 2, 2024
Collaborator

kimishpatel
Dec 2, 2024
Collaborator

shoumikhin
Dec 2, 2024
Collaborator

kimishpatel
Dec 2, 2024
Collaborator