You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR brings a major overhaul of WebLLM runtime.
- Modular package that can be reused by the community, independent from
UI.
- Published on npm.
- Rewrite all components in typescript.
- Overhauls the tvm unity side dependencies to be cleaner.
- WebLLM package now can dependent and reuse on artifact generated by
MLC LLM without having to setup the build part for package development.
This project brings language model chats directly onto web browsers. **Everything runs inside the browser with no server support and accelerated with WebGPU.** We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.
6
+
WebLLM is a modular, customizable javascript package that directly
7
+
bring language model chats directly onto web browsers with hardware acceleration.
8
+
**Everything runs inside the browser with no server support and accelerated with WebGPU.**
9
+
We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.
4
10
5
11
**[Check out our demo webpage to try out!](https://mlc.ai/web-llm/)**
12
+
This project is a companion project of [MLC LLM](https://github.com/mlc-ai/mlc-llm),
13
+
our companion project that runs LLMs natively on iphone and other native local environments.
6
14
7
-
You might also be interested in [MLC LLM](https://github.com/mlc-ai/mlc-llm), our companion project that runs LLMs natively on iphone and other native local environments.
8
15
9
16
<imgsrc="site/img/fig/demo.gif">
10
17
11
-
We have been seeing amazing progress in generative AI and LLM recently. Thanks to the open-source efforts like LLaMA, Alpaca, Vicuna, and Dolly, we can now see an exciting future of building our own open-source language models and personal AI assistant.
18
+
## Get Started
12
19
13
-
These models are usually big and compute-heavy. To build a chat service, we will need a large cluster to run an inference server, while clients send requests to servers and retrieve the inference output. We also usually have to run on a specific type of GPUs where popular deep-learning frameworks are readily available.
20
+
WebLLM offers a minimalist and modular interface to access the chatbot in browser.
21
+
The following code demonstrates the basic usage.
14
22
15
-
This project is our step to bring more diversity to the ecosystem. Specifically, can we simply bake LLMs directly into the client side and directly run them inside a browser? If that can be realized, we could offer support for client personal AI models with the benefit of cost reduction, enhancement for personalization, and privacy protection. The client side is getting pretty powerful.
23
+
```typescript
24
+
import { ChatModule } from"@mlc-ai/web-llm";
16
25
17
-
Won’t it be even more amazing if we can simply open up a browser and directly bring AI natively to your browser tab? There is some level of readiness in the ecosystem. WebGPU has just shipped and enables native GPU executions on the browser.
We also do not want to only do it for just one model. Instead, we would like to present a repeatable and hackable workflow that enables anyone to easily develop and optimize these models in a productive Python-first approach, and deploy them universally, including on the web.
120
+
In many cases we only want to supply the model weight variant, but
121
+
not necessarily a new model. In such cases, we can reuse the model lib.
122
+
In such cases, we can just pass in the `model_list` field and skip the model lib,
123
+
and make sure the `mlc-chat-config.json` in the model url have a model lib
124
+
that points to a prebuilt version, right now the prebuilt lib includes
26
125
27
-
Besides supporting WebGPU, this project also provides the harness for other kinds of GPU backends that TVM supports (such as CUDA, OpenCL, and Vulkan) and really enables accessible deployment of LLM models.
1. Install TVM Unity. Open [mlc.ai wheels](https://mlc.ai/wheels) for more version.
130
+
## Build WebLLM Package From Source
32
131
33
-
```shell
34
-
pip3 install -r requirements.txt
35
-
```
132
+
WebLLM package is a web runtime designed for [MLC LLM](https://github.com/mlc-ai/mlc-llm).
36
133
37
-
2. Install all the prerequisite for web deployment:
134
+
1. Install all the prerequisite for web deployment:
38
135
1.[emscripten](https://emscripten.org). It is an LLVM-based compiler which compiles C/C++ source code to WebAssembly.
39
136
- Follow the [installation instruction](https://emscripten.org/docs/getting_started/downloads.html#installation-instructions-using-the-emsdk-recommended) to install the latest emsdk.
40
137
- Source `emsdk_env.sh` by `source path/to/emsdk_env.sh`, so that `emcc` is reachable from PATH and the command `emcc` works.
3. [`wasm-pack`](https://rustwasm.github.io/wasm-pack/installer/). It helps build Rust-generated WebAssembly, which used fortokenizerin our case here.
43
138
4. Install jekyll by following the [official guides](https://jekyllrb.com/docs/installation/). It is the package we use for website.
44
139
5. Install jekyll-remote-theme by command. Try [gem mirror](https://gems.ruby-china.com/) if install blocked.
45
140
```shell
46
141
gem install jekyll-remote-theme
47
142
```
48
-
6. Install [Chrome](https://www.google.com/chrome/) with version at least 113. WebGPU has shipped to Chrome in version 113.
49
-
50
-
We can verify the success installation by trying out `emcc`, `jekyll` and `wasm-pack`in terminal respectively.
51
-
52
-
3. Import, optimize and build the LLM model:
53
-
* Get Model Weight
54
-
55
-
Currently we support LLaMA and Vicuna and RedPajama. To get the Vicuna model weights, follow the instructions below:
56
-
57
-
1. Get the original LLaMA weights in the huggingface format by following the instructions [here](https://huggingface.co/docs/transformers/main/model_doc/llama).
58
-
2. Use instructions [here](https://github.com/lm-sys/FastChat#vicuna-weights) to get vicuna weights
59
-
3. Create a soft link to the model path under mlc-llm/dist/models.
Note: build.py for Vicuna-v1-7B requires 16GB of memory for Mac, and about 30GB CPU memory for other OS. We are continuously optimizing for reducing build memory requirement to enable more people to try out locally.
143
+
We can verify the success installation by trying out `emcc` and `jekyll`in terminal respectively.
78
144
79
-
4. Deploy the model on web with WebGPU runtime
145
+
2. Setup necessary environment
80
146
81
147
Prepare all the necessary dependencies for web build:
82
148
```shell
83
149
./scripts/prep_deps.sh
84
150
```
85
151
86
-
The last thing to do is setting up the site with:
87
-
```shell
88
-
./scripts/local_deploy_site.sh
89
-
```
152
+
3. Buld WebLLM Package
90
153
91
-
With the site set up, you can go to `localhost:8888/web-llm/`in Chrome to try out the demo on your local machine. You will need around 6GB GPU memory to run the Vicuna model, or 3GB GPU memory to run the RedPajama model. You can use
to launch Chrome from the command line to turn off the robustness check from Chrome and enable better performance.
157
+
158
+
4. Validate some of the sub packages
159
+
160
+
You can then go to the subfolders in [examples] to validate some of the sub packages.
161
+
We use Parcelv2 for bundling. Although parcel is not very good at tracking parent directory
162
+
changes sometimes. When you made a change in the WebLLM package, try to edit the `package.json`
163
+
of the subfolder and save it, which will trigger Parcel to rebuild.
96
164
97
165
98
166
## How
@@ -121,22 +189,6 @@ One key characteristic of LLM models is the dynamic nature of the model. As the
121
189
We also leveraged the integration of tensor expressions to quickly express partial-tensor computations such as rotary embedding directly without materializing them into full-tensor matrix computations.
122
190
123
191
124
-
## Comparison to Native GPU Runtime, Limitations and Opportunities
125
-
126
-
Besides the WebGPU runtime, we also provide options for native deployment with local GPU runtime. So they can be used both as a tool to deploy on native environment as well as a reference point to compare native GPU driver performance and WebGPU.
127
-
128
-
WebGPU works by translating WGSL shaders to native shaders. We observed that there are opportunities to reach zero gap between the WebGPU runtime and native environment.
129
-
130
-
Some of the current gaps are caused by Chrome's WebGPU implementation inserts bound clips for all array index access, such that `a[i]` becomes `a[min(i, a.size)]`. This can be optimized out as the WebGPU support continues to mature.
131
-
132
-
You can get around this by using a special flag to launch Chrome (thanks to Dawn developers forproviding the pointers), by exiting Chrome completely, thenincommand line, type:
Then you will find that the execution speed is as fast as native GPU environment. We anticipate this problem will get resolved as WebGPU matures. WebGPU just shipped and we are excited to see opportunities it can unblock. There are also a lot of exciting upcoming features we can leverage to further improve things such as fp16 extensions.
139
-
140
192
## Links
141
193
142
194
- [Demo page](https://mlc.ai/web-llm/)
@@ -147,24 +199,4 @@ Then you will find that the execution speed is as fast as native GPU environment
147
199
148
200
This project is initiated by members from CMU catalyst, UW SAMPL, SJTU, OctoML and the MLC community. We would love to continue developing and supporting the open-source ML community.
149
201
150
-
<a href="https://www.scs.cmu.edu">
151
-
<img src="site/img/logo/cmuscs.png" alt="CMU School of Computer Science" height="60"/>
This project is only possible thanks to the shoulders open-source ecosystems that we stand on. We want to thank the Apache TVM community and developers of the TVM Unity effort. The open-source ML community members made these models publicly available. PyTorch and Hugging Face communities that make these models accessible. We would like to thank the teams behind vicuna, SentencePiece, LLaMA, Alpaca. We also would like to thank the WebAssembly, Emscripten, and WebGPU communities. Finally, thanks to Dawn and WebGPU developers.
0 commit comments