Skip to content

Commit 8ce1d8d

Browse files
authored
Major overhaul and flow standardization (#113)
This PR brings a major overhaul of WebLLM runtime. - Modular package that can be reused by the community, independent from UI. - Published on npm. - Rewrite all components in typescript. - Overhauls the tvm unity side dependencies to be cleaner. - WebLLM package now can dependent and reuse on artifact generated by MLC LLM without having to setup the build part for package development.
1 parent 13d51df commit 8ce1d8d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+9375
-1088
lines changed

.eslintignore

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
dist
2+
debug
3+
lib
4+
build
5+
node_modules
6+
.eslintrc.cjs

.eslintrc.cjs

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
module.exports = {
2+
extends: ['eslint:recommended', 'plugin:@typescript-eslint/recommended'],
3+
parser: '@typescript-eslint/parser',
4+
plugins: ['@typescript-eslint'],
5+
root: true,
6+
rules: {
7+
"@typescript-eslint/no-explicit-any": "off"
8+
}
9+
};

.gitignore

+4
Original file line numberDiff line numberDiff line change
@@ -318,3 +318,7 @@ gallery/how_to/work_with_microtvm/micro_tvmc.py
318318

319319
3rdparty
320320
dist
321+
tvm_home
322+
node_modules
323+
lib
324+
.parcel-cache

.gitmodules

-6
Original file line numberDiff line numberDiff line change
@@ -1,6 +0,0 @@
1-
[submodule "mlc-llm"]
2-
path = mlc-llm
3-
url = https://github.com/mlc-ai/mlc-llm
4-
[submodule "3rdparty/tokenizers-cpp"]
5-
path = 3rdparty/tokenizers-cpp
6-
url = https://github.com/mlc-ai/tokenizers-cpp

3rdparty/.gitkeep

Whitespace-only changes.

3rdparty/tokenizers-cpp

-1
This file was deleted.

README.md

+126-94
Original file line numberDiff line numberDiff line change
@@ -1,98 +1,166 @@
1+
[discord-url]: https://discord.gg/9Xpy2HGBuD
2+
13
# Web LLM
4+
| [NPM Package](https://www.npmjs.com/package/@mlc-ai/web-llm) | [Get Started](#get-started) | [MLC LLM](https://github.com/mlc-ai/mlc-llm) | [Discord][discord-url]
25

3-
This project brings language model chats directly onto web browsers. **Everything runs inside the browser with no server support and accelerated with WebGPU.** We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.
6+
WebLLM is a modular, customizable javascript package that directly
7+
bring language model chats directly onto web browsers with hardware acceleration.
8+
**Everything runs inside the browser with no server support and accelerated with WebGPU.**
9+
We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.
410

511
**[Check out our demo webpage to try out!](https://mlc.ai/web-llm/)**
12+
This project is a companion project of [MLC LLM](https://github.com/mlc-ai/mlc-llm),
13+
our companion project that runs LLMs natively on iphone and other native local environments.
614

7-
You might also be interested in [MLC LLM](https://github.com/mlc-ai/mlc-llm), our companion project that runs LLMs natively on iphone and other native local environments.
815

916
<img src="site/img/fig/demo.gif">
1017

11-
We have been seeing amazing progress in generative AI and LLM recently. Thanks to the open-source efforts like LLaMA, Alpaca, Vicuna, and Dolly, we can now see an exciting future of building our own open-source language models and personal AI assistant.
18+
## Get Started
1219

13-
These models are usually big and compute-heavy. To build a chat service, we will need a large cluster to run an inference server, while clients send requests to servers and retrieve the inference output. We also usually have to run on a specific type of GPUs where popular deep-learning frameworks are readily available.
20+
WebLLM offers a minimalist and modular interface to access the chatbot in browser.
21+
The following code demonstrates the basic usage.
1422

15-
This project is our step to bring more diversity to the ecosystem. Specifically, can we simply bake LLMs directly into the client side and directly run them inside a browser? If that can be realized, we could offer support for client personal AI models with the benefit of cost reduction, enhancement for personalization, and privacy protection. The client side is getting pretty powerful.
23+
```typescript
24+
import { ChatModule } from "@mlc-ai/web-llm";
1625

17-
Won’t it be even more amazing if we can simply open up a browser and directly bring AI natively to your browser tab? There is some level of readiness in the ecosystem. WebGPU has just shipped and enables native GPU executions on the browser.
26+
async function main() {
27+
const chat = new ChatModule();
28+
// load a prebuilt model
29+
await chat.reload("RedPajama-INCITE-Chat-3B-v1-q4f32_0");
30+
// generate a reply base on input
31+
const prompt = "What is the capital of Canada?";
32+
const reply = await chat.generate(prompt);
33+
console.log(reply);
34+
}
35+
```
1836

19-
Still, there are big hurdles to cross, to name a few:
37+
The WebLLM package itself does not come with UI, and is designed in a
38+
modular way to hooked to any of the UI component. The following code snippet
39+
is contains part of the program that generate streaming response on a webpage.
40+
You can checkout [examples/get-started](examples/get-started/) to see the complete example.
41+
42+
```typescript
43+
async function main() {
44+
// create a ChatModule,
45+
const chat = new ChatModule();
46+
// This callback allows us to report initialization progress
47+
chat.setInitProgressCallback((report: InitProgressReport) => {
48+
setLabel("init-label", report.text);
49+
});
50+
// pick a model, here we use red-pajama
51+
const localId = "RedPajama-INCITE-Chat-3B-v1-q4f32_0";
52+
await chat.reload(localId);
53+
54+
// callback to refresh the streaming response
55+
const generateProgressCallback = (_step: number, message: string) => {
56+
setLabel("generate-label", message);
57+
};
58+
const prompt0 = "What is the capital of Canada?";
59+
// generate response
60+
const reply0 = await chat.generate(prompt0, generateProgressCallback);
61+
console.log(reply0);
62+
63+
const prompt1 = "How about France?";
64+
const reply1 = await chat.generate(prompt1, generateProgressCallback)
65+
console.log(reply1);
66+
67+
// We can print out the statis
68+
console.log(await chat.runtimeStatsText());
69+
}
70+
```
2071

21-
- We need to bring the models somewhere without the relevant GPU-accelerated Python frameworks.
22-
- Most of the AI frameworks rely heavily on optimized computed libraries that are maintained by hardware vendors. We need to start from scratch.
23-
- Careful planning of memory usage, and aggressive compression of weights so that we can fit the models into memory.
72+
Finally, you can find a complete
73+
You can also find a complete chat-app in [examples/simple-chat](examples/simple-chat/).
74+
75+
## Customized Model Weights
76+
77+
WebLLM works a companion project of [MLC LLM](https://github.com/mlc-ai/mlc-llm).
78+
It reuses the model artifact and build flow of MLC LLM, please checkout MLC LLM document
79+
on how to build a new model weights and libraries (MLC LLM document will come in the incoming weeks).
80+
To generate the wasm needed by WebLLM, you can run with `--target webgpu` in the mlc llm build.
81+
There are two elements of WebLLM package that enables new models and weight variants.
82+
83+
- model_url: Contains a URL to model artifacts, such as weights and meta-data.
84+
- model_lib: The webassembly libary that contains the executables to accelerate the model computations.
85+
86+
Both are customizable in the WebLLM.
87+
88+
```typescript
89+
async main() {
90+
const myLlamaUrl = "/url/to/my/llama";
91+
const appConfig = {
92+
"model_list": [
93+
{
94+
"model_url": myLlamaUrl,
95+
"local_id": "MyLlama-3b-v1-q4f32_0"
96+
}
97+
],
98+
"model_lib_map": {
99+
"llama-v1-3b-q4f32_0": "/url/to/myllama3b.wasm",
100+
}
101+
};
102+
// override default
103+
const chatOpts = {
104+
"repetition_penalty": 1.01
105+
};
106+
107+
const chat = new ChatModule();
108+
// load a prebuilt model
109+
// with a chat option override and app config
110+
// under the hood, it will load the model from myLlamaUrl
111+
// and cache it in the browser cache
112+
//
113+
// Let us assume that myLlamaUrl/mlc-config.json contains a model_lib
114+
// field that points to "llama-v1-3b-q4f32_0"
115+
// then chat module will initialize with these information
116+
await chat.reload("MyLlama-3b-v1-q4f32_0", chatOpts, appConfig);
117+
}
118+
```
24119

25-
We also do not want to only do it for just one model. Instead, we would like to present a repeatable and hackable workflow that enables anyone to easily develop and optimize these models in a productive Python-first approach, and deploy them universally, including on the web.
120+
In many cases we only want to supply the model weight variant, but
121+
not necessarily a new model. In such cases, we can reuse the model lib.
122+
In such cases, we can just pass in the `model_list` field and skip the model lib,
123+
and make sure the `mlc-chat-config.json` in the model url have a model lib
124+
that points to a prebuilt version, right now the prebuilt lib includes
26125

27-
Besides supporting WebGPU, this project also provides the harness for other kinds of GPU backends that TVM supports (such as CUDA, OpenCL, and Vulkan) and really enables accessible deployment of LLM models.
126+
- `vicuna-v1-7b-q4f32_0`: llama-7b models.
127+
- `RedPajama-INCITE-Chat-3B-v1-q4f32_0`: RedPajama-3B variant.
28128

29-
## Instructions for local deployment
30129

31-
1. Install TVM Unity. Open [mlc.ai wheels](https://mlc.ai/wheels) for more version.
130+
## Build WebLLM Package From Source
32131

33-
```shell
34-
pip3 install -r requirements.txt
35-
```
132+
WebLLM package is a web runtime designed for [MLC LLM](https://github.com/mlc-ai/mlc-llm).
36133

37-
2. Install all the prerequisite for web deployment:
134+
1. Install all the prerequisite for web deployment:
38135
1. [emscripten](https://emscripten.org). It is an LLVM-based compiler which compiles C/C++ source code to WebAssembly.
39136
- Follow the [installation instruction](https://emscripten.org/docs/getting_started/downloads.html#installation-instructions-using-the-emsdk-recommended) to install the latest emsdk.
40137
- Source `emsdk_env.sh` by `source path/to/emsdk_env.sh`, so that `emcc` is reachable from PATH and the command `emcc` works.
41-
2. [Rust](https://www.rust-lang.org/tools/install).
42-
3. [`wasm-pack`](https://rustwasm.github.io/wasm-pack/installer/). It helps build Rust-generated WebAssembly, which used for tokenizer in our case here.
43138
4. Install jekyll by following the [official guides](https://jekyllrb.com/docs/installation/). It is the package we use for website.
44139
5. Install jekyll-remote-theme by command. Try [gem mirror](https://gems.ruby-china.com/) if install blocked.
45140
```shell
46141
gem install jekyll-remote-theme
47142
```
48-
6. Install [Chrome](https://www.google.com/chrome/) with version at least 113. WebGPU has shipped to Chrome in version 113.
49-
50-
We can verify the success installation by trying out `emcc`, `jekyll` and `wasm-pack` in terminal respectively.
51-
52-
3. Import, optimize and build the LLM model:
53-
* Get Model Weight
54-
55-
Currently we support LLaMA and Vicuna and RedPajama. To get the Vicuna model weights, follow the instructions below:
56-
57-
1. Get the original LLaMA weights in the huggingface format by following the instructions [here](https://huggingface.co/docs/transformers/main/model_doc/llama).
58-
2. Use instructions [here](https://github.com/lm-sys/FastChat#vicuna-weights) to get vicuna weights
59-
3. Create a soft link to the model path under mlc-llm/dist/models.
60-
```shell
61-
mkdir -p mlc-llm/dist/models
62-
ln -s your_model_path mlc-llm/dist/models/model_name
63-
64-
# For example:
65-
# ln -s path/to/vicuna-v1-7b mlc-llm/dist/models/vicuna-v1-7b
66-
```
67-
68-
If you want to use your own mlc-llm branch, set `MLC_LLM_HOME` to that path and link weights under `$MLC_LLM_HOME/dist/models/model_name`
69-
70-
You can download the RedPajama weights from the HuggingFace repo [here](https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1).
71-
72-
* Optimize and build the models to WebGPU backend and export the executable to disk in the WebAssembly file format.
73-
```shell
74-
./build.sh --model=vicuna-v1-7b --quantization q4f32_0
75-
./build.sh --model=RedPajama-INCITE-Chat-3B-v1 --quantization q4f32_0
76-
```
77-
Note: build.py for Vicuna-v1-7B requires 16GB of memory for Mac, and about 30GB CPU memory for other OS. We are continuously optimizing for reducing build memory requirement to enable more people to try out locally.
143+
We can verify the success installation by trying out `emcc` and `jekyll` in terminal respectively.
78144

79-
4. Deploy the model on web with WebGPU runtime
145+
2. Setup necessary environment
80146

81147
Prepare all the necessary dependencies for web build:
82148
```shell
83149
./scripts/prep_deps.sh
84150
```
85151

86-
The last thing to do is setting up the site with:
87-
```shell
88-
./scripts/local_deploy_site.sh
89-
```
152+
3. Buld WebLLM Package
90153

91-
With the site set up, you can go to `localhost:8888/web-llm/` in Chrome to try out the demo on your local machine. You will need around 6GB GPU memory to run the Vicuna model, or 3GB GPU memory to run the RedPajama model. You can use
92154
```shell
93-
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --enable-dawn-features=disable_robustness
155+
npm run build
94156
```
95-
to launch Chrome from the command line to turn off the robustness check from Chrome and enable better performance.
157+
158+
4. Validate some of the sub packages
159+
160+
You can then go to the subfolders in [examples] to validate some of the sub packages.
161+
We use Parcelv2 for bundling. Although parcel is not very good at tracking parent directory
162+
changes sometimes. When you made a change in the WebLLM package, try to edit the `package.json`
163+
of the subfolder and save it, which will trigger Parcel to rebuild.
96164

97165

98166
## How
@@ -121,22 +189,6 @@ One key characteristic of LLM models is the dynamic nature of the model. As the
121189
We also leveraged the integration of tensor expressions to quickly express partial-tensor computations such as rotary embedding directly without materializing them into full-tensor matrix computations.
122190
123191
124-
## Comparison to Native GPU Runtime, Limitations and Opportunities
125-
126-
Besides the WebGPU runtime, we also provide options for native deployment with local GPU runtime. So they can be used both as a tool to deploy on native environment as well as a reference point to compare native GPU driver performance and WebGPU.
127-
128-
WebGPU works by translating WGSL shaders to native shaders. We observed that there are opportunities to reach zero gap between the WebGPU runtime and native environment.
129-
130-
Some of the current gaps are caused by Chrome's WebGPU implementation inserts bound clips for all array index access, such that `a[i]` becomes `a[min(i, a.size)]`. This can be optimized out as the WebGPU support continues to mature.
131-
132-
You can get around this by using a special flag to launch Chrome (thanks to Dawn developers for providing the pointers), by exiting Chrome completely, then in command line, type:
133-
134-
```
135-
/path/to/Chrome --enable-dawn-features=disable_robustness
136-
```
137-
138-
Then you will find that the execution speed is as fast as native GPU environment. We anticipate this problem will get resolved as WebGPU matures. WebGPU just shipped and we are excited to see opportunities it can unblock. There are also a lot of exciting upcoming features we can leverage to further improve things such as fp16 extensions.
139-
140192
## Links
141193
142194
- [Demo page](https://mlc.ai/web-llm/)
@@ -147,24 +199,4 @@ Then you will find that the execution speed is as fast as native GPU environment
147199
148200
This project is initiated by members from CMU catalyst, UW SAMPL, SJTU, OctoML and the MLC community. We would love to continue developing and supporting the open-source ML community.
149201
150-
<a href="https://www.scs.cmu.edu">
151-
<img src="site/img/logo/cmuscs.png" alt="CMU School of Computer Science" height="60"/>
152-
</a>
153-
<a href="https://catalyst.cs.cmu.edu">
154-
<img src="site/img/logo/catalyst.svg" alt="Catalyst" height="60"/>
155-
</a>
156-
<a href="https://mlc.ai">
157-
<img src="site/img/logo/mlc-logo-with-text-landscape.svg" alt="MLC" height="60"/>
158-
</a>
159-
</br>
160-
<a href="https://octoml.ai">
161-
<img src="site/img/logo/octoml.png" alt="OctoML" height="60"/>
162-
</a>
163-
<a href="https://www.cs.washington.edu/">
164-
<img src="site/img/logo/uw.jpg" alt="UW" height="60"/>
165-
</a>
166-
<a href="https://en.sjtu.edu.cn/">
167-
<img src="site/img/logo/sjtu.png" alt="SJTU" height="60"/>
168-
</a>
169-
170202
This project is only possible thanks to the shoulders open-source ecosystems that we stand on. We want to thank the Apache TVM community and developers of the TVM Unity effort. The open-source ML community members made these models publicly available. PyTorch and Hugging Face communities that make these models accessible. We would like to thank the teams behind vicuna, SentencePiece, LLaMA, Alpaca. We also would like to thank the WebAssembly, Emscripten, and WebGPU communities. Finally, thanks to Dawn and WebGPU developers.

build.sh

-10
This file was deleted.

examples/.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
package-lock.json

examples/README.md

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Awesome WebLLM
2+
3+
This page contains a curated list of examples, tutorials, blogs about WebLLM usecases.
4+
Please send a pull request if you find things that belongs to here.
5+
6+
## Tutorial Examples
7+
8+
- [get-started](get-started): minimum get started example.
9+
- [simple-chat](simple-chat): a mininum and complete chat app.
10+

examples/get-started/README.md

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# WebLLM Get Started App
2+
3+
This folder provides a minimum demo to show WebLLM API in a webapp setting.
4+
To try it out, you can do the following steps
5+
6+
- Modify [package.json](package.json) to make sure either
7+
- `@mlc-ai/web-llm` points to a valid npm version e.g.
8+
```js
9+
"dependencies": {
10+
"@mlc-ai/web-llm": "^0.1.0"
11+
}
12+
```
13+
Try this option if you would like to use WebLLM without building it yourself.
14+
- Or keep the dependencies as `"file:../.."`, and follow the build from source
15+
instruction in the project to build webllm locally. This option is more useful
16+
for developers who would like to hack WebLLM core package.
17+
- Run the following command
18+
```bash
19+
npm install
20+
npm start
21+
```

examples/get-started/package.json

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
{
2+
"name": "get-started",
3+
"version": "0.1.0",
4+
"private": true,
5+
"scripts": {
6+
"start": "parcel src/get_started.html --port 8888",
7+
"build": "parcel build src/get_started.html --dist-dir lib"
8+
},
9+
"devDependencies": {
10+
"parcel": "^2.8.3",
11+
"typescript": "^4.9.5",
12+
"tslib": "^2.3.1"
13+
},
14+
"dependencies": {
15+
"@mlc-ai/web-llm": "file:../.."
16+
}
17+
}
+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
<!DOCTYPE html>
2+
<html>
3+
<script>
4+
webLLMGlobal = {}
5+
</script>
6+
<body>
7+
<h2>WebLLM Test Page</h2>
8+
Open console to see output
9+
</br>
10+
</br>
11+
<label id="init-label"> </label>
12+
13+
<h3>Prompt</h3>
14+
<label id="prompt-label"> </label>
15+
16+
<h3>Response</h3>
17+
<label id="generate-label"> </label>
18+
</br>
19+
<label id="stats-label"> </label>
20+
21+
<script type="module" src="./get_started.ts"></script>
22+
</html>

0 commit comments

Comments
 (0)