Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
080a53a
fix Rotation matri form of RoPE (#25)
Andy1314Chen Jun 13, 2025
cd54459
add back installation check script
skyzh Jun 14, 2025
55066c3
bump mlx to latest version (#33)
skyzh Jul 23, 2025
7fc05dc
test:add a test case to cover week_1_day_3_task3 (#31)
Phoenix500526 Jul 23, 2025
13295fc
fix tokp implementation
skyzh Jul 27, 2025
4e9101c
more precision tweaks
skyzh Jul 27, 2025
ae06a7f
fix bugs in continuous batching
skyzh Jul 27, 2025
cfbc43e
fix mask tests
skyzh Jul 27, 2025
5863d96
reshape in kvcache
skyzh Jul 27, 2025
55e7b0c
fix flash attention
skyzh Aug 2, 2025
d954fb5
add back causal mask to gqa
skyzh Aug 2, 2025
e21a583
flash attention works for the first token, maybe some mem init issue
skyzh Aug 2, 2025
9eacc3b
try debug flashattention on multi test run
skyzh Aug 3, 2025
657c0b3
small fixes of flash attention
skyzh Aug 3, 2025
8dfe61c
finally fully fix flash attention
skyzh Aug 3, 2025
0ca2bf1
feat(kv-cache): add KV cache imports and week 2 day 1 tests (#35)
magic3007 Aug 3, 2025
3cd7d84
refactor continuous batching
skyzh Aug 3, 2025
5930135
chunked prefill only in continuous batching
skyzh Aug 3, 2025
8b4d9a7
update readme and roadmap
skyzh Aug 3, 2025
024d528
update the vllm-RoPE code link in the reading (#39)
58191554 Aug 8, 2025
00ea990
bfloat16 support for matmul
skyzh Aug 9, 2025
850dd6c
model shortcut and dispatcher
skyzh Aug 9, 2025
ffbd15d
qwen3 support
skyzh Aug 9, 2025
cd87116
update readme
skyzh Aug 9, 2025
4e1cced
fix: resolve f-string syntax error in batch.py (#44)
minatoaquaMK2 Aug 12, 2025
1d7572f
remove offset in week 1, not used
skyzh Aug 17, 2025
0b82b7f
add week2day1 kv cache contents
skyzh Aug 17, 2025
45cff24
update benches
skyzh Aug 17, 2025
30b68a9
small fix about dim
skyzh Aug 18, 2025
042acf5
clearify variables
skyzh Aug 18, 2025
4cddec2
fix: Add flash attention option and fix token offset (#46)
touale Aug 19, 2025
25068d4
Fix: remove offset parameter to Qwen2MultiHeadAttention.__call__ meth…
58191554 Aug 19, 2025
4a4c752
Fix broken url links of MultiHeadAttention in week1-01-attention.md (…
jiengup Aug 19, 2025
4ae2ad1
Fix MLX Metal API usage and Primitive interface for Axpby; restore su…
58191554 Aug 21, 2025
efe008a
s/consequtive/consecutive (#52)
Plypy Aug 21, 2025
a55a92f
docs: fix some typos (#53)
KKKZOZ Aug 21, 2025
e051790
fix typo in week2-01 (#54)
58191554 Aug 21, 2025
b4c14ed
ci: add spell check workflow (#55)
KKKZOZ Aug 24, 2025
bf3383d
fix: Use non-traditional RoPE in Qwen2 test case. (#56)
jiengup Sep 7, 2025
1c9369a
fix: mlx-llm Qwen2 RMSNorm url link (#57)
jiengup Sep 7, 2025
04149a3
add test for week 1 day 5 test 1: Qwen2TransformerBlock (#59)
jiengup Sep 7, 2025
81b917d
Possible typo in week1-01-attention (#60)
ekzhang Sep 8, 2025
fa8b08e
Revert "fix: Use non-traditional RoPE in Qwen2 test case. (#56)" (#62)
jiengup Sep 10, 2025
919a3e5
format and warn on different test files
skyzh Sep 13, 2025
34fb3fe
mention that we have quantized weight now
skyzh Sep 13, 2025
1449816
add chunked prefill and continuous batching writeup (#64)
skyzh Sep 13, 2025
1fc0752
fix simple kv cache decoding (#65)
skyzh Sep 13, 2025
26aa2ff
update writeup progress
skyzh Sep 13, 2025
1f2ab12
Bump mlx to >=0.27 and fix build-ext from week 1, day 7 (#66)
ekzhang Sep 14, 2025
308388e
CI workflow for pdm setup, build and testing refsol (#67)
ekzhang Sep 14, 2025
136ad7f
Day 6, task 1 tests - RoPE with multiple offsets (#68)
ekzhang Sep 17, 2025
6635e4a
Add tests for week 2, day 6 - continuous batching (#69)
ekzhang Sep 19, 2025
b6a3b00
update dev-tools.py to fix --force in copy-test (#70)
linuxholic Sep 21, 2025
ad6d976
add speculative decoding (#71)
skyzh Sep 26, 2025
ff5d7d0
ensure user solution can run
skyzh Sep 26, 2025
cf6910a
add definition hint for model args
Connor1996 Oct 6, 2025
a30f9c2
add more info
Connor1996 Oct 10, 2025
83762c8
rename
Connor1996 Oct 10, 2025
8eebd4a
Merge pull request #73 from Connor1996/model-args
Connor1996 Oct 10, 2025
f1f4f98
fix: fix link to Qwen2.5 blog in week1 (#72)
YangchenYe323 Oct 11, 2025
cea8926
docs: add instruction to download Qwen2-1.5B model (#75)
jinhuix Oct 12, 2025
5dc71b8
perform pdm sync before running (#76)
Connor1996 Nov 2, 2025
ace6e45
Fix f-string syntax (#81)
chasingegg Dec 18, 2025
5b6fdc3
fix: draft-generate offset (#83)
KKKZOZ Dec 18, 2025
16f55c7
fix mx.logsumexp with the right dim (#80)
linuxholic Dec 18, 2025
685caf5
feat: implement quantized_matmul with typed CPU implementation (#77)
Elubrazione Dec 18, 2025
c9f05de
book: remove deprecated mdbook multilingual key (#86)
Connor1996 Feb 8, 2026
e34dc7e
ci: update mdbook preprocessors for 0.5 pipeline (#87)
Connor1996 Feb 8, 2026
0c95267
add AGENTS.md (#85)
Connor1996 Feb 10, 2026
b2393a2
docs: add Week 2 Day 2-3 Quantized Matmul chapter CPU part (#88)
Connor1996 Feb 15, 2026
0688e96
docs: add Week 2 Day 2-3 Quantized Matmul chapter GPU part (#89)
Connor1996 Feb 15, 2026
1cd513b
doc: add tokenizer definition reference (#90)
Connor1996 Feb 16, 2026
e9b90bd
docs: mark week 2.3 tiny_llm status as complete (#91)
Connor1996 Feb 16, 2026
f4dc967
Add bench-main command and week2 benchmark instructions (#93)
Connor1996 Feb 17, 2026
ed8ac9e
fix(ref): correct attention weight shape asserts (#92)
jhsong233 Feb 17, 2026
9b40133
bugfix: way to get_kernel from library
fuyufjh Feb 20, 2026
4373368
Merge pull request #94 from fuyufjh/fix_metal_get_kernel
Connor1996 Feb 22, 2026
5b2f184
book: replace huggingface-cli with hf
you06 Feb 22, 2026
ce64300
Merge pull request #95 from you06/doc/update-huggingface-cli
Connor1996 Feb 22, 2026
2ace66c
tests: parametrize flash attention mask coverage (#96)
Connor1996 Feb 22, 2026
bb1e902
docs: add week2 flash-attention CPU part (#97)
Connor1996 Feb 23, 2026
ddfcaed
docs: add week2 entries to glossary (#98)
Connor1996 Feb 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions .cspell.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
{
"version": "0.2",
"language": "en",
"words": [
"skyzh",
"numpy",
"Connor",
"CUDA",
"matmul",
"qwen",
"huggingface",
"dequantize",
"freqs",
"torchtune",
"Jinyi",
"logits",
"argmax",
"logprobs",
"softmax",
"feedforward",
"Convolutional",
"Roformer",
"bfloat",
"multihead",
"vllm",
"silu",
"GFLOPS",
"TFLOPS",
"dequantized",
"dequantization",
"dequantizes",
"dtype",
"threadgroups",
],
"ignoreRegExpList": [
"`[^`]*`",
]
}
45 changes: 45 additions & 0 deletions .github/workflows/macos.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Build and test the reference solution automatically on M1 runners.
# This helps prevent breakage of the dev setup.
name: macOS

on:
push:
branches:
- main
pull_request:

jobs:
test-refsol:
name: Test reference solution
runs-on: macos-15 # ARM64
steps:
- uses: actions/checkout@v5

- name: Install HuggingFace weights
run: |
brew install huggingface-cli
hf download Qwen/Qwen2-0.5B-Instruct-MLX

- uses: pdm-project/setup-pdm@v4
with:
python-version: 3.12
cache: true

- run: pdm install

- run: pdm run check-installation

# Without this, future build steps fail in CMake.
- name: Add nanobind to CMake
run: |
nanobind_dir=$(pdm run python -c 'import nanobind, os; print(os.path.join(nanobind.__path__[0], "cmake"))')
echo "nanobind_DIR=${nanobind_dir}" >> $GITHUB_ENV

- name: Try building extensions
run: |
pdm run build-ext
pdm run build-ext-test

- run: pdm run build-ext-ref

- run: pdm run test-refsol
3 changes: 2 additions & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ jobs:
- name: setup rust toolchain
run: rustup update && rustup toolchain install
- uses: dtolnay/rust-toolchain@stable
- run: cargo install mdbook-katex
- run: cargo install mdbook-toc
- run: cargo install mdbook-katex --version 0.10.0-alpha
- uses: taiki-e/install-action@mdbook
- name: patch for gh-pages build
run: mv book/theme/head.hbs._ book/theme/head.hbs
Expand Down
28 changes: 28 additions & 0 deletions .github/workflows/spell-check.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: Spell Check

on:
push:
branches: ["main"]
pull_request:
branches: ["main"]
workflow_dispatch:

jobs:
spell-check:
name: Run cspell
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Set up Node.js
uses: actions/setup-node@v4
with:
node-version: "20"

- name: Install cspell globally
run: npm install -g cspell

- name: Run spell check on Markdown files
run: cspell "book/**/*.md"
81 changes: 81 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# AGENTS.md

## Scope

- This file applies to the entire repository.
- Use this as the default test-running policy for coding agents.

## Objective

- Run and verify tests in a way that matches the book workflow (`book/src/*.md`).
- Prefer `pdm` entrypoints defined in `pyproject.toml`.

## Environment Requirements

- macOS on Apple Silicon is expected by the project.
- Install dependencies first:

```bash
pdm install -v
pdm run check-installation
```

- Optional baseline check from the setup chapter (reference solution, Week 1):

```bash
pdm run test-refsol -- -- -k week_1
```

## Agent Test Workflow

1. Start with the smallest relevant scope (`--week` + `--day`).
2. Use pytest filters via `-- -k ...` to isolate failing tasks.
3. Run broader suites only after targeted tests pass.
4. If extension code changed, rebuild extensions before testing.

## Canonical Commands

Run all tests:

```bash
pdm run test
```

Run a specific chapter/day:

```bash
pdm run test --week <WEEK> --day <DAY>
```

Run with pytest filters:

```bash
pdm run test --week 1 --day 3 -- -k task_2
pdm run test --week 2 --day 2 -- -k cpu
pdm run test --week 2 --day 2 -- -k gpu
```

Run reference-solution tests:

```bash
pdm run test-refsol
pdm run test-refsol --week 2 --day 2 -- -k cpu
```

## Extension Rebuild Rule

Rebuild before tests if these changed:

- `src/extensions/src/*`

Commands:

```bash
pdm run build-ext
```

## Guardrails

- Use `--` before pytest args (`-k`, `-q`, `--collect-only`, etc.).
- `pdm run test --week X --day Y` auto-copies `tests_refsol/test_week_X_day_Y.py` into `tests/`.
- Model-dependent tests (0.5B/1.5B/7B) skip when models are not downloaded locally.
24 changes: 13 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ can build the model serving infrastructure from scratch and dig into the optimiz

The goal is to learn the techniques behind efficiently serving a large language model (e.g., Qwen2 models).

In week 1, you will implement the necessary components in Python (only Python!) to use the Qwen2 model to generate responses (e.g., attention, RoPE, etc). In week 2, you will implement the inference system which is similar to but a much simpler version of vLLM (e.g., KV cache, continuous batching, flash attention, etc). In week 3, we will cover more advanced topics and how the model interacts with the outside world.

Why MLX: nowadays it's easier to get a macOS-based local development environment than setting up an NVIDIA GPU.

Why Qwen2: this was the first LLM I've interacted with -- it's the go-to example in the vllm documentation. I spent some time looking at the vllm source code and built some knowledge around it.
Expand Down Expand Up @@ -35,19 +37,19 @@ Week 1 is complete. Week 2 is in progress.
| 1.5 | Load the Model | ✅ | ✅ | ✅ |
| 1.6 | Generate Responses (aka Decoding) | ✅ | ✅ | ✅ |
| 1.7 | Sampling | ✅ | ✅ | ✅ |
| 2.1 | Key-Value Cache | ✅ | 🚧 | 🚧 |
| 2.2 | Quantized Matmul and Linear - CPU | ✅ | 🚧 | 🚧 |
| 2.3 | Quantized Matmul and Linear - GPU | ✅ | 🚧 | 🚧 |
| 2.4 | Flash Attention 2 - CPU | ✅ | 🚧 | 🚧 |
| 2.5 | Flash Attention 2 - GPU | ✅ | 🚧 | 🚧 |
| 2.6 | Continuous Batching | ✅ | 🚧 | 🚧 |
| 2.7 | Chunked Prefill | ✅ | 🚧 | 🚧 |
| 2.1 | Key-Value Cache | ✅ | | |
| 2.2 | Quantized Matmul and Linear - CPU | ✅ | | |
| 2.3 | Quantized Matmul and Linear - GPU | ✅ | | |
| 2.4 | Flash Attention 2 - CPU | ✅ | | |
| 2.5 | Flash Attention 2 - GPU | ✅ | | 🚧 |
| 2.6 | Continuous Batching | ✅ | | |
| 2.7 | Chunked Prefill | ✅ | | |
| 3.1 | Paged Attention - Part 1 | 🚧 | 🚧 | 🚧 |
| 3.2 | Paged Attention - Part 2 | 🚧 | 🚧 | 🚧 |
| 3.3 | MoE (Mixture of Experts) | 🚧 | 🚧 | 🚧 |
| 3.4 | Speculative Decoding | 🚧 | 🚧 | 🚧 |
| 3.5 | Prefill-Decode Separation (requires two Macintosh devices) | 🚧 | 🚧 | 🚧 |
| 3.6 | Parallelism | 🚧 | 🚧 | 🚧 |
| 3.7 | AI Agent / Tool Calling | 🚧 | 🚧 | 🚧 |
| 3.4 | Speculative Decoding | 🚧 | | 🚧 |
| 3.5 | RAG Pipeline | 🚧 | 🚧 | 🚧 |
| 3.6 | AI Agent / Tool Calling | 🚧 | 🚧 | 🚧 |
| 3.7 | Long Context | 🚧 | 🚧 | 🚧 |

Other topics not covered: quantized/compressed kv cache, prefix/prompt cache; sampling, fine tuning; smaller kernels (softmax, silu, etc)
18 changes: 14 additions & 4 deletions batch-main.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import random

parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="Qwen/Qwen2-7B-Instruct-MLX")
parser.add_argument("--model", type=str, default="qwen2-0.5b")

shanghai_wikipedia = """
Shanghai[a] is a direct-administered municipality and the most populous urban area in China. The city is located on the Chinese shoreline on the southern estuary of the Yangtze River, with the Huangpu River flowing through it. The population of the city proper is the second largest in the world after Chongqing, with around 24.87 million inhabitants in 2023, while the urban area is the most populous in China, with 29.87 million residents. As of 2022, the Greater Shanghai metropolitan area was estimated to produce a gross metropolitan product (nominal) of nearly 13 trillion RMB ($1.9 trillion).[13] Shanghai is one of the world's major centers for finance, business and economics, research, science and technology, manufacturing, transportation, tourism, and culture. The Port of Shanghai is the world's busiest container port.
Expand Down Expand Up @@ -38,23 +38,31 @@
parser.add_argument("--device", type=str, default="gpu")
parser.add_argument("--batch-size", type=int, default=5)
parser.add_argument("--prefill-step", type=int, default=128)
parser.add_argument("--enable-flash-attn", action="store_true")
parser.add_argument("--enable-thinking", action="store_true")
args = parser.parse_args()

if args.solution == "tiny_llm":
print("Using your tiny_llm solution")
from tiny_llm import Qwen2ModelWeek2, batch_generate
from tiny_llm import models, batch_generate

elif args.solution == "tiny_llm_ref" or args.solution == "ref":
print("Using tiny_llm_ref solution")
from tiny_llm_ref import Qwen2ModelWeek2, batch_generate
from tiny_llm_ref import models, batch_generate

else:
raise ValueError(f"Solution {args.solution} not supported")

args.model = models.shortcut_name_to_full_name(args.model)
mlx_model, tokenizer = load(args.model)

with mx.stream(mx.gpu if args.device == "gpu" else mx.cpu):
tiny_llm_model = Qwen2ModelWeek2(mlx_model)
print(
f"Using week2 loader with flash_attn={args.enable_flash_attn} thinking={args.enable_thinking} for {args.model}"
)
tiny_llm_model = models.dispatch_model(
args.model, mlx_model, week=2, enable_flash_attn=args.enable_flash_attn
)
encoded_prompts = []
for idx, prompt in enumerate(prompts):
print(f"Prompt {idx}: {prompt}")
Expand All @@ -66,6 +74,7 @@
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=args.enable_thinking,
)
encoded_prompts.append(prompt)
result = batch_generate(
Expand All @@ -76,5 +85,6 @@
prefill_step=args.prefill_step,
)
for prompt_idx, text in result:
print(f"--- {prompt_idx} ---")
print(f"Q: {prompts[prompt_idx]}")
print(f"A: {text}")
Loading