Add HAT specific README

pabloiyu · pabloiyu · commit b76e7566e3cb · 2025-07-31T16:54:20.000Z
diff --git a/README.md b/README.md
@@ -1,3 +1,117 @@
+<h1 align="center">vLLM + HAT</h1>
+
+<p align="center">
+🤗 <a href="https://huggingface.co/Aleph-Alpha">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2501.10322">HAT ICLR25 Paper</a> &nbsp&nbsp | &nbsp&nbsp 📑 Upcoming Research Paper
+</p>
+
+This branch provides a batched inference implementation of HAT (Hierarchical Autoregressive Transformer). This fork integrates HAT into vLLM v1 so you can run or serve HAT models with the same low-latency engine you know from vLLM. HAT wraps a standard Llama-style word-level transformer (referred to as the backbone) with two small byte-level modules: an encoder and a decoder.  For a comprehensive architectural and training deep-dive, including a closer look at each component discussed below, an accompanying research paper will soon be released; which will also provide more information on the challenges behind batched inference for such a model.
+
+The encoder processes the input text as raw UTF-8 bytes, and produces a sequence of activations of the same length. The splitter is then in charge of splitting this text into words or semantically meaningful chunks. In the encoder connector layer, for each word, a learned latent vector attends to the encoder activations of the bytes which compose the word. The backbone then processes this word-level sequence to produce a sequence of word-level representations which guide the decoding process. Thus, to generate bytes auto-regressively, the decoder uses the encoder activations of the current word and the word-level representation of the previous word. 
+
+Next Steps:
+- Currently, our CUDA graph implementation for HAT is still based on the vLLM v0 approach. When [PR 20059](https://github.com/vllm-project/vllm/pull/20059) gets merged, we will update our implementation and perform an upstream MR to vLLM.
+
+---
+---
+# Environment Setup
+
+### 1. Prerequisites
+* **GPU**: NVIDIA GPU
+* **Python**: 3.12. 
+### 2. Clone and install
+```bash
+git clone <this-repository> vllm-hat
+cd vllm-hat
+
+# Create and activate a 3.12 virtual env
+uv venv -p 3.12
+source .venv/bin/activate
+
+# Tell vLLM to skip local compilation and use prebuilt CUDA wheels
+export VLLM_USE_PRECOMPILED=1
+
+# Finally, install in editable mode
+uv pip install -e .
+```
+
+---
+---
+# Using HAT
+
+Points to keep in mind
+- If you want to test out the 70B model, please make sure to specify tensor parallel size. If testing on GPUs with 80GB VRAM, we recommend setting tensor parallel size to 4.
+- Currently, HAT only works with Flash Attention 2. Thus, if testing this model on Hopper architecture or newer, please make sure to export the environment variable `VLLM_FLASH_ATTN_VERSION = 2`.
+- Additionally, running the 70B on H100 or newer currently does not work.
+
+The supported HAT models are the following:
+- `Aleph-Alpha/llama-3_1-8b-tfree-hat-dpo`
+- `Aleph-Alpha/llama-3_1-8b-tfree-hat-sft`
+- `Aleph-Alpha/llama-3_1-8b-tfree-hat-base`
+- `Aleph-Alpha/llama-3_1-70b-tfree-hat-sft`
+---
+## Offline Inference 
+
+We have included an example script to run offline inference. 
+
+```bash
+python hat_scripts/hat_offline_inference.py [OPTIONS]
+```
+
+**Optional Parameters:**
+- `--model` - Path to the HAT model (default: Aleph-Alpha/llama-3_1-8b-tfree-hat-dpo)
+- `--batch-size` - Batch size for inference (default: 16)
+- `--max-bytes-per-req` - Output bytes (default: 1000)
+- `--tensor-parallel-size` - Tensor parallelism size (default: 1)
+
+---
+## Serving Scenario (OpenAI-compatible API)
+
+### Starting the server
+
+```bash
+vllm serve [MODEL] [OPTIONS]
+```
+
+**Example:**
+```bash
+vllm serve "Aleph-Alpha/llama-3_1-8b-tfree-hat-dpo" \
+  --trust-remote-code \
+  --dtype bfloat16 \
+  --compilation-config '{"full_cuda_graph": true, "level": 0}'
+  --max-num-batched-tokens 100000 \
+  --tensor-parallel-size 1 \
+  --gpu-memory-utilization 0.9 \
+```
+
+**Required Options:**
+- `--trust-remote-code` - Required for HAT models
+- `--dtype bfloat16` - Required data type for HAT models
+- `--compilation-config '{"full_cuda_graph": true, "level": 0}'` - Required compilation settings for HAT models
+
+**Optional Parameters:**
+- `--max-num-batched-tokens` - Maximum number of batched tokens (default: varies)
+- `--tensor-parallel-size` - Tensor parallelism size (default: 1)
+- `--gpu-memory-utilization` - GPU memory utilization fraction (default: 0.9)
+
+### Sending requests
+
+Any OpenAI-compatible client works (curl, python, etc.). For convenience, we include a script that asynchronously sends multiple requests to the server:
+
+```bash
+python hat_scripts/send_async_prompts.py [OPTIONS]
+```
+
+**Optional Parameters:**
+- `--api-url` - URL of the OpenAI-compatible chat completions API endpoint (default: http://localhost:8000/v1/chat/completions)
+- `--num-concurrent-requests` - Number of concurrent requests to send (default: 16)
+- `--max-bytes-per-req` - Output bytes (default: 1000)
+
+
+---
+---
+---
+---
+
 <p align="center">
   <picture>
     <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">