Skip to content

Commit 3bce68b

Browse files
committed
Merge branch 'ChrisDryden-script_to_download_tokenized_dataset'
2 parents 483f675 + b29478e commit 3bce68b

File tree

2 files changed

+89
-8
lines changed

2 files changed

+89
-8
lines changed

README.md

+14-8
Original file line numberDiff line numberDiff line change
@@ -13,28 +13,34 @@ debugging tip: when you run the `make` command to build the binary, modify it by
1313
If you won't be training on multiple nodes, aren't interested in mixed precision, and are interested in learning CUDA, the fp32 (legacy) files might be of interest to you. These are files that were "checkpointed" early in the history of llm.c and frozen in time. They are simpler, more portable, and possibly easier to understand. Run the 1 GPU, fp32 code like this:
1414

1515
```bash
16-
pip install -r requirements.txt
17-
python dev/data/tinyshakespeare.py
18-
python train_gpt2.py
16+
chmod u+x ./dev/download_starter_pack.sh
17+
./dev/download_starter_pack.sh
1918
make train_gpt2fp32cu
2019
./train_gpt2fp32cu
2120
```
2221

23-
The above lines (1) download the [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset, tokenize it with the GPT-2 Tokenizer, (2) download and save the GPT-2 (124M) weights, (3) init from them in C/CUDA and train for one epoch on tineshakespeare with AdamW (using batch size 4, context length 1024, total of 74 steps), evaluate validation loss, and sample some text.
22+
The download_starter_pack.sh script is a quick & easy way to get started and it downloads a bunch of .bin files that help get you off the ground. These contain: 1) the GPT-2 124M model saved in fp32, in bfloat16, 2) a "debug state" used in unit testing (a small batch of data, and target activations and gradients), 3) the GPT-2 tokenizer, and 3) the tokenized [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset. Alternatively, instead of running the .sh script, you can re-create these artifacts manually as follows:
23+
24+
```bash
25+
pip install -r requirements.txt
26+
python dev/data/tinyshakespeare.py
27+
python train_gpt2.py
28+
```
2429

2530
## quick start (CPU)
2631

2732
The "I am so GPU poor that I don't even have one GPU" section. You can still enjoy seeing llm.c train! But you won't go too far. Just like the fp32 version above, the CPU version is an even earlier checkpoint in the history of llm.c, back when it was just a simple reference implementation in C. For example, instead of training from scratch, you can finetune a GPT-2 small (124M) to output Shakespeare-like text, as an example:
2833

2934
```bash
30-
pip install -r requirements.txt
31-
python dev/data/tinyshakespeare.py
32-
python train_gpt2.py
35+
chmod u+x ./dev/download_starter_pack.sh
36+
./dev/download_starter_pack.sh
3337
make train_gpt2
3438
OMP_NUM_THREADS=8 ./train_gpt2
3539
```
3640

37-
The above lines (1) download the [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset, tokenize it with the GPT-2 Tokenizer, (2) download and save the GPT-2 (124M) weights, (3) init from them in C and train for 40 steps on tineshakespeare with AdamW (using batch size 4, context length only 64), evaluate validation loss, and sample some text. Honestly, unless you have a beefy CPU (and can crank up the number of OMP threads in the launch command), you're not going to get that far on CPU training LLMs, but it might be a good demo/reference. The output looks like this on my MacBook Pro (Apple Silicon M3 Max):
41+
If you'd prefer to avoid running the starter pack script, then as mentioned in the previous section you can reproduce the exact same .bin files and artifacts by running `python dev/data/tinyshakespeare.py` and then `python train_gpt2.py`.
42+
43+
The above lines (1) download an already tokenized [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset and download the GPT-2 (124M) weights, (3) init from them in C and train for 40 steps on tineshakespeare with AdamW (using batch size 4, context length only 64), evaluate validation loss, and sample some text. Honestly, unless you have a beefy CPU (and can crank up the number of OMP threads in the launch command), you're not going to get that far on CPU training LLMs, but it might be a good demo/reference. The output looks like this on my MacBook Pro (Apple Silicon M3 Max):
3844

3945
```
4046
[GPT-2]

dev/download_starter_pack.sh

+75
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
#!/bin/bash
2+
3+
# Get the directory of the script
4+
SCRIPT_DIR=$(dirname "$(realpath "$0")")
5+
6+
# Base URL
7+
BASE_URL="https://huggingface.co/datasets/chrisdryden/llmcDatasets/resolve/main/"
8+
9+
# Directory paths based on script location
10+
SAVE_DIR_PARENT="$SCRIPT_DIR/.."
11+
SAVE_DIR_TINY="$SCRIPT_DIR/data/tinyshakespeare"
12+
13+
# Create the directories if they don't exist
14+
mkdir -p "$SAVE_DIR_TINY"
15+
16+
# Files to download
17+
FILES=(
18+
"gpt2_124M.bin"
19+
"gpt2_124M_bf16.bin"
20+
"gpt2_124M_debug_state.bin"
21+
"gpt2_tokenizer.bin"
22+
"tiny_shakespeare_train.bin"
23+
"tiny_shakespeare_val.bin"
24+
)
25+
26+
# Function to download files to the appropriate directory
27+
download_file() {
28+
local FILE_NAME=$1
29+
local FILE_URL="${BASE_URL}${FILE_NAME}?download=true"
30+
local FILE_PATH
31+
32+
# Determine the save directory based on the file name
33+
if [[ "$FILE_NAME" == tiny_shakespeare* ]]; then
34+
FILE_PATH="${SAVE_DIR_TINY}/${FILE_NAME}"
35+
else
36+
FILE_PATH="${SAVE_DIR_PARENT}/${FILE_NAME}"
37+
fi
38+
39+
# Download the file
40+
curl -s -L -o "$FILE_PATH" "$FILE_URL"
41+
echo "Downloaded $FILE_NAME to $FILE_PATH"
42+
}
43+
44+
# Export the function so it's available in subshells
45+
export -f download_file
46+
47+
# Generate download commands
48+
download_commands=()
49+
for FILE in "${FILES[@]}"; do
50+
download_commands+=("download_file \"$FILE\"")
51+
done
52+
53+
# Function to manage parallel jobs in increments of a given size
54+
run_in_parallel() {
55+
local batch_size=$1
56+
shift
57+
local i=0
58+
local command
59+
60+
for command; do
61+
eval "$command" &
62+
((i = (i + 1) % batch_size))
63+
if [ "$i" -eq 0 ]; then
64+
wait
65+
fi
66+
done
67+
68+
# Wait for any remaining jobs to finish
69+
wait
70+
}
71+
72+
# Run the download commands in parallel in batches of 2
73+
run_in_parallel 6 "${download_commands[@]}"
74+
75+
echo "All files downloaded and saved in their respective directories"

0 commit comments

Comments
 (0)