wodehouse-gpt

A GPT-style transformer built from the ground up - no huggingface transformers library, no pre-trained models, just raw PyTorch. The model learns to write like Wodehouse by predicting the next character in his books.

Quick Start

git clone git@github.com:LostWarrior/wodehouse-gpt.git
cd wodehouse-gpt
./setup
./jeeves "Jeeves entered the room"

A pre-trained model (model.pt) is included in the repo. ./setup just installs dependencies.

./jeeves                                    # interactive mode
./jeeves "Jeeves entered the room"          # single prompt
./jeeves "It was a" --chars 300 --temp 0.5  # with options
./jeeves --help

To use jeeves from anywhere, either symlink it:

ln -s "$(pwd)/jeeves" /usr/local/bin/jeeves

Or add the project to your PATH in ~/.zshrc (or ~/.bashrc):

export PATH="$HOME/path/to/wodehouse-gpt:$PATH"

Sample Output (v4, 17.9M parameters)

> Jeeves
Jeeves . ."

Jill thoughtfully. In she had never done with a feeling that she
fended that there was a bit on the street and subserved on and
all that sort of life had been leaned for a splendid composer of
the leader Certain Cambeth to her that the house on the mouth.

> Hello Bertie
Hello Bertie, where he was feeling with the Reggie, but in which
an absolute creature of the netting of the local dinner was still
bunitoring.

"Here's a bit. I suppose I went to the house as well. He waited
here when I see somewhom repried in Amar to me found through one
again and refuge to the little tree and correct when I have
expended the sort of potted speech fate in the necklace

Not Shakespeare (or Wodehouse), maybe after a few drinks.

Training Data

30 P.G. Wodehouse novels (11.2 million characters) sourced from Project Gutenberg via the edwardjross/wodehouse dataset. Includes Jeeves & Wooster stories, Psmith novels, and more.

Architecture

Decoder-only transformer, same family as GPT. Every component built from scratch.

Input text
  |
  v
Tokenizer              character -> integer ID (76 unique characters)
  |
  v
Character Embedding    ID -> vector of 64-384 learned features
  +
Position Embedding     position -> vector (learned, not sinusoidal)
  |
  v
Transformer Block x N
  |-- LayerNorm -> Multi-Head Self-Attention (causal mask) -> + residual
  |-- LayerNorm -> Feed-Forward (expand 4x, ReLU, compress) -> + residual
  |
  v
Final LayerNorm -> Linear -> 76 scores (one per character)
  |
  v
Next character prediction

Tokenization: Character-level (76 unique characters, no subword/BPE)
Attention: Multi-head self-attention with causal mask (each position only sees past characters)
Feed-Forward: Expand to 4x embed_dim, ReLU, compress back
Normalization: Pre-norm (LayerNorm before each sublayer, GPT-2 style)
Device: Apple MPS, CUDA, or CPU

Model Versions

Version	embed_dim	Layers	Heads	Params	Val Loss	Quality
v1	64	4	4	226K	1.86	Gibberish
v2	128	6	4	1.2M	1.42	English-ish
v3	256	8	8	6.4M	1.27	Recognizable Wodehouse
v4	384	10	8	17.9M	1.25	Wodehouse-ish prose

Project Structure

model.py           # complete transformer (MultiHeadAttention, FeedForward, TransformerBlock, WodehouseGPT)
tokenizer.py       # character-level tokenizer (build_vocab, encode, decode)
generate.py        # text generation (temperature sampling, interactive/CLI modes)
jeeves             # CLI wrapper - run ./jeeves "prompt" from anywhere
config.py          # all model and training settings
demos/             # step-by-step learning demos
  attention.py           # single-head self-attention
  multihead_attention.py # multi-head attention
  feedforward.py         # expand/ReLU/compress
  embedding_demo.py      # token IDs -> learned vectors
  positional_demo.py     # adding position information
  layernorm_demo.py      # normalization + residual connections

For Developers

Training

./setup-dev                         # installs dependencies + downloads books
python3 train.py            # start fresh (~30-60 min on Apple MPS)
python3 train.py --resume   # pick up from last checkpoint

Training saves model.pt and vocab.json.

Configuration

Edit config.py - both train.py and generate.py read from it:

embed_dim = 384
num_heads = 8
num_layers = 10
max_seq_len = 256
batch_size = 16
learning_rate = 3e-4
max_steps = 10000

Bigger embed_dim and more layers = better output but slower training.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
demos		demos
.gitignore		.gitignore
README.md		README.md
bpe_tokenizer.py		bpe_tokenizer.py
checker.py		checker.py
config.py		config.py
dialogue_jeeves.txt		dialogue_jeeves.txt
dialogue_wodehouse.txt		dialogue_wodehouse.txt
extract_dialogue.py		extract_dialogue.py
finetune.py		finetune.py
finetune_v7.log		finetune_v7.log
generate.py		generate.py
jeeves		jeeves
merges.json		merges.json
merges_v4.json		merges_v4.json
model.pt		model.pt
model.py		model.py
model_dialogue.pt		model_dialogue.pt
model_dialogue_v4.pt		model_dialogue_v4.pt
model_no_dropout.pt		model_no_dropout.pt
model_v4.pt		model_v4.pt
requirements.txt		requirements.txt
setup		setup
setup-dev		setup-dev
tokenizer.py		tokenizer.py
train.py		train.py
vocab.json		vocab.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wodehouse-gpt

Quick Start

Sample Output (v4, 17.9M parameters)

Training Data

Architecture

Model Versions

Project Structure

For Developers

Training

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wodehouse-gpt

Quick Start

Sample Output (v4, 17.9M parameters)

Training Data

Architecture

Model Versions

Project Structure

For Developers

Training

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages