bellechat

A conversational LLM trained from scratch on text published before 1914—no world wars, no nuclear physics, no antibiotics, no Soviet Union, etc. You can chat with it and it will answer as a knowledgeable, articulate person from 1913.

Built as a fork of Karpathy's nanochat.

Model

Parameters: 1.4B (d24 transformer, 24 layers, 1536 dim, 12 heads)
Training data: 13.3B tokens from pre-1914 text (Project Gutenberg, 1911 Encyclopaedia Britannica, Chronicling America newspapers, Internet Archive)
SFT: 6,200 synthetic conversations for persona and conversational ability
Training time: ~3.5 hours pretraining + SFT on 8xH100 SXM with FP8

Model weights: david-fish/bellechat

Note: This is a 1.4B parameter model so expect some incoherence. Also, the SFT is likely overfit. I plan to scale significantly and improve SFT as I gain more access to compute.

Quick start

Requirements: Python 3.10+ and Git.

git clone https://github.com/davidfish-g/bellechat.git
cd bellechat
python run.py

This installs dependencies, downloads the model, starts the server, and opens the chat interface at http://localhost:8000 in your browser.

Data sources

Source	Files	Tokens	Description
Project Gutenberg	20,729	2.68B	Pre-1914 books filtered by author death date
1911 Britannica	36,221	0.07B	Complete 11th edition (5x upweighted in training)
Chronicling America	1.3M	6.7B	LOC newspaper OCR, 80% quality filtered
Internet Archive	21,935	2.3B	Pre-1914 texts via IA search API

Data Quality Improvements

Date filtering: Gutenberg books filtered by author death date. Newspaper pages filtered by publication date in file path. IA texts filtered by date metadata.
OCR quality: Newspaper pages below 80% word validity ratio are rejected.
Boilerplate removal: Modern Project Gutenberg and Internet Archive headers/footers are stripped at shard time.
Anachronism detection: Corpus scanned for post-1914 terms. Contamination rate: <0.01%.
SFT conversations: Generated with explicit constraints against modern knowledge leakage.

Name		Name	Last commit message	Last commit date
Latest commit History 376 Commits
data		data
nanochat		nanochat
runs		runs
scripts		scripts
sft		sft
tasks		tasks
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bellechat

Model

Quick start

Data sources

Data Quality Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

bellechat

Model

Quick start

Data sources

Data Quality Improvements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages