Skip to content

anuj0456/OpenArch

Repository files navigation

OpenArch

Python implementations of modern open-source LLM architectures — written from scratch, one model at a time.

This repository contains hand-written PyTorch implementations of the model architectures cataloged in Sebastian Raschka's LLM Architecture Gallery. Each model is implemented to the best of my knowledge from the original papers, technical reports, reference config.json files, and the excellent writeups by Sebastian Raschka and Machine Learning Mastery.

The goal is not to compete with transformers or other production libraries. The goal is clarity and learning: a single readable file per architecture, with the structural choices (attention type, normalization, layer mix, MoE routing, positional encoding) made explicit and easy to compare side-by-side.

Why this repo?

Modern LLM architectures share a common skeleton but differ in dozens of small, important choices:

  • Attention: MHA, GQA, MQA, MLA, sliding-window, linear/DeltaNet hybrids
  • Normalization: pre-norm, post-norm, QK-Norm, sandwich norm, RMSNorm
  • Positional encodings: RoPE, NoPE, partial RoPE, YaRN
  • Decoder type: dense vs sparse MoE (with or without shared experts), hybrid Mamba/attention
  • Training-time tricks: Multi-token-prediction, latent experts, gated attention

Reading the official model code can be hard because production repos optimize for speed, sharding, and backward compatibility. This repo optimizes for reading.

What's implemented (so far)

Implementations marked ✅ are usable for forward passes; those marked 🚧 are under construction.

Model Status Model Size Normalization Positional Encoding Attention Mixture of Expert
GPT-2 XL 1.5B - Absolute Multihead Attention No
Llama 2 7B RMS Norm RoPE Multihead Attention No
Llama 3 8B RMS Norm RoPE Grouped Query Attention No
OLMo 2 7B RMS Norm & QK-Norm RoPE Multihead Attention No
DeepSeek R1 671B RMS Norm & QK-Norm RoPE Multihead Latent Attention Yes
Gemma 3 27B RMS Norm & QK-Norm RoPE Grouped Query Attention with Sliding Window No
Mistral 3 24B RMS Norm RoPE Grouped Query Attention with Sliding Window No
Llama 4 Maverick 400B RMS Norm RoPE Grouped Query Attention Yes
Qwen 3 4B RMS Norm & QK-Norm RoPE Grouped Query Attention No
30B - A3B RMS Norm & QK-Norm RoPE Grouped Query Attention Yes
Kimmi K2 1T RMS Norm RoPE Multihead Latent Attention Yes
GLM 4.5 355B RMS Norm & QK-Norm RoPE Grouped Query Attention & Multi-Token Prediction Yes
GPT-OSS 20B RMS Norm RoPE Grouped Query Attention with Sliding Window Yes
Grok-2.5 🚧 270B RMS Norm RoPE Grouped Query Attention Yes

The full target list mirrors the 72 architectures in the Architecture Gallery. Contributions toward any of them are welcome.

Repository layout

OpenArch/
├── gpt2/
│   ├── model.py
│   └── README.md
├── llama3/
├── qwen3/
├── deepseek_v3/
├── README.md
└── requirements.txt

Each model lives in its own folder with respective model.py and a short README.md describing the architectural choices and references used.

Contributing

I am actively looking for contributors. If you enjoy reading model papers, comparing config.json files, or just want to deepen your understanding of how modern LLMs are built, this is a friendly place to start.

Good first contributions:

  • Pick an unimplemented model from the gallery and add a model.py for it
  • Add a README.md for an existing model documenting its architectural choices
  • Add a forward-pass test that loads the official weights and matches outputs on a few tokens
  • Fix bugs, improve docstrings, or refactor shared components

Please open an issue before starting a large piece of work so we can avoid duplicating effort. Implementations should prioritize readability over performance — this is a learning resource first.

See CONTRIBUTING.md for more details.

Acknowledgements

This repository would not exist without the work of two outstanding educators:

Any errors in the implementations here are entirely my own.

License

This project is licensed under the Apache License 2.0 — see LICENSE for details. Individual model implementations follow the licenses of the original models where applicable; see each model's folder for specifics.

Disclaimer

These implementations are written to the best of my knowledge based on publicly available papers, technical reports, configuration files, and educational material. They are intended as a learning resource and are not affiliated with or endorsed by the original model authors. For production use, please use the official implementations or transformers.

About

PyTorch implementations of modern open-source LLM architectures (Llama, Qwen, DeepSeek, Gemma, GPT-OSS, Kimi, and more) — written from scratch for readability and learning, based on Sebastian Raschka's LLM Architecture Gallery.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors

Languages