This project explores generative text models, focusing on Recurrent Neural Networks (RNNs) and Transformer-based language models. It involves implementing and experimenting with architectures such as bidirectional RNNs, Transformers, Sliding Window Attention, Rotary Positional Embeddings (RoPE), and Grouped Query Attention (GQA). The goal is to understand how different generative techniques impact language modeling and computational efficiency.
├── requirements.txt
├── input.txt
├── chargpt.py
├── mingpt/
│ ├── model.py
│ ├── trainer.py
│ ├── utils.py
├── test_model.py
├── README.md
To set up the environment locally, follow these steps:
- Install Python dependencies:
pip install torch einops
- Run the main training script:
python chargpt.py
The RNN language model uses an Elman Network with hidden states updated as follows:
ht = slide(Whh * ht-1 + Whx * xt + bh)
yt = slide(Wyh * ht + by)
where slide(a) = min(1, max(0, a))
ensures values stay within a fixed range. The model processes sequential text data, capturing dependencies over time.
- Implemented a simple RNN-based language model.
- Explored bidirectional RNNs and their inability to serve as autoregressive models.
- Compared how different architectures handle sequential context.
The Transformer model is based on scaled dot-product attention:
st,j = kT * qt / sqrt(|k|)
at = softmax(st)
where queries, keys, and values are computed as:
vj = Wv * xj, qj = Wq * xj, kj = Wk * xj
We also analyzed alternative attention mechanisms, such as multiplicative and additive attention.
- Implemented scaled dot-product attention.
- Explored multiplicative attention and its impact on model expressiveness.
- Analyzed self-attention properties, including conditions for symmetry.
Sliding Window Attention improves efficiency by restricting context to a fixed window size w
, rather than attending to the entire sequence.
- Defined causal masks for attention computation.
- Optimized time complexity from
O(N^2)
toO(Nw)
. - Reduced space complexity from
O(N^2)
toO(N + w)
.
- Implemented optimized Sliding Window Attention.
- Evaluated computational efficiency against naive matrix multiplication.
RoPE encodes relative positional information directly into the attention mechanism, replacing absolute position embeddings.
- Implemented RoPE in the
RotaryPositionalEmbeddings
class. - Modified the
CausalSelfAttention
class to integrate RoPE embeddings.
- Compared text samples generated with and without RoPE.
- Evaluated training loss across different training schedules.
GQA reduces memory usage by sharing key-value pairs across query groups, balancing efficiency and performance.
- Implemented
GroupedQueryAttention
inmodel.py
. - Modified the attention mechanism to support grouped query heads.
- Measured attention computation time across different numbers of key heads.
- Compared training loss between standard multi-head attention and GQA.
- Implemented and tested various attention mechanisms.
- Trained models using Shakespeare’s works as a dataset.
- Logged results with Weights & Biases (wandb) for analysis.
- Train the language model:
python chargpt.py --trainer.max_iters=600 --model.rope=True
- Run unit tests for verification:
python test_model.py
- View experiment logs with Weights & Biases.
- RNNs struggle with long-term dependencies; Transformers improve contextual modeling.
- RoPE enhances positional encoding in attention layers.
- Sliding Window Attention and GQA improve efficiency without major performance losses.
This project is part of 10-623 Generative AI at Carnegie Mellon University, with datasets and starter code provided by the course instructors.