A demonstration of transfer learning and generative AI applied to quantum physics domain. This project fine-tunes OpenAI's GPT-2 language model on quantum physics text to generate coherent, domain-specific passages. It bridges domain expertise in quantum information and computation with modern deep learning practices.
Author: Thiago Girao - PhD candidate in Physics, researching quantum information and quantum computing
This project demonstrates how domain expertise in physics can be leveraged with state-of-the-art machine learning techniques. Rather than training a language model from scratch (computationally expensive and data-hungry), we use transfer learning to adapt a pre-trained model to specialized domain language. The quantum physics text generator serves as a practical example of:
- Transfer learning: Leveraging pre-trained models for downstream tasks
- Domain adaptation: Specializing general models to physics terminology and concepts
- Modern deep learning workflows: Proper training loops, learning rate scheduling, and evaluation metrics
-
quantum_text_generator.py: Main training and generation script- Custom PyTorch Dataset class for efficient data loading
- Training loop with gradient accumulation and learning rate scheduling
- Text generation with various sampling strategies
- Original quantum physics training data
-
evaluate_model.py: Model evaluation and comparison- Perplexity computation on held-out test data
- Comparison between base GPT-2 and fine-tuned model
- Multiple sample generation with qualitative analysis
-
requirements.txt: Python dependencies -
.gitignore: Git ignore rules -
README.md: This file
GPT-2 is a transformer-based language model consisting of a decoder stack (left-to-right self-attention):
Input Text
↓
Token Embeddings + Positional Embeddings
↓
[Transformer Decoder Block × 12] (for GPT-2 Small)
├─ Multi-head Self-Attention (12 heads)
├─ Feed-forward Network (4H hidden units)
├─ Layer Normalization
└─ Residual Connections
↓
Output Layer Norm
↓
Linear Layer + Softmax
↓
Next Token Distribution
Key properties:
- Causal masking: Attention only to previous tokens (left-to-right language modeling)
- Parameterized: ~124M parameters for GPT-2 Small
- Pre-trained: On 40GB of diverse internet text (WebText)
- Transfer learning: Fine-tune on downstream tasks with modest computational resources
We leverage domain-specific transfer learning with these practices:
- Warm-start: Load pre-trained weights from OpenAI's GPT-2
- Lower learning rate: 5e-5 (smaller steps than initial training)
- Gradient accumulation: Simulate larger batch sizes without OOM errors
- Gradient clipping: Stabilize training with max norm = 1.0
- Learning rate scheduling: Linear warmup then decay
- Warmup for 10% of training steps (avoids sudden weight changes)
- Linear decay to small learning rate
Optimizer: AdamW (weight decay = 0.01 for regularization)
Learning rate: 5e-5
Batch size: 2 (with gradient accumulation × 4)
Gradient accumulation steps: 4 (effective batch = 8)
Epochs: 3
Max sequence length: 256 tokens- AdamW: Decoupled weight decay regularization prevents overfitting
- Low learning rate: Preserves pre-trained knowledge while adapting to quantum physics domain
- Gradient accumulation: Allows larger effective batch size on limited VRAM
- Gradient clipping: Prevents exploding gradients in transformer models
- Short training: 3 epochs sufficient for fine-tuning (full retraining would be many more)
The project demonstrates three important text generation methods:
- Restrict sampling to K most likely next tokens
- Reduces low-probability nonsense
- Default: K=50
- Sample from smallest set of tokens with cumulative probability ≥ P
- More flexible than top-K (uses variable number of tokens)
- Default: P=0.95
- Control randomness of distribution
- Temperature = 1.0: unchanged distribution
- Temperature > 1.0: more random/diverse
- Temperature < 1.0: more conservative/focused
- Default: 0.8 (slightly conservative to maintain coherence)
These techniques together prevent both:
- Mode collapse (only generating one response)
- Incoherent gibberish (allowing too much randomness)
The training data consists of 20 original quantum physics paragraphs covering:
- Quantum entanglement and Bell inequalities
- Variational quantum algorithms (VQE, QAOA)
- Many-body localization and thermalization
- Quantum error correction and topological codes
- Quantum simulation and condensed matter systems
- Quantum machine learning applications
- Quantum chaos and out-of-time-order correlators
- Adiabatic quantum computing
- Quantum phase transitions
- Quantum metrology and sensing
- Quantum key distribution and cryptography
All text is original, written to demonstrate authentic quantum physics domain knowledge and to maintain focus on the ML techniques rather than copyright issues.
Definition: Perplexity = exp(average cross-entropy loss)
- Lower perplexity indicates better model predictions
- Measures how surprised the model is on unseen data
- Comparison: Fine-tuned model vs. base GPT-2 on held-out quantum physics text
Interpretation:
- Base GPT-2: Untrained on quantum physics → higher perplexity
- Fine-tuned: Adapted to quantum domain → lower perplexity
- Improvement: Quantifies benefit of domain-specific fine-tuning
Generated samples are evaluated for:
- Domain relevance: Does output use quantum physics terminology?
- Coherence: Are sentences grammatically correct and fluent?
- Factual plausibility: Could this appear in a physics paper?
- Diversity: Do multiple samples show variety or repetition?
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Fine-tune GPT-2 on quantum physics text
python quantum_text_generator.pyThis will:
- Load pre-trained GPT-2 and tokenizer
- Prepare quantum physics training data
- Train for 3 epochs with proper learning rate scheduling
- Save fine-tuned model to
./quantum-gpt2/ - Generate example outputs
Expected runtime: ~5-15 minutes (depending on hardware; GPU recommended)
# Compare models and generate samples
python evaluate_model.pyThis will:
- Load base GPT-2 and fine-tuned model
- Compute perplexity on held-out test data
- Generate multiple samples from various prompts
- Print detailed evaluation report
| Component | Technology | Purpose |
|---|---|---|
| Model | GPT-2 (Transformer) | Decoder-only language model |
| Framework | PyTorch | Deep learning framework |
| NLP Library | Hugging Face Transformers | Pre-trained models and training utilities |
| Optimization | AdamW + LR Scheduling | Training algorithm and scheduling |
| Tokenization | GPT-2 BPE Tokenizer | 50K subword vocabulary |
| Hardware | CUDA (GPU) or CPU | Hardware acceleration (optional) |
class QuantumPhysicsDataset(Dataset):
"""Efficient tokenization and batching for language model fine-tuning."""Handles:
- Tokenization with padding and truncation
- Efficient batch loading
- Proper label handling for causal language modeling
- Gradient accumulation (memory efficiency)
- Gradient clipping (stability)
- Learning rate scheduling (convergence)
- Progress tracking with tqdm
- Proper device handling (GPU/CPU)def generate_text(..., temperature, top_k, top_p, ...):
"""Configurable sampling strategies for diverse outputs."""Supports multiple sampling strategies and fine-grained control over generation quality.
Example outputs (fine-tuned model):
Prompt: "Quantum entanglement is"
Generated: "Quantum entanglement is a fundamental resource in quantum information processing
where the quantum state of a composite system cannot be described as a product of independent
states. This correlation structure enables protocols like quantum teleportation and distributed
quantum computing across separated nodes."
Prompt: "The variational quantum eigensolver"
Generated: "The variational quantum eigensolver combines quantum circuits with classical
optimization to find ground state energies efficiently. By variationally preparing ansatz states
and evaluating expectations on near-term devices, VQE enables chemistry simulations on
current quantum hardware."
Fine-tuned perplexity typically improves by 15-30% on quantum physics text compared to base GPT-2.
- Larger training dataset: Collect more quantum physics abstracts/papers
- Domain-specific tokenizer: Train BPE tokenizer on physics vocabulary
- Longer context: Increase max_length for multi-paragraph generation
- Conditional generation: Generate abstracts given titles
- Prompt engineering: Develop better prompts for specific physics tasks
- Quantization: Compress model for deployment
- Evaluation metrics: BLEU, ROUGE for comparison with reference texts
- Comparison models: Fine-tune GPT-3.5, LLaMA for baseline comparison
-
Attention Is All You Need: Vaswani et al., 2017
- Introduced the Transformer architecture underlying GPT-2
- arXiv:1706.03762
-
Language Models are Unsupervised Multitask Learners: Radford et al., 2019 (GPT-2 Paper)
- Demonstrated generative pre-training at scale
- OpenAI Blog
-
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: Devlin et al., 2019
- Complementary approach (bidirectional) to causal language modeling
- arXiv:1810.04805
- Hugging Face Documentation: huggingface.co/docs
- PyTorch Documentation: pytorch.org/docs
- Neural Networks & Deep Learning (Goodfellow et al., 2016): Deep learning fundamentals
Author: Thiago Girao Email: [thiagorgs@id.uff.br] Research Focus: Quantum information theory, quantum algorithms, quantum machine learning PhD Program: Physics (Quantum Information & Quantum Computing)
This project is provided as-is for educational and portfolio purposes.