Skip to content

Open-Superintelligence-Lab/deepseek-sparse-attention-research

Repository files navigation

DeepSeek V3.2 Sparse Attention Research

Discord

Experimental implementation and analysis of DeepSeek V3.2's sparse attention mechanisms. This repository contains systematic experiments comparing sparse vs dense attention across different architectures and sequence lengths.

📖 Blog Post: DeepSeek Sparse Attention

🎯 Overview

This repository implements and evaluates DeepSeek V3.2's sparse attention innovations:

  • Lightning Indexer: Token relevance scoring mechanism
  • Top-K Selection: Dynamic sparse attention patterns
  • Multi-Head Latent Attention (MHLA): Efficient KV compression
  • Mixture of Experts Integration: MoE with sparse attention

🔬 Experiments

Experiment 1: Sparse vs Classic Attention

Location: experiments/exp1_sparse_vs_classic_attention/

Compares DeepSeek sparse attention against standard dense attention.

Key Findings:

  • Sparse attention dramatically outperforms classic attention (139-302% better loss)
  • Benefits increase with sequence length (256 tokens: 302% improvement)
  • Same training speed, better regularization effect

Results: Sparse achieves 68.4% accuracy vs 7.6% for classic at 256 tokens.

Experiment 2: MHLA + Sparse Comparison

Location: experiments/exp2_mhla_sparse_comparison/

Tests whether sparse selection improves DeepSeek's already-efficient MHLA.

Key Findings:

  • Mixed results: Sparse helps short sequences (12% better at 64 tokens)
  • Hurts long sequences: -41% worse at 1024 tokens vs baseline MHLA
  • MHLA alone is optimal: Latent compression already provides sparsity benefits

Results: Baseline MHLA achieves 32.2% accuracy vs sparse's 10.7% at 1024 tokens.

🚀 Quick Start

git clone https://github.com/yourusername/deepseek-sparse-attention-research.git
cd deepseek-sparse-attention-research
pip install -r requirements.txt

# Run Experiment 1 (Sparse vs Classic)
cd experiments/exp1_sparse_vs_classic_attention
python run_experiment.py

# Run Experiment 2 (MHLA + Sparse)  
cd experiments/exp2_mhla_sparse_comparison
python run_experiment.py

🏗️ Repository Structure

├── models/                    # Core implementations
│   ├── components.py         # Sparse attention components
│   ├── layers.py            # Standard attention layers
│   └── moe_llm.py          # MoE + sparse attention models
├── experiments/              # Research experiments
│   ├── exp1_sparse_vs_classic_attention/
│   └── exp2_mhla_sparse_comparison/
├── training/                 # Training utilities
├── data/                     # Data processing
└── configs/                  # Configuration files

📊 Key Results Summary

Experiment Architecture Sequence Length Sparse vs Baseline Key Insight
Exp 1 Standard Attention 256 tokens 302% better loss Sparse dramatically improves standard attention
Exp 2 MHLA 1024 tokens 41% worse loss MHLA alone is more effective than MHLA + sparse

🔑 Research Insights

  1. Sparse attention is not just about speed - it provides superior learning through forced selectivity
  2. MHLA's latent compression already captures most benefits of token-level sparsity
  3. Double compression (latent + sparse) can be too aggressive for long contexts
  4. Architecture matters: Sparse helps standard attention but may hurt already-optimized MHLA

🤝 Contributing

We welcome contributions in:

  • Novel sparse attention patterns
  • Hardware-specific optimizations
  • Theoretical analysis
  • Domain-specific applications

📄 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

  • DeepSeek Team for V3.2 architecture innovations
  • Open Superintelligence Lab for research collaboration

Ready to explore sparse attention? Start with Experiment 1 or read our detailed blog post.

Happy Researching! 🚀🧠

About

Research on DeepSeek Sparse Attention

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

  •  

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •