A collection of deep learning research experiments focused on transformer model internals, weight analysis, and layer-level properties.
T-1: Logit Lens — Project the residual stream at each layer through the LM head to track how predictions evolve across depth. Evaluated on 50 prompts (4,094 completion tokens). Reveals a four-phase architecture: representation building (L0–12), early semantics (L13–21), prediction formation (L22–28), and refinement (L29–35). Mean crystallization at layer 25.4, with a 2.4-layer gap to first top-1. Model: Qwen3-4B-Instruct-2507.
T-2: Layer Knockout — Skip each layer and measure loss on completions. Every layer is critical (min 2.4x loss increase). Layer 0 is catastrophic (101x), Layer 6 is the critical hub (23x, appears in 4/5 top synergistic pairs). Includes activation patching for causal bottleneck analysis. Model: Qwen3-4B-Instruct-2507.
T-3: Layer Swap Cost — Swap every pair of layers and measure loss. Zero interchangeable pairs when evaluated on completions. Same 3-zone clustering (early/middle/late) but no layer is freely relocatable. Layer 2 is the most position-sensitive. Model: Qwen3-4B-Instruct-2507.
T-4: Residual Stream Geometry — Track hidden-state geometry across all 36 layers via participation ratio, isotropy, norms, and category clustering. Reveals bimodal dimensionality collapse (PR=1.6 at layer 16), superlinear norm growth (1→568), and that all anisotropy is mean-direction only (centered cosine ≈ 0 everywhere). Final layer acts as a "de-anisotropifier" — norms drop, isotropy spikes, category separation jumps to 1.03. Model: Qwen3-4B-Instruct-2507.
T-7: Layer Linearization Gap — Measure how nonlinear each layer's computation is on real inputs via JVP-based perturbation analysis. Reveals a U-shaped nonlinearity profile: middle layers (6–18) are most linear (gap ~0.13, 54% less nonlinear than early layers), with higher nonlinearity at both ends. Layer 0 is qualitatively different (15x transform magnitude, spectral norm 5.4). Attention and MLP nonlinearity nearly identical on average (0.129 vs 0.127). Sub-quadratic nonlinearity everywhere (order 0.6–0.8). Model: Qwen3-4B-Instruct-2507.
T-9: Weight Spectral Structure — SVD analysis of all 252 weight matrices (7 types x 36 layers). Q/K routing matrices are dramatically lower-rank (0.25–0.38 effective rank ratio) than V/O value processing (0.52) and MLP (0.50–0.68), confirming "where to attend" is simpler than "what to extract." Q_proj rank jumps 36.7% at layer 24→25 (discrete transition in routing complexity). Layer 1 MLP is degenerate (gate/down eff rank ~0.12–0.13). Late-layer MLP compression in layers 34–35. Model: Qwen3-4B-Instruct-2507.
T-11: Quantization Sensitivity — Per-layer and per-matrix quantization sensitivity analysis using RTN simulation (2–8 bit) plus full-model method comparison (bitsandbytes NF4, torchao INT4, GPTQ). Early layers (L0–3) are catastrophically sensitive (L2 at 2-bit: +3,828 PPL), while mid-layers (L8–20) absorb 2-bit with <1 PPL impact. SwiGLU gate_proj is 50x more sensitive than Q/K projections. Sensitivity correlates with linearity gap (ρ=0.68). Spectral-informed mixed-precision fails; simple uniform 4-bit is near-lossless. Model: Qwen3-4B-Instruct-2507.
T-17: Contrastive Completion Trajectories — Force-decode semantically related completions (synonyms, antonyms, style variants, unrelated) and compare hidden-state trajectories layer-by-layer. Discovers a meaning-vs-form crossover at layer ~18: synonyms are closer than antonyms in early layers (shared meaning) but diverge more in late layers (different surface forms). Context dominates token identity in the residual stream (antonym cosine > 0.72 across layers 2–34). Layer 35 universally destroys inter-completion similarity. KL divergence follows a U-shape for all types except antonyms. 50 hand-crafted contrastive groups, 4 relationship types. Model: Qwen3-4B-Instruct-2507.
Layer Shuffle Recovery — Shuffle all 28 layers of Qwen3-1.7B and test 13 recovery methods. Best pipeline achieves perfect recovery (100% accuracy) in ~19 seconds.
Fish Speech S2 Pro — Architecture investigation of the Fish Speech S2 Pro TTS model.
ACE-Step v1.5 — Architecture investigation of the ACE-Step 1.5 music generation model.
See TODO.md for the full research agenda:
- T-5, T-6, T-8: Architecture surgery — cross-model layer transplant, layer doubling/iteration, thinking vs answer token routing.
- T-10a/b: Attention architecture survey & kernel benchmarks — comparative study of MHA, GQA, MLA, DeltaNet, Mamba2, RWKV-7, sparse attention, and hybrid designs; GPU kernel microbenchmarks (FA2/3/4, FlashInfer, Triton, SageAttention3).
- T-12 to T-14: Inference & systems — CUDA graphs & torch.compile, KV-cache optimization, NIXL & disaggregated inference.
- T-16: Component analysis — activation function survey & ablation.
- D-1 to D-6: Diffusion-inspired experiments — depth-as-denoising, noise injection/recovery, iterative refinement, flow matching, AR as discrete denoiser, textual diffusion from scratch.
- VL-1 to VL-8: Vision-language model experiments (Qwen3-VL-2B-Instruct) — modality gap, visual token redundancy, hallucination localization, bottleneck analysis, representation decoding, modality-specific criticality, cross-modal interference, VLM layer shuffle.
Experiments T-1 through T-4, T-7, T-9, and T-11 use pre-generated greedy completions as evaluation data. T-17 uses hand-crafted contrastive pairs (data/text_completions/contrastive_pairs.json).
- Prompts: 50 question/instruction-format prompts across 7 categories (factual, reasoning, linguistic, code, world knowledge, technical, rare)
- Completions generated via vLLM (temp=0, max 2048 tokens) with system message to prevent echo
- Loss computed only on completion tokens, not prompt/template tokens
Generate completions for a new model:
poetry run python data/text_completions/generate_completions.py --model Qwen/Qwen3-4B-Instruct-2507poetry install --no-root- Python 3.11-3.12
- CUDA-capable GPU (tested on 2x NVIDIA B200, 183GB each)
- Poetry for dependency management
- PyTorch 2.10+ (CUDA 12.8)
- Transformers 5.3.x
- vLLM 0.18.x
- scipy, scikit-learn, numpy, datasets
experiments/ # Each experiment in its own subfolder
<experiment>/
run.py # Main entry point
README.md # Full write-up: motivation, methods, results, conclusions
results/ # Outputs (JSON, plots, logs)
*.py # Supporting scripts
data/
text_completions/ # Greedy completions for text experiments
prompts.json # Shared evaluation prompts (50 prompts, 7 categories)
generate_completions.py # vLLM generator CLI
<model-slug>/
completions.json # Prompt+completion pairs with token counts
configs/ # Shared configuration files
models/ # Model checkpoints and saved weights
notebooks/ # Jupyter notebooks for exploration
utils/ # Shared utility functions
Each experiment README is the authoritative record of the investigation — research question, setup, methods, quantitative results, and conclusions.