openai · ibarrajo · Mar 28, 2026
diff --git a/records/track_10min_16mb/2026-03-28_NonRecord_ThreeApproaches_ibarrajo/README.md b/records/track_10min_16mb/2026-03-28_NonRecord_ThreeApproaches_ibarrajo/README.md
@@ -0,0 +1,34 @@
+# Non-record: Three Approaches — Lessons Learned
+
+**Best legal result: 1.1188 BPB** (Approach B, s_0 TTT score only)
+
+## Context
+
+Previous PR #991 was closed because TTT re-scored tokens after training on them. This submission reports only the legal s_0 score (cumulative first-pass BPB where each token is scored before being used for training). All GPTQ calibration runs within the 600s training budget.
+
+## Results
+
+| Approach | Base | TTT? | val_bpb | Artifact | Status |
+|----------|------|------|---------|----------|--------|
+| **A** | #569 (VRL+LeakyReLU²+GPTQ) int5 | No | 1.1317 | <16MB | int5 penalty too high on d=512 |
+| **B base** | #576 (d=576, 33.6M) int5 | No | 1.1249 | 15.3MB | Strong base, no TTT |
+| **B + TTT** | #576 (d=576, 33.6M) int5 | s_0 only | **1.1188** | 15.3MB | Legal score-first, no re-eval |
+| **C** | #505 (GEPA) int5 | s_0 only | N/A | 16.3MB | Artifact over limit |
+
+## Key Lessons
+
+1. **TTT re-scoring is illegal**: score→train→re-score reports s_1 which benefits from training on eval tokens. Only s_0 (cumulative first-pass) is legal.
+2. **int5 penalty on d=512**: Switching #569 from int6 to int5 costs +0.014 BPB — the architecture was optimized for int6 precision.
+3. **Legal s_0 TTT gives ~0.006 BPB**: B's base 1.1249 → s_0 1.1188 = -0.0061 improvement from backward-looking TTT.
+4. **GEPA doesn't fit at int5**: 33.6M params at int5+3% prune+LZMA = 16.3MB. Would need 6%+ pruning or smaller model.
+5. **GPTQ calibration timing matters**: Must complete within 600s training budget. Our script reserves 10-45s from training for calibration.
+
+## Rule Compliance
+
+- All GPTQ calibration within training budget (assert in code)
+- All artifacts asserted < 16MB
+- All eval times asserted < 600s
+- TTT reports s_0 only — no second eval pass
+- No val tokens in artifact
+
+Based on PRs #569 (@gowtham0992), #576 (@cmcdnd), #505 (@JoeProAI).
diff --git a/records/track_10min_16mb/2026-03-28_NonRecord_ThreeApproaches_ibarrajo/submission.json b/records/track_10min_16mb/2026-03-28_NonRecord_ThreeApproaches_ibarrajo/submission.json
@@ -0,0 +1,14 @@
+{
+    "author": "ibarrajo",
+    "github_id": "ibarrajo",
+    "name": "Non-record: Three approaches — VRL+GPTQ base, d=576 int5 + legal TTT, GEPA int5",
+    "blurb": "Three approaches tested: (A) Fork #569 VRL+LeakyReLU²+GPTQ int5 no-TTT = 1.1317, (B) Fork #576 d=576 33.6M int5 + legal score-first TTT (s_0 only) = 1.1188, (C) GEPA int5 + TTT — artifact over 16MB. Lessons: int5 penalty on d=512 arch is ~0.014; legal s_0-only TTT gives ~0.006 BPB; GEPA doesn't fit at int5 without more aggressive pruning.",
+    "date": "2026-03-28",
+    "val_bpb": 1.1188,
+    "results": {
+        "approach_a_int5_no_ttt": {"val_bpb": 1.1317, "artifact_bytes": "under_16MB", "notes": "int5 penalty too high on d=512"},
+        "approach_b_no_ttt": {"val_bpb": 1.1249, "artifact_bytes": 15288826},
+        "approach_b_s0_ttt": {"val_bpb": 1.1188, "artifact_bytes": 15288826, "notes": "s_0 only, no re-scoring"},
+        "approach_c_gepa": {"val_bpb": "N/A", "notes": "artifact 16.3MB over limit at int5+3% prune+lzma"}
+    }
+}
diff --git a/records/track_10min_16mb/2026-03-28_NonRecord_ThreeApproaches_ibarrajo/train.log b/records/track_10min_16mb/2026-03-28_NonRecord_ThreeApproaches_ibarrajo/train.log
@@ -0,0 +1,97 @@
+W0328 01:20:03.015000 63430 torch/distributed/run.py:803] 
+W0328 01:20:03.015000 63430 torch/distributed/run.py:803] *****************************************
+W0328 01:20:03.015000 63430 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0328 01:20:03.015000 63430 torch/distributed/run.py:803] *****************************************
+logs/8b0afcee-19bf-4314-aa5f-52f4058d6a77.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:33580124
+XSA:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ws:8 gqa:8/8
+lr:embed=0.035 matrix=0.025 scalar=0.025 batch:786432 wall:590s seed:1337
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms
+step:1/20000 train_loss:6.9324 train_time:153ms step_avg:153.21ms
+step:2/20000 train_loss:8.6549 train_time:244ms step_avg:122.20ms
+step:3/20000 train_loss:7.7194 train_time:341ms step_avg:113.55ms
+step:4/20000 train_loss:7.3036 train_time:436ms step_avg:108.93ms
+step:5/20000 train_loss:7.0307 train_time:531ms step_avg:106.26ms
+step:6/20000 train_loss:6.8386 train_time:627ms step_avg:104.49ms
+step:7/20000 train_loss:6.8010 train_time:722ms step_avg:103.18ms
+step:8/20000 train_loss:6.7276 train_time:818ms step_avg:102.20ms
+step:9/20000 train_loss:6.4170 train_time:913ms step_avg:101.43ms
+step:10/20000 train_loss:6.0697 train_time:1009ms step_avg:100.94ms
+step:500/20000 train_loss:2.3653 train_time:48829ms step_avg:97.66ms
+step:1000/20000 train_loss:2.2473 train_time:97823ms step_avg:97.82ms
+step:1500/20000 train_loss:2.1909 train_time:146773ms step_avg:97.85ms
+step:2000/20000 train_loss:2.0329 train_time:195677ms step_avg:97.84ms
+step:2500/20000 train_loss:2.1356 train_time:244518ms step_avg:97.81ms
+step:3000/20000 train_loss:2.1157 train_time:293343ms step_avg:97.78ms
+step:3500/20000 train_loss:2.1230 train_time:342142ms step_avg:97.75ms
+step:4000/20000 train_loss:1.9162 train_time:390936ms step_avg:97.73ms
+step:4000/20000 val_loss:2.0032 val_bpb:1.1864 train_time:390941ms step_avg:97.74ms
+late_qat:enabled step:4288 scale:0.4999
+step:4500/20000 train_loss:2.0615 train_time:439721ms step_avg:97.72ms
+step:5000/20000 train_loss:2.0343 train_time:488571ms step_avg:97.71ms
+swa:start step:5350
+step:5500/20000 train_loss:1.9452 train_time:537530ms step_avg:97.73ms
+step:6000/20000 train_loss:1.8735 train_time:586678ms step_avg:97.78ms
+step:6034/20000 val_loss:1.9031 val_bpb:1.1272 train_time:590032ms step_avg:97.78ms
+stopping_early: wallclock_cap train_time:590032ms step:6034/20000
+peak memory allocated: 26200 MiB reserved: 26368 MiB
+ema:applying EMA weights
+DIAGNOSTIC post_ema val_loss:1.9016 val_bpb:1.1263 eval_time:2362ms
+swa:applying SWA weights (count=14)
+DIAGNOSTIC post_swa val_loss:1.9033 val_bpb:1.1273 eval_time:2362ms
+best_averaging:ema val_bpb:1.1263
+Serialized model: 130957195 bytes
+Code size: 77742 bytes
+pruning:3.0% magnitude pruning applied
+gptq:calibrating with training data...
+gptq:calibrated 68 layers in 3.8s (total train+gptq: 593.9s / 600s)
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+Serialized model int6+zstd: 15211084 bytes
+Total submission size int6+zstd: 15288826 bytes
+artifact_headroom: 711174 bytes remaining
+final_int6_sliding_window val_loss:1.8993 val_bpb:1.1249 stride:64 eval_time:119291ms
+final_int6_sliding_window_exact val_loss:1.89926807 val_bpb:1.12485651
+TTT: epochs=3 lr=0.0001 freeze_first=2 chunk=131072 opt=adamw
+ttt:start chunks=474 chunk_tokens=131072 windows=969088 stride=64 lr=0.0001 epochs=3 opt=adamw freeze_first=2
+ttt:params unfrozen=5780500 frozen=27799624
+  ttt_chunk [1/474] bpb=1.204317 time=0.8s
+  ttt_chunk [101/474] bpb=1.125849 time=63.2s
+  ttt_chunk [201/474] bpb=1.126739 time=125.7s
+  ttt_chunk [301/474] bpb=1.122655 time=188.1s
+  ttt_chunk [401/474] bpb=1.119282 time=250.5s
+  ttt_chunk [474/474] bpb=1.118810 time=295.6s
+ttt:done val_loss=1.887708 val_bpb=1.118010 elapsed=295.6s
+final_ttt_T1.0 val_loss:1.8877 val_bpb:1.1180 stride:64 eval_time:296140ms
+final_ttt_T0.98 val_loss:1.8823 val_bpb:1.1148 eval_time:82117ms
+final_ttt_T0.98_exact val_loss:1.88227386 val_bpb:1.11479156
+total_eval_time:497.5s