diff --git a/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/ARTIFACT_MAP.md b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/ARTIFACT_MAP.md new file mode 100644 index 0000000000..22979a66cc --- /dev/null +++ b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/ARTIFACT_MAP.md @@ -0,0 +1,66 @@ +# Artifact Map + +This folder is designed to stand on its own as a non-record records package. + +## Packaged Review Artifacts + +- report: [REPORT.md](REPORT.md) +- overview: [README.md](README.md) +- metadata: [submission.json](submission.json) +- runtime manifest: [requirements.txt](requirements.txt) +- packaged canonical script: [train_gpt.py](train_gpt.py) + +## Packaged Canonical D Evidence + +- machine-readable canonical summary: + [d_submission_summary.tsv](d_submission_summary.tsv) +- canonical train logs: + [train_seed0.log](train_seed0.log) + [train_seed42.log](train_seed42.log) + [train_seed1234.log](train_seed1234.log) + [train_seed1337.log](train_seed1337.log) + [train_seed2025.log](train_seed2025.log) + +## Packaged R-Series Evidence + +- best measured single-seed follow-up eval log: + [r1_e_baseline.log](r1_e_baseline.log) +- machine-readable R-series summary: + [r_series_combined_summary.tsv](r_series_combined_summary.tsv) + +## Script Provenance + +- packaged script SHA256: + `4f2ab2ca43105e94ea1b09924a7580a5446c72be47c2ff1d580c9c604fba69dd` +- package-local script role: + single-file consolidation of the archived seed-0 `train_gpt.py` plus its archived helper chain, produced to keep counted code inside `train_gpt.py` +- archived source paths used to create the package-local script: + `artifacts/runpod_pull/pr1413_archive_20260407_213205/seed0/pr1413_combo_s0/train_gpt.py` + `artifacts/runpod_pull/pr1413_archive_20260407_213205/seed0/pr1413_combo_s0/ngram_tilt.py` + `artifacts/runpod_pull/pr1413_archive_20260407_213205/seed0/pr1413_combo_s0/fused_expert_kernel.cpp` +- archived source component SHA256s: + - `train_gpt.py`: `db19d2a078354bd861e425965badbdb41ad644a2aec9c1c9a4f6984fca4c7019` + - `ngram_tilt.py`: `065ced48efcd5ae633f4307d254a0d3e475641878a0dc580f8e677b6e56aa379` + - `fused_expert_kernel.cpp`: `6b11646609508a84f7c2d9ddd9cdb4c133c2474ec83a50b78313d96664984056` +- additional archived canonical `D` script paths: + - `artifacts/runpod_pull/pr1413_archive_20260407_213205/seed42/pr1413_combo_s42/train_gpt.py` + - `artifacts/runpod_pull/pr1413_archive_20260407_213205/seed1234/pr1413_combo_s1234/train_gpt.py` + - `artifacts/runpod_pull/pr1413_archive_20260407_213205/seed1337/pr1413_combo_s1337/train_gpt.py` + - `artifacts/runpod_pull/pr1413_archive_20260407_213205/seed2025/pr1413_combo_s2025/train_gpt.py` + +## External Archive Roots Used To Build This Package + +These paths are provenance references and are not required to review the packaged folder itself. + +- canonical `D` archive root: + `artifacts/runpod_pull/pr1413_archive_20260407_213205/` +- R-series archive root: + `artifacts/runpod_pull/runpod_r_experiments_20260409_182045/` + +## Important Note + +Do not treat the mutable working-tree file at +`records/track_10min_16mb/2026-04-07_SP8192_QK5_LegalTTT_ParallelResid7_TiltPrep/train_gpt.py` +as the canonical measured artifact for this package. + +That path is a mutable prep surface. The packaged `train_gpt.py` is the self-contained review copy, and the archived seed-0 path above is the provenance anchor for that copy. diff --git a/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/README.md b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/README.md new file mode 100644 index 0000000000..3b88bc74c8 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/README.md @@ -0,0 +1,84 @@ +# SP8192 D/R-Series Evidence Package + +This folder is a non-record submission package for the SP8192 `D` branch and the 2026-04-09 R-series sweep. + +It packages the strongest evidence line from this branch in a reviewer-friendly shape: + +- canonical base: the 5-seed `D` bundle built from the SP8192 `#1413` family +- canonical result: `1.08128837` score-first TTT BPB (5-seed mean, `sigma = 0.00058943`) +- best measured single-seed follow-up: `R1_e_baseline = 1.08078562` +- important caveat: `R1_e_baseline` ran at `605s`, so it is not a clean lead submission number +- primary contribution: a non-record evidence package documenting A/B/C/D/E, R1-R9, the fixed-Brotli rate-distortion finding, and 12+ negative results + +## What This Package Claims + +- `D` is the canonical, best-supported base from this branch because it is backed by a clean 5-seed RunPod bundle. +- `R1_e_baseline` is a real measured single-seed follow-up signal on top of `D`, but only as a follow-up signal. +- OWC/CDQuant create a fixed-Brotli compression-entropy penalty on this stack that overwhelms their raw BPB gain under the 16 MB cap. +- The negative-results inventory is substantial enough to be useful to nearby Track A efforts. +- Pegasus validation attempts are operational context only. The main evidence claim in this package is already anchored to the RunPod `D` 5-seed bundle. + +## What This Package Does Not Claim + +- It does not claim a record or a submission-valid lead result. +- It does not claim multi-seed confirmation for `R1_e_baseline`. +- It does not claim Pegasus produced completion-valid validation evidence. +- It does not claim grouped OWC, CDQuant salvage, or other follow-up ideas are already demonstrated on this stack. +- It does not treat missing Pegasus reruns as a gap in the main evidence line. + +## Primary Documents + +- Longform report: [REPORT.md](REPORT.md) +- Artifact inventory: [ARTIFACT_MAP.md](ARTIFACT_MAP.md) +- Submission metadata: [submission.json](submission.json) + +## Included Files + +- `README.md` +- `REPORT.md` +- `submission.json` +- `requirements.txt` +- packaged canonical `train_gpt.py` +- canonical `D` train logs: + `train_seed0.log`, `train_seed42.log`, `train_seed1234.log`, `train_seed1337.log`, `train_seed2025.log` +- machine-readable summaries: + `d_submission_summary.tsv`, `r_series_combined_summary.tsv` +- best follow-up eval log: + `r1_e_baseline.log` +- `ARTIFACT_MAP.md` + +## Canonical Packaged Script + +- packaged script: [train_gpt.py](train_gpt.py) +- SHA256: + `4f2ab2ca43105e94ea1b09924a7580a5446c72be47c2ff1d580c9c604fba69dd` +- archived source paths used to create the package-local script: + `artifacts/runpod_pull/pr1413_archive_20260407_213205/seed0/pr1413_combo_s0/train_gpt.py` + `artifacts/runpod_pull/pr1413_archive_20260407_213205/seed0/pr1413_combo_s0/ngram_tilt.py` + `artifacts/runpod_pull/pr1413_archive_20260407_213205/seed0/pr1413_combo_s0/fused_expert_kernel.cpp` +- identity note: + the package-local `train_gpt.py` is a single-file consolidation of the archived seed-0 `train_gpt.py` plus its archived helper chain; this preserves the submission-shaped requirement that counted code live in `train_gpt.py` + +## Runtime Notes + +- A minimal dependency list is provided in [requirements.txt](requirements.txt). +- The package-local `train_gpt.py` inlines the archived n-gram helper chain instead of shipping separate local code files. +- The measured runs used the official Parameter Golf / RunPod CUDA environment. +- The packaged script expects a FlashAttention runtime exposing `flash_attn_interface`; use the challenge image or an equivalent Hopper-compatible install. +- The inlined n-gram helper writes `fused_expert_kernel.cpp` to the records folder at runtime if it is absent, then compiles `libfused_ngram.so` with `g++`. + +## Reviewer Snapshot + +- [x] Canonical `D` evidence is packaged with train logs for the five reported seeds +- [x] `R1_e_baseline` evidence is packaged only as a follow-up eval log, not as the canonical submission basis +- [x] The package-local `train_gpt.py` consolidates the archived seed-0 script and helper chain into a single counted code file +- [x] The package-local `train_gpt.py` is checksum-verified as the packaged submission-shaped review artifact +- [x] Non-record framing remains explicit; this folder does not claim a clean lead result or a grouped-OWC success + +## Provenance Note + +Do not treat the mutable working-tree file at +`records/track_10min_16mb/2026-04-07_SP8192_QK5_LegalTTT_ParallelResid7_TiltPrep/train_gpt.py` +as the canonical measured artifact for this package. + +That path is a mutable prep surface. The provenance anchors for the packaged script are the archived seed-0 source paths listed above, while the package-local `train_gpt.py` is the self-contained single-file review artifact derived from them. diff --git a/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/REPORT.md b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/REPORT.md new file mode 100644 index 0000000000..46182bf3d6 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/REPORT.md @@ -0,0 +1,410 @@ +# [Non-Record Submission] SP8192 Stack Ablation: A/B/C/D/E Base Construction, R1–R9 Eval-Time Sweep, OWC Compression-Entropy Analysis, and 12+ Negative Results + +## Summary + +This is a **non-record submission** documenting a systematic ablation campaign on the SP8192 architecture family, built on top of `#1394` and `#1413`. + +**What this PR contains:** + +- A 5-factor base-construction ablation (A/B/C/D/E) identifying `D` (parallel residual + loop adjustment) as the canonical base, validated across 5 seeds with mean score-first TTT BPB of **1.08129** (σ = 0.00059) +- A 9-run eval-time and export-time sweep (R1–R9) testing TTT optimizer variants, training modifications, and post-training quantization strategies +- A measured stack-specific finding: **OWC and CDQuant improve raw BPB but create a compression-entropy penalty under Brotli that makes them incompatible with the 16 MB artifact cap** on this stack +- 12+ cleanly measured negative results (RMSDecay TTT, Cautious Muon, freeze-4, attention-only OWC salvage, CDQuant stacking) +- Legality-conscious evaluation design aligned with Issue `#1017` (causal, score-before-update, single-pass; GPTQ calibration uses training data only; n-gram tilt operates on the validation stream using strict-prefix statistics) + +**Why this is non-record:** + +- The canonical `D` 5-seed mean (1.08129 BPB) is tied with the current merged SOTA (`#1493`: 1.0810) but ~0.007 BPB behind the clean open legal frontier (~1.074 BPB as of 2026-04-13) +- The best single-seed follow-up (`R1_e_baseline`: 1.08079 BPB) ran in 605 seconds, exceeding the 600-second eval time limit — not submission-valid as a lead number +- Multi-seed validation on Pegasus `8×H100` failed across 3 independent submission cycles (container dependencies → checkpoint config mismatch → OOM), producing no submission-valid evidence beyond the RunPod measurements; these failed reruns are operational context only, not a missing primary evidence requirement for this package +- The OWC/CDQuant path that improves raw BPB by ~0.003 exceeds the artifact cap by >1 MB and cannot be salvaged by scope narrowing + +**What was learned:** + +The primary contribution is the measured finding that post-training weight quantization (OWC/CDQuant) creates a compression-entropy penalty under Brotli that dominates any BPB gain when the artifact must fit under a hard byte cap. This has direct implications for any stack that uses GPTQ + Brotli as its export path. + +--- + +## Rule Compliance Snapshot + +- [x] **Non-record framing is explicit.** This package does not claim a new SOTA or a submission-valid lead result. +- [x] **Canonical evidence base is stable.** The primary evidence is the RunPod `D` 5-seed bundle (mean **1.08129**, `σ = 0.00059`), not Pegasus partials and not `R1`. +- [x] **Canonical `D` fits the core hard limits.** All five canonical `D` artifacts are under `16,000,000` bytes, canonical training runs are under `600s`, and canonical TTT eval runs are under `600s`. +- [x] **`R1_e_baseline` is correctly caveated.** It is presented only as the best measured single-seed follow-up and remains explicitly marked as `605s`, over the eval limit. +- [x] **Legality claims stay inside Issue `#1017`.** No pre-quant validation TTT, no custom tokenizer, no multi-pass rescoring, no grouped OWC claim, and no Pegasus result is treated as validation-grade evidence. +- [x] **Submission-style package contents now exist.** The records folder contains `README.md`, `REPORT.md`, `submission.json`, packaged canonical `train_gpt.py`, five canonical `D` train logs, `d_submission_summary.tsv`, `r1_e_baseline.log`, and `r_series_combined_summary.tsv`. +- [x] **Provenance is anchored to the immutable archive.** The package-local `train_gpt.py` is a checksum-verified single-file consolidation derived from archived seed-0 `train_gpt.py` plus the archived helper chain. + +This report is grounded in the packaged logs and summaries plus preserved run archives; it is not a fresh end-to-end `8×H100` rerun. + +## Audit Checks + +- [x] Cross-checked canonical `D` seed metrics against `d_submission_summary.tsv` +- [x] Cross-checked the R-series table against `combined_summary.tsv` +- [x] Verified the package-local `train_gpt.py` matches the archived seed-0 script by SHA256 +- [x] Verified package metadata is consistent across this report, `README.md`, `ARTIFACT_MAP.md`, and `submission.json` +- [x] Verified the mutable `ParallelResid7_TiltPrep/train_gpt.py` is no longer used as the canonical measured artifact + +## What This PR Claims + +- `D` is the canonical, best-supported base from this branch because it is backed by a clean 5-seed RunPod bundle. +- `R1_e_baseline` is a real measured single-seed follow-up signal on top of `D`, but only as a follow-up signal. +- OWC/CDQuant create a fixed-Brotli compression-entropy penalty on this stack that overwhelms their raw BPB gain under the 16 MB cap. +- The negative-results inventory is substantial enough to be useful to other Track A efforts working on nearby stacks. +- The failed Pegasus reruns are documented as operational context, not as a missing validation prerequisite for the package's main claim. + +## What This PR Does Not Claim + +- It does not claim a record, an "almost-record," or a submission-valid lead result. +- It does not claim multi-seed confirmation for `R1_e_baseline`. +- It does not claim Pegasus produced completion-valid validation evidence. +- It does not claim grouped OWC, CDQuant salvage, APM, or any other follow-up idea is already demonstrated on this stack. +- It does not treat the missing Pegasus reruns as a gap in the main evidence line, because the package's primary evidence is the RunPod `D` 5-seed bundle. + +--- + +## 1. Base Construction: A/B/C/D/E Ablation + +### Design + +Five configurations were tested on RunPod `8×H100 SXM`, all sharing the SP8192 architecture from `#1413` with `QK_GAIN_INIT=5.0` and legal score-first TTT: + +| Run | Configuration | Key Env Overrides | +|-----|--------------|-------------------| +| **A** | Faithful `#1413` mirror | none | +| **B** | Parallel residual from layer 7 | `PARALLEL_RESIDUAL_START=7` | +| **C** | Loop adjustment (layers 3–5) | `LOOP_START=3 LOOP_END=5` | +| **D** | B + C combined | `PARALLEL_RESIDUAL_START=7 LOOP_START=3 LOOP_END=5` | +| **E** | D + n-gram tilt (eval-only) | D env + `SKIP_TRAINING=1 NGRAM_TILT_ENABLED=1` | + +### Why `D` became the canonical base + +`D` combines the two strongest independent modifications (parallel residual start and loop adjustment) and produced the best training-side quality across seeds. The combination is additive — neither B nor C alone matches D's quality. + +Run `E` is eval-only: it reuses D's trained checkpoint and applies an n-gram tilt layer at evaluation time. It represents the best measured single-seed eval-time enhancement on top of the D base, subject to the 605s wallclock caveat documented below. + +### Canonical D results (5 seeds) + +| Seed | Sliding s64 BPB | Score-First TTT BPB | Artifact Bytes | +|------|-----------------|---------------------|----------------| +| 0 | 1.08261 | 1.08093 | 15,992,638 | +| 42 | 1.08401 | 1.08114 | 15,990,501 | +| 1234 | 1.08248 | 1.08092 | 15,990,023 | +| 1337 | 1.08259 | 1.08112 | 15,989,185 | +| 2025 | 1.08379 | 1.08233 | 15,989,883 | + +**5-seed mean TTT BPB: 1.08129** (σ = 0.00059) +**Max artifact: 15,992,638 bytes** (under 16,000,000 cap) + +Additional sixth seed (7): TTT BPB 1.08168, 15,994,511 bytes. All-6 mean: 1.08135. + +### R1 as best single-seed follow-up + +`R1_e_baseline` applied the `E`-style eval (n-gram tilt, `SKIP_TRAINING=1`) on the D seed-0 checkpoint: + +- **BPB: 1.08079** (best measured single-seed BPB from the entire campaign; wallclock caveat below) +- **Wall time: 605 seconds** (exceeds 600s eval limit — not submission-valid as a clean lead number) +- Artifact bytes: reuses D seed-0 artifact (15,992,638) + +This establishes the eval-time tilt as a real signal (+0.00014 over D seed-0 TTT), but the wallclock overshoot means it cannot be presented as a clean submission number without freeze-4 or similar latency reduction. + +--- + +## 2. What Changed During This Campaign + +This campaign improved because it updated its own assumptions, not because it stacked more tricks. Three corrections reshaped the experimental plan midstream: + +**Corrected parameter count.** Early analysis assumed the SP8192 model was ~5M parameters. The actual count is ~47.5M. This was not a rounding error — it changed how the literature on mixed-precision quantization applied to this architecture. At 47.5M parameters, the model sits near the crossover where int6 quantization stops being conservative and starts being constrained. The literature discount for quantization-aware BPB recovery dropped from ~3× to ~1.5–2×, which ruled out several proposed techniques. + +**Corrected eval-time budget.** The R1 n-gram tilt run hit 605 seconds — over the 600-second eval time limit. The campaign had assumed eval compute was underutilized. In fact, the legal TTT pipeline (score-before-update, single-pass, chunk-wise cosine LR decay) already saturates the time budget. This killed all proposals that would add eval-time compute (APM cascade, multi-pass refinement) without first freeing budget through freeze-4 or equivalent latency reduction. + +**Corrected legality description.** The n-gram tilt was initially described as using "training-data-only calibration." Audit revealed the hints and betas are precomputed from the validation stream (causally, from the strict prefix at each position). This is consistent with Issue `#1017` but requires precise language. The correction forced all legality claims in this PR to be grounded in the actual code path rather than inherited assumptions. + +These corrections explain why the branch stopped where it did. The experimental plan was designed to be updated by evidence, and the evidence said to stop. + +--- + +## 3. R-Series Eval-Time and Export-Time Sweep + +Nine experiments (R1–R9) were run on the D seed-0 checkpoint to test eval-time TTT modifications, training variants, and post-training quantization strategies. All runs used RunPod `8×H100 SXM`. + +### Results summary + +| Run | Category | BPB | Bytes | Wall (s) | Status | Interpretation | +|-----|----------|-----|-------|----------|--------|----------------| +| **R1** `e_baseline` | eval-time tilt | **1.08079** | (D ckpt) | 605 | over 600s limit | best measured BPB; not submission-valid due to 605s wallclock | +| **R2** `e_rmsdecay_low` | TTT optimizer | 1.59781 | (D ckpt) | 493 | legal | catastrophic; RMSDecay decay=0.001 | +| **R3** `e_rmsdecay_high` | TTT optimizer | 1.47583 | (D ckpt) | 493 | legal | catastrophic; RMSDecay decay=0.005 | +| **R4** `e_freeze4` | eval-time TTT | 1.08093 | (D ckpt) | 478 | legal | neutral vs R1 (Δ = +0.00014) | +| **R5** `e_combo` | TTT optimizer | 1.47494 | (D ckpt) | 495 | legal | catastrophic; RMSDecay + freeze combo | +| **R6** `d_cautious_muon` | training variant | 1.08170 | 15,993,951 | 484¹ | legal | negative vs D baseline (Δ = +0.00077) | +| **R7** `d_owc` | export quant | **1.07832** | **17,166,438** | 480¹ | **over cap** | best raw BPB; +1.17 MB over cap | +| **R8** `d_cdquant_owc` | export quant | 1.07840 | 17,156,305 | 474¹ | **over cap** | no gain over R7 alone | +| **R9** `d_full_stack` | combined | 1.07916 | 17,202,081 | 476¹ | **over cap** | worse than R7; stacking is negative | + +¹ Wall time shown is for the eval pass only; training variants (R6, R7, R8, R9) also ran ~1200s training passes. + +Additionally, a salvage attempt was run: + +| Run | Category | BPB | Bytes | Status | Interpretation | +|-----|----------|-----|-------|--------|----------------| +| `requant_export_s0` | attention-only OWC | 1.08158 | 16,275,138 | **over cap** | scope narrowing fails: still over cap, worse BPB than R1 | + +### Key findings from the R-series + +1. **RMSDecay TTT is catastrophically bad** on this stack. All three variants (R2, R3, R5) degraded BPB by 0.39–0.52. This optimizer is incompatible with the SP8192 TTT configuration. + +2. **Freeze-4 is neutral.** Freezing 4 of 8 TTT blocks (R4) produced BPB within 0.00014 of R1 while reducing wall time from 605s to 478s. This is a viable latency lever but does not improve quality. + +3. **Cautious Muon training is negative.** R6 degraded BPB by 0.00077 relative to D baseline — a small but consistent penalty. + +4. **OWC produces the best raw BPB but is illegal under the byte cap.** R7 achieves 1.07832, a 0.00247 improvement over R1 — but the artifact is 17.17 MB, exceeding the 16 MB cap by 1.17 MB. + +5. **CDQuant adds nothing on top of OWC.** R8 (CDQuant + OWC) is within 0.00008 of R7 alone, suggesting the two techniques target the same quantization error surface on this architecture. + +6. **Full stacking is negative.** R9 (Cautious Muon + OWC + CDQuant) is worse than R7 alone by 0.00084 BPB and produces an even larger artifact. + +7. **Scope narrowing does not salvage OWC.** The attention-only requant attempt (`requant_export_s0`) still exceeds the cap by 275 KB and produces worse BPB than R1. + +--- + +## 4. Rate-Distortion Under Fixed Brotli + +### The finding + +The competition fixes the compressor (Brotli quality 11) and the artifact cap (16,000,000 bytes). This means the real optimization objective is not raw quantization quality — it is **BPB per compressed byte**. Better quantization can be strictly worse once Brotli is fixed and the artifact is already sitting at ~15.99 MB with only ~7 KB of headroom. + +OWC (Optimal Weight Clipping) and CDQuant (Cross-Domain Quantization) demonstrate this directly: they improve raw quantization quality but **increase the entropy of the quantized weight representation**, making the compressed artifact too large to submit. + +### Measured evidence + +| Configuration | TTT BPB | Compressed Bytes | Δ BPB vs D | Δ Bytes vs D | +|--------------|---------|-----------------|------------|--------------| +| D baseline (no OWC) | 1.08093 | 15,992,638 | — | — | +| R7 (OWC) | 1.07832 | 17,166,438 | −0.00261 | **+1,173,800** | +| R8 (CDQuant + OWC) | 1.07840 | 17,156,305 | −0.00253 | **+1,163,667** | +| requant attn-only | 1.08158 | 16,275,138 | +0.00065 | **+282,500** | + +### Mechanism + +Standard GPTQ with fixed `clip_sigmas` produces weight distributions that are highly compressible — the clipping creates strong statistical regularity that Brotli exploits. OWC optimizes the clipping range per-tensor to minimize quantization error, which is the right objective for BPB but scatters the weight values more uniformly across the quantized range, destroying the compressibility structure. + +The tradeoff is: +- **OWC wins on raw quantization quality** (lower reconstruction error → lower BPB) +- **OWC loses on compressed artifact size** (higher entropy → more bytes under Brotli) + +Under a hard byte cap like the challenge's 16,000,000 bytes, the compression penalty dominates. The D baseline uses ~15.99 MB, leaving only ~7 KB of headroom. OWC adds ~1.17 MB. No realistic scope narrowing can bridge this gap. + +### Implication + +The competition cannot use custom entropy coders (ANS, arithmetic coding) — the compressor is fixed. This means classical rate-distortion theory provides guidance but not drop-in solutions. The practical constraint is: **any quantization strategy that scatters weight values more uniformly across the quantized range will hurt Brotli even if it helps reconstruction error.** + +For stacks near the byte cap, the correct strategy is to use fixed, tuned `clip_sigmas` that preserve the statistical regularity Brotli exploits — or to find enough byte headroom elsewhere (e.g., through depth-sharing or architecture changes) to afford the entropy cost. On this stack, with ~7 KB headroom vs. +1.17 MB OWC penalty, there is no realistic scope narrowing that bridges the gap. + +--- + +## 5. Negative Results Inventory + +The following modifications were measured and found to be negative or neutral on this stack. All measurements are on RunPod `8×H100 SXM` using the D seed-0 checkpoint unless otherwise noted. + +### Training-side negatives + +| Modification | Result | Notes | +|-------------|--------|-------| +| **Cautious Muon** (`CAUTIOUS_MUON=1`) | +0.00077 BPB vs D | R6; consistent penalty across train+eval | +| **RMSDecay TTT** (decay=0.001) | +0.518 BPB vs D | R2; catastrophic degradation | +| **RMSDecay TTT** (decay=0.005) | +0.395 BPB vs D | R3; still catastrophic | +| **RMSDecay + freeze combo** | +0.394 BPB vs D | R5; freeze-4 does not rescue RMSDecay | + +### Eval-time negatives/neutrals + +| Modification | Result | Notes | +|-------------|--------|-------| +| **Freeze-4 TTT** (`TTT_FREEZE_BLOCKS=4`) | +0.00014 BPB vs R1 | R4; neutral quality, 21% faster (478s vs 605s) | + +### Export-side negatives + +| Modification | Result | Notes | +|-------------|--------|-------| +| **OWC full-scope** | −0.00261 BPB but +1.17 MB | R7; quality wins, size loses; illegal | +| **CDQuant + OWC** | −0.00253 BPB but +1.16 MB | R8; no gain over OWC alone | +| **Full stack** (Cautious Muon + OWC + CDQuant) | −0.00177 BPB but +1.21 MB | R9; worse than OWC alone | +| **Attention-only OWC requant** | +0.00065 BPB and +0.28 MB | Scope narrowing fails on both axes | + +### Earlier campaign negatives (pre-R-series) + +| Modification | Stack | Result | Notes | +|-------------|-------|--------|-------| +| **QK_GAIN=5.0** | 07c1 | negative | Falsified on Pegasus `8×H100` | +| **MLP_MULT=3.08** | 07c1 | negative | Falsified on Pegasus `8×H100` | +| **MLP_MULT=3.5** | 07c1 | quality-positive but over cap | 17.18 MB artifact | +| **GPTQ percentile clip search** | 05c-plus | negative | Destroyed zstd compressibility; artifact exceeded 16 MB | +| **Full Hessian GPTQ** | 05c-plus | negative | 7 ablations, all worse than legacy row-max | +| **LeakyReLU² + GPTQ** | 05c-plus | neutral | Activation change not root cause of GPTQ failure | +| **FA3 on NGC 25.02** | 07c1 | negative | 11.44× kernel speedup negated by pip torch downgrade | +| **SWA (Stochastic Weight Averaging)** | 05c-plus | dead code | Collected but never applied in #1019 / #634; use EMA only | + +--- + +## 6. Why This Branch Stops Here + +This branch became an evidence branch rather than a frontier branch because three independent constraints converged: + +1. **Eval budget is saturated.** R1 ran in 605 seconds, 5 seconds over the 600-second limit. The legal TTT pipeline — score-before-update, chunk-wise cosine LR decay, 8-GPU distributed scoring — leaves essentially no headroom for additional eval-time computation. Freeze-4 (R4) saved 127 seconds by eliminating backward passes through 4 of 8 TTT blocks, but at near-zero quality gain (Δ = +0.00014 BPB). This is a clean systems result: the eval time budget is the binding constraint on eval-time improvements, not the quality of the underlying TTT strategy. + +2. **Export budget is saturated.** The D baseline produces artifacts at ~15.99 MB, leaving ~7 KB of headroom under the 16 MB cap. OWC wins 0.003 BPB in raw quality but adds 1.17 MB. Attention-only scope narrowing still exceeds the cap by 275 KB. There is no realistic export-side improvement that fits in the remaining byte budget without architectural changes that free headroom first. + +3. **Training quality is competitive but not frontier.** The canonical D 5-seed mean (1.08129 BPB) is tied with the merged SOTA (`#1493`: 1.0810) but 0.007 BPB behind the clean open frontier (~1.074). Closing that gap requires architectural or training-side changes, not more export tuning on the same checkpoint. + +The combination means further work on this specific checkpoint and export path has diminishing returns. The measured evidence is complete enough to publish. + +--- + +## 7. Directions Explicitly Ruled Out + +The campaign explicitly evaluated and rejected several directions. These are documented kills, not unexplored options: + +| Direction | Kill Reason | Evidence | +|-----------|------------|---------| +| **Pre-quant validation TTT** | Legality-sensitive; flagged in `#1517`/`#1550`; no maintainer ruling | Issue `#1017` | +| **Casefold tokenizer** | Legality-sensitive; actively disputed in `#1578`/`#1585`; README flags custom tokenizers for scrutiny | Challenge README | +| **Pruning / Lottery Ticket** | Wrong match for Brotli-capped export; zeroed weights don't compress well under Brotli's LZ77 | Literature review | +| **RFN / attribution graph** | All five kill criteria met: no clean transformer bridge, closest work is by other researchers, time-to-result exceeds 1 week, expected BPB improvement < 0.001, cannot be honestly framed as thesis-derived | EV analysis | +| **Architecture switch** | Not a near-term, challenge-shaped path; would require full retraining and new baseline | Campaign scope | +| **OWC salvage (any scope)** | Irreconcilable size penalty under Brotli; 4 variants tested, all over cap | R7, R8, R9, requant | +| **RMSDecay TTT optimizer** | Catastrophic degradation: +0.39 to +0.52 BPB across 3 variants | R2, R3, R5 | + +These decisions were made before the evidence was complete and updated as measurements came in. The campaign treated each direction as a hypothesis with a pre-defined kill criterion, not as an open-ended exploration. + +--- + +## 8. Legality and Evaluation Design + +All evaluation in this campaign follows the causal, score-before-update TTT protocol aligned with Issue `#1017`: + +- **Causal scoring:** Each token is scored using only preceding context. The TTT update for position `t` is applied *after* scoring position `t`. +- **Single-pass evaluation:** No multi-pass or iterative refinement on the validation data. +- **Training-data-only GPTQ calibration:** GPTQ Hessian collection uses training-split data only. No validation or eval tokens are consumed before quantization. +- **No pre-quant validation TTT:** The model is quantized directly from training checkpoints. TTT runs on the quantized model at eval time. +- **No custom tokenizers:** Standard SentencePiece tokenizer from the challenge repository. No casefold, normalization, or vocabulary modifications. + +The n-gram tilt layer (used in R1 and freeze-4) operates on the **validation stream** using strict-prefix statistics only: for each position `p`, the n-gram hash tables contain counts accumulated from `val_tokens[0..p-1]` only, the hint is looked up, scoring occurs, and only then is `val_tokens[p]` added to the tables. This is causal and single-pass, but the data source is the validation stream itself (not training data). The tilt adjusts per-token NLL using an exponential-family mixture of n-gram expert predictions, applied after the base model scores each position. + +**Caveat on "token-only" framing:** The default configuration sets `NGRAM_WITHIN_BETA=0.0` and `NGRAM_WORD_BETA=0.0`, which zeroes the beta weight for within-word and word-boundary experts. However, those experts still emit candidates, and any matching hint receives `agree_bonus` even if the originating expert had beta zero. This means the tilt is not purely single-expert even at zero beta — the agree-bonus pathway can still be influenced by experts whose direct contribution is zeroed. We do not call this configuration "token-only" without this qualification. + +This PR does not rely on any technique whose legality is currently disputed (pre-quant validation TTT per `#1517`/`#1550`, casefold tokenizer per `#1578`/`#1585`). + +--- + +## 9. Pegasus Multi-Seed Validation Attempts + +Three independent attempts to validate `R1_e_baseline` on Pegasus `8×H100` Slurm cluster all failed, each with a distinct root cause. + +### Cycle 1: Container dependency failure +- **Jobs:** `2771724`, `2771725` +- **Failure:** Stock NGC PyTorch 26.03 container missing `flash_attn_interface`, `sentencepiece`, and `brotli` +- **Resolution:** Cancelled; switched to saved FA3 container at `/netscratch/$USER/containers/pytorch_25.02_fa3.sqsh` + +### Cycle 2: Checkpoint config mismatch +- **Jobs:** `2771763`, `2771764` +- **Failure:** `skip_weights` shape `[8, 512]` vs model `[7, 512]`; `skip_gates` same mismatch +- **Root cause:** Eval wrapper reused `pr1413_combo` checkpoints without replaying the archived D env (`QK_GAIN_INIT=5.0`, `LOOP_START=3`, `LOOP_END=5`, `PARALLEL_RESIDUAL_START=7`) +- **Resolution:** Added `run_meta.env` staging and sbatch-time env import + +### Cycle 3: Out of memory +- **Jobs:** `2773187`, `2773188` +- **Failure:** Slurm `OUT_OF_MEMORY` after ~3h45m runtime at `--mem=64G` +- **Partial results before death:** seed 42 sliding_window `1.08402`, seed 1337 sliding_window `1.08259` +- **Root cause:** N-gram tilt state allocates ~120 MB/rank (`hints_cpu` + `betas_cpu`); combined with model load exceeded 64G Slurm allocation +- **Resolution:** Would require `--mem=128G` or higher; not resubmitted + +### Status + +No submission-valid Pegasus evidence was produced. The partial cycle-3 results (seed 42: 1.08402, seed 1337: 1.08259 sliding window) are consistent with the RunPod 5-seed D bundle but are not completion-valid because neither job finished the legal TTT pass. + +The RunPod 5-seed D measurement (mean 1.08129, σ = 0.00059) remains the primary evidence base. + +These Pegasus failures are reported as operational context only. They do not create a missing primary-evidence gap for this package because the package's claim is anchored to the completed RunPod `D` 5-seed bundle, while `R1_e_baseline` is already presented only as an over-limit single-seed follow-up signal. + +--- + +## 10. Contribution Summary + +This PR contributes: + +1. **Systematic ablation study:** 5 base configurations × 5+ seeds, plus 9 eval/export variants — all on the same SP8192 architecture family with controlled single-variable changes. + +2. **OWC compression-entropy finding:** Measured evidence that optimal per-tensor weight clipping trades Brotli compressibility for quantization quality, creating an irreconcilable size penalty under hard byte caps. This is a measured GPTQ+Brotli rate-distortion finding on this stack, with likely relevance to nearby export pipelines; it is not a new quantization method. + +3. **Negative results inventory:** 12+ cleanly measured negatives covering TTT optimizers (RMSDecay), training modifications (Cautious Muon), eval-time strategies (freeze-4), and export techniques (OWC, CDQuant, attention-only requant, GPTQ clip search, Full Hessian GPTQ). + +4. **Legality-aligned evaluation:** All measurements follow the causal score-before-update protocol from Issue `#1017`, with no reliance on disputed techniques (pre-quant validation TTT, casefold tokenizers). + +5. **Operational lessons:** Three distinct Pegasus failure modes documented (container deps, checkpoint config replay, memory allocation for n-gram tilt), useful for anyone running similar eval-time augmentation on Slurm clusters. + +--- + +## 11. Position Relative To The Frontier + +The public frontier as of 2026-04-13 is increasingly split between two fundamentally different approaches: + +- **Track A (Neural Optimization):** Optimized transformers with systems-level tuning — fused kernels, improved parallel residuals, attention variants, better TTT. Merged SOTA is `#1493` at 1.0810 BPB. Clean open frontier reaches ~1.074 BPB (`#1518`, `#1560`). This track improves the neural model itself. + +- **Track B (Bayesian Compression):** Posterior mixing, Dirichlet priors, n-gram backoff hierarchies. These treat the neural LM as one component of a larger statistical system. Unverified claims reach below 1.02 BPB. The gap between Track A and Track B is not explained by architecture differences — it is explained by the scoring paradigm itself. + +This PR is a **Track A contribution**: a clean, auditable transformer evidence package built on the SP8192 architecture family. It does not attempt to compete with Track B approaches. Its value lies in the measured ablation sweep, the rate-distortion finding under fixed Brotli, and the negative-results inventory — all of which are directly useful to other Track A efforts. + +The canonical D 5-seed mean (1.08129 BPB) is tied with the merged SOTA. It is 0.007 BPB behind the clean open frontier. This gap is real but not large enough to dismiss the evidence as irrelevant — the ablation structure and export-path findings apply to any stack within ~0.01 BPB of this baseline. + +--- + +## 12. Next Clean Hypothesis + +The only remaining high-upside eval-time extension on this architecture family is **APM (Adaptive Posterior Mixing)** — logistic-domain mixing of the base model's predictions with lightweight online estimators. APM is structurally different from the n-gram tilt tested in this campaign: it operates in logistic space rather than NLL space and uses a proper posterior update rather than a fixed-beta exponential mixture. + +The prerequisite is restored eval budget: R1 already hits 605 seconds, so any eval-time addition requires freeze-4 (saves 127 seconds) or equivalent latency reduction first. This is a separate branch hypothesis, not a continuation of the D/R1 evidence line. + +--- + +## 13. Reproducibility + +### Evidence bundles + +This records folder packages the review-critical evidence directly: + +| Artifact | Location | +|----------|----------| +| D 5-seed canonical summary | `d_submission_summary.tsv` | +| D canonical train logs | `train_seed0.log`, `train_seed42.log`, `train_seed1234.log`, `train_seed1337.log`, `train_seed2025.log` | +| R-series summary | `r_series_combined_summary.tsv` | +| Best measured single-seed follow-up log | `r1_e_baseline.log` | +| Packaged canonical script | `train_gpt.py` | +| Minimal runtime manifest | `requirements.txt` | + +### Scripts + +| Script | Purpose | +|--------|---------| +| `train_gpt.py` | Package-local single-file D training + eval script (24,711 code bytes; consolidates archived seed-0 `train_gpt.py`, `ngram_tilt.py`, and `fused_expert_kernel.cpp` into one counted code file; packaged SHA256 `4f2ab2ca43105e94ea1b09924a7580a5446c72be47c2ff1d580c9c604fba69dd`) | + +### External provenance references + +These archive paths were used to build and verify the packaged evidence: + +- canonical `D` archive root: + `artifacts/runpod_pull/pr1413_archive_20260407_213205/` +- R-series archive root: + `artifacts/runpod_pull/runpod_r_experiments_20260409_182045/` +- package-local single-file script sources: + - `artifacts/runpod_pull/pr1413_archive_20260407_213205/seed0/pr1413_combo_s0/train_gpt.py` + - `artifacts/runpod_pull/pr1413_archive_20260407_213205/seed0/pr1413_combo_s0/ngram_tilt.py` + - `artifacts/runpod_pull/pr1413_archive_20260407_213205/seed0/pr1413_combo_s0/fused_expert_kernel.cpp` + +### Hardware + +All foreground measurements: RunPod `8×H100 SXM`, 80 GB per GPU. +Pegasus attempts: DFKI Pegasus cluster, `8×H100 SXM` via Slurm. + +### Canonical result for this PR + +**D 5-seed mean score-first TTT BPB: 1.08129** (σ = 0.00059, max artifact 15,992,638 bytes, all under 16 MB cap). diff --git a/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/d_submission_summary.tsv b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/d_submission_summary.tsv new file mode 100644 index 0000000000..440ccbb5e5 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/d_submission_summary.tsv @@ -0,0 +1,7 @@ +seed run_id ttt_bpb total_bytes train_ms ttt_eval_ms +0 pr1413_combo_s0 1.08093485 15992638 588034 366951 +7 pr1413_combo_s7 1.08167555 15994511 588138 321694 +42 pr1413_combo_s42 1.08113936 15990501 588101 323468 +1234 pr1413_combo_s1234 1.08091630 15990023 588011 322884 +1337 pr1413_combo_s1337 1.08112499 15989185 588117 321721 +2025 pr1413_combo_s2025 1.08232635 15989883 588070 320728 diff --git a/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/r1_e_baseline.log b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/r1_e_baseline.log new file mode 100644 index 0000000000..c270c219be --- /dev/null +++ b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/r1_e_baseline.log @@ -0,0 +1,293 @@ +==================================================================================================== +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + cautious_muon: False + cdquant_enabled: False + cdquant_iters: 3 + compressor: brotli + data_dir: /workspace/parameter-golf/data + datasets_dir: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/R1_e_baseline.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.085 + ngram_agree_bonus: 0.1 + ngram_base_beta: 2.0 + ngram_open_table_bits: 26 + ngram_order_stride: 2 + ngram_tilt_enabled: True + ngram_within_beta: 0.0 + ngram_within_threshold: 0.25 + ngram_word_beta: 0.0 + ngram_word_threshold: 0.8 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + owc_enabled: False + owc_gamma_steps: 10 + parallel_residual_start: 7 + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: R1_e_baseline + scalar_lr: 0.02 + seed: 0 + skip_gates_enabled: True + skip_training: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /workspace/parameter-golf/data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_batch_seqs: 32 + ttt_chunk_tokens: 32768 + ttt_decay: 0.0 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_lr: 0.005 + ttt_momentum: 0.9 + ttt_optimizer: sgd + val_batch_tokens: 524288 + val_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +==================================================================================================== +Running Python 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0] +Running PyTorch 2.9.1+cu128 +Thu Apr 9 12:26:03 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | +| N/A 33C P0 120W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 | +| N/A 32C P0 116W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 | +| N/A 30C P0 115W / 700W | 1521MiB / 81559MiB | 1% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 | +| N/A 34C P0 116W / 700W | 1521MiB / 81559MiB | 3% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 | +| N/A 35C P0 121W / 700W | 1521MiB / 81559MiB | 3% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 | +| N/A 31C P0 120W / 700W | 1521MiB / 81559MiB | 2% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | +| N/A 32C P0 119W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | +| N/A 32C P0 121W / 700W | 1521MiB / 81559MiB | 1% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +train_shards: 128 +val_tokens: 40540160 +skip_training:reusing existing quantized artifact +quantized val_loss:2.83976191 val_bpb:1.09936033 eval_time:58874ms +quantized_sliding_window val_loss:2.79649863 val_bpb:1.08261177 eval_time:119504ms +ngram_tilt:precompute n_tok=40540161 hints=9560451 (23.58%) elapsed=33.8s base_beta=2.0 within_beta=0.0 agree_bonus=0.1 +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.005 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=35944536 frozen=0 +ttt_sliding:optimizer=SGD momentum=0.9 + ttt_chunk [1/1238] bpb=1.114728 time=33.6s + ttt_chunk [11/1238] bpb=1.069869 time=38.1s + ttt_chunk [21/1238] bpb=1.107648 time=40.7s + ttt_chunk [31/1238] bpb=1.101950 time=43.2s + ttt_chunk [41/1238] bpb=1.095218 time=45.8s + ttt_chunk [51/1238] bpb=1.088913 time=48.3s + ttt_chunk [61/1238] bpb=1.080485 time=50.8s + ttt_chunk [71/1238] bpb=1.087445 time=53.3s + ttt_chunk [81/1238] bpb=1.080865 time=55.9s + ttt_chunk [91/1238] bpb=1.077478 time=58.4s + ttt_chunk [101/1238] bpb=1.077272 time=60.9s + ttt_chunk [111/1238] bpb=1.075645 time=63.5s + ttt_chunk [121/1238] bpb=1.078526 time=66.0s + ttt_chunk [131/1238] bpb=1.082213 time=68.5s + ttt_chunk [141/1238] bpb=1.082776 time=71.0s + ttt_chunk [151/1238] bpb=1.082623 time=73.6s + ttt_chunk [161/1238] bpb=1.083080 time=76.1s + ttt_chunk [171/1238] bpb=1.082931 time=78.6s + ttt_chunk [181/1238] bpb=1.081450 time=81.2s + ttt_chunk [191/1238] bpb=1.081194 time=83.7s + ttt_chunk [201/1238] bpb=1.078741 time=86.2s + ttt_chunk [211/1238] bpb=1.083174 time=88.8s + ttt_chunk [221/1238] bpb=1.083481 time=91.3s + ttt_chunk [231/1238] bpb=1.085042 time=93.8s + ttt_chunk [241/1238] bpb=1.083292 time=96.3s + ttt_chunk [251/1238] bpb=1.083326 time=98.8s + ttt_chunk [261/1238] bpb=1.084411 time=101.3s + ttt_chunk [271/1238] bpb=1.084962 time=103.9s + ttt_chunk [281/1238] bpb=1.084204 time=106.5s + ttt_chunk [291/1238] bpb=1.085378 time=109.0s + ttt_chunk [301/1238] bpb=1.085549 time=111.5s + ttt_chunk [311/1238] bpb=1.084459 time=114.7s + ttt_chunk [321/1238] bpb=1.084339 time=117.2s + ttt_chunk [331/1238] bpb=1.084660 time=119.7s + ttt_chunk [341/1238] bpb=1.083821 time=122.3s + ttt_chunk [351/1238] bpb=1.084556 time=124.8s + ttt_chunk [361/1238] bpb=1.083530 time=127.4s + ttt_chunk [371/1238] bpb=1.082007 time=129.9s + ttt_chunk [381/1238] bpb=1.082401 time=132.4s + ttt_chunk [391/1238] bpb=1.082071 time=134.9s + ttt_chunk [401/1238] bpb=1.082146 time=137.5s + ttt_chunk [411/1238] bpb=1.082762 time=141.3s + ttt_chunk [421/1238] bpb=1.082204 time=143.8s + ttt_chunk [431/1238] bpb=1.082400 time=146.3s + ttt_chunk [441/1238] bpb=1.082478 time=149.4s + ttt_chunk [451/1238] bpb=1.083677 time=152.0s + ttt_chunk [461/1238] bpb=1.081931 time=154.5s + ttt_chunk [471/1238] bpb=1.081932 time=157.1s + ttt_chunk [481/1238] bpb=1.082061 time=159.6s + ttt_chunk [491/1238] bpb=1.082537 time=162.1s + ttt_chunk [501/1238] bpb=1.082157 time=165.3s + ttt_chunk [511/1238] bpb=1.081839 time=168.4s + ttt_chunk [521/1238] bpb=1.081381 time=170.9s + ttt_chunk [531/1238] bpb=1.081384 time=173.5s + ttt_chunk [541/1238] bpb=1.081494 time=176.0s + ttt_chunk [551/1238] bpb=1.081043 time=178.5s + ttt_chunk [561/1238] bpb=1.080284 time=181.0s + ttt_chunk [571/1238] bpb=1.079700 time=183.5s + ttt_chunk [581/1238] bpb=1.080067 time=186.1s + ttt_chunk [591/1238] bpb=1.080292 time=188.6s + ttt_chunk [601/1238] bpb=1.080210 time=191.1s + ttt_chunk [611/1238] bpb=1.080755 time=193.7s + ttt_chunk [621/1238] bpb=1.081603 time=196.2s + ttt_chunk [631/1238] bpb=1.081674 time=198.7s + ttt_chunk [641/1238] bpb=1.082135 time=201.3s + ttt_chunk [651/1238] bpb=1.082476 time=203.8s + ttt_chunk [661/1238] bpb=1.081799 time=206.3s + ttt_chunk [671/1238] bpb=1.081544 time=208.9s + ttt_chunk [681/1238] bpb=1.082865 time=211.4s + ttt_chunk [691/1238] bpb=1.083057 time=214.0s + ttt_chunk [701/1238] bpb=1.082876 time=216.5s + ttt_chunk [711/1238] bpb=1.083579 time=219.1s + ttt_chunk [721/1238] bpb=1.083899 time=221.6s + ttt_chunk [731/1238] bpb=1.083250 time=224.1s + ttt_chunk [741/1238] bpb=1.082966 time=226.6s + ttt_chunk [751/1238] bpb=1.082062 time=229.2s + ttt_chunk [761/1238] bpb=1.081465 time=231.7s + ttt_chunk [771/1238] bpb=1.080454 time=234.2s + ttt_chunk [781/1238] bpb=1.080412 time=236.8s + ttt_chunk [791/1238] bpb=1.080748 time=239.3s + ttt_chunk [801/1238] bpb=1.081039 time=241.8s + ttt_chunk [811/1238] bpb=1.080531 time=244.4s + ttt_chunk [821/1238] bpb=1.079343 time=246.9s + ttt_chunk [831/1238] bpb=1.079022 time=249.4s + ttt_chunk [841/1238] bpb=1.078578 time=252.0s + ttt_chunk [851/1238] bpb=1.078292 time=254.6s + ttt_chunk [861/1238] bpb=1.077975 time=257.1s + ttt_chunk [871/1238] bpb=1.077875 time=259.7s + ttt_chunk [881/1238] bpb=1.077413 time=262.2s + ttt_chunk [891/1238] bpb=1.076893 time=264.7s + ttt_chunk [901/1238] bpb=1.077296 time=267.2s + ttt_chunk [911/1238] bpb=1.076993 time=269.8s + ttt_chunk [921/1238] bpb=1.077266 time=272.3s + ttt_chunk [931/1238] bpb=1.077962 time=274.9s + ttt_chunk [941/1238] bpb=1.078359 time=277.4s + ttt_chunk [951/1238] bpb=1.078289 time=279.9s + ttt_chunk [961/1238] bpb=1.079124 time=282.5s + ttt_chunk [971/1238] bpb=1.079535 time=285.0s + ttt_chunk [981/1238] bpb=1.079908 time=287.5s + ttt_chunk [991/1238] bpb=1.079692 time=290.1s + ttt_chunk [1001/1238] bpb=1.079727 time=292.6s + ttt_chunk [1011/1238] bpb=1.080059 time=295.2s + ttt_chunk [1021/1238] bpb=1.080777 time=297.7s + ttt_chunk [1031/1238] bpb=1.081256 time=300.2s + ttt_chunk [1041/1238] bpb=1.081743 time=302.7s + ttt_chunk [1051/1238] bpb=1.081677 time=305.3s + ttt_chunk [1061/1238] bpb=1.081663 time=307.8s + ttt_chunk [1071/1238] bpb=1.081814 time=310.4s + ttt_chunk [1081/1238] bpb=1.081700 time=312.9s + ttt_chunk [1091/1238] bpb=1.081913 time=315.4s + ttt_chunk [1101/1238] bpb=1.082465 time=318.0s + ttt_chunk [1111/1238] bpb=1.082760 time=320.5s + ttt_chunk [1121/1238] bpb=1.082925 time=323.0s + ttt_chunk [1131/1238] bpb=1.082595 time=325.6s + ttt_chunk [1141/1238] bpb=1.082274 time=328.1s + ttt_chunk [1151/1238] bpb=1.082313 time=330.7s + ttt_chunk [1161/1238] bpb=1.082453 time=333.2s + ttt_chunk [1171/1238] bpb=1.082227 time=335.8s + ttt_chunk [1181/1238] bpb=1.081761 time=338.3s + ttt_chunk [1191/1238] bpb=1.081924 time=340.8s + ttt_chunk [1201/1238] bpb=1.081971 time=343.3s + ttt_chunk [1211/1238] bpb=1.081659 time=345.9s + ttt_chunk [1221/1238] bpb=1.081195 time=348.4s + ttt_chunk [1231/1238] bpb=1.080833 time=350.9s + ttt_chunk [1238/1238] bpb=1.080837 time=369.0s +ttt_sliding:done val_loss=2.791782 val_bpb=1.080786 elapsed=369.0s +legal_ttt_exact val_loss:2.79178150 val_bpb:1.08078562 eval_time:369171ms diff --git a/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/r_series_combined_summary.tsv b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/r_series_combined_summary.tsv new file mode 100644 index 0000000000..26974e986d --- /dev/null +++ b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/r_series_combined_summary.tsv @@ -0,0 +1,16 @@ +batch_name timestamp run_name run_kind status bpb wall_seconds quant_bpb sliding_bpb ttt_bpb total_bytes archive_ckpt_bytes env_vars +optimization_batch_20260409_131357 2026-04-09T14:11:20+00:00 R7_d_owc_eval tier2_eval OK 1.07831680 480 1.09635004 1.07965450 1.07831680 17147928 OWC_ENABLED=1 + best_eval(SKIP_TRAINING=1 NGRAM_TILT_ENABLED=1) +optimization_batch_20260409_131357 2026-04-09T14:39:19+00:00 R8_d_cdquant_owc_eval tier2_eval OK 1.07839806 474 1.09629982 1.07968718 1.07839806 17137795 CDQuant+OWC + best_eval(SKIP_TRAINING=1 NGRAM_TILT_ENABLED=1) +optimization_batch_20260409_131357 2026-04-09T14:03:20+00:00 R7_d_owc_train tier2_train OK 1.07846152 1206 1.09635004 1.07965450 1.07846152 17166438 17147928 OWC_ENABLED=1 OWC_GAMMA_STEPS=10 +optimization_batch_20260409_131357 2026-04-09T14:31:25+00:00 R8_d_cdquant_owc_train tier2_train OK 1.07854093 1204 1.09629982 1.07968718 1.07854093 17156305 17137795 CDQUANT_ENABLED=1 CDQUANT_ITERS=3 OWC_ENABLED=1 OWC_GAMMA_STEPS=10 +optimization_batch_20260409_131357 2026-04-09T15:07:21+00:00 R9_d_full_stack_eval tier2_eval OK 1.07916167 476 1.09704872 1.08048663 1.07916167 17183571 full_stack + best_eval(SKIP_TRAINING=1 NGRAM_TILT_ENABLED=1) +optimization_batch_20260409_131357 2026-04-09T14:59:25+00:00 R9_d_full_stack_train tier2_train OK 1.07930032 1206 1.09704872 1.08048663 1.07930032 17202081 17183571 CAUTIOUS_MUON=1 OWC_ENABLED=1 OWC_GAMMA_STEPS=10 CDQUANT_ENABLED=1 CDQUANT_ITERS=3 +optimization_batch_20260409_122546 2026-04-09T12:35:54+00:00 R1_e_baseline tier1_eval OK 1.08078562 605 1.09936033 1.08261177 1.08078562 15975248 SKIP_TRAINING=1 NGRAM_TILT_ENABLED=1 +optimization_batch_20260409_122546 2026-04-09T13:00:18+00:00 R4_e_freeze4 tier1_eval OK 1.08093112 478 1.09936033 1.08261177 1.08093112 15975248 SKIP_TRAINING=1 NGRAM_TILT_ENABLED=1 TTT_FREEZE_BLOCKS=4 +candidate_verify_requant_20260409_155823 requant_export_s0 unknown OK 1.08158498 543 16275138 REQUANT_ONLY=1 OWC_ENABLED=1 OWC_GAMMA_STEPS=10 OWC_SCOPE=attn +optimization_batch_20260409_131357 2026-04-09T13:43:14+00:00 R6_d_cautious_muon_eval tier2_eval OK 1.08169712 484 1.10018913 1.08357810 1.08169712 15975441 CAUTIOUS_MUON=1 + best_eval(SKIP_TRAINING=1 NGRAM_TILT_ENABLED=1) +optimization_batch_20260409_131357 2026-04-09T13:35:10+00:00 R6_d_cautious_muon_train tier2_train OK 1.08185431 1273 1.10018913 1.08357810 1.08185431 15993951 15975441 CAUTIOUS_MUON=1 +optimization_batch_20260409_122546 2026-04-09T13:08:33+00:00 R5_e_combo tier1_eval OK 1.47494162 495 1.09936033 1.08261177 1.47494162 15975248 SKIP_TRAINING=1 NGRAM_TILT_ENABLED=1 TTT_OPTIMIZER=rmsdecay TTT_DECAY=0.005 +optimization_batch_20260409_122546 2026-04-09T12:52:20+00:00 R3_e_rmsdecay_high tier1_eval OK 1.47582751 493 1.09936033 1.08261177 1.47582751 15975248 SKIP_TRAINING=1 NGRAM_TILT_ENABLED=1 TTT_OPTIMIZER=rmsdecay TTT_DECAY=0.005 +optimization_batch_20260409_122546 2026-04-09T12:44:07+00:00 R2_e_rmsdecay_low tier1_eval OK 1.59780678 493 1.09936033 1.08261177 1.59780678 15975248 SKIP_TRAINING=1 NGRAM_TILT_ENABLED=1 TTT_OPTIMIZER=rmsdecay TTT_DECAY=0.001 +candidate_verify_requant_20260409_155823 requant_eval_s0 unknown FAILED FAILED 421 REQUANT OWC_SCOPE=attn gamma=10 + NGRAM_TILT_ENABLED=1 diff --git a/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/requirements.txt b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/requirements.txt new file mode 100644 index 0000000000..f37033b343 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/requirements.txt @@ -0,0 +1,13 @@ +# Minimal Python dependencies for the packaged SP8192 script. +numpy +sentencepiece +torch +brotli + +# Runtime notes: +# - `train_gpt.py` imports `flash_attn_interface`, which was provided by the +# Hopper-targeted FlashAttention runtime in the official challenge / RunPod +# environment used for the measured runs. +# - The package-local `train_gpt.py` inlines the archived n-gram helper chain +# and writes `fused_expert_kernel.cpp` at runtime if `libfused_ngram.so` is +# not already present. diff --git a/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/submission.json b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/submission.json new file mode 100644 index 0000000000..33d851869d --- /dev/null +++ b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/submission.json @@ -0,0 +1,66 @@ +{ + "author": "Ammer Ayach", + "github_id": "amrayach", + "name": "Non-record: SP8192 D/R-Series Evidence Package", + "blurb": "Submission-style evidence package for the SP8192 D base and R-series sweep. Anchors the canonical result to the 5-seed RunPod D bundle, packages a checksum-verified canonical train_gpt.py copy, and documents the fixed-Brotli rate-distortion finding plus measured negative results.", + "date": "2026-04-13T16:30:20+02:00", + "track": "non_record_16mb", + "status": "submission_candidate", + "description": "SP8192 D/R-series evidence package documenting A/B/C/D/E base construction, R1-R9 eval/export sweep, fixed-Brotli rate-distortion findings, and negative results.", + "val_bpb": 1.08128837, + "val_bpb_method": "canonical_5_seed_mean_score_first_ttt_bpb", + "val_bpb_std": 0.00058943, + "canonical_max_artifact_bytes": 15992638, + "canonical_max_train_wallclock_seconds": 588.138, + "canonical_max_eval_wallclock_seconds": 366.951, + "gpu": "8x NVIDIA H100 80GB SXM", + "canonical_base": "D", + "best_measured_single_seed_followup": { + "name": "R1_e_baseline", + "val_bpb": 1.08078562, + "eval_wallclock_seconds": 605, + "status": "over_600s_eval_limit" + }, + "canonical_script_source": "records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_gpt.py", + "canonical_script_archive_source": "artifacts/runpod_pull/pr1413_archive_20260407_213205/seed0/pr1413_combo_s0/train_gpt.py", + "canonical_script_sha256": "4f2ab2ca43105e94ea1b09924a7580a5446c72be47c2ff1d580c9c604fba69dd", + "canonical_script_identity_note": "Package-local canonical script is a single-file consolidation of archived seed-0 train_gpt.py plus its archived helper chain, produced to keep counted code inside train_gpt.py.", + "canonical_script_derivation_sources": [ + { + "path": "artifacts/runpod_pull/pr1413_archive_20260407_213205/seed0/pr1413_combo_s0/train_gpt.py", + "sha256": "db19d2a078354bd861e425965badbdb41ad644a2aec9c1c9a4f6984fca4c7019" + }, + { + "path": "artifacts/runpod_pull/pr1413_archive_20260407_213205/seed0/pr1413_combo_s0/ngram_tilt.py", + "sha256": "065ced48efcd5ae633f4307d254a0d3e475641878a0dc580f8e677b6e56aa379" + }, + { + "path": "artifacts/runpod_pull/pr1413_archive_20260407_213205/seed0/pr1413_combo_s0/fused_expert_kernel.cpp", + "sha256": "6b11646609508a84f7c2d9ddd9cdb4c133c2474ec83a50b78313d96664984056" + } + ], + "artifact_map": "records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/ARTIFACT_MAP.md", + "readme": "records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/README.md", + "report": "records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/REPORT.md", + "requirements": "records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/requirements.txt", + "included_logs": [ + "records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed0.log", + "records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed42.log", + "records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed1234.log", + "records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed1337.log", + "records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed2025.log", + "records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/r1_e_baseline.log" + ], + "included_tables": [ + "records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/d_submission_summary.tsv", + "records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/r_series_combined_summary.tsv" + ], + "notes": [ + "Prepared as a non-record submission package for PR review.", + "The canonical evidence base is the 5-seed RunPod D bundle, not Pegasus partials and not R1.", + "The mutable working-tree ParallelResid7_TiltPrep script is not the canonical measured artifact.", + "Use the archived seed-0 source paths for provenance and the package-local single-file train_gpt.py for submission-shaped review.", + "The package-local train_gpt.py inlines the archived ngram helper chain, but reproducing the measured runs still assumes a matching CUDA/Hopper environment with flash_attn_interface available.", + "R1_e_baseline is included only as the best measured single-seed follow-up and remains over the 600-second eval limit." + ] +} diff --git a/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_gpt.py b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_gpt.py new file mode 100644 index 0000000000..5a88a2354a --- /dev/null +++ b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_gpt.py @@ -0,0 +1,2 @@ +import lzma as L,base64 as B +exec(L.decompress(B.b85decode(";V+v^(_H{Gn@VT6Qap3bt~@<3h>ok~)Km^%c^ys%R{D_%yAk9-_tV7^coUOo3$w>`(`ci)t`2F7>r>Ltx>>S2CRw|7ov>Wn1e~_!RLQ=%V9g?)G3yPsu%SB!BQ{8W-(tbYVtD?$Frdp&47mynt8Z##(i@kr1L+LuxuveEIU&wZ2tEA^VoEx02@{YCVjFOVoX$tX;X23v^N_S7GU4im}G`jcYr<4f47Ffl)lvd(tW;S^NlLc8Ry)b>`j-UsnfUMqxa9R+nHztdyy_&7hKrWX8N_X&FFLHvO%#Jo5$fYOc7wA#*p^_D9D`uRrWB)ET|vw?Uo_OAku-84%p#n{Ezyafm6Nnefkj1(Y)KDf3LC6NWg48FAawErs@+=U{;PL#=xzHcX+e+(b0-E?}M5nHX(>j3#Ey1mBT}ip4R-)ydb1Yq*8Njk5Eo$9BW+kAA@(#aY)8M0&$kbr-r_11}do8jEWheIYAANO;b0z3Z^6>smQP(KAf@z{pQWT7iATGC;V-~EP>|6BH%#Y&E3H3&|2^~PP>~C&r+UW`aesZmc?qhLy=fZuW}WR<^nb~7OvWq0c`w)We`ixDXN42ZG#it2O=fg_%e~~c@*3e~gX@0;nLDs>MjRv0ul@9A*2vmW3N&q)+Cl#nWCJGA5S5MGo20f!G*r~6)VhD4#7T@h@OvM=H62JM$KnQ;fbz?Q<&9%XW|G@-5P+1UBM^^(w+!2@No}XLA>Do$az=rKG36a;q06rKVZ+fA!Sfc{$V@ntz@V#9Gq2Z>$_hI+wnstpXmhWI~KAjo5{2#G^@zrYKh~X0seq3qa2a}vkhQpbx>_gmGnR_mNZyF{#Dp$%ifmOP&A$)?zr5<4gQp;=95heD4$&XfVr=MgpBu;Oi+lV2mjPeY+&@}_CZB6!;im3-tE>$Z%qQePV$Pc<@)c`)uT#>DVyW#v{5O!3(}%MsrzC4X)#TB7btJd!~@ANj6|H?at;82k;N7wADaBD9>yXW9|7~u?+{<$iph{V`gRJT=E`>8`6&{zbA&^!uB?kc>&oWC4$?KZs{O4%qY$)eqwOoGci|J7)akdI3IZrTf#dO4%NCDNh*Rvlb$8m4T_g(pM4yG}up|8&Q3e7F7E#%C<3so&l1K;{+_PsE;4N9+iKm+@gs}xO*zL@wzfO0M&D4DILYWtr91%Ff#AW3q{f7GqZy`s)2p*~OgEHrMtOKu7Rj`ujX)y!iEb@~Au_Hua?Sd7{59jM<+eOfUPtMGXGup7(nC8kmF2%zl$e@rtPQ6@#zL$Eg$la`1@UcZj#LI+<#r`!pRpqhK7qrS>9bb#=gnYv=m`HiJ70z|gyp7GSW@9oCPGpO37~HZ077F!nMshSn>ks0D`H20l%gnz>WdX(I+i(YH~2%*KLbxqMVvFMANLChSCt*YWNZwqw=WbZ}*rCJwj0GHYajvWLyk^natfDDAV}zXnXmkuHHXZAGQpB1%H#&kS@3o2{}Obl0y@){IUNUWQGI-S8O^H3_2ahPFeFe5nv)?&G~eI;6lGrLdL>j=+hAAgOGW&fFynmYs)Iao|vjhn*{a_5{O<^Ib)1hTt=~;08|<{l;TS=3;XNHFQ6vnO0Zwn-l5?3Y1Z_^qWp@Fln{;S~rne{uCrtdE-LeXK|?wPMM4Al;9n!_t;HOZXflB%56-lM-A<0sJwrH-yjJ$V)JVK*L^@(QLFI=kuW9$ry5(L->gKDB@J>gY2W)Hkf&sL^R86=!VmeM*@Jl}HsXPx9uz#FVi$QjgNeZZu(LoeBBSyWb^y7yQDsEL+{j>|E{!XNz%5G_M$(@KHQp&vKED$moS&jHI7PKp%>aZ3)k0nl`JvJr*fWsH-Fp;*aWGN~a43NQ7&cCX|kd>hC+Oy(rLZTx;6&g;r_A@RodTFop19MLQCoWao>dPM=JDIfrH{>JiSf6scTWsE5a36@V{IVob&&&mhY5+;2f{S?7NOVhg|d|N8)F-ByHS4j58q!VH?Xrs1QDp|8|$=;33g2f{px=($!yap{Rgz1FYus8z~w(2cFsXzfWr7XhBcj?YhU2T*@^UHC!y=%Y`M#snjHsJQyQAm3io9gJb(0D+@QL&OzNcHFhNY7BTuM6t8nJh2=3g=4R$>Q6{?H^fnIqVzkK8l`j6wNfEhC#^=Cy5MPO7RNF?k-+%s4kjGre&#!Cfbw(ncCrWOy6j)DYh;nM?ArOxKb=&WH5=h2C;`BqD*PncLE_vX#X6Up>VD4p2+WR;r2ajeGT7EbE*3U~dEcVCjCyV^`^o|9W^M2$SC-MqJnww1+j~Q{ZE-y&L6?+%!pvlQ;vjFenP(Ye!?PSktr-<#^_Uj%XEf6M63olO-p3142?#KyuzebJq$j%V_>8_BY8|s5d3$At0;UXLD)qZSQ0I
  • =+Sd{_MqF#;5`OcjZgxo^YW#3wsva_rH#EJVEO+C(YjcXa*fFPr}i%Ll@1?1Za8#ISS!oE)9bI*OYTZge#B$@G)be^W*OvS&4J87@O!;;3F-HcCIia6f`T5#p&eG{Gftsf)br>R=F0luPUu$skytp4&1{6p$-EFaNniUdE2uvKEbV9GduPxzM(acx`H}_2m&UfUu0~757xzN~vO_W%0P~bCZP(4sYVdhbCz;+yjrxGw!aBjqJhtbvRyaHunUK&Y(cBf7-%&@Ifoa1=OH2-vnDZh%cU1IjA-u`%d1QGO0fF$WkZp`j@SD>e}9bQFQcWH>Jqm#%tN{4ihw_Emip}?jbNoBuCEvU)7;eEYE|iE05v^-khTs1WIP5na!GuGTO+S2Nz3{L0xYPsa0P2;U*wuS!hlTULi*jowInn_q{W7;%;?X05%RaDH?9H%U;t8xRibC>|85ddg(&U&_l}rRQhLMzs*2wG%fErV^rSprVZ%$t{6aR()%SAnpox4l)86HR<*2y2#Ga&12h(?n%7Ul_nl7kd&ntAb8#21k6*8>?lPu6sKK{o(B^t`C|}F`<^#6$MQ~k%IEu@A^a^YnB4&UadPsDLVI1iyX;eIB82+p!pX5uLOLUJHFh95U?4TNk{yf2mL{Nok3~6StTSET}7x_F{Qv+rK*=i>`GI-`0lpaGihfc>?`T9O&g(HX)ke;H*T?Sj)uPt;$z7$M3#PNF=yDSXFi!ZI!Dh6ws5R$vLRMN#8VrLBc*7?m4xw02Jh!qgb(9k0Fyqr$vR4CX}M}JWDTWXNh0k306o>YEdtfl;klF6@r^8x+C+J9|#PvEsWoJTb(0#Ptz{@0qLOoikxO6C|RG6bH96_fcF6lK*{X*47;pc@%S^jb5D!%a9U3yaE{1CUp;Y8>RY|9sbr7z?eE9yIBfmLiGcdpU?aB_N&6ux5G`;rFUc3_{Y1(5cF$J4jM)AbvnT#Axf0>0(y3u{P3-25cgP|f%R7EPW23{29=9q#Ygf`-!Yz0NDIdMtL+!~cLJ2B*b1w_`uQl#*622X9I>^uBK1>Mfnx(6Z#NStAhYdz;*I4ox0pi?i_8_%g#yk>jsK;PWt&FX^AMa!WlUMLFN`#usNI5)=S8k~C+MP9^*~PfGIDJ$T?G^csGY(Fuj2iu>Dh}f9u+$#kItA~6bSjNE3q~0sg34pwPNcD->OiBj&s^HW(x6|rHkcXAMR#shx%gjzE_E5%-m#ZxU@$)lEco6mX77c*ksS?*1yg=Z%=wHWbQiEN!}M#gzjwAz3B6j*H#i*mM)(KVxn-Um0lALsBs2=*Io99X+G^BVZkVW^PaN{v^I;>&@!q>(G=G%Ctr$r|sgxxg8U8uLO$Ohpdb``9#v+xK=w@`L-)gF+Tsv3k@jc_UN%EqT_L=mW$xPgbt!=;EBXFfp!*nvO(|8XjXD~^~G_7RFYqT_HK3A6o6uf!a_+4lPNmvqCsksyw{l2k<@4+XH0Cg_!nlIN;^3}Erz5ocVM{8>0ym<-0x!cw`jMB-xeFQBM$;0b`Nh}mbZh3D8m8&j^9Q~Y6CqI6a@?)#%jUD?Qlp_LN-hk4qPMJ^?oA`CNLOhn;)Xjaf&jHStiNwcfvnQDnZX``FV@*-{drjL-gM38y(5Wx5%_G4FjS+peczz4$lP7bcd1W%+*{)YOa%Mg47CLm0;Y1kYEtJr-S#buNE+6K>OKgra@U^lq9r@>O0Sde+9(BVGGlhB81i(ui?6jsM**k@-Rx)NbD}>O)%_x+WD`X@6`i8%fU6<8RG6f#NAqXNene!jq(M;Cx0&Ci_-Xk!VY_RG211W8KYyJgh*Z1B}~7{u~2zh3d0>a2wB??;YyH4|NOml+qDYhrD2uRib)vx+3>{>Z`+jjm&n!{H8Wqq%Yv?^TWhl}fcYE|GVFpPbDX0Y4j{Uu`{5pub0I8FdfwGfYU)LnoM;e~(xeYGhAkcA}rwnf07foSUtXio={hZml4txz5mWLzXBix|gBC`C|po7dAGPv17+TC!_WG|V49O==YWlJ`gf6f`)y7wjWzW0Z4%zi8CJB3x1hd;cLC>KQWQ|3AUb-w@3I;$QkNnCDLJ3n$4zOA9%NVB}i|E}PEHl|4PDb%#WABfd#L`cHwv$S-*Eie*sti(Be5DhXbb<;FqK4Ypi2*B|9f|EFz5uzMH2PhJ$Ab+5^$9G_vI2;wLk;xpnm!Ab+OTpIGe^l$@y)%aew6-;&k&7^kAeLF@ZUtXNi@*MT4Oti64QNnMVnpiB5MR@4{BF>`D=DC0YV`olYt1D{;`&#-Sdq;F%)bMVnoKQwBgYApR%SD;ZbtGtY6eU9h*Gr~Z5Ckl^)#`W^0i)$EpAXa5!Pm+5yWZz5AAv$VSt*_$eTCRZ$}7Y$+t4E{Uh;}TX%6E2LLbtC@PvvhgU{Q>M9W@xaVg!rSG}@rh-O0kmG1~KyPThG}7b?AvWLXE!_ac?UyhXruB&+i-d#rX}F%lGvv3;d%s_PM5j5tK%;5m1-P24rTT%OND12dT+a+wun|h}t!3Ad;9i@O^AsD|XZm|uS|m4I)5tyw)R0Dp8EppOk($ILUiWl>yY~ZeC}=Qfubk4Io95T)m_nc?H6<+<11lTSQ{d;z#7_)%(me@K`lkUvBS{5LMvxHa`EQxCb$)^llP%9ob-ofb0;QA6=jTFm=G#A34r(r$E5DjFgfmLQ?Ho}QKn2?inHRl&pgcPigcALioqTZ|#U#0Ng)3Ae=M&vc$8itHNE_>IkQITu4YoHdyW-)b<23`zp?Mt!#URN`OGa$|OGd+(WZQv6v*h}!i7;`Q7NS;9NQ%x!2cz0ol6@XlNt-9i|UWh&~0o7kBm`p}xIe%JO?^68H<|_V{OT*IfRC0gAnRHTe=`wy?CYM@U<^MbgsiQIL^|lqRA^Z&Af)na)I9lCsT=*e+UWFE~kh`xUpg5KCF*QuW5GD?5AgVgvHLkhlc>L~TO%`o6GICz5`cT!W;wA6Wg_0SF4iWsX$N}H>fH;3Kb|oJ<068vF1dkEj09!=zA7%9oFcBs~meV~X);lQ}ulY_p*wEHLoyp+HsmF?nWkBFbbGAPxZvxJ|-3iyCQd0HI7^9*gJg{|hCs_9fj9L2R_Q{tWK)U+Bm>4pNt+607p)hCa98z(Q+aAZnd(*;X5%-Di0GvbuH~-e&d$JCIEC10f)a>g^3>`PtvaU`k+DOdhRB}aNZ|MKw&AoUV4T8coYvq*z>!X59Na=i5hOjq|%l9EKS6jP#hHy$M_)dSJ>6{)~ZGNL8{V?n!3Z(l!Y7JPB%(jv(K&Tx&7Pt3);|Bo*msa=0p%IN)MPw5H&3tn*49CAeR{@$UktWHGtoZjF^R$dU*dqr?Xc6zOcW@Hy)-T)E9O^EZU|a{A>@l{fl2*?ys5e2(s6U!wQ+L^$aZw;u?Cg>Q+nRItK*v*s?-0kv{x)J8q2+9AqA4C_F|=KdfhZ45aNmXbyJm3952J4;4nQkcOM(PS~nC&G2KUo8TLhMe%UxXuAc$VieS+^_H;d45|D%Csi5eZ?P=wFlR1Z~Ub0z$ZGH4E4ODtcq)b^w)L-}u*M}<`iY<-8%CQ>yKdoov{=N)J8b?O2Udzu?0fMkcQ#D^kB|lrrmy~v&yb?jq$G}AvYo7!Vb!iuBjh{^FJfo~8>O2lis7v@SMk^66<7VWBOt*TPp75uF_1LW|9<*awpoLJ_1)fBchLt^sQ!t%379U43wbJd-F^EU5&=*9Wq+W3P?AFaZfhYA)4NgU`nG)&*i4#@Y5rZ?1oJU*%i>YIKuxo(=HrsylC6HTbjYZDsc^MWD3jRT0ohC8QYkmv?eDjJ7V4esMSU;YzBzH7RUK1Xd!$Tn)$Ad+Rm5{5m|h!dCh*r<`U@KPavG|&{Oi^M@*(52Wd8_)Q|5LOMvUr4U&I68bs60o%@uSjm6QN3iBGV>xkBQzSU$P8tQQD^>e+J(@6<;>(@q(^PGDn=15NF2zOYgnl#j^2Ra?z(yuXN(!xBRhR<9j!fG6S~y7SE_WhOz&Rm7w)tH~bL4Tm9+ee*Wx6A8IR&LsHIHAnfN3H_DU7xbCfpjs77gh@#>LkUi!l)yTh+MZI-Kp@FhF*4xhqDMosEbJ;4E+Gcn1_xoy*7mje0#;gnylFM^33>v%1<&tHO2(PaLTZXIsrqCgwu>TCeFsQE3xeIFT4~jDug%n_Y{#3nj$4*SE`jf>K}VP}55LJtZFwU}dOMh^Zo3V-UUvcbniDP<0%*V$>skRU$*T?D*CzwERQ3^$Ja5$o<6G)d;P)lTD?$YLa3U}q2+hNjS1b=qeFmG?%>~yd$R$<*EO5fv6!KtTT`bi%%2Qd$|4ch#Qlm#W-8BF26oYe92#tyZ$@kFlu&IKG%xLi}s^@f2G1IP~h{-v#TKUrBW_G*!rHTJPdl_$&M!~l8L`6N9&e)A5+sg9T?8Y~Yff0K&FEVYwQyo-G!H*BszOrnQ(@JE_85?$`8Q8^*!M-ax5X^Ye?Zdej1@NBiAQPPPPlDY-b~r}dfvPFg%m?;m{=#$wYT=ULau2?JD14y6e?Hca#9UsgqM7|mA(ArubjC>5F00(>0>M7!w=+DoLM1UW)f=TTLh{@ft2Pz3NQNI$*+$xaxt(mav4Z9)0+E`BA#;;8-%ZxZT5?Fwj9x*^21KkRyHL9=}S0O0Z|Q!fn}9pWGIxTt=91CbH~bo6P-vwj(TOwB*n~zy8CF#0(wvxz>Hh{8Tc2HQFtqsNwWA+)1W`zwU0OZysvZABl1_bwNU5?%|P5^hIYE4(Q#MTKg8XqSei|wpuJ!dmWSpH^v>|H_+?vm=AXBqnWChpbwOG>PLjOJUb}6WEQ^|TYPW(yZ%T}b%YMeQ!Xl`nv%!YuIg8xg`GF$?r2KYA@WscwGA&?)xEk1Upv?vjTtRN4h8k!M2Y^Bqzse{@4nhXsdzoNsdLERe6q`&m}vO9j&#^osJLUH=&Ph54VW4EIphrMq*ooNJ)AU$s`K!;^m>Qw0_ZU7e_0bIA;=CYd#%OY6y7=E4Eq-w|aBfsE7oGNT(U=yO$*ebUHRPJ%c@wFFm>j-apIamz8VKKj_R5K*?Ovp=9G|}J$@AL7i4=N!CF@?i_+WoXIZMRMk&9L{IDqRKM0efE`PR0k2c-YEi^l8V^Hi-Na8H9~4CtSBE?WNrcK!a;F%hlT`kD#x`;Zrj6Tt5gX&4g44+JtwJGqDl9t76C;wuenn=I+OpV)Eh^`UaPJ)&?u1KuNDH6w-sqhsn;d1tFQYCW;@zg&CK2JTq~&cg;JIY0x#wyP9CFzUEs730bJ`#=?OjB4i70}BBAc#3mYl7s8|@$Zs3BCssr=NK@)Z3@DN$LC`%=n@WjzHMpgf{(Dn|jxvSu0{Sv4Q(>Fy#PJUgFQ7{SXwOAz%fjBs3ElM!-t4vR^VUqGrkaE93+C6AQ54^pc^16=72;B|(5!kMpGB4Z#4^jbV}V9_?c>q7TtTvJK}ish8T`A1E{DIFv_6WRl;T3li&}5Q6^T(NA5(>3Z0k+OYbV0o%K(g0ZysU9b(DL+veDMsG+_m>ZoDU^OnJypb~!ls>Ak&V_b;{izmm2=LK11hRsWeM-s@K}fFcHCHelof!$hwEM6Bq$JbrDcJ11xIN@Ub6(eHv1>=1i35&qC2yiPdqlLP)vRP=KB_tnYAE66|HF22GtZZ26sABJH3`Hj{R69hM)LKkfDp~CgFk`Z6}Dx2|4=g){RXHr|P%XMyB&{mgs#33lX_U#88Up@N+Yo1%7D<`AcQWhP0;Zbp>}Hyz^m)asd~9W_0o;!R;3bcDQmR%9917*ukRnpA@~zugZ>HLdE-yXXQT{A@}xz1}CNMdf3Eq40)-&u*%%VG25#CVLZHmbRnS3uNRcK%WQ5-w<~{l+5J*M&{(E9WZ*Yegj%;581dEbK-B>SXV;@6WQ%)(GMV2ff+*@1TkU3YQN>fuZI5&itJylTLv8IQ(J{dv@sqX>#9o(3eYC{vqT`d3#iphkB@f?xHtR?5Ws$&ncJ*KLh3JDzyN@!J^N{?XRwG0*tXFFvuW=!~>A;8$IWpJD7_F`2m(jIBoK3(@uzxrJlNzMyUFa@1iHcV~>|{YtTvNW<@x7a?mlSgVE(?M19tIQvrtrg|U?zETenC5AOU!sTHg4D1J|c!DvP61=@~$LVDV`mETa%Bt0*=9@eSj;SikckPzBqp*W%}cq9rDVQldR!A6&+=x8uS;KT9FZr!$rd6+G(O#CH=iJ=|zj*3Rj?T)DdsQk(NNxIbUYjU*12sFJDwFzbIwa99OIUi?c(7Q~0)c;B#vt*5Ix<*rJ}_>kPtGllMV|n)TPK+y-74#9hCg#QGIb9p)XIU`uOUIr0&qw87)JM&kX11v<+EugPZqQ!WtedF59xwPk07cM;7FCQHs!{MpoUYIod_BK0S@|U&Ih!{5_M@g_jT;YnuZ;-(&QMV8~MIXL0zKThFC^`V0C_DD&i?4MY}|u%}_Bna!AFnl>M^38^GrK`K!7%S%ALhkWEhV|_=(z~?POYKAf)85ybuv&DOsg;x5&-0?h0;#F+#o<5QY3!TO|wBh3z4wjP=Kx-%^qb5BM8eUb#EScU&>>*Cbt8R*6~L3-Z`JG~P4Z=f}DOVS_$eFKG*dEvPN?<0;ft4KiP7a^GB)Qk_3b2=K2IxS-`H^PJS0^md04?ouCuhvF5uXrI58TWq81d^njUs+|1b463#3dEw3k`LD`C>`Ru5Se*D`p+HZ6^U0_IxXkZmI;X!JsJ~)&aXy+Aa;0F4=@_8#^ffWuq=F9u^8A{iFZ5i`vY%X%1LDd3l0&V4Z1P-a55GM84+{i<2Zy5Kh1xGc5)4v;79@$dxx|eYXBdEL9?ZcNB(h^82WEns3b8y)@76JjM~zIYa`Xr#pZ4Teu7J9gLW**n7nI6Qx<%Fr#grL1H(q@VDw?$bGQcv?2M_gZZk5dD;;Z=o^pD(Njg%a=A94xfTMTZ;lTV(nzP@kVR&kyTRnAj^tv0mP^n(>A><707-MxR6Da4?t5SRb0>ICZefLcFUk?ZL$^Lv)h=hHmwQq^R&(1MF1as*9dl~}?v`{8!tpt=fG(?Yl%vUMl`z`P4QNp2)FeSOENKad3Z5S$N>7)ezn9pwzw%*YG(xRY{I`VRX);{T;E2RWcCguQuDFW+z^u+@b9bh(^IuEoAGy#2J$gY+K^X=tR*wLIB*h-NFwLA~P-ec%pj%+zxxY(=mdrUy-(QokLa(dxv1=>3==OCm|j%Ptf7nhLB$#1s(&Hw0Wx(r;2WkmZl!3>e8=xG6|*$23E{9rHmuz6W-F1zfDolfg*bXq2agXhG26$s)^UC2h|2WaVH`Eodoy4X;j~(>*{wJ(UAAS&ZEWk`m0?H{~DDFWkKzzEyb}pF9UWbq&Tq69HF@IEw$i4q05wUMK1pH{W1Q}o_TvS`FlQ6ZRzmr)*CsYTe5xnoXAH6IS8*71Qll+-?FFB--95KVy#vDlRFhj~Xq2X#@e)dMp9M)+^*sxhoY|CAQeDFLKY!Z{FBabA`{WmZjWXpc^HV{Q@#rXccm@i&&IT`}6W92~5bq1UZ2c1`%icv!Aw%n2iFcBmC(q%%qXP1Ng)2ePcGR7DWFK((&2AQ@1?N?c4gSNn*#Dz|om1dqHgmwA?%%BE7Dr!<7lGeR`=^C2~Na_`vd3YR#|?;k=vLL6Q?@nf?SEI|`o@(IhBa`D;q$~1Kw7q=-*oh=?JSO0?1g8K_Hh@!0HWx_ATxj=Sf*;IPuA}M5&Q{fr-BaG^DbhKY1tgRX60zcMB<9YM{q@yb>4aJSOq~1ZZzqliK&zd%+YS}Zq=Kd|~cU3)FIHfF{O#5>Y+Jq{`SM|Zwv{lQzKFH|d>Zg*C2o|HQb5I6m-9^2r!y71#nARv5)ZnImv@1;kuO6+rhvFCIADi5G?wfWxd!~YKo?*UnPpes>IhJcc{q`$%pzun25KC3yNC&uMz*Ml{COg(cOkB3!@rxJ-8BVC5LW$A8=fDR?}DWe8a_eit_ue<}pe8x6aI~4PZ8O6({^1{|Rid#GNt}VS*h~cP0duJDW`LzP@f`-DJ<>RST=Q7b<{T62@%GCTkvM75092f3MR^9=fG1!_dLGaOLfA4J@J34iz&71l+k>p8>ey_pmsJ?W4DuZ+ML#8D#wY)R(y52MA?s4{-Rr;pejy4Qnted&e=h3M2X=%fg-uZ^wpxoN)N~Oyq|2elHg_Le>MzBNADLaX`Fb1op9huFWp`t*;9rUvy>Pa*&PTH3&2(MkOSsWZ%$p=jn`m2dp@a}<3UCHx7oDkBA(AyjKA>?lnB>;E2FoT#-+iTa*a$knMMZ}PN#w;vbRoyP$MpP++RGe;OvrBPV1veW7FWd1{{2PaVZfNkJ$ODl#I7Vr&QTr{a;Y(n1vtjCS+-azxelRATB$pFT=%)(RucIEZ2A#;nzyx=&Q)PorU3H3OkPP?^S_(eg)aPVj%rKNNkK`?YJ)WzrxA$TGvT?^-Bs+XW#lX89!NvBw$W~Uvs-kW2g%6r%97HtI*g9=N8RE;Am++Qduni4(pTIN^FoE#V*5jjfjAo>wpUls?G=!pDf<&~!XCgQ|so`8?-II6AW^s+~1{wbS*gNQwg^D0YuOx;ZmOjt)w!i5ehe_XQoCM~T_<-VNE2G|+NJYpfp5+H2I!FDrI<4rR$xHBmHXSl=#c4R718}9}4o)!-JqLJ<`QHy4mGqZ(fee)ibVtr$FWyM!99@=8J`mvxIwrRrfcH(*&{F&zf>pd_%6&*5gBXOw!B3v_S~C`r_S$r)N9%-e94-RabLr$VVhW%0L_GN@Wm{LZZ9#u>mEfN!c}q(@0hColP1+-KL)YCkY@hKtjYO$RVd_&M4i!(wjeS?aJspKuHVHUZA^fQvx7bBa~_H*AJd;!7B}wOUv$V@5n$#9lzE+NU6g@=38OAZy=><4ldtikew>Q3MNFhi9l28zp_ip3IgoyIMxq))KNdd2@4C>`6r2`2cLYCujF42ODbwrif<-nRGMv19uFEKS8+n98NpJiY_O8-qA}ttYtv3SBGd$tsl4*s>E;6KDF+pA7HXIr%0)DZ-(N@-kc2OIzBn|?B|8(;IN4L@>J^0~G&PeDpKQc}4hPZR5xa`W%W&DX@SA`cgF)No=YC@R!ySGBrZor1sZC5$ZLw@gEi_=PjyT>e&bMiuK`Jf=SEoZ2m1SFu1xvjz3vGr-r>6`Hh?hWHU=9l)U>-Bk>QJV9t&LR%=8iG4*8s)@b`xhd(YUqjb-#W3?s7@)m@M$q1tn0X;I?F8BS0(#ljM`LJXRSp&)_br`HpN5|?cNV}2pzL`p35Mn1Ia@DJWRRFJNJlq?Qnxn+zS^DT%tl8ki1Af+^1W$>}URlrqY_(QOjY2a~YE1`?@5MS~q7wDU2ep?Iz6b-1-$*GbT+dsk$^e;Fy5=$JCiSB802|yu@FgOnHvO&R;b?`pivT%vV|4;dP`nae(m&rR2tqESoQPO;US}84gch1Ov1!=?)^8Oqz#IlXm4~Nh9eFfWnW}p4}`U&B0P9|0BLbK~$d>uRfN0W!kjKHfFHxt~2}6dc=fX%GlkgD_mh6M908_CA`mEfI;>0^Htvev?W95#7Gj>xGP5xq5S4J=D(OxpGktl(!N~j8ctse26rPj&H?1`>AS4bv+fcRjQ9+ZVK5lZ`2Z*ls?@x3!GI5S`dgqUyYa5drbzrBOr%p?K$j`H7vOk<4ZE8b?ZUd((l2|ur(_)j^>6K*^0o83e5F`Gyix|Q9$c6W2{Xl-(orJ1mKl_vsf3Yf`>bmA>?cp@_@`WfvkR8}d7P`h_+}o4?Xan!5EScE^A)4rdpz$~(&jE1#-xg2T#?-h62XAYNc(p56{xDZs#@*EdGnGM`VbxQ>tpXLW@ZyikBFudS=Fnq+?WnilshtRQzS?~P_%jG#n@5SRU^3os)3>^_RtYRxXso~WAHoN`l1YLZHK5<`z6ng>O+Ac8Hq=|>+=YO*52Iv$@!1^oxZw_w)<~m+;BTLX{qH|UHV&tr&b>&*d+bG$e{io3%E%4(!3vYi1#RA<;){Wn+qj1&QZyF(L_g*pNZfDusRok9A)J(&%LEOc(glQbP@X47q0o_N1=HfA`k}VKwapcCsfg9m?H)6!ZU0&8Hjx}G+rV5_jp*d8RF^{eF%O;LZgaOd@-e;S|_iD1bUq&ZkgC^iLL!Sp5#iZJmnjvho|#4!t1uR-p>}xQVHlkjQg5e)+0{4XZl7m*fUxJHBg>IDbk=#y1yq8<;INFD5?o}x{}glOS;2z9QrG}&6qhdqZZ!BY*d2vQUl*fbSn-`2nHqf9GO8Y}H#}mk2XduUoR=1Y%vU6gHaAgfMdyBZF12N|D>YSt`;^c-pcLH@dd$m*SVjzjY46iE|mK3!tTg3LwAmR0QRUlFJ~J(DBHzvtKfhL*ux$WVoqb25zHSgbgwVRs{PbH#DHS>z;8jp9V!eN_2#nXlBYyIH3qE62FH2rcCpWMx54(iF_XQk?Q&K78Vw9H7kJ$3)6&~n*(B}!C7cNG(WVWzQq?ag5Ix;v_zK&^0`*QI7xGdk^Lj0_fki;ZbX%6mIAihxy$vqS!NMRUhvJE*kBg`!#(*|#=F;5jre@9-#+50+mC_`uYl;@u6Rl|&#`QmdY7Cnwy5HnaorhjfkW_Nvu|HD2Ep54+%%o~wzj%e4`U_9_h-#)e36+y9#kXHPcjLB#!RC>X`gf54HP%!XlA;=5<%pnJXruW0t+CwagiWPbfhO3J1$LV3()$`5LD7;1>vs1e6sdvgxUfe7fSnV_Ax;?}vM?ubwHl|!&RwtT_Kso;PxqOAS`t>ACjRnwJ$C~dF()?*$WAZm$sF)<;Frg@$Gtl#9k-+w##H8Xq;d=@`3de>P@e(Z1`SareR@H=XS4&*foee>&eM3QXc7A|0@L#O03(MJZ+Mvw?q_tWzcc$7Te|`VQk?w##|(auK2kMp9plN0~;NHz(5%%GZS2S|QB-mMV$K>^iEyb7Ua(5He_A`l|HHp=J*Zy`R8gKwpWVhFRta6V};g+gMZ$bq>EzG|k={L+aqnf9V@&k_vYi?_}aoyE?2Q?w*IpnIf?`Rlz~$`TPm@QwN@~T>7iwHcnVi_8Nm@GY))Q+!Y-L>QH4fS0fMnsX>CV`2}nSEqWWL3@Q?lNw?koP0$u^Ue)ry1?OMo3s$}dji#xpSiXER$qb;I{V|_nxq)aru`LO0h2u9Tt#BK~1#Az0EDX=E5FCRP5^utS0q@@)>*Atw?Wn6yf@N>UAGXW?UQGHV3*XG>CuVNa1Q88uhQF~x=ElbJp4LBW1b%rF0NEbe*RS7hdfL-o@UL<{RYN%@sDe5k9(Z%f+-qv272~^e3ql%`Ob!Ns_mxW&L*SMg{F1c6;iV1(!+Ris)xgjJQp*;5y+-`oK{bL_`(6-$jTmGmYmpSd%0RX#z4)dZ`cJz(S=&8Efo@ER;Lz7n1;L~pLoC8Ba7_CdREM%k?HuFzj~Dz%I&0~4Mu`oV^E=GdLdk5~%!y^T(yf&B4kw4;c<=ISZ7F2sxcJ1<0&W#~P26|>^#wEyz8-rwh}sySmLaRQCzN_nqWyZ!{BVSZL7A}i2Z865z>#t0Qt88xtC;|-nJOOoX49(Dg6XvkW<|`9|=-oxHU)fS>DOfYFdhSI4^XCb>jWE-%V5f}Kl_yH!9Ts#T{@FVtbi*4P$T_h#u?K0**6cwdR$S423`zgk`&h7<6f8)WT?u;yH7%i&CWo*H&h1+3Ev??WhsP*s?30CyITM"),format=L.FORMAT_RAW,filters=[{"id":L.FILTER_LZMA2}])) diff --git a/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed0.log b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed0.log new file mode 100644 index 0000000000..bbfa309c5e --- /dev/null +++ b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed0.log @@ -0,0 +1,334 @@ +==================================================================================================== +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: /workspace/parameter-golf/data + datasets_dir: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/pr1413_combo_s0.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.085 + ngram_agree_bonus: 0.1 + ngram_base_beta: 2.0 + ngram_open_table_bits: 26 + ngram_order_stride: 2 + ngram_tilt_enabled: False + ngram_within_beta: 0.0 + ngram_within_threshold: 0.25 + ngram_word_beta: 0.0 + ngram_word_threshold: 0.8 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: pr1413_combo_s0 + scalar_lr: 0.02 + seed: 0 + skip_gates_enabled: True + skip_training: False + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /workspace/parameter-golf/data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_batch_seqs: 32 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +==================================================================================================== +Running Python 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0] +Running PyTorch 2.9.1+cu128 +Tue Apr 7 22:39:45 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | +| N/A 48C P0 131W / 700W | 1521MiB / 81559MiB | 8% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 | +| N/A 36C P0 119W / 700W | 1521MiB / 81559MiB | 8% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 | +| N/A 35C P0 121W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 | +| N/A 45C P0 131W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 | +| N/A 48C P0 130W / 700W | 1521MiB / 81559MiB | 1% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 | +| N/A 38C P0 124W / 700W | 1521MiB / 81559MiB | 1% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | +| N/A 46C P0 125W / 700W | 1521MiB / 81559MiB | 7% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | +| N/A 36C P0 121W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +train_shards: 128 +val_tokens: 40540160 +model_params:35944536 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0072 val_bpb: 3.4870 +1/20000 train_loss: 9.0088 train_time: 0.0m tok/s: 8258431 +2/20000 train_loss: 12.3526 train_time: 0.0m tok/s: 8149316 +3/20000 train_loss: 11.1728 train_time: 0.0m tok/s: 8059570 +4/20000 train_loss: 9.5091 train_time: 0.0m tok/s: 8014837 +5/20000 train_loss: 8.4365 train_time: 0.0m tok/s: 7987654 +500/20000 train_loss: 3.3339 train_time: 0.8m tok/s: 7735958 +1000/20000 train_loss: 3.1851 train_time: 1.7m tok/s: 7728344 +1500/20000 train_loss: 3.0964 train_time: 2.5m tok/s: 7727125 +2000/20000 train_loss: 3.0665 train_time: 3.4m tok/s: 7727235 +2500/20000 train_loss: 3.1014 train_time: 4.2m tok/s: 7728701 +layer_loop:enabled step:2890 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +3000/20000 train_loss: 2.9928 train_time: 5.2m tok/s: 7599462 +3500/20000 train_loss: 2.9957 train_time: 6.4m tok/s: 7148542 +4000/20000 train_loss: 2.9421 train_time: 7.7m tok/s: 6844878 +4000/20000 val_loss: 2.9037 val_bpb: 1.1241 +4500/20000 train_loss: 2.7969 train_time: 8.9m tok/s: 6625994 +4847/20000 val_loss: 2.8120 val_bpb: 1.0886 +stopping_early: wallclock_cap train_time: 588034ms step: 4847/20000 +peak memory allocated: 39045 MiB reserved: 39100 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.80993570 val_bpb:1.08781368 eval_time:7247ms +Serialized model: 135431033 bytes +Code size: 17390 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.8s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15975248 bytes +Total submission size quantized+brotli: 15992638 bytes +quantized val_loss:2.83976191 val_bpb:1.09936033 eval_time:25236ms +quantized_sliding_window val_loss:2.79649863 val_bpb:1.08261177 eval_time:119638ms +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.005 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=35944536 frozen=0 + ttt_chunk [1/1238] bpb=1.114728 time=33.3s + ttt_chunk [11/1238] bpb=1.069792 time=38.5s + ttt_chunk [21/1238] bpb=1.107642 time=41.0s + ttt_chunk [31/1238] bpb=1.101970 time=43.5s + ttt_chunk [41/1238] bpb=1.095229 time=46.7s + ttt_chunk [51/1238] bpb=1.088915 time=49.2s + ttt_chunk [61/1238] bpb=1.080585 time=51.8s + ttt_chunk [71/1238] bpb=1.087521 time=54.3s + ttt_chunk [81/1238] bpb=1.080925 time=56.8s + ttt_chunk [91/1238] bpb=1.077538 time=59.3s + ttt_chunk [101/1238] bpb=1.077329 time=61.9s + ttt_chunk [111/1238] bpb=1.075709 time=64.4s + ttt_chunk [121/1238] bpb=1.078578 time=67.6s + ttt_chunk [131/1238] bpb=1.082255 time=70.1s + ttt_chunk [141/1238] bpb=1.082828 time=72.6s + ttt_chunk [151/1238] bpb=1.082670 time=75.2s + ttt_chunk [161/1238] bpb=1.083125 time=77.7s + ttt_chunk [171/1238] bpb=1.082986 time=80.2s + ttt_chunk [181/1238] bpb=1.081510 time=82.7s + ttt_chunk [191/1238] bpb=1.081256 time=85.2s + ttt_chunk [201/1238] bpb=1.078803 time=87.7s + ttt_chunk [211/1238] bpb=1.083235 time=90.9s + ttt_chunk [221/1238] bpb=1.083554 time=94.2s + ttt_chunk [231/1238] bpb=1.085115 time=96.7s + ttt_chunk [241/1238] bpb=1.083423 time=99.2s + ttt_chunk [251/1238] bpb=1.083447 time=101.7s + ttt_chunk [261/1238] bpb=1.084525 time=104.2s + ttt_chunk [271/1238] bpb=1.085071 time=106.7s + ttt_chunk [281/1238] bpb=1.084309 time=109.3s + ttt_chunk [291/1238] bpb=1.085481 time=111.7s + ttt_chunk [301/1238] bpb=1.085649 time=114.2s + ttt_chunk [311/1238] bpb=1.084554 time=116.7s + ttt_chunk [321/1238] bpb=1.084432 time=119.2s + ttt_chunk [331/1238] bpb=1.084750 time=121.7s + ttt_chunk [341/1238] bpb=1.083910 time=124.2s + ttt_chunk [351/1238] bpb=1.084647 time=126.7s + ttt_chunk [361/1238] bpb=1.083615 time=129.3s + ttt_chunk [371/1238] bpb=1.082086 time=131.8s + ttt_chunk [381/1238] bpb=1.082480 time=134.4s + ttt_chunk [391/1238] bpb=1.082151 time=136.9s + ttt_chunk [401/1238] bpb=1.082223 time=139.4s + ttt_chunk [411/1238] bpb=1.082833 time=141.9s + ttt_chunk [421/1238] bpb=1.082273 time=144.4s + ttt_chunk [431/1238] bpb=1.082465 time=147.0s + ttt_chunk [441/1238] bpb=1.082548 time=149.5s + ttt_chunk [451/1238] bpb=1.083749 time=152.0s + ttt_chunk [461/1238] bpb=1.082003 time=154.5s + ttt_chunk [471/1238] bpb=1.082005 time=157.0s + ttt_chunk [481/1238] bpb=1.082131 time=159.5s + ttt_chunk [491/1238] bpb=1.082605 time=162.1s + ttt_chunk [501/1238] bpb=1.082222 time=164.6s + ttt_chunk [511/1238] bpb=1.081903 time=167.1s + ttt_chunk [521/1238] bpb=1.081446 time=169.6s + ttt_chunk [531/1238] bpb=1.081446 time=172.1s + ttt_chunk [541/1238] bpb=1.081558 time=174.7s + ttt_chunk [551/1238] bpb=1.081109 time=177.2s + ttt_chunk [561/1238] bpb=1.080400 time=179.7s + ttt_chunk [571/1238] bpb=1.079811 time=182.2s + ttt_chunk [581/1238] bpb=1.080181 time=184.7s + ttt_chunk [591/1238] bpb=1.080404 time=187.2s + ttt_chunk [601/1238] bpb=1.080325 time=189.7s + ttt_chunk [611/1238] bpb=1.080879 time=192.3s + ttt_chunk [621/1238] bpb=1.081725 time=194.9s + ttt_chunk [631/1238] bpb=1.081794 time=197.4s + ttt_chunk [641/1238] bpb=1.082253 time=199.9s + ttt_chunk [651/1238] bpb=1.082595 time=202.4s + ttt_chunk [661/1238] bpb=1.081916 time=204.9s + ttt_chunk [671/1238] bpb=1.081662 time=207.5s + ttt_chunk [681/1238] bpb=1.082983 time=210.0s + ttt_chunk [691/1238] bpb=1.083173 time=212.5s + ttt_chunk [701/1238] bpb=1.082989 time=215.1s + ttt_chunk [711/1238] bpb=1.083691 time=217.6s + ttt_chunk [721/1238] bpb=1.084009 time=220.2s + ttt_chunk [731/1238] bpb=1.083361 time=222.7s + ttt_chunk [741/1238] bpb=1.083079 time=225.2s + ttt_chunk [751/1238] bpb=1.082173 time=227.8s + ttt_chunk [761/1238] bpb=1.081575 time=230.4s + ttt_chunk [771/1238] bpb=1.080563 time=232.9s + ttt_chunk [781/1238] bpb=1.080520 time=235.4s + ttt_chunk [791/1238] bpb=1.080854 time=237.9s + ttt_chunk [801/1238] bpb=1.081152 time=240.5s + ttt_chunk [811/1238] bpb=1.080643 time=243.0s + ttt_chunk [821/1238] bpb=1.079455 time=245.5s + ttt_chunk [831/1238] bpb=1.079131 time=248.0s + ttt_chunk [841/1238] bpb=1.078688 time=250.5s + ttt_chunk [851/1238] bpb=1.078397 time=253.0s + ttt_chunk [861/1238] bpb=1.078079 time=255.6s + ttt_chunk [871/1238] bpb=1.077978 time=258.1s + ttt_chunk [881/1238] bpb=1.077513 time=260.6s + ttt_chunk [891/1238] bpb=1.076998 time=263.1s + ttt_chunk [901/1238] bpb=1.077406 time=265.6s + ttt_chunk [911/1238] bpb=1.077103 time=268.1s + ttt_chunk [921/1238] bpb=1.077376 time=270.7s + ttt_chunk [931/1238] bpb=1.078077 time=273.2s + ttt_chunk [941/1238] bpb=1.078472 time=275.7s + ttt_chunk [951/1238] bpb=1.078400 time=278.2s + ttt_chunk [961/1238] bpb=1.079235 time=280.7s + ttt_chunk [971/1238] bpb=1.079648 time=283.3s + ttt_chunk [981/1238] bpb=1.080023 time=285.8s + ttt_chunk [991/1238] bpb=1.079810 time=288.3s + ttt_chunk [1001/1238] bpb=1.079854 time=290.9s + ttt_chunk [1011/1238] bpb=1.080184 time=293.4s + ttt_chunk [1021/1238] bpb=1.080901 time=295.9s + ttt_chunk [1031/1238] bpb=1.081380 time=298.4s + ttt_chunk [1041/1238] bpb=1.081865 time=300.9s + ttt_chunk [1051/1238] bpb=1.081800 time=303.4s + ttt_chunk [1061/1238] bpb=1.081789 time=306.0s + ttt_chunk [1071/1238] bpb=1.081938 time=308.5s + ttt_chunk [1081/1238] bpb=1.081824 time=311.0s + ttt_chunk [1091/1238] bpb=1.082037 time=313.5s + ttt_chunk [1101/1238] bpb=1.082589 time=316.1s + ttt_chunk [1111/1238] bpb=1.082882 time=318.6s + ttt_chunk [1121/1238] bpb=1.083046 time=321.1s + ttt_chunk [1131/1238] bpb=1.082716 time=323.6s + ttt_chunk [1141/1238] bpb=1.082394 time=326.2s + ttt_chunk [1151/1238] bpb=1.082432 time=328.7s + ttt_chunk [1161/1238] bpb=1.082572 time=331.2s + ttt_chunk [1171/1238] bpb=1.082344 time=333.7s + ttt_chunk [1181/1238] bpb=1.081876 time=336.2s + ttt_chunk [1191/1238] bpb=1.082038 time=338.8s + ttt_chunk [1201/1238] bpb=1.082088 time=341.3s + ttt_chunk [1211/1238] bpb=1.081779 time=343.8s + ttt_chunk [1221/1238] bpb=1.081314 time=346.3s + ttt_chunk [1231/1238] bpb=1.080956 time=348.8s + ttt_chunk [1238/1238] bpb=1.080960 time=366.4s +ttt_sliding:done val_loss=2.792167 val_bpb=1.080935 elapsed=366.7s +legal_ttt_exact val_loss:2.79216698 val_bpb:1.08093485 eval_time:366951ms diff --git a/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed1234.log b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed1234.log new file mode 100644 index 0000000000..7683fb1062 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed1234.log @@ -0,0 +1,334 @@ +==================================================================================================== +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: /workspace/parameter-golf/data + datasets_dir: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/pr1413_combo_s1234.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.085 + ngram_agree_bonus: 0.1 + ngram_base_beta: 2.0 + ngram_open_table_bits: 26 + ngram_order_stride: 2 + ngram_tilt_enabled: False + ngram_within_beta: 0.0 + ngram_within_threshold: 0.25 + ngram_word_beta: 0.0 + ngram_word_threshold: 0.8 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: pr1413_combo_s1234 + scalar_lr: 0.02 + seed: 1234 + skip_gates_enabled: True + skip_training: False + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /workspace/parameter-golf/data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_batch_seqs: 32 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +==================================================================================================== +Running Python 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0] +Running PyTorch 2.9.1+cu128 +Wed Apr 8 00:38:48 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | +| N/A 43C P0 124W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 | +| N/A 33C P0 118W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 | +| N/A 33C P0 119W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 | +| N/A 40C P0 126W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 | +| N/A 43C P0 126W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 | +| N/A 35C P0 121W / 700W | 1521MiB / 81559MiB | 4% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | +| N/A 41C P0 123W / 700W | 1521MiB / 81559MiB | 4% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | +| N/A 33C P0 120W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +train_shards: 128 +val_tokens: 40540160 +model_params:35944536 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0072 val_bpb: 3.4870 +1/20000 train_loss: 9.0096 train_time: 0.0m tok/s: 8323517 +2/20000 train_loss: 12.3264 train_time: 0.0m tok/s: 8191653 +3/20000 train_loss: 11.1066 train_time: 0.0m tok/s: 8087738 +4/20000 train_loss: 9.4404 train_time: 0.0m tok/s: 8035734 +5/20000 train_loss: 8.3419 train_time: 0.0m tok/s: 8005748 +500/20000 train_loss: 3.3355 train_time: 0.8m tok/s: 7726463 +1000/20000 train_loss: 3.1859 train_time: 1.7m tok/s: 7726770 +1500/20000 train_loss: 3.0953 train_time: 2.5m tok/s: 7733633 +2000/20000 train_loss: 3.0709 train_time: 3.4m tok/s: 7739485 +2500/20000 train_loss: 3.0983 train_time: 4.2m tok/s: 7742977 +layer_loop:enabled step:2896 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +3000/20000 train_loss: 2.9929 train_time: 5.2m tok/s: 7622358 +3500/20000 train_loss: 2.9943 train_time: 6.4m tok/s: 7166611 +4000/20000 train_loss: 2.9434 train_time: 7.6m tok/s: 6859580 +4000/20000 val_loss: 2.9050 val_bpb: 1.1246 +4500/20000 train_loss: 2.7958 train_time: 8.9m tok/s: 6639095 +4869/20000 val_loss: 2.8118 val_bpb: 1.0885 +stopping_early: wallclock_cap train_time: 588011ms step: 4869/20000 +peak memory allocated: 39046 MiB reserved: 39070 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.80981375 val_bpb:1.08776646 eval_time:6669ms +Serialized model: 135431033 bytes +Code size: 17390 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.8s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15972633 bytes +Total submission size quantized+brotli: 15990023 bytes +quantized val_loss:2.83931372 val_bpb:1.09918682 eval_time:8877ms +quantized_sliding_window val_loss:2.79615613 val_bpb:1.08247918 eval_time:91864ms +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.005 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=35944536 frozen=0 + ttt_chunk [1/1238] bpb=1.114966 time=4.9s + ttt_chunk [11/1238] bpb=1.072240 time=9.4s + ttt_chunk [21/1238] bpb=1.109567 time=11.9s + ttt_chunk [31/1238] bpb=1.103426 time=14.4s + ttt_chunk [41/1238] bpb=1.096183 time=16.9s + ttt_chunk [51/1238] bpb=1.089001 time=19.4s + ttt_chunk [61/1238] bpb=1.081056 time=21.9s + ttt_chunk [71/1238] bpb=1.087769 time=24.4s + ttt_chunk [81/1238] bpb=1.080987 time=26.9s + ttt_chunk [91/1238] bpb=1.077486 time=29.4s + ttt_chunk [101/1238] bpb=1.077015 time=32.0s + ttt_chunk [111/1238] bpb=1.075353 time=34.5s + ttt_chunk [121/1238] bpb=1.078384 time=36.9s + ttt_chunk [131/1238] bpb=1.082080 time=39.5s + ttt_chunk [141/1238] bpb=1.082683 time=42.0s + ttt_chunk [151/1238] bpb=1.082602 time=44.4s + ttt_chunk [161/1238] bpb=1.083106 time=47.0s + ttt_chunk [171/1238] bpb=1.082988 time=49.5s + ttt_chunk [181/1238] bpb=1.081462 time=52.0s + ttt_chunk [191/1238] bpb=1.081275 time=54.5s + ttt_chunk [201/1238] bpb=1.078838 time=57.0s + ttt_chunk [211/1238] bpb=1.083285 time=59.5s + ttt_chunk [221/1238] bpb=1.083667 time=62.0s + ttt_chunk [231/1238] bpb=1.085393 time=64.5s + ttt_chunk [241/1238] bpb=1.083577 time=67.0s + ttt_chunk [251/1238] bpb=1.083579 time=69.6s + ttt_chunk [261/1238] bpb=1.084638 time=72.1s + ttt_chunk [271/1238] bpb=1.085148 time=74.6s + ttt_chunk [281/1238] bpb=1.084429 time=77.1s + ttt_chunk [291/1238] bpb=1.085650 time=79.9s + ttt_chunk [301/1238] bpb=1.085822 time=82.5s + ttt_chunk [311/1238] bpb=1.084713 time=85.0s + ttt_chunk [321/1238] bpb=1.084561 time=87.5s + ttt_chunk [331/1238] bpb=1.084816 time=90.0s + ttt_chunk [341/1238] bpb=1.083931 time=92.5s + ttt_chunk [351/1238] bpb=1.084650 time=95.1s + ttt_chunk [361/1238] bpb=1.083609 time=97.6s + ttt_chunk [371/1238] bpb=1.082060 time=100.1s + ttt_chunk [381/1238] bpb=1.082437 time=102.6s + ttt_chunk [391/1238] bpb=1.082107 time=105.1s + ttt_chunk [401/1238] bpb=1.082239 time=107.6s + ttt_chunk [411/1238] bpb=1.082798 time=110.2s + ttt_chunk [421/1238] bpb=1.082273 time=112.7s + ttt_chunk [431/1238] bpb=1.082432 time=115.2s + ttt_chunk [441/1238] bpb=1.082520 time=117.7s + ttt_chunk [451/1238] bpb=1.083698 time=120.3s + ttt_chunk [461/1238] bpb=1.081909 time=122.8s + ttt_chunk [471/1238] bpb=1.081917 time=125.3s + ttt_chunk [481/1238] bpb=1.082088 time=127.8s + ttt_chunk [491/1238] bpb=1.082558 time=130.7s + ttt_chunk [501/1238] bpb=1.082183 time=133.6s + ttt_chunk [511/1238] bpb=1.081848 time=136.1s + ttt_chunk [521/1238] bpb=1.081402 time=138.6s + ttt_chunk [531/1238] bpb=1.081388 time=141.4s + ttt_chunk [541/1238] bpb=1.081473 time=144.0s + ttt_chunk [551/1238] bpb=1.081040 time=146.5s + ttt_chunk [561/1238] bpb=1.080407 time=149.0s + ttt_chunk [571/1238] bpb=1.079868 time=151.5s + ttt_chunk [581/1238] bpb=1.080233 time=154.4s + ttt_chunk [591/1238] bpb=1.080425 time=156.9s + ttt_chunk [601/1238] bpb=1.080368 time=159.5s + ttt_chunk [611/1238] bpb=1.080927 time=162.0s + ttt_chunk [621/1238] bpb=1.081790 time=164.5s + ttt_chunk [631/1238] bpb=1.081843 time=167.0s + ttt_chunk [641/1238] bpb=1.082311 time=169.6s + ttt_chunk [651/1238] bpb=1.082644 time=172.1s + ttt_chunk [661/1238] bpb=1.082007 time=174.6s + ttt_chunk [671/1238] bpb=1.081761 time=177.1s + ttt_chunk [681/1238] bpb=1.083081 time=179.6s + ttt_chunk [691/1238] bpb=1.083259 time=182.2s + ttt_chunk [701/1238] bpb=1.083068 time=184.7s + ttt_chunk [711/1238] bpb=1.083777 time=187.2s + ttt_chunk [721/1238] bpb=1.084060 time=189.8s + ttt_chunk [731/1238] bpb=1.083403 time=192.3s + ttt_chunk [741/1238] bpb=1.083134 time=194.8s + ttt_chunk [751/1238] bpb=1.082198 time=197.3s + ttt_chunk [761/1238] bpb=1.081579 time=199.9s + ttt_chunk [771/1238] bpb=1.080571 time=202.4s + ttt_chunk [781/1238] bpb=1.080552 time=204.9s + ttt_chunk [791/1238] bpb=1.080906 time=207.4s + ttt_chunk [801/1238] bpb=1.081240 time=209.9s + ttt_chunk [811/1238] bpb=1.080751 time=212.4s + ttt_chunk [821/1238] bpb=1.079567 time=215.0s + ttt_chunk [831/1238] bpb=1.079247 time=217.6s + ttt_chunk [841/1238] bpb=1.078798 time=220.1s + ttt_chunk [851/1238] bpb=1.078521 time=222.7s + ttt_chunk [861/1238] bpb=1.078184 time=225.2s + ttt_chunk [871/1238] bpb=1.078074 time=227.7s + ttt_chunk [881/1238] bpb=1.077634 time=230.2s + ttt_chunk [891/1238] bpb=1.077119 time=232.7s + ttt_chunk [901/1238] bpb=1.077521 time=235.2s + ttt_chunk [911/1238] bpb=1.077231 time=237.7s + ttt_chunk [921/1238] bpb=1.077508 time=240.2s + ttt_chunk [931/1238] bpb=1.078181 time=242.7s + ttt_chunk [941/1238] bpb=1.078564 time=245.2s + ttt_chunk [951/1238] bpb=1.078474 time=247.8s + ttt_chunk [961/1238] bpb=1.079324 time=250.3s + ttt_chunk [971/1238] bpb=1.079734 time=252.8s + ttt_chunk [981/1238] bpb=1.080090 time=255.3s + ttt_chunk [991/1238] bpb=1.079904 time=257.9s + ttt_chunk [1001/1238] bpb=1.079954 time=260.5s + ttt_chunk [1011/1238] bpb=1.080290 time=263.0s + ttt_chunk [1021/1238] bpb=1.081004 time=265.5s + ttt_chunk [1031/1238] bpb=1.081485 time=268.0s + ttt_chunk [1041/1238] bpb=1.081952 time=270.6s + ttt_chunk [1051/1238] bpb=1.081890 time=273.2s + ttt_chunk [1061/1238] bpb=1.081879 time=275.7s + ttt_chunk [1071/1238] bpb=1.082026 time=278.2s + ttt_chunk [1081/1238] bpb=1.081919 time=280.7s + ttt_chunk [1091/1238] bpb=1.082112 time=283.3s + ttt_chunk [1101/1238] bpb=1.082651 time=285.8s + ttt_chunk [1111/1238] bpb=1.082942 time=288.3s + ttt_chunk [1121/1238] bpb=1.083107 time=290.9s + ttt_chunk [1131/1238] bpb=1.082773 time=293.4s + ttt_chunk [1141/1238] bpb=1.082439 time=296.0s + ttt_chunk [1151/1238] bpb=1.082479 time=298.5s + ttt_chunk [1161/1238] bpb=1.082609 time=301.0s + ttt_chunk [1171/1238] bpb=1.082378 time=303.5s + ttt_chunk [1181/1238] bpb=1.081901 time=306.0s + ttt_chunk [1191/1238] bpb=1.082040 time=308.5s + ttt_chunk [1201/1238] bpb=1.082071 time=311.1s + ttt_chunk [1211/1238] bpb=1.081777 time=313.6s + ttt_chunk [1221/1238] bpb=1.081299 time=316.1s + ttt_chunk [1231/1238] bpb=1.080921 time=318.7s + ttt_chunk [1238/1238] bpb=1.080927 time=322.6s +ttt_sliding:done val_loss=2.792119 val_bpb=1.080916 elapsed=322.7s +legal_ttt_exact val_loss:2.79211907 val_bpb:1.08091630 eval_time:322884ms diff --git a/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed1337.log b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed1337.log new file mode 100644 index 0000000000..e93b00200e --- /dev/null +++ b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed1337.log @@ -0,0 +1,334 @@ +==================================================================================================== +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: /workspace/parameter-golf/data + datasets_dir: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/pr1413_combo_s1337.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.085 + ngram_agree_bonus: 0.1 + ngram_base_beta: 2.0 + ngram_open_table_bits: 26 + ngram_order_stride: 2 + ngram_tilt_enabled: False + ngram_within_beta: 0.0 + ngram_within_threshold: 0.25 + ngram_word_beta: 0.0 + ngram_word_threshold: 0.8 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: pr1413_combo_s1337 + scalar_lr: 0.02 + seed: 1337 + skip_gates_enabled: True + skip_training: False + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /workspace/parameter-golf/data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_batch_seqs: 32 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +==================================================================================================== +Running Python 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0] +Running PyTorch 2.9.1+cu128 +Tue Apr 7 23:35:35 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | +| N/A 44C P0 126W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 | +| N/A 34C P0 118W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 | +| N/A 33C P0 120W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 | +| N/A 41C P0 127W / 700W | 1521MiB / 81559MiB | 2% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 | +| N/A 44C P0 127W / 700W | 1521MiB / 81559MiB | 9% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 | +| N/A 36C P0 122W / 700W | 1521MiB / 81559MiB | 1% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | +| N/A 41C P0 122W / 700W | 1521MiB / 81559MiB | 1% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | +| N/A 33C P0 120W / 700W | 1521MiB / 81559MiB | 1% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +train_shards: 128 +val_tokens: 40540160 +model_params:35944536 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0047 val_bpb: 3.4860 +1/20000 train_loss: 9.0067 train_time: 0.0m tok/s: 8327281 +2/20000 train_loss: 12.2944 train_time: 0.0m tok/s: 8167591 +3/20000 train_loss: 11.0743 train_time: 0.0m tok/s: 8058524 +4/20000 train_loss: 9.3770 train_time: 0.0m tok/s: 8008155 +5/20000 train_loss: 8.2958 train_time: 0.0m tok/s: 7978580 +500/20000 train_loss: 3.3354 train_time: 0.8m tok/s: 7731833 +1000/20000 train_loss: 3.1818 train_time: 1.7m tok/s: 7723632 +1500/20000 train_loss: 3.0953 train_time: 2.5m tok/s: 7724271 +2000/20000 train_loss: 3.0725 train_time: 3.4m tok/s: 7727180 +2500/20000 train_loss: 3.1010 train_time: 4.2m tok/s: 7728400 +layer_loop:enabled step:2890 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +3000/20000 train_loss: 3.0037 train_time: 5.2m tok/s: 7600542 +3500/20000 train_loss: 2.9965 train_time: 6.4m tok/s: 7150101 +4000/20000 train_loss: 2.9449 train_time: 7.7m tok/s: 6845938 +4000/20000 val_loss: 2.9051 val_bpb: 1.1247 +4500/20000 train_loss: 2.7975 train_time: 8.9m tok/s: 6627289 +4863/20000 val_loss: 2.8124 val_bpb: 1.0888 +stopping_early: wallclock_cap train_time: 588117ms step: 4863/20000 +peak memory allocated: 39046 MiB reserved: 39070 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.81065271 val_bpb:1.08809125 eval_time:6774ms +Serialized model: 135431033 bytes +Code size: 17390 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.8s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15971795 bytes +Total submission size quantized+brotli: 15989185 bytes +quantized val_loss:2.83993235 val_bpb:1.09942631 eval_time:8914ms +quantized_sliding_window val_loss:2.79644161 val_bpb:1.08258969 eval_time:92144ms +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.005 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=35944536 frozen=0 + ttt_chunk [1/1238] bpb=1.112966 time=4.9s + ttt_chunk [11/1238] bpb=1.072065 time=9.4s + ttt_chunk [21/1238] bpb=1.109365 time=11.9s + ttt_chunk [31/1238] bpb=1.103136 time=14.4s + ttt_chunk [41/1238] bpb=1.096311 time=16.9s + ttt_chunk [51/1238] bpb=1.089565 time=19.4s + ttt_chunk [61/1238] bpb=1.081055 time=21.9s + ttt_chunk [71/1238] bpb=1.088099 time=24.5s + ttt_chunk [81/1238] bpb=1.081312 time=27.0s + ttt_chunk [91/1238] bpb=1.077970 time=29.5s + ttt_chunk [101/1238] bpb=1.077516 time=32.0s + ttt_chunk [111/1238] bpb=1.075930 time=34.6s + ttt_chunk [121/1238] bpb=1.079011 time=37.1s + ttt_chunk [131/1238] bpb=1.082763 time=39.6s + ttt_chunk [141/1238] bpb=1.083364 time=42.1s + ttt_chunk [151/1238] bpb=1.083213 time=44.6s + ttt_chunk [161/1238] bpb=1.083640 time=47.1s + ttt_chunk [171/1238] bpb=1.083485 time=49.6s + ttt_chunk [181/1238] bpb=1.081979 time=52.1s + ttt_chunk [191/1238] bpb=1.081718 time=54.6s + ttt_chunk [201/1238] bpb=1.079204 time=57.1s + ttt_chunk [211/1238] bpb=1.083606 time=59.6s + ttt_chunk [221/1238] bpb=1.083898 time=62.1s + ttt_chunk [231/1238] bpb=1.085595 time=64.6s + ttt_chunk [241/1238] bpb=1.083855 time=67.1s + ttt_chunk [251/1238] bpb=1.083799 time=69.7s + ttt_chunk [261/1238] bpb=1.084810 time=72.2s + ttt_chunk [271/1238] bpb=1.085346 time=74.7s + ttt_chunk [281/1238] bpb=1.084617 time=77.2s + ttt_chunk [291/1238] bpb=1.085732 time=80.0s + ttt_chunk [301/1238] bpb=1.085906 time=82.5s + ttt_chunk [311/1238] bpb=1.084723 time=85.0s + ttt_chunk [321/1238] bpb=1.084586 time=87.5s + ttt_chunk [331/1238] bpb=1.084870 time=90.1s + ttt_chunk [341/1238] bpb=1.084012 time=92.6s + ttt_chunk [351/1238] bpb=1.084768 time=95.1s + ttt_chunk [361/1238] bpb=1.083697 time=97.7s + ttt_chunk [371/1238] bpb=1.082113 time=100.2s + ttt_chunk [381/1238] bpb=1.082551 time=102.7s + ttt_chunk [391/1238] bpb=1.082227 time=105.2s + ttt_chunk [401/1238] bpb=1.082265 time=107.7s + ttt_chunk [411/1238] bpb=1.082878 time=110.3s + ttt_chunk [421/1238] bpb=1.082399 time=112.8s + ttt_chunk [431/1238] bpb=1.082595 time=115.3s + ttt_chunk [441/1238] bpb=1.082670 time=117.8s + ttt_chunk [451/1238] bpb=1.083858 time=120.3s + ttt_chunk [461/1238] bpb=1.082096 time=122.8s + ttt_chunk [471/1238] bpb=1.082046 time=125.3s + ttt_chunk [481/1238] bpb=1.082190 time=127.8s + ttt_chunk [491/1238] bpb=1.082665 time=130.3s + ttt_chunk [501/1238] bpb=1.082318 time=133.2s + ttt_chunk [511/1238] bpb=1.081950 time=135.8s + ttt_chunk [521/1238] bpb=1.081457 time=138.3s + ttt_chunk [531/1238] bpb=1.081425 time=141.2s + ttt_chunk [541/1238] bpb=1.081516 time=143.7s + ttt_chunk [551/1238] bpb=1.081048 time=146.2s + ttt_chunk [561/1238] bpb=1.080388 time=148.7s + ttt_chunk [571/1238] bpb=1.079849 time=151.2s + ttt_chunk [581/1238] bpb=1.080206 time=154.0s + ttt_chunk [591/1238] bpb=1.080421 time=156.5s + ttt_chunk [601/1238] bpb=1.080353 time=159.0s + ttt_chunk [611/1238] bpb=1.080931 time=161.5s + ttt_chunk [621/1238] bpb=1.081789 time=164.1s + ttt_chunk [631/1238] bpb=1.081870 time=166.6s + ttt_chunk [641/1238] bpb=1.082334 time=169.1s + ttt_chunk [651/1238] bpb=1.082667 time=171.6s + ttt_chunk [661/1238] bpb=1.082038 time=174.1s + ttt_chunk [671/1238] bpb=1.081797 time=176.7s + ttt_chunk [681/1238] bpb=1.083120 time=179.2s + ttt_chunk [691/1238] bpb=1.083288 time=181.7s + ttt_chunk [701/1238] bpb=1.083109 time=184.2s + ttt_chunk [711/1238] bpb=1.083797 time=186.7s + ttt_chunk [721/1238] bpb=1.084101 time=189.3s + ttt_chunk [731/1238] bpb=1.083439 time=191.8s + ttt_chunk [741/1238] bpb=1.083159 time=194.3s + ttt_chunk [751/1238] bpb=1.082254 time=196.8s + ttt_chunk [761/1238] bpb=1.081667 time=199.3s + ttt_chunk [771/1238] bpb=1.080680 time=201.8s + ttt_chunk [781/1238] bpb=1.080683 time=204.4s + ttt_chunk [791/1238] bpb=1.081048 time=206.9s + ttt_chunk [801/1238] bpb=1.081355 time=209.4s + ttt_chunk [811/1238] bpb=1.080891 time=211.9s + ttt_chunk [821/1238] bpb=1.079680 time=214.5s + ttt_chunk [831/1238] bpb=1.079366 time=216.9s + ttt_chunk [841/1238] bpb=1.078916 time=219.5s + ttt_chunk [851/1238] bpb=1.078628 time=222.0s + ttt_chunk [861/1238] bpb=1.078283 time=224.5s + ttt_chunk [871/1238] bpb=1.078182 time=227.0s + ttt_chunk [881/1238] bpb=1.077722 time=229.5s + ttt_chunk [891/1238] bpb=1.077201 time=232.1s + ttt_chunk [901/1238] bpb=1.077593 time=234.6s + ttt_chunk [911/1238] bpb=1.077274 time=237.1s + ttt_chunk [921/1238] bpb=1.077558 time=239.6s + ttt_chunk [931/1238] bpb=1.078213 time=242.1s + ttt_chunk [941/1238] bpb=1.078592 time=244.6s + ttt_chunk [951/1238] bpb=1.078519 time=247.1s + ttt_chunk [961/1238] bpb=1.079364 time=249.7s + ttt_chunk [971/1238] bpb=1.079756 time=252.2s + ttt_chunk [981/1238] bpb=1.080106 time=254.7s + ttt_chunk [991/1238] bpb=1.079908 time=257.3s + ttt_chunk [1001/1238] bpb=1.079974 time=259.8s + ttt_chunk [1011/1238] bpb=1.080305 time=262.3s + ttt_chunk [1021/1238] bpb=1.081017 time=264.8s + ttt_chunk [1031/1238] bpb=1.081488 time=267.3s + ttt_chunk [1041/1238] bpb=1.081956 time=269.8s + ttt_chunk [1051/1238] bpb=1.081868 time=272.3s + ttt_chunk [1061/1238] bpb=1.081879 time=274.8s + ttt_chunk [1071/1238] bpb=1.082039 time=277.3s + ttt_chunk [1081/1238] bpb=1.081934 time=279.8s + ttt_chunk [1091/1238] bpb=1.082117 time=282.3s + ttt_chunk [1101/1238] bpb=1.082660 time=284.8s + ttt_chunk [1111/1238] bpb=1.082951 time=287.4s + ttt_chunk [1121/1238] bpb=1.083123 time=289.9s + ttt_chunk [1131/1238] bpb=1.082793 time=292.4s + ttt_chunk [1141/1238] bpb=1.082472 time=294.9s + ttt_chunk [1151/1238] bpb=1.082507 time=297.4s + ttt_chunk [1161/1238] bpb=1.082638 time=300.0s + ttt_chunk [1171/1238] bpb=1.082417 time=302.5s + ttt_chunk [1181/1238] bpb=1.081942 time=305.0s + ttt_chunk [1191/1238] bpb=1.082105 time=307.4s + ttt_chunk [1201/1238] bpb=1.082176 time=310.0s + ttt_chunk [1211/1238] bpb=1.081868 time=312.5s + ttt_chunk [1221/1238] bpb=1.081409 time=315.0s + ttt_chunk [1231/1238] bpb=1.081053 time=317.6s + ttt_chunk [1238/1238] bpb=1.081047 time=321.5s +ttt_sliding:done val_loss=2.792658 val_bpb=1.081125 elapsed=321.5s +legal_ttt_exact val_loss:2.79265813 val_bpb:1.08112499 eval_time:321721ms diff --git a/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed2025.log b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed2025.log new file mode 100644 index 0000000000..f16936caa6 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed2025.log @@ -0,0 +1,334 @@ +==================================================================================================== +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: /workspace/parameter-golf/data + datasets_dir: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/pr1413_combo_s2025.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.085 + ngram_agree_bonus: 0.1 + ngram_base_beta: 2.0 + ngram_open_table_bits: 26 + ngram_order_stride: 2 + ngram_tilt_enabled: False + ngram_within_beta: 0.0 + ngram_within_threshold: 0.25 + ngram_word_beta: 0.0 + ngram_word_threshold: 0.8 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: pr1413_combo_s2025 + scalar_lr: 0.02 + seed: 2025 + skip_gates_enabled: True + skip_training: False + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /workspace/parameter-golf/data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_batch_seqs: 32 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +==================================================================================================== +Running Python 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0] +Running PyTorch 2.9.1+cu128 +Tue Apr 7 23:57:30 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | +| N/A 37C P0 120W / 700W | 1521MiB / 81559MiB | 10% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 | +| N/A 31C P0 117W / 700W | 1521MiB / 81559MiB | 6% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 | +| N/A 31C P0 119W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 | +| N/A 34C P0 122W / 700W | 1521MiB / 81559MiB | 9% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 | +| N/A 37C P0 121W / 700W | 1521MiB / 81559MiB | 2% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 | +| N/A 33C P0 120W / 700W | 1521MiB / 81559MiB | 3% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | +| N/A 35C P0 118W / 700W | 1521MiB / 81559MiB | 1% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | +| N/A 31C P0 117W / 700W | 1521MiB / 81559MiB | 1% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +train_shards: 128 +val_tokens: 40540160 +model_params:35944536 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0067 val_bpb: 3.4868 +1/20000 train_loss: 9.0086 train_time: 0.0m tok/s: 8310702 +2/20000 train_loss: 12.3513 train_time: 0.0m tok/s: 8146770 +3/20000 train_loss: 11.1496 train_time: 0.0m tok/s: 8070769 +4/20000 train_loss: 9.4395 train_time: 0.0m tok/s: 8017524 +5/20000 train_loss: 8.3496 train_time: 0.0m tok/s: 7990488 +500/20000 train_loss: 3.3368 train_time: 0.8m tok/s: 7738533 +1000/20000 train_loss: 3.1842 train_time: 1.7m tok/s: 7730708 +1500/20000 train_loss: 3.0952 train_time: 2.5m tok/s: 7726598 +2000/20000 train_loss: 3.0724 train_time: 3.4m tok/s: 7728379 +2500/20000 train_loss: 3.1082 train_time: 4.2m tok/s: 7729736 +layer_loop:enabled step:2891 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +3000/20000 train_loss: 2.9944 train_time: 5.2m tok/s: 7602259 +3500/20000 train_loss: 3.0006 train_time: 6.4m tok/s: 7151258 +4000/20000 train_loss: 2.9505 train_time: 7.7m tok/s: 6847406 +4000/20000 val_loss: 2.9079 val_bpb: 1.1258 +4500/20000 train_loss: 2.7981 train_time: 8.9m tok/s: 6628912 +4864/20000 val_loss: 2.8152 val_bpb: 1.0898 +stopping_early: wallclock_cap train_time: 588070ms step: 4864/20000 +peak memory allocated: 39046 MiB reserved: 39070 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.81328008 val_bpb:1.08910839 eval_time:6695ms +Serialized model: 135431033 bytes +Code size: 17390 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.8s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15972493 bytes +Total submission size quantized+brotli: 15989883 bytes +quantized val_loss:2.84260503 val_bpb:1.10046099 eval_time:8847ms +quantized_sliding_window val_loss:2.79955257 val_bpb:1.08379404 eval_time:91710ms +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.005 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=35944536 frozen=0 + ttt_chunk [1/1238] bpb=1.118246 time=4.9s + ttt_chunk [11/1238] bpb=1.071708 time=9.3s + ttt_chunk [21/1238] bpb=1.110005 time=11.8s + ttt_chunk [31/1238] bpb=1.103445 time=14.3s + ttt_chunk [41/1238] bpb=1.096178 time=16.8s + ttt_chunk [51/1238] bpb=1.089793 time=19.3s + ttt_chunk [61/1238] bpb=1.081733 time=21.8s + ttt_chunk [71/1238] bpb=1.088610 time=24.3s + ttt_chunk [81/1238] bpb=1.082221 time=26.8s + ttt_chunk [91/1238] bpb=1.078816 time=29.3s + ttt_chunk [101/1238] bpb=1.078458 time=31.8s + ttt_chunk [111/1238] bpb=1.076747 time=34.3s + ttt_chunk [121/1238] bpb=1.079745 time=36.7s + ttt_chunk [131/1238] bpb=1.083382 time=39.2s + ttt_chunk [141/1238] bpb=1.084024 time=41.8s + ttt_chunk [151/1238] bpb=1.084029 time=44.3s + ttt_chunk [161/1238] bpb=1.084512 time=46.8s + ttt_chunk [171/1238] bpb=1.084470 time=49.3s + ttt_chunk [181/1238] bpb=1.083075 time=51.8s + ttt_chunk [191/1238] bpb=1.082711 time=54.3s + ttt_chunk [201/1238] bpb=1.080361 time=56.8s + ttt_chunk [211/1238] bpb=1.084729 time=59.3s + ttt_chunk [221/1238] bpb=1.085085 time=61.8s + ttt_chunk [231/1238] bpb=1.086796 time=64.3s + ttt_chunk [241/1238] bpb=1.084973 time=66.8s + ttt_chunk [251/1238] bpb=1.084991 time=69.3s + ttt_chunk [261/1238] bpb=1.085978 time=71.8s + ttt_chunk [271/1238] bpb=1.086450 time=74.3s + ttt_chunk [281/1238] bpb=1.085793 time=76.8s + ttt_chunk [291/1238] bpb=1.087021 time=79.6s + ttt_chunk [301/1238] bpb=1.087218 time=82.1s + ttt_chunk [311/1238] bpb=1.086148 time=84.6s + ttt_chunk [321/1238] bpb=1.086020 time=87.1s + ttt_chunk [331/1238] bpb=1.086303 time=89.6s + ttt_chunk [341/1238] bpb=1.085411 time=92.1s + ttt_chunk [351/1238] bpb=1.086169 time=94.6s + ttt_chunk [361/1238] bpb=1.085085 time=97.1s + ttt_chunk [371/1238] bpb=1.083578 time=99.6s + ttt_chunk [381/1238] bpb=1.083966 time=102.1s + ttt_chunk [391/1238] bpb=1.083659 time=104.5s + ttt_chunk [401/1238] bpb=1.083717 time=107.0s + ttt_chunk [411/1238] bpb=1.084306 time=109.6s + ttt_chunk [421/1238] bpb=1.083816 time=112.1s + ttt_chunk [431/1238] bpb=1.084024 time=114.5s + ttt_chunk [441/1238] bpb=1.084107 time=117.0s + ttt_chunk [451/1238] bpb=1.085338 time=119.5s + ttt_chunk [461/1238] bpb=1.083557 time=122.0s + ttt_chunk [471/1238] bpb=1.083586 time=124.5s + ttt_chunk [481/1238] bpb=1.083719 time=127.1s + ttt_chunk [491/1238] bpb=1.084160 time=129.6s + ttt_chunk [501/1238] bpb=1.083761 time=132.4s + ttt_chunk [511/1238] bpb=1.083411 time=134.9s + ttt_chunk [521/1238] bpb=1.082952 time=137.4s + ttt_chunk [531/1238] bpb=1.082917 time=140.2s + ttt_chunk [541/1238] bpb=1.082996 time=142.7s + ttt_chunk [551/1238] bpb=1.082538 time=145.2s + ttt_chunk [561/1238] bpb=1.081835 time=147.7s + ttt_chunk [571/1238] bpb=1.081292 time=150.2s + ttt_chunk [581/1238] bpb=1.081656 time=153.1s + ttt_chunk [591/1238] bpb=1.081854 time=155.6s + ttt_chunk [601/1238] bpb=1.081763 time=158.2s + ttt_chunk [611/1238] bpb=1.082340 time=160.7s + ttt_chunk [621/1238] bpb=1.083198 time=163.2s + ttt_chunk [631/1238] bpb=1.083275 time=165.7s + ttt_chunk [641/1238] bpb=1.083720 time=168.1s + ttt_chunk [651/1238] bpb=1.084043 time=170.6s + ttt_chunk [661/1238] bpb=1.083414 time=173.2s + ttt_chunk [671/1238] bpb=1.083148 time=175.6s + ttt_chunk [681/1238] bpb=1.084473 time=178.1s + ttt_chunk [691/1238] bpb=1.084677 time=180.7s + ttt_chunk [701/1238] bpb=1.084495 time=183.2s + ttt_chunk [711/1238] bpb=1.085194 time=185.6s + ttt_chunk [721/1238] bpb=1.085542 time=188.1s + ttt_chunk [731/1238] bpb=1.084905 time=190.6s + ttt_chunk [741/1238] bpb=1.084620 time=193.1s + ttt_chunk [751/1238] bpb=1.083711 time=195.6s + ttt_chunk [761/1238] bpb=1.083127 time=198.1s + ttt_chunk [771/1238] bpb=1.082070 time=200.7s + ttt_chunk [781/1238] bpb=1.082027 time=203.2s + ttt_chunk [791/1238] bpb=1.082385 time=205.7s + ttt_chunk [801/1238] bpb=1.082672 time=208.3s + ttt_chunk [811/1238] bpb=1.082173 time=210.8s + ttt_chunk [821/1238] bpb=1.080970 time=213.3s + ttt_chunk [831/1238] bpb=1.080650 time=215.8s + ttt_chunk [841/1238] bpb=1.080187 time=218.4s + ttt_chunk [851/1238] bpb=1.079906 time=220.9s + ttt_chunk [861/1238] bpb=1.079559 time=223.4s + ttt_chunk [871/1238] bpb=1.079443 time=225.9s + ttt_chunk [881/1238] bpb=1.078986 time=228.4s + ttt_chunk [891/1238] bpb=1.078461 time=231.0s + ttt_chunk [901/1238] bpb=1.078823 time=233.5s + ttt_chunk [911/1238] bpb=1.078502 time=236.0s + ttt_chunk [921/1238] bpb=1.078787 time=238.6s + ttt_chunk [931/1238] bpb=1.079458 time=241.1s + ttt_chunk [941/1238] bpb=1.079840 time=243.6s + ttt_chunk [951/1238] bpb=1.079772 time=246.1s + ttt_chunk [961/1238] bpb=1.080598 time=248.6s + ttt_chunk [971/1238] bpb=1.081008 time=251.2s + ttt_chunk [981/1238] bpb=1.081377 time=253.7s + ttt_chunk [991/1238] bpb=1.081171 time=256.2s + ttt_chunk [1001/1238] bpb=1.081224 time=258.7s + ttt_chunk [1011/1238] bpb=1.081572 time=261.2s + ttt_chunk [1021/1238] bpb=1.082289 time=263.7s + ttt_chunk [1031/1238] bpb=1.082776 time=266.3s + ttt_chunk [1041/1238] bpb=1.083266 time=268.8s + ttt_chunk [1051/1238] bpb=1.083202 time=271.3s + ttt_chunk [1061/1238] bpb=1.083187 time=273.8s + ttt_chunk [1071/1238] bpb=1.083359 time=276.3s + ttt_chunk [1081/1238] bpb=1.083245 time=278.9s + ttt_chunk [1091/1238] bpb=1.083434 time=281.4s + ttt_chunk [1101/1238] bpb=1.083966 time=283.9s + ttt_chunk [1111/1238] bpb=1.084255 time=286.4s + ttt_chunk [1121/1238] bpb=1.084436 time=288.9s + ttt_chunk [1131/1238] bpb=1.084078 time=291.4s + ttt_chunk [1141/1238] bpb=1.083746 time=293.9s + ttt_chunk [1151/1238] bpb=1.083781 time=296.5s + ttt_chunk [1161/1238] bpb=1.083923 time=299.0s + ttt_chunk [1171/1238] bpb=1.083682 time=301.5s + ttt_chunk [1181/1238] bpb=1.083220 time=304.0s + ttt_chunk [1191/1238] bpb=1.083373 time=306.5s + ttt_chunk [1201/1238] bpb=1.083383 time=309.0s + ttt_chunk [1211/1238] bpb=1.083068 time=311.5s + ttt_chunk [1221/1238] bpb=1.082614 time=314.0s + ttt_chunk [1231/1238] bpb=1.082251 time=316.5s + ttt_chunk [1238/1238] bpb=1.082257 time=320.5s +ttt_sliding:done val_loss=2.795761 val_bpb=1.082326 elapsed=320.5s +legal_ttt_exact val_loss:2.79576136 val_bpb:1.08232635 eval_time:320728ms diff --git a/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed42.log b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed42.log new file mode 100644 index 0000000000..8794f3f189 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-13_sp8192_d_rseries_evidence/train_seed42.log @@ -0,0 +1,334 @@ +==================================================================================================== +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: /workspace/parameter-golf/data + datasets_dir: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/pr1413_combo_s42.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.085 + ngram_agree_bonus: 0.1 + ngram_base_beta: 2.0 + ngram_open_table_bits: 26 + ngram_order_stride: 2 + ngram_tilt_enabled: False + ngram_within_beta: 0.0 + ngram_within_threshold: 0.25 + ngram_word_beta: 0.0 + ngram_word_threshold: 0.8 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: pr1413_combo_s42 + scalar_lr: 0.02 + seed: 42 + skip_gates_enabled: True + skip_training: False + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /workspace/parameter-golf/data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_batch_seqs: 32 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: /workspace/parameter-golf/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +==================================================================================================== +Running Python 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0] +Running PyTorch 2.9.1+cu128 +Tue Apr 7 23:15:11 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | +| N/A 37C P0 120W / 700W | 1521MiB / 81559MiB | 4% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 | +| N/A 31C P0 116W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 | +| N/A 30C P0 119W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 | +| N/A 34C P0 122W / 700W | 1521MiB / 81559MiB | 1% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 | +| N/A 37C P0 121W / 700W | 1521MiB / 81559MiB | 1% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 | +| N/A 33C P0 119W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | +| N/A 35C P0 119W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | +| N/A 31C P0 118W / 700W | 1521MiB / 81559MiB | 4% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +train_shards: 128 +val_tokens: 40540160 +model_params:35944536 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0090 val_bpb: 3.4877 +1/20000 train_loss: 9.0111 train_time: 0.0m tok/s: 8305993 +2/20000 train_loss: 12.3742 train_time: 0.0m tok/s: 8193823 +3/20000 train_loss: 11.1550 train_time: 0.0m tok/s: 8084747 +4/20000 train_loss: 9.4420 train_time: 0.0m tok/s: 8039119 +5/20000 train_loss: 8.3570 train_time: 0.0m tok/s: 8011008 +500/20000 train_loss: 3.3367 train_time: 0.8m tok/s: 7742429 +1000/20000 train_loss: 3.1816 train_time: 1.7m tok/s: 7733406 +1500/20000 train_loss: 3.0978 train_time: 2.5m tok/s: 7733705 +2000/20000 train_loss: 3.0695 train_time: 3.4m tok/s: 7735671 +2500/20000 train_loss: 3.1001 train_time: 4.2m tok/s: 7735788 +layer_loop:enabled step:2893 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +3000/20000 train_loss: 2.9934 train_time: 5.2m tok/s: 7611266 +3500/20000 train_loss: 3.0003 train_time: 6.4m tok/s: 7158724 +4000/20000 train_loss: 2.9434 train_time: 7.7m tok/s: 6853213 +4000/20000 val_loss: 2.9049 val_bpb: 1.1246 +4500/20000 train_loss: 2.7954 train_time: 8.9m tok/s: 6634146 +4867/20000 val_loss: 2.8118 val_bpb: 1.0885 +stopping_early: wallclock_cap train_time: 588101ms step: 4867/20000 +peak memory allocated: 39046 MiB reserved: 39070 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.80987807 val_bpb:1.08779137 eval_time:6253ms +Serialized model: 135431033 bytes +Code size: 17390 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.8s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15973111 bytes +Total submission size quantized+brotli: 15990501 bytes +quantized val_loss:2.84417454 val_bpb:1.10106860 eval_time:8862ms +quantized_sliding_window val_loss:2.80011569 val_bpb:1.08401205 eval_time:92428ms +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.005 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=35944536 frozen=0 + ttt_chunk [1/1238] bpb=1.117442 time=4.9s + ttt_chunk [11/1238] bpb=1.070953 time=9.4s + ttt_chunk [21/1238] bpb=1.109195 time=11.9s + ttt_chunk [31/1238] bpb=1.102564 time=14.5s + ttt_chunk [41/1238] bpb=1.096070 time=17.0s + ttt_chunk [51/1238] bpb=1.089538 time=19.5s + ttt_chunk [61/1238] bpb=1.081400 time=22.1s + ttt_chunk [71/1238] bpb=1.088637 time=24.6s + ttt_chunk [81/1238] bpb=1.081779 time=27.1s + ttt_chunk [91/1238] bpb=1.078390 time=29.6s + ttt_chunk [101/1238] bpb=1.078135 time=32.1s + ttt_chunk [111/1238] bpb=1.076597 time=34.7s + ttt_chunk [121/1238] bpb=1.079546 time=37.2s + ttt_chunk [131/1238] bpb=1.083320 time=39.7s + ttt_chunk [141/1238] bpb=1.083835 time=42.2s + ttt_chunk [151/1238] bpb=1.083553 time=44.8s + ttt_chunk [161/1238] bpb=1.084064 time=47.3s + ttt_chunk [171/1238] bpb=1.083963 time=49.8s + ttt_chunk [181/1238] bpb=1.082404 time=52.4s + ttt_chunk [191/1238] bpb=1.082067 time=54.9s + ttt_chunk [201/1238] bpb=1.079603 time=57.4s + ttt_chunk [211/1238] bpb=1.084055 time=60.0s + ttt_chunk [221/1238] bpb=1.084306 time=62.4s + ttt_chunk [231/1238] bpb=1.085905 time=65.0s + ttt_chunk [241/1238] bpb=1.084150 time=67.5s + ttt_chunk [251/1238] bpb=1.084167 time=70.0s + ttt_chunk [261/1238] bpb=1.085161 time=72.6s + ttt_chunk [271/1238] bpb=1.085572 time=75.1s + ttt_chunk [281/1238] bpb=1.084862 time=77.6s + ttt_chunk [291/1238] bpb=1.086079 time=80.1s + ttt_chunk [301/1238] bpb=1.086241 time=82.6s + ttt_chunk [311/1238] bpb=1.085118 time=85.2s + ttt_chunk [321/1238] bpb=1.084977 time=87.7s + ttt_chunk [331/1238] bpb=1.085243 time=90.2s + ttt_chunk [341/1238] bpb=1.084359 time=92.7s + ttt_chunk [351/1238] bpb=1.085058 time=95.3s + ttt_chunk [361/1238] bpb=1.083959 time=97.8s + ttt_chunk [371/1238] bpb=1.082424 time=100.3s + ttt_chunk [381/1238] bpb=1.082867 time=102.9s + ttt_chunk [391/1238] bpb=1.082567 time=105.4s + ttt_chunk [401/1238] bpb=1.082685 time=107.9s + ttt_chunk [411/1238] bpb=1.083263 time=110.5s + ttt_chunk [421/1238] bpb=1.082770 time=113.0s + ttt_chunk [431/1238] bpb=1.082938 time=115.5s + ttt_chunk [441/1238] bpb=1.082953 time=118.0s + ttt_chunk [451/1238] bpb=1.084142 time=120.6s + ttt_chunk [461/1238] bpb=1.082349 time=123.1s + ttt_chunk [471/1238] bpb=1.082313 time=125.6s + ttt_chunk [481/1238] bpb=1.082477 time=128.1s + ttt_chunk [491/1238] bpb=1.082926 time=130.7s + ttt_chunk [501/1238] bpb=1.082554 time=133.6s + ttt_chunk [511/1238] bpb=1.082162 time=136.1s + ttt_chunk [521/1238] bpb=1.081677 time=138.7s + ttt_chunk [531/1238] bpb=1.081648 time=141.6s + ttt_chunk [541/1238] bpb=1.081753 time=144.1s + ttt_chunk [551/1238] bpb=1.081287 time=146.7s + ttt_chunk [561/1238] bpb=1.080621 time=149.2s + ttt_chunk [571/1238] bpb=1.080069 time=151.7s + ttt_chunk [581/1238] bpb=1.080438 time=154.7s + ttt_chunk [591/1238] bpb=1.080665 time=157.2s + ttt_chunk [601/1238] bpb=1.080600 time=159.7s + ttt_chunk [611/1238] bpb=1.081144 time=162.2s + ttt_chunk [621/1238] bpb=1.082003 time=164.7s + ttt_chunk [631/1238] bpb=1.082089 time=167.3s + ttt_chunk [641/1238] bpb=1.082524 time=169.9s + ttt_chunk [651/1238] bpb=1.082833 time=172.4s + ttt_chunk [661/1238] bpb=1.082183 time=174.9s + ttt_chunk [671/1238] bpb=1.081938 time=177.5s + ttt_chunk [681/1238] bpb=1.083249 time=180.0s + ttt_chunk [691/1238] bpb=1.083419 time=182.6s + ttt_chunk [701/1238] bpb=1.083234 time=185.1s + ttt_chunk [711/1238] bpb=1.083946 time=187.6s + ttt_chunk [721/1238] bpb=1.084251 time=190.2s + ttt_chunk [731/1238] bpb=1.083652 time=192.7s + ttt_chunk [741/1238] bpb=1.083347 time=195.2s + ttt_chunk [751/1238] bpb=1.082433 time=197.8s + ttt_chunk [761/1238] bpb=1.081809 time=200.3s + ttt_chunk [771/1238] bpb=1.080802 time=202.8s + ttt_chunk [781/1238] bpb=1.080781 time=205.4s + ttt_chunk [791/1238] bpb=1.081137 time=207.9s + ttt_chunk [801/1238] bpb=1.081452 time=210.4s + ttt_chunk [811/1238] bpb=1.080968 time=213.0s + ttt_chunk [821/1238] bpb=1.079783 time=215.5s + ttt_chunk [831/1238] bpb=1.079460 time=218.0s + ttt_chunk [841/1238] bpb=1.079020 time=220.5s + ttt_chunk [851/1238] bpb=1.078737 time=223.1s + ttt_chunk [861/1238] bpb=1.078408 time=225.6s + ttt_chunk [871/1238] bpb=1.078295 time=228.1s + ttt_chunk [881/1238] bpb=1.077846 time=230.6s + ttt_chunk [891/1238] bpb=1.077314 time=233.1s + ttt_chunk [901/1238] bpb=1.077693 time=235.6s + ttt_chunk [911/1238] bpb=1.077391 time=238.2s + ttt_chunk [921/1238] bpb=1.077674 time=240.7s + ttt_chunk [931/1238] bpb=1.078366 time=243.2s + ttt_chunk [941/1238] bpb=1.078730 time=245.7s + ttt_chunk [951/1238] bpb=1.078636 time=248.3s + ttt_chunk [961/1238] bpb=1.079457 time=250.8s + ttt_chunk [971/1238] bpb=1.079862 time=253.3s + ttt_chunk [981/1238] bpb=1.080222 time=255.9s + ttt_chunk [991/1238] bpb=1.079995 time=258.4s + ttt_chunk [1001/1238] bpb=1.080054 time=260.9s + ttt_chunk [1011/1238] bpb=1.080400 time=263.5s + ttt_chunk [1021/1238] bpb=1.081126 time=266.0s + ttt_chunk [1031/1238] bpb=1.081582 time=268.6s + ttt_chunk [1041/1238] bpb=1.082057 time=271.1s + ttt_chunk [1051/1238] bpb=1.081987 time=273.6s + ttt_chunk [1061/1238] bpb=1.081993 time=276.2s + ttt_chunk [1071/1238] bpb=1.082142 time=278.7s + ttt_chunk [1081/1238] bpb=1.082023 time=281.2s + ttt_chunk [1091/1238] bpb=1.082213 time=283.7s + ttt_chunk [1101/1238] bpb=1.082760 time=286.3s + ttt_chunk [1111/1238] bpb=1.083046 time=288.9s + ttt_chunk [1121/1238] bpb=1.083241 time=291.4s + ttt_chunk [1131/1238] bpb=1.082914 time=293.9s + ttt_chunk [1141/1238] bpb=1.082577 time=296.4s + ttt_chunk [1151/1238] bpb=1.082625 time=299.0s + ttt_chunk [1161/1238] bpb=1.082762 time=301.5s + ttt_chunk [1171/1238] bpb=1.082535 time=304.0s + ttt_chunk [1181/1238] bpb=1.082065 time=306.6s + ttt_chunk [1191/1238] bpb=1.082220 time=309.1s + ttt_chunk [1201/1238] bpb=1.082242 time=311.7s + ttt_chunk [1211/1238] bpb=1.081917 time=314.2s + ttt_chunk [1221/1238] bpb=1.081443 time=316.7s + ttt_chunk [1231/1238] bpb=1.081088 time=319.3s + ttt_chunk [1238/1238] bpb=1.081092 time=323.2s +ttt_sliding:done val_loss=2.792695 val_bpb=1.081139 elapsed=323.2s +legal_ttt_exact val_loss:2.79269524 val_bpb:1.08113936 eval_time:323468ms