diff --git a/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/README.md b/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/README.md new file mode 100644 index 0000000000..a416e24f83 --- /dev/null +++ b/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/README.md @@ -0,0 +1,90 @@ +# Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + Legal TTT + Systems Optimization + +**val_bpb = 1.0801** (3-seed mean, std 0.0001) | **2.7899 nats** | **~15.99 MB** | 8xH100 SXM, 600s | Legal TTT + +This submission applies systems-level performance optimizations to the PR #1493 SOTA stack. The ML is unchanged; faster per-step throughput yields extra training steps in the same 600s budget. + +> **Submission series:** This PR is one of three related submissions applying the same systems optimizations to different base stacks: +> +> 1. On PR #1493 (current merged SOTA) -- **this PR** +> 2. On PR #1529 (pending review) +> 3. On PR #1578 (pending review) +> +> The optimizations are identical across all three -- fused Muon kernel, batched EMA, and loader prealloc. We submit against multiple bases so that a ready-to-merge option exists regardless of how the pending PRs are resolved. Judges should feel free to evaluate whichever base(s) they consider valid and disregard the rest. + +**Note on record criteria:** This submission improves speed through systems optimization without changing the ML. Per the official contest rules: *"For submissions that improve speed through systems optimization without changing the ML, this requirement [0.005 nats] is waived."* The changes (fused Muon kernel, batched EMA, superchunk eval, rank-0 serialize) are purely systems-level and do not alter model architecture, optimizer logic, loss function, or any hyperparameter. + +## 3-Seed Results + +| Seed | Steps | ms/step | Post-EMA BPB | Sliding BPB | **TTT BPB** | Artifact | +|------|-------|---------|-------------|-------------|-------------|----------| +| 0 | 4,607 | 127.4 | 1.0866 | 1.0815 | **1.0799** | 15,993,737 | +| 3141 | 4,622 | 127.0 | 1.0868 | 1.0817 | **1.0801** | 15,995,437 | +| 42 | 4,619 | 127.1 | 1.0869 | 1.0815 | **1.0802** | 15,993,201 | +| **Mean** | **4,616** | **127.2** | **1.0868** | **1.0816** | **1.0801** | **15,994,125** | +| **Std** | | | | | **0.0001** | | + +Current merged SOTA (PR #1493): **1.0810 BPB**. Delta: **-0.0009 BPB**. + +## Systems Optimizations + +1. **Fused Muon transform** -- Single `@torch.compile` function combining momentum update, Nesterov extrapolation, row normalization, and Newton-Schulz orthogonalization. (+0.43% step time on 2xH100 benchmark) + +2. **EMA foreach** -- Replaces per-tensor EMA loop with `torch._foreach_mul_` / `torch._foreach_add_`. (+0.08% step time) + +3. **Muon prealloc + foreach apply** -- Pre-allocated flat update buffer reused across steps; `torch._foreach_mul_`/`_foreach_add_` for weight updates. (+0.07% step time) + +4. **Superchunk eval** -- Contiguous copy + `torch.as_strided` overlapping views for sliding window eval, replacing per-window data loading. (+2.65% eval time) + +5. **Rank-0 serialize** -- Only rank 0 performs GPTQ serialization; other ranks skip. Saves redundant work on 7 of 8 GPUs. + +6. **Eval batch 128** -- Increased sliding window eval batch from 32 to 128 sequences. + +No model architecture or hyperparameter changes. + +## Architecture (from PR #1493) + +11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: loops layers 3-5 (activated at frac=0.35), 17 virtual layers from 11 physical. Parallel residuals from layer 7: GPT-J style, attention and MLP read from same input. Skip gates (sigmoid-gated U-Net connections). + +## Training + +Muon optimizer (flat-buffer all-reduce, Newton-Schulz 5 steps), AdamW for embeddings/scalars. ~4,616 steps in 588s. Warmdown frac=0.72, EMA decay=0.9965, WD=0.095. GPTQ reserve 12s. + +## Quantization + +Full-Hessian GPTQ with SDClip: int6 for attention/MLP matrices (k=12.85), int8 for token embeddings (k=20.0). Byte-shuffle + Brotli-11 compression. + +## TTT (Test-Time Training) + +Score-first chunk-based SGD: 32K-token chunks, 3 epochs per chunk, cosine LR decay (lr=0.005, momentum=0.9). Gradient clipping at 1.0. + +## Compliance + +- **Condition 1 (Causality):** Sliding-window eval is strictly causal. +- **Condition 2 (Normalized distribution):** Standard softmax over full vocab. +- **Condition 3 (Score before update):** Each chunk scored under `torch.no_grad()` before SGD. +- **Condition 4 (Single pass):** Each token scored exactly once. +- No SLOT, no pre-quant TTT, no ETLB, no n-gram cache. + +## Reproducibility + +```bash +pip install brotli sentencepiece flash_attn_3 huggingface_hub + +# Data: +MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192 + +# Training (per seed): +for SEED in 0 3141 42; do + SEED=$SEED TTT_ENABLED=1 TTT_LR=0.005 \ + torchrun --standalone --nproc_per_node=8 train_gpt.py +done +``` + +## Attribution + +- **PR #1493** (@bigbag): Full SOTA stack +- **PR #1394** (@clarkkev): SP8192 tokenizer, GPTQ SDClip, depth recurrence base +- **PR #1413** (@dexhunter): Legal TTT framework +- **PR #1412** (@Robby955), **PR #1204** (@msisovic): Parallel residuals +- **PR #1445** (@X-Abhishek-X): Hyperparameter tuning diff --git a/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/submission.json b/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/submission.json new file mode 100644 index 0000000000..5e085db646 --- /dev/null +++ b/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/submission.json @@ -0,0 +1,63 @@ +{ + "author": "Benjamin Hadad", + "github_id": "codemath3000", + "name": "SP8192 + 3-Layer Recurrence + Parallel Residuals + Legal TTT + Systems Optimization", + "date": "2026-04-13", + "track": "10min_16mb", + "val_loss": 2.78992846, + "val_bpb": 1.08006825, + "val_bpb_std": 0.00014482, + "seeds": [0, 3141, 42], + "seed_results": { + "0": { + "steps": 4607, + "step_avg_ms": 127.4, + "val_loss": 2.78951654, + "val_bpb": 1.07990878, + "post_ema_val_bpb": 1.08663068, + "sliding_val_bpb": 1.08150228, + "artifact_bytes": 15993737 + }, + "3141": { + "steps": 4622, + "step_avg_ms": 127.0, + "val_loss": 2.79002185, + "val_bpb": 1.08010440, + "post_ema_val_bpb": 1.08681115, + "sliding_val_bpb": 1.08166940, + "artifact_bytes": 15995437 + }, + "42": { + "steps": 4619, + "step_avg_ms": 127.1, + "val_loss": 2.79024700, + "val_bpb": 1.08019157, + "post_ema_val_bpb": 1.08686427, + "sliding_val_bpb": 1.08152039, + "artifact_bytes": 15993201 + } + }, + "hardware": "8xH100 80GB SXM", + "pytorch_version": "2.9.1+cu128", + "technique_summary": "PR #1493 stack (SP8192 + 3-layer recurrence + parallel residuals + legal TTT) + systems optimization (fused Muon, EMA foreach, superchunk eval, rank-0 serialize). Identical ML; faster throughput yields extra training steps.", + "compliance": { + "train_under_600s": true, + "artifact_under_16mb": true, + "eval_under_600s": true, + "no_slot": true, + "no_pre_quant_ttt": true, + "no_etlb": true, + "no_ngram_cache": true, + "score_first_ttt": true, + "three_seeds": true + }, + "attribution": { + "base_stack": "@bigbag (PR #1493)", + "sp8192_gptq_sdclip": "@clarkkev (PR #1394)", + "depth_recurrence": "@dexhunter (PR #1331, #1437)", + "parallel_residuals": "@Robby955 (PR #1412), @msisovic (PR #1204)", + "legal_ttt_framework": "@abaybektursun (PR #549), @dexhunter (PR #1413)", + "hyperparameter_tuning": "@X-Abhishek-X (PR #1445)", + "systems_optimization": "@codemath3000 (fused Muon, EMA foreach, superchunk eval, loader prealloc)" + } +} diff --git a/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/train_gpt.py b/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/train_gpt.py new file mode 100644 index 0000000000..b9e9e40021 --- /dev/null +++ b/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/train_gpt.py @@ -0,0 +1,2 @@ +import lzma as L,base64 as B +exec(L.decompress(B.b85decode(b';P$sTZCwB~n@VT6Qap3bt~@<3h>ok~)Km_aAcM1$ZA=RNsrI&uUw)pb_nMj0LFYxKp&KENV6ytRnz=z`3S+|QdX=`Fk`)|HY`R{~)~vD0kS^20C3`*JPC8u#o}V9vhZwf>0PphG7JR@bPgW!IRf{8uby^}M4K~o-R;F!5AA3`?f)sW(^?KrahdcpY9i>_0MvwHmAzHkFZGBP44u$iQ>M^$dRQ1--6^uS*z+XMynH`jpDbaXNHjlx{`Ol_Ze0D0#K$=f2+{IF0+0$eeW>d9g?dqU+ZuwhOqS*MaR|3T>j|8^-OQ86|NFZ4CtDKiH99X!t!r;?C-?7_2_p3bPV=HgWv<7VbpgqR*5+db2E~k3>n`WjiF*VTQ>ras#&N`8L;69J^XU*M(aw1{ki$t}%6YAnfh9jjJ>Xuc_i87wv1x%?_;R$^Jow}>H$erDMS&9EA^{M^@BLeoD-Fw{Z$4R(GW+$4=ryJj1elmeu^A=@x2V%4n$AlN$O;XFozO{53E4pxx;U~^cLYQg4K==+i$cL}s~Wi7uU(jW((n+W;1cwJmLP!7>ruF$tHVK*aeoQM8^cQSi}DP+b9^z=Fz(`Wh)=2Z#ogW>kIsAH#r^0DE%DzMXN;@BWIN#~tN)(l_Np4BEpV_sZ&v*G|WAG`q(1@Z^WG$e6BsjNY~SM>_Qx`wJ1e!=Xd?E3Vx&ah$D9IQS?d^R5cqR&UJ{!(XD&-j&~qNHM=8Z*j^ZE6Y7f;oNv(NFtv3Z#&;1Y~laZG>}I@>>NI=ScL{J{*AhHh`yt=TI#X?@zF7HqC*(3gJ}A^Eo3co6{1)`a{B|TL?@+`p(eGeCg6+Y4SL%q=i|j@4W|4zB|H({z|n}im(T|96PErbhek>cqGCuR>cxV6_})swJqit#ri=#2N27~zO}Mt%ilfk7y4?eAOvLyD=4S$0BVv=R=uJj@j-_2coynlUAzAhM|QJZq0gyncv{igB@-9hw%wEuJ&SKqp+Q`PldtcIn|APR1CyZ_jAC-I8^<0dzcbGjrp?_LEzV5Trg-YTB5`D4IV7||=aJouWj3dOeIC|mY>Vyr-{^4fQCy@~jbas+&u070`eo#tb3VCiYb**jIzN%jPYDOy#pfRmzn7<3PzyLg`h5s~=Z-AR*Zb;b28<{}^0y_w&EzPYDlf(70-j97wF`}4h^uOhdAQxopuvFHn3#5zUYn3z}H62akYX*K@d}ubgh&QZ@cx^_0D8<)|k|FG7M={D4;Ah{eUZC%#eJah?W%n&M?~}SN^8C^O)nZC`vBT1a;s2<80|RV_E-T<1)491sUSR!Kl(J)Gd?HHN%_Om7gmu+UY$8x5vd8H6aQOFh<1iagb3~i<}Xk8`Ok%)utw3P?T=|yav!Z)4`rlE_(VlF6=+)NwRQAFS(d;9YqUmAHt75ZInm%K$bDxO_QEaGbq8>WGG;|CS)ePIe+i3>tonPf4(N>CUNE3jc+iU@QM^9oLwoluw{!>p!_{01>5KiTBX*@Y;ozup63ciMmozxqV$FLSNYdtCfm|6Mb~?5DL;6z)-cgidDa43Kpr^p_OSLD3`tp6}dwjwo2)y@yG|^HJIDm;aY;ADRxkyprx`0y>R@bpuU?kW)nMXkc=0kqKqi_=qo;kil)tM1-P%*8e@I2^!i`@FkZ8=x*?Z_8BwZHqCLMcXgnR!9%{B;`fMgt#Z)A#=;&jjLNPZhAo#hskZFhuQk_-Ei$MhI5>~^eD^7D?Ooe9?)!w8EyCIEfcda{>q$fK8afwQp}7y)4lX{lkt*9Z)Vz=eQLIvLNk3$F~Qqw4Wu~Z?;M`Ifyh$o%t5%+{l%Bn)T0xZa~RKRMoF_C+e;R8tHhe^vsymZwGM5#>&`F>e(bar50oIStOoSHmBHbNqE}~gkR_r@0fpkM-zTtpdWZ>u2&Wq1-c#d+JvIahLEb;To1U8NN45?Nps|qzuxW)gShT0#eni-9RjBJ($1CYIBa!D9dOaaE4U4}tCb3<$WDL}E78vSjjvR0C*Rx(tf2s6HA2i*qaq^&&~nJKkdEw=4^HZ~1tPxElH6Oaevt!=a+t)@Xv$gp5Nk8Bu;-Gld#6AQw0YT6LV!a1H?5vfvP!l@8*zCo7c{q|y;?RS0Mu@Z;~dkcdzxd(PttUDET0$o+f(kXC^+3Z=(TBfJ3GLM$0ZH1qMBU|JM_#%(WS*QAU6^Qs|5z=U9*6n6Y$1sWyo5n`rnGPkc5}`Qs=^tDWgm4>0|*+l%Erwq^*wq_&TCmE?h(O^`5a1R~8D1{QJ^{Jy_99`ov9aB?Bq(FawQQmeQCkTa~GWGnk;QyM0G9XFbTo?QyMxlEGwX&ZFz?ePJsfK}ZQ$Yga#uz!YLa_VWWP-P5G_MX2Y0oD*4Wq$Fv+4JwT@bcre0p70kM@K!oytleF0{QOej>Oej!F#I^D(OMUh9Sa(@|E0QWp}Sqs)9_n8BknAtU9=Rn_=tH%u*ZqTg0Awmh`^-($}*|2r87P=m@Pas*~iKCA3&xClY=c_>b464G>8zTN{8_i`iKYiExM@D=K{fB{}of}r0PJLBTRfU@Yn=#mBo;@fXBlJ)+-q&?eoWtz*FI0?5ZlTdJwNYPq;k^~G_HK8VkI>usebke)J4m4FkR`wO)vgvkTf?_3W!{qS!SNX>nZsJ+YWlJj>1=5l#I~{kB>PM?KWA+lzEQU+cD<+=u)HlBIRI#!+48!u>ViEYT}_w(h^eVZz}j(>vyHraL=RheW3~Ht{n9*5QeH7=Il!3!Mx5yjMFEXtJWsOFHTtXwSmZ7y`rbMjiZVv5N(v9TP)0#@yQ2@slIn7c<@L-b&toaqjJwPah8tdKN$QHYSc~pW@IQZ&`8O}pYi#@d6+!QQ`34P6(Eh{RHXj|+&$3LZhnxEDJF`OV&*LHe2&TO+(VC2}{=kVCi`}{J#XLWE-Xh4Pejmu;rind~!UX(b`SpQJT_=2%<{<~OxCJeo_;Kon`sWw=@uY_J^pKVCUt3%SQnYJGNJB4q7P{UaLke!3d$#T6jUEiF!&!S}u+A^kN#J)C{r@;Rad_=Yms0VeAxS7}HbLt&1-kaD*=YUS#cU?p*Qr);SSKwO1CH!vuj5c<^ZO||>utFtM(pAx+v|1QjctskcX(eH=#}}~`Jk@KAIrC4SE1^+UG}m!K4u&iUQMXQ$(U2&jtOGvKDWZQH&PIs~YY=v{#TCs+Qeom>)XUHU!Fam(yb?ks>AKq!J&yW^TxnkAXPN^HB>f?{v4NHy6n0;(}mz65H;tvPmc+c20}H&8l;vlEx?$Ut=!H|E=f+$sM@T6UI*efsaET9$BpzOnSaY!?yHpkw#C<4z!ptmE??%c^XGN@>gUtxJQ2u)!|@WQSm9XKcy<-@a41(8p<%66kOv!C1MDZP_H;uB{Ya)yQS+$?r+!~1k-wVt5IJ{*unKiAbJ5lpOuSS(w6k_U-MP^TKP-8hN`z5&(x%=ZODziRoM`bfj?1DaO^`GBuU3?r0aKL=*b71idF`>`F>~9uEB7X#uU0?!NV>X+n#XH*uVfTHu=thfB<@@3vq7r13Lbf_9TEYPyyh`8&c+i)^IQ@{x{ofV0uV&?O>8?(a61)I_4e1siw03J$!2QJ$`ncQ1nYdQy(No5rTfSXlkP$0)+%#4)$ud1(KRPb|+Vd!g5_8N95TaPBs+$%Vxg?PAb{PdmWv`8b+a2L`U!0Tn8hBF28evk3s6v2Vt)d;*x+cQPc%5$Or1zn*)#*^4$01FRa4gy)RjC3}opsu&4Ep@0o^~Q@a4w>+oK{O-fSZkQ@KKq*X`BeLe@ktLYMk(-h5Oo?pg{!#zPutTpGH)ecU98VS?CSUKp=?rCY0Q=LmD?aUZF>PPPY3P^bwml0^){F^Y@X3drGiccR;Z3WrSrRLltSyXjOQO)CEQvr$ziS#lvKx9jqW4}+C%Gd&YZ^+Ax7u5bJBX$~{X7v*c3c`7zGLp53MSbqeJAFD%-CmsS&1V3hDBv>J^Iyy>`u3|hpjtJ#8(}u8GiUv`%g9r!qhb0-4Z^sdj_Fknt4w`#odNDdUGV73b3c;M$yYuL-Jv}ZL@{3*j$-{?zB9VW4iT%^eWfu{zC_wM@a=>@}9D=PO#vGB+u}qeoqzwTkCU;nOXR*0k8xx1HkD%xnW}Yz)BbYrj-j#~CO5~F%@zVBJmX33`bIiIWj9M?=-)145{CxPl>Tsn*;S*ln8IjMgW>n6cHbsBC#LW181|M?+Q-};BW^vBj71XgcOnGFKTSIh6-1B%D0i^3MJKEw!_)yQ3NDL-A#@1ukR6CFC%_V7$A#Rq0B}VOVwU1{ioV@#tp3e2T$2WnH9(Ro!^|ngYDr;$au-v#IlWPeJ<}P;sWtzR4uOK44icC~*+<(u(#M8KiV2UucHh{Su&&?iP%9(fB2A!P!3x00POR;!`M(lG7_iZTwKZYW;3Wu{w23+&CGVEq5BVMmplT;1gDPQWl}Y1nW_=$y&z*TOf)cAzo8k}$xzCUU^;K{={10aepb)il+naK~r$^S905Kw2I)6my&+i@|jtV{Eg^47BpiR3r0Rn3l!gO!3A0O+FrKSPe&u5tE@|5Zzh`RUvoU82ModNkIOksI{I8tTu%8Pnm}{aUjm=~MJ8_f&L0HuO6%;cQ4u*qYk#Pv+AoiY^BV+FVLf{X;CQNhS`*%sMn)u&a0A$2_q`R?}eTPnGe5z&$N4X^zqkLr#|K~6=(QegjMi^o3#-;`Dh#DDru!)^qzY>b-Vgb*}?dLL5Pn|^K+Nm0joE!MPA^aTPy*F1XWY)co->IC0u_3cVXy?^H{1%zUa#RDe88>foN7D`c~7i$g04JWA8I6Rn2e0c||8Y(f7IwPf!lYnA>%A3j{nQv*C;>o&Z^zjIfLbLf^ng3P^7o`VikngY{2m6o*Utj`Haf;hsZftLK^ar~l=qyg+}uSEi9j>UOOymgol({PVv@Gnr3+Foo7R2eF)+-R1eK#ZehPM-`h}up48Mlw_#|$Rq4H`NqUrv226L%e(=D3@t7j;WC%$c9G$-aQQc=PEWb*+LN`yQNJWdKVi;OI3(2+NMKKxcf1^86nD6G=k1^Ih-%WIa3oOygxOk3jcSv|{I-YW~w#3JHL~|0wMgF^q2bNXx?TOd@0@Uo5h$@+FEC8HdQFUsWlg3Le5WZphOd#zQ&ESgz4#5S0PV7EJQF~Fun)qhIDGs_f&BTJl8S3J`y2T&Ttc7GmriCHYI(o;nOvon)TT{EbzF;s<&e=T*`5Ycv%YXckOP3|+rT5u45tS)^6dxe&QjN^C2$KrvVs3+k9?R#esJYeP?CaNxH!VhPIh%rPeIVZ4xkPpI!VVO*3K=n9PzEs|gYEQWm`f*CmiihayZF_;YQZ*8+|@m!{Y^uaXUmY3_#m5RUhO-Vkw!vt3RHdCi0LYE5j#+)M%R}u7hB9MMmFn_$v+$MT;%ZM#fScq>lFOV_NYOlDd5P<9zz7)upV@pk9TtZi;uyx)Q>(^|5j--iwyfqIuUr7!&@92dOaNx(I%`=-S~^WFYRGrDi$k@T-eHe{UP-GIg?M~rL%8tZ7F%iP^kiCnp6@pZ9IeJxsiQd*zzQBTwNZ~b3r-?DdB$wSR5St>3Ir(l99&;uG-u!ry~7E3jTo%d>?+5iwRTCGrBRfQqtdE-#8)kGZEtTmK^HPRGEjLuA!c2Hx9W?ChM?v7rwg{MT&A!WL`^%`MiU5gg0wmWW3>(Y(N>^zpr>7r-6(E_id}1lm@@hOOl!2fgqE@FCSKPqt*st`ATR$YhujssF?$_rBs$MfdB98m*S-3>`*g1M9lS8WK^Gd6%@A_g#Ls+186;Xia7Ckl@P>9>JOrxSwVDrnfsYRvIcZ%pgU72cJszB%$K^bP{ObW_2x62+_7!2NG;y?2N^IrgTf!Z3R&f>vCDSl}8D(lX%pj}+I#H(c=qh}}%_sZPBLxOm*{?D9TMNZU`8UOXFn>T#NY)QLCo=Ox0380vqfR4N85L-iw%gL%d=}<>|sEly(9SwyQ+!?BOrfW1^yoOk3u*SkD*k~KR!b@yTbI&9YrlUZ5PWmnK;uo{+U2KYQlmj=wmQO3p&1^4zd{~c|GV)3_(D=Pl4T_@NDRa_)hIrN{Cud1O*RkG>YljtdJIQxjqtA&h@(eUD3)AV$znAxu%ymH~MU&jdM2?cK!tHhKZ;Z_{kej2Iru%Z2!WhatCss>m7p@=osxiP!6~qSnShf*+THFaCuoAsI%=_9L6*O)XH&`-HjxH}6pBNU16e0Jo+&#ff9wIUsTy(it@|t|}&p-*1pOk8E$;_QnKifA+>*;Lpz^^bE=`8#4~l_*US)&RUhnonA|~o9d&mzZXmif8-*E+qr72~F}Fi8bd(CL)2@1*qdNGHbzzM-yhBmW=#XJ!&(45&r2pRLj>_cm>K_XvgP(wmr3LOu9$v*L26LU4;_UAy$KLS@RfzGTeq!g(NP}+HXw&qrQ0xiF+4Cnxq!m*wo8g4FlYHldpB54Ynq3)N{F9Pcb`Z9V`Fyxqvf@tHg%QhgD!#Djx`E*WR10I7gN2!$M?0}?8!i`}OlM^Cv8+|r>#%oYocHDgXJV-4(;Bte>p}yu|8o>BZ*&v9sI+G$yZ?L`FveaA^|0(~M8c!HvE<04{jX^g|a;PADwpx-ly7>+)$Gz>|D1gBXRt5|D#)ExRFD%P8t05#fMSyDNc3^x-LXV=RpOG0aXrR~ge}@z=<43)utZ7R=Y99o)pFM%_p=|%C;}V{xgdg@|3CThAFIu&k@mA?&_B}an`@Wo~IXR!DP)Zpx8Pd8yBv2GaX-|Yomm4S6xKtQVE0yf@V{2N~bfz?l!m{2w17jo1|eYbQ~HSaKzCZs3vZ@myA&JQy*AGUYU)FN0_G~)%pbI2U@zb8N)|M0f8nBJLFT!l^N(~N16MEj?5*nKIIJ}QBAODo_3DY29oY8R;4jKIU?YWCF|ttO9y*ubgw}RThqk!ItcvCN`o7GaMkGXanLvW9d^cUc!MUX~gg|yc)$mrqgW5tq{3bmyarb&dlNrGP-eskAMD$NkfpgJsj?5$W(I|RNRTE%YAi8lZv-Ajup_}u*>w6g#44v`L8RVb;87pX)#pgSm^V#TeGYD}G*O7bVkGqZjQ%_zhhdwEJwI)nirXSa{0G1+p(<%)uaB--}!~-a$P;F0SWZ8sc12`)aQA;YkDYKt8C=W2R=cRnv2|Wb(ttp7{1obE}UbNoF_TSgz^ts{Zf~^UW;RgEA+cns?VZX*n8~B!%UXgHk&MDn=o;f-ov@{!}bU5tkHZzR@PpPjlvxdlzF=Ub#4d@J<_dPa}qCqmf!OGTeKchgg&{<1YI9uAnq;ij07&!rpX%eWGF@}`(NnI#}hTaZ%%63Jt_A8>)S``k`T({6EhJc1WxX0z)>iX^K{X1HPF#dBgQY5SNgMMsJ!?DMwck~lwdzuI~EYiB5^x0%tf83C4CR@}BI&7G^`pgo*u^_`qNgVHGKIx?q1Z{~4Ejx8|1d3ruNaLkiHpkM9>{QX4Z1r^B&(U+gJP9Y#bVaqRKQ?@pY^0}OwD5UlPq>zHl5&_ji!V&S0YmKyG{7m%e*(m>KXQ6RILht#*FJ1S7kaidJBQeZhHzuAOwBPY2U=4K+&j8Eh)P1G`&jxXA3O>^fyb1zo(?ujr3}uo^1c_wrGw?4`7zgkhN;&W}ws{h4%5)uv<_Zvm~A|D7+(w!yS%Y35JODqYk0!O9Ci`bm`21fZreDc22ZoH@F_>MJKNO&rbaK;=WJDqx-@ZO%*|jB2i0x`ZTZn~M#Y-QV=0uTMbKuVg=3AyC`uuN18)7f0UI7cH1rDNqV!7wwkatSOyn|FCB-6Py!enODdE^%=mk@zagWi-eDSUw7!I=KOcTDwXkKOO2}d@Q8PVjxKvn#I6j?{|Dt`F~RZe({t?kN+JTWc^xK47(Cxo&BNElm#g`15x)cGhMHv?6g7auA40xvkJRsmeG<4a{D_!PvEg7nmZwPS`b8f25VUwNOfg0JF*Ce(qx`*tVUV;mN_ai)7|0Q=FNbpXpQ1Muh2J;3zdO>o$`vo!MavVju*%k};d8IhSg**DYk58+Tg+?o@?XidmYH^|-|seKT?=?%8yIoPjypczO%U7aRb_(Cl*wV7&5ezD6vc*XOUuCFp{T@tBL8LW&V%Au$Fm+2w@JGEas=no4^sGkiC(u0@n)-Q!NaQ_CqA>!Wcdv;F_^eMgeuUGcXxlfQ60)tjwBYhw*1_;>=dP^#Mv@ro3+!bbE92Czr|gYPLRE*LOejY*z`#qG1sKe+?*K!SV!S-{%tWM9@S#{(&+ar;P`J+{6=RpvRS;C*Gbb^m~0@YymV;uQZQyzF_$1RuOg`1O0zlF#}$y>UgKd)a`@t!s2to0f#|bm3o2>qY6Muz(F?d>t=t2=bwQ(_=lBY|2kX)5^%7bopl6ea}8H{Pnj|TvG}GSlDw`RPL^yI*>IEN{FrRAQlZwB2`-dy7<8>JK#nD`^gW^me8Cq6r=E%VCNt0MAdA^d?q#iYKa#0TQy{(;tHx-ZR(7A|=;;W*K;Ja8ujSAJU>9sk+))5)$TBy|@9FOafGa}9kSHpyLhRe)f5F*Ph!jdk%)Bjq^%ZU14fKrTc3j%x9kIQvonX2U5d^URB5{Hw1=?S_!}&+{M=5ef94{EAXGQ2&AP4rQ1H~H@;8vIw!mkoA)x03bET5lOK0x)s~^kP?jW{>?KOL1`l4xjj+!1*sb|q^(=1KZgO6kkGrL13`J~*yVsKxM+26ncOx7sQiM?py(rMAD`{+(d;P+Oz3ppy)#uD0WNa*!R^5`i$F0iKee%c;yPWdg9{;_o1BQfpNm8`xC4c2MjB5RtTRTA>LxHmz1FWE2IvYVViK=+8<3wlx)4qDKVx#}>n;U(X1Zg3TP6)2@6cQn`s^Ksz?dHpP3GY%?*$pbv{@f@!>2{Z-r&R`Y0;KC{F|I_q-+KRM+Te?iwwF|FLpVR6l1D*zf@7ip@w|uYT8_p(X0$=}JB=5xd=DA?2u1|91Dy*}`JE_q8$2m0yr0700V6SBEmU)ML(>#rbQ^c@vz2_usR)xj6qp^fM|l#4MV2NGQ-docO4-OD8XGe)~=`8SkLD`32rsmuRg;z&u!K!TDh-0|qTjXOEcc+?^IIOI`J}qw?EbZLNMaKb+ajk@A#cIQnScQs?Iy#mOB3ms#iNwop!i_3rjCM0Y(JAB`@YG0Q@Aj&>YBp?|R^-{ht+HNF0Y4)8X!{?x^${Q>d5m|F^rt3EbS$MSnfbtaOzEQlmhN_J6ywvdN#)OobuTy1Owd-bk{Vq?7Oz#wKcg<9i!lgd-P1{R=kP-OJXEu2M^icfOk5I9<7(xBfSa6~*(*!)lOkEG_=MtG&u$DSRWOzLxtSSNWIJCFuW83KYT}`6On17?m1)KpL62*ER+nri@b8exB3`*lleY+74r`17)UjOIsars?O+;T%|MO%P3zw-s@r37?B4h{yXQ&7!&kozwB}o(Mg2@!ll_3G35qa~A7n>P?;B@+bV$`{TV;;9lJ|BF1X3J)7ypL{n1hZ=jJ-&j9wgpoh%U>LqtCBD2@ohBU&=y+oYdEJT5oWFVuP=gOAPh(;>(h&{HvJJYaqc*8{y}LDEm<)+ZbE6>r*&MR=r-0?*%IQ{`Bbk5TY($2?Crsdjf2fMGUvcD_HC9a@CRE}6R3XikC;J_o?mZ~qw=AOn#5g>Hs3;wn9;wicFCTGnVrypJ09udK5R_zV4u)r0y{M!e`Q~-~;hGWjUEMty>&w+hECdIK<^{?BEBa(@00g#1T6yAy&NUXEQSMKA?V8cObRB8QwhbLNw1*9I=!uV$ic;SBjP#!t5U@M2(-2BWPmJWWkGHy%uZSt4x@oo#qppX{Nx*38}rOXm5^e!pQoz`5Yrv>HGcp$$>m03e1{)dMpkr^~=Z~QWtFx191K*qQ+iM52JIkEgf)!-OK(WJ5ITF?SsR*T*nv%>AtlRY&gZi5Frr%g^>>WbODOY`5zM^U6kuDM0WWr>nWkO!iJzcQG`+bi2nfGYFjm?DH08kT6JwIMc8_ZRi82oCJ|f|4~ShzWi-Sh7Cm*aA3*n_<=p21n{ib3JB(2$;}=Cy`E*A*V|#ffMuZ`5zw(1&dy3pd?*DM4ieG+c;5@T)EWiW^BX)glzCIy*BQ&nCYz-6#y^~XR>$X8<$tR0_f*EyB_xK1j@ymW=(zh=M?v2?#e<<+$7MpDjms%o6}KEW?lN6^An{$uz~GUv_NfqFL7cNQ_tWL;i2v7_9=y{g)C1O+r`mwmKr4b*u|hr{n$J;`uA8|mp38J@FlZizRNem_m_F$~$mKa_0ur!u)(@v1^ur2jKBr+Is~ce|v_GAe8WMj=BI$%2GSykOX}VW)!q^E92E*uu(UBtfR_w1PJ@(5ttM9x_;S4FpPWF*`fvcus<7nTTj!U|Hw%2erxMPA^_LV2vI_Za93LYiKvw`21bp{};tcu{g96||4TLInO?UqpCUP9VK=dJ%xy#emoH(?x5nCt{fIXK&L$7?lWwwjVHamVGKYEv69Rqob3fosAtem1qAfV^PQh1A+G~Df}I8&nko0(i4oA0w){*FE7g6RJXT_bGOA^^NBj5E=F;T4FX#I?9~=FYD)I%FHTYLoQkdL-d2qXuN)7s`^(;v%&ayQDjn0Wr7;4;~F}Aw1vwcb9b{JK0GzB9Say{ry84*|Ln8B9ii35RlUhw^K_GRiFnlt_P|#;v?F1u>}`|euv$dPx&9k*fX34M@q0NJRL9IGab>C1Lqpw;k|UZN3sG~uJaS|(g72$ro58!DJnoG)uRTSTv1}Mv&dsub7j!%3YqUNbbDQ)FshX_GeGQ0JNm*B;A`1d*T&Gzbn`1e=smc|zsTINhc7t@Ye7O(BS_mS$YOWSf9)8TF9q@`s)Dwpm}KKaE9DmBbOCF}%pOK!GqHtC#UYSLDD3cj=xeXeKu8xS&s+pr2tudAOj#-%|nF~i)(>a-XkH+o$6X9_ks=gMv_KTD0Aonsa5dr5jfm#~ROw8u5h|CuyR{Ir(%ef~Rkc_v8GB0@8g=dl)~|c26syQBwMJW?gywP|ctWlblZ?b)axJOZ17+BJ318st|zPe(R-QMh&2fyCoHEUZQbk7hE96I1bZEsPG8Q8`FKB=Ec0k2pvuSJ=4?cf#uuMLesynFk@WpX{Z`(OE%$DDU%Q%`O*e=mR?xksV-OtY7`oJ(Sal<0!@Ov^Mj#g53?lwV)6YS#51r5g_K+PTV9>~&MQ=wd$PBQNZTKF~F7^w!S2|~*X`K~4aN_si&lw>T>_)5n^nq%zD@MTgF^kbcNd-AIY<9k4ub)^!d9aKCxz(h#tM%%TVB=5&R)K#Vu+vaEjLWp#}cMR+Qpa_#%>^|*HNi;!w$GnXe9Kw*T$+N?@>+>LX&IQnlXH25?`^RenKo<)4W6g9g2X5yj1fHWlhK8C#_vW%IBf6<>dLsMsBl}*>&sPW5y5yx#(GnjPy%{Vkq*ap*Vw%lue2M$s+gmU}d>!7A+_M00'),format=L.FORMAT_RAW,filters=[{"id":L.FILTER_LZMA2}])) diff --git a/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/train_seed0.log b/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/train_seed0.log new file mode 100644 index 0000000000..86b34cfd54 --- /dev/null +++ b/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/train_seed0.log @@ -0,0 +1,147 @@ +W0412 04:33:07.570000 56418 torch/distributed/run.py:803] +W0412 04:33:07.570000 56418 torch/distributed/run.py:803] ***************************************** +W0412 04:33:07.570000 56418 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0412 04:33:07.570000 56418 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.9965 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.35 + etlb_clip: 3.0 + etlb_enabled: False + etlb_lr: 0.05 + etlb_steps: 5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/fa987d91-4c03-4dfc-873a-c960e2ffc004.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + qk_gain_init: 5.25 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: fa987d91-4c03-4dfc-873a-c960e2ffc004 + scalar_lr: 0.02 + seed: 0 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.72 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 80 +val_tokens: 40540160 +model_params:35944536 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0072 val_bpb: 3.4870 +1/20000 train_loss: 9.0086 train_time: 0.0m tok/s: 9225961 +2/20000 train_loss: 12.3279 train_time: 0.0m tok/s: 9362528 +3/20000 train_loss: 11.0101 train_time: 0.0m tok/s: 8796719 +4/20000 train_loss: 9.5067 train_time: 0.0m tok/s: 8542046 +5/20000 train_loss: 8.3986 train_time: 0.0m tok/s: 8412781 +500/20000 train_loss: 3.3744 train_time: 0.8m tok/s: 7804756 +1000/20000 train_loss: 3.2774 train_time: 1.7m tok/s: 7780751 +1500/20000 train_loss: 3.1802 train_time: 2.5m tok/s: 7785670 +2000/20000 train_loss: 3.0675 train_time: 3.4m tok/s: 7792306 +layer_loop:enabled step:2039 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2500/20000 train_loss: 3.1190 train_time: 4.6m tok/s: 7164121 +3000/20000 train_loss: 2.9019 train_time: 5.8m tok/s: 6764013 +3500/20000 train_loss: 2.9412 train_time: 7.1m tok/s: 6504175 +4000/20000 train_loss: 2.8224 train_time: 8.3m tok/s: 6315937 +4000/20000 val_loss: 2.8769 val_bpb: 1.1137 +4500/20000 train_loss: 2.8396 train_time: 9.5m tok/s: 6184255 +4607/20000 val_loss: 2.8101 val_bpb: 1.0879 +stopping_early: wallclock_cap train_time: 588170ms step: 4607/20000 +peak memory allocated: 39278 MiB reserved: 39416 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.80687991 val_bpb:1.08663068 eval_time:6875ms +Code size: 18185 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.8s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15975552 bytes +Total submission size quantized+brotli: 15993737 bytes +quantized val_loss:2.83696682 val_bpb:1.09827827 eval_time:8832ms +quantized_sliding_window val_loss:2.79363272 val_bpb:1.08150228 eval_time:87085ms +ttt:start chunks=1238 ttt_lr=0.005 ttt_epochs=3 +quantized_ttt val_loss:2.78951654 val_bpb:1.07990878 eval_time:308087ms diff --git a/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/train_seed3141.log b/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/train_seed3141.log new file mode 100644 index 0000000000..622c186536 --- /dev/null +++ b/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/train_seed3141.log @@ -0,0 +1,147 @@ +W0412 06:29:36.635000 63192 torch/distributed/run.py:803] +W0412 06:29:36.635000 63192 torch/distributed/run.py:803] ***************************************** +W0412 06:29:36.635000 63192 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0412 06:29:36.635000 63192 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.9965 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.35 + etlb_clip: 3.0 + etlb_enabled: False + etlb_lr: 0.05 + etlb_steps: 5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/78e1c274-1c75-46b6-ad05-589bea50402e.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + qk_gain_init: 5.25 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 78e1c274-1c75-46b6-ad05-589bea50402e + scalar_lr: 0.02 + seed: 3141 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.72 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 80 +val_tokens: 40540160 +model_params:35944536 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0092 val_bpb: 3.4877 +1/20000 train_loss: 9.0100 train_time: 0.0m tok/s: 9246251 +2/20000 train_loss: 12.3477 train_time: 0.0m tok/s: 9391537 +3/20000 train_loss: 11.0235 train_time: 0.0m tok/s: 8840673 +4/20000 train_loss: 9.4880 train_time: 0.0m tok/s: 8590099 +5/20000 train_loss: 8.3438 train_time: 0.0m tok/s: 8446006 +500/20000 train_loss: 3.3802 train_time: 0.8m tok/s: 7813922 +1000/20000 train_loss: 3.2841 train_time: 1.7m tok/s: 7808250 +1500/20000 train_loss: 3.1838 train_time: 2.5m tok/s: 7821604 +2000/20000 train_loss: 3.0711 train_time: 3.3m tok/s: 7831531 +layer_loop:enabled step:2050 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2500/20000 train_loss: 3.1237 train_time: 4.5m tok/s: 7209477 +3000/20000 train_loss: 2.9000 train_time: 5.8m tok/s: 6800104 +3500/20000 train_loss: 2.9452 train_time: 7.0m tok/s: 6536012 +4000/20000 train_loss: 2.8242 train_time: 8.3m tok/s: 6343559 +4000/20000 val_loss: 2.8790 val_bpb: 1.1145 +4500/20000 train_loss: 2.8441 train_time: 9.5m tok/s: 6208556 +4622/20000 val_loss: 2.8105 val_bpb: 1.0880 +stopping_early: wallclock_cap train_time: 588109ms step: 4622/20000 +peak memory allocated: 39278 MiB reserved: 39416 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.80734608 val_bpb:1.08681115 eval_time:6896ms +Code size: 18185 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.8s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15977252 bytes +Total submission size quantized+brotli: 15995437 bytes +quantized val_loss:2.83721177 val_bpb:1.09837309 eval_time:8847ms +quantized_sliding_window val_loss:2.79406439 val_bpb:1.08166940 eval_time:86618ms +ttt:start chunks=1238 ttt_lr=0.005 ttt_epochs=3 +quantized_ttt val_loss:2.79002185 val_bpb:1.08010440 eval_time:307147ms diff --git a/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/train_seed42.log b/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/train_seed42.log new file mode 100644 index 0000000000..8477221eff --- /dev/null +++ b/records/track_10min_16mb/2026-04-13_SystemsOpt_SP8192_LegalTTT/train_seed42.log @@ -0,0 +1,147 @@ +W0412 03:49:45.337000 53115 torch/distributed/run.py:803] +W0412 03:49:45.337000 53115 torch/distributed/run.py:803] ***************************************** +W0412 03:49:45.337000 53115 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0412 03:49:45.337000 53115 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.9965 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.35 + etlb_clip: 3.0 + etlb_enabled: False + etlb_lr: 0.05 + etlb_steps: 5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/c4ff5b23-f15a-4325-85da-337d08a9590b.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + qk_gain_init: 5.25 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: c4ff5b23-f15a-4325-85da-337d08a9590b + scalar_lr: 0.02 + seed: 42 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.72 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 80 +val_tokens: 40540160 +model_params:35944536 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0090 val_bpb: 3.4877 +1/20000 train_loss: 9.0104 train_time: 0.0m tok/s: 9191547 +2/20000 train_loss: 12.3645 train_time: 0.0m tok/s: 9416998 +3/20000 train_loss: 11.0075 train_time: 0.0m tok/s: 8883056 +4/20000 train_loss: 9.4552 train_time: 0.0m tok/s: 8634412 +5/20000 train_loss: 8.3278 train_time: 0.0m tok/s: 8498155 +500/20000 train_loss: 3.3791 train_time: 0.8m tok/s: 7832557 +1000/20000 train_loss: 3.2876 train_time: 1.7m tok/s: 7813307 +1500/20000 train_loss: 3.1843 train_time: 2.5m tok/s: 7816205 +2000/20000 train_loss: 3.0689 train_time: 3.4m tok/s: 7822690 +layer_loop:enabled step:2047 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2500/20000 train_loss: 3.1243 train_time: 4.6m tok/s: 7199531 +3000/20000 train_loss: 2.9011 train_time: 5.8m tok/s: 6793033 +3500/20000 train_loss: 2.9459 train_time: 7.0m tok/s: 6530046 +4000/20000 train_loss: 2.8242 train_time: 8.3m tok/s: 6338311 +4000/20000 val_loss: 2.8789 val_bpb: 1.1145 +4500/20000 train_loss: 2.8464 train_time: 9.5m tok/s: 6203787 +4619/20000 val_loss: 2.8106 val_bpb: 1.0881 +stopping_early: wallclock_cap train_time: 588109ms step: 4619/20000 +peak memory allocated: 39278 MiB reserved: 39416 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.80748328 val_bpb:1.08686427 eval_time:6920ms +Code size: 18185 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.8s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15975016 bytes +Total submission size quantized+brotli: 15993201 bytes +quantized val_loss:2.83658359 val_bpb:1.09812991 eval_time:8895ms +quantized_sliding_window val_loss:2.79367948 val_bpb:1.08152039 eval_time:86709ms +ttt:start chunks=1238 ttt_lr=0.005 ttt_epochs=3 +quantized_ttt val_loss:2.79024700 val_bpb:1.08019157 eval_time:303873ms