Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
774b65a
chore: gitignore autoresearch artifacts
Mar 18, 2026
326a423
experiment(train_gpt): replace ReLU² MLP with SwiGLU — hypothesis: ga…
Mar 18, 2026
2fc37a9
experiment(train_gpt): increase layers 9→11 — hypothesis: more depth …
Mar 18, 2026
0365016
experiment(train_gpt): increase matrix_lr 0.04→0.05 — hypothesis: dee…
Mar 18, 2026
e9c6026
Revert "experiment(train_gpt): increase matrix_lr 0.04→0.05 — hypothe…
Mar 18, 2026
0275bf8
experiment(train_gpt): reduce logit_softcap 30→20 — hypothesis: tight…
Mar 18, 2026
090f343
experiment(train_gpt): increase model_dim 512→528 — hypothesis: use a…
Mar 18, 2026
5bc1a71
Revert "experiment(train_gpt): increase model_dim 512→528 — hypothesi…
Mar 18, 2026
4dac192
experiment(train_gpt): increase warmdown_iters 1200→2400 — hypothesis…
Mar 18, 2026
11d6c5f
experiment(train_gpt): increase muon_backend_steps 5→8 — hypothesis: …
Mar 18, 2026
548b850
experiment(train_gpt): enable gradient clipping norm=1.0 — hypothesis…
Mar 18, 2026
fed2741
experiment(train_gpt): increase Adam beta2 0.95→0.99 — hypothesis: sm…
Mar 19, 2026
fdd589a
experiment(train_gpt): depth recurrence — 6 unique blocks × 2 passes …
Mar 19, 2026
9074a2c
Revert "experiment(train_gpt): depth recurrence — 6 unique blocks × 2…
Mar 19, 2026
b311be8
experiment(train_gpt): reduce qk_gain_init 1.5→1.0 — hypothesis: shar…
Mar 19, 2026
c0828a1
Revert "experiment(train_gpt): reduce qk_gain_init 1.5→1.0 — hypothes…
Mar 19, 2026
5749817
experiment(train_gpt): increase SwiGLU hidden 2/3→3/4 ratio — hypothe…
Mar 19, 2026
b758bb3
Revert "experiment(train_gpt): increase SwiGLU hidden 2/3→3/4 ratio —…
Mar 19, 2026
c7d9c05
experiment(train_gpt): increase tied_embed_lr 0.05→0.08 — hypothesis:…
Mar 19, 2026
4d4dad3
Revert "experiment(train_gpt): increase tied_embed_lr 0.05→0.08 — hyp…
Mar 19, 2026
0de4236
experiment(train_gpt): increase tied_embed_init_std 0.005→0.01 — hypo…
Mar 19, 2026
d01a7e4
Revert "experiment(train_gpt): increase tied_embed_init_std 0.005→0.0…
Mar 19, 2026
dbc921b
experiment(train_gpt): disable muon momentum warmup (500→0 steps) — h…
Mar 19, 2026
5d6ee0b
Revert "experiment(train_gpt): disable muon momentum warmup (500→0 st…
Mar 19, 2026
b1c8cdb
experiment(train_gpt): tighter int8 clip percentile 99.99984→99.995 —…
Mar 19, 2026
8934572
Revert "experiment(train_gpt): tighter int8 clip percentile 99.99984→…
Mar 19, 2026
e6f0525
experiment(train_gpt): reduce rope_base 10000→500 — hypothesis: faste…
Mar 19, 2026
68a16a9
Revert "experiment(train_gpt): reduce rope_base 10000→500 — hypothesi…
Mar 19, 2026
e65930d
experiment(train_gpt): reduce num_kv_heads 4→2 — hypothesis: fewer KV…
Mar 19, 2026
965521e
Revert "experiment(train_gpt): reduce num_kv_heads 4→2 — hypothesis: …
Mar 19, 2026
d3daded
experiment(train_gpt): revert layers 11→9 — 11 layers blows 16MB budg…
Mar 19, 2026
537730f
experiment(train_gpt): revert logit_softcap 20→30 — re-test with long…
Mar 19, 2026
18978d4
experiment(train_gpt): revert SwiGLU back to ReLU² — re-test at 2000 …
Mar 19, 2026
89e4554
Revert "experiment(train_gpt): revert SwiGLU back to ReLU² — re-test …
Mar 19, 2026
839b2c4
experiment(train_gpt): revert warmdown 2400→1200 — re-test at 2000 st…
Mar 19, 2026
54fe63b
Revert "experiment(train_gpt): revert warmdown 2400→1200 — re-test at…
Mar 19, 2026
0d4b67a
experiment(train_gpt): increase muon_momentum 0.95→0.98 — hypothesis:…
Mar 19, 2026
503c331
Revert "experiment(train_gpt): increase muon_momentum 0.95→0.98 — hyp…
Mar 19, 2026
7fb6507
experiment(train_gpt): reduce scalar_lr 0.04→0.02 — hypothesis: slowe…
Mar 19, 2026
0833c30
experiment(train_gpt): reduce matrix_lr 0.04→0.03 — hypothesis: lower…
Mar 19, 2026
6279a74
Revert "experiment(train_gpt): reduce matrix_lr 0.04→0.03 — hypothesi…
Mar 19, 2026
65fd032
experiment(train_gpt): reduce tied_embed_lr 0.05→0.03 — hypothesis: p…
Mar 19, 2026
dd8d76b
Revert "experiment(train_gpt): reduce tied_embed_lr 0.05→0.03 — hypot…
Mar 19, 2026
54e51c7
experiment(train_gpt): increase warmdown_iters 2400→3600 — hypothesis…
Mar 19, 2026
078c122
experiment(train_gpt): increase warmdown_iters 3600→4800 — hypothesis…
Mar 19, 2026
e0c7741
experiment(train_gpt): increase warmdown_iters 4800→6400 — hypothesis…
Mar 19, 2026
7b53faf
Revert "experiment(train_gpt): increase warmdown_iters 4800→6400 — hy…
Mar 19, 2026
3283408
experiment(train_gpt): tighten grad_clip_norm 1.0→0.5 — hypothesis: m…
Mar 19, 2026
e733b20
Revert "experiment(train_gpt): tighten grad_clip_norm 1.0→0.5 — hypot…
Mar 19, 2026
d42ebd3
experiment(train_gpt): disable logit_softcap entirely — hypothesis: r…
Mar 19, 2026
42ef022
Revert "experiment(train_gpt): disable logit_softcap entirely — hypot…
Mar 19, 2026
5ea1f92
experiment(train_gpt): increase muon_backend_steps 8→10 — hypothesis:…
Mar 19, 2026
31c59cd
experiment(train_gpt): fp16 embedding passthrough — keep tok_emb in f…
Mar 19, 2026
a2f9665
experiment(train_gpt): add sliding window eval (stride=64) — score ea…
Mar 19, 2026
53da99d
experiment(train_gpt): increase default train_seq_len 1024→2048 — hyp…
Mar 19, 2026
53d65d2
experiment(train_gpt): reduce eval_stride 64→32 — hypothesis: more co…
Mar 19, 2026
6780f07
Revert "experiment(train_gpt): reduce eval_stride 64→32 — hypothesis:…
Mar 20, 2026
9473c45
experiment(train_gpt): increase matrix_lr 0.04→0.06 — hypothesis: hig…
Mar 20, 2026
01d2dcc
Revert "experiment(train_gpt): increase matrix_lr 0.04→0.06 — hypothe…
Mar 20, 2026
80b33f0
experiment(train_gpt): reduce tied_embed_lr 0.05→0.04 — hypothesis: f…
Mar 20, 2026
9d113f6
Revert "experiment(train_gpt): reduce tied_embed_lr 0.05→0.04 — hypot…
Mar 20, 2026
ec979ff
experiment(train_gpt): increase warmdown_iters 4800→10000 — hypothesi…
Mar 20, 2026
7134619
experiment(train_gpt): increase layers 9→10 — hypothesis: use 3.4MB a…
Mar 20, 2026
5ab39d3
experiment(train_gpt): increase layers 10→11 — hypothesis: 2.6MB head…
Mar 20, 2026
edbc92e
experiment(train_gpt): increase layers 11→12 — hypothesis: push depth…
Mar 20, 2026
e85b313
experiment(train_gpt): increase warmdown_iters 10000→15000 — hypothes…
Mar 20, 2026
921b5ab
Revert "experiment(train_gpt): increase warmdown_iters 10000→15000 — …
Mar 20, 2026
d587a0e
fix(train_gpt): rewrite sliding window eval — match proven SOTA imple…
Mar 20, 2026
7bc304c
experiment(train_gpt): increase warmdown_iters 10000→12000 — hypothes…
Mar 20, 2026
da9fbc8
Revert "experiment(train_gpt): increase warmdown_iters 10000→12000 — …
Mar 21, 2026
657ceda
experiment(train_gpt): reduce tied_embed_init_std 0.005→0.002 — hypot…
Mar 21, 2026
392c220
Revert "experiment(train_gpt): reduce tied_embed_init_std 0.005→0.002…
Mar 21, 2026
3ca4856
experiment(train_gpt): set layers to 10 — verified on 4xH100: 1.2074 …
Mar 21, 2026
41c046b
experiment(train_gpt): 12 layers with thinner SwiGLU (hidden factor 2…
Mar 21, 2026
6ee256c
submission(train_gpt): 9-layer ReLU² with sliding window eval — 1.186…
Mar 21, 2026
1060c36
Add record: Optimizer Tuning + Sliding Window Eval, val_bpb=1.1864
Mar 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,6 @@ data/manifest.json
data/docs_selected.jsonl
.mypy_cache/
.venv
logs/
logs/autoresearch-results.tsv
verify.sh
logs/
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
This record combines optimizer tuning, training at longer sequence length, and sliding window evaluation to improve on the naive baseline without changing the model architecture.

## Key Changes from Baseline

### Training Improvements
- **Sequence length 2048** (baseline: 1024): Longer context during training improves the model's ability to use positional information. Steps are ~18% slower but quality gain is worth it.
- **Warmdown 10000** (baseline: 1200): Much longer learning rate decay schedule. With the wallclock-based warmdown, this means the LR decays throughout most of training, producing a smoother convergence.
- **Muon backend steps 10** (baseline: 5): More Newton-Schulz iterations in the Muon optimizer produce better gradient orthogonalization.
- **Gradient clipping norm=1.0** (baseline: disabled): Stabilizes training, especially important with the longer warmdown.
- **Adam beta2=0.99** (baseline: 0.95): Smoother second moment estimate for embedding and scalar parameters.
- **Scalar LR=0.02** (baseline: 0.04): Lower learning rate for scale/gate parameters (attn_scale, mlp_scale, resid_mix, skip_weights) improves stability.

### Evaluation Improvement
- **Sliding window eval (stride=64)**: Instead of chopping the validation set into non-overlapping 2048-token chunks (where the first token has zero context), we use overlapping windows advancing by 64 tokens. Only the last 64 tokens of each window are scored, giving every token 1984+ tokens of context. The first window scores all tokens. This is a pure eval improvement — the model weights are identical.

### What Didn't Work (Tried and Reverted)
- SwiGLU MLP: Better per-param quality but the 3-matrix design uses more params per layer, blowing the 16MB budget at convergence.
- FP16 embedding passthrough: Reduces quantization error from ~0.007 to ~0.0003 BPB, but adds ~500KB to the artifact, pushing over 16MB.
- More layers (10-12): Better BPB but always exceeded the 16MB artifact limit at full convergence. The int8+zlib compression ratio is ~0.93 bytes/param at 8xH100 convergence.
- Higher/lower learning rates for matrix_lr, tied_embed_lr: The defaults (0.04, 0.05) are well-tuned.
- Depth recurrence, lower RoPE base, different KV head counts: All worse.

## Configuration

Same architecture as baseline:
- Layout: `VOCAB_SIZE=1024 NUM_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2`
- Tied output/input embeddings: `TIE_EMBEDDINGS=1`
- ReLU^2 MLP (unchanged)

Modified hyperparameters:
- `TRAIN_SEQ_LEN=2048` (was 1024)
- `WARMDOWN_ITERS=10000` (was 1200)
- `MUON_BACKEND_STEPS=10` (was 5)
- `GRAD_CLIP_NORM=1.0` (was 0.0)
- `BETA2=0.99` (was 0.95)
- `SCALAR_LR=0.02` (was 0.04)
- `EVAL_STRIDE=64` (sliding window evaluation)

## Command

```bash
RUN_ID=submission_seed1337 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
MAX_WALLCLOCK_SECONDS=600 \
TRAIN_LOG_EVERY=200 \
VAL_LOSS_EVERY=2000 \
EVAL_BATCH_SEQS=1024 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Key Metrics (from `train.log`)

- Timed training stopped at `11520/20000` steps due to the wallclock cap.
- Pre-quant eval at stop: `val_loss:2.0313`, `val_bpb:1.2031`
- Post-quant sliding window eval: `val_loss:2.0032`, `val_bpb:1.1864`
- Exact printed metric: `final_int8_zlib_roundtrip_exact val_bpb:1.18641686`
- Train time: `600019ms` (`step_avg:52.08ms`)
- Peak memory: `10121 MiB allocated`, `10440 MiB reserved`
- Eval time: `132519ms` (sliding window, stride=64, batch_seqs=1024)
- Serialized model int8+zlib: `15808653 bytes`
- Code size: `52684 bytes`
- Total submission size int8+zlib: `15861337 bytes`

## Training Volume

- Global batch: `524288` tokens/step
- Total train tokens seen: `6,044,098,560`

## Reproducibility (3 seeds)

| Seed | Steps | val_loss | val_bpb | Artifact |
|------|-------|----------|---------|----------|
| 1337 | 11,520 | 2.00321 | 1.18642 | 15,861,337 |
| 1338 | 11,520 | 2.00428 | 1.18705 | 15,859,751 |
| 1339 | 11,523 | 2.00667 | 1.18847 | 15,867,480 |

- Sample mean val_loss: `2.00472`
- Sample std: `0.00177`
- Current SOTA val_loss: `2.01348`
- Required improvement: `0.005 nats`
- Actual improvement: `0.00876 nats`
- One-sided t-test: `t=8.57`, `df=2`, `p < 0.01`

## Methodology

Changes were discovered through 46 iterations of automated experimentation (autoresearch) on a proxy test setup (RTX 3090, 2000 steps), then validated on 4xH100 and finally 8xH100. The proxy correctly identified directional improvements but could not predict exact artifact sizes at full convergence, leading to several over-budget configurations being tested on H100.

## Included Files

- `train_gpt.py` (code snapshot used for the run)
- `train.log` (canonical run, SEED=1337)
- `train_seed1338.log` (reproducibility run, SEED=1338)
- `train_seed1339.log` (reproducibility run, SEED=1339)
- `submission.json` (leaderboard metadata)
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"author": "RAC",
"github_id": "andreanjos",
"name": "Optimizer Tuning + Sliding Window Eval",
"blurb": "Baseline 9x512 SP-1024 architecture with optimizer improvements (warmdown=10000, muon_backend_steps=10, grad_clip=1.0, beta2=0.99, scalar_lr=0.02) and seq2048 training. Sliding window evaluation at stride=64 scores every token with near-maximum context. Post-quant int8+zlib roundtrip under the 16,000,000-byte cap.",
"date": "2026-03-21T06:00:00Z",
"val_loss": 2.00320987,
"val_bpb": 1.18641686,
"pre_quant_val_loss": 2.0313,
"pre_quant_val_bpb": 1.2031,
"step_stop": 11520,
"wallclock_seconds": 600.019,
"eval_time_seconds": 132.519,
"bytes_total": 15861337,
"bytes_model_int8_zlib": 15808653,
"bytes_code": 52684
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
logs/8xh100_9layer_nofp16embed.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:17059912
world_size:8 grad_accum_steps:1
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.05 head_lr:0.0 matrix_lr:0.04 scalar_lr:0.02
train_batch_tokens:524288 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
seed:1337
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9357 val_bpb:4.1077 train_time:0ms step_avg:0.01ms
step:1/20000 train_loss:6.9370 train_time:105ms step_avg:105.45ms
step:2/20000 train_loss:17.5737 train_time:147ms step_avg:73.41ms
step:3/20000 train_loss:13.1619 train_time:198ms step_avg:66.05ms
step:4/20000 train_loss:8.2470 train_time:250ms step_avg:62.49ms
step:5/20000 train_loss:6.3365 train_time:302ms step_avg:60.33ms
step:6/20000 train_loss:7.2407 train_time:353ms step_avg:58.90ms
step:7/20000 train_loss:6.2363 train_time:405ms step_avg:57.88ms
step:8/20000 train_loss:6.0346 train_time:457ms step_avg:57.11ms
step:9/20000 train_loss:5.8972 train_time:509ms step_avg:56.51ms
step:10/20000 train_loss:5.7512 train_time:560ms step_avg:56.03ms
step:200/20000 train_loss:2.7615 train_time:10441ms step_avg:52.20ms
step:400/20000 train_loss:2.2909 train_time:20830ms step_avg:52.08ms
step:600/20000 train_loss:2.4978 train_time:31224ms step_avg:52.04ms
step:800/20000 train_loss:2.2474 train_time:41627ms step_avg:52.03ms
step:1000/20000 train_loss:2.3363 train_time:52034ms step_avg:52.03ms
step:1200/20000 train_loss:2.3576 train_time:62456ms step_avg:52.05ms
step:1400/20000 train_loss:2.3836 train_time:72861ms step_avg:52.04ms
step:1600/20000 train_loss:2.0454 train_time:83272ms step_avg:52.04ms
step:1800/20000 train_loss:2.1603 train_time:93672ms step_avg:52.04ms
step:2000/20000 train_loss:2.2083 train_time:104074ms step_avg:52.04ms
step:2000/20000 val_loss:2.1909 val_bpb:1.2975 train_time:104086ms step_avg:52.04ms
step:2200/20000 train_loss:2.0275 train_time:114480ms step_avg:52.04ms
step:2400/20000 train_loss:2.1568 train_time:124883ms step_avg:52.03ms
step:2600/20000 train_loss:2.3789 train_time:135288ms step_avg:52.03ms
step:2800/20000 train_loss:2.1904 train_time:145692ms step_avg:52.03ms
step:3000/20000 train_loss:2.1813 train_time:156097ms step_avg:52.03ms
step:3200/20000 train_loss:2.1458 train_time:166500ms step_avg:52.03ms
step:3400/20000 train_loss:2.1104 train_time:176909ms step_avg:52.03ms
step:3600/20000 train_loss:2.0582 train_time:187316ms step_avg:52.03ms
step:3800/20000 train_loss:2.1655 train_time:197726ms step_avg:52.03ms
step:4000/20000 train_loss:2.1274 train_time:208168ms step_avg:52.04ms
step:4000/20000 val_loss:2.1197 val_bpb:1.2554 train_time:208179ms step_avg:52.04ms
step:4200/20000 train_loss:2.1172 train_time:218645ms step_avg:52.06ms
step:4400/20000 train_loss:2.0575 train_time:229055ms step_avg:52.06ms
step:4600/20000 train_loss:1.9276 train_time:239470ms step_avg:52.06ms
step:4800/20000 train_loss:2.2088 train_time:249884ms step_avg:52.06ms
step:5000/20000 train_loss:1.9610 train_time:260389ms step_avg:52.08ms
step:5200/20000 train_loss:2.1223 train_time:270797ms step_avg:52.08ms
step:5400/20000 train_loss:2.1388 train_time:281210ms step_avg:52.08ms
step:5600/20000 train_loss:2.1251 train_time:291619ms step_avg:52.07ms
step:5800/20000 train_loss:2.0806 train_time:302028ms step_avg:52.07ms
step:6000/20000 train_loss:2.1595 train_time:312442ms step_avg:52.07ms
step:6000/20000 val_loss:2.0863 val_bpb:1.2356 train_time:312453ms step_avg:52.08ms
step:6200/20000 train_loss:2.0288 train_time:322855ms step_avg:52.07ms
step:6400/20000 train_loss:2.1062 train_time:333264ms step_avg:52.07ms
step:6600/20000 train_loss:2.0640 train_time:343676ms step_avg:52.07ms
step:6800/20000 train_loss:2.1290 train_time:354088ms step_avg:52.07ms
step:7000/20000 train_loss:2.1750 train_time:364508ms step_avg:52.07ms
step:7200/20000 train_loss:2.1447 train_time:374922ms step_avg:52.07ms
step:7400/20000 train_loss:2.0649 train_time:385336ms step_avg:52.07ms
step:7600/20000 train_loss:1.9417 train_time:395751ms step_avg:52.07ms
step:7800/20000 train_loss:2.0889 train_time:406164ms step_avg:52.07ms
step:8000/20000 train_loss:2.0593 train_time:416580ms step_avg:52.07ms
step:8000/20000 val_loss:2.0610 val_bpb:1.2206 train_time:416591ms step_avg:52.07ms
step:8200/20000 train_loss:2.1323 train_time:426997ms step_avg:52.07ms
step:8400/20000 train_loss:2.0714 train_time:437478ms step_avg:52.08ms
step:8600/20000 train_loss:2.0887 train_time:447892ms step_avg:52.08ms
step:8800/20000 train_loss:2.0444 train_time:458310ms step_avg:52.08ms
step:9000/20000 train_loss:1.9627 train_time:468721ms step_avg:52.08ms
step:9200/20000 train_loss:2.0257 train_time:479142ms step_avg:52.08ms
step:9400/20000 train_loss:2.0612 train_time:489555ms step_avg:52.08ms
step:9600/20000 train_loss:2.0844 train_time:499973ms step_avg:52.08ms
step:9800/20000 train_loss:1.9934 train_time:510389ms step_avg:52.08ms
step:10000/20000 train_loss:2.0501 train_time:520802ms step_avg:52.08ms
step:10000/20000 val_loss:2.0421 val_bpb:1.2094 train_time:520813ms step_avg:52.08ms
step:10200/20000 train_loss:2.0035 train_time:531220ms step_avg:52.08ms
step:10400/20000 train_loss:2.0217 train_time:541641ms step_avg:52.08ms
step:10600/20000 train_loss:1.9142 train_time:552057ms step_avg:52.08ms
step:10800/20000 train_loss:2.1162 train_time:562468ms step_avg:52.08ms
step:11000/20000 train_loss:2.0469 train_time:572886ms step_avg:52.08ms
step:11200/20000 train_loss:2.0079 train_time:583305ms step_avg:52.08ms
step:11400/20000 train_loss:1.9918 train_time:593729ms step_avg:52.08ms
step:11520/20000 val_loss:2.0313 val_bpb:1.2031 train_time:600019ms step_avg:52.08ms
stopping_early: wallclock_cap train_time:600019ms step:11520/20000
peak memory allocated: 10121 MiB reserved: 10440 MiB
Serialized model: 67224983 bytes
Code size: 52684 bytes
Total submission size: 67277667 bytes
Serialized model int8+zlib: 15808653 bytes (payload:17178912 raw_torch:17224025 payload_ratio:3.91x)
Total submission size int8+zlib: 15861337 bytes
final_int8_zlib_roundtrip val_loss:2.0032 val_bpb:1.1864 eval_time:132519ms
final_int8_zlib_roundtrip_exact val_loss:2.00320987 val_bpb:1.18641686
Loading