Record: Seed-Regenerated Random Model + Incremental N-gram Cache — val_bpb 0.0905#1095
Draft
vimeto wants to merge 1 commit intoopenai:mainfrom
Draft
Record: Seed-Regenerated Random Model + Incremental N-gram Cache — val_bpb 0.0905#1095vimeto wants to merge 1 commit intoopenai:mainfrom
vimeto wants to merge 1 commit intoopenai:mainfrom
Conversation
776a620 to
38c5e7d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: Seed-Regenerated Random Model + Incremental N-gram Cache — val_bpb 0.0905
val_bpb = 0.0905 (1 seed, additional seeds pending H100 access) | 15.09 MB | 8xH100 SXM
Results (8xH100 80GB SXM, PyTorch 2.7.1)
Additional seeds pending H100 access.
Key Innovation: Zero-Cost Base Weights
ALL transformer weight matrices use frozen orthogonal random projections regenerated from 8-byte seeds at load time (0 bytes in artifact). Only rank-64 LoRA adapters are stored (3.9 MB). The remaining 11 MB holds an incrementally-built INT16 n-gram cache (orders 2-7, 31B counts, 8-GPU all-reduce synced).
Why orthogonal: Prior work (PR #874) used Gaussian random bases but could not train past 5 layers. Our QR-decomposed orthogonal init preserves singular values = 1.0, enabling stable deep training.
Adapter quantization: Simple INT8 per-row gives quant gap of only +0.003 BPB (vs +0.006 for baseline INT6 GPTQ).
Incremental N-gram Cache (Zero Overhead)
The cache is built during training by calling update_batch_fast() after each microstep (less than 1ms overhead). After training, counts are all-reduced across 8 GPUs and LZMA-compressed into the artifact. At eval, the cache is frozen, no TTT.
We tested pre-filling from training shards at startup: 10x worse (0.996 BPB) due to pre-fill consuming 24-33% of the training budget.
Architecture
5L 512d, 8H/4KV, MLP 3.0, LeakyReLU(0.5) squared, rank-64 LoRA adapters, tied embeddings, vocab 1024
Credits