Trim train_gpt_mlx_kl.py to ≤1500 lines; fix orphaned clip_grad_tree by Copilot · Pull Request #8 · kailean/parameter-golf

Copilot · 2026-04-04T15:44:27Z

Script was 1847 lines, over the 1500-line target for the sub-1.0 BPB build. Code bytes count toward the 16MB artifact limit per challenge rules. Also found a bug: clip_grad_tree body existed but had no def statement — would NameError when grad_clip_norm > 0.

Bug fix

Added missing function definition:

def clip_grad_tree(grads_tree, max_norm):
    """Clip gradient tree by global norm."""
    if max_norm <= 0:
        return grads_tree
    ...

Previously this was an orphaned code block (indented body with no def) following eval_val_sliding_ngram, unreachable as written but called in the training loop at line 1747.

Line reduction (1847 → 1493)

Module docstring: 27-line innovation list → 2-line summary
Section separators: Removed all # ====...==== blocks (~17)
Docstrings: Multi-line → single-line where content was obvious
Blank lines: Collapsed consecutive blanks to ≤1
Comments: Removed inline comments restating the code

All 17 classes, 25 functions, and every feature (EngramLite, BackoffNgramMixer, ComplementaryTraining, SkipGramHash, SmearGate, XSA, LoRA TTT, GPTQ-lite, sliding-window eval) preserved. Verified via ast.parse and AST name enumeration.

Summary by Sourcery

Trim and clean up the GPT training script while preserving functionality and add a proper gradient clipping helper to fix a missing definition bug.

Bug Fixes:

Define the missing clip_grad_tree function and wire it into the training loop to safely apply global gradient norm clipping.

Enhancements:

Remove verbose comments, section banners, and redundant docstring text to reduce file size without changing behavior.

- Fix orphaned clip_grad_tree function body by adding proper def line - Remove verbose section separator comment blocks (17+ instances) - Compact 26-line module docstring to 2-line summary - Trim multi-line docstrings to single lines throughout - Remove redundant inline comments that restate the code - Remove unnecessary blank lines within function bodies - Compact Hyperparameters class by removing section comment headers All functionality, logic, algorithms, and class/function signatures preserved. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: kailean <49617037+kailean@users.noreply.github.com>

sourcery-ai · 2026-04-04T18:18:58Z

Reviewer's Guide

Fixes a latent NameError by properly defining clip_grad_tree, and reduces train_gpt_mlx_kl.py from ~1847 to 1493 lines via doc/comment pruning and minor docstring tightening without changing model/optimizer/eval behavior.

Sequence diagram for training step with clip_grad_tree

sequenceDiagram
    participant main
    participant model
    participant compiled_loss_and_grad
    participant clip_grad_tree
    participant optimizer

    main->>compiled_loss_and_grad: loss_and_grad_chunked(args, train_loader)
    compiled_loss_and_grad-->>main: loss, grads_tree

    main->>clip_grad_tree: clip_grad_tree(grads_tree, args.grad_clip_norm)
    alt max_norm <= 0
        clip_grad_tree-->>main: grads_tree (unchanged)
    else max_norm > 0
        clip_grad_tree->>clip_grad_tree: flat = dict(tree_flatten(grads_tree))
        clip_grad_tree->>clip_grad_tree: total_sq = sum((g**2).sum() for g in flat.values())
        clip_grad_tree->>clip_grad_tree: scale = max_norm / sqrt(total_sq + 1e-12)
        clip_grad_tree-->>main: tree_unflatten((k, g * scale))
    end

    main->>optimizer: opt.step(model, clipped_grads, step, lr_mul)
    optimizer-->>model: update(model.parameters())
    model-->>main: parameters_updated

Flow diagram for clip_grad_tree gradient clipping

flowchart TD
    A_start[Start clip_grad_tree] --> B_check_norm{max_norm <= 0}
    B_check_norm -->|yes| C_return_orig[Return grads_tree]
    B_check_norm -->|no| D_flatten["flat = dict(tree_flatten(grads_tree))"]
    D_flatten --> E_total_sq["total_sq = sum((g * g).sum() for g in flat.values())"]
    E_total_sq --> F_scale["scale = max_norm / (sqrt(total_sq) + 1e-12)"]
    F_scale --> G_scale_grads["scaled_items = (k, g * scale) for k, g in flat.items()"]
    G_scale_grads --> H_unflatten["clipped = tree_unflatten(scaled_items)"]
    H_unflatten --> I_return_clipped[Return clipped]

File-Level Changes

Change	Details	Files
Define clip_grad_tree helper and keep its use in the training loop intact.	Add a proper def clip_grad_tree(grads_tree, max_norm) wrapper around the existing clipping body Ensure the function early-returns when max_norm <= 0 and otherwise computes a global-norm scale from flattened grads Continue to call clip_grad_tree in the training loop before optimizer.step	`train_gpt_mlx_kl.py`
Aggressively trim comments, section separators, and docstrings to cut ~350 lines without functional changes.	Remove decorative section separator comments and redundant inline comments that restate code Shorten multi-line docstrings to single-line summaries where behavior is obvious Collapse consecutive blank lines and shorten the top-level module docstring	`train_gpt_mlx_kl.py`
Minor readability/consistency cleanups while preserving behavior of models, training, quantization, and eval features.	Normalize or remove some explanatory comments around GPT architecture features (SmearGate, XSA, EngramLite, SkipGram, complementary loss, BackoffNgramMixer, TTT) without touching logic Tighten code in several helpers by removing superfluous temporary comments and spacing Leave all public classes/functions, hyperparameters, and evaluation paths (standard/sliding/ngram/TTT) intact	`train_gpt_mlx_kl.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The new clip_grad_tree helper is currently defined inline between evaluation functions; consider moving it closer to the optimizer/gradient logic (e.g., near SplitOptimizers or training loop helpers) to keep related concerns grouped together.
In clip_grad_tree, you repeatedly convert between tree and dict (tree_flatten → dict → tree_unflatten); if performance becomes an issue, you could operate directly on the flattened list or reuse the original structure to avoid extra allocations.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new `clip_grad_tree` helper is currently defined inline between evaluation functions; consider moving it closer to the optimizer/gradient logic (e.g., near `SplitOptimizers` or training loop helpers) to keep related concerns grouped together.
- In `clip_grad_tree`, you repeatedly convert between tree and dict (`tree_flatten` → `dict` → `tree_unflatten`); if performance becomes an issue, you could operate directly on the flattened list or reuse the original structure to avoid extra allocations.

## Individual Comments

### Comment 1
<location path="train_gpt_mlx_kl.py" line_range="1159-1163" />
<code_context>

-
-
+def clip_grad_tree(grads_tree, max_norm):
+    """Clip gradient tree by global norm."""
     if max_norm <= 0:
</code_context>
<issue_to_address>
**issue (bug_risk):** clip_grad_tree increases small gradients up to max_norm instead of only shrinking large ones

This rescales the gradient tree to have norm `max_norm` even when the original norm is already smaller, which deviates from standard clipping and unintentionally increases gradients. A typical implementation only rescales when the norm exceeds `max_norm`, e.g.

```python
if total_sq <= max_norm * max_norm:
    return grads_tree
scale = max_norm / (math.sqrt(total_sq) + 1e-12)
...
```

so gradients within the bound are left unchanged and only oversized gradients are reduced.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-04-04T18:18:58Z

train_gpt_mlx_kl.py

+def clip_grad_tree(grads_tree, max_norm):
+    """Clip gradient tree by global norm."""
    if max_norm <= 0:
        return grads_tree
    flat = dict(tree_flatten(grads_tree))


issue (bug_risk): clip_grad_tree increases small gradients up to max_norm instead of only shrinking large ones

This rescales the gradient tree to have norm max_norm even when the original norm is already smaller, which deviates from standard clipping and unintentionally increases gradients. A typical implementation only rescales when the norm exceeds max_norm, e.g.

if total_sq <= max_norm * max_norm: return grads_tree scale = max_norm / (math.sqrt(total_sq) + 1e-12) ...

so gradients within the bound are left unchanged and only oversized gradients are reduced.

Copilot

Pull request overview

This PR trims train_gpt_mlx_kl.py to meet the ≤1500-line target for the Parameter Golf artifact-size constraints and fixes a runtime bug where clip_grad_tree was referenced in training but had no function definition.

Changes:

Reduced script length primarily by removing/condensing docstrings, separators, and redundant comments/whitespace.
Added a proper def clip_grad_tree(grads_tree, max_norm): ... implementation so gradient clipping works when enabled.
Kept existing model/optimizer/eval features intact while reorganizing/condensing surrounding text.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-04T18:20:43Z

train_gpt_mlx_kl.py

-    bigram_hash_size: int = int(os.environ.get("BIGRAM_HASH_SIZE", 16384))  # Task 2: fill budget (was 10240)
-    qat_start_frac: float = float(os.environ.get("QAT_START_FRAC", 0.15))  # UNUSED — late_qat_threshold controls QAT
+    bigram_hash_size: int = int(os.environ.get("BIGRAM_HASH_SIZE", 16384))
+    qat_start_frac: float = float(os.environ.get("QAT_START_FRAC", 0.15))


qat_start_frac is defined but never referenced anywhere in the script (only appears in Hyperparameters). Since this repo is optimizing for minimal code bytes, consider removing it (and the QAT_START_FRAC env var) or wiring it into the QAT toggle logic so the config surface matches actual behavior.

Suggested change

qat_start_frac: float = float(os.environ.get("QAT_START_FRAC", 0.15))

augmentcode · 2026-04-06T20:07:04Z

🤖 Augment PR Summary

Summary: This PR trims train_gpt_mlx_kl.py to meet a ≤1500 line target for the Parameter Golf artifact-size constraints, and fixes a runtime bug in gradient clipping.

Changes:

Replaces the long module docstring and removes section separators, redundant comments, and extra blank lines.
Collapses several multi-line docstrings into shorter single-line versions while keeping the same APIs.
Adds a missing def clip_grad_tree(grads_tree, max_norm) wrapper around an existing gradient-clipping code block.
Keeps the training loop’s call to clip_grad_tree functional when grad_clip_norm > 0.

Technical Notes: The only intended behavioral change is the NameError fix for gradient clipping; the rest of the diff is code/comment compaction to reduce script size.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-04-06T20:07:07Z

train_gpt_mlx_kl.py

-    bigram_hash_size: int = int(os.environ.get("BIGRAM_HASH_SIZE", 16384))  # Task 2: fill budget (was 10240)
-    qat_start_frac: float = float(os.environ.get("QAT_START_FRAC", 0.15))  # UNUSED — late_qat_threshold controls QAT
+    bigram_hash_size: int = int(os.environ.get("BIGRAM_HASH_SIZE", 16384))
+    qat_start_frac: float = float(os.environ.get("QAT_START_FRAC", 0.15))


train_gpt_mlx_kl.py:54 — qat_start_frac is read from QAT_START_FRAC but isn’t referenced anywhere else, so that env var currently has no effect on when QAT turns on. Since the explanatory comment was removed in this trim, consider either wiring this parameter into the QAT toggle logic or removing it to avoid misleading configuration.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-04-06T20:07:07Z

train_gpt_mlx_kl.py

-        # TTT: gradient steps on LoRA params using context tokens (s=0..wlen-stride)
        s = 0 if ws == 0 else max(wlen - stride, 0)
-        if s > 0:  # Only train if there are context tokens before the eval window
+        if s > 0:


train_gpt_mlx_kl.py:1215 — Inside this TTT block, lora_A/lora_B never get updated (and ttt_lr is unused), so enabling TTT appears to be a no-op besides recomputing loss. If the intent is to actually adapt per-window weights, this likely needs an update step for the LoRA parameters.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

Initial plan

8793084

Copilot AI assigned Copilot and kailean Apr 4, 2026

Copilot started work on behalf of kailean April 4, 2026 15:44 View session

Copilot AI changed the title ~~[WIP] Check aub 1.0 bpb build code length against parameter golf rules~~ Trim train_gpt_mlx_kl.py to ≤1500 lines; fix orphaned clip_grad_tree Apr 4, 2026

Copilot AI requested a review from kailean April 4, 2026 16:05

Copilot finished work on behalf of kailean April 4, 2026 16:05

Copilot AI mentioned this pull request Apr 4, 2026

Merge PR#8 moonshot stack into main #9

Merged

kailean marked this pull request as ready for review April 4, 2026 18:18

Copilot AI review requested due to automatic review settings April 4, 2026 18:18

kailean merged commit 5c2050e into copilot/create-clean-submission-ready-pr Apr 4, 2026
3 checks passed

Copilot started reviewing on behalf of kailean April 4, 2026 18:18 View session

sourcery-ai bot reviewed Apr 4, 2026

View reviewed changes

Copilot AI reviewed Apr 4, 2026

View reviewed changes

augmentcode bot reviewed Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trim train_gpt_mlx_kl.py to ≤1500 lines; fix orphaned clip_grad_tree#8

Trim train_gpt_mlx_kl.py to ≤1500 lines; fix orphaned clip_grad_tree#8
kailean merged 2 commits intocopilot/create-clean-submission-ready-prfrom
copilot/verify-aub-1-0-bpb-code-length

Copilot AI commented Apr 4, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

Uh oh!

sourcery-ai bot commented Apr 4, 2026

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Apr 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

augmentcode bot commented Apr 6, 2026

Uh oh!

augmentcode bot left a comment

Uh oh!

augmentcode bot Apr 6, 2026

Uh oh!

augmentcode bot Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Apr 4, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug fix

Line reduction (1847 → 1493)

Summary by Sourcery

Uh oh!

Uh oh!

sourcery-ai bot commented Apr 4, 2026

Reviewer's Guide

Sequence diagram for training step with clip_grad_tree

Flow diagram for clip_grad_tree gradient clipping

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot commented Apr 6, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Apr 4, 2026 •

edited by sourcery-ai bot

Loading