Skip to content

Trim train_gpt_mlx_kl.py to ≤1500 lines; fix orphaned clip_grad_tree#8

Merged
kailean merged 2 commits intocopilot/create-clean-submission-ready-prfrom
copilot/verify-aub-1-0-bpb-code-length
Apr 4, 2026
Merged

Trim train_gpt_mlx_kl.py to ≤1500 lines; fix orphaned clip_grad_tree#8
kailean merged 2 commits intocopilot/create-clean-submission-ready-prfrom
copilot/verify-aub-1-0-bpb-code-length

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 4, 2026

Script was 1847 lines, over the 1500-line target for the sub-1.0 BPB build. Code bytes count toward the 16MB artifact limit per challenge rules. Also found a bug: clip_grad_tree body existed but had no def statement — would NameError when grad_clip_norm > 0.

Bug fix

Added missing function definition:

def clip_grad_tree(grads_tree, max_norm):
    """Clip gradient tree by global norm."""
    if max_norm <= 0:
        return grads_tree
    ...

Previously this was an orphaned code block (indented body with no def) following eval_val_sliding_ngram, unreachable as written but called in the training loop at line 1747.

Line reduction (1847 → 1493)

  • Module docstring: 27-line innovation list → 2-line summary
  • Section separators: Removed all # ====...==== blocks (~17)
  • Docstrings: Multi-line → single-line where content was obvious
  • Blank lines: Collapsed consecutive blanks to ≤1
  • Comments: Removed inline comments restating the code

All 17 classes, 25 functions, and every feature (EngramLite, BackoffNgramMixer, ComplementaryTraining, SkipGramHash, SmearGate, XSA, LoRA TTT, GPTQ-lite, sliding-window eval) preserved. Verified via ast.parse and AST name enumeration.

Summary by Sourcery

Trim and clean up the GPT training script while preserving functionality and add a proper gradient clipping helper to fix a missing definition bug.

Bug Fixes:

  • Define the missing clip_grad_tree function and wire it into the training loop to safely apply global gradient norm clipping.

Enhancements:

  • Remove verbose comments, section banners, and redundant docstring text to reduce file size without changing behavior.

- Fix orphaned clip_grad_tree function body by adding proper def line
- Remove verbose section separator comment blocks (17+ instances)
- Compact 26-line module docstring to 2-line summary
- Trim multi-line docstrings to single lines throughout
- Remove redundant inline comments that restate the code
- Remove unnecessary blank lines within function bodies
- Compact Hyperparameters class by removing section comment headers

All functionality, logic, algorithms, and class/function signatures preserved.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kailean <49617037+kailean@users.noreply.github.com>
Copilot AI changed the title [WIP] Check aub 1.0 bpb build code length against parameter golf rules Trim train_gpt_mlx_kl.py to ≤1500 lines; fix orphaned clip_grad_tree Apr 4, 2026
Copilot AI requested a review from kailean April 4, 2026 16:05
@kailean kailean marked this pull request as ready for review April 4, 2026 18:18
Copilot AI review requested due to automatic review settings April 4, 2026 18:18
@kailean kailean merged commit 5c2050e into copilot/create-clean-submission-ready-pr Apr 4, 2026
3 checks passed
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai bot commented Apr 4, 2026

Reviewer's Guide

Fixes a latent NameError by properly defining clip_grad_tree, and reduces train_gpt_mlx_kl.py from ~1847 to 1493 lines via doc/comment pruning and minor docstring tightening without changing model/optimizer/eval behavior.

Sequence diagram for training step with clip_grad_tree

sequenceDiagram
    participant main
    participant model
    participant compiled_loss_and_grad
    participant clip_grad_tree
    participant optimizer

    main->>compiled_loss_and_grad: loss_and_grad_chunked(args, train_loader)
    compiled_loss_and_grad-->>main: loss, grads_tree

    main->>clip_grad_tree: clip_grad_tree(grads_tree, args.grad_clip_norm)
    alt max_norm <= 0
        clip_grad_tree-->>main: grads_tree (unchanged)
    else max_norm > 0
        clip_grad_tree->>clip_grad_tree: flat = dict(tree_flatten(grads_tree))
        clip_grad_tree->>clip_grad_tree: total_sq = sum((g**2).sum() for g in flat.values())
        clip_grad_tree->>clip_grad_tree: scale = max_norm / sqrt(total_sq + 1e-12)
        clip_grad_tree-->>main: tree_unflatten((k, g * scale))
    end

    main->>optimizer: opt.step(model, clipped_grads, step, lr_mul)
    optimizer-->>model: update(model.parameters())
    model-->>main: parameters_updated
Loading

Flow diagram for clip_grad_tree gradient clipping

flowchart TD
    A_start[Start clip_grad_tree] --> B_check_norm{max_norm <= 0}
    B_check_norm -->|yes| C_return_orig[Return grads_tree]
    B_check_norm -->|no| D_flatten["flat = dict(tree_flatten(grads_tree))"]
    D_flatten --> E_total_sq["total_sq = sum((g * g).sum() for g in flat.values())"]
    E_total_sq --> F_scale["scale = max_norm / (sqrt(total_sq) + 1e-12)"]
    F_scale --> G_scale_grads["scaled_items = (k, g * scale) for k, g in flat.items()"]
    G_scale_grads --> H_unflatten["clipped = tree_unflatten(scaled_items)"]
    H_unflatten --> I_return_clipped[Return clipped]
Loading

File-Level Changes

Change Details Files
Define clip_grad_tree helper and keep its use in the training loop intact.
  • Add a proper def clip_grad_tree(grads_tree, max_norm) wrapper around the existing clipping body
  • Ensure the function early-returns when max_norm <= 0 and otherwise computes a global-norm scale from flattened grads
  • Continue to call clip_grad_tree in the training loop before optimizer.step
train_gpt_mlx_kl.py
Aggressively trim comments, section separators, and docstrings to cut ~350 lines without functional changes.
  • Remove decorative section separator comments and redundant inline comments that restate code
  • Shorten multi-line docstrings to single-line summaries where behavior is obvious
  • Collapse consecutive blank lines and shorten the top-level module docstring
train_gpt_mlx_kl.py
Minor readability/consistency cleanups while preserving behavior of models, training, quantization, and eval features.
  • Normalize or remove some explanatory comments around GPT architecture features (SmearGate, XSA, EngramLite, SkipGram, complementary loss, BackoffNgramMixer, TTT) without touching logic
  • Tighten code in several helpers by removing superfluous temporary comments and spacing
  • Leave all public classes/functions, hyperparameters, and evaluation paths (standard/sliding/ngram/TTT) intact
train_gpt_mlx_kl.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The new clip_grad_tree helper is currently defined inline between evaluation functions; consider moving it closer to the optimizer/gradient logic (e.g., near SplitOptimizers or training loop helpers) to keep related concerns grouped together.
  • In clip_grad_tree, you repeatedly convert between tree and dict (tree_flattendicttree_unflatten); if performance becomes an issue, you could operate directly on the flattened list or reuse the original structure to avoid extra allocations.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new `clip_grad_tree` helper is currently defined inline between evaluation functions; consider moving it closer to the optimizer/gradient logic (e.g., near `SplitOptimizers` or training loop helpers) to keep related concerns grouped together.
- In `clip_grad_tree`, you repeatedly convert between tree and dict (`tree_flatten``dict``tree_unflatten`); if performance becomes an issue, you could operate directly on the flattened list or reuse the original structure to avoid extra allocations.

## Individual Comments

### Comment 1
<location path="train_gpt_mlx_kl.py" line_range="1159-1163" />
<code_context>

-
-
+def clip_grad_tree(grads_tree, max_norm):
+    """Clip gradient tree by global norm."""
     if max_norm <= 0:
</code_context>
<issue_to_address>
**issue (bug_risk):** clip_grad_tree increases small gradients up to max_norm instead of only shrinking large ones

This rescales the gradient tree to have norm `max_norm` even when the original norm is already smaller, which deviates from standard clipping and unintentionally increases gradients. A typical implementation only rescales when the norm exceeds `max_norm`, e.g.

```python
if total_sq <= max_norm * max_norm:
    return grads_tree
scale = max_norm / (math.sqrt(total_sq) + 1e-12)
...
```

so gradients within the bound are left unchanged and only oversized gradients are reduced.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +1159 to 1163
def clip_grad_tree(grads_tree, max_norm):
"""Clip gradient tree by global norm."""
if max_norm <= 0:
return grads_tree
flat = dict(tree_flatten(grads_tree))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): clip_grad_tree increases small gradients up to max_norm instead of only shrinking large ones

This rescales the gradient tree to have norm max_norm even when the original norm is already smaller, which deviates from standard clipping and unintentionally increases gradients. A typical implementation only rescales when the norm exceeds max_norm, e.g.

if total_sq <= max_norm * max_norm:
    return grads_tree
scale = max_norm / (math.sqrt(total_sq) + 1e-12)
...

so gradients within the bound are left unchanged and only oversized gradients are reduced.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR trims train_gpt_mlx_kl.py to meet the ≤1500-line target for the Parameter Golf artifact-size constraints and fixes a runtime bug where clip_grad_tree was referenced in training but had no function definition.

Changes:

  • Reduced script length primarily by removing/condensing docstrings, separators, and redundant comments/whitespace.
  • Added a proper def clip_grad_tree(grads_tree, max_norm): ... implementation so gradient clipping works when enabled.
  • Kept existing model/optimizer/eval features intact while reorganizing/condensing surrounding text.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bigram_hash_size: int = int(os.environ.get("BIGRAM_HASH_SIZE", 16384)) # Task 2: fill budget (was 10240)
qat_start_frac: float = float(os.environ.get("QAT_START_FRAC", 0.15)) # UNUSED — late_qat_threshold controls QAT
bigram_hash_size: int = int(os.environ.get("BIGRAM_HASH_SIZE", 16384))
qat_start_frac: float = float(os.environ.get("QAT_START_FRAC", 0.15))
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qat_start_frac is defined but never referenced anywhere in the script (only appears in Hyperparameters). Since this repo is optimizing for minimal code bytes, consider removing it (and the QAT_START_FRAC env var) or wiring it into the QAT toggle logic so the config surface matches actual behavior.

Suggested change
qat_start_frac: float = float(os.environ.get("QAT_START_FRAC", 0.15))

Copilot uses AI. Check for mistakes.
@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Apr 6, 2026

🤖 Augment PR Summary

Summary: This PR trims train_gpt_mlx_kl.py to meet a ≤1500 line target for the Parameter Golf artifact-size constraints, and fixes a runtime bug in gradient clipping.

Changes:

  • Replaces the long module docstring and removes section separators, redundant comments, and extra blank lines.
  • Collapses several multi-line docstrings into shorter single-line versions while keeping the same APIs.
  • Adds a missing def clip_grad_tree(grads_tree, max_norm) wrapper around an existing gradient-clipping code block.
  • Keeps the training loop’s call to clip_grad_tree functional when grad_clip_norm > 0.

Technical Notes: The only intended behavioral change is the NameError fix for gradient clipping; the rest of the diff is code/comment compaction to reduce script size.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

bigram_hash_size: int = int(os.environ.get("BIGRAM_HASH_SIZE", 16384)) # Task 2: fill budget (was 10240)
qat_start_frac: float = float(os.environ.get("QAT_START_FRAC", 0.15)) # UNUSED — late_qat_threshold controls QAT
bigram_hash_size: int = int(os.environ.get("BIGRAM_HASH_SIZE", 16384))
qat_start_frac: float = float(os.environ.get("QAT_START_FRAC", 0.15))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train_gpt_mlx_kl.py:54 — qat_start_frac is read from QAT_START_FRAC but isn’t referenced anywhere else, so that env var currently has no effect on when QAT turns on. Since the explanatory comment was removed in this trim, consider either wiring this parameter into the QAT toggle logic or removing it to avoid misleading configuration.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

# TTT: gradient steps on LoRA params using context tokens (s=0..wlen-stride)
s = 0 if ws == 0 else max(wlen - stride, 0)
if s > 0: # Only train if there are context tokens before the eval window
if s > 0:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train_gpt_mlx_kl.py:1215 — Inside this TTT block, lora_A/lora_B never get updated (and ttt_lr is unused), so enabling TTT appears to be a no-op besides recomputing loss. If the intent is to actually adapt per-window weights, this likely needs an update step for the LoRA parameters.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants