Conversation
There was a problem hiding this comment.
Pull request overview
This pull request introduces a utility script for merging two nanoGPT model checkpoints with flexible normalization options. The script supports L2-normalized merging (the default behavior), simple averaging without normalization, and an option to skip final normalization for embedding and language model head weights.
Changes:
- Added
model_merge.pyutility script that merges two checkpoint files with configurable L2 normalization and averaging strategies - Added
demos/model_merge_demo.shdemonstrating the three main usage patterns of the merge script
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| model_merge.py | New utility for merging two nanoGPT checkpoints with L2 normalization, simple averaging, and selective layer normalization skipping |
| demos/model_merge_demo.sh | Demo script illustrating typical merge operations with different normalization options |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| checkpoint_a.pop("optimizer", None) | ||
| checkpoint_a.pop("scheduler", None) |
There was a problem hiding this comment.
The code mutates checkpoint_a directly after lines 132-135 potentially create a new dictionary assignment. If checkpoint_a is reassigned to be just the state_dict (line 135), the .pop() calls will fail since a state_dict doesn't have 'optimizer' or 'scheduler' keys. Consider handling this more explicitly by checking isinstance(checkpoint_a, dict) before attempting to pop these keys, similar to the check on line 139.
| def l2_normalize(tensor: torch.Tensor, dim: int = L2_NORM_DIM) -> torch.Tensor: | ||
| if tensor.ndim == 0: | ||
| return tensor | ||
| if tensor.ndim == 1: | ||
| dim = 0 |
There was a problem hiding this comment.
The l2_normalize function has special handling for scalar (0-dim) and vector (1-dim) tensors but lacks documentation explaining this behavior. Add a docstring documenting that: (1) scalar tensors are returned unchanged, (2) 1-dim tensors normalize along dim=0, and (3) higher-dim tensors use the provided dim parameter (default=-1).
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This pull request introduces a new utility script for merging two nanoGPT model checkpoints with flexible options for normalization and averaging. The main addition is the
model_merge.pyscript, which supports L2-normalized merging, skipping normalization for specific layers, and a simple averaging mode. A demo shell script is also provided to illustrate usage.New model merging functionality:
model_merge.py, a utility script for merging two nanoGPT checkpoints with options for L2 normalization, skipping final normalization forwte/lm_headweights, and simple averaging without normalization. The script handles key mismatches, shape validation, and preserves metadata.Demo and usage examples:
demos/model_merge_demo.sh, a shell script demonstrating typical usage patterns formodel_merge.py, including L2-normalized merge, skipping final normalization for specific layers, and simple averaging.