-
Notifications
You must be signed in to change notification settings - Fork 27
Add exploration to compare wte and scale #670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add exploration to compare wte and scale #670
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds configurable normalization layers for word token embeddings (WTE) and absolute position embeddings (ABS), with support for HyperSphereNorm variants including gain and scale parameters. It refactors the embedding scale initialization and improves checkpoint saving behavior.
Key changes:
- Introduces norm_variant_wte and norm_variant_abs with configurable radius, scale, and gain parameters
- Refactors HyperSphereNorm to support a const_radius_factor scaling mechanism
- Adds embedding_scale_init configuration option for custom initialization
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| variations/norm_variations.py | Refactors HyperSphereNorm to support scale factor and updates forward to apply gain parameter |
| train_args.py | Adds CLI arguments for WTE/ABS norm configurations, embedding scale init, and hsnorm_scale parameter |
| train.py | Simplifies checkpoint saving logic by removing redundant never_save_checkpoint check |
| model.py | Moves post-embedding norm instantiation to transformer ModuleDict and reorders embedding operations |
| gpt_conf.py | Adds configuration fields for WTE/ABS norm parameters and embedding scale initialization |
| explorations/norm_wte_abs_sweep.yaml | Adds experiment configuration for normalization sweep experiments |
| explorations/norm_wte_abs_embd_scale.yaml | Adds experiment configuration with embedding scale variations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| radius = self.const_radius_factor * self.radius_init_factor | ||
| hypersphere_norm = x.norm(2, dim=-1, keepdim=True) | ||
| return x / hypersphere_norm * self.radius | ||
| return x / hypersphere_norm * radius * self.gain |
Copilot
AI
Oct 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra space in 'return x' should be 'return x' (single space).
| return x / hypersphere_norm * radius * self.gain | |
| return x / hypersphere_norm * radius * self.gain |
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
This pull request introduces new configuration options and experimental YAML files to enable fine-grained control and experimentation with normalization strategies for word token embeddings (WTE) and absolute positional embeddings in the model. The changes allow for flexible application of HyperSphereNorm (HSNorm) to these embeddings, with tunable parameters such as radius, scale, and gain, and provide infrastructure for running large sweeps and ablation studies on these normalization settings.
The most important changes are:
1. Experimental configuration for normalization sweeps and ablations
norm_wte_abs_sweep.yamlandnorm_wte_abs_embd_scale.yaml, which define comprehensive parameter sweeps over normalization variants, radius, scale, and gain for WTE and absolute position embeddings, including baseline and ablation runs. These files facilitate systematic experimentation. [1] [2]2. Extended model and config to support post-embedding normalization
GPTConfigingpt_conf.pyto include new attributes for WTE and absolute embedding normalization:norm_variant_wte,norm_wte_radius,norm_wte_scale,norm_wte_gain,norm_variant_abs,norm_abs_radius,norm_abs_scale,norm_abs_gain, and related parameters for HyperSphereNorm. [1] [2]train_args.pyto expose these new configuration options as command-line arguments, allowing them to be set via CLI or YAML. [1] [2] [3]3. Model logic for embedding normalization and scaling
model.pyto apply the specified normalization (e.g., HSNorm) to WTE and absolute position embeddings, using a helper method to build normalization layers with the correct parameters. Also improved embedding scaling logic to allow explicit initialization. [1] [2] [3] [4] [5] [6]4. HyperSphereNorm improvements
HyperSphereNorminnorm_variations.pyto support a scale factor (hsnorm_scale), and clarified the logic for whether the radius is learned or fixed, improving flexibility for experiments. [1] [2]5. Minor fixes and usability improvements
These changes collectively enable more systematic research into the effects of normalization on embedding layers, with a flexible, configurable setup for large-scale experimentation.