Skip to content

Conversation

@klei22
Copy link
Collaborator

@klei22 klei22 commented Oct 29, 2025

This pull request introduces new configuration options and experimental YAML files to enable fine-grained control and experimentation with normalization strategies for word token embeddings (WTE) and absolute positional embeddings in the model. The changes allow for flexible application of HyperSphereNorm (HSNorm) to these embeddings, with tunable parameters such as radius, scale, and gain, and provide infrastructure for running large sweeps and ablation studies on these normalization settings.

The most important changes are:

1. Experimental configuration for normalization sweeps and ablations

  • Added two new YAML files, norm_wte_abs_sweep.yaml and norm_wte_abs_embd_scale.yaml, which define comprehensive parameter sweeps over normalization variants, radius, scale, and gain for WTE and absolute position embeddings, including baseline and ablation runs. These files facilitate systematic experimentation. [1] [2]

2. Extended model and config to support post-embedding normalization

  • Updated GPTConfig in gpt_conf.py to include new attributes for WTE and absolute embedding normalization: norm_variant_wte, norm_wte_radius, norm_wte_scale, norm_wte_gain, norm_variant_abs, norm_abs_radius, norm_abs_scale, norm_abs_gain, and related parameters for HyperSphereNorm. [1] [2]
  • Updated argument parsing in train_args.py to expose these new configuration options as command-line arguments, allowing them to be set via CLI or YAML. [1] [2] [3]

3. Model logic for embedding normalization and scaling

  • Refactored model.py to apply the specified normalization (e.g., HSNorm) to WTE and absolute position embeddings, using a helper method to build normalization layers with the correct parameters. Also improved embedding scaling logic to allow explicit initialization. [1] [2] [3] [4] [5] [6]

4. HyperSphereNorm improvements

  • Enhanced HyperSphereNorm in norm_variations.py to support a scale factor (hsnorm_scale), and clarified the logic for whether the radius is learned or fixed, improving flexibility for experiments. [1] [2]

5. Minor fixes and usability improvements

  • Improved checkpointing logic and argument help text for better clarity and control over training checkpoints. [1] [2]

These changes collectively enable more systematic research into the effects of normalization on embedding layers, with a flexible, configurable setup for large-scale experimentation.

@klei22 klei22 requested review from Copilot and gkielian October 29, 2025 15:37
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds configurable normalization layers for word token embeddings (WTE) and absolute position embeddings (ABS), with support for HyperSphereNorm variants including gain and scale parameters. It refactors the embedding scale initialization and improves checkpoint saving behavior.

Key changes:

  • Introduces norm_variant_wte and norm_variant_abs with configurable radius, scale, and gain parameters
  • Refactors HyperSphereNorm to support a const_radius_factor scaling mechanism
  • Adds embedding_scale_init configuration option for custom initialization

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
variations/norm_variations.py Refactors HyperSphereNorm to support scale factor and updates forward to apply gain parameter
train_args.py Adds CLI arguments for WTE/ABS norm configurations, embedding scale init, and hsnorm_scale parameter
train.py Simplifies checkpoint saving logic by removing redundant never_save_checkpoint check
model.py Moves post-embedding norm instantiation to transformer ModuleDict and reorders embedding operations
gpt_conf.py Adds configuration fields for WTE/ABS norm parameters and embedding scale initialization
explorations/norm_wte_abs_sweep.yaml Adds experiment configuration for normalization sweep experiments
explorations/norm_wte_abs_embd_scale.yaml Adds experiment configuration with embedding scale variations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

radius = self.const_radius_factor * self.radius_init_factor
hypersphere_norm = x.norm(2, dim=-1, keepdim=True)
return x / hypersphere_norm * self.radius
return x / hypersphere_norm * radius * self.gain
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra space in 'return x' should be 'return x' (single space).

Suggested change
return x / hypersphere_norm * radius * self.gain
return x / hypersphere_norm * radius * self.gain

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants