Add exploration to compare wte and scale #670

klei22 · 2025-10-29T15:37:22Z

This pull request introduces new configuration options and experimental YAML files to enable fine-grained control and experimentation with normalization strategies for word token embeddings (WTE) and absolute positional embeddings in the model. The changes allow for flexible application of HyperSphereNorm (HSNorm) to these embeddings, with tunable parameters such as radius, scale, and gain, and provide infrastructure for running large sweeps and ablation studies on these normalization settings.

The most important changes are:

1. Experimental configuration for normalization sweeps and ablations

Added two new YAML files, norm_wte_abs_sweep.yaml and norm_wte_abs_embd_scale.yaml, which define comprehensive parameter sweeps over normalization variants, radius, scale, and gain for WTE and absolute position embeddings, including baseline and ablation runs. These files facilitate systematic experimentation. [1] [2]

2. Extended model and config to support post-embedding normalization

Updated GPTConfig in gpt_conf.py to include new attributes for WTE and absolute embedding normalization: norm_variant_wte, norm_wte_radius, norm_wte_scale, norm_wte_gain, norm_variant_abs, norm_abs_radius, norm_abs_scale, norm_abs_gain, and related parameters for HyperSphereNorm. [1] [2]
Updated argument parsing in train_args.py to expose these new configuration options as command-line arguments, allowing them to be set via CLI or YAML. [1] [2] [3]

3. Model logic for embedding normalization and scaling

Refactored model.py to apply the specified normalization (e.g., HSNorm) to WTE and absolute position embeddings, using a helper method to build normalization layers with the correct parameters. Also improved embedding scaling logic to allow explicit initialization. [1] [2] [3] [4] [5] [6]

4. HyperSphereNorm improvements

Enhanced HyperSphereNorm in norm_variations.py to support a scale factor (hsnorm_scale), and clarified the logic for whether the radius is learned or fixed, improving flexibility for experiments. [1] [2]

5. Minor fixes and usability improvements

Improved checkpointing logic and argument help text for better clarity and control over training checkpoints. [1] [2]

These changes collectively enable more systematic research into the effects of normalization on embedding layers, with a flexible, configurable setup for large-scale experimentation.

Copilot

Pull Request Overview

This PR adds configurable normalization layers for word token embeddings (WTE) and absolute position embeddings (ABS), with support for HyperSphereNorm variants including gain and scale parameters. It refactors the embedding scale initialization and improves checkpoint saving behavior.

Key changes:

Introduces norm_variant_wte and norm_variant_abs with configurable radius, scale, and gain parameters
Refactors HyperSphereNorm to support a const_radius_factor scaling mechanism
Adds embedding_scale_init configuration option for custom initialization

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
variations/norm_variations.py	Refactors HyperSphereNorm to support scale factor and updates forward to apply gain parameter
train_args.py	Adds CLI arguments for WTE/ABS norm configurations, embedding scale init, and hsnorm_scale parameter
train.py	Simplifies checkpoint saving logic by removing redundant never_save_checkpoint check
model.py	Moves post-embedding norm instantiation to transformer ModuleDict and reorders embedding operations
gpt_conf.py	Adds configuration fields for WTE/ABS norm parameters and embedding scale initialization
explorations/norm_wte_abs_sweep.yaml	Adds experiment configuration for normalization sweep experiments
explorations/norm_wte_abs_embd_scale.yaml	Adds experiment configuration with embedding scale variations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

model.py

Copilot · 2025-10-29T15:39:42Z

variations/norm_variations.py

+        radius = self.const_radius_factor * self.radius_init_factor
        hypersphere_norm = x.norm(2, dim=-1, keepdim=True)
-        return  x / hypersphere_norm * self.radius
+        return  x / hypersphere_norm * radius * self.gain


Extra space in 'return x' should be 'return x' (single space).

Suggested change

return x / hypersphere_norm * radius * self.gain

return x / hypersphere_norm * radius * self.gain

Co-authored-by: Copilot <[email protected]>

Add exploration to compare wte and scale

21ef961

klei22 requested review from Copilot and gkielian October 29, 2025 15:37

Copilot AI reviewed Oct 29, 2025

View reviewed changes

gkielian and others added 2 commits November 12, 2025 14:15

Apply suggestion from @Copilot

0c36681

Co-authored-by: Copilot <[email protected]>

Apply suggestion from @Copilot

68476ce

Co-authored-by: Copilot <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add exploration to compare wte and scale #670

Add exploration to compare wte and scale #670

Uh oh!

klei22 commented Oct 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	return x / hypersphere_norm * radius * self.gain
	return x / hypersphere_norm * radius * self.gain

Add exploration to compare wte and scale #670

Are you sure you want to change the base?

Add exploration to compare wte and scale #670

Uh oh!

Conversation

klei22 commented Oct 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants