Skip to content

Conversation

@klei22
Copy link
Collaborator

@klei22 klei22 commented Oct 19, 2025

The First MLP is Free

This pull request introduces a new experiment configuration file to explore the impact of varying the MLP size in the first transformer block when using identity attention. The configuration sets up a sweep over several first-layer MLP widths, while keeping the rest of the model parameters constant.

While still utilizing shared parameters during training of wte and lm_head, this aims to use a lookup table to replace the wte and the first MLP.

This sweep is intended to scope if adding an MLP (or any module without state) for inference.

Experiment setup and parameter sweep:

  • Added explorations/identity_first_layer_mlp_sweep.yaml to define a sweep that varies the first-layer MLP size in a 4-layer transformer, with the first block using identity attention and subsequent blocks using causal attention.
  • Configured the sweep to test five different first-layer MLP sizes (512, 1024, 1536, 2048, 2560), while keeping other layers at the default size of 2048.
  • Set shared base hyperparameters for all runs, including block_size, n_layer, n_head, n_embd, dataset, device, dtype, and compilation settings.
  • Specified per-layer attention variants, with only the first block using "identity" and the rest using "causal".

@klei22 klei22 requested review from Copilot and gkielian October 19, 2025 05:46
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces an experimental configuration to evaluate the performance impact of varying MLP sizes in the first transformer block when using identity attention. The goal is to determine whether adding an MLP (or stateless module) provides meaningful improvements during inference, potentially enabling replacement of the word token embedding (wte) and first MLP with a lookup table.

  • Adds a YAML configuration sweep for first-layer MLP size variations (512 to 2560)
  • Configures a 4-layer transformer with identity attention only in the first block
  • Sets up shared hyperparameters across all sweep runs

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant