The Internal Tribunal - Phase 1: Calibration

Finding the Implicit Value Function direction in DeepSeek-R1's activation space.

Hypothesis

Reasoning models like DeepSeek-R1 maintain an internal "Implicit Value Function" — a direction in activation space that tracks whether the current Chain of Thought (CoT) is leading toward a correct answer.

Methodology

Generate CoT Traces: Run the model on GSM8K math problems with high temperature to get a mix of correct and incorrect reasoning trajectories
Label Trajectories: Parse final answers and compare to ground truth to label each trace as "Correct" or "Incorrect"
Extract Activations: Collect residual stream activations from middle layers (12, 16, 20, 24)
Train Linear Probe: Fit logistic regression to distinguish correct vs incorrect trajectories
Evaluate: If probe accuracy > 70%, we have evidence of an implicit value signal

Files

calibration_phase.py - Main experiment script
run_calibration.sbatch - SLURM job submission script for WashU cluster
setup_cluster.sh - One-time setup script for the cluster environment
requirements.txt - Python dependencies

Running Locally (if you have 4x GPUs)

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run calibration
python calibration_phase.py

Running on WashU Cluster

# SSH to cluster
ssh washu-cluster

# Navigate to your project area
cd /project/scratch01/compiling/a.a.baggio/

# Clone/copy project files
mkdir -p internal_tribunal
cd internal_tribunal
# (copy files here)

# Run setup (once)
bash setup_cluster.sh

# Submit job
sbatch run_calibration.sbatch

# Monitor
squeue -u $USER
tail -f logs/calibration-*.out

Outputs

After successful completion:

results/probe_results.png - Visualization of probe accuracy and logit separation
results/implicit_value_directions.json - The discovered direction vectors (the "Implicit Value Function")
checkpoints/ - Intermediate state for job resumption

Checkpointing & Job Chaining

The script handles SLURM's 4-hour time limit through:

Signal handling: Catches SIGUSR1 sent 5 minutes before timeout
Checkpointing: Saves all state to disk
Self-resubmission: Sets a flag that triggers job resubmission
Resume: New job loads checkpoint and continues from where it left off

Key Parameters

In calibration_phase.py:

TARGET_CORRECT = 50       # Number of correct traces to collect
TARGET_INCORRECT = 50     # Number of incorrect traces to collect
MAX_ATTEMPTS = 200        # Max problems to try
TEMPERATURE = 1.0         # High temp encourages mistakes
TARGET_LAYERS = [12, 16, 20, 24]  # Layers to probe

Expected Results

If the hypothesis is correct, we expect:

Probe accuracy significantly above chance (50%)
Accuracy > 70% indicates a strong implicit value signal
The learned direction vector can be used in Phase 2 for runtime monitoring

Troubleshooting

Out of Memory: The script handles OOM gracefully, but if persistent:

Reduce max_new_tokens in trace generation
Use fewer target layers
Enable more aggressive truncation

Model Loading Fails: The script tries multiple loading strategies:

Native transformer_lens multi-GPU
Fallback to single-GPU loading

Probe Accuracy Low: Consider:

Collecting more data (increase targets)
Different aggregation ("last" instead of "mean")
Different layers (try earlier or later)
Per-token probing instead of trajectory-level

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
results		results
results_attribution		results_attribution
results_phase2		results_phase2
results_phase3		results_phase3
results_phase4		results_phase4
.gitignore		.gitignore
README.md		README.md
calibration_phase.py		calibration_phase.py
phase2_calibration.py		phase2_calibration.py
phase3_intervention.py		phase3_intervention.py
phase3b_refined.py		phase3b_refined.py
phase4_realtime.py		phase4_realtime.py
requirements.txt		requirements.txt
rigorous_baselines.py		rigorous_baselines.py
run_all.sh		run_all.sh
run_attribution.sbatch		run_attribution.sbatch
run_attribution_ablation.py		run_attribution_ablation.py
run_baselines.sbatch		run_baselines.sbatch
run_calibration.sbatch		run_calibration.sbatch
run_destruction.py		run_destruction.py
run_destruction.sbatch		run_destruction.sbatch
run_phase2.sbatch		run_phase2.sbatch
run_phase3.sbatch		run_phase3.sbatch
run_phase3b.sbatch		run_phase3b.sbatch
run_phase4.sbatch		run_phase4.sbatch
run_selector.py		run_selector.py
run_selector.sbatch		run_selector.sbatch
run_upstream.py		run_upstream.py
run_upstream.sbatch		run_upstream.sbatch
setup_cluster.sh		setup_cluster.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Internal Tribunal - Phase 1: Calibration

Hypothesis

Methodology

Files

Running Locally (if you have 4x GPUs)

Running on WashU Cluster

Outputs

Checkpointing & Job Chaining

Key Parameters

Expected Results

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Internal Tribunal - Phase 1: Calibration

Hypothesis

Methodology

Files

Running Locally (if you have 4x GPUs)

Running on WashU Cluster

Outputs

Checkpointing & Job Chaining

Key Parameters

Expected Results

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages