This is the official repository for the paper "Skill-Targeted Adaptive Training".
Authors: Yinghui He, Abhishek Panigrahi, Yong Lin, Sanjeev Arora.
Blog Post | Arxiv | Twitter | Connect with authors
🚨 We introduce Skill-Targeted Adaptive Training (STAT), which uses a supervisor model and a skill catalog to construct a Missing-Skill-Profile for each student model, and then modifies training to squeeze out >=7% more performance! The intervention can be as simple as reweighting existing training sets. You can also think of this as a more effective distillation method.
- Part 1: Skill-targeted training data
- Part 2: Model training code
- Part 3: Training data creation code
We recommend using uv for fast and reliable dependency management.
curl -LsSf https://astral.sh/uv/install.sh | shuv syncThis will automatically create a virtual environment and install all dependencies. To activate the environment:
source .venv/bin/activate # On Unix/macOS
# or
.venv\Scripts\activate # On WindowsWe conducted adaptive training data selection for three models: Llama-3.2-3B-Instruct, Llama-3.2-1B-Instruct, and Qwen2.5-3B.
The model-specific training data are provided under STAT_data/. Each dataset contain roughly 4k unique questions, 9.5k QA pairs.
We create two sets of training data (STAT-Sel/ and STAT-Syn/) for each model using two method variants:
-
We begin by filtering 500 difficult questions from the validation set using our process reward model. For each such question, the teacher model identifies 2–3 missing skills in the student’s response.
-
We then create the training set by selecting 5 questions for each missing skill in the question’s Missing-Skill-Profile.
-
We use 3 answers for each question and randomly sample a subset of 9.5k question–answer pairs as our training set.
-
We begin by filtering 500 difficult questions from the validation set using our process reward model. For each such question, the teacher model identifies 2–3 missing skills in the student’s response.
-
For each pair of
(difficult_question, missing_skill), we retrieve 3 questions from the MATH training set. We input these 3 questions, along with themissing_skill, to the teacher model, prompting it to synthesize 2 new questions. The teacher further generates 3 solutions for each new question. -
We then filter the newly synthesized data by:
a. Compute consistency scores for each set of
(new_question, solution)pairs, according to the number of solutions agreeing on the final answer. For example, a new question with 2 solutions agreeing on the final answer has a consistency score of 2.b. Keep only the
new_questionith a consistency score of ≥ 2.c. For each filtered question, keep only the
solutionthat agrees on the final answer.
This process enables our approach to generate diverse data, as we input 3 questions to the teacher model as references each time. The consistency-filtering step filters out both invalid questions and solutions, ensuring the quality of STAT-Syn.
To fine-tune a model on STAT data, run the corresponding training script:
# For Llama-3.2-3B-Instruct and Llama-3.2-1B-Instruct
bash scripts/run_sft_llama_instruct.sh
# For Qwen2.5-3B
bash scripts/run_sft_qwen_base.shYou can modify the DATA_NAME variable in the scripts to use either STAT-Sel or STAT-Syn datasets.
To evaluate a fine-tuned model:
bash scripts/eval_sft.shConfigure the evaluation by editing the script variables:
BASE_MODEL_PATH: The base model to evaluateTRAIN_DATA_NAME: Which training data was used (STAT-SelorSTAT-Syn)TEST_DATA_NAME: Test dataset (math500,math2,gsm8k,math_perturb_simple,math_perturb_hard,amc23, oraime)
If you have any questions on the code or the paper, feel free to email Yinghui (yh0068@princeton.edu). We welcome all kinds of constructive discussions!
If you find our work useful, please consider citing it! 🤗
@article{he2025skilltargetedadaptivetraining,
title={Skill-Targeted Adaptive Training},
author={Yinghui He and Abhishek Panigrahi and Yong Lin and Sanjeev Arora},
journal={arXiv preprint arXiv:2510.10023},
year={2025},
url={https://arxiv.org/abs/2510.10023},
}