DySCO is a training-free decoding algorithm that improves long-context reasoning for off-the-shelf LMs. At each decoding step, DySCO uses retrieval heads (QRHeads specifically) to identify task-relevant tokens in the context and explicitly up-weights them, dynamically adjusting attention during generation to better utilize relevant context.
Please install the following packages:
torch(tested with2.6.0)transformers(tested with4.57.3)flash_attn(tested with2.8.3)
Some other dependencies: numpy, datasets, tqdm, pyyaml.
1. Download evaluation data
sh scripts/setup/prepare_data.shThis downloads the evaluation data to data_eval/.
2. Prepare models
Please create a models/ directory with symlinks to your model checkpoints for easy experiments:
mkdir -p models
ln -s /path/to/Qwen3-8B models/Qwen3-8BFor Qwen models, we create YaRN versions by symlinking the model weights and only modifying config.json to enable YaRN rope scaling. See Qwen3-8B for YaRN instructions.
3. Quick check
Run a quick sanity check with 20 samples on path_walking_16k using Qwen3-8B:
sh scripts/setup/quick_check.shThis runs both vanilla (Flash Attention) and DySCO generation. Expected results:
| Method | Accuracy |
|---|---|
| Vanilla | ~25% |
| DySCO | ~30% |
We provide an standalone example in dysco_inference_example.py:
import json, yaml, torch
from transformers import AutoTokenizer
from dysco.custom_modeling_qwen3 import RescaleQwen3ForCausalLM
from dysco.custom_mixin import RescaleConfig
# load model
tokenizer = AutoTokenizer.from_pretrained("models/Qwen3-8B")
model = RescaleQwen3ForCausalLM.from_pretrained(
"models/Qwen3-8B",
attn_implementation="flash_attention_2",
device_map="auto",
torch_dtype=torch.bfloat16,
)
# build RescaleConfig from yaml
with open("dysco_cfgs/qwen3_8b.yaml") as f:
cfg = yaml.safe_load(f)
selected_heads = eval(cfg["selected_heads"])
rescale_config = RescaleConfig(
selected_heads=selected_heads,
top_k=cfg["top_k"], top_p=cfg["top_p"],
strength=cfg["strength"], decay_factor=cfg["decay_factor"],
)
# tokenize
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": "Your prompt here"}],
tokenize=True, add_generation_prompt=True,
return_tensors="pt", enable_thinking=False,
).to(model.device)
# generate with DySCO
generated_ids, logging_info = model.rescale_generate(
input_ids,
rescale_config=rescale_config,
max_new_tokens=512,
temperature=0.0,
do_sample=False,
pad_token_id=tokenizer.pad_token_id,
)
output = tokenizer.decode(generated_ids[0][input_ids.shape[1]:], skip_special_tokens=True)Pre-built DySCO configs with detected QRHeads are provided in dysco_cfgs/ for each supported model.
[ ] Experiment scripts.
Please cite our paper and the backbone QRHead if you find DySCO useful:
@article{ye2026dysco,
title={DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs},
author={Xi Ye and Wuwei Zhang and Fangcong Yin and Howard Yen and Danqi Chen},
year={2026},
eprint={2602.22175},
archivePrefix={arXiv},
primaryClass={cs.CL},
}
@inproceedings{zhang25qrhead,
title={Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking},
author={Wuwei Zhang and Fangcong Yin and Howard Yen and Danqi Chen and Xi Ye},
booktitle={Proceedings of EMNLP},
year={2025}
}