This directory keeps a small local runner around the upstream ContextBench repo.
contextbench_official_repo/: upstream ContextBench code and data.scripts/*.py: local preparation, run, and evaluation scripts.mitmproxy_addons/trace_recorder.py: HTTP trace recorder used while Claude runs.requirements-run.txt: extra Python dependencies for these local scripts.
Generated directories such as .venv/, .mitmproxy-venv/, traces/,
logs/, scripts/contextbench_work_dir_*, and scripts/contextbench_eval_repos/
can be deleted and regenerated.
Run from this directory:
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r contextbench_official_repo/requirements.txt
pip install -r requirements-run.txtInstall LEANN:
uv tool install leann-core --with leannInstall mitmdump in a separate environment:
python3.11 -m venv .mitmproxy-venv
.mitmproxy-venv/bin/python -m pip install mitmproxyThe run script also expects:
claudeCLI available onPATH.- Node/npm available for
npx ccusage. - A Claude login session or
ANTHROPIC_API_KEYin the environment. - If using LEANN MCP mode, a Claude MCP server named
leann-serverorLEANN_MCP_SERVER/CLAUDE_MCP_CONFIG_PATHconfigured accordingly.
cd scripts
WORK_ROOT=contextbench_work_dir_claude python prepare_repos_with_leann.pyUseful overrides:
SELECTED_IDS=id1,id2 WORK_ROOT=contextbench_work_dir_claude python prepare_repos_with_leann.py
BENCH_FILTER=Pro WORK_ROOT=contextbench_work_dir_claude python prepare_repos_with_leann.py
LEANN_AST_CHUNK_SIZE=600 LEANN_AST_CHUNK_OVERLAP=96 python prepare_repos_with_leann.pycd scripts
LEANN_ENABLED=1 \
WORK_ROOT=contextbench_work_dir_claude \
OUTPUT_FILE=all_predictions_claude.jsonl \
python batch_run_selected.pyRun without LEANN:
LEANN_ENABLED=0 \
WORK_ROOT=contextbench_work_dir_claude \
OUTPUT_FILE=all_predictions_claude_baseline.jsonl \
python batch_run_selected.pyRun specific IDs without editing the script:
SELECTED_IDS=id1,id2 python batch_run_selected.pyContext retrieval metrics:
cd scripts
python evaluate_run.py \
--predictions all_predictions_claude.jsonl \
--metrics task_metrics.jsonl \
--output-json eval_report.jsonPatch-based proxy accuracy:
python evaluate_contextbench_accuracy.py \
--predictions all_predictions_claude.jsonl \
--metrics task_metrics.jsonl \
--output-json contextbench_accuracy_report.jsonrm -rf .venv .mitmproxy-venv .eval-venv .leann .pycache_tmp logs traces
rm -rf scripts/.leann scripts/scripts
rm -rf scripts/contextbench_eval_repos scripts/contextbench_work_dir_claude scripts/contextbench_work_dir_claude_overlap160