Skip to content

Conversation

@ShaneIsley
Copy link
Owner

No description provided.

Add an extensible framework to recreate tests from the original RLM paper
(arXiv:2512.24601) and support new benchmark development.

Framework components:
- benchmarks/base.py: Abstract Benchmark class, BenchmarkSample, BenchmarkResult
- benchmarks/metrics.py: Evaluation metrics (exact match, containment, token F1, pairwise F1)
- benchmarks/runner.py: BenchmarkRunner with support for RLM, direct LLM, and summarization methods
- benchmarks/cli.py: CLI for running benchmarks and comparing methods

Benchmark implementations:
- NIAHBenchmark: Single-Needle-in-a-Haystack (synthetic retrieval)
- OolongBenchmark: Semantic aggregation using HuggingFace oolongbench dataset
- OolongPairsBenchmark: Pairwise combinatorial aggregation (hardest setting)
- BrowseCompPlusBenchmark: Multi-hop QA over document corpora (synthetic)

Usage:
  python -m benchmarks.cli --benchmark oolong --methods rlm direct -n 10
  python -m benchmarks.cli --benchmark all --output results.json
Add a results storage system that enables:
- Persistent storage of benchmark results in JSON-lines format
- Historical comparison across experiment runs
- Query by benchmark, model, method, or environment
- CSV export for external analysis tools

New CLI subcommands:
- `run`: Run benchmarks (default, backward compatible)
- `history`: Show historical results with filters
- `compare`: Compare results grouped by method/model/environment
- `list`: Show summary of stored results
- `export`: Export benchmark results to CSV

Results are automatically saved to ./benchmark_results/ directory
with full experiment metadata including git commit and RLM version.
Add --max-workers (-w) option to run samples concurrently using
ThreadPoolExecutor. This significantly speeds up evaluation when
making many independent LLM API calls.

Usage:
  # Run 4 samples in parallel
  python -m benchmarks.cli run --benchmark niah -n 20 --max-workers 4

  # Or via Python API
  runner = BenchmarkRunner(backend="openai", model="gpt-5", max_workers=4)
  results = runner.run(benchmark, num_samples=100)

Default is sequential (max_workers=1) for predictable behavior.
@ShaneIsley ShaneIsley merged commit 916c803 into main Jan 15, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants