Claude/rlm testing framework byf lb #14

ShaneIsley · 2026-01-15T09:09:41Z

No description provided.

Add an extensible framework to recreate tests from the original RLM paper (arXiv:2512.24601) and support new benchmark development. Framework components: - benchmarks/base.py: Abstract Benchmark class, BenchmarkSample, BenchmarkResult - benchmarks/metrics.py: Evaluation metrics (exact match, containment, token F1, pairwise F1) - benchmarks/runner.py: BenchmarkRunner with support for RLM, direct LLM, and summarization methods - benchmarks/cli.py: CLI for running benchmarks and comparing methods Benchmark implementations: - NIAHBenchmark: Single-Needle-in-a-Haystack (synthetic retrieval) - OolongBenchmark: Semantic aggregation using HuggingFace oolongbench dataset - OolongPairsBenchmark: Pairwise combinatorial aggregation (hardest setting) - BrowseCompPlusBenchmark: Multi-hop QA over document corpora (synthetic) Usage: python -m benchmarks.cli --benchmark oolong --methods rlm direct -n 10 python -m benchmarks.cli --benchmark all --output results.json

Add a results storage system that enables: - Persistent storage of benchmark results in JSON-lines format - Historical comparison across experiment runs - Query by benchmark, model, method, or environment - CSV export for external analysis tools New CLI subcommands: - `run`: Run benchmarks (default, backward compatible) - `history`: Show historical results with filters - `compare`: Compare results grouped by method/model/environment - `list`: Show summary of stored results - `export`: Export benchmark results to CSV Results are automatically saved to ./benchmark_results/ directory with full experiment metadata including git commit and RLM version.

Add --max-workers (-w) option to run samples concurrently using ThreadPoolExecutor. This significantly speeds up evaluation when making many independent LLM API calls. Usage: # Run 4 samples in parallel python -m benchmarks.cli run --benchmark niah -n 20 --max-workers 4 # Or via Python API runner = BenchmarkRunner(backend="openai", model="gpt-5", max_workers=4) results = runner.run(benchmark, num_samples=100) Default is sequential (max_workers=1) for predictable behavior.

claude added 3 commits January 15, 2026 08:25

ShaneIsley merged commit 916c803 into main Jan 15, 2026
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude/rlm testing framework byf lb #14

Claude/rlm testing framework byf lb #14

Uh oh!

ShaneIsley commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Claude/rlm testing framework byf lb #14

Claude/rlm testing framework byf lb #14

Uh oh!

Conversation

ShaneIsley commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants