Scaffold deterministic ScienceAgentBench local experiment track by Darkroom4364 · Pull Request #121 · 0sec-labs/noeris

Darkroom4364 · 2026-05-25T10:25:23Z

Closes #117

Summary

add an experimental local ScienceAgentBench-style scaffold with TaskRegistry, ScienceTaskSpec, ExperimentDatabase, deterministic strategy routing, and memory extraction
add a deterministic toy bioinformatics normalization fixture and runner with CI-facing tests for register -> choose strategy -> record result -> extract reusable memory
document the kernel-search mapping, deferred real benchmark pieces, and the next concrete Explore ML experiment track: ScienceAgentBench as generalization proof #85 integration step without claiming any ScienceAgentBench score

Verification

python3 -m unittest tests.test_science_agent_bench
python3 -m compileall src/research_engine/science_agent_bench.py tests/test_science_agent_bench.py
git diff --check

Tracking

posted Explore ML experiment track: ScienceAgentBench as generalization proof #85 follow-up comment: Explore ML experiment track: ScienceAgentBench as generalization proof #85 (comment)

Summary by CodeRabbit

New Features
- Added experimental local execution framework for science agent tasks supporting deterministic fixture running, strategy routing, and persistent result tracking.
- Enables extraction and reuse of learned patterns across tasks.
Documentation
- Added architecture documentation for the local framework and roadmap for future integration.
Tests
- Added comprehensive tests covering registration, routing, persistence, and error cases.

coderabbitai · 2026-05-25T10:25:35Z

Warning

Review limit reached

@Darkroom4364, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 23 minutes and 30 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: cc4732ed-c5c8-48cd-8bee-1d834034d945

📥 Commits

Reviewing files that changed from the base of the PR and between cdfb8d9 and dc19f43.

📒 Files selected for processing (3)

docs/research/scienceagentbench-local-scaffold.md
src/research_engine/science_agent_bench.py
tests/test_science_agent_bench.py

📝 Walkthrough

Walkthrough

The PR adds a local-only deterministic scaffold for ScienceAgentBench-style tasks: data models for tasks and results, a task registry enforcing unique IDs, a JSON-backed experiment database, a deterministic strategy router, a runner orchestrating execution, a toy normalization fixture task, and comprehensive test coverage with documentation.

Changes

ScienceAgentBench Local Deterministic Scaffold

Layer / File(s)	Summary
Data Models and Strategy Constants `src/research_engine/science_agent_bench.py` (lines 1–73)	Strategy name constants and three frozen dataclasses model task specs, reusable memory items, and experiment results with scoring, notes, and UTC timestamps.
Task Registry and Experiment Database `src/research_engine/science_agent_bench.py` (lines 75–188)	`TaskRegistry` enforces unique task IDs and validates at least one enabled strategy. `ExperimentDatabase` persists results to JSON, loads existing records, filters by task/discipline/type, consolidates memory items, and calculates per-strategy success rates.
Strategy Router and Deterministic Runner `src/research_engine/science_agent_bench.py` (lines 190–254)	`ScienceStrategyRouter` selects strategies deterministically: prefers retrieval if enabled and relevant memory exists, otherwise picks the best-performing strategy with tie-breaking. `DeterministicScienceRunner` orchestrates task lookup, strategy selection, memory gathering, execution, and result recording.
Local Fixture Task and Execution `src/research_engine/science_agent_bench.py` (lines 256–362)	`build_local_fixture_registry` creates a toy normalization task. `run_local_fixture_task` validates strategy/task type, computes outputs by strategy mode, determines success via exact equality, and extracts reusable memory on success. Supporting helpers normalize counts and emit memory items with deterministic IDs.
Persistence Serialization `src/research_engine/science_agent_bench.py` (lines 364–386)	JSON serialization/deserialization helpers convert experiment results and memory items to/from JSON-compatible dicts, preserving timestamps and backward compatibility.
Test Suite `tests/test_science_agent_bench.py`	Four integration tests exercise deterministic task execution, strategy selection, router memory-based switching to retrieval, JSON persistence round-trip of results and memory, and registry duplicate-rejection validation.
Scaffold Documentation `docs/research/scienceagentbench-local-scaffold.md`	Describes the experimental local scaffold, enumerates components (registry, database, router, runner), explains the toy fixture, maps Noeris concepts to local equivalents, lists deferred features, and specifies the next benchmark integration step.

Sequence Diagram

sequenceDiagram
  participant User
  participant DeterministicScienceRunner
  participant TaskRegistry
  participant ScienceStrategyRouter
  participant ExperimentDatabase
  participant FixtureTask as run_local_fixture_task

  User->>DeterministicScienceRunner: run_task(task_id)
  DeterministicScienceRunner->>TaskRegistry: get(task_id)
  TaskRegistry-->>DeterministicScienceRunner: ScienceTaskSpec
  DeterministicScienceRunner->>ScienceStrategyRouter: choose_strategy(task, database)
  ScienceStrategyRouter->>ExperimentDatabase: results_for(task)
  ExperimentDatabase-->>ScienceStrategyRouter: prior results
  ScienceStrategyRouter->>ExperimentDatabase: memory_items(task)
  ExperimentDatabase-->>ScienceStrategyRouter: available memory
  ScienceStrategyRouter-->>DeterministicScienceRunner: selected strategy
  DeterministicScienceRunner->>ExperimentDatabase: memory_items(task, strategy)
  ExperimentDatabase-->>DeterministicScienceRunner: relevant memory
  DeterministicScienceRunner->>FixtureTask: run_local_fixture_task(task, strategy, memory)
  FixtureTask-->>DeterministicScienceRunner: ScienceExperimentResult
  DeterministicScienceRunner->>ExperimentDatabase: record_result(result)
  DeterministicScienceRunner-->>User: ScienceExperimentResult

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

Poem

A rabbit hops through experiments with care,
Registry, database, router all there.
Strategies chosen with wisdom and grace,
Memory extracted, results saved in place.
Normalization tasks pass the test—
A local scaffold, deterministically blessed! 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Scaffold deterministic ScienceAgentBench local experiment track' clearly and concisely summarizes the main change: adding a deterministic local scaffold for ScienceAgentBench experiments.
Linked Issues check	✅ Passed	All coding requirements from issue `#117` are met: TaskRegistry, ScienceTaskSpec, ExperimentDatabase, deterministic local fixture task, strategy router, result recording, memory extraction, and CI tests.
Out of Scope Changes check	✅ Passed	All changes (documentation, module implementation, and tests) are directly related to implementing the local ScienceAgentBench scaffold as specified in issue `#117`.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch issue-117-scienceagentbench-local

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/research_engine/science_agent_bench.py (1)
138-138: 💤 Low value

Optional: Remove redundant list() call.

The list(results) call is unnecessary since results is already a list created by the filter comprehensions above. This is a minor redundancy but doesn't affect correctness.
♻️ Simplify by removing redundant list() call
-        return list(results)
+        return results
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/research_engine/science_agent_bench.py` at line 138, The return uses an
unnecessary list() wrapper—replace the final "return list(results)" with "return
results" to avoid redundant copying; locate the return statement (currently
"return list(results)") in src/research_engine/science_agent_bench.py (the
function that builds the filtered results list) and return the existing results
list directly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/research/scienceagentbench-local-scaffold.md`:
- Around line 38-47: Update the wording to distinguish the two sandbox scopes:
clarify that "sandbox execution is deferred" refers to full Docker/E2B
arbitrary-code sandboxing, while the immediate Next Benchmark Integration Step
requires a minimal fixture-only sandbox harness that converts a
ScienceAgentBench-Lite task fixture into a ScienceTaskSpec, runs it in an
offline sandbox, and compares output to the benchmark gold; rephrase the lines
mentioning "sandbox execution" and the Next Benchmark Integration Step so they
explicitly say "minimal fixture-only/offline sandbox for running predefined task
fixtures (not full arbitrary-code Docker/E2B sandboxing)" to remove ambiguity.

---

Nitpick comments:
In `@src/research_engine/science_agent_bench.py`:
- Line 138: The return uses an unnecessary list() wrapper—replace the final
"return list(results)" with "return results" to avoid redundant copying; locate
the return statement (currently "return list(results)") in
src/research_engine/science_agent_bench.py (the function that builds the
filtered results list) and return the existing results list directly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 3f5d0f72-fad6-438f-b0e2-8644a08f0380

📥 Commits

Reviewing files that changed from the base of the PR and between 8235f8c and cdfb8d9.

📒 Files selected for processing (3)

docs/research/scienceagentbench-local-scaffold.md
src/research_engine/science_agent_bench.py
tests/test_science_agent_bench.py

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

Comment thread docs/research/scienceagentbench-local-scaffold.md

feat: scaffold local ScienceAgentBench experiment track

dc19f43

Darkroom4364 force-pushed the issue-117-scienceagentbench-local branch from cdfb8d9 to dc19f43 Compare May 26, 2026 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scaffold deterministic ScienceAgentBench local experiment track#121

Scaffold deterministic ScienceAgentBench local experiment track#121
Darkroom4364 wants to merge 1 commit into
mainfrom
issue-117-scienceagentbench-local

Darkroom4364 commented May 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 25, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Darkroom4364 commented May 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Tracking

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Darkroom4364 commented May 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 25, 2026 •

edited

Loading