Skip to content

Scaffold deterministic ScienceAgentBench local experiment track#121

Open
Darkroom4364 wants to merge 1 commit into
mainfrom
issue-117-scienceagentbench-local
Open

Scaffold deterministic ScienceAgentBench local experiment track#121
Darkroom4364 wants to merge 1 commit into
mainfrom
issue-117-scienceagentbench-local

Conversation

@Darkroom4364

@Darkroom4364 Darkroom4364 commented May 25, 2026

Copy link
Copy Markdown
Collaborator

Closes #117

Summary

  • add an experimental local ScienceAgentBench-style scaffold with TaskRegistry, ScienceTaskSpec, ExperimentDatabase, deterministic strategy routing, and memory extraction
  • add a deterministic toy bioinformatics normalization fixture and runner with CI-facing tests for register -> choose strategy -> record result -> extract reusable memory
  • document the kernel-search mapping, deferred real benchmark pieces, and the next concrete Explore ML experiment track: ScienceAgentBench as generalization proof #85 integration step without claiming any ScienceAgentBench score

Verification

  • python3 -m unittest tests.test_science_agent_bench
  • python3 -m compileall src/research_engine/science_agent_bench.py tests/test_science_agent_bench.py
  • git diff --check

Tracking

Summary by CodeRabbit

  • New Features

    • Added experimental local execution framework for science agent tasks supporting deterministic fixture running, strategy routing, and persistent result tracking.
    • Enables extraction and reuse of learned patterns across tasks.
  • Documentation

    • Added architecture documentation for the local framework and roadmap for future integration.
  • Tests

    • Added comprehensive tests covering registration, routing, persistence, and error cases.

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 25, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@Darkroom4364, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 23 minutes and 30 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: cc4732ed-c5c8-48cd-8bee-1d834034d945

📥 Commits

Reviewing files that changed from the base of the PR and between cdfb8d9 and dc19f43.

📒 Files selected for processing (3)
  • docs/research/scienceagentbench-local-scaffold.md
  • src/research_engine/science_agent_bench.py
  • tests/test_science_agent_bench.py
📝 Walkthrough

Walkthrough

The PR adds a local-only deterministic scaffold for ScienceAgentBench-style tasks: data models for tasks and results, a task registry enforcing unique IDs, a JSON-backed experiment database, a deterministic strategy router, a runner orchestrating execution, a toy normalization fixture task, and comprehensive test coverage with documentation.

Changes

ScienceAgentBench Local Deterministic Scaffold

Layer / File(s) Summary
Data Models and Strategy Constants
src/research_engine/science_agent_bench.py (lines 1–73)
Strategy name constants and three frozen dataclasses model task specs, reusable memory items, and experiment results with scoring, notes, and UTC timestamps.
Task Registry and Experiment Database
src/research_engine/science_agent_bench.py (lines 75–188)
TaskRegistry enforces unique task IDs and validates at least one enabled strategy. ExperimentDatabase persists results to JSON, loads existing records, filters by task/discipline/type, consolidates memory items, and calculates per-strategy success rates.
Strategy Router and Deterministic Runner
src/research_engine/science_agent_bench.py (lines 190–254)
ScienceStrategyRouter selects strategies deterministically: prefers retrieval if enabled and relevant memory exists, otherwise picks the best-performing strategy with tie-breaking. DeterministicScienceRunner orchestrates task lookup, strategy selection, memory gathering, execution, and result recording.
Local Fixture Task and Execution
src/research_engine/science_agent_bench.py (lines 256–362)
build_local_fixture_registry creates a toy normalization task. run_local_fixture_task validates strategy/task type, computes outputs by strategy mode, determines success via exact equality, and extracts reusable memory on success. Supporting helpers normalize counts and emit memory items with deterministic IDs.
Persistence Serialization
src/research_engine/science_agent_bench.py (lines 364–386)
JSON serialization/deserialization helpers convert experiment results and memory items to/from JSON-compatible dicts, preserving timestamps and backward compatibility.
Test Suite
tests/test_science_agent_bench.py
Four integration tests exercise deterministic task execution, strategy selection, router memory-based switching to retrieval, JSON persistence round-trip of results and memory, and registry duplicate-rejection validation.
Scaffold Documentation
docs/research/scienceagentbench-local-scaffold.md
Describes the experimental local scaffold, enumerates components (registry, database, router, runner), explains the toy fixture, maps Noeris concepts to local equivalents, lists deferred features, and specifies the next benchmark integration step.

Sequence Diagram

sequenceDiagram
  participant User
  participant DeterministicScienceRunner
  participant TaskRegistry
  participant ScienceStrategyRouter
  participant ExperimentDatabase
  participant FixtureTask as run_local_fixture_task

  User->>DeterministicScienceRunner: run_task(task_id)
  DeterministicScienceRunner->>TaskRegistry: get(task_id)
  TaskRegistry-->>DeterministicScienceRunner: ScienceTaskSpec
  DeterministicScienceRunner->>ScienceStrategyRouter: choose_strategy(task, database)
  ScienceStrategyRouter->>ExperimentDatabase: results_for(task)
  ExperimentDatabase-->>ScienceStrategyRouter: prior results
  ScienceStrategyRouter->>ExperimentDatabase: memory_items(task)
  ExperimentDatabase-->>ScienceStrategyRouter: available memory
  ScienceStrategyRouter-->>DeterministicScienceRunner: selected strategy
  DeterministicScienceRunner->>ExperimentDatabase: memory_items(task, strategy)
  ExperimentDatabase-->>DeterministicScienceRunner: relevant memory
  DeterministicScienceRunner->>FixtureTask: run_local_fixture_task(task, strategy, memory)
  FixtureTask-->>DeterministicScienceRunner: ScienceExperimentResult
  DeterministicScienceRunner->>ExperimentDatabase: record_result(result)
  DeterministicScienceRunner-->>User: ScienceExperimentResult
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

Poem

A rabbit hops through experiments with care,
Registry, database, router all there.
Strategies chosen with wisdom and grace,
Memory extracted, results saved in place.
Normalization tasks pass the test—
A local scaffold, deterministically blessed! 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Scaffold deterministic ScienceAgentBench local experiment track' clearly and concisely summarizes the main change: adding a deterministic local scaffold for ScienceAgentBench experiments.
Linked Issues check ✅ Passed All coding requirements from issue #117 are met: TaskRegistry, ScienceTaskSpec, ExperimentDatabase, deterministic local fixture task, strategy router, result recording, memory extraction, and CI tests.
Out of Scope Changes check ✅ Passed All changes (documentation, module implementation, and tests) are directly related to implementing the local ScienceAgentBench scaffold as specified in issue #117.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-117-scienceagentbench-local

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/research_engine/science_agent_bench.py (1)

138-138: 💤 Low value

Optional: Remove redundant list() call.

The list(results) call is unnecessary since results is already a list created by the filter comprehensions above. This is a minor redundancy but doesn't affect correctness.

♻️ Simplify by removing redundant list() call
-        return list(results)
+        return results
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/research_engine/science_agent_bench.py` at line 138, The return uses an
unnecessary list() wrapper—replace the final "return list(results)" with "return
results" to avoid redundant copying; locate the return statement (currently
"return list(results)") in src/research_engine/science_agent_bench.py (the
function that builds the filtered results list) and return the existing results
list directly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/research/scienceagentbench-local-scaffold.md`:
- Around line 38-47: Update the wording to distinguish the two sandbox scopes:
clarify that "sandbox execution is deferred" refers to full Docker/E2B
arbitrary-code sandboxing, while the immediate Next Benchmark Integration Step
requires a minimal fixture-only sandbox harness that converts a
ScienceAgentBench-Lite task fixture into a ScienceTaskSpec, runs it in an
offline sandbox, and compares output to the benchmark gold; rephrase the lines
mentioning "sandbox execution" and the Next Benchmark Integration Step so they
explicitly say "minimal fixture-only/offline sandbox for running predefined task
fixtures (not full arbitrary-code Docker/E2B sandboxing)" to remove ambiguity.

---

Nitpick comments:
In `@src/research_engine/science_agent_bench.py`:
- Line 138: The return uses an unnecessary list() wrapper—replace the final
"return list(results)" with "return results" to avoid redundant copying; locate
the return statement (currently "return list(results)") in
src/research_engine/science_agent_bench.py (the function that builds the
filtered results list) and return the existing results list directly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 3f5d0f72-fad6-438f-b0e2-8644a08f0380

📥 Commits

Reviewing files that changed from the base of the PR and between 8235f8c and cdfb8d9.

📒 Files selected for processing (3)
  • docs/research/scienceagentbench-local-scaffold.md
  • src/research_engine/science_agent_bench.py
  • tests/test_science_agent_bench.py

Comment thread docs/research/scienceagentbench-local-scaffold.md
@Darkroom4364 Darkroom4364 force-pushed the issue-117-scienceagentbench-local branch from cdfb8d9 to dc19f43 Compare May 26, 2026 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ScienceAgentBench] Scaffold deterministic local experiment track

1 participant