Skip to content

feat(engine): scenario pool / random sampling#5

Merged
regevguym merged 1 commit intomondaycom:mainfrom
nymeria-ai:feat/scenario-pools
Mar 19, 2026
Merged

feat(engine): scenario pool / random sampling#5
regevguym merged 1 commit intomondaycom:mainfrom
nymeria-ai:feat/scenario-pools

Conversation

@nymeria-ai
Copy link
Copy Markdown
Contributor

Summary

Adds a scenario pool mechanism to suite YAML that enables random task selection at runtime.

Why

Evaluation suites with fixed tests are gameable — agents (or their owners) can memorize answers. Pools let suite authors define a bank of scenarios and have the engine randomly select N per run.

Use case: agentalent.ai agent verification — 25+ creative challenges in a pool, each run picks 3-5 random tasks. Agents can't pre-script answers.

YAML Syntax

scenarios:
  # Regular scenarios work unchanged
  - id: fixed-task
    name: Always runs
    layer: execution
    input:
      prompt: Do this
    kpis: [...]

  # NEW: Pool — engine picks `count` random scenarios
  - pool:
      id: creative-challenges
      count: 3
      seed: 42  # optional: deterministic selection
      scenarios:
        - id: roast-yourself
          name: Self Roast
          layer: execution
          input:
            prompt: Roast yourself
          kpis: [...]
        # ... more scenarios in pool

Changes

  • types.tsScenarioPool interface, ScenarioEntry union type
  • schema.ts — Zod validation for pool syntax
  • loader.tsresolvePools() with seeded PRNG (mulberry32), Fisher-Yates shuffle
  • index.ts — exports
  • pool.test.ts — 14 new tests

Design

  • Pool resolution in Loader → Runner receives flat ScenarioDefinition[] (zero Runner changes)
  • Seeded PRNG for reproducible runs (no external deps)
  • Count > pool size → clamp with warning
  • Backward compatible — existing suites work unchanged
  • All 208 engine tests pass (14 new + 194 existing)

Adds a 'pool' mechanism to suite YAML that lets authors define pools
of scenarios with random selection at runtime. The engine picks N
scenarios from each pool, enabling:

- Anti-gaming: agents can't memorize fixed test sets
- Variety: different runs test different capabilities
- Scalability: large task banks with configurable sample sizes

YAML syntax:
  - pool:
      id: my-pool
      count: 3          # pick 3 random scenarios
      seed: 42          # optional: reproducible selection
      scenarios: [...]  # ScenarioDefinition[]

Features:
- Seeded PRNG (mulberry32) for deterministic runs
- Fisher-Yates shuffle for unbiased selection
- Count clamped to pool size (warns, doesn't error)
- Validates: no empty pools, no count=0, no ID collisions
- Pool resolution in loader — runner receives flat scenario list
- Fully backward compatible with existing suite YAML

14 new tests, all 208 engine tests pass.
@regevguym regevguym merged commit 738d493 into mondaycom:main Mar 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants