Skip to content

Architecture

Abhishek Gahlot edited this page Mar 27, 2026 · 1 revision

Architecture

How it fits together

+------------------------------------------+
|        Training Framework                 |
|   (TRL / verl / OpenRLHF / custom)       |
|                                           |
|   model.generate(prompt) -> completions   |
+------------------+-----------------------+
                   |
                   | completions (code strings)
                   v
+------------------------------------------+
|              DeepGym                      |
|                                           |
|  +------------+  +-------------------+   |
|  | Environment|  | Reward Function   |   |
|  | (task +    |  | (sync / async)    |   |
|  |  verifier) |  +-------------------+   |
|  +------------+                           |
|                                           |
|  +------------------------------------+  |
|  |           Executor                  |  |
|  |  LocalExecutor | DaytonaSandbox    |  |
|  +------------------------------------+  |
+------------------+-----------------------+
                   |
                   | RunResult (score, cases, metrics)
                   v
+------------------------------------------+
|        Training Framework                 |
|                                           |
|   reward = result.score                   |
|   advantages = (r - mean) / std           |
|   optimizer.step(advantages)              |
+------------------------------------------+

Module map

src/deepgym/
|-- core.py              # DeepGym sync client
|-- async_core.py        # AsyncDeepGym async client
|-- models.py            # all Pydantic models
|-- sandbox.py           # DaytonaSandbox + LocalExecutor
|-- verifier.py          # Verifier model + protocol
|-- verifier_template.py # auto-wrapper for simple verifiers
|-- registry.py          # environment loader + registry
|-- gym.py               # Gymnasium-style wrapper
|-- multi_turn.py        # multi-step episodes
|-- adversarial.py       # reward hack attack strategies
|-- reward_qa.py         # verifier auditing + risk scoring
|-- exploit_db.py        # SQLite exploit storage
|-- benchmark_ops.py     # benchmark split auditing
|-- computer_use.py      # GUI/computer interaction
|-- exceptions.py        # DeepGymError hierarchy
|-- cli.py               # CLI entry point
|-- web.py               # web debugging UI
|
|-- integrations/
|   |-- trl.py           # HuggingFace TRL reward functions
|   |-- verl.py          # ByteDance verl compute_score
|   |-- openrlhf.py      # OpenRLHF FastAPI router
|   |-- lm_eval.py       # lm-evaluation-harness tasks
|   |-- hf.py            # HuggingFace Hub push/load
|   |-- reward.py        # universal RewardFunction
|
|-- api/
|   |-- app.py           # FastAPI application
|   |-- routes.py        # API endpoints
|   |-- schemas.py       # request/response models
|   |-- deps.py          # dependency injection
|
|-- envs/                # 24 built-in environments
    |-- coin_change/
    |   |-- task.md
    |   |-- verifier.py
    |   |-- solution.py
    |   |-- metadata.json
    |-- two_sum/
    |-- ...

Data flow

Single run

1. DeepGym.run(env, model_output)
2. Pick executor (LocalExecutor or DaytonaSandbox)
3. Write solution to temp file (or upload to sandbox)
4. Wrap verifier_code with template (if inline)
5. Execute: python verifier.py solution.py
6. Parse JSON stdout -> VerifierResult
7. Build RunResult (score, cases, timing)
8. Cleanup sandbox / temp files
9. Return RunResult

Batch run

1. DeepGym.run_batch(env, outputs, max_parallel)
2. Create ThreadPoolExecutor (sync) or Semaphore (async)
3. Submit N run() calls with concurrency limit
4. Collect results (gather with return_exceptions=True)
5. Compute aggregate stats (total, passed, avg_score)
6. Return BatchResult

Verifier wrapping

1. User writes function body (inline verifier_code)
2. verifier_template.py wraps it into:
   def _run_verifier(solution_path, test_cases_path=None):
       <user code>
3. Wrapper catches exceptions, normalizes return type:
   - bool -> {"score": 1.0/0.0, "passed": true/false}
   - float -> {"score": <clamped>, "passed": score >= 1.0}
   - dict -> validated against VerifierResult schema
4. JSON printed to stdout
5. Exit code set (0/1/2)

Design decisions

Pydantic everywhere. All data models use Pydantic for validation. No dataclasses, no plain dicts at boundaries.

Protocol over framework. Verifiers are standalone scripts that print JSON. No SDK dependency. Any language could be a verifier.

Executor abstraction. LocalExecutor and DaytonaSandbox share the same interface. Switching modes is a one-parameter change.

Score clamping. All scores clamped to [0.0, 1.0]. No negative rewards, no scores above 1.

Deterministic seeding. Verifiers use fixed seeds for random test generation. Same solution = same score. GRPO requires this.

Clone this wiki locally