-
Notifications
You must be signed in to change notification settings - Fork 1
Architecture
Abhishek Gahlot edited this page Mar 27, 2026
·
1 revision
+------------------------------------------+
| Training Framework |
| (TRL / verl / OpenRLHF / custom) |
| |
| model.generate(prompt) -> completions |
+------------------+-----------------------+
|
| completions (code strings)
v
+------------------------------------------+
| DeepGym |
| |
| +------------+ +-------------------+ |
| | Environment| | Reward Function | |
| | (task + | | (sync / async) | |
| | verifier) | +-------------------+ |
| +------------+ |
| |
| +------------------------------------+ |
| | Executor | |
| | LocalExecutor | DaytonaSandbox | |
| +------------------------------------+ |
+------------------+-----------------------+
|
| RunResult (score, cases, metrics)
v
+------------------------------------------+
| Training Framework |
| |
| reward = result.score |
| advantages = (r - mean) / std |
| optimizer.step(advantages) |
+------------------------------------------+
src/deepgym/
|-- core.py # DeepGym sync client
|-- async_core.py # AsyncDeepGym async client
|-- models.py # all Pydantic models
|-- sandbox.py # DaytonaSandbox + LocalExecutor
|-- verifier.py # Verifier model + protocol
|-- verifier_template.py # auto-wrapper for simple verifiers
|-- registry.py # environment loader + registry
|-- gym.py # Gymnasium-style wrapper
|-- multi_turn.py # multi-step episodes
|-- adversarial.py # reward hack attack strategies
|-- reward_qa.py # verifier auditing + risk scoring
|-- exploit_db.py # SQLite exploit storage
|-- benchmark_ops.py # benchmark split auditing
|-- computer_use.py # GUI/computer interaction
|-- exceptions.py # DeepGymError hierarchy
|-- cli.py # CLI entry point
|-- web.py # web debugging UI
|
|-- integrations/
| |-- trl.py # HuggingFace TRL reward functions
| |-- verl.py # ByteDance verl compute_score
| |-- openrlhf.py # OpenRLHF FastAPI router
| |-- lm_eval.py # lm-evaluation-harness tasks
| |-- hf.py # HuggingFace Hub push/load
| |-- reward.py # universal RewardFunction
|
|-- api/
| |-- app.py # FastAPI application
| |-- routes.py # API endpoints
| |-- schemas.py # request/response models
| |-- deps.py # dependency injection
|
|-- envs/ # 24 built-in environments
|-- coin_change/
| |-- task.md
| |-- verifier.py
| |-- solution.py
| |-- metadata.json
|-- two_sum/
|-- ...
1. DeepGym.run(env, model_output)
2. Pick executor (LocalExecutor or DaytonaSandbox)
3. Write solution to temp file (or upload to sandbox)
4. Wrap verifier_code with template (if inline)
5. Execute: python verifier.py solution.py
6. Parse JSON stdout -> VerifierResult
7. Build RunResult (score, cases, timing)
8. Cleanup sandbox / temp files
9. Return RunResult
1. DeepGym.run_batch(env, outputs, max_parallel)
2. Create ThreadPoolExecutor (sync) or Semaphore (async)
3. Submit N run() calls with concurrency limit
4. Collect results (gather with return_exceptions=True)
5. Compute aggregate stats (total, passed, avg_score)
6. Return BatchResult
1. User writes function body (inline verifier_code)
2. verifier_template.py wraps it into:
def _run_verifier(solution_path, test_cases_path=None):
<user code>
3. Wrapper catches exceptions, normalizes return type:
- bool -> {"score": 1.0/0.0, "passed": true/false}
- float -> {"score": <clamped>, "passed": score >= 1.0}
- dict -> validated against VerifierResult schema
4. JSON printed to stdout
5. Exit code set (0/1/2)
Pydantic everywhere. All data models use Pydantic for validation. No dataclasses, no plain dicts at boundaries.
Protocol over framework. Verifiers are standalone scripts that print JSON. No SDK dependency. Any language could be a verifier.
Executor abstraction. LocalExecutor and DaytonaSandbox share the same interface. Switching modes is a one-parameter change.
Score clamping. All scores clamped to [0.0, 1.0]. No negative rewards, no scores above 1.
Deterministic seeding. Verifiers use fixed seeds for random test generation. Same solution = same score. GRPO requires this.