Add NewtonBench Resource Server #650

Kelvin0110 · 2026-02-05T07:53:47Z

Contributing To NeMo-Gym (NewtonBench Resource Server)

1) Basic information

i. Description of the environment

A resource server wrapping the NewtonBench benchmark

Tasks: 324 scientific law discovery tasks across 12 physics domains.
Observation Space: Experimental results (numeric or structured dictionaries) returned after tool use.
Tools:
- run_experiment: Query the environment with specific parameters to receive physical observations.
- execute_python: (Optional) Python code-assisted discovery for complex data analysis.
Server: FastAPI resource server following NeMo Gym conventions.

ii. Description of the verification logic

The verifier uses the NewtonBench evaluation suite to score the agent's proposed scientific law:

Law Extraction: Attempts to find a law within <final_law> tags in the assistant's final response.
Success Criteria: Evaluates both symbolic equivalence (via an LLM judge) and numeric accuracy (Root Mean Square Logarithmic Error - RMSLE).
Reward Calculation:
- reward = 0.3 * R_symbolic + 0.7 * R_numeric.
  - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise.
  - $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$, yielding a score in $(-1, 1]$.
/verify endpoint processes the agent's submission and returns these detailed performance metrics.

iii. Description of the prompts/tasks (source + domain)

Domain: Maths (Scientific Law Discovery).
Source: Tasks and prompts adapted from the NewtonBench benchmark, which instruct the agent to discover a specific shifted scientific law (e.g., Newton's Law of Gravitation, Snell's Law) by performing interactive experiments.

iv. License information

Code: Apache 2.0.
Data: Apache 2.0
NewtonBench Benchmark: MIT (Copyright (c) 2025 HKUST-KnowComp).

2) Environment validity check

i. Commands used to collect rollouts

# Start NeMo Gym servers (agent + NewtonBench)
config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\
responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect sample rollouts
ng_collect_rollouts \
    +agent_name=newton_bench_simple_agent \
    +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \
    +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \
    +limit=5

# View rollouts
ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl

ii. Resulting rollouts (5 examples)

See resources_servers/newton_bench/data/example_rollouts.jsonl
Expected behavior:

Agent performs several experiments, analyzes data, and submits a scientific law.
Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0).
Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative.

3) Tests

i. Commands used to run the tests

source resources_servers/newton_bench/.venv/bin/activate 
pytest resources_servers/newton_bench/tests/test_app.py

Coverage notes:
Resource server tests provide comprehensive coverage of the following areas:

Session Lifecycle: Successful seeding, error handling for invalid modules, session ending, and background cleanup.
Experiment Execution: Dynamic handler registration for each modules, basic run experiment execution, and error handling for uninitialized sessions, mismatched module calls, etc.
Python Sandbox: Basic execution, session-based code persistence, timeout enforcement, and security validation (restricting dangerous imports/operations).
Verification Logic: Law extraction from diverse response structures, and reward calculation via symbolic equivalence (LLM judge) and numeric RMSLE.

4) Reward profiling

Models: Qwen/Qwen3-VL-8B-Thinking

Method:

108 prompts based on version v0 of scientific laws.
4 rollouts per prompt (432 total).
Tool calling of run_experiment enabled and agent loops until law submission.

Results:
Overall Metrics

Total Rollouts: 432
Mean Reward: $\approx$ 0.0675
Median Reward: 0.0
Min Reward: $\approx$ -0.8786
Max Reward: 1.0

Tool Call Statistics

Average Tool Calls: 22.95 per rollout
Min Tool Calls: 0
Max Tool Calls: 1770
Correlation (tool calls $\leftrightarrow$ reward): $\approx$ -0.0211 (Weak negative correlation)

Reward Distribution (Buckets)

Reward Range	Count
[-1.0, -0.8)	16
[-0.8, -0.6)	16
[-0.6, -0.4)	60
[-0.4, -0.2)	39
[-0.2, 0.0)	24
[0.0, 0.2)	150
[0.2, 0.4)	46
[0.4, 0.6)	2
[0.6, 0.8)	1
[0.8, 1.0]	78

Performance by Tool Call Count Bins

Tool Call Range	Rollouts (n)	Mean Reward
0	23	$\approx$ -0.1112
1–10	329	$\approx$ 0.0824
11–50	60	$\approx$ 0.1308
51–200	15	$\approx$ -0.1959
201–2000	5	$\approx$ -0.0600

Key observations:

Symbolic Accuracy: Approximately 19.7% symbolic accuracy and a widespread RMSLE distribution indicate frequent failures to recover exact symbolic forms or precise numeric behavior.
Reward Distribution: Rewards cluster near zero (median 0.0, mean ~0.0675) with a long tail and many negative outcomes, reflecting frequent partial or failed discoveries.
Tool Usage Sweet Spot: Positive performance is observed with moderate tool use (1–50 calls), with a peak in the 11–50 range, suggesting that tool-driven data collection is critical for inducing scientific laws.
Diminishing Returns: Performance declines sharply after 50 calls, showing additional tool calls become detrimental and successful discovery depends on reasoning and hypothesis selection rather than raw data volume.

Signed-off-by: cmunley1 <[email protected]>

@bxyu-nvidia

commit 647d1e5 Author: fsiino-nvidia <[email protected]> Date: Fri Dec 19 18:40:39 2025 -0800 Remove PlainTextResponse response_class (NVIDIA-NeMo#544) https://nvidia.slack.com/archives/C08TG7CLEGY/p1766191655660079 Initially in NVIDIA-NeMo#290 , the `response_class=PlainTextResponse` was added to the `/global_config_dict_yaml` endpoint of the HeadServer as an attempt to debug parsing server info for the `ng_status` command. This lead to a parsing error in `load_from_global_config`. This command now uses it's own separate endpoint `server_instances`, so this needs to be removed. Signed-off-by: Frankie Siino <[email protected]> commit f250e0c Author: cmunley1 <[email protected]> Date: Fri Dec 19 16:38:29 2025 -0800 docs: remove trl docs (NVIDIA-NeMo#543) remove trl from docs, leaving just unsloth. was unclear that they are together. will make a trl section when we have a standalone trl notebook, or a section on trl's docs too. --------- Signed-off-by: Christian Munley <[email protected]> commit 34a2b0f Author: cmunley1 <[email protected]> Date: Fri Dec 19 14:01:56 2025 -0800 add unsloth and trl to docs (NVIDIA-NeMo#536) adds a section for single-step training with unsloth and trl not sure if these should be broken into separate sections. Left as one since the same notebook works for both, but could be confusing. not sure if we should also add more info about multi-step (hopefully) coming soon. Signed-off-by: Christian Munley <[email protected]> commit 146b1a5 Author: cmunley1 <[email protected]> Date: Fri Dec 19 12:56:33 2025 -0800 python flag for colab venv installation (NVIDIA-NeMo#526) need to set uv pip install python flag in colab environments when launching servers usage: `ng_run "+config_paths=[...]" +uv_pip_set_python=true ` defaults to false For NVIDIA-NeMo#370 Needed for notebook here: https://docs.unsloth.ai/models/nemotron-3#reinforcement-learning--nemo-gym --------- Signed-off-by: Christian Munley <[email protected]> commit ba2153a Author: cmunley1 <[email protected]> Date: Fri Dec 19 10:42:44 2025 -0800 Salesforce xlam-function-calling-60k resources server (NVIDIA-NeMo#262) function calling resources server based on https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k --------- Signed-off-by: Christian Munley <[email protected]> Signed-off-by: cmunley1 <[email protected]> commit 29d3511 Author: pjin-nvidia <[email protected]> Date: Fri Dec 19 10:28:28 2025 -0800 VLLMModel supports chat template kwargs (NVIDIA-NeMo#538) Signed-off-by: Brian Yu <[email protected]> Signed-off-by: Peter Jin <[email protected]> commit 7d8fdda Author: fsiino-nvidia <[email protected]> Date: Wed Dec 17 18:38:18 2025 -0800 List running server health and status (NVIDIA-NeMo#290) This implements the `ng_status` command to list all running servers on the system and ping for health check. --------- Signed-off-by: Frankie Siino <[email protected]> commit 076d002 Author: fsiino-nvidia <[email protected]> Date: Tue Dec 16 10:25:14 2025 -0800 Debug server package versions (NVIDIA-NeMo#406) Adds `ng_pip_list` command to see the underlying uv pip list of the specified environment. --------- Signed-off-by: Frankie Siino <[email protected]> commit c192ee4 Author: Lawrence Lane <[email protected]> Date: Tue Dec 16 12:19:31 2025 -0500 docs settings update (NVIDIA-NeMo#525) Signed-off-by: Lawrence Lane <[email protected]> commit 8ca39d6 Author: bxyu-nvidia <[email protected]> Date: Mon Dec 15 19:56:03 2025 -0800 docs: Miscellaneous GRPO tutorial fixes (NVIDIA-NeMo#512) Signed-off-by: Brian Yu <[email protected]> commit 1539b2b Author: Lawrence Lane <[email protected]> Date: Mon Dec 15 18:28:11 2025 -0500 docs: redirect setup (NVIDIA-NeMo#513) Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: Brian Yu <[email protected]> Co-authored-by: Brian Yu <[email protected]> commit 96ccdfc Author: cmunley1 <[email protected]> Date: Mon Dec 15 14:31:59 2025 -0800 reasoning-gym resource server (NVIDIA-NeMo#113) single turn tasks across various domains: "Reasoning Gym is a community-created Python library of procedural dataset generators and algorithmically verifiable reasoning environments for training reasoning models with reinforcement learning (RL). The goal is to generate virtually infinite training data with adjustable complexity. It currently provides more than 100 tasks over many domains, including but not limited to algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and many common games." Tested all 100+ environments for errors, and tested training on many, demonstrated convergence. This dataset of 100+ environments is also used in ProRL (https://arxiv.org/abs/2505.24864) --------- Signed-off-by: cmunley1 <[email protected]> Signed-off-by: Christian Munley <[email protected]> Co-authored-by: ARC Bot <[email protected]> commit 8c4c5e3 Author: bxyu-nvidia <[email protected]> Date: Sun Dec 14 16:38:21 2025 -0800 Bump to v0.2.0 (NVIDIA-NeMo#510) Signed-off-by: Brian Yu <[email protected]> commit 3897ff4 Author: bxyu-nvidia <[email protected]> Date: Sun Dec 14 16:28:58 2025 -0800 Change to v0.1.1 release version (NVIDIA-NeMo#509) Signed-off-by: Brian Yu <[email protected]> commit b1bf0f4 Author: bxyu-nvidia <[email protected]> Date: Sun Dec 14 16:24:49 2025 -0800 Update dataset configs with HuggingFace links (NVIDIA-NeMo#508) Signed-off-by: Brian Yu <[email protected]> commit 9a9177e Author: bxyu-nvidia <[email protected]> Date: Sun Dec 14 16:12:06 2025 -0800 docs: End-to-end GRPO Training with NeMo RL tutorial [master branch] (NVIDIA-NeMo#481) Signed-off-by: Brian Yu <[email protected]> Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: Frankie Siino <[email protected]> Co-authored-by: L.B. <[email protected]> Co-authored-by: Frankie Siino <[email protected]> commit d3646c5 Author: Chris Wing <[email protected]> Date: Fri Dec 12 12:20:25 2025 -0800 Reorder README structure (NVIDIA-NeMo#501) move available environments higher up in the README after the quickstart Signed-off-by: Chris Wing <[email protected]> commit b9cf8b2 Author: Chris Wing <[email protected]> Date: Fri Dec 12 08:13:32 2025 -0800 Simplify contributing.md (NVIDIA-NeMo#500) added links to contribute section of docs site and removed redundant content. links need to be verified after NVIDIA-NeMo#498 is merged to main --------- Signed-off-by: Chris Wing <[email protected]> Signed-off-by: Lawrence Lane <[email protected]> Co-authored-by: Lawrence Lane <[email protected]> commit eabcbcf Author: Chris Wing <[email protected]> Date: Fri Dec 12 07:43:04 2025 -0800 FAQ cleanup (NVIDIA-NeMo#499) This PR removes redundant content from the FAQ and better organizes the documentation structure. **Removed redundant FAQ sections** now covered in dedicated documentation: - `ng_version` → `docs/reference/cli-commands.md` - Config anatomy → `docs/reference/configuration.md` (section was incomplete TODO) - DCO and commit signing → `CONTRIBUTING.md` and `docs/contribute/development-setup.md` - Copyright errors → `docs/contribute/development-setup.md` - CI/CD requirements → `docs/contribute/development-setup.md` **Reorganized FAQ placement:** - Moved `docs/how-to-faq.md` → `docs/reference/faq.md` (consistent with other reference docs) - Repositioned FAQ to bottom of Reference section (after Configuration, CLI Commands, API Reference) - Updated intro to clarify FAQ provides quick answers while comprehensive docs are developed --------- Signed-off-by: Chris Wing <[email protected]> Co-authored-by: Lawrence Lane <[email protected]> commit fc59615 Author: Chris Wing <[email protected]> Date: Fri Dec 12 07:38:48 2025 -0800 Add environment contribution docs (NVIDIA-NeMo#498) Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: Chris Wing <[email protected]> Co-authored-by: Lawrence Lane <[email protected]> Co-authored-by: Copilot <[email protected]> commit 39ee39e Author: Chris Wing <[email protected]> Date: Thu Dec 11 15:52:13 2025 -0800 Docs: Contribution Home & Dev Setup (NVIDIA-NeMo#494) Added types of contribution to contribution overview and replicated dev setup instructions from contributing.md to docs --------- Signed-off-by: Chris Wing <[email protected]> Signed-off-by: Lawrence Lane <[email protected]> Co-authored-by: Lawrence Lane <[email protected]> commit aa48c20 Author: Chris Wing <[email protected]> Date: Thu Dec 11 14:16:47 2025 -0800 improve framing of training framework integration guide for contributing (NVIDIA-NeMo#493) Make it more clear this guide is for contributing training framework integrations Signed-off-by: Chris Wing <[email protected]> commit a4cfd5e Author: pjin-nvidia <[email protected]> Date: Thu Dec 11 13:31:09 2025 -0800 Misc rollout fixes (NVIDIA-NeMo#447) Signed-off-by: Peter Jin <[email protected]> commit def5fdd Author: L.B. <[email protected]> Date: Thu Dec 11 15:00:38 2025 -0500 docs: contribute section (NVIDIA-NeMo#490) - move training content into new contribute section - create contributing overview page - add contributing section on home page with link to RL integrations content hub --------- Signed-off-by: Lawrence Lane <[email protected]> commit 8f4d638 Author: L.B. <[email protected]> Date: Thu Dec 11 14:17:03 2025 -0500 docs: move FAQ (NVIDIA-NeMo#489) moves how-to-faq to render under "references" and display as FAQ. no material changes to the content. Signed-off-by: Lawrence Lane <[email protected]> commit 54b21db Author: bxyu-nvidia <[email protected]> Date: Thu Dec 11 10:27:28 2025 -0800 Fix NeMo Gym Pyproject links (NVIDIA-NeMo#486) Signed-off-by: Brian Yu <[email protected]> commit 82f0f0c Author: fsiino-nvidia <[email protected]> Date: Thu Dec 11 10:18:58 2025 -0800 More single tool call filename updates cont (NVIDIA-NeMo#484) Signed-off-by: Frankie Siino <[email protected]> commit 8654ecf Author: L.B. <[email protected]> Date: Wed Dec 10 22:08:20 2025 -0500 docs: home pg, quickstart move, gh icon (NVIDIA-NeMo#463) - adds GH icon + link to global top nav - rebuilds the home page to standard layout - adds CTA to quickstart and tutorials - moves quickstart into get started - clarifies differences between the quickstart and more detailed onboarding materials --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: Chris Wing <[email protected]> Co-authored-by: Chris Wing <[email protected]> commit c345e5d Author: bxyu-nvidia <[email protected]> Date: Wed Dec 10 19:05:20 2025 -0800 Fix duplicate reference sections (NVIDIA-NeMo#483) Signed-off-by: Brian Yu <[email protected]> commit be25806 Author: fsiino-nvidia <[email protected]> Date: Wed Dec 10 17:24:13 2025 -0800 docs: Fix wrong count vs actual (NVIDIA-NeMo#482) Signed-off-by: Frankie Siino <[email protected]> commit a3417ce Author: fsiino-nvidia <[email protected]> Date: Wed Dec 10 16:58:55 2025 -0800 More single tool call filename updates (NVIDIA-NeMo#480) Signed-off-by: Frankie Siino <[email protected]> commit 25808bf Author: fsiino-nvidia <[email protected]> Date: Wed Dec 10 16:36:05 2025 -0800 Rename examples simple_weather and stateful_counter (NVIDIA-NeMo#479) Signed-off-by: Frankie Siino <[email protected]> commit bf0b0c5 Author: Ahmad Kiswani <[email protected]> Date: Wed Dec 10 15:44:25 2025 -0800 Expose server host and port in dataset viewer CLI (NVIDIA-NeMo#476) Closes https://github.com/NVIDIA-NeMo/Internal-Planning/issues/126 @bxyu-nvidia Per the issue, the PR also changes the default `server_host` to `0.0.0.0` (accessible from everywhere). But I would advise against this for security reasons. I think keeping the default to `127.0.0.1` is the right call even if the user needs to modify the command to access the server. --------- Signed-off-by: Ahmad Kiswani <[email protected]> commit 993543a Author: pjin-nvidia <[email protected]> Date: Wed Dec 10 14:38:22 2025 -0800 Miscellaneous infra improvements/fixes (NVIDIA-NeMo#317) should resolve NVIDIA-NeMo#342 Signed-off-by: Brian Yu <[email protected]> Co-authored-by: Brian Yu <[email protected]> Co-authored-by: Peter Jin <[email protected]> commit 845bf71 Author: Ahmad Kiswani <[email protected]> Date: Wed Dec 10 14:15:07 2025 -0800 pyproject typos and grammar fixes (NVIDIA-NeMo#473) Closes https://github.com/NVIDIA-NeMo/Internal-Planning/issues/132 Signed-off-by: Ahmad Kiswani <[email protected]> commit 81a0013 Author: bxyu-nvidia <[email protected]> Date: Wed Dec 10 14:11:08 2025 -0800 docs: Improve server reference info (NVIDIA-NeMo#474) Signed-off-by: Brian Yu <[email protected]> commit 1d78f22 Author: bxyu-nvidia <[email protected]> Date: Wed Dec 10 13:50:27 2025 -0800 Bug: inconsistent documentation around servers running (NVIDIA-NeMo#472) Signed-off-by: Brian Yu <[email protected]> commit 9f26473 Author: bxyu-nvidia <[email protected]> Date: Wed Dec 10 13:25:42 2025 -0800 docs: Training framework integration (NVIDIA-NeMo#439) Signed-off-by: Brian Yu <[email protected]> commit f67fa48 Author: Ahmad Kiswani <[email protected]> Date: Wed Dec 10 13:19:24 2025 -0800 Remove penguin references (NVIDIA-NeMo#469) After this PR, the only remaining penguin references are in the NeMo-RL tutorial, but these should be fixed with tutorial rewrite. Closes https://github.com/NVIDIA-NeMo/Internal-Planning/issues/131 Signed-off-by: Ahmad Kiswani <[email protected]> commit eecb93c Author: L.B. <[email protected]> Date: Wed Dec 10 16:13:44 2025 -0500 docs(readme): fix Example Resource Servers table - correct Multi Step… (NVIDIA-NeMo#464) Update 'Demonstrates' column for Multi Step example: - Before: Instruction_Following example - After: Multi-step tool calling Fixes NVIDIA-NeMo#417 --------- Signed-off-by: Brian Yu <[email protected]> Co-authored-by: Brian Yu <[email protected]> commit 0e367c2 Author: Sanjay Kariyappa <[email protected]> Date: Thu Dec 11 02:38:51 2025 +0530 add calendar env for multi-turn IF (NVIDIA-NeMo#297) This PR introduces the **Calendar Resource Server**, a new training environment that challenges models to schedule multiple events on a calendar while satisfying complex temporal constraints. The constraints are mentioned in a multi-turn conversation format (generated synthetically using a role-playing model). Achieving high performance on this benchmark requires the model to satisfy constraints mentioned in different user turns. When trained on this synthetic dataset, we observe an improvement in the model's multi-turn instruction following ability. The Calendar environment simulates a realistic scheduling task where an AI agent must: - Schedule multiple events within a working day time window - Satisfy various temporal constraints: - **"before"**: Event must end before a specific time - **"after"**: Event must start after a specific time - **"between"**: Event must start and end within a time window - **"at"**: Event must start at an exact time - Ensure no time conflicts between events - Match exact event durations - Stay within global min/max time boundaries This environment tests an agent's ability to: - Parse and understand natural language constraints. - Follow instructions that are mentioned in multiple user messages. - Infer scheduling conflicts and satisfy multiple constraints simultaneously. - Perform temporal reasoning and arithmetic. - **4 constraint types**: before, after, between, at - **Time window enforcement**: Global min/max boundaries for all events - **Conflict detection**: Automatic validation of event overlaps - **Duration matching**: Exact duration requirements per event The server includes a robust verification pipeline that: - Extracts JSON schedules from model responses - Validates all temporal constraints - Detects overlapping events - Returns binary rewards (1 for valid, 0 for invalid) - Filters out responses with thinking tags (`<think>`) - Script to generate diverse scheduling scenarios - Configurable number of events and constraint types - Natural language constraint descriptions - Validation data included - Tests for each constraint type (valid and violation cases) - Edge cases: empty schedules, wrong event counts, time conflicts - Complex multi-event scenarios Qwen3-8b shows steady improvement in rewards when trained with GRPO with a dataset of 4K synthetic samples. Wandb logs are below. https://wandb.ai/nvidia/skariyappa-nemo-gym-rl-integration/runs/t4v06nbg https://wandb.ai/nvidia/skariyappa-nemo-gym-rl-integration/runs/70yc23ew https://wandb.ai/nvidia/skariyappa-nemo-gym-rl-integration/runs/1jnwuhi3 --------- Signed-off-by: Sanjay Kariyappa <[email protected]> Signed-off-by: Brian Yu <[email protected]> Co-authored-by: Brian Yu <[email protected]> commit a182171 Author: bxyu-nvidia <[email protected]> Date: Wed Dec 10 12:58:07 2025 -0800 Explain where the name Gym comes from; Gym Key Terminology doc is missing some of the old material (NVIDIA-NeMo#470) Signed-off-by: Brian Yu <[email protected]> commit d8ecb8b Author: Chris Wing <[email protected]> Date: Wed Dec 10 10:56:29 2025 -0800 Add benefits to About page aligned with README (NVIDIA-NeMo#452) Fixes NVIDIA-NeMo#451 Signed-off-by: Chris Wing <[email protected]> commit e08906c Author: Ahmad Kiswani <[email protected]> Date: Wed Dec 10 10:35:33 2025 -0800 docs: Moved configuration system under about (NVIDIA-NeMo#420) Moved configuration systems under "About" instead of "About>Concepts". Also removed configuration mentions and examples from core abstraction pages Closes NVIDIA-NeMo#392 and NVIDIA-NeMo#393 --------- Signed-off-by: Ahmad Kiswani <[email protected]> Signed-off-by: L.B <[email protected]> Signed-off-by: Chris Wing <[email protected]> Signed-off-by: Brian Yu <[email protected]> Co-authored-by: L.B <[email protected]> Co-authored-by: Chris Wing <[email protected]> Co-authored-by: Brian Yu <[email protected]> commit 7aa8306 Author: Chris Wing <[email protected]> Date: Wed Dec 10 05:59:38 2025 -0800 Add Data Designer and links to ecosystem page (NVIDIA-NeMo#462) Fixes NVIDIA-NeMo#450 Signed-off-by: Chris Wing <[email protected]> commit 287d08d Author: Chris Wing <[email protected]> Date: Tue Dec 9 12:45:35 2025 -0800 Change NeMo Gym from framework to library (NVIDIA-NeMo#456) Changed description of NeMo Gym from a framework to library for consistency across NeMo products Signed-off-by: Chris Wing <[email protected]>

- Add math to pre-imported libraries - Implement session TTL and background cleanup for expired sessions - Update the dataset tool description to reflect available libraries

… descriptions of the benchmark

- Expanded the README with detailed instructions for dataset generation, rollout collection, and testing - Added example_rollouts.jsonl and updated example.jsonl - Improved generate_dataset.py to support new CLI options for dataset customization

copy-pr-bot · 2026-02-05T07:53:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cmunley1 · 2026-02-05T22:48:36Z

can you please merge main?

Kelvin0110 · 2026-02-06T10:16:48Z

Sure, I’ve merged the latest main branch. Please let me know if you’d like me to take any further steps.

cmunley1 · 2026-02-06T18:07:05Z

have you tried training with NeMo RL (ideally we can test training before merging)? Also, I see you used a vision language model, does anything require vision here (not an issue, just curious) ?

cmunley1 · 2026-02-06T20:06:03Z

DCO is faililng can you try to resolve that? https://docs.nvidia.com/nemo/gym/latest/contribute/development-setup.html#dco-and-commit-signing

Also see here https://docs.nvidia.com/nemo/gym/latest/contribute/environments/new-environment.html#contribution-workflow

cmunley1 · 2026-02-06T20:41:19Z

please also run pre-commit check like ruff https://docs.nvidia.com/nemo/gym/latest/contribute/development-setup.html#pre-commit-hook-failures

cmunley1

need to pass dco and precommit

Kelvin0110 · 2026-02-10T05:52:47Z

Thanks for checking.
Our resource server doesn’t require any vision. We selected Qwen/Qwen3-VL-8B-Thinking because this vision language model provides stronger pure text performance than the corresponding non‑VL models (e.g., qwen3-8b-thinking). Since our tasks involve relatively complex reasoning, using the stronger model helps ensure more stable and reliable reward distribution.

newtdes · 2026-02-10T06:01:36Z

Also for DCO and precommit, we will handle that to ensure our pull request will pass both checking

cmunley1 and others added 27 commits October 29, 2025 11:46

init newton bench

5c6f7ae

Signed-off-by: cmunley1 <[email protected]>

Merge branch 'main' into cmunley1/newton

abb2770

Add dynamic module support to NewtonBench server

bdd079d

Remove legacy run_experiment endpoint

9d3a2d2

Refactor generate_dataset to support all modules

3381182

Refactor NewtonBench config and session handling

f1b708d

Expand tests for session seeding, experiment running, error handling

26aa8b1

[Execute Python passed test_app.py] For execute python endpoints

64c2316

Improve session management and python_execute endpoint

8a1e672

- Add math to pre-imported libraries - Implement session TTL and background cleanup for expired sessions - Update the dataset tool description to reflect available libraries

Refactor session cleanup to use pop

b45dedb

Refactor NewtonBench utils and python sandbox

b79283e

Add more test cases, current number of test cases is 42

54c4e86

Catch EOF error in test_app.py caused by _session_worker function

9ef6c11

New client.py that test the module and execute python

719eba8

Add test cases for session management and different end points

7408ce2

Reorganize test_app.py with clearer sectioning

3ece03f

Update workflow demo

140a4f9

Push the useless

bfb4a92

More detailed README that has instructions on cloning NewtonBench and…

95afd71

… descriptions of the benchmark

Update README with NewtonBench descriptions

88192fb

readme with reward distribution and tool-reward correlation

2159ead

update README.md and example_metrics.json

2fb1fd1

New reward function in app.py

4445700

Adapt test_app.py to new reward function

91bea35

Refactor session cleanup and update example rollouts

5091a65

Kelvin0110 requested a review from a team as a code owner February 5, 2026 07:53

cmunley1 self-requested a review February 5, 2026 22:41

Kelvin0110 added 2 commits February 6, 2026 17:40

Merge branch 'main' into cmunley1/newton

3c43f4d

Merge branch 'main' into cmunley1/newton

52df42a

Merge branch 'main' into cmunley1/newton

9414137

cmunley1 requested changes Feb 10, 2026

View reviewed changes

Merge branch 'main' into cmunley1/newton

9288ad5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NewtonBench Resource Server #650

Add NewtonBench Resource Server #650

Uh oh!

Kelvin0110 commented Feb 5, 2026

Uh oh!

copy-pr-bot bot commented Feb 5, 2026

Uh oh!

cmunley1 commented Feb 5, 2026

Uh oh!

Kelvin0110 commented Feb 6, 2026

Uh oh!

cmunley1 commented Feb 6, 2026 •

edited

Loading

Uh oh!

cmunley1 commented Feb 6, 2026

Uh oh!

cmunley1 commented Feb 6, 2026

Uh oh!

cmunley1 left a comment

Uh oh!

Kelvin0110 commented Feb 10, 2026

Uh oh!

newtdes commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add NewtonBench Resource Server #650

Are you sure you want to change the base?

Add NewtonBench Resource Server #650

Uh oh!

Conversation

Kelvin0110 commented Feb 5, 2026

Contributing To NeMo-Gym (NewtonBench Resource Server)

1) Basic information

i. Description of the environment

ii. Description of the verification logic

iii. Description of the prompts/tasks (source + domain)

iv. License information

2) Environment validity check

i. Commands used to collect rollouts

ii. Resulting rollouts (5 examples)

3) Tests

i. Commands used to run the tests

4) Reward profiling

Uh oh!

copy-pr-bot bot commented Feb 5, 2026

Uh oh!

cmunley1 commented Feb 5, 2026

Uh oh!

Kelvin0110 commented Feb 6, 2026

Uh oh!

cmunley1 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmunley1 commented Feb 6, 2026

Uh oh!

cmunley1 commented Feb 6, 2026

Uh oh!

cmunley1 left a comment

Choose a reason for hiding this comment

Uh oh!

Kelvin0110 commented Feb 10, 2026

Uh oh!

newtdes commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cmunley1 commented Feb 6, 2026 •

edited

Loading