Skip to content

Comments

Add LibTrace recipe for generating domain-specific reasoning data#1224

Open
jubick1337 wants to merge 7 commits intomainfrom
mnovikov/libtrace
Open

Add LibTrace recipe for generating domain-specific reasoning data#1224
jubick1337 wants to merge 7 commits intomainfrom
mnovikov/libtrace

Conversation

@jubick1337
Copy link
Collaborator

@jubick1337 jubick1337 commented Feb 9, 2026

Summary by CodeRabbit

Release Notes

  • Documentation

    • Added LibTrace recipe documentation detailing a pipeline for building domain-specific reasoning datasets from library APIs across physics, chemistry, and biology domains.
  • Bug Fixes

    • Improved error handling for sandbox execution with graceful recovery from transport failures.
  • Chores

    • Added supporting infrastructure for LibTrace pipeline execution, including container configuration and utility scripts for data processing.

@jubick1337 jubick1337 force-pushed the mnovikov/libtrace branch 2 times, most recently from 316ee6f to 7f5b80d Compare February 9, 2026 20:55
@jubick1337 jubick1337 marked this pull request as ready for review February 9, 2026 20:56
@jubick1337 jubick1337 requested a review from Copilot February 9, 2026 20:58
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 70 to 75
generate(
ctx=wrap_arguments(" ".join(extra_args)),
cluster=args.cluster,
input_file=args.input_file,
output_dir=args.output_dir,
expname=f"{args.expname_prefix}-{args.domain or 'custom'}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Domain flag is ignored

--domain is only used to change expname, but it never gets forwarded into the generate() Hydra args. As written, passing --domain chem has no effect on the actual prompt variables/behavior, which is likely unintended given the CLI and the prompt expects domain-specific labeling. Consider adding it to extra_args (e.g. ++domain=... / whatever config key the prompt uses) or removing the flag to avoid a misleading interface.

Comment on lines +590 to +594
extra_srun_args = None
if isinstance(command.script, SandboxScript):
# Also disable PMI/PMIx for the sandbox step. The sandbox runs a
# single SLURM task but spawns many child processes (uwsgi workers,
# IPython shells). On some clusters, PMIx can treat child crashes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate --mpi srun args

The sandbox path sets extra_srun_args = ["--kill-on-bad-exit=0", "--mpi=none"], but get_executor() always seeds srun_args with "--mpi=pmix" and then extends with extra_srun_args (see nemo_skills/pipeline/utils/exp.py). This produces two --mpi flags in the same srun command, which can be rejected by Slurm or can keep the first value (making the override ineffective). It needs to replace the default --mpi=pmix rather than append a second --mpi=....

Signed-off-by: jubick1337 <mnovikov@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 9, 2026

📝 Walkthrough

Walkthrough

This pull request introduces LibTrace, a comprehensive recipe for building domain-specific reasoning data from library APIs. It adds documentation, Python pipeline scripts for data harvesting and processing, Docker infrastructure, prompt templates, and enhances sandbox error handling for transport failures.

Changes

Cohort / File(s) Summary
Documentation and Navigation
docs/recipes/libtrace.md, mkdocs.yml, recipes/README.md
Added LibTrace recipe documentation, navigation entry, and cross-references describing the 8-step pipeline and file organization.
Sandbox Error Handling
nemo_skills/code_execution/sandbox.py
Added handling for httpx.TransportError and httpx.TimeoutException across code execution, state restoration, proof checking, and session cleanup paths with standardized error payloads and logging.
Pipeline Infrastructure
nemo_skills/pipeline/utils/declarative.py, nemo_skills/pipeline/utils/exp.py
Introduced extra_srun_args parameter to get_executor() and added Sandbox-specific SLURM overrides (--kill-on-bad-exit=0, --mpi=none) to allow worker process restarts without step-wide kills.
LibTrace Docker & Environment Setup
recipes/libtrace/dockerfiles/Dockerfile.sandbox, recipes/libtrace/dockerfiles/environment.yml, recipes/libtrace/dockerfiles/start-with-nginx.sh
New Docker image for LibTrace sandbox with scientific libraries (Geant4, GROMACS, chemistry/biology tools), conda environment specification, and entrypoint script configuring MPI/BLAS thread limits.
LibTrace Prompt Templates
recipes/libtrace/prompts/applicability-relevance.yaml, recipes/libtrace/prompts/problem-generation.yaml
New YAML prompts for evaluating library API applicability/domain-relevance and generating graduate-level domain problems requiring specific library functions.
LibTrace Pipeline Scripts
recipes/libtrace/scripts/harvest_docs.py, recipes/libtrace/scripts/prepare_inference_jsonl.py, recipes/libtrace/scripts/filter_applicability_relevance.py, recipes/libtrace/scripts/collect_generated_problems.py, recipes/libtrace/scripts/gather_solutions.py
Five core utilities implementing the LibTrace pipeline: harvesting library docstrings, preparing inference datasets, filtering by applicability/relevance, aggregating generated problems, and sampling solved solutions with library usage tracking.
LibTrace Documentation
recipes/libtrace/README.md
Comprehensive walkthrough of LibTrace pipeline steps, setup instructions, environment notes, and expected outputs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

  • Kipok
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a LibTrace recipe for generating domain-specific reasoning data. It is clear, specific, and directly reflects the primary addition across documentation, scripts, and configuration files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch mnovikov/libtrace

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🤖 Fix all issues with AI agents
In `@recipes/libtrace/dockerfiles/environment.yml`:
- Around line 24-25: Remove the redundant conda package entry "geant4" from
environment.yml so the Dockerfile.sandbox build uses only the source-built
Geant4 v11.2.2; edit environment.yml to delete the geant4 line referenced in the
diff, ensuring the Dockerfile.sandbox steps that build and install Geant4 into
${CONDA_PREFIX} and the dependent geant4_pybind linkage remain unchanged to
avoid duplicate installation and file conflicts.

In `@recipes/libtrace/scripts/collect_generated_problems.py`:
- Around line 175-176: Replace the use of row.get("name", "<missing>") and
row.get("source", "<missing>") with direct dictionary access (row["name"] and
row["source"]) so the code fails fast when validated keys are unexpectedly
missing; locate the usages of row at the point where name and source are
assigned in collect_generated_problems.py (the block that validated keys
earlier) and update those two expressions accordingly, keeping existing variable
names and surrounding logic unchanged.

In `@recipes/libtrace/scripts/filter_applicability_relevance.py`:
- Around line 188-197: Replace the awkward raise/catch pattern around extracting
and validating source: stop raising TypeError and remove the broad except
Exception; instead explicitly handle missing key and wrong type. Specifically,
try to get source = item["source"] in a try/except KeyError as e and on KeyError
either increment invalid_count and continue when args.skip_invalid is true or
raise ValueError(f"Missing or invalid source in {input_path}:{line_num}") from
e; then check if not isinstance(source, str) and likewise increment
invalid_count and continue when args.skip_invalid is true or raise
ValueError(...) (use raise ... from where appropriate if you have an underlying
exception); only after passing these explicit checks update
source_dist_pre[source] += 1 and append to valid_items. This removes the blanket
except and uses raise ... from per B904.

In `@recipes/libtrace/scripts/gather_solutions.py`:
- Line 1: Add the repository's required copyright header to the very top of
gather_solutions.py (before the existing module docstring/triple-quote), using
the standard project header template (year and organization/owner) that CI
expects; ensure the header is placed above any code or docstrings and preserve
the existing module docstring after the header.
- Around line 528-550: The try/except in process_line (and likewise in
process_line_for_sampling) currently catches all Exception types and hides
programming errors; narrow the exception handling to only the expected
input/parsing errors (e.g., json.JSONDecodeError, KeyError, TypeError,
ValueError) and handle those by logging the error and returning the failure
tuple, but remove the broad except so other exceptions (AttributeError,
IndexError, etc.) propagate to the parent; ensure you include the exception
details in the log message when returning on malformed input and apply the same
change to process_line_for_sampling.
- Around line 846-852: The loop that calls _json_loads(line) and adds
data["name"] currently swallows all exceptions; modify it to track and report
skipped lines: introduce a skipped_lines_count (and optionally collect up to N
examples in malformed_examples) in the except block instead of pass, increment
it when JSONDecodeError or KeyError (or any Exception) occurs, and after the
loop log or raise a clear warning including skipped_lines_count, total lines
processed, and a few example malformed lines; keep the existing RuntimeError("No
names found in input data.") for the all_names empty case but augment it or the
surrounding log with the skipped-line statistics so data quality issues are
visible.
- Around line 427-439: _in _extract_all_code_from_messages, replace the
permissive dict.get() usage with direct indexing to fail fast: check
tool_call["function"]["name"] == "stateful_python_code_exec" and access args =
tool_call["function"]["arguments"] then pull code via _json_loads(args)["code"]
or args["code"] as before; this uses the known structure created by
parse_messages and surfaces errors immediately instead of silently masking
missing keys. Ensure you update all occurrences of
tool_call.get(...)/msg.get(...) in that function to msg["role"],
tool_call["function"]["name"], tool_call["function"]["arguments"], and
args["code"] accordingly.
- Around line 639-653: The try/except in the loop that writes records (using
_json_loads, parse_messages, _json_dumps, and writing to output_file via
_progress) is swallowing errors and causing silent data loss; update
save_sampled_messages so the except block increments a failure counter and
records the failed indices (e.g., failed_count and failed_indices), continue
writing remaining records, and after the with block log or print a concise
summary including failed_count and failed_indices (or re-raise if failures > 0
depending on policy) so callers can detect incomplete output instead of silently
dropping records.
🟡 Minor comments (15)
nemo_skills/code_execution/sandbox.py-399-408 (1)

399-408: ⚠️ Potential issue | 🟡 Minor

Fix ambiguous EN DASH character on line 404.

Static analysis (RUF001) flags the (EN DASH, U+2013) in the log format string. Replace with a standard hyphen for consistency.

Proposed fix
-                LOG.warning("Retry %d/%d deleting session %s – %s", attempt + 1, max_retries, session_id, e)
+                LOG.warning("Retry %d/%d deleting session %s - %s", attempt + 1, max_retries, session_id, e)
recipes/libtrace/scripts/run_boxed_inference.py-1-2 (1)

1-2: ⚠️ Potential issue | 🟡 Minor

Missing copyright header — CI failure.

The pipeline reports a copyright header is missing. Add the project's standard copyright header at the top of this file.

docs/recipes/libtrace.md-10-11 (1)

10-11: ⚠️ Potential issue | 🟡 Minor

Broken link — README.md is outside the docs/ directory.

The link /recipes/libtrace/README.md points to a file that lives in the repo root's recipes/ folder, not inside docs/. MkDocs only serves content from the docs/ directory, so this link will 404 on the rendered site. Consider linking to the GitHub source directly (e.g., the repo URL for that file) or moving the relevant content into the docs tree.

Proposed fix (link to GitHub source)
 For the full walkthrough, command examples, and configuration details see
-[recipes/libtrace/README.md](/recipes/libtrace/README.md).
+[recipes/libtrace/README.md](https://github.com/NVIDIA-NeMo/Skills/blob/main/recipes/libtrace/README.md).
#!/bin/bash
# Verify that the README is not inside the docs directory
fd 'README.md' --full-path --glob '**/libtrace/**'
recipes/libtrace/scripts/filter_applicability_relevance.py-1-1 (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Missing copyright header — CI failure.

The pipeline reports a copyright header is missing. Add the project's standard copyright header at the top of this file.

recipes/libtrace/scripts/run_applicability_relevance.py-64-71 (1)

64-71: ⚠️ Potential issue | 🟡 Minor

Space-containing Hydra overrides in extra_args will be split incorrectly.

wrap_arguments (from nemo_skills/pipeline/cli.py) splits its input on spaces: arguments.split(" "). If a user passes --extra_args '++inference.stop_phrases="end of response"', the value will be broken into multiple tokens. This affects both this script and run_problem_generation.py.

Consider whether wrap_arguments should accept a list instead, or ensure extra_args is split and extended individually before joining.

recipes/libtrace/dockerfiles/Dockerfile.sandbox-31-36 (1)

31-36: ⚠️ Potential issue | 🟡 Minor

Update micromamba download URL to current official endpoint.

The URL https://micromamba.snakepit.net/api/micromamba/linux-64/latest is an older endpoint. The current official endpoint is https://micro.mamba.pm/api/micromamba/linux-64/latest. Update line 33 to use the current recommended source.

recipes/libtrace/scripts/harvest_docs.py-253-258 (1)

253-258: ⚠️ Potential issue | 🟡 Minor

Help text contradicts validation logic.

The help string says --field is "ignored if --libraries is provided", but line 282-283 raises a ValueError if both are supplied. Update the help text to reflect the actual mutual-exclusivity constraint.

Proposed fix
-        help="Library group to process (ignored if --libraries is provided).",
+        help="Library group to process (mutually exclusive with --libraries).",
recipes/libtrace/scripts/harvest_docs.py-229-247 (1)

229-247: ⚠️ Potential issue | 🟡 Minor

"source" field is added after per-library JSONL is written — schema mismatch.

write_jsonl on line 232 serializes docs without the "source" key. It's only added on line 246, so per-library files have a different schema than the unified file. If any downstream consumer expects "source" in per-library files, it will fail.

If the omission is intentional, a comment would help. Otherwise, move the mutation before the write:

Proposed fix
     docs = extract_docs_from_module(lib, library_name, seen=set())
 
+    for doc in docs:
+        doc["source"] = library_name
+
     output_file = output_dir / f"{library_name}.jsonl"
     write_jsonl(output_file, docs)
 
     functions = sum(1 for doc in docs if doc["type"] == "function")
     classes = sum(1 for doc in docs if doc["type"] == "class")
     methods = sum(1 for doc in docs if doc["type"] == "method")
 
     print(f"\nExtracted {len(docs)} items from {library_name}")
     print(f"Saved to: {output_file}")
     print("Statistics:")
     print(f"  Functions: {functions}")
     print(f"  Classes: {classes}")
     print(f"  Methods: {methods}")
 
-    for doc in docs:
-        doc["source"] = library_name
     return docs
recipes/libtrace/scripts/collect_generated_problems.py-1-14 (1)

1-14: ⚠️ Potential issue | 🟡 Minor

Missing copyright header.

The pipeline failure indicates a missing copyright header. This applies to this file as well (and likely all new Python files in this PR).

recipes/libtrace/scripts/collect_generated_problems.py-98-102 (1)

98-102: ⚠️ Potential issue | 🟡 Minor

--max_problem_tokens cannot be disabled from CLI; the "or omitted" path is unreachable.

With default=10000 and type=int, args.max_problem_tokens is never None, so the is not None guards on lines 128 and 134 are always True, and the error message "or omitted" on line 129 is misleading. If the intent is to allow disabling token-length filtering, change the default to None:

Proposed fix
     parser.add_argument(
         "--max_problem_tokens",
         type=int,
-        default=10000,
-        help="Drop problems whose tokenized length exceeds this threshold.",
+        default=None,
+        help="Drop problems whose tokenized length exceeds this threshold (default: disabled).",
     )

Alternatively, if filtering should always be on, remove the is not None guards and the "or omitted" message to avoid confusion.

Also applies to: 128-139

recipes/libtrace/scripts/gather_solutions.py-442-446 (1)

442-446: ⚠️ Potential issue | 🟡 Minor

Redundant check: r"\boxed{" and "\\boxed{" are identical strings.

Both r"\boxed{" (raw string) and "\\boxed{" (escaped backslash) evaluate to the same string \boxed{. The or clause is dead code. If you intended to check for two different representations, one side needs adjustment; otherwise, simplify to a single check.

Proposed simplification
 def _has_boxed_in_last_assistant(messages: list) -> bool:
     for msg in reversed(messages):
         if msg.get("role") == "assistant" and msg.get("content"):
-            return r"\boxed{" in msg["content"] or "\\boxed{" in msg["content"]
+            return r"\boxed{" in msg["content"]
     return False
recipes/libtrace/scripts/gather_solutions.py-49-50 (1)

49-50: ⚠️ Potential issue | 🟡 Minor

_json_dumps fallback is missing ensure_ascii=False, unlike all other LibTrace scripts.

Every other script in this recipe passes ensure_ascii=False to _json_std.dumps in the fallback path. This omission means non-ASCII characters (chemical names, mathematical symbols, etc.) would be escaped to \uXXXX sequences when orjson is not installed.

Proposed fix
     def _json_dumps(obj) -> str:
-        return _json_std.dumps(obj)
+        return _json_std.dumps(obj, ensure_ascii=False)
recipes/libtrace/scripts/gather_solutions.py-409-424 (1)

409-424: ⚠️ Potential issue | 🟡 Minor

Dead code block at lines 418–423.

Line 411 already adds all_names & used_names to matched. The loop at line 418 iterates over the same intersection and adds names that are already in matched. The entire block (lines 418–423) is a no-op and can be removed.

Proposed fix
 def _find_matching_names(all_names: Set[str], used_names: Set[str], imported_from_modules: Set[str]) -> Set[str]:
     matched = set()
     matched.update(all_names & used_names)
     for used in used_names:
         parts = used.split(".")
         for i in range(1, len(parts)):
             prefix = ".".join(parts[:i])
             if prefix in all_names:
                 matched.add(prefix)
-    for name in all_names & used_names:
-        parts = name.split(".")
-        if len(parts) >= 2:
-            module_prefix = ".".join(parts[:-1])
-            if module_prefix in imported_from_modules:
-                matched.add(name)
     return matched
recipes/libtrace/scripts/gather_solutions.py-375-388 (1)

375-388: ⚠️ Potential issue | 🟡 Minor

Dead code: redundant name in used_names check on line 386.

The check on line 376 already tests name in used_names and returns True if it matches. If execution reaches line 386, name is guaranteed to not be in used_names, making this condition always False.

Proposed fix
 def _check_name_in_extracted(name: str, used_names: Set[str], imported_from_modules: Set[str]) -> bool:
     if name in used_names:
         return True
     name_prefix = name + "."
     for used in used_names:
         if used.startswith(name_prefix):
             return True
     parts = name.split(".")
     if len(parts) >= 2:
         module_prefix = ".".join(parts[:-1])
-        if module_prefix in imported_from_modules:
-            if name in used_names:
-                return True
+        if module_prefix in imported_from_modules and parts[-1] in {u.split(".")[-1] for u in used_names}:
+            return True
     return False

If the intent was to check that the attribute part of the name is used (not the full qualified name), the condition needs to be fixed as above. Otherwise, simply remove the dead branch.

recipes/libtrace/scripts/gather_solutions.py-255-361 (1)

255-361: ⚠️ Potential issue | 🟡 Minor

Remove unused NameUsageChecker class.

This class is defined at lines 255–361 but never instantiated anywhere in the codebase. Its functionality has been superseded by UsedNamesExtractor (lines 184–252), which provides a more efficient single-pass extraction approach combined with helper functions like _check_name_in_extracted() and _check_any_name_in_extracted() for name matching.

🧹 Nitpick comments (21)
nemo_skills/pipeline/utils/exp.py (1)

180-180: extra_srun_args parameter is undocumented in the docstring.

The existing docstring (lines 196–228) documents every other parameter but omits extra_srun_args (and the pre-existing ray_template). Consider adding a brief entry for discoverability.

nemo_skills/pipeline/utils/declarative.py (1)

578-598: Consider letting the script declare its own srun overrides instead of hardcoding them here.

The isinstance(command.script, SandboxScript) check + hardcoded args creates a coupling between the generic _create_executor and a specific script type. If SandboxScript (or BaseJobScript) exposed an extra_srun_args property, this method could stay script-agnostic:

extra_srun_args = getattr(command.script, "extra_srun_args", None)

Not urgent since there's only one case today, but it would make the declarative layer more extensible.

recipes/libtrace/README.md (1)

1-192: Well-structured documentation with clear step-by-step examples.

The README provides comprehensive pipeline documentation. One minor observation: Steps 3, 5, and 7 invoke scripts directly with python (because they internally call the Nemo pipeline), while other steps use ns run_cmd. Consider adding a brief note near Step 3 (or in the introduction) explaining this distinction, so users understand why some steps use ns run_cmd and others don't.

recipes/libtrace/scripts/run_boxed_inference.py (1)

102-102: Unnecessary f-string.

f"{args.expname_prefix}" is equivalent to args.expname_prefix since no formatting is applied.

Proposed fix
-        expname=f"{args.expname_prefix}",
+        expname=args.expname_prefix,
recipes/libtrace/scripts/filter_applicability_relevance.py (3)

23-41: Duplicated JSON helper and read_jsonl across multiple scripts.

_json_loads, _json_dumps, and read_jsonl are copy-pasted from collect_generated_problems.py (and possibly prepare_inference_jsonl.py). Consider extracting these into a shared utility module under recipes/libtrace/scripts/ to avoid drift between copies.

#!/bin/bash
# Verify the extent of duplication across libtrace scripts
rg -l '_json_loads\|_json_dumps\|read_jsonl' --glob 'recipes/libtrace/scripts/*.py'

90-104: Regex is recompiled on every call.

parse_scores is called per-line and recompiles two regexes each time. For large JSONL files this adds up. Consider caching compiled patterns (e.g., with functools.lru_cache on the label, or precompiling outside the loop).


145-155: store_true with default=True makes the flag a no-op.

--require_applicable uses action="store_true" with default=True, so passing the flag changes it from True to True — it has no effect. The only way to change the value is via --no_require_applicable. While functionally correct as a pair, the --require_applicable flag itself is misleading since it suggests the user can opt in, but it's already on by default. Consider removing the --require_applicable flag and only keeping --no_require_applicable, or simply documenting that it's always on unless negated.

recipes/libtrace/dockerfiles/start-with-nginx.sh (1)

38-41: Consider logging a warning when micromamba activate sandbox fails.

The || true on Line 40 silently swallows activation failures. Since the if guard already confirms micromamba is on PATH, a failure to activate the sandbox environment likely indicates a real problem (e.g., broken or missing env). A brief warning to stderr would aid debugging without breaking the entrypoint.

Suggested improvement
 if command -v micromamba >/dev/null 2>&1; then
     eval "$(micromamba shell hook -s bash)" || true
-    micromamba activate sandbox || true
+    micromamba activate sandbox || echo "WARNING: failed to activate 'sandbox' env" >&2
 fi
recipes/libtrace/dockerfiles/environment.yml (1)

1-64: No version pins on most packages — builds may not be reproducible.

Nearly all conda and pip dependencies lack version constraints. While strict pinning can cause solver conflicts with this many packages, consider at minimum generating and committing a lock file (e.g., conda-lock or micromamba env export) for reproducible builds.

recipes/libtrace/dockerfiles/Dockerfile.sandbox (2)

15-15: Pin the base image digest for reproducible builds.

tiangolo/uwsgi-nginx-flask:python3.10 can change upstream at any time. Consider pinning to a specific digest (e.g., @sha256:...) to ensure deterministic builds.


80-99: geant4_pybind is cloned without a version pin.

Line 96 clones geant4_pybind at HEAD, making the build non-reproducible and vulnerable to breaking upstream changes. Pin to a specific tag or commit hash.

Suggested fix
-    git clone --recursive https://github.com/HaarigerHarald/geant4_pybind.git /tmp/geant4_pybind && \
+    git clone --recursive --branch v0.XX.X https://github.com/HaarigerHarald/geant4_pybind.git /tmp/geant4_pybind && \
recipes/libtrace/scripts/run_applicability_relevance.py (1)

57-90: Near-identical boilerplate shared with run_problem_generation.py.

Both scripts duplicate the prompt-config validation, extra-args construction, and the large generate(...) call. Consider extracting the shared logic (arg parsing base, prompt resolution, generate invocation) into a small shared helper module under recipes/libtrace/scripts/ to reduce drift between the two.

recipes/libtrace/scripts/run_problem_generation.py (1)

90-90: Unnecessary f-string — just use the variable directly.

f"{args.expname_prefix}" is equivalent to args.expname_prefix. Compare with run_applicability_relevance.py Line 75 which actually interpolates domain into the f-string.

Suggested fix
-        expname=f"{args.expname_prefix}",
+        expname=args.expname_prefix,
recipes/libtrace/scripts/prepare_inference_jsonl.py (2)

140-154: Validate --target_per_library before processing all rows.

The <= 0 check on Line 153 runs after the entire input has been read and validated (Lines 145-151). Moving this check to just after arg parsing would fail fast and avoid wasted work on large inputs.

Suggested fix
 if __name__ == "__main__":
     args = parse_args()
     input_path = Path(args.input_file)
     output_path = Path(args.output_file)
     domain_label = resolve_domain_label(args.domain)
+
+    if args.target_per_library is not None and args.target_per_library <= 0:
+        raise ValueError("--target_per_library must be a positive integer.")
+
     rows = read_jsonl(input_path)
     if not rows:
         raise RuntimeError(f"No entries found in {input_path}")
 
     output_rows: list[dict] = []
     for line_num, row in enumerate(rows, start=1):
         output_rows.append(validate_row(row, input_path, line_num, domain_label))
 
-    if args.target_per_library is not None and args.target_per_library <= 0:
-        raise ValueError("--target_per_library must be a positive integer.")
-
     if args.target_per_library:

22-38: Extract duplicated JSON helpers into a shared module.

The _json_loads/_json_dumps orjson-fallback pattern is identically duplicated across five files in recipes/libtrace/scripts/: prepare_inference_jsonl.py, harvest_docs.py, gather_solutions.py, filter_applicability_relevance.py, and collect_generated_problems.py. This represents ~75 lines of identical boilerplate code spread across the codebase. Extracting these helpers into a shared module (e.g., recipes/libtrace/scripts/_io.py) will reduce maintenance burden and prevent future drift. Note that read_jsonl and write_jsonl have API variations across files (e.g., skip_invalid parameter in some, mkdir behavior in others) that would require careful abstraction, but the core JSON serialization utilities are good candidates for consolidation.

recipes/libtrace/scripts/harvest_docs.py (3)

31-31: Narrow the except clause to ImportError.

Catching bare Exception here could silently swallow a real error during orjson initialization (e.g., a broken native extension). The only expected failure is the module not being installed.

Proposed fix
-except Exception:  # pragma: no cover - best effort
+except ImportError:  # pragma: no cover - best effort

169-169: Hardcoded workaround — consider making this configurable.

This single-entry skip is fragile. If other problematic symbols appear, you'll accumulate more hardcoded lines. A module-level SKIP_SYMBOLS: set[str] constant would be cleaner and easier to maintain.


210-213: Add mkdir for robustness, consistent with sibling script.

prepare_inference_jsonl.py includes path.parent.mkdir(parents=True, exist_ok=True) in its write_jsonl. While current callers pre-create the directory, adding it here would be defensive and consistent.

Proposed fix
 def write_jsonl(path: Path, rows: list[dict]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
     with path.open("w", encoding="utf-8") as f:
         for row in rows:
             f.write(_json_dumps(row) + "\n")
recipes/libtrace/scripts/collect_generated_problems.py (2)

24-24: Top-level AutoTokenizer import is unconditional but only needed when token filtering is active.

transformers is a heavy dependency. Since max_problem_tokens defaults to 10000 (always set via argparse), this import will always be hit in practice, so this is fine for the current CLI. However, if someone imports this module as a library or if max_problem_tokens is ever made truly optional, the import would fail unnecessarily. Consider a lazy import inside the if args.max_problem_tokens is not None: block.

Proposed lazy import
-from transformers import AutoTokenizer  # type: ignore
 ...
     tokenizer = None
     if args.max_problem_tokens is not None:
+        from transformers import AutoTokenizer  # type: ignore
+
         tokenizer_name = args.tokenizer

26-43: Significant code duplication across LibTrace scripts.

_json_loads/_json_dumps, read_jsonl, and resolve_input_files are copy-pasted verbatim across at least 4–5 scripts (collect_generated_problems.py, filter_applicability_relevance.py, gather_solutions.py, prepare_inference_jsonl.py, run_problem_generation.py). Consider extracting these into a shared utility module (e.g., recipes/libtrace/scripts/_io_utils.py) to reduce maintenance burden and divergence risk.

Note: gather_solutions.py already has a subtle difference — its _json_dumps fallback uses _json_std.dumps(obj) (no ensure_ascii=False), unlike the other scripts. This inconsistency is likely unintentional and would be caught by consolidation.

Also applies to: 52-70, 73-89

recipes/libtrace/scripts/gather_solutions.py (1)

829-832: Move import os to the top of the file with other imports.

os is a standard library module — placing it inside if __name__ == "__main__" is unconventional and can confuse readers.

Comment on lines +24 to +25
# Geant4 (very large; remove if not needed or unavailable)
- geant4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check environment.yml for geant4
echo "=== environment.yml lines 20-30 ==="
head -30 recipes/libtrace/dockerfiles/environment.yml | tail -11

echo -e "\n=== Check for geant4 in Dockerfile.sandbox ==="
find recipes/libtrace/dockerfiles -name "Dockerfile*" -type f

Repository: NVIDIA-NeMo/Skills

Length of output: 485


🏁 Script executed:

#!/bin/bash
# List files in dockerfiles directory
ls -la recipes/libtrace/dockerfiles/

Repository: NVIDIA-NeMo/Skills

Length of output: 375


🏁 Script executed:

#!/bin/bash
# Search for geant4 references in all Dockerfiles
find recipes/libtrace/dockerfiles -name "Dockerfile*" -type f -exec echo "File: {}" \; -exec rg -n 'geant4' {} \;

Repository: NVIDIA-NeMo/Skills

Length of output: 687


🏁 Script executed:

#!/bin/bash
# Get the full Dockerfile.sandbox to see when environment is created vs when geant4 is built
cat -n recipes/libtrace/dockerfiles/Dockerfile.sandbox

Repository: NVIDIA-NeMo/Skills

Length of output: 6024


🏁 Script executed:

#!/bin/bash
# Check what conda's geant4 package provides
# Look for geant4-pybind alternative in conda
rg -i 'geant4' recipes/libtrace/dockerfiles/environment.yml -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 285


Remove geant4 from environment.yml; it's redundantly installed and conflicts with the source build.

The Dockerfile.sandbox creates the conda environment (line 42) with the geant4 package from conda-forge, then builds Geant4 v11.2.2 from source and installs it to the same location (${CONDA_PREFIX}, line 87). Since the source-built version is the one actually used (geant4_pybind depends on it, lines 96–97), the conda package is redundant. Both installations target the same directory, leading to file conflicts and build inefficiency. Remove line 25 from environment.yml.

🤖 Prompt for AI Agents
In `@recipes/libtrace/dockerfiles/environment.yml` around lines 24 - 25, Remove
the redundant conda package entry "geant4" from environment.yml so the
Dockerfile.sandbox build uses only the source-built Geant4 v11.2.2; edit
environment.yml to delete the geant4 line referenced in the diff, ensuring the
Dockerfile.sandbox steps that build and install Geant4 into ${CONDA_PREFIX} and
the dependent geant4_pybind linkage remain unchanged to avoid duplicate
installation and file conflicts.

Comment on lines +188 to +197
try:
source = item["source"]
if not isinstance(source, str):
raise TypeError
source_dist_pre[source] += 1
valid_items.append(item)
except Exception:
if not args.skip_invalid:
raise ValueError(f"Missing or invalid source in {input_path}:{line_num}")
invalid_count += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Awkward raise TypeError caught by broad except Exception.

Lines 190-191 artificially raise TypeError only to be caught by the blanket except Exception on line 194. This obscures intent and makes the control flow hard to follow. A simple if/continue/raise would be clearer and avoids the broad catch. Additionally, line 196 should use raise ... from per B904.

Proposed fix
     for line_num, item in enumerate(data, start=1):
-        try:
-            source = item["source"]
-            if not isinstance(source, str):
-                raise TypeError
-            source_dist_pre[source] += 1
-            valid_items.append(item)
-        except Exception:
-            if not args.skip_invalid:
-                raise ValueError(f"Missing or invalid source in {input_path}:{line_num}")
-            invalid_count += 1
+        source = item.get("source")
+        if not isinstance(source, str):
+            if not args.skip_invalid:
+                raise ValueError(f"Missing or invalid source in {input_path}:{line_num}")
+            invalid_count += 1
+            continue
+        source_dist_pre[source] += 1
+        valid_items.append(item)

As per coding guidelines, "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving."

🧰 Tools
🪛 Ruff (0.14.14)

[warning] 191-191: Abstract raise to an inner function

(TRY301)


[warning] 194-194: Do not catch blind exception: Exception

(BLE001)


[warning] 196-196: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


[warning] 196-196: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/filter_applicability_relevance.py` around lines 188
- 197, Replace the awkward raise/catch pattern around extracting and validating
source: stop raising TypeError and remove the broad except Exception; instead
explicitly handle missing key and wrong type. Specifically, try to get source =
item["source"] in a try/except KeyError as e and on KeyError either increment
invalid_count and continue when args.skip_invalid is true or raise
ValueError(f"Missing or invalid source in {input_path}:{line_num}") from e; then
check if not isinstance(source, str) and likewise increment invalid_count and
continue when args.skip_invalid is true or raise ValueError(...) (use raise ...
from where appropriate if you have an underlying exception); only after passing
these explicit checks update source_dist_pre[source] += 1 and append to
valid_items. This removes the blanket except and uses raise ... from per B904.

Comment on lines 846 to 852
try:
data = _json_loads(line)
all_names.add(data["name"])
except Exception:
pass
if not all_names:
raise RuntimeError("No names found in input data.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Silent try-except-pass when extracting names hides data quality issues.

Lines 849–850 silently swallow all exceptions during name extraction. If a significant portion of input lines are malformed (missing "name" key, invalid JSON), the all_names set will be silently incomplete, leading to incorrect usage statistics and sampling. At minimum, count and report skipped lines.

Proposed fix
+    name_extraction_errors = 0
     for line in _progress(all_lines, desc="Extracting names", unit="line"):
         try:
             data = _json_loads(line)
             all_names.add(data["name"])
         except Exception:
-            pass
+            name_extraction_errors += 1
     if not all_names:
         raise RuntimeError("No names found in input data.")
+    if name_extraction_errors:
+        print(f"Warning: {name_extraction_errors} lines skipped during name extraction", file=sys.stderr)
     print(f"Found {len(all_names):,} unique names to check against")

As per coding guidelines, "Follow the Zen of Python principles... ensure errors never pass silently unless explicitly silenced".

🧰 Tools
🪛 Ruff (0.14.14)

[error] 849-850: try-except-pass detected, consider logging the exception

(S110)


[warning] 849-849: Do not catch blind exception: Exception

(BLE001)


[warning] 852-852: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/gather_solutions.py` around lines 846 - 852, The
loop that calls _json_loads(line) and adds data["name"] currently swallows all
exceptions; modify it to track and report skipped lines: introduce a
skipped_lines_count (and optionally collect up to N examples in
malformed_examples) in the except block instead of pass, increment it when
JSONDecodeError or KeyError (or any Exception) occurs, and after the loop log or
raise a clear warning including skipped_lines_count, total lines processed, and
a few example malformed lines; keep the existing RuntimeError("No names found in
input data.") for the all_names empty case but augment it or the surrounding log
with the skipped-line statistics so data quality issues are visible.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new LibTrace recipe to generate domain-specific reasoning datasets from scientific library docstrings, along with a custom sandbox image and a few pipeline/runtime robustness tweaks to better support “heavy” scientific dependencies.

Changes:

  • Introduces LibTrace end-to-end scripts (doc harvesting → labeling → filtering → problem generation → boxed inference → solution gathering) plus prompt templates and documentation.
  • Adds a LibTrace-specific sandbox Docker image (micromamba env + scientific stacks) and a wrapper entrypoint with runtime env tuning.
  • Extends SLURM executor plumbing to allow extra srun args for sandbox resiliency, and improves sandbox client handling of transport/connection errors.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
recipes/libtrace/scripts/run_problem_generation.py CLI wrapper around nemo_skills.pipeline.generate for domain problem generation.
recipes/libtrace/scripts/run_boxed_inference.py CLI wrapper to solve generated problems using generic/general-boxed (optionally with sandbox).
recipes/libtrace/scripts/run_applicability_relevance.py Runs applicability+relevance labeling using a LibTrace prompt config.
recipes/libtrace/scripts/prepare_inference_jsonl.py Validates/normalizes harvested docs into inference-ready JSONL; optional per-library sampling.
recipes/libtrace/scripts/harvest_docs.py Extracts public API docs from selected libraries/modules into per-lib + unified JSONL.
recipes/libtrace/scripts/gather_solutions.py Parses boxed inference outputs, checks library usage via AST, computes stats and samples records.
recipes/libtrace/scripts/filter_applicability_relevance.py Filters labeling outputs by applicability/relevance thresholds and reports distributions.
recipes/libtrace/scripts/collect_generated_problems.py Consolidates generated problems across seeds; optional token-length filtering.
recipes/libtrace/prompts/problem-generation.yaml Prompt template for generating domain problems from docstrings.
recipes/libtrace/prompts/applicability-relevance.yaml Prompt template for applicability/relevance labeling.
recipes/libtrace/dockerfiles/start-with-nginx.sh Sandbox entrypoint wrapper setting env for MPI/threading/PySCF stability.
recipes/libtrace/dockerfiles/environment.yml Conda environment spec for the LibTrace sandbox image.
recipes/libtrace/dockerfiles/Dockerfile.sandbox Builds the LibTrace sandbox image (micromamba + scientific libs + extra builds).
recipes/libtrace/README.md Full LibTrace walkthrough and command examples.
recipes/README.md Adds LibTrace to the recipes index.
nemo_skills/pipeline/utils/exp.py Adds extra_srun_args support and uses it for sandbox srun overrides.
nemo_skills/pipeline/utils/declarative.py Adds sandbox-specific srun overrides and adjusts executor GPU allocation logic.
nemo_skills/code_execution/sandbox.py Handles httpx.TransportError during execution/restoration/proof-check and makes delete-session best-effort.
mkdocs.yml Adds LibTrace to MkDocs navigation.
docs/recipes/libtrace.md Adds docs landing page for the LibTrace recipe.
Comments suppressed due to low confidence (1)

nemo_skills/code_execution/sandbox.py:183

  • New TransportError handling changes sandbox behavior on connection failures, but there’s no unit test exercising this path. Consider adding a test that mocks http_session.post (and/or delete_session) to raise httpx.TransportError and asserts execute_code()/is_proof_correct() return the expected status and user-facing stderr message.
                    "Session state was lost. You need to re-import all modules and re-define all variables "
                    "before continuing. Consider simplifying the code to avoid crashes.\n"
                ),
            }
        new_session_created = output.pop("new_session_created", False)

        # Rebuild state by re-executing history first, then execute the new code.
        # NOTE: Only cells that completed successfully are stored, so we intentionally omit re-running cells that errored

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +834 to +838
)
parser.add_argument("--no_dump_json", action="store_false", dest="dump_json")
parser.add_argument("--user_prompt_prefix", type=str, default=DEFAULT_USER_PROMPT_PREFIX)
return parser.parse_args()

Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script reads all input lines into memory (all_lines.extend(f.readlines())), which can be prohibitively large for output-rs*.jsonl at scale. Consider streaming processing for stats, and for sampling use a single-pass reservoir/weighted reservoir approach (or store only offsets/indices) to avoid O(file_size) RAM usage.

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +116
if relevance_match is None:
raise ValueError(f"Missing {label}-Relevance in {path}:{line_num}")
relevance = int(relevance_match.group(1))
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve_label() checks DOMAIN_TO_LABEL using the raw --domain string; DOMAIN_TO_LABEL currently omits common aliases like "chemistry"/"biology" and doesn’t normalize case. Since prepare_inference_jsonl supports these aliases, consider doing domain = domain.lower() and expanding DOMAIN_TO_LABEL to include the long-form keys for consistency.

Copilot uses AI. Check for mistakes.
srun_args.append("--overlap")
if not cluster_config.get("disable_gpus_per_node", False) and gpus_per_node is not None:
srun_args.append(f"--gpus-per-node={gpus_per_node}")
if extra_srun_args:
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appending extra_srun_args directly to srun_args can introduce conflicting/duplicate flags (e.g., default "--mpi=pmix" plus extra "--mpi=none" for the sandbox). Depending on Slurm version, repeating --mpi may error or behave inconsistently; consider explicitly overriding/removing existing --mpi/--kill-on-bad-exit entries when an override is provided (or add a dedicated mpi/kill_on_bad_exit parameter).

Suggested change
if extra_srun_args:
if extra_srun_args:
# Allow extra_srun_args to override specific options like --mpi and --kill-on-bad-exit
has_mpi_override = any(arg.startswith("--mpi") for arg in extra_srun_args)
has_kill_override = any(arg.startswith("--kill-on-bad-exit") for arg in extra_srun_args)
if has_mpi_override or has_kill_override:
filtered_srun_args = []
for arg in srun_args:
if has_mpi_override and arg.startswith("--mpi"):
# Skip default MPI setting if an override is provided
continue
if has_kill_override and arg.startswith("--kill-on-bad-exit"):
# Skip default kill-on-bad-exit setting if an override is provided
continue
filtered_srun_args.append(arg)
srun_args = filtered_srun_args

Copilot uses AI. Check for mistakes.
Comment on lines 578 to 580
# Only the server component (span_group_nodes=True) should claim group GPUs.
# Other components (client, sandbox) run on the same node but don't need GPUs.
if span_group_nodes:
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU allocation logic keys off span_group_nodes, but SandboxScript has span_group_nodes=True, so the sandbox will still request/claim group GPUs even though the comment says it shouldn’t. This defeats the goal of making client/sandbox GPU-free; consider switching the condition to the specific server script type (e.g., ServerScript) or introducing an explicit “needs_gpus” flag on scripts.

Suggested change
# Only the server component (span_group_nodes=True) should claim group GPUs.
# Other components (client, sandbox) run on the same node but don't need GPUs.
if span_group_nodes:
# Decide whether this script actually needs GPUs. By default, scripts that span
# the group are assumed to need GPUs, but this can be overridden via a
# script-level `needs_gpus` attribute. SandboxScript explicitly does not
# need GPUs, even though it may span the group.
needs_gpus = getattr(command.script, "needs_gpus", span_group_nodes)
if isinstance(command.script, SandboxScript):
needs_gpus = False
# Only components that declare they need GPUs should claim group GPUs.
# Other components (client, sandbox) run on the same node but don't need GPUs.
if needs_gpus:

Copilot uses AI. Check for mistakes.
inputs and output paths.

For the full walkthrough, command examples, and configuration details see
[recipes/libtrace/README.md](/recipes/libtrace/README.md).
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link is site-absolute ("/recipes/libtrace/README.md") and likely won’t resolve in the rendered MkDocs site because the repo-root recipes/ tree isn’t part of the docs build. Consider linking to the GitHub file URL or copying the walkthrough content into docs/ and using a relative docs link.

Suggested change
[recipes/libtrace/README.md](/recipes/libtrace/README.md).
the README at `recipes/libtrace/README.md` in the repository.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is correct @jubick1337 please run mkdocs serve to double check docs are rendered properly and fix any warnings

num_workers = args.num_workers if args.num_workers is not None else max(1, (os.cpu_count() or 2) // 2)

print("Reading all lines from files...")
all_lines: list[str] = []
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 9, 2026

Additional Comments (1)

nemo_skills/pipeline/utils/exp.py
Duplicate --mpi flag

get_executor() always seeds srun_args with --mpi=pmix and then appends extra_srun_args (and add_task() now passes extra_srun_args=["--kill-on-bad-exit=0", "--mpi=none"]). This will produce two --mpi flags in the final srun invocation, which can be rejected by Slurm or ignore the intended override depending on parsing order. This needs to replace the default --mpi=pmix rather than append another --mpi=... when an override is requested.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In `@recipes/libtrace/dockerfiles/Dockerfile.sandbox`:
- Around line 96-97: The Dockerfile currently clones geant4_pybind without a
version, making builds non-reproducible; change the git clone in
recipes/libtrace/dockerfiles/Dockerfile.sandbox (the line that calls git clone
--recursive https://github.com/HaarigerHarald/geant4_pybind.git
/tmp/geant4_pybind) to clone a specific existing tag (e.g., v0.1.1) instead of
HEAD, and then pip install that pinned copy; also verify that the chosen tag
(v0.1.1) is compatible with Geant4 v11.2.2 before merging.
- Around line 33-34: The Dockerfile RUN that downloads micromamba from
micromamba.snakepit.net is using a deprecated endpoint; update the download step
in the RUN that invokes curl and tar to use the official installer or release
URL (either run the official installer at https://micro.mamba.pm/install.sh or
fetch the appropriate binary from
https://github.com/mamba-org/micromamba-releases/releases/download/), adjust the
curl invocation and any tar/extract steps to match the chosen source, and ensure
the command still places the micromamba binary into /usr/local/bin (the RUN line
that writes to /usr/local/bin and strips components should be updated
accordingly).

In `@recipes/libtrace/scripts/collect_generated_problems.py`:
- Around line 142-143: The CLI currently sets args.max_problem_tokens to 10000
so the "omit to disable" behavior is impossible; change the argparse definition
for --max_problem_tokens to use default=None (keep type=int) so omitting the
flag disables tokenization, and adjust the validation around
args.max_problem_tokens to only raise ValueError when args.max_problem_tokens is
not None and <= 0, updating the error text to mention that omitting the flag
disables tokenization; reference the argparse flag for --max_problem_tokens and
the runtime check that inspects args.max_problem_tokens.

In `@recipes/libtrace/scripts/filter_applicability_relevance.py`:
- Around line 159-169: The --require_applicable flag is effectively a no-op
because parser.add_argument("--require_applicable", action="store_true",
default=True, ...) always yields True; update the argument handling so the CLI
can toggle the value: replace the current paired arguments
("--require_applicable" with action="store_true" and "--no_require_applicable")
by a single parser.add_argument for "require_applicable" that uses
argparse.BooleanOptionalAction (or alternatively set default=False with
action="store_true" and remove the separate --no_require_applicable) so the
"require_applicable" boolean toggles correctly.

In `@recipes/libtrace/scripts/gather_solutions.py`:
- Around line 456-460: The condition in _has_boxed_in_last_assistant contains a
redundant check (r"\boxed{" and "\\boxed{" are identical), so simplify the
boolean test by keeping a single representation (e.g., r"\boxed{") or, if you
intended to detect a different escaped form, replace the second operand with the
correct alternative; update the return expression to something like: return
r"\boxed{" in msg["content"] (or the intended distinct string) and remove the
duplicate or dead code.
- Around line 710-728: The skipped_no_boxed counter is dead because
process_line_for_sampling already sets has_code=False when a solution lacks a
\\boxed{} answer, so the require_boxed check in the loop (inside
gather_solutions.py) never sees those rows; fix by either removing
skipped_no_boxed and its print (clean up dead code around the loop that
references require_boxed and skipped_no_boxed) or modify
process_line_for_sampling to return a distinct flag (e.g., has_boxed or
was_filtered_by_boxed) separate from has_code so the loop (where
given_name_indices, any_name_only_data, name_usage_counts are updated) can
correctly increment skipped_no_boxed when require_boxed=True. Ensure references
to process_line_for_sampling, require_boxed, skipped_no_boxed, and the loop that
appends to given_name_indices/any_name_only_data are updated consistently.

In `@recipes/libtrace/scripts/harvest_docs.py`:
- Around line 243-261: The `"source"` field is assigned to each doc after
`write_jsonl` has already written the docs to the per-library JSONL file, which
means the saved files will be missing this field. Move the loop that assigns
`doc["source"] = library_name` to occur before the `write_jsonl` call so that
the `source` field is included in both the written JSONL file and the returned
docs list.
🧹 Nitpick comments (15)
nemo_skills/code_execution/sandbox.py (1)

168-179: Consider extracting a helper for the repeated transport-error payload.

The sandbox-crash error dict (with the "re-import all modules" message) is duplicated at lines 171–179 and 261–269. A small helper like _transport_error_output(error) would reduce duplication and ensure the user-facing message stays consistent if it changes.

♻️ Example helper
+    `@staticmethod`
+    def _transport_error_output(error: Exception) -> dict:
+        return {
+            "process_status": "error",
+            "stdout": "",
+            "stderr": (
+                f"Sandbox environment crashed: {error}\n"
+                "Session state was lost. You need to re-import all modules and re-define all variables "
+                "before continuing. Consider simplifying the code to avoid crashes.\n"
+            ),
+        }

Also applies to: 259-269

recipes/libtrace/dockerfiles/Dockerfile.sandbox (3)

26-29: Hardcoded linux64 architecture for PyPy — no multi-arch support.

The standard sandbox Dockerfile handles multi-arch via TARGETARCH. This will fail on ARM builds. If this image is x86_64-only by design, consider adding a comment; otherwise, replicate the arch-detection logic.


15-15: Container runs as root while executing user-submitted code.

Static analysis flagged DS002 (no USER directive). Since this sandbox executes potentially untrusted code, running the application as root amplifies the blast radius of any container escape. Consider adding a non-root user for the uWSGI workers, even if some build steps require root.

Sketch — add a non-root runtime user
 COPY recipes/libtrace/dockerfiles/start-with-nginx.sh /start-with-nginx.sh
 RUN chmod +x /start-with-nginx-base.sh /start-with-nginx.sh
 
+RUN useradd -m -s /bin/bash sandbox-user
+
 WORKDIR /app
 
 # Multi-worker settings (same envs as default sandbox image)

Then switch to the user before CMD or inside the entrypoint for the worker processes. You may need to adjust file permissions for nginx/uWSGI sockets.

Also applies to: 104-124


101-102: Installing aiohttp and requests via conda after pip may cause conflicts.

Lines 59-61 already install pip packages from requirements/code_execution.txt. If that file includes requests or aiohttp (common in networking stacks), the subsequent conda install can overwrite them with incompatible versions, leading to subtle runtime errors. Prefer installing these via pip alongside the other requirements, or ensure they're in environment.yml from the start.

recipes/libtrace/scripts/run_boxed_inference.py (2)

40-138: Significant CLI boilerplate duplication across runner scripts.

run_boxed_inference.py, run_problem_generation.py, and run_applicability_relevance.py share ~80% of their parse_args() definitions and generate() call sites. Consider extracting the shared argument definitions and generate() call into a common helper (e.g., recipes/libtrace/scripts/_common.py) to reduce maintenance burden.


116-116: Unnecessary f-string wrapper.

f"{args.expname_prefix}" is equivalent to args.expname_prefix. This also appears in run_problem_generation.py line 104.

Proposed fix
-        expname=f"{args.expname_prefix}",
+        expname=args.expname_prefix,
recipes/libtrace/scripts/collect_generated_problems.py (1)

38-38: Top-level transformers import is heavy and unconditional.

AutoTokenizer is imported at module load even though tokenization can be disabled (by passing --max_problem_tokens with no value or modifying the default). Consider a lazy import inside the if args.max_problem_tokens is not None block at line 148.

Proposed fix
-from transformers import AutoTokenizer  # type: ignore

Then at line 148:

    if args.max_problem_tokens is not None:
        from transformers import AutoTokenizer  # type: ignore
        tokenizer_name = args.tokenizer
        ...
recipes/libtrace/scripts/harvest_docs.py (3)

183-184: Hardcoded skip for scipy.LowLevelCallable lacks explanation.

Why is this specific symbol excluded? A brief comment would help future maintainers understand the reason (e.g., crashes on introspection, produces garbage docs, etc.).


330-342: Manual counter-building — Counter is already imported in other scripts.

Lines 330–342 manually build frequency dicts. collections.Counter would be more concise and idiomatic.

Proposed fix
+from collections import Counter
 ...
-    library_counts: dict[str, int] = {}
-    for doc in all_docs:
-        source = doc["source"]
-        library_counts[source] = library_counts.get(source, 0) + 1
+    library_counts = Counter(doc["source"] for doc in all_docs)
 
     print("\nItems per library:")
     for lib, count in sorted(library_counts.items()):
         print(f"  {lib}: {count}")
 
-    type_counts: dict[str, int] = {}
-    for doc in all_docs:
-        doc_type = doc["type"]
-        type_counts[doc_type] = type_counts.get(doc_type, 0) + 1
+    type_counts = Counter(doc["type"] for doc in all_docs)

37-49: Duplicated JSON helper pattern across all scripts.

The try: import orjson ... except: fallback to json block is copy-pasted in at least 5 files (harvest_docs.py, filter_applicability_relevance.py, collect_generated_problems.py, prepare_inference_jsonl.py, gather_solutions.py). Extract this into a shared utility module (e.g., recipes/libtrace/scripts/_json_utils.py) to eliminate the duplication.

recipes/libtrace/scripts/prepare_inference_jsonl.py (1)

163-170: Validate --target_per_library before processing rows, and prefer explicit is not None check.

Two minor items:

  1. The <= 0 guard on Line 167 runs after the validation loop (Lines 164–165). Moving it before saves unnecessary work on invalid input.
  2. Line 170 relies on int truthiness (if args.target_per_library:) to mean "is set and positive." After the guard this happens to work, but if args.target_per_library is not None: is more explicit and won't silently skip if the guard is ever removed.
Proposed reordering
+    if args.target_per_library is not None and args.target_per_library <= 0:
+        raise ValueError("--target_per_library must be a positive integer.")
+
     output_rows: list[dict] = []
     for line_num, row in enumerate(rows, start=1):
         output_rows.append(validate_row(row, input_path, line_num, domain_label))
 
-    if args.target_per_library is not None and args.target_per_library <= 0:
-        raise ValueError("--target_per_library must be a positive integer.")
-
-    if args.target_per_library:
+    if args.target_per_library is not None:
         grouped: dict[str, list[dict]] = defaultdict(list)
recipes/libtrace/scripts/gather_solutions.py (4)

274-287: Dead attributes in NameUsageChecker.__init__.

target_module, target_attr, and target_parts are assigned but never read by any method in the class — all matching logic uses self.target_name directly. These appear to be leftovers from a refactor.

Cleanup suggestion
     def __init__(self, target_name: str):
         self.target_name = target_name
-        parts = target_name.split(".")
-        if len(parts) == 1:
-            self.target_module = target_name
-            self.target_attr = None
-        else:
-            self.target_parts = parts
-            self.target_module = None
-            self.target_attr = None
-
         self.import_aliases: Dict[str, str] = {}
         self.wildcard_imports: Set[str] = set()
         self.found_usage = False

396-402: Redundant name in used_names check on Line 400.

Line 390 already returns True when name in used_names, so the same check on Line 400 is dead code — it can never be reached with a truthy result.

Simplify
     parts = name.split(".")
     if len(parts) >= 2:
         module_prefix = ".".join(parts[:-1])
-        if module_prefix in imported_from_modules:
-            if name in used_names:
-                return True
+        if module_prefix in imported_from_modules and name in used_names:
+            return True
     return False

Actually, since name in used_names was already checked at line 390, this entire block (lines 396–402) is unreachable when name in used_names is true. The block can only contribute if module_prefix in imported_from_modules but name is not in used_names — in which case it returns False anyway. Consider whether this branch is intended to check something else (e.g., whether any sub-name of the module is present).


843-846: Move import os to the top-level imports.

Line 844 places import os after the first function call in the __main__ block. Standard practice is to keep stdlib imports at the top of the file with the other imports (Lines 35–45).


818-835: --require_code_rounds / --require_boxed / --dump_json flags are no-ops since default is already True.

action="store_true" with default=True means passing --require_code_rounds has no effect — it's always True unless --no_require_code_rounds is used. The flags work but are misleading in --help. Consider argparse.BooleanOptionalAction (Python 3.9+) which generates --flag / --no-flag from a single declaration.

Example using BooleanOptionalAction
-    parser.add_argument(
-        "--require_code_rounds",
-        action="store_true",
-        default=True,
-        help="Skip records without code_rounds_executed field (default: True)",
-    )
-    parser.add_argument("--no_require_code_rounds", action="store_false", dest="require_code_rounds")
+    parser.add_argument(
+        "--require_code_rounds",
+        action=argparse.BooleanOptionalAction,
+        default=True,
+        help="Skip records without code_rounds_executed field (default: True)",
+    )

Comment on lines +33 to +34
RUN curl -L https://micromamba.snakepit.net/api/micromamba/linux-64/latest | \
tar -xvj -C /usr/local/bin --strip-components=1 bin/micromamba
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

current official micromamba download URL

💡 Result:

Official micromamba download/install URLs:

  • Linux/macOS (and Git Bash): https://micro.mamba.pm/install.sh (run via curl -L ...) [1]
  • Windows (PowerShell): https://micro.mamba.pm/install.ps1 [1]

If you want a direct binary from the official GitHub release mirror, the current latest release shown is 2.5.0-1 (Jan 8, 2026) [2], e.g.:

  • Linux x86_64: https://github.com/mamba-org/micromamba-releases/releases/download/2.5.0-1/micromamba-linux-64 [3]

Sources: [1] [2] [3]


Update to use the current official micromamba download endpoint.

The URL micromamba.snakepit.net is outdated. The current official endpoint is https://micro.mamba.pm/install.sh for installation, or https://github.com/mamba-org/micromamba-releases/releases/download/ for direct binary downloads. Using the deprecated endpoint risks breakage when the old host is retired.

🧰 Tools
🪛 Trivy (0.58.1)

[error] 20-34: 'apt-get' missing '--no-install-recommends'

'--no-install-recommends' flag is missed: 'apt-get update && apt-get install -y curl git net-tools bzip2 build-essential libseccomp-dev && ARCH="${TARGETARCH:-$(dpkg --print-architecture)}" && case "$ARCH" in amd64) PYPY_ARCH=linux64 ;; arm64|aarch64) PYPY_ARCH=aarch64 ;; x86_64) PYPY_ARCH=linux64 ;; ) echo "Unsupported TARGETARCH '$ARCH'" >&2; exit 1 ;; esac && curl -L https://downloads.python.org/pypy/pypy3.10-v7.3.17-$PYPY_ARCH.tar.bz2 -o /tmp/pypy.tar.bz2 && tar -xjf /tmp/pypy.tar.bz2 -C /opt/ && ln -s /opt/pypy3.10-v7.3.17-$PYPY_ARCH/bin/pypy3 /usr/bin/pypy3 && /usr/bin/pypy3 -m ensurepip && rm /tmp/pypy.tar.bz2 && rm -rf /var/lib/apt/lists/ /var/cache/apt/archives/*'

Rule: DS029

Learn more

(IaC/Dockerfile)

🤖 Prompt for AI Agents
In `@recipes/libtrace/dockerfiles/Dockerfile.sandbox` around lines 33 - 34, The
Dockerfile RUN that downloads micromamba from micromamba.snakepit.net is using a
deprecated endpoint; update the download step in the RUN that invokes curl and
tar to use the official installer or release URL (either run the official
installer at https://micro.mamba.pm/install.sh or fetch the appropriate binary
from https://github.com/mamba-org/micromamba-releases/releases/download/),
adjust the curl invocation and any tar/extract steps to match the chosen source,
and ensure the command still places the micromamba binary into /usr/local/bin
(the RUN line that writes to /usr/local/bin and strips components should be
updated accordingly).

Comment on lines +96 to +97
git clone --recursive https://github.com/HaarigerHarald/geant4_pybind.git /tmp/geant4_pybind && \
pip install --no-cache-dir /tmp/geant4_pybind && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check available tags for geant4_pybind
git ls-remote --tags https://github.com/HaarigerHarald/geant4_pybind.git | tail -20

Repository: NVIDIA-NeMo/Skills

Length of output: 1355


🏁 Script executed:

# Also check the Dockerfile to confirm the current state
cat -n recipes/libtrace/dockerfiles/Dockerfile.sandbox | sed -n '90,100p'

Repository: NVIDIA-NeMo/Skills

Length of output: 646


Pin geant4_pybind to a specific tag — the proposed version v0.38.0 does not exist.

The issue is valid: geant4_pybind is cloned without a version pin, making builds non-reproducible. Unlike GROMACS (v2025.3) and Geant4 (v11.2.2), this uses the default branch HEAD.

However, the proposed tag v0.38.0 does not exist in the repository. Available tags are v0.1.0 through v0.1.1. Pin to an actual compatible version:

-    git clone --recursive https://github.com/HaarigerHarald/geant4_pybind.git /tmp/geant4_pybind && \
+    git clone --recursive --depth 1 --branch v0.1.1 https://github.com/HaarigerHarald/geant4_pybind.git /tmp/geant4_pybind && \

Verify v0.1.1 (or the appropriate tag) is compatible with Geant4 v11.2.2 before merging.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
git clone --recursive https://github.com/HaarigerHarald/geant4_pybind.git /tmp/geant4_pybind && \
pip install --no-cache-dir /tmp/geant4_pybind && \
git clone --recursive --depth 1 --branch v0.1.1 https://github.com/HaarigerHarald/geant4_pybind.git /tmp/geant4_pybind && \
pip install --no-cache-dir /tmp/geant4_pybind && \
🤖 Prompt for AI Agents
In `@recipes/libtrace/dockerfiles/Dockerfile.sandbox` around lines 96 - 97, The
Dockerfile currently clones geant4_pybind without a version, making builds
non-reproducible; change the git clone in
recipes/libtrace/dockerfiles/Dockerfile.sandbox (the line that calls git clone
--recursive https://github.com/HaarigerHarald/geant4_pybind.git
/tmp/geant4_pybind) to clone a specific existing tag (e.g., v0.1.1) instead of
HEAD, and then pip install that pinned copy; also verify that the chosen tag
(v0.1.1) is compatible with Geant4 v11.2.2 before merging.

Comment on lines +142 to +143
if args.max_problem_tokens is not None and args.max_problem_tokens <= 0:
raise ValueError("--max_problem_tokens must be a positive integer or omitted.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

--max_problem_tokens validation is unreachable for the default value.

args.max_problem_tokens defaults to 10000 (line 115), so the is not None and <= 0 check on line 142 can only trigger if the user explicitly passes --max_problem_tokens 0 or a negative number. This is fine as defensive validation, but note that argparse doesn't support "omitted" vs "set to None" here — a user cannot disable tokenization without modifying the code. If disabling tokenization at the CLI is intended, consider using default=None and documenting "omit to disable".

🧰 Tools
🪛 Ruff (0.14.14)

[warning] 143-143: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/collect_generated_problems.py` around lines 142 -
143, The CLI currently sets args.max_problem_tokens to 10000 so the "omit to
disable" behavior is impossible; change the argparse definition for
--max_problem_tokens to use default=None (keep type=int) so omitting the flag
disables tokenization, and adjust the validation around args.max_problem_tokens
to only raise ValueError when args.max_problem_tokens is not None and <= 0,
updating the error text to mention that omitting the flag disables tokenization;
reference the argparse flag for --max_problem_tokens and the runtime check that
inspects args.max_problem_tokens.

Comment on lines +159 to +169
parser.add_argument(
"--require_applicable",
action="store_true",
default=True,
help="Require Applicable=1 (default: True).",
)
parser.add_argument(
"--no_require_applicable",
action="store_false",
dest="require_applicable",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

--require_applicable flag is a no-op due to default=True with store_true.

Since default=True is already set, passing --require_applicable on the CLI doesn't change anything — the value is True either way. Only --no_require_applicable has any effect. The action="store_true" should be removed (or the default changed) so the flag can actually toggle the behavior.

Proposed fix
     parser.add_argument(
         "--require_applicable",
-        action="store_true",
+        action=argparse.BooleanOptionalAction,
         default=True,
         help="Require Applicable=1 (default: True).",
     )
-    parser.add_argument(
-        "--no_require_applicable",
-        action="store_false",
-        dest="require_applicable",
-    )

This uses BooleanOptionalAction (Python 3.9+) to support both --require_applicable and --no-require_applicable cleanly.

🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/filter_applicability_relevance.py` around lines 159
- 169, The --require_applicable flag is effectively a no-op because
parser.add_argument("--require_applicable", action="store_true", default=True,
...) always yields True; update the argument handling so the CLI can toggle the
value: replace the current paired arguments ("--require_applicable" with
action="store_true" and "--no_require_applicable") by a single
parser.add_argument for "require_applicable" that uses
argparse.BooleanOptionalAction (or alternatively set default=False with
action="store_true" and remove the separate --no_require_applicable) so the
"require_applicable" boolean toggles correctly.

Comment on lines 456 to 460
def _has_boxed_in_last_assistant(messages: list) -> bool:
for msg in reversed(messages):
if msg.get("role") == "assistant" and msg.get("content"):
return r"\boxed{" in msg["content"] or "\\boxed{" in msg["content"]
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Redundant \boxed{ check — both expressions are identical strings.

r"\boxed{" and "\\boxed{" both evaluate to the same 7-character string \boxed{. The or clause is dead code. If you intended to match two different representations, one of the conditions needs to be different.

Simplify
 def _has_boxed_in_last_assistant(messages: list) -> bool:
     for msg in reversed(messages):
         if msg.get("role") == "assistant" and msg.get("content"):
-            return r"\boxed{" in msg["content"] or "\\boxed{" in msg["content"]
+            return r"\boxed{" in msg["content"]
     return False
🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/gather_solutions.py` around lines 456 - 460, The
condition in _has_boxed_in_last_assistant contains a redundant check (r"\boxed{"
and "\\boxed{" are identical), so simplify the boolean test by keeping a single
representation (e.g., r"\boxed{") or, if you intended to detect a different
escaped form, replace the second operand with the correct alternative; update
the return expression to something like: return r"\boxed{" in msg["content"] (or
the intended distinct string) and remove the duplicate or dead code.

Comment on lines +710 to +728
for line_idx, is_valid, has_code, has_boxed, used_given, names_used in _progress(
results, total=len(all_lines), desc="Categorizing", unit="line"
):
if not is_valid or not has_code:
continue
if require_boxed and not has_boxed:
skipped_no_boxed += 1
continue
if used_given:
given_name_indices.append(line_idx)
elif names_used:
any_name_only_data.append((line_idx, names_used))
for name in names_used:
name_usage_counts[name] += 1

if require_boxed:
print(f"\nSkipped {skipped_no_boxed:,} solutions without \\boxed{{}}")
print(f"Found {len(given_name_indices):,} solutions using given name")
print(f"Found {len(any_name_only_data):,} solutions using any name but not given")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

skipped_no_boxed counter is always 0 — dead code due to worker-side filtering.

When require_boxed=True, the worker (process_line_for_sampling, Line 577–578) already returns has_code=False for solutions without \boxed{}. Line 713 then skips these rows before reaching the has_boxed check on Line 715. So Lines 715–716 are unreachable and the counter is always 0.

If require_boxed=False, Line 715 evaluates to False and ... — also never increments.

Either remove the dead counter or restructure the worker to return filtered-out-by-boxed as a distinct category so it can be reported.

Option A: Remove dead code
-    skipped_no_boxed = 0
 
     with mp.Pool(...) as pool:
         results = pool.imap_unordered(...)
         for line_idx, is_valid, has_code, has_boxed, used_given, names_used in _progress(...):
             if not is_valid or not has_code:
                 continue
-            if require_boxed and not has_boxed:
-                skipped_no_boxed += 1
-                continue
             if used_given:
                 ...
 
-    if require_boxed:
-        print(f"\nSkipped {skipped_no_boxed:,} solutions without \\boxed{{}}")
Option B: Have worker distinguish boxed-filtered from no-code

Change process_line_for_sampling to return has_code=True when require_boxed fails (since there is code, just no boxed answer), so the caller can tally it:

         if _REQUIRE_BOXED and not has_boxed:
-            return (line_idx, True, False, False, False, ())
+            return (line_idx, True, True, False, False, ())
🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/gather_solutions.py` around lines 710 - 728, The
skipped_no_boxed counter is dead because process_line_for_sampling already sets
has_code=False when a solution lacks a \\boxed{} answer, so the require_boxed
check in the loop (inside gather_solutions.py) never sees those rows; fix by
either removing skipped_no_boxed and its print (clean up dead code around the
loop that references require_boxed and skipped_no_boxed) or modify
process_line_for_sampling to return a distinct flag (e.g., has_boxed or
was_filtered_by_boxed) separate from has_code so the loop (where
given_name_indices, any_name_only_data, name_usage_counts are updated) can
correctly increment skipped_no_boxed when require_boxed=True. Ensure references
to process_line_for_sampling, require_boxed, skipped_no_boxed, and the loop that
appends to given_name_indices/any_name_only_data are updated consistently.

Comment on lines +243 to +261
docs = extract_docs_from_module(lib, library_name, seen=set())

output_file = output_dir / f"{library_name}.jsonl"
write_jsonl(output_file, docs)

functions = sum(1 for doc in docs if doc["type"] == "function")
classes = sum(1 for doc in docs if doc["type"] == "class")
methods = sum(1 for doc in docs if doc["type"] == "method")

print(f"\nExtracted {len(docs)} items from {library_name}")
print(f"Saved to: {output_file}")
print("Statistics:")
print(f" Functions: {functions}")
print(f" Classes: {classes}")
print(f" Methods: {methods}")

for doc in docs:
doc["source"] = library_name
return docs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

"source" field is added after per-library JSONL is written — per-library files will lack it.

write_jsonl at line 246 writes docs without the "source" key. The doc["source"] = library_name mutation at line 260 only takes effect for the unified file. If any downstream consumer reads per-library files and expects "source", it will fail.

If this is intentional (per-library filename implies the source), consider documenting it. Otherwise, move the source assignment before the write.

Proposed fix (if source should be in per-library files too)
 def process_library(library_name: str, output_dir: Path) -> list[dict]:
     ...
     docs = extract_docs_from_module(lib, library_name, seen=set())
 
+    for doc in docs:
+        doc["source"] = library_name
+
     output_file = output_dir / f"{library_name}.jsonl"
     write_jsonl(output_file, docs)
     ...
-    for doc in docs:
-        doc["source"] = library_name
     return docs
🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/harvest_docs.py` around lines 243 - 261, The
`"source"` field is assigned to each doc after `write_jsonl` has already written
the docs to the per-library JSONL file, which means the saved files will be
missing this field. Move the loop that assigns `doc["source"] = library_name` to
occur before the `write_jsonl` call so that the `source` field is included in
both the written JSONL file and the returned docs list.

inputs and output paths.

For the full walkthrough, command examples, and configuration details see
[recipes/libtrace/README.md](/recipes/libtrace/README.md).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is correct @jubick1337 please run mkdocs serve to double check docs are rendered properly and fix any warnings

output = await self._send_request(request, timeout)
except httpx.TimeoutException:
output = {"process_status": "timeout", "stdout": "", "stderr": "Client timed out\n"}
except httpx.TransportError as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwarmstrong to review this change, not fully sure on the impact or whether this should be needed. E.g. is session really lost here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hit this in practice: the custom sandbox image uses PySCF, which can get SIGILL. When a uwsgi worker crashes, the sandbox restarts it, but the HTTP connection to the client is already gone with httpx.TransportError. Without this catch, it bubbles up and kills all 32 concurrent generations.
With this change, only the affected generation gets an error and the rest continue. The "session was lost" message tells the LLM to re-import, which is harmless if the session is actually still alive. @gwarmstrong happy to hear if there's a better recovery path.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the right thing to do would be to:

  1. include httpx.TransportError in the preceding except block
  2. add a logging command like: LOG.warning("Sandbox communication error for session %s: %s", request_session_id, e) to the above, so itnernally we can distinguish between the errors

This would catch both sources. Then, on the subsequent request, if the session was lost, it will message back to the client accordingly. This is consistent with how things are done presently for timeouts:

elif session_id is not None and new_session_created and self.disable_session_restore:
# Session was recreated but restore is disabled — clear stale history and warn the model.
self.session_histories.pop(session_id_str, None)
self.session_histories.pop(request_session_id_str, None)
LOG.warning("Session %s was recreated but restore is disabled; history cleared", session_id)
output["stderr"] = (
"RuntimeError: Sandbox state restoration failed after the execution worker restarted. "
"The interactive session history has been cleared; "
"please re-run the last code block without relying on prior state.\n"
) + output.get("stderr", "")

With the changes proposed presently in this PR, if the session/worker is lost, the the client would actually get two separate "the session died" warnings. Once by the proposed addition and another by the one we already have that catches the session loss retroactively. In the case where session is not lost, the model would get one message about the session being lost.


# Only the server component (span_group_nodes=True) should claim group GPUs.
# Other components (client, sandbox) run on the same node but don't need GPUs.
if span_group_nodes:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this causing problems? I think we might have legit cases of sandbox needing gpus, e.g. for cuda code execution. And main client might also need gpus in case of run_cmd jobs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, I was matching the legacy add_task behavior (sandbox gets 0, client gets 0 when server is present), but you're right it could break run_cmd jobs or CUDA sandbox. Reverted the GPU change. The real fix for my case was the srun flags.

@@ -0,0 +1,138 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need scripts like this? Isn't this exactly the same as ns generate with just an appropriate prompt config set?

Signed-off-by: jubick1337 <mnovikov@nvidia.com>
Signed-off-by: jubick1337 <mnovikov@nvidia.com>
Signed-off-by: jubick1337 <mnovikov@nvidia.com>
Signed-off-by: jubick1337 <mnovikov@nvidia.com>
- Remove silent try/except wrappers in gather_solutions.py (let it crash)
- Replace .get() with direct access where keys are expected
- Remove dead NameUsageChecker class (superseded by UsedNamesExtractor)
- Replace typing imports with builtins (Dict->dict, List->list, etc.)
- Make --target required (no magic default)
- Rename --field to --domain in harvest_docs.py for consistency
- Add domain aliases and .lower() to filter_applicability_relevance.py

Signed-off-by: jubick1337 <mnovikov@nvidia.com>
@jubick1337 jubick1337 requested a review from dgtm777 February 11, 2026 03:57
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

17 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 11, 2026

Additional Comments (1)

nemo_skills/pipeline/utils/exp.py
Duplicate --mpi flags still present. Line 316 hardcodes --mpi=pmix, then line 329 extends with extra_srun_args which includes --mpi=none for sandbox. This creates two --mpi flags. As noted in the previous thread, this needs to replace the default rather than append.

Consider either:

  • Check if extra_srun_args contains --mpi= and filter it from the base list
  • Or replace --mpi=pmix conditionally based on whether extra_srun_args has an override

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@recipes/libtrace/scripts/gather_solutions.py`:
- Around line 46-63: The stdlib fallback _json_dumps currently calls
_json_std.dumps(obj) without ensure_ascii=False, causing non-ASCII chars to be
escaped and producing inconsistent output versus orjson; update the fallback
implementation of _json_dumps to call _json_std.dumps(obj, ensure_ascii=False)
so it matches the orjson behavior, keeping the existing _json_loads and _orjson
handling unchanged and referencing the _json_dumps, _json_loads, _orjson, and
_json_std symbols to locate the change.
- Around line 686-716: The CLI currently declares --target with required=True
inside parse_args(), which forces a value even in stats mode; change
parse_args() so --target is not required (set required=False and default=None)
and adjust its help text accordingly, then in the main entry (the code that
consumes parse_args(), e.g. after calling parse_args() in main or if __name__ ==
"__main__") validate that when args.mode == "sample" args.target is provided (if
args.target is None call parser.error("--target is required when --mode=sample")
or raise the same error through the ArgumentParser instance); locate parse_args,
the --target argument definition, and the main block that inspects args.mode to
implement this check.
🧹 Nitpick comments (8)
nemo_skills/code_execution/sandbox.py (1)

399-408: Two small nits in the retry exception handling.

  1. httpx.TimeoutException is a subclass of httpx.TransportError, so listing both in the same except tuple is redundant — TransportError already covers it.
  2. Line 404: The (EN DASH) should be a - (HYPHEN-MINUS) per Ruff RUF001.
Proposed cleanup
             except (
-                httpx.TransportError,  # Covers ReadError, ConnectError, RemoteProtocolError, etc.
-                httpx.TimeoutException,
+                httpx.TransportError,  # Covers ReadError, ConnectError, RemoteProtocolError, TimeoutException, etc.
                 httpx.HTTPStatusError,
             ) as e:
-                LOG.warning("Retry %d/%d deleting session %s – %s", attempt + 1, max_retries, session_id, e)
+                LOG.warning("Retry %d/%d deleting session %s - %s", attempt + 1, max_retries, session_id, e)
recipes/libtrace/scripts/collect_generated_problems.py (2)

87-103: Minor inconsistency: glob.glob vs Path.glob used in sibling script.

gather_solutions.py (line 360) uses Path.glob(pattern), while this function uses glob.glob(str(Path(input_dir) / pattern)). Both work, but Path.glob is more idiomatic and would let you drop the import glob at line 33.


138-153: from transformers import AutoTokenizer at module level makes the script unusable without transformers, even if tokenization could theoretically be disabled.

The top-level import on line 38 will crash before parse_args() runs. Since max_problem_tokens defaults to 10000 (always on), this is currently fine in practice. However, if you ever intend to allow disabling tokenization (e.g., default=None), consider deferring the import to line 153 where the tokenizer is actually created.

Suggested deferred import
-from transformers import AutoTokenizer  # type: ignore
 ...
     if args.max_problem_tokens is not None:
         tokenizer_name = args.tokenizer
         if tokenizer_name is None:
             local_default = Path("/hf_models/gpt-oss-120b")
             tokenizer_name = str(local_default) if local_default.exists() else "openai/gpt-oss-120b"
+        from transformers import AutoTokenizer  # type: ignore
         tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
recipes/libtrace/scripts/filter_applicability_relevance.py (2)

106-120: Regex patterns are recompiled on every call to parse_scores.

If this function is called in a tight loop (once per row), recompiling the regex each time is wasteful. Consider accepting pre-compiled patterns or caching them.

Suggested optimization
+import functools
+
+@functools.lru_cache(maxsize=16)
+def _get_score_patterns(label: str):
+    return (
+        re.compile(rf"{re.escape(label)}-Applicable:\s*([01])"),
+        re.compile(rf"{re.escape(label)}-Relevance:\s*([123])"),
+    )
+
 def parse_scores(generation: str, label: str, path: str, line_num: int) -> tuple[int, int]:
-    applicable_pattern = re.compile(rf"{re.escape(label)}-Applicable:\s*([01])")
-    relevance_pattern = re.compile(rf"{re.escape(label)}-Relevance:\s*([123])")
+    applicable_pattern, relevance_pattern = _get_score_patterns(label)

69-87: read_jsonl is duplicated in collect_generated_problems.py and should be extracted to a shared utility module.

The function in filter_applicability_relevance.py is identical to the one in collect_generated_problems.py. A similar variant exists in prepare_inference_jsonl.py (without the skip_invalid parameter). Extract the two identical versions into a shared utility module (e.g., recipes/libtrace/scripts/libtrace_utils.py) to reduce duplication and maintenance burden.

recipes/libtrace/scripts/harvest_docs.py (1)

302-348: Shared libraries across domains will produce duplicate entries in the unified file.

Libraries like sympy, numpy, scipy appear in all domain groups. Since process_library creates a fresh seen=set() per call and the main loop extends all_docs with each library's output, the unified file will contain duplicate entries when the same library is listed multiple times (e.g., if --libraries numpy numpy is passed, or a custom list overlaps). For domain groups this doesn't happen since each library name appears once in each group list. But if this were used with --libraries, duplicates could occur.

This is low-risk for the default domain usage, but worth a note.

recipes/libtrace/scripts/gather_solutions.py (2)

697-714: Same store_true + default=True no-op pattern as in filter_applicability_relevance.py.

--require_code_rounds, --require_boxed, and --dump_json all use action="store_true" with default=True. Passing these flags on the CLI has no effect — only the --no_* variants work. Consider using argparse.BooleanOptionalAction for clarity.


719-743: import os should be at the top of the file, not inside the __main__ block.

Line 723 imports os inside the main block. Move it to the top-level imports for consistency and clarity.

Proposed fix
+import os
 import argparse
 import ast
 import importlib
 ...
 
 if __name__ == "__main__":
     args = parse_args()
 
     input_files = resolve_input_files(args.input_dir, args.input_files, args.pattern)
-    import os
 
     num_workers = args.num_workers if args.num_workers is not None else max(1, (os.cpu_count() or 2) // 2)

Comment on lines +46 to +63
try: # orjson is significantly faster; fallback to std json
import orjson as _orjson # type: ignore

def _json_loads(s: str):
return _orjson.loads(s)

def _json_dumps(obj) -> str:
return _orjson.dumps(obj).decode("utf-8")

except Exception: # pragma: no cover - best effort
_orjson = None

def _json_loads(s: str):
return _json_std.loads(s)

def _json_dumps(obj) -> str:
return _json_std.dumps(obj)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fallback _json_dumps missing ensure_ascii=False — inconsistent with all other scripts.

Every other script in this PR passes ensure_ascii=False in the stdlib fallback. This omission means non-ASCII characters (e.g., in docstrings, problem text) will be escaped as \uXXXX sequences, producing output that differs depending on whether orjson is installed.

Proposed fix
     def _json_dumps(obj) -> str:
-        return _json_std.dumps(obj)
+        return _json_std.dumps(obj, ensure_ascii=False)
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 55-55: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/gather_solutions.py` around lines 46 - 63, The
stdlib fallback _json_dumps currently calls _json_std.dumps(obj) without
ensure_ascii=False, causing non-ASCII chars to be escaped and producing
inconsistent output versus orjson; update the fallback implementation of
_json_dumps to call _json_std.dumps(obj, ensure_ascii=False) so it matches the
orjson behavior, keeping the existing _json_loads and _orjson handling unchanged
and referencing the _json_dumps, _json_loads, _orjson, and _json_std symbols to
locate the change.

Comment on lines +686 to +716
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Gather and analyze solutions.")
parser.add_argument("--mode", choices=["stats", "sample"], default="stats")
parser.add_argument("--input_dir", type=str, default=None)
parser.add_argument("--input_files", type=str, nargs="+", default=None)
parser.add_argument("--pattern", type=str, default="output-rs*.jsonl")
parser.add_argument("--dataset_name", type=str, default="dataset")
parser.add_argument("--target", type=int, required=True, help="Target number of sampled records.")
parser.add_argument("--output_dir", type=str, default="sampled_solutions")
parser.add_argument("--random_seed", type=int, default=42)
parser.add_argument("--num_workers", type=int, default=None, help="Default: os.cpu_count()//2")
parser.add_argument(
"--require_code_rounds",
action="store_true",
default=True,
help="Skip records without code_rounds_executed field (default: True)",
)
parser.add_argument("--no_require_code_rounds", action="store_false", dest="require_code_rounds")
parser.add_argument(
"--require_boxed", action="store_true", default=True, help="Skip records without \\boxed{} (default: True)"
)
parser.add_argument("--no_require_boxed", action="store_false", dest="require_boxed")
parser.add_argument(
"--dump_json",
action="store_true",
default=True,
help="Serialize tool_call arguments as JSON strings (default: True)",
)
parser.add_argument("--no_dump_json", action="store_false", dest="dump_json")
parser.add_argument("--user_prompt_prefix", type=str, default=DEFAULT_USER_PROMPT_PREFIX)
return parser.parse_args()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

--target is required=True but unused in stats mode — forces users to provide a meaningless argument.

In stats mode (line 745), args.target is never read. Users are forced to specify --target even when it has no effect, which is confusing.

Proposed fix
-    parser.add_argument("--target", type=int, required=True, help="Target number of sampled records.")
+    parser.add_argument("--target", type=int, default=None, help="Target number of sampled records (required for sample mode).")

Then validate in the main block:

if args.mode == "sample" and args.target is None:
    parser.error("--target is required when --mode=sample")
🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/gather_solutions.py` around lines 686 - 716, The CLI
currently declares --target with required=True inside parse_args(), which forces
a value even in stats mode; change parse_args() so --target is not required (set
required=False and default=None) and adjust its help text accordingly, then in
the main entry (the code that consumes parse_args(), e.g. after calling
parse_args() in main or if __name__ == "__main__") validate that when args.mode
== "sample" args.target is provided (if args.target is None call
parser.error("--target is required when --mode=sample") or raise the same error
through the ArgumentParser instance); locate parse_args, the --target argument
definition, and the main block that inspects args.mode to implement this check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants