Add LibTrace recipe for generating domain-specific reasoning data#1224
Add LibTrace recipe for generating domain-specific reasoning data#1224jubick1337 wants to merge 7 commits intomainfrom
Conversation
316ee6f to
7f5b80d
Compare
| generate( | ||
| ctx=wrap_arguments(" ".join(extra_args)), | ||
| cluster=args.cluster, | ||
| input_file=args.input_file, | ||
| output_dir=args.output_dir, | ||
| expname=f"{args.expname_prefix}-{args.domain or 'custom'}", |
There was a problem hiding this comment.
Domain flag is ignored
--domain is only used to change expname, but it never gets forwarded into the generate() Hydra args. As written, passing --domain chem has no effect on the actual prompt variables/behavior, which is likely unintended given the CLI and the prompt expects domain-specific labeling. Consider adding it to extra_args (e.g. ++domain=... / whatever config key the prompt uses) or removing the flag to avoid a misleading interface.
| extra_srun_args = None | ||
| if isinstance(command.script, SandboxScript): | ||
| # Also disable PMI/PMIx for the sandbox step. The sandbox runs a | ||
| # single SLURM task but spawns many child processes (uwsgi workers, | ||
| # IPython shells). On some clusters, PMIx can treat child crashes |
There was a problem hiding this comment.
Duplicate --mpi srun args
The sandbox path sets extra_srun_args = ["--kill-on-bad-exit=0", "--mpi=none"], but get_executor() always seeds srun_args with "--mpi=pmix" and then extends with extra_srun_args (see nemo_skills/pipeline/utils/exp.py). This produces two --mpi flags in the same srun command, which can be rejected by Slurm or can keep the first value (making the override ineffective). It needs to replace the default --mpi=pmix rather than append a second --mpi=....
Signed-off-by: jubick1337 <mnovikov@nvidia.com>
4693193 to
b53767b
Compare
📝 WalkthroughWalkthroughThis pull request introduces LibTrace, a comprehensive recipe for building domain-specific reasoning data from library APIs. It adds documentation, Python pipeline scripts for data harvesting and processing, Docker infrastructure, prompt templates, and enhances sandbox error handling for transport failures. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🤖 Fix all issues with AI agents
In `@recipes/libtrace/dockerfiles/environment.yml`:
- Around line 24-25: Remove the redundant conda package entry "geant4" from
environment.yml so the Dockerfile.sandbox build uses only the source-built
Geant4 v11.2.2; edit environment.yml to delete the geant4 line referenced in the
diff, ensuring the Dockerfile.sandbox steps that build and install Geant4 into
${CONDA_PREFIX} and the dependent geant4_pybind linkage remain unchanged to
avoid duplicate installation and file conflicts.
In `@recipes/libtrace/scripts/collect_generated_problems.py`:
- Around line 175-176: Replace the use of row.get("name", "<missing>") and
row.get("source", "<missing>") with direct dictionary access (row["name"] and
row["source"]) so the code fails fast when validated keys are unexpectedly
missing; locate the usages of row at the point where name and source are
assigned in collect_generated_problems.py (the block that validated keys
earlier) and update those two expressions accordingly, keeping existing variable
names and surrounding logic unchanged.
In `@recipes/libtrace/scripts/filter_applicability_relevance.py`:
- Around line 188-197: Replace the awkward raise/catch pattern around extracting
and validating source: stop raising TypeError and remove the broad except
Exception; instead explicitly handle missing key and wrong type. Specifically,
try to get source = item["source"] in a try/except KeyError as e and on KeyError
either increment invalid_count and continue when args.skip_invalid is true or
raise ValueError(f"Missing or invalid source in {input_path}:{line_num}") from
e; then check if not isinstance(source, str) and likewise increment
invalid_count and continue when args.skip_invalid is true or raise
ValueError(...) (use raise ... from where appropriate if you have an underlying
exception); only after passing these explicit checks update
source_dist_pre[source] += 1 and append to valid_items. This removes the blanket
except and uses raise ... from per B904.
In `@recipes/libtrace/scripts/gather_solutions.py`:
- Line 1: Add the repository's required copyright header to the very top of
gather_solutions.py (before the existing module docstring/triple-quote), using
the standard project header template (year and organization/owner) that CI
expects; ensure the header is placed above any code or docstrings and preserve
the existing module docstring after the header.
- Around line 528-550: The try/except in process_line (and likewise in
process_line_for_sampling) currently catches all Exception types and hides
programming errors; narrow the exception handling to only the expected
input/parsing errors (e.g., json.JSONDecodeError, KeyError, TypeError,
ValueError) and handle those by logging the error and returning the failure
tuple, but remove the broad except so other exceptions (AttributeError,
IndexError, etc.) propagate to the parent; ensure you include the exception
details in the log message when returning on malformed input and apply the same
change to process_line_for_sampling.
- Around line 846-852: The loop that calls _json_loads(line) and adds
data["name"] currently swallows all exceptions; modify it to track and report
skipped lines: introduce a skipped_lines_count (and optionally collect up to N
examples in malformed_examples) in the except block instead of pass, increment
it when JSONDecodeError or KeyError (or any Exception) occurs, and after the
loop log or raise a clear warning including skipped_lines_count, total lines
processed, and a few example malformed lines; keep the existing RuntimeError("No
names found in input data.") for the all_names empty case but augment it or the
surrounding log with the skipped-line statistics so data quality issues are
visible.
- Around line 427-439: _in _extract_all_code_from_messages, replace the
permissive dict.get() usage with direct indexing to fail fast: check
tool_call["function"]["name"] == "stateful_python_code_exec" and access args =
tool_call["function"]["arguments"] then pull code via _json_loads(args)["code"]
or args["code"] as before; this uses the known structure created by
parse_messages and surfaces errors immediately instead of silently masking
missing keys. Ensure you update all occurrences of
tool_call.get(...)/msg.get(...) in that function to msg["role"],
tool_call["function"]["name"], tool_call["function"]["arguments"], and
args["code"] accordingly.
- Around line 639-653: The try/except in the loop that writes records (using
_json_loads, parse_messages, _json_dumps, and writing to output_file via
_progress) is swallowing errors and causing silent data loss; update
save_sampled_messages so the except block increments a failure counter and
records the failed indices (e.g., failed_count and failed_indices), continue
writing remaining records, and after the with block log or print a concise
summary including failed_count and failed_indices (or re-raise if failures > 0
depending on policy) so callers can detect incomplete output instead of silently
dropping records.
🟡 Minor comments (15)
nemo_skills/code_execution/sandbox.py-399-408 (1)
399-408:⚠️ Potential issue | 🟡 MinorFix ambiguous EN DASH character on line 404.
Static analysis (RUF001) flags the
–(EN DASH, U+2013) in the log format string. Replace with a standard hyphen for consistency.Proposed fix
- LOG.warning("Retry %d/%d deleting session %s – %s", attempt + 1, max_retries, session_id, e) + LOG.warning("Retry %d/%d deleting session %s - %s", attempt + 1, max_retries, session_id, e)recipes/libtrace/scripts/run_boxed_inference.py-1-2 (1)
1-2:⚠️ Potential issue | 🟡 MinorMissing copyright header — CI failure.
The pipeline reports a copyright header is missing. Add the project's standard copyright header at the top of this file.
docs/recipes/libtrace.md-10-11 (1)
10-11:⚠️ Potential issue | 🟡 MinorBroken link — README.md is outside the
docs/directory.The link
/recipes/libtrace/README.mdpoints to a file that lives in the repo root'srecipes/folder, not insidedocs/. MkDocs only serves content from thedocs/directory, so this link will 404 on the rendered site. Consider linking to the GitHub source directly (e.g., the repo URL for that file) or moving the relevant content into the docs tree.Proposed fix (link to GitHub source)
For the full walkthrough, command examples, and configuration details see -[recipes/libtrace/README.md](/recipes/libtrace/README.md). +[recipes/libtrace/README.md](https://github.com/NVIDIA-NeMo/Skills/blob/main/recipes/libtrace/README.md).#!/bin/bash # Verify that the README is not inside the docs directory fd 'README.md' --full-path --glob '**/libtrace/**'recipes/libtrace/scripts/filter_applicability_relevance.py-1-1 (1)
1-1:⚠️ Potential issue | 🟡 MinorMissing copyright header — CI failure.
The pipeline reports a copyright header is missing. Add the project's standard copyright header at the top of this file.
recipes/libtrace/scripts/run_applicability_relevance.py-64-71 (1)
64-71:⚠️ Potential issue | 🟡 MinorSpace-containing Hydra overrides in
extra_argswill be split incorrectly.
wrap_arguments(fromnemo_skills/pipeline/cli.py) splits its input on spaces:arguments.split(" "). If a user passes--extra_args '++inference.stop_phrases="end of response"', the value will be broken into multiple tokens. This affects both this script andrun_problem_generation.py.Consider whether
wrap_argumentsshould accept a list instead, or ensureextra_argsis split and extended individually before joining.recipes/libtrace/dockerfiles/Dockerfile.sandbox-31-36 (1)
31-36:⚠️ Potential issue | 🟡 MinorUpdate micromamba download URL to current official endpoint.
The URL
https://micromamba.snakepit.net/api/micromamba/linux-64/latestis an older endpoint. The current official endpoint ishttps://micro.mamba.pm/api/micromamba/linux-64/latest. Update line 33 to use the current recommended source.recipes/libtrace/scripts/harvest_docs.py-253-258 (1)
253-258:⚠️ Potential issue | 🟡 MinorHelp text contradicts validation logic.
The help string says
--fieldis "ignored if --libraries is provided", but line 282-283 raises aValueErrorif both are supplied. Update the help text to reflect the actual mutual-exclusivity constraint.Proposed fix
- help="Library group to process (ignored if --libraries is provided).", + help="Library group to process (mutually exclusive with --libraries).",recipes/libtrace/scripts/harvest_docs.py-229-247 (1)
229-247:⚠️ Potential issue | 🟡 Minor
"source"field is added after per-library JSONL is written — schema mismatch.
write_jsonlon line 232 serializes docs without the"source"key. It's only added on line 246, so per-library files have a different schema than the unified file. If any downstream consumer expects"source"in per-library files, it will fail.If the omission is intentional, a comment would help. Otherwise, move the mutation before the write:
Proposed fix
docs = extract_docs_from_module(lib, library_name, seen=set()) + for doc in docs: + doc["source"] = library_name + output_file = output_dir / f"{library_name}.jsonl" write_jsonl(output_file, docs) functions = sum(1 for doc in docs if doc["type"] == "function") classes = sum(1 for doc in docs if doc["type"] == "class") methods = sum(1 for doc in docs if doc["type"] == "method") print(f"\nExtracted {len(docs)} items from {library_name}") print(f"Saved to: {output_file}") print("Statistics:") print(f" Functions: {functions}") print(f" Classes: {classes}") print(f" Methods: {methods}") - for doc in docs: - doc["source"] = library_name return docsrecipes/libtrace/scripts/collect_generated_problems.py-1-14 (1)
1-14:⚠️ Potential issue | 🟡 MinorMissing copyright header.
The pipeline failure indicates a missing copyright header. This applies to this file as well (and likely all new Python files in this PR).
recipes/libtrace/scripts/collect_generated_problems.py-98-102 (1)
98-102:⚠️ Potential issue | 🟡 Minor
--max_problem_tokenscannot be disabled from CLI; the "or omitted" path is unreachable.With
default=10000andtype=int,args.max_problem_tokensis neverNone, so theis not Noneguards on lines 128 and 134 are alwaysTrue, and the error message "or omitted" on line 129 is misleading. If the intent is to allow disabling token-length filtering, change the default toNone:Proposed fix
parser.add_argument( "--max_problem_tokens", type=int, - default=10000, - help="Drop problems whose tokenized length exceeds this threshold.", + default=None, + help="Drop problems whose tokenized length exceeds this threshold (default: disabled).", )Alternatively, if filtering should always be on, remove the
is not Noneguards and the "or omitted" message to avoid confusion.Also applies to: 128-139
recipes/libtrace/scripts/gather_solutions.py-442-446 (1)
442-446:⚠️ Potential issue | 🟡 MinorRedundant check:
r"\boxed{"and"\\boxed{"are identical strings.Both
r"\boxed{"(raw string) and"\\boxed{"(escaped backslash) evaluate to the same string\boxed{. Theorclause is dead code. If you intended to check for two different representations, one side needs adjustment; otherwise, simplify to a single check.Proposed simplification
def _has_boxed_in_last_assistant(messages: list) -> bool: for msg in reversed(messages): if msg.get("role") == "assistant" and msg.get("content"): - return r"\boxed{" in msg["content"] or "\\boxed{" in msg["content"] + return r"\boxed{" in msg["content"] return Falserecipes/libtrace/scripts/gather_solutions.py-49-50 (1)
49-50:⚠️ Potential issue | 🟡 Minor
_json_dumpsfallback is missingensure_ascii=False, unlike all other LibTrace scripts.Every other script in this recipe passes
ensure_ascii=Falseto_json_std.dumpsin the fallback path. This omission means non-ASCII characters (chemical names, mathematical symbols, etc.) would be escaped to\uXXXXsequences whenorjsonis not installed.Proposed fix
def _json_dumps(obj) -> str: - return _json_std.dumps(obj) + return _json_std.dumps(obj, ensure_ascii=False)recipes/libtrace/scripts/gather_solutions.py-409-424 (1)
409-424:⚠️ Potential issue | 🟡 MinorDead code block at lines 418–423.
Line 411 already adds
all_names & used_namestomatched. The loop at line 418 iterates over the same intersection and adds names that are already inmatched. The entire block (lines 418–423) is a no-op and can be removed.Proposed fix
def _find_matching_names(all_names: Set[str], used_names: Set[str], imported_from_modules: Set[str]) -> Set[str]: matched = set() matched.update(all_names & used_names) for used in used_names: parts = used.split(".") for i in range(1, len(parts)): prefix = ".".join(parts[:i]) if prefix in all_names: matched.add(prefix) - for name in all_names & used_names: - parts = name.split(".") - if len(parts) >= 2: - module_prefix = ".".join(parts[:-1]) - if module_prefix in imported_from_modules: - matched.add(name) return matchedrecipes/libtrace/scripts/gather_solutions.py-375-388 (1)
375-388:⚠️ Potential issue | 🟡 MinorDead code: redundant
name in used_namescheck on line 386.The check on line 376 already tests
name in used_namesand returnsTrueif it matches. If execution reaches line 386,nameis guaranteed to not be inused_names, making this condition alwaysFalse.Proposed fix
def _check_name_in_extracted(name: str, used_names: Set[str], imported_from_modules: Set[str]) -> bool: if name in used_names: return True name_prefix = name + "." for used in used_names: if used.startswith(name_prefix): return True parts = name.split(".") if len(parts) >= 2: module_prefix = ".".join(parts[:-1]) - if module_prefix in imported_from_modules: - if name in used_names: - return True + if module_prefix in imported_from_modules and parts[-1] in {u.split(".")[-1] for u in used_names}: + return True return FalseIf the intent was to check that the attribute part of the name is used (not the full qualified name), the condition needs to be fixed as above. Otherwise, simply remove the dead branch.
recipes/libtrace/scripts/gather_solutions.py-255-361 (1)
255-361:⚠️ Potential issue | 🟡 MinorRemove unused
NameUsageCheckerclass.This class is defined at lines 255–361 but never instantiated anywhere in the codebase. Its functionality has been superseded by
UsedNamesExtractor(lines 184–252), which provides a more efficient single-pass extraction approach combined with helper functions like_check_name_in_extracted()and_check_any_name_in_extracted()for name matching.
🧹 Nitpick comments (21)
nemo_skills/pipeline/utils/exp.py (1)
180-180:extra_srun_argsparameter is undocumented in the docstring.The existing docstring (lines 196–228) documents every other parameter but omits
extra_srun_args(and the pre-existingray_template). Consider adding a brief entry for discoverability.nemo_skills/pipeline/utils/declarative.py (1)
578-598: Consider letting the script declare its own srun overrides instead of hardcoding them here.The
isinstance(command.script, SandboxScript)check + hardcoded args creates a coupling between the generic_create_executorand a specific script type. IfSandboxScript(orBaseJobScript) exposed anextra_srun_argsproperty, this method could stay script-agnostic:extra_srun_args = getattr(command.script, "extra_srun_args", None)Not urgent since there's only one case today, but it would make the declarative layer more extensible.
recipes/libtrace/README.md (1)
1-192: Well-structured documentation with clear step-by-step examples.The README provides comprehensive pipeline documentation. One minor observation: Steps 3, 5, and 7 invoke scripts directly with
python(because they internally call the Nemo pipeline), while other steps usens run_cmd. Consider adding a brief note near Step 3 (or in the introduction) explaining this distinction, so users understand why some steps usens run_cmdand others don't.recipes/libtrace/scripts/run_boxed_inference.py (1)
102-102: Unnecessary f-string.
f"{args.expname_prefix}"is equivalent toargs.expname_prefixsince no formatting is applied.Proposed fix
- expname=f"{args.expname_prefix}", + expname=args.expname_prefix,recipes/libtrace/scripts/filter_applicability_relevance.py (3)
23-41: Duplicated JSON helper andread_jsonlacross multiple scripts.
_json_loads,_json_dumps, andread_jsonlare copy-pasted fromcollect_generated_problems.py(and possiblyprepare_inference_jsonl.py). Consider extracting these into a shared utility module underrecipes/libtrace/scripts/to avoid drift between copies.#!/bin/bash # Verify the extent of duplication across libtrace scripts rg -l '_json_loads\|_json_dumps\|read_jsonl' --glob 'recipes/libtrace/scripts/*.py'
90-104: Regex is recompiled on every call.
parse_scoresis called per-line and recompiles two regexes each time. For large JSONL files this adds up. Consider caching compiled patterns (e.g., withfunctools.lru_cacheon the label, or precompiling outside the loop).
145-155:store_truewithdefault=Truemakes the flag a no-op.
--require_applicableusesaction="store_true"withdefault=True, so passing the flag changes it fromTruetoTrue— it has no effect. The only way to change the value is via--no_require_applicable. While functionally correct as a pair, the--require_applicableflag itself is misleading since it suggests the user can opt in, but it's already on by default. Consider removing the--require_applicableflag and only keeping--no_require_applicable, or simply documenting that it's always on unless negated.recipes/libtrace/dockerfiles/start-with-nginx.sh (1)
38-41: Consider logging a warning whenmicromamba activate sandboxfails.The
|| trueon Line 40 silently swallows activation failures. Since theifguard already confirmsmicromambais onPATH, a failure to activate thesandboxenvironment likely indicates a real problem (e.g., broken or missing env). A brief warning to stderr would aid debugging without breaking the entrypoint.Suggested improvement
if command -v micromamba >/dev/null 2>&1; then eval "$(micromamba shell hook -s bash)" || true - micromamba activate sandbox || true + micromamba activate sandbox || echo "WARNING: failed to activate 'sandbox' env" >&2 firecipes/libtrace/dockerfiles/environment.yml (1)
1-64: No version pins on most packages — builds may not be reproducible.Nearly all conda and pip dependencies lack version constraints. While strict pinning can cause solver conflicts with this many packages, consider at minimum generating and committing a lock file (e.g.,
conda-lockormicromamba env export) for reproducible builds.recipes/libtrace/dockerfiles/Dockerfile.sandbox (2)
15-15: Pin the base image digest for reproducible builds.
tiangolo/uwsgi-nginx-flask:python3.10can change upstream at any time. Consider pinning to a specific digest (e.g.,@sha256:...) to ensure deterministic builds.
80-99:geant4_pybindis cloned without a version pin.Line 96 clones
geant4_pybindatHEAD, making the build non-reproducible and vulnerable to breaking upstream changes. Pin to a specific tag or commit hash.Suggested fix
- git clone --recursive https://github.com/HaarigerHarald/geant4_pybind.git /tmp/geant4_pybind && \ + git clone --recursive --branch v0.XX.X https://github.com/HaarigerHarald/geant4_pybind.git /tmp/geant4_pybind && \recipes/libtrace/scripts/run_applicability_relevance.py (1)
57-90: Near-identical boilerplate shared withrun_problem_generation.py.Both scripts duplicate the prompt-config validation, extra-args construction, and the large
generate(...)call. Consider extracting the shared logic (arg parsing base, prompt resolution,generateinvocation) into a small shared helper module underrecipes/libtrace/scripts/to reduce drift between the two.recipes/libtrace/scripts/run_problem_generation.py (1)
90-90: Unnecessary f-string — just use the variable directly.
f"{args.expname_prefix}"is equivalent toargs.expname_prefix. Compare withrun_applicability_relevance.pyLine 75 which actually interpolates domain into the f-string.Suggested fix
- expname=f"{args.expname_prefix}", + expname=args.expname_prefix,recipes/libtrace/scripts/prepare_inference_jsonl.py (2)
140-154: Validate--target_per_librarybefore processing all rows.The
<= 0check on Line 153 runs after the entire input has been read and validated (Lines 145-151). Moving this check to just after arg parsing would fail fast and avoid wasted work on large inputs.Suggested fix
if __name__ == "__main__": args = parse_args() input_path = Path(args.input_file) output_path = Path(args.output_file) domain_label = resolve_domain_label(args.domain) + + if args.target_per_library is not None and args.target_per_library <= 0: + raise ValueError("--target_per_library must be a positive integer.") + rows = read_jsonl(input_path) if not rows: raise RuntimeError(f"No entries found in {input_path}") output_rows: list[dict] = [] for line_num, row in enumerate(rows, start=1): output_rows.append(validate_row(row, input_path, line_num, domain_label)) - if args.target_per_library is not None and args.target_per_library <= 0: - raise ValueError("--target_per_library must be a positive integer.") - if args.target_per_library:
22-38: Extract duplicated JSON helpers into a shared module.The
_json_loads/_json_dumpsorjson-fallback pattern is identically duplicated across five files inrecipes/libtrace/scripts/:prepare_inference_jsonl.py,harvest_docs.py,gather_solutions.py,filter_applicability_relevance.py, andcollect_generated_problems.py. This represents ~75 lines of identical boilerplate code spread across the codebase. Extracting these helpers into a shared module (e.g.,recipes/libtrace/scripts/_io.py) will reduce maintenance burden and prevent future drift. Note thatread_jsonlandwrite_jsonlhave API variations across files (e.g.,skip_invalidparameter in some,mkdirbehavior in others) that would require careful abstraction, but the core JSON serialization utilities are good candidates for consolidation.recipes/libtrace/scripts/harvest_docs.py (3)
31-31: Narrow theexceptclause toImportError.Catching bare
Exceptionhere could silently swallow a real error duringorjsoninitialization (e.g., a broken native extension). The only expected failure is the module not being installed.Proposed fix
-except Exception: # pragma: no cover - best effort +except ImportError: # pragma: no cover - best effort
169-169: Hardcoded workaround — consider making this configurable.This single-entry skip is fragile. If other problematic symbols appear, you'll accumulate more hardcoded lines. A module-level
SKIP_SYMBOLS: set[str]constant would be cleaner and easier to maintain.
210-213: Addmkdirfor robustness, consistent with sibling script.
prepare_inference_jsonl.pyincludespath.parent.mkdir(parents=True, exist_ok=True)in itswrite_jsonl. While current callers pre-create the directory, adding it here would be defensive and consistent.Proposed fix
def write_jsonl(path: Path, rows: list[dict]) -> None: + path.parent.mkdir(parents=True, exist_ok=True) with path.open("w", encoding="utf-8") as f: for row in rows: f.write(_json_dumps(row) + "\n")recipes/libtrace/scripts/collect_generated_problems.py (2)
24-24: Top-levelAutoTokenizerimport is unconditional but only needed when token filtering is active.
transformersis a heavy dependency. Sincemax_problem_tokensdefaults to10000(always set via argparse), this import will always be hit in practice, so this is fine for the current CLI. However, if someone imports this module as a library or ifmax_problem_tokensis ever made truly optional, the import would fail unnecessarily. Consider a lazy import inside theif args.max_problem_tokens is not None:block.Proposed lazy import
-from transformers import AutoTokenizer # type: ignore ... tokenizer = None if args.max_problem_tokens is not None: + from transformers import AutoTokenizer # type: ignore + tokenizer_name = args.tokenizer
26-43: Significant code duplication across LibTrace scripts.
_json_loads/_json_dumps,read_jsonl, andresolve_input_filesare copy-pasted verbatim across at least 4–5 scripts (collect_generated_problems.py,filter_applicability_relevance.py,gather_solutions.py,prepare_inference_jsonl.py,run_problem_generation.py). Consider extracting these into a shared utility module (e.g.,recipes/libtrace/scripts/_io_utils.py) to reduce maintenance burden and divergence risk.Note:
gather_solutions.pyalready has a subtle difference — its_json_dumpsfallback uses_json_std.dumps(obj)(noensure_ascii=False), unlike the other scripts. This inconsistency is likely unintentional and would be caught by consolidation.Also applies to: 52-70, 73-89
recipes/libtrace/scripts/gather_solutions.py (1)
829-832: Moveimport osto the top of the file with other imports.
osis a standard library module — placing it insideif __name__ == "__main__"is unconventional and can confuse readers.
| # Geant4 (very large; remove if not needed or unavailable) | ||
| - geant4 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check environment.yml for geant4
echo "=== environment.yml lines 20-30 ==="
head -30 recipes/libtrace/dockerfiles/environment.yml | tail -11
echo -e "\n=== Check for geant4 in Dockerfile.sandbox ==="
find recipes/libtrace/dockerfiles -name "Dockerfile*" -type fRepository: NVIDIA-NeMo/Skills
Length of output: 485
🏁 Script executed:
#!/bin/bash
# List files in dockerfiles directory
ls -la recipes/libtrace/dockerfiles/Repository: NVIDIA-NeMo/Skills
Length of output: 375
🏁 Script executed:
#!/bin/bash
# Search for geant4 references in all Dockerfiles
find recipes/libtrace/dockerfiles -name "Dockerfile*" -type f -exec echo "File: {}" \; -exec rg -n 'geant4' {} \;Repository: NVIDIA-NeMo/Skills
Length of output: 687
🏁 Script executed:
#!/bin/bash
# Get the full Dockerfile.sandbox to see when environment is created vs when geant4 is built
cat -n recipes/libtrace/dockerfiles/Dockerfile.sandboxRepository: NVIDIA-NeMo/Skills
Length of output: 6024
🏁 Script executed:
#!/bin/bash
# Check what conda's geant4 package provides
# Look for geant4-pybind alternative in conda
rg -i 'geant4' recipes/libtrace/dockerfiles/environment.yml -B 2 -A 2Repository: NVIDIA-NeMo/Skills
Length of output: 285
Remove geant4 from environment.yml; it's redundantly installed and conflicts with the source build.
The Dockerfile.sandbox creates the conda environment (line 42) with the geant4 package from conda-forge, then builds Geant4 v11.2.2 from source and installs it to the same location (${CONDA_PREFIX}, line 87). Since the source-built version is the one actually used (geant4_pybind depends on it, lines 96–97), the conda package is redundant. Both installations target the same directory, leading to file conflicts and build inefficiency. Remove line 25 from environment.yml.
🤖 Prompt for AI Agents
In `@recipes/libtrace/dockerfiles/environment.yml` around lines 24 - 25, Remove
the redundant conda package entry "geant4" from environment.yml so the
Dockerfile.sandbox build uses only the source-built Geant4 v11.2.2; edit
environment.yml to delete the geant4 line referenced in the diff, ensuring the
Dockerfile.sandbox steps that build and install Geant4 into ${CONDA_PREFIX} and
the dependent geant4_pybind linkage remain unchanged to avoid duplicate
installation and file conflicts.
| try: | ||
| source = item["source"] | ||
| if not isinstance(source, str): | ||
| raise TypeError | ||
| source_dist_pre[source] += 1 | ||
| valid_items.append(item) | ||
| except Exception: | ||
| if not args.skip_invalid: | ||
| raise ValueError(f"Missing or invalid source in {input_path}:{line_num}") | ||
| invalid_count += 1 |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Awkward raise TypeError caught by broad except Exception.
Lines 190-191 artificially raise TypeError only to be caught by the blanket except Exception on line 194. This obscures intent and makes the control flow hard to follow. A simple if/continue/raise would be clearer and avoids the broad catch. Additionally, line 196 should use raise ... from per B904.
Proposed fix
for line_num, item in enumerate(data, start=1):
- try:
- source = item["source"]
- if not isinstance(source, str):
- raise TypeError
- source_dist_pre[source] += 1
- valid_items.append(item)
- except Exception:
- if not args.skip_invalid:
- raise ValueError(f"Missing or invalid source in {input_path}:{line_num}")
- invalid_count += 1
+ source = item.get("source")
+ if not isinstance(source, str):
+ if not args.skip_invalid:
+ raise ValueError(f"Missing or invalid source in {input_path}:{line_num}")
+ invalid_count += 1
+ continue
+ source_dist_pre[source] += 1
+ valid_items.append(item)As per coding guidelines, "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving."
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 191-191: Abstract raise to an inner function
(TRY301)
[warning] 194-194: Do not catch blind exception: Exception
(BLE001)
[warning] 196-196: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
[warning] 196-196: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/filter_applicability_relevance.py` around lines 188
- 197, Replace the awkward raise/catch pattern around extracting and validating
source: stop raising TypeError and remove the broad except Exception; instead
explicitly handle missing key and wrong type. Specifically, try to get source =
item["source"] in a try/except KeyError as e and on KeyError either increment
invalid_count and continue when args.skip_invalid is true or raise
ValueError(f"Missing or invalid source in {input_path}:{line_num}") from e; then
check if not isinstance(source, str) and likewise increment invalid_count and
continue when args.skip_invalid is true or raise ValueError(...) (use raise ...
from where appropriate if you have an underlying exception); only after passing
these explicit checks update source_dist_pre[source] += 1 and append to
valid_items. This removes the blanket except and uses raise ... from per B904.
| try: | ||
| data = _json_loads(line) | ||
| all_names.add(data["name"]) | ||
| except Exception: | ||
| pass | ||
| if not all_names: | ||
| raise RuntimeError("No names found in input data.") |
There was a problem hiding this comment.
Silent try-except-pass when extracting names hides data quality issues.
Lines 849–850 silently swallow all exceptions during name extraction. If a significant portion of input lines are malformed (missing "name" key, invalid JSON), the all_names set will be silently incomplete, leading to incorrect usage statistics and sampling. At minimum, count and report skipped lines.
Proposed fix
+ name_extraction_errors = 0
for line in _progress(all_lines, desc="Extracting names", unit="line"):
try:
data = _json_loads(line)
all_names.add(data["name"])
except Exception:
- pass
+ name_extraction_errors += 1
if not all_names:
raise RuntimeError("No names found in input data.")
+ if name_extraction_errors:
+ print(f"Warning: {name_extraction_errors} lines skipped during name extraction", file=sys.stderr)
print(f"Found {len(all_names):,} unique names to check against")As per coding guidelines, "Follow the Zen of Python principles... ensure errors never pass silently unless explicitly silenced".
🧰 Tools
🪛 Ruff (0.14.14)
[error] 849-850: try-except-pass detected, consider logging the exception
(S110)
[warning] 849-849: Do not catch blind exception: Exception
(BLE001)
[warning] 852-852: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/gather_solutions.py` around lines 846 - 852, The
loop that calls _json_loads(line) and adds data["name"] currently swallows all
exceptions; modify it to track and report skipped lines: introduce a
skipped_lines_count (and optionally collect up to N examples in
malformed_examples) in the except block instead of pass, increment it when
JSONDecodeError or KeyError (or any Exception) occurs, and after the loop log or
raise a clear warning including skipped_lines_count, total lines processed, and
a few example malformed lines; keep the existing RuntimeError("No names found in
input data.") for the all_names empty case but augment it or the surrounding log
with the skipped-line statistics so data quality issues are visible.
There was a problem hiding this comment.
Pull request overview
Adds a new LibTrace recipe to generate domain-specific reasoning datasets from scientific library docstrings, along with a custom sandbox image and a few pipeline/runtime robustness tweaks to better support “heavy” scientific dependencies.
Changes:
- Introduces LibTrace end-to-end scripts (doc harvesting → labeling → filtering → problem generation → boxed inference → solution gathering) plus prompt templates and documentation.
- Adds a LibTrace-specific sandbox Docker image (micromamba env + scientific stacks) and a wrapper entrypoint with runtime env tuning.
- Extends SLURM executor plumbing to allow extra
srunargs for sandbox resiliency, and improves sandbox client handling of transport/connection errors.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| recipes/libtrace/scripts/run_problem_generation.py | CLI wrapper around nemo_skills.pipeline.generate for domain problem generation. |
| recipes/libtrace/scripts/run_boxed_inference.py | CLI wrapper to solve generated problems using generic/general-boxed (optionally with sandbox). |
| recipes/libtrace/scripts/run_applicability_relevance.py | Runs applicability+relevance labeling using a LibTrace prompt config. |
| recipes/libtrace/scripts/prepare_inference_jsonl.py | Validates/normalizes harvested docs into inference-ready JSONL; optional per-library sampling. |
| recipes/libtrace/scripts/harvest_docs.py | Extracts public API docs from selected libraries/modules into per-lib + unified JSONL. |
| recipes/libtrace/scripts/gather_solutions.py | Parses boxed inference outputs, checks library usage via AST, computes stats and samples records. |
| recipes/libtrace/scripts/filter_applicability_relevance.py | Filters labeling outputs by applicability/relevance thresholds and reports distributions. |
| recipes/libtrace/scripts/collect_generated_problems.py | Consolidates generated problems across seeds; optional token-length filtering. |
| recipes/libtrace/prompts/problem-generation.yaml | Prompt template for generating domain problems from docstrings. |
| recipes/libtrace/prompts/applicability-relevance.yaml | Prompt template for applicability/relevance labeling. |
| recipes/libtrace/dockerfiles/start-with-nginx.sh | Sandbox entrypoint wrapper setting env for MPI/threading/PySCF stability. |
| recipes/libtrace/dockerfiles/environment.yml | Conda environment spec for the LibTrace sandbox image. |
| recipes/libtrace/dockerfiles/Dockerfile.sandbox | Builds the LibTrace sandbox image (micromamba + scientific libs + extra builds). |
| recipes/libtrace/README.md | Full LibTrace walkthrough and command examples. |
| recipes/README.md | Adds LibTrace to the recipes index. |
| nemo_skills/pipeline/utils/exp.py | Adds extra_srun_args support and uses it for sandbox srun overrides. |
| nemo_skills/pipeline/utils/declarative.py | Adds sandbox-specific srun overrides and adjusts executor GPU allocation logic. |
| nemo_skills/code_execution/sandbox.py | Handles httpx.TransportError during execution/restoration/proof-check and makes delete-session best-effort. |
| mkdocs.yml | Adds LibTrace to MkDocs navigation. |
| docs/recipes/libtrace.md | Adds docs landing page for the LibTrace recipe. |
Comments suppressed due to low confidence (1)
nemo_skills/code_execution/sandbox.py:183
- New TransportError handling changes sandbox behavior on connection failures, but there’s no unit test exercising this path. Consider adding a test that mocks http_session.post (and/or delete_session) to raise httpx.TransportError and asserts execute_code()/is_proof_correct() return the expected status and user-facing stderr message.
"Session state was lost. You need to re-import all modules and re-define all variables "
"before continuing. Consider simplifying the code to avoid crashes.\n"
),
}
new_session_created = output.pop("new_session_created", False)
# Rebuild state by re-executing history first, then execute the new code.
# NOTE: Only cells that completed successfully are stored, so we intentionally omit re-running cells that errored
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
| parser.add_argument("--no_dump_json", action="store_false", dest="dump_json") | ||
| parser.add_argument("--user_prompt_prefix", type=str, default=DEFAULT_USER_PROMPT_PREFIX) | ||
| return parser.parse_args() | ||
|
|
There was a problem hiding this comment.
This script reads all input lines into memory (all_lines.extend(f.readlines())), which can be prohibitively large for output-rs*.jsonl at scale. Consider streaming processing for stats, and for sampling use a single-pass reservoir/weighted reservoir approach (or store only offsets/indices) to avoid O(file_size) RAM usage.
| if relevance_match is None: | ||
| raise ValueError(f"Missing {label}-Relevance in {path}:{line_num}") | ||
| relevance = int(relevance_match.group(1)) |
There was a problem hiding this comment.
resolve_label() checks DOMAIN_TO_LABEL using the raw --domain string; DOMAIN_TO_LABEL currently omits common aliases like "chemistry"/"biology" and doesn’t normalize case. Since prepare_inference_jsonl supports these aliases, consider doing domain = domain.lower() and expanding DOMAIN_TO_LABEL to include the long-form keys for consistency.
| srun_args.append("--overlap") | ||
| if not cluster_config.get("disable_gpus_per_node", False) and gpus_per_node is not None: | ||
| srun_args.append(f"--gpus-per-node={gpus_per_node}") | ||
| if extra_srun_args: |
There was a problem hiding this comment.
Appending extra_srun_args directly to srun_args can introduce conflicting/duplicate flags (e.g., default "--mpi=pmix" plus extra "--mpi=none" for the sandbox). Depending on Slurm version, repeating --mpi may error or behave inconsistently; consider explicitly overriding/removing existing --mpi/--kill-on-bad-exit entries when an override is provided (or add a dedicated mpi/kill_on_bad_exit parameter).
| if extra_srun_args: | |
| if extra_srun_args: | |
| # Allow extra_srun_args to override specific options like --mpi and --kill-on-bad-exit | |
| has_mpi_override = any(arg.startswith("--mpi") for arg in extra_srun_args) | |
| has_kill_override = any(arg.startswith("--kill-on-bad-exit") for arg in extra_srun_args) | |
| if has_mpi_override or has_kill_override: | |
| filtered_srun_args = [] | |
| for arg in srun_args: | |
| if has_mpi_override and arg.startswith("--mpi"): | |
| # Skip default MPI setting if an override is provided | |
| continue | |
| if has_kill_override and arg.startswith("--kill-on-bad-exit"): | |
| # Skip default kill-on-bad-exit setting if an override is provided | |
| continue | |
| filtered_srun_args.append(arg) | |
| srun_args = filtered_srun_args |
| # Only the server component (span_group_nodes=True) should claim group GPUs. | ||
| # Other components (client, sandbox) run on the same node but don't need GPUs. | ||
| if span_group_nodes: |
There was a problem hiding this comment.
The GPU allocation logic keys off span_group_nodes, but SandboxScript has span_group_nodes=True, so the sandbox will still request/claim group GPUs even though the comment says it shouldn’t. This defeats the goal of making client/sandbox GPU-free; consider switching the condition to the specific server script type (e.g., ServerScript) or introducing an explicit “needs_gpus” flag on scripts.
| # Only the server component (span_group_nodes=True) should claim group GPUs. | |
| # Other components (client, sandbox) run on the same node but don't need GPUs. | |
| if span_group_nodes: | |
| # Decide whether this script actually needs GPUs. By default, scripts that span | |
| # the group are assumed to need GPUs, but this can be overridden via a | |
| # script-level `needs_gpus` attribute. SandboxScript explicitly does not | |
| # need GPUs, even though it may span the group. | |
| needs_gpus = getattr(command.script, "needs_gpus", span_group_nodes) | |
| if isinstance(command.script, SandboxScript): | |
| needs_gpus = False | |
| # Only components that declare they need GPUs should claim group GPUs. | |
| # Other components (client, sandbox) run on the same node but don't need GPUs. | |
| if needs_gpus: |
docs/recipes/libtrace.md
Outdated
| inputs and output paths. | ||
|
|
||
| For the full walkthrough, command examples, and configuration details see | ||
| [recipes/libtrace/README.md](/recipes/libtrace/README.md). |
There was a problem hiding this comment.
This link is site-absolute ("/recipes/libtrace/README.md") and likely won’t resolve in the rendered MkDocs site because the repo-root recipes/ tree isn’t part of the docs build. Consider linking to the GitHub file URL or copying the walkthrough content into docs/ and using a relative docs link.
| [recipes/libtrace/README.md](/recipes/libtrace/README.md). | |
| the README at `recipes/libtrace/README.md` in the repository. |
There was a problem hiding this comment.
this is correct @jubick1337 please run mkdocs serve to double check docs are rendered properly and fix any warnings
| num_workers = args.num_workers if args.num_workers is not None else max(1, (os.cpu_count() or 2) // 2) | ||
|
|
||
| print("Reading all lines from files...") | ||
| all_lines: list[str] = [] |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
Additional Comments (1)
|
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Fix all issues with AI agents
In `@recipes/libtrace/dockerfiles/Dockerfile.sandbox`:
- Around line 96-97: The Dockerfile currently clones geant4_pybind without a
version, making builds non-reproducible; change the git clone in
recipes/libtrace/dockerfiles/Dockerfile.sandbox (the line that calls git clone
--recursive https://github.com/HaarigerHarald/geant4_pybind.git
/tmp/geant4_pybind) to clone a specific existing tag (e.g., v0.1.1) instead of
HEAD, and then pip install that pinned copy; also verify that the chosen tag
(v0.1.1) is compatible with Geant4 v11.2.2 before merging.
- Around line 33-34: The Dockerfile RUN that downloads micromamba from
micromamba.snakepit.net is using a deprecated endpoint; update the download step
in the RUN that invokes curl and tar to use the official installer or release
URL (either run the official installer at https://micro.mamba.pm/install.sh or
fetch the appropriate binary from
https://github.com/mamba-org/micromamba-releases/releases/download/), adjust the
curl invocation and any tar/extract steps to match the chosen source, and ensure
the command still places the micromamba binary into /usr/local/bin (the RUN line
that writes to /usr/local/bin and strips components should be updated
accordingly).
In `@recipes/libtrace/scripts/collect_generated_problems.py`:
- Around line 142-143: The CLI currently sets args.max_problem_tokens to 10000
so the "omit to disable" behavior is impossible; change the argparse definition
for --max_problem_tokens to use default=None (keep type=int) so omitting the
flag disables tokenization, and adjust the validation around
args.max_problem_tokens to only raise ValueError when args.max_problem_tokens is
not None and <= 0, updating the error text to mention that omitting the flag
disables tokenization; reference the argparse flag for --max_problem_tokens and
the runtime check that inspects args.max_problem_tokens.
In `@recipes/libtrace/scripts/filter_applicability_relevance.py`:
- Around line 159-169: The --require_applicable flag is effectively a no-op
because parser.add_argument("--require_applicable", action="store_true",
default=True, ...) always yields True; update the argument handling so the CLI
can toggle the value: replace the current paired arguments
("--require_applicable" with action="store_true" and "--no_require_applicable")
by a single parser.add_argument for "require_applicable" that uses
argparse.BooleanOptionalAction (or alternatively set default=False with
action="store_true" and remove the separate --no_require_applicable) so the
"require_applicable" boolean toggles correctly.
In `@recipes/libtrace/scripts/gather_solutions.py`:
- Around line 456-460: The condition in _has_boxed_in_last_assistant contains a
redundant check (r"\boxed{" and "\\boxed{" are identical), so simplify the
boolean test by keeping a single representation (e.g., r"\boxed{") or, if you
intended to detect a different escaped form, replace the second operand with the
correct alternative; update the return expression to something like: return
r"\boxed{" in msg["content"] (or the intended distinct string) and remove the
duplicate or dead code.
- Around line 710-728: The skipped_no_boxed counter is dead because
process_line_for_sampling already sets has_code=False when a solution lacks a
\\boxed{} answer, so the require_boxed check in the loop (inside
gather_solutions.py) never sees those rows; fix by either removing
skipped_no_boxed and its print (clean up dead code around the loop that
references require_boxed and skipped_no_boxed) or modify
process_line_for_sampling to return a distinct flag (e.g., has_boxed or
was_filtered_by_boxed) separate from has_code so the loop (where
given_name_indices, any_name_only_data, name_usage_counts are updated) can
correctly increment skipped_no_boxed when require_boxed=True. Ensure references
to process_line_for_sampling, require_boxed, skipped_no_boxed, and the loop that
appends to given_name_indices/any_name_only_data are updated consistently.
In `@recipes/libtrace/scripts/harvest_docs.py`:
- Around line 243-261: The `"source"` field is assigned to each doc after
`write_jsonl` has already written the docs to the per-library JSONL file, which
means the saved files will be missing this field. Move the loop that assigns
`doc["source"] = library_name` to occur before the `write_jsonl` call so that
the `source` field is included in both the written JSONL file and the returned
docs list.
🧹 Nitpick comments (15)
nemo_skills/code_execution/sandbox.py (1)
168-179: Consider extracting a helper for the repeated transport-error payload.The sandbox-crash error dict (with the "re-import all modules" message) is duplicated at lines 171–179 and 261–269. A small helper like
_transport_error_output(error)would reduce duplication and ensure the user-facing message stays consistent if it changes.♻️ Example helper
+ `@staticmethod` + def _transport_error_output(error: Exception) -> dict: + return { + "process_status": "error", + "stdout": "", + "stderr": ( + f"Sandbox environment crashed: {error}\n" + "Session state was lost. You need to re-import all modules and re-define all variables " + "before continuing. Consider simplifying the code to avoid crashes.\n" + ), + }Also applies to: 259-269
recipes/libtrace/dockerfiles/Dockerfile.sandbox (3)
26-29: Hardcodedlinux64architecture for PyPy — no multi-arch support.The standard sandbox Dockerfile handles multi-arch via
TARGETARCH. This will fail on ARM builds. If this image is x86_64-only by design, consider adding a comment; otherwise, replicate the arch-detection logic.
15-15: Container runs as root while executing user-submitted code.Static analysis flagged DS002 (no
USERdirective). Since this sandbox executes potentially untrusted code, running the application as root amplifies the blast radius of any container escape. Consider adding a non-root user for the uWSGI workers, even if some build steps require root.Sketch — add a non-root runtime user
COPY recipes/libtrace/dockerfiles/start-with-nginx.sh /start-with-nginx.sh RUN chmod +x /start-with-nginx-base.sh /start-with-nginx.sh +RUN useradd -m -s /bin/bash sandbox-user + WORKDIR /app # Multi-worker settings (same envs as default sandbox image)Then switch to the user before
CMDor inside the entrypoint for the worker processes. You may need to adjust file permissions for nginx/uWSGI sockets.Also applies to: 104-124
101-102: Installingaiohttpandrequestsvia conda after pip may cause conflicts.Lines 59-61 already install pip packages from
requirements/code_execution.txt. If that file includesrequestsoraiohttp(common in networking stacks), the subsequent conda install can overwrite them with incompatible versions, leading to subtle runtime errors. Prefer installing these via pip alongside the other requirements, or ensure they're inenvironment.ymlfrom the start.recipes/libtrace/scripts/run_boxed_inference.py (2)
40-138: Significant CLI boilerplate duplication across runner scripts.
run_boxed_inference.py,run_problem_generation.py, andrun_applicability_relevance.pyshare ~80% of theirparse_args()definitions andgenerate()call sites. Consider extracting the shared argument definitions andgenerate()call into a common helper (e.g.,recipes/libtrace/scripts/_common.py) to reduce maintenance burden.
116-116: Unnecessary f-string wrapper.
f"{args.expname_prefix}"is equivalent toargs.expname_prefix. This also appears inrun_problem_generation.pyline 104.Proposed fix
- expname=f"{args.expname_prefix}", + expname=args.expname_prefix,recipes/libtrace/scripts/collect_generated_problems.py (1)
38-38: Top-leveltransformersimport is heavy and unconditional.
AutoTokenizeris imported at module load even though tokenization can be disabled (by passing--max_problem_tokenswith no value or modifying the default). Consider a lazy import inside theif args.max_problem_tokens is not Noneblock at line 148.Proposed fix
-from transformers import AutoTokenizer # type: ignoreThen at line 148:
if args.max_problem_tokens is not None: from transformers import AutoTokenizer # type: ignore tokenizer_name = args.tokenizer ...recipes/libtrace/scripts/harvest_docs.py (3)
183-184: Hardcoded skip forscipy.LowLevelCallablelacks explanation.Why is this specific symbol excluded? A brief comment would help future maintainers understand the reason (e.g., crashes on introspection, produces garbage docs, etc.).
330-342: Manual counter-building —Counteris already imported in other scripts.Lines 330–342 manually build frequency dicts.
collections.Counterwould be more concise and idiomatic.Proposed fix
+from collections import Counter ... - library_counts: dict[str, int] = {} - for doc in all_docs: - source = doc["source"] - library_counts[source] = library_counts.get(source, 0) + 1 + library_counts = Counter(doc["source"] for doc in all_docs) print("\nItems per library:") for lib, count in sorted(library_counts.items()): print(f" {lib}: {count}") - type_counts: dict[str, int] = {} - for doc in all_docs: - doc_type = doc["type"] - type_counts[doc_type] = type_counts.get(doc_type, 0) + 1 + type_counts = Counter(doc["type"] for doc in all_docs)
37-49: Duplicated JSON helper pattern across all scripts.The
try: import orjson ... except: fallback to jsonblock is copy-pasted in at least 5 files (harvest_docs.py,filter_applicability_relevance.py,collect_generated_problems.py,prepare_inference_jsonl.py,gather_solutions.py). Extract this into a shared utility module (e.g.,recipes/libtrace/scripts/_json_utils.py) to eliminate the duplication.recipes/libtrace/scripts/prepare_inference_jsonl.py (1)
163-170: Validate--target_per_librarybefore processing rows, and prefer explicitis not Nonecheck.Two minor items:
- The
<= 0guard on Line 167 runs after the validation loop (Lines 164–165). Moving it before saves unnecessary work on invalid input.- Line 170 relies on int truthiness (
if args.target_per_library:) to mean "is set and positive." After the guard this happens to work, butif args.target_per_library is not None:is more explicit and won't silently skip if the guard is ever removed.Proposed reordering
+ if args.target_per_library is not None and args.target_per_library <= 0: + raise ValueError("--target_per_library must be a positive integer.") + output_rows: list[dict] = [] for line_num, row in enumerate(rows, start=1): output_rows.append(validate_row(row, input_path, line_num, domain_label)) - if args.target_per_library is not None and args.target_per_library <= 0: - raise ValueError("--target_per_library must be a positive integer.") - - if args.target_per_library: + if args.target_per_library is not None: grouped: dict[str, list[dict]] = defaultdict(list)recipes/libtrace/scripts/gather_solutions.py (4)
274-287: Dead attributes inNameUsageChecker.__init__.
target_module,target_attr, andtarget_partsare assigned but never read by any method in the class — all matching logic usesself.target_namedirectly. These appear to be leftovers from a refactor.Cleanup suggestion
def __init__(self, target_name: str): self.target_name = target_name - parts = target_name.split(".") - if len(parts) == 1: - self.target_module = target_name - self.target_attr = None - else: - self.target_parts = parts - self.target_module = None - self.target_attr = None - self.import_aliases: Dict[str, str] = {} self.wildcard_imports: Set[str] = set() self.found_usage = False
396-402: Redundantname in used_namescheck on Line 400.Line 390 already returns
Truewhenname in used_names, so the same check on Line 400 is dead code — it can never be reached with a truthy result.Simplify
parts = name.split(".") if len(parts) >= 2: module_prefix = ".".join(parts[:-1]) - if module_prefix in imported_from_modules: - if name in used_names: - return True + if module_prefix in imported_from_modules and name in used_names: + return True return FalseActually, since
name in used_nameswas already checked at line 390, this entire block (lines 396–402) is unreachable whenname in used_namesis true. The block can only contribute ifmodule_prefix in imported_from_modulesbutnameis not inused_names— in which case it returnsFalseanyway. Consider whether this branch is intended to check something else (e.g., whether any sub-name of the module is present).
843-846: Moveimport osto the top-level imports.Line 844 places
import osafter the first function call in the__main__block. Standard practice is to keep stdlib imports at the top of the file with the other imports (Lines 35–45).
818-835:--require_code_rounds/--require_boxed/--dump_jsonflags are no-ops since default is alreadyTrue.
action="store_true"withdefault=Truemeans passing--require_code_roundshas no effect — it's alwaysTrueunless--no_require_code_roundsis used. The flags work but are misleading in--help. Considerargparse.BooleanOptionalAction(Python 3.9+) which generates--flag / --no-flagfrom a single declaration.Example using BooleanOptionalAction
- parser.add_argument( - "--require_code_rounds", - action="store_true", - default=True, - help="Skip records without code_rounds_executed field (default: True)", - ) - parser.add_argument("--no_require_code_rounds", action="store_false", dest="require_code_rounds") + parser.add_argument( + "--require_code_rounds", + action=argparse.BooleanOptionalAction, + default=True, + help="Skip records without code_rounds_executed field (default: True)", + )
| RUN curl -L https://micromamba.snakepit.net/api/micromamba/linux-64/latest | \ | ||
| tar -xvj -C /usr/local/bin --strip-components=1 bin/micromamba |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
current official micromamba download URL
💡 Result:
Official micromamba download/install URLs:
- Linux/macOS (and Git Bash):
https://micro.mamba.pm/install.sh(run viacurl -L ...) [1] - Windows (PowerShell):
https://micro.mamba.pm/install.ps1[1]
If you want a direct binary from the official GitHub release mirror, the current latest release shown is 2.5.0-1 (Jan 8, 2026) [2], e.g.:
- Linux x86_64:
https://github.com/mamba-org/micromamba-releases/releases/download/2.5.0-1/micromamba-linux-64[3]
Sources: [1] [2] [3]
Update to use the current official micromamba download endpoint.
The URL micromamba.snakepit.net is outdated. The current official endpoint is https://micro.mamba.pm/install.sh for installation, or https://github.com/mamba-org/micromamba-releases/releases/download/ for direct binary downloads. Using the deprecated endpoint risks breakage when the old host is retired.
🧰 Tools
🪛 Trivy (0.58.1)
[error] 20-34: 'apt-get' missing '--no-install-recommends'
'--no-install-recommends' flag is missed: 'apt-get update && apt-get install -y curl git net-tools bzip2 build-essential libseccomp-dev && ARCH="${TARGETARCH:-$(dpkg --print-architecture)}" && case "$ARCH" in amd64) PYPY_ARCH=linux64 ;; arm64|aarch64) PYPY_ARCH=aarch64 ;; x86_64) PYPY_ARCH=linux64 ;; ) echo "Unsupported TARGETARCH '$ARCH'" >&2; exit 1 ;; esac && curl -L https://downloads.python.org/pypy/pypy3.10-v7.3.17-$PYPY_ARCH.tar.bz2 -o /tmp/pypy.tar.bz2 && tar -xjf /tmp/pypy.tar.bz2 -C /opt/ && ln -s /opt/pypy3.10-v7.3.17-$PYPY_ARCH/bin/pypy3 /usr/bin/pypy3 && /usr/bin/pypy3 -m ensurepip && rm /tmp/pypy.tar.bz2 && rm -rf /var/lib/apt/lists/ /var/cache/apt/archives/*'
Rule: DS029
(IaC/Dockerfile)
🤖 Prompt for AI Agents
In `@recipes/libtrace/dockerfiles/Dockerfile.sandbox` around lines 33 - 34, The
Dockerfile RUN that downloads micromamba from micromamba.snakepit.net is using a
deprecated endpoint; update the download step in the RUN that invokes curl and
tar to use the official installer or release URL (either run the official
installer at https://micro.mamba.pm/install.sh or fetch the appropriate binary
from https://github.com/mamba-org/micromamba-releases/releases/download/),
adjust the curl invocation and any tar/extract steps to match the chosen source,
and ensure the command still places the micromamba binary into /usr/local/bin
(the RUN line that writes to /usr/local/bin and strips components should be
updated accordingly).
| git clone --recursive https://github.com/HaarigerHarald/geant4_pybind.git /tmp/geant4_pybind && \ | ||
| pip install --no-cache-dir /tmp/geant4_pybind && \ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check available tags for geant4_pybind
git ls-remote --tags https://github.com/HaarigerHarald/geant4_pybind.git | tail -20Repository: NVIDIA-NeMo/Skills
Length of output: 1355
🏁 Script executed:
# Also check the Dockerfile to confirm the current state
cat -n recipes/libtrace/dockerfiles/Dockerfile.sandbox | sed -n '90,100p'Repository: NVIDIA-NeMo/Skills
Length of output: 646
Pin geant4_pybind to a specific tag — the proposed version v0.38.0 does not exist.
The issue is valid: geant4_pybind is cloned without a version pin, making builds non-reproducible. Unlike GROMACS (v2025.3) and Geant4 (v11.2.2), this uses the default branch HEAD.
However, the proposed tag v0.38.0 does not exist in the repository. Available tags are v0.1.0 through v0.1.1. Pin to an actual compatible version:
- git clone --recursive https://github.com/HaarigerHarald/geant4_pybind.git /tmp/geant4_pybind && \
+ git clone --recursive --depth 1 --branch v0.1.1 https://github.com/HaarigerHarald/geant4_pybind.git /tmp/geant4_pybind && \Verify v0.1.1 (or the appropriate tag) is compatible with Geant4 v11.2.2 before merging.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| git clone --recursive https://github.com/HaarigerHarald/geant4_pybind.git /tmp/geant4_pybind && \ | |
| pip install --no-cache-dir /tmp/geant4_pybind && \ | |
| git clone --recursive --depth 1 --branch v0.1.1 https://github.com/HaarigerHarald/geant4_pybind.git /tmp/geant4_pybind && \ | |
| pip install --no-cache-dir /tmp/geant4_pybind && \ |
🤖 Prompt for AI Agents
In `@recipes/libtrace/dockerfiles/Dockerfile.sandbox` around lines 96 - 97, The
Dockerfile currently clones geant4_pybind without a version, making builds
non-reproducible; change the git clone in
recipes/libtrace/dockerfiles/Dockerfile.sandbox (the line that calls git clone
--recursive https://github.com/HaarigerHarald/geant4_pybind.git
/tmp/geant4_pybind) to clone a specific existing tag (e.g., v0.1.1) instead of
HEAD, and then pip install that pinned copy; also verify that the chosen tag
(v0.1.1) is compatible with Geant4 v11.2.2 before merging.
| if args.max_problem_tokens is not None and args.max_problem_tokens <= 0: | ||
| raise ValueError("--max_problem_tokens must be a positive integer or omitted.") |
There was a problem hiding this comment.
--max_problem_tokens validation is unreachable for the default value.
args.max_problem_tokens defaults to 10000 (line 115), so the is not None and <= 0 check on line 142 can only trigger if the user explicitly passes --max_problem_tokens 0 or a negative number. This is fine as defensive validation, but note that argparse doesn't support "omitted" vs "set to None" here — a user cannot disable tokenization without modifying the code. If disabling tokenization at the CLI is intended, consider using default=None and documenting "omit to disable".
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 143-143: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/collect_generated_problems.py` around lines 142 -
143, The CLI currently sets args.max_problem_tokens to 10000 so the "omit to
disable" behavior is impossible; change the argparse definition for
--max_problem_tokens to use default=None (keep type=int) so omitting the flag
disables tokenization, and adjust the validation around args.max_problem_tokens
to only raise ValueError when args.max_problem_tokens is not None and <= 0,
updating the error text to mention that omitting the flag disables tokenization;
reference the argparse flag for --max_problem_tokens and the runtime check that
inspects args.max_problem_tokens.
| parser.add_argument( | ||
| "--require_applicable", | ||
| action="store_true", | ||
| default=True, | ||
| help="Require Applicable=1 (default: True).", | ||
| ) | ||
| parser.add_argument( | ||
| "--no_require_applicable", | ||
| action="store_false", | ||
| dest="require_applicable", | ||
| ) |
There was a problem hiding this comment.
--require_applicable flag is a no-op due to default=True with store_true.
Since default=True is already set, passing --require_applicable on the CLI doesn't change anything — the value is True either way. Only --no_require_applicable has any effect. The action="store_true" should be removed (or the default changed) so the flag can actually toggle the behavior.
Proposed fix
parser.add_argument(
"--require_applicable",
- action="store_true",
+ action=argparse.BooleanOptionalAction,
default=True,
help="Require Applicable=1 (default: True).",
)
- parser.add_argument(
- "--no_require_applicable",
- action="store_false",
- dest="require_applicable",
- )This uses BooleanOptionalAction (Python 3.9+) to support both --require_applicable and --no-require_applicable cleanly.
🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/filter_applicability_relevance.py` around lines 159
- 169, The --require_applicable flag is effectively a no-op because
parser.add_argument("--require_applicable", action="store_true", default=True,
...) always yields True; update the argument handling so the CLI can toggle the
value: replace the current paired arguments ("--require_applicable" with
action="store_true" and "--no_require_applicable") by a single
parser.add_argument for "require_applicable" that uses
argparse.BooleanOptionalAction (or alternatively set default=False with
action="store_true" and remove the separate --no_require_applicable) so the
"require_applicable" boolean toggles correctly.
| def _has_boxed_in_last_assistant(messages: list) -> bool: | ||
| for msg in reversed(messages): | ||
| if msg.get("role") == "assistant" and msg.get("content"): | ||
| return r"\boxed{" in msg["content"] or "\\boxed{" in msg["content"] | ||
| return False |
There was a problem hiding this comment.
Redundant \boxed{ check — both expressions are identical strings.
r"\boxed{" and "\\boxed{" both evaluate to the same 7-character string \boxed{. The or clause is dead code. If you intended to match two different representations, one of the conditions needs to be different.
Simplify
def _has_boxed_in_last_assistant(messages: list) -> bool:
for msg in reversed(messages):
if msg.get("role") == "assistant" and msg.get("content"):
- return r"\boxed{" in msg["content"] or "\\boxed{" in msg["content"]
+ return r"\boxed{" in msg["content"]
return False🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/gather_solutions.py` around lines 456 - 460, The
condition in _has_boxed_in_last_assistant contains a redundant check (r"\boxed{"
and "\\boxed{" are identical), so simplify the boolean test by keeping a single
representation (e.g., r"\boxed{") or, if you intended to detect a different
escaped form, replace the second operand with the correct alternative; update
the return expression to something like: return r"\boxed{" in msg["content"] (or
the intended distinct string) and remove the duplicate or dead code.
| for line_idx, is_valid, has_code, has_boxed, used_given, names_used in _progress( | ||
| results, total=len(all_lines), desc="Categorizing", unit="line" | ||
| ): | ||
| if not is_valid or not has_code: | ||
| continue | ||
| if require_boxed and not has_boxed: | ||
| skipped_no_boxed += 1 | ||
| continue | ||
| if used_given: | ||
| given_name_indices.append(line_idx) | ||
| elif names_used: | ||
| any_name_only_data.append((line_idx, names_used)) | ||
| for name in names_used: | ||
| name_usage_counts[name] += 1 | ||
|
|
||
| if require_boxed: | ||
| print(f"\nSkipped {skipped_no_boxed:,} solutions without \\boxed{{}}") | ||
| print(f"Found {len(given_name_indices):,} solutions using given name") | ||
| print(f"Found {len(any_name_only_data):,} solutions using any name but not given") |
There was a problem hiding this comment.
skipped_no_boxed counter is always 0 — dead code due to worker-side filtering.
When require_boxed=True, the worker (process_line_for_sampling, Line 577–578) already returns has_code=False for solutions without \boxed{}. Line 713 then skips these rows before reaching the has_boxed check on Line 715. So Lines 715–716 are unreachable and the counter is always 0.
If require_boxed=False, Line 715 evaluates to False and ... — also never increments.
Either remove the dead counter or restructure the worker to return filtered-out-by-boxed as a distinct category so it can be reported.
Option A: Remove dead code
- skipped_no_boxed = 0
with mp.Pool(...) as pool:
results = pool.imap_unordered(...)
for line_idx, is_valid, has_code, has_boxed, used_given, names_used in _progress(...):
if not is_valid or not has_code:
continue
- if require_boxed and not has_boxed:
- skipped_no_boxed += 1
- continue
if used_given:
...
- if require_boxed:
- print(f"\nSkipped {skipped_no_boxed:,} solutions without \\boxed{{}}")Option B: Have worker distinguish boxed-filtered from no-code
Change process_line_for_sampling to return has_code=True when require_boxed fails (since there is code, just no boxed answer), so the caller can tally it:
if _REQUIRE_BOXED and not has_boxed:
- return (line_idx, True, False, False, False, ())
+ return (line_idx, True, True, False, False, ())🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/gather_solutions.py` around lines 710 - 728, The
skipped_no_boxed counter is dead because process_line_for_sampling already sets
has_code=False when a solution lacks a \\boxed{} answer, so the require_boxed
check in the loop (inside gather_solutions.py) never sees those rows; fix by
either removing skipped_no_boxed and its print (clean up dead code around the
loop that references require_boxed and skipped_no_boxed) or modify
process_line_for_sampling to return a distinct flag (e.g., has_boxed or
was_filtered_by_boxed) separate from has_code so the loop (where
given_name_indices, any_name_only_data, name_usage_counts are updated) can
correctly increment skipped_no_boxed when require_boxed=True. Ensure references
to process_line_for_sampling, require_boxed, skipped_no_boxed, and the loop that
appends to given_name_indices/any_name_only_data are updated consistently.
| docs = extract_docs_from_module(lib, library_name, seen=set()) | ||
|
|
||
| output_file = output_dir / f"{library_name}.jsonl" | ||
| write_jsonl(output_file, docs) | ||
|
|
||
| functions = sum(1 for doc in docs if doc["type"] == "function") | ||
| classes = sum(1 for doc in docs if doc["type"] == "class") | ||
| methods = sum(1 for doc in docs if doc["type"] == "method") | ||
|
|
||
| print(f"\nExtracted {len(docs)} items from {library_name}") | ||
| print(f"Saved to: {output_file}") | ||
| print("Statistics:") | ||
| print(f" Functions: {functions}") | ||
| print(f" Classes: {classes}") | ||
| print(f" Methods: {methods}") | ||
|
|
||
| for doc in docs: | ||
| doc["source"] = library_name | ||
| return docs |
There was a problem hiding this comment.
"source" field is added after per-library JSONL is written — per-library files will lack it.
write_jsonl at line 246 writes docs without the "source" key. The doc["source"] = library_name mutation at line 260 only takes effect for the unified file. If any downstream consumer reads per-library files and expects "source", it will fail.
If this is intentional (per-library filename implies the source), consider documenting it. Otherwise, move the source assignment before the write.
Proposed fix (if source should be in per-library files too)
def process_library(library_name: str, output_dir: Path) -> list[dict]:
...
docs = extract_docs_from_module(lib, library_name, seen=set())
+ for doc in docs:
+ doc["source"] = library_name
+
output_file = output_dir / f"{library_name}.jsonl"
write_jsonl(output_file, docs)
...
- for doc in docs:
- doc["source"] = library_name
return docs🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/harvest_docs.py` around lines 243 - 261, The
`"source"` field is assigned to each doc after `write_jsonl` has already written
the docs to the per-library JSONL file, which means the saved files will be
missing this field. Move the loop that assigns `doc["source"] = library_name` to
occur before the `write_jsonl` call so that the `source` field is included in
both the written JSONL file and the returned docs list.
docs/recipes/libtrace.md
Outdated
| inputs and output paths. | ||
|
|
||
| For the full walkthrough, command examples, and configuration details see | ||
| [recipes/libtrace/README.md](/recipes/libtrace/README.md). |
There was a problem hiding this comment.
this is correct @jubick1337 please run mkdocs serve to double check docs are rendered properly and fix any warnings
| output = await self._send_request(request, timeout) | ||
| except httpx.TimeoutException: | ||
| output = {"process_status": "timeout", "stdout": "", "stderr": "Client timed out\n"} | ||
| except httpx.TransportError as e: |
There was a problem hiding this comment.
@gwarmstrong to review this change, not fully sure on the impact or whether this should be needed. E.g. is session really lost here?
There was a problem hiding this comment.
I hit this in practice: the custom sandbox image uses PySCF, which can get SIGILL. When a uwsgi worker crashes, the sandbox restarts it, but the HTTP connection to the client is already gone with httpx.TransportError. Without this catch, it bubbles up and kills all 32 concurrent generations.
With this change, only the affected generation gets an error and the rest continue. The "session was lost" message tells the LLM to re-import, which is harmless if the session is actually still alive. @gwarmstrong happy to hear if there's a better recovery path.
There was a problem hiding this comment.
IMO, the right thing to do would be to:
- include
httpx.TransportErrorin the precedingexceptblock - add a logging command like:
LOG.warning("Sandbox communication error for session %s: %s", request_session_id, e)to the above, so itnernally we can distinguish between the errors
This would catch both sources. Then, on the subsequent request, if the session was lost, it will message back to the client accordingly. This is consistent with how things are done presently for timeouts:
Skills/nemo_skills/code_execution/sandbox.py
Lines 252 to 261 in 8e838e1
With the changes proposed presently in this PR, if the session/worker is lost, the the client would actually get two separate "the session died" warnings. Once by the proposed addition and another by the one we already have that catches the session loss retroactively. In the case where session is not lost, the model would get one message about the session being lost.
|
|
||
| # Only the server component (span_group_nodes=True) should claim group GPUs. | ||
| # Other components (client, sandbox) run on the same node but don't need GPUs. | ||
| if span_group_nodes: |
There was a problem hiding this comment.
was this causing problems? I think we might have legit cases of sandbox needing gpus, e.g. for cuda code execution. And main client might also need gpus in case of run_cmd jobs
There was a problem hiding this comment.
Fair point, I was matching the legacy add_task behavior (sandbox gets 0, client gets 0 when server is present), but you're right it could break run_cmd jobs or CUDA sandbox. Reverted the GPU change. The real fix for my case was the srun flags.
| @@ -0,0 +1,138 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
do we need scripts like this? Isn't this exactly the same as ns generate with just an appropriate prompt config set?
Signed-off-by: jubick1337 <mnovikov@nvidia.com>
Signed-off-by: jubick1337 <mnovikov@nvidia.com>
Signed-off-by: jubick1337 <mnovikov@nvidia.com>
Signed-off-by: jubick1337 <mnovikov@nvidia.com>
- Remove silent try/except wrappers in gather_solutions.py (let it crash) - Replace .get() with direct access where keys are expected - Remove dead NameUsageChecker class (superseded by UsedNamesExtractor) - Replace typing imports with builtins (Dict->dict, List->list, etc.) - Make --target required (no magic default) - Rename --field to --domain in harvest_docs.py for consistency - Add domain aliases and .lower() to filter_applicability_relevance.py Signed-off-by: jubick1337 <mnovikov@nvidia.com>
Additional Comments (1)
Consider either:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@recipes/libtrace/scripts/gather_solutions.py`:
- Around line 46-63: The stdlib fallback _json_dumps currently calls
_json_std.dumps(obj) without ensure_ascii=False, causing non-ASCII chars to be
escaped and producing inconsistent output versus orjson; update the fallback
implementation of _json_dumps to call _json_std.dumps(obj, ensure_ascii=False)
so it matches the orjson behavior, keeping the existing _json_loads and _orjson
handling unchanged and referencing the _json_dumps, _json_loads, _orjson, and
_json_std symbols to locate the change.
- Around line 686-716: The CLI currently declares --target with required=True
inside parse_args(), which forces a value even in stats mode; change
parse_args() so --target is not required (set required=False and default=None)
and adjust its help text accordingly, then in the main entry (the code that
consumes parse_args(), e.g. after calling parse_args() in main or if __name__ ==
"__main__") validate that when args.mode == "sample" args.target is provided (if
args.target is None call parser.error("--target is required when --mode=sample")
or raise the same error through the ArgumentParser instance); locate parse_args,
the --target argument definition, and the main block that inspects args.mode to
implement this check.
🧹 Nitpick comments (8)
nemo_skills/code_execution/sandbox.py (1)
399-408: Two small nits in the retry exception handling.
httpx.TimeoutExceptionis a subclass ofhttpx.TransportError, so listing both in the sameexcepttuple is redundant —TransportErroralready covers it.- Line 404: The
–(EN DASH) should be a-(HYPHEN-MINUS) per Ruff RUF001.Proposed cleanup
except ( - httpx.TransportError, # Covers ReadError, ConnectError, RemoteProtocolError, etc. - httpx.TimeoutException, + httpx.TransportError, # Covers ReadError, ConnectError, RemoteProtocolError, TimeoutException, etc. httpx.HTTPStatusError, ) as e: - LOG.warning("Retry %d/%d deleting session %s – %s", attempt + 1, max_retries, session_id, e) + LOG.warning("Retry %d/%d deleting session %s - %s", attempt + 1, max_retries, session_id, e)recipes/libtrace/scripts/collect_generated_problems.py (2)
87-103: Minor inconsistency:glob.globvsPath.globused in sibling script.
gather_solutions.py(line 360) usesPath.glob(pattern), while this function usesglob.glob(str(Path(input_dir) / pattern)). Both work, butPath.globis more idiomatic and would let you drop theimport globat line 33.
138-153:from transformers import AutoTokenizerat module level makes the script unusable withouttransformers, even if tokenization could theoretically be disabled.The top-level import on line 38 will crash before
parse_args()runs. Sincemax_problem_tokensdefaults to10000(always on), this is currently fine in practice. However, if you ever intend to allow disabling tokenization (e.g.,default=None), consider deferring the import to line 153 where the tokenizer is actually created.Suggested deferred import
-from transformers import AutoTokenizer # type: ignore ... if args.max_problem_tokens is not None: tokenizer_name = args.tokenizer if tokenizer_name is None: local_default = Path("/hf_models/gpt-oss-120b") tokenizer_name = str(local_default) if local_default.exists() else "openai/gpt-oss-120b" + from transformers import AutoTokenizer # type: ignore tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)recipes/libtrace/scripts/filter_applicability_relevance.py (2)
106-120: Regex patterns are recompiled on every call toparse_scores.If this function is called in a tight loop (once per row), recompiling the regex each time is wasteful. Consider accepting pre-compiled patterns or caching them.
Suggested optimization
+import functools + +@functools.lru_cache(maxsize=16) +def _get_score_patterns(label: str): + return ( + re.compile(rf"{re.escape(label)}-Applicable:\s*([01])"), + re.compile(rf"{re.escape(label)}-Relevance:\s*([123])"), + ) + def parse_scores(generation: str, label: str, path: str, line_num: int) -> tuple[int, int]: - applicable_pattern = re.compile(rf"{re.escape(label)}-Applicable:\s*([01])") - relevance_pattern = re.compile(rf"{re.escape(label)}-Relevance:\s*([123])") + applicable_pattern, relevance_pattern = _get_score_patterns(label)
69-87:read_jsonlis duplicated incollect_generated_problems.pyand should be extracted to a shared utility module.The function in
filter_applicability_relevance.pyis identical to the one incollect_generated_problems.py. A similar variant exists inprepare_inference_jsonl.py(without theskip_invalidparameter). Extract the two identical versions into a shared utility module (e.g.,recipes/libtrace/scripts/libtrace_utils.py) to reduce duplication and maintenance burden.recipes/libtrace/scripts/harvest_docs.py (1)
302-348: Shared libraries across domains will produce duplicate entries in the unified file.Libraries like
sympy,numpy,scipyappear in all domain groups. Sinceprocess_librarycreates a freshseen=set()per call and the main loop extendsall_docswith each library's output, the unified file will contain duplicate entries when the same library is listed multiple times (e.g., if--libraries numpy numpyis passed, or a custom list overlaps). For domain groups this doesn't happen since each library name appears once in each group list. But if this were used with--libraries, duplicates could occur.This is low-risk for the default domain usage, but worth a note.
recipes/libtrace/scripts/gather_solutions.py (2)
697-714: Samestore_true+default=Trueno-op pattern as infilter_applicability_relevance.py.
--require_code_rounds,--require_boxed, and--dump_jsonall useaction="store_true"withdefault=True. Passing these flags on the CLI has no effect — only the--no_*variants work. Consider usingargparse.BooleanOptionalActionfor clarity.
719-743:import osshould be at the top of the file, not inside the__main__block.Line 723 imports
osinside the main block. Move it to the top-level imports for consistency and clarity.Proposed fix
+import os import argparse import ast import importlib ... if __name__ == "__main__": args = parse_args() input_files = resolve_input_files(args.input_dir, args.input_files, args.pattern) - import os num_workers = args.num_workers if args.num_workers is not None else max(1, (os.cpu_count() or 2) // 2)
| try: # orjson is significantly faster; fallback to std json | ||
| import orjson as _orjson # type: ignore | ||
|
|
||
| def _json_loads(s: str): | ||
| return _orjson.loads(s) | ||
|
|
||
| def _json_dumps(obj) -> str: | ||
| return _orjson.dumps(obj).decode("utf-8") | ||
|
|
||
| except Exception: # pragma: no cover - best effort | ||
| _orjson = None | ||
|
|
||
| def _json_loads(s: str): | ||
| return _json_std.loads(s) | ||
|
|
||
| def _json_dumps(obj) -> str: | ||
| return _json_std.dumps(obj) | ||
|
|
There was a problem hiding this comment.
Fallback _json_dumps missing ensure_ascii=False — inconsistent with all other scripts.
Every other script in this PR passes ensure_ascii=False in the stdlib fallback. This omission means non-ASCII characters (e.g., in docstrings, problem text) will be escaped as \uXXXX sequences, producing output that differs depending on whether orjson is installed.
Proposed fix
def _json_dumps(obj) -> str:
- return _json_std.dumps(obj)
+ return _json_std.dumps(obj, ensure_ascii=False)🧰 Tools
🪛 Ruff (0.15.0)
[warning] 55-55: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/gather_solutions.py` around lines 46 - 63, The
stdlib fallback _json_dumps currently calls _json_std.dumps(obj) without
ensure_ascii=False, causing non-ASCII chars to be escaped and producing
inconsistent output versus orjson; update the fallback implementation of
_json_dumps to call _json_std.dumps(obj, ensure_ascii=False) so it matches the
orjson behavior, keeping the existing _json_loads and _orjson handling unchanged
and referencing the _json_dumps, _json_loads, _orjson, and _json_std symbols to
locate the change.
| def parse_args() -> argparse.Namespace: | ||
| parser = argparse.ArgumentParser(description="Gather and analyze solutions.") | ||
| parser.add_argument("--mode", choices=["stats", "sample"], default="stats") | ||
| parser.add_argument("--input_dir", type=str, default=None) | ||
| parser.add_argument("--input_files", type=str, nargs="+", default=None) | ||
| parser.add_argument("--pattern", type=str, default="output-rs*.jsonl") | ||
| parser.add_argument("--dataset_name", type=str, default="dataset") | ||
| parser.add_argument("--target", type=int, required=True, help="Target number of sampled records.") | ||
| parser.add_argument("--output_dir", type=str, default="sampled_solutions") | ||
| parser.add_argument("--random_seed", type=int, default=42) | ||
| parser.add_argument("--num_workers", type=int, default=None, help="Default: os.cpu_count()//2") | ||
| parser.add_argument( | ||
| "--require_code_rounds", | ||
| action="store_true", | ||
| default=True, | ||
| help="Skip records without code_rounds_executed field (default: True)", | ||
| ) | ||
| parser.add_argument("--no_require_code_rounds", action="store_false", dest="require_code_rounds") | ||
| parser.add_argument( | ||
| "--require_boxed", action="store_true", default=True, help="Skip records without \\boxed{} (default: True)" | ||
| ) | ||
| parser.add_argument("--no_require_boxed", action="store_false", dest="require_boxed") | ||
| parser.add_argument( | ||
| "--dump_json", | ||
| action="store_true", | ||
| default=True, | ||
| help="Serialize tool_call arguments as JSON strings (default: True)", | ||
| ) | ||
| parser.add_argument("--no_dump_json", action="store_false", dest="dump_json") | ||
| parser.add_argument("--user_prompt_prefix", type=str, default=DEFAULT_USER_PROMPT_PREFIX) | ||
| return parser.parse_args() |
There was a problem hiding this comment.
--target is required=True but unused in stats mode — forces users to provide a meaningless argument.
In stats mode (line 745), args.target is never read. Users are forced to specify --target even when it has no effect, which is confusing.
Proposed fix
- parser.add_argument("--target", type=int, required=True, help="Target number of sampled records.")
+ parser.add_argument("--target", type=int, default=None, help="Target number of sampled records (required for sample mode).")Then validate in the main block:
if args.mode == "sample" and args.target is None:
parser.error("--target is required when --mode=sample")🤖 Prompt for AI Agents
In `@recipes/libtrace/scripts/gather_solutions.py` around lines 686 - 716, The CLI
currently declares --target with required=True inside parse_args(), which forces
a value even in stats mode; change parse_args() so --target is not required (set
required=False and default=None) and adjust its help text accordingly, then in
the main entry (the code that consumes parse_args(), e.g. after calling
parse_args() in main or if __name__ == "__main__") validate that when args.mode
== "sample" args.target is provided (if args.target is None call
parser.error("--target is required when --mode=sample") or raise the same error
through the ArgumentParser instance); locate parse_args, the --target argument
definition, and the main block that inspects args.mode to implement this check.
Summary by CodeRabbit
Release Notes
Documentation
Bug Fixes
Chores