feat(results): enhance report generation pipeline and fix incomplete reports by sephmard · Pull Request #708 · NVIDIA-NeMo/Evaluator

sephmard · 2026-02-09T15:27:22Z

Evaluation Report Enhancements

This PR overhauls the post-evaluation reporting system — the HTML reports generated after each evaluation run. It fixes data completeness issues, redesigns the visual layout, adds new CLI tooling,
and significantly expands test coverage.

CLI

Added nel analyze <artifacts_dir> — generates standalone HTML analysis report from evaluation artifacts (results.yml, report.json, eval_factory_metrics.json)
Added nel view --log-dir <path> — local HTTP viewer for browsing evaluation runs, tasks, logs, and reports with auto-refreshing logs
Registered both commands in the launcher CLI entry point

Report Data Fixes

Fixed incomplete MMLU Pro and IFEval reports missing Benchmark, Container, and Git Hash pills when results.yml is absent — added fallback resolution through run_config.yml (config.type,
config.params.task) and metadata.yaml (versioning.git-hash)
Fixed container resolution failing for ambiguous task names — uses framework_name from run_config.yml to construct qualified lookups (e.g. lm-evaluation-harness.ifeval vs codec.ifeval)
Fixed ARC expected/predicted values always null — lm-eval-harness uses target + filtered_resps, not expected_answer + predicted_answer
Fixed graded_metrics (inst_level_strict_acc, symbolic_correct) collected but never rendered — now displayed in detail view
Fixed graded_source (originating JSONL file path) collected but never rendered
Fixed IFEval instruction_id_list labels lost during grading — now preserved and rendered as badges
Fixed grading records not discovered from output.jsonl-async files
Fixed Jinja2 autoescape ineffective against XSS — select_autoescape(["html", "xml"]) does not apply to from_string() templates, changed to autoescape=True
Fixed curl command heredoc breaking on single quotes in payloads — replaced echo '...' with cat <<'PAYLOAD_EOF' heredoc
Fixed table row numbers not updating after filtering
Removed redundant character count recomputation loop (already computed earlier in pipeline)
Removed dead normalizeGradeClasses() JS function and call
Removed duplicate [data-graded="correct/incorrect/unknown"] CSS selectors (.grade-* classes already handle this)
Removed unused .brand-meta and .meta-pill CSS blocks

Report Visual Improvements

Merged flat "Run Summary" and "Sample Stats" sections into unified layout with hero stats banner (Accuracy %, Total Samples, Correct, Error Rate)
Added category-grouped cards with color-coded left borders — Configuration (blue), Evaluation Results (green), Performance (yellow), System (light blue)
Added eval parameter pills (temperature, top_p, max_new_tokens, parallelism, etc.) extracted from run_config.yml
Added grade distribution bar (tri-color correct/incorrect/unknown)
Added versioning info from metadata.yaml (evaluator version, launcher version)
Added framework name display in Configuration group
Added metadata.yaml to raw artifacts section
Added CSV/JSON export buttons
Added lang="en" to <html> tag
Added keyboard focus styles (:focus-visible)
Added debounced search input (200ms)
Split confusing per-sample "Tokens: 711 / 3 / 714" into separate labeled cards (Prompt Tokens, Completion Tokens, Total Tokens)
Renamed abbreviated labels "Req chars" / "Resp chars" to "Request Chars" / "Response Chars"
Changed table Input column to show problem text only (without repeated options blob)

Architecture

Decomposed 333-line post_eval_hook into _load_auxiliary_data, _build_grading_index, _match_entries_to_grades, _compute_report_stats, _build_report_meta
Pre-normalized fuzzy match data to eliminate redundant _normalize_ws() calls in O(n*m) loop
Added diagnostic logging — DEBUG on match failure with prompt preview, INFO summary of matched vs unmatched counts

Tests

Test	Coverage
`test_report_counts_and_target_for_ifeval`	IFEval grading + instruction metrics
`test_report_counts_and_target_for_ns_mmlu_pro`	MMLU Pro grading + expected answer
`test_report_grading_from_jsonl_async`	Grading from `output.jsonl-async` files
`test_grading_match_failure`	Unmatched entries render as ungraded
`test_lm_eval_harness_arc_format`	ARC `target` + `filtered_resps` format
`test_whitespace_fuzzy_matching`	Normalized whitespace matching
`test_container_url_from_image`	NGC image → catalog URL mapping
`test_flatten_numeric`	Nested metric flattening
`test_task_name_fallback_from_run_config_type`	Task name from `run_config.config.type`
`test_git_hash_fallback_from_metadata`	Git hash + versions from `metadata.yaml`
`test_eval_params_rendered`	Eval param pills in HTML output

Example Report

Example new report HTML:
arc_challenge_report.html

copy-pr-bot · 2026-02-09T15:27:27Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Seph Mard <smard@nvidia.com>

Enhance the report hook with better answer display (maps numeric targets to choice labels like "F) Shareholders..."), improved grading match algorithm using scored multi-signal matching, scoped metrics collection for groups/tasks, and benchmark container/harness resolution from the launcher task registry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Seph Mard <smard@nvidia.com>

Overhaul the report template with NVIDIA brand bar, scoped metrics tables (group rollups and per-task breakdowns), improved sample viewer with prompt sections and choice-label targets, and refined styling for better readability across the report. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Seph Mard <smard@nvidia.com>

Add test cases covering instruction-following tasks (ifeval) and multiple-choice tasks (mmlu_pro) to verify correct target display, grading label counts, and HTML row generation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Seph Mard <smard@nvidia.com>

The glob in _collect_grading_records now matches *.jsonl* (not just *.jsonl) and the filename filter uses startswith("output.jsonl") so output.jsonl-async is picked up. The .done sentinel is explicitly skipped. _derive_correctness falls back to comparing expected_answer vs predicted_answer when no other correctness indicator exists. Report template improvements: - Add search box for filtering samples by keyword - Show predicted answer letter prominently in the Answer column - Enable "Graded" detail toggle by default - Add "Sort: incorrect first / correct first" options - Add accuracy progress bar visualization - Auto-collapse lower reference sections on load - Refactor mmlu_pro test fixtures into shared helper; add test_report_grading_from_jsonl_async Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Seph Mard <smard@nvidia.com>

…line Fix incomplete MMLU Pro and IFEval reports by adding fallback paths for task_name (run_config.config.type), git_hash (metadata.yaml), and container resolution (framework_name-qualified lookup). Load metadata.yaml as a new data source for versioning info and eval parameters. Redesign the Run Summary section: merge Run Summary and Sample Stats into a unified layout with a hero stats banner (accuracy, samples, correct, error rate), color-coded category groups (Configuration, Evaluation, Performance, System), eval parameter pills, and a grade distribution bar. Also includes Phase 1-3 improvements: render graded_metrics and graded_source, fix ARC expected/predicted values, preserve IFEval instruction labels, add problem-only text in table, decompose post_eval_hook into helper methods, pre-normalize fuzzy match data, fix autoescape, debounce search, add CSV/JSON export, and 8 new tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Seph Mard <smard@nvidia.com>

Split the combined "Tokens: 711 / 3 / 714" display into separate labeled cards: "Prompt Tokens", "Completion Tokens", "Total Tokens". Also expand abbreviated labels "Req chars" and "Resp chars" to full "Request Chars" and "Response Chars". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Seph Mard <smard@nvidia.com>

Signed-off-by: Seph Mard <smard@nvidia.com>

The detect-secrets scanner flagged the test fixture "abc123def456" as a hex high entropy string. This is a fake git hash used in test_git_hash_fallback_from_metadata, not a real secret. Signed-off-by: Seph Mard <smard@nvidia.com>

Signed-off-by: Seph Mard <smard@nvidia.com>

sephmard requested review from a team as code owners February 9, 2026 15:27

github-actions bot added nemo-evaluator-launcher nemo-evaluator tests labels Feb 9, 2026

sephmard changed the title ~~Claude/report enhancements~~ feat(results): enhance report generation pipeline and fix incomplete reports Feb 9, 2026

Seph Mard iMac Mini and others added 8 commits February 9, 2026 10:34

Improve reports and add analysis tooling

b91ff20

Signed-off-by: Seph Mard <smard@nvidia.com>

Improve eval report UX and add viewer

169ef38

Signed-off-by: Seph Mard <smard@nvidia.com>

sephmard force-pushed the codex/report-enhancements branch from 05d646c to 490d3c6 Compare February 9, 2026 15:35

sephmard added 3 commits February 9, 2026 10:39

chore: retrigger CI

e5fd3aa

Signed-off-by: Seph Mard <smard@nvidia.com>

fix(tests): mark fake git hash as allowlisted secret

254ee5e

The detect-secrets scanner flagged the test fixture "abc123def456" as a hex high entropy string. This is a fake git hash used in test_git_hash_fallback_from_metadata, not a real secret. Signed-off-by: Seph Mard <smard@nvidia.com>

fix(tests): replace hex-like test hash to pass secret detector

91dc573

Signed-off-by: Seph Mard <smard@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(results): enhance report generation pipeline and fix incomplete reports#708

feat(results): enhance report generation pipeline and fix incomplete reports#708
sephmard wants to merge 11 commits intoNVIDIA-NeMo:mainfrom
sephmard:codex/report-enhancements

sephmard commented Feb 9, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sephmard commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Evaluation Report Enhancements

CLI

Report Data Fixes

Report Visual Improvements

Architecture

Tests

Example Report

Uh oh!

copy-pr-bot bot commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sephmard commented Feb 9, 2026 •

edited

Loading