Skip to content

feat(results): enhance report generation pipeline and fix incomplete reports#708

Open
sephmard wants to merge 11 commits intoNVIDIA-NeMo:mainfrom
sephmard:codex/report-enhancements
Open

feat(results): enhance report generation pipeline and fix incomplete reports#708
sephmard wants to merge 11 commits intoNVIDIA-NeMo:mainfrom
sephmard:codex/report-enhancements

Conversation

@sephmard
Copy link
Contributor

@sephmard sephmard commented Feb 9, 2026

Evaluation Report Enhancements

This PR overhauls the post-evaluation reporting system — the HTML reports generated after each evaluation run. It fixes data completeness issues, redesigns the visual layout, adds new CLI tooling,
and significantly expands test coverage.


CLI

  • Added nel analyze <artifacts_dir> — generates standalone HTML analysis report from evaluation artifacts (results.yml, report.json, eval_factory_metrics.json)
  • Added nel view --log-dir <path> — local HTTP viewer for browsing evaluation runs, tasks, logs, and reports with auto-refreshing logs
  • Registered both commands in the launcher CLI entry point

Report Data Fixes

  • Fixed incomplete MMLU Pro and IFEval reports missing Benchmark, Container, and Git Hash pills when results.yml is absent — added fallback resolution through run_config.yml (config.type,
    config.params.task) and metadata.yaml (versioning.git-hash)
  • Fixed container resolution failing for ambiguous task names — uses framework_name from run_config.yml to construct qualified lookups (e.g. lm-evaluation-harness.ifeval vs codec.ifeval)
  • Fixed ARC expected/predicted values always null — lm-eval-harness uses target + filtered_resps, not expected_answer + predicted_answer
  • Fixed graded_metrics (inst_level_strict_acc, symbolic_correct) collected but never rendered — now displayed in detail view
  • Fixed graded_source (originating JSONL file path) collected but never rendered
  • Fixed IFEval instruction_id_list labels lost during grading — now preserved and rendered as badges
  • Fixed grading records not discovered from output.jsonl-async files
  • Fixed Jinja2 autoescape ineffective against XSS — select_autoescape(["html", "xml"]) does not apply to from_string() templates, changed to autoescape=True
  • Fixed curl command heredoc breaking on single quotes in payloads — replaced echo '...' with cat <<'PAYLOAD_EOF' heredoc
  • Fixed table row numbers not updating after filtering
  • Removed redundant character count recomputation loop (already computed earlier in pipeline)
  • Removed dead normalizeGradeClasses() JS function and call
  • Removed duplicate [data-graded="correct/incorrect/unknown"] CSS selectors (.grade-* classes already handle this)
  • Removed unused .brand-meta and .meta-pill CSS blocks

Report Visual Improvements

  • Merged flat "Run Summary" and "Sample Stats" sections into unified layout with hero stats banner (Accuracy %, Total Samples, Correct, Error Rate)
  • Added category-grouped cards with color-coded left borders — Configuration (blue), Evaluation Results (green), Performance (yellow), System (light blue)
  • Added eval parameter pills (temperature, top_p, max_new_tokens, parallelism, etc.) extracted from run_config.yml
  • Added grade distribution bar (tri-color correct/incorrect/unknown)
  • Added versioning info from metadata.yaml (evaluator version, launcher version)
  • Added framework name display in Configuration group
  • Added metadata.yaml to raw artifacts section
  • Added CSV/JSON export buttons
  • Added lang="en" to <html> tag
  • Added keyboard focus styles (:focus-visible)
  • Added debounced search input (200ms)
  • Split confusing per-sample "Tokens: 711 / 3 / 714" into separate labeled cards (Prompt Tokens, Completion Tokens, Total Tokens)
  • Renamed abbreviated labels "Req chars" / "Resp chars" to "Request Chars" / "Response Chars"
  • Changed table Input column to show problem text only (without repeated options blob)

Architecture

  • Decomposed 333-line post_eval_hook into _load_auxiliary_data, _build_grading_index, _match_entries_to_grades, _compute_report_stats, _build_report_meta
  • Pre-normalized fuzzy match data to eliminate redundant _normalize_ws() calls in O(n*m) loop
  • Added diagnostic logging — DEBUG on match failure with prompt preview, INFO summary of matched vs unmatched counts

Tests

Test Coverage
test_report_counts_and_target_for_ifeval IFEval grading + instruction metrics
test_report_counts_and_target_for_ns_mmlu_pro MMLU Pro grading + expected answer
test_report_grading_from_jsonl_async Grading from output.jsonl-async files
test_grading_match_failure Unmatched entries render as ungraded
test_lm_eval_harness_arc_format ARC target + filtered_resps format
test_whitespace_fuzzy_matching Normalized whitespace matching
test_container_url_from_image NGC image → catalog URL mapping
test_flatten_numeric Nested metric flattening
test_task_name_fallback_from_run_config_type Task name from run_config.config.type
test_git_hash_fallback_from_metadata Git hash + versions from metadata.yaml
test_eval_params_rendered Eval param pills in HTML output

Example Report

Example new report HTML:
arc_challenge_report.html

@sephmard sephmard requested review from a team as code owners February 9, 2026 15:27
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@sephmard sephmard changed the title Claude/report enhancements feat(results): enhance report generation pipeline and fix incomplete reports Feb 9, 2026
Seph Mard iMac Mini and others added 8 commits February 9, 2026 10:34
Signed-off-by: Seph Mard <smard@nvidia.com>
Signed-off-by: Seph Mard <smard@nvidia.com>
Enhance the report hook with better answer display (maps numeric targets
to choice labels like "F) Shareholders..."), improved grading match
algorithm using scored multi-signal matching, scoped metrics collection
for groups/tasks, and benchmark container/harness resolution from the
launcher task registry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Seph Mard <smard@nvidia.com>
Overhaul the report template with NVIDIA brand bar, scoped metrics
tables (group rollups and per-task breakdowns), improved sample viewer
with prompt sections and choice-label targets, and refined styling
for better readability across the report.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Seph Mard <smard@nvidia.com>
Add test cases covering instruction-following tasks (ifeval) and
multiple-choice tasks (mmlu_pro) to verify correct target display,
grading label counts, and HTML row generation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Seph Mard <smard@nvidia.com>
The glob in _collect_grading_records now matches *.jsonl* (not just
*.jsonl) and the filename filter uses startswith("output.jsonl") so
output.jsonl-async is picked up. The .done sentinel is explicitly
skipped.  _derive_correctness falls back to comparing expected_answer
vs predicted_answer when no other correctness indicator exists.

Report template improvements:
- Add search box for filtering samples by keyword
- Show predicted answer letter prominently in the Answer column
- Enable "Graded" detail toggle by default
- Add "Sort: incorrect first / correct first" options
- Add accuracy progress bar visualization
- Auto-collapse lower reference sections on load
- Refactor mmlu_pro test fixtures into shared helper; add
  test_report_grading_from_jsonl_async

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Seph Mard <smard@nvidia.com>
…line

Fix incomplete MMLU Pro and IFEval reports by adding fallback paths for
task_name (run_config.config.type), git_hash (metadata.yaml), and
container resolution (framework_name-qualified lookup). Load metadata.yaml
as a new data source for versioning info and eval parameters.

Redesign the Run Summary section: merge Run Summary and Sample Stats into
a unified layout with a hero stats banner (accuracy, samples, correct,
error rate), color-coded category groups (Configuration, Evaluation,
Performance, System), eval parameter pills, and a grade distribution bar.

Also includes Phase 1-3 improvements: render graded_metrics and
graded_source, fix ARC expected/predicted values, preserve IFEval
instruction labels, add problem-only text in table, decompose
post_eval_hook into helper methods, pre-normalize fuzzy match data,
fix autoescape, debounce search, add CSV/JSON export, and 8 new tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Seph Mard <smard@nvidia.com>
Split the combined "Tokens: 711 / 3 / 714" display into separate
labeled cards: "Prompt Tokens", "Completion Tokens", "Total Tokens".
Also expand abbreviated labels "Req chars" and "Resp chars" to full
"Request Chars" and "Response Chars".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Seph Mard <smard@nvidia.com>
@sephmard sephmard force-pushed the codex/report-enhancements branch from 05d646c to 490d3c6 Compare February 9, 2026 15:35
Signed-off-by: Seph Mard <smard@nvidia.com>
The detect-secrets scanner flagged the test fixture "abc123def456"
as a hex high entropy string. This is a fake git hash used in
test_git_hash_fallback_from_metadata, not a real secret.

Signed-off-by: Seph Mard <smard@nvidia.com>
Signed-off-by: Seph Mard <smard@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant