feat(results): enhance report generation pipeline and fix incomplete reports#708
Open
sephmard wants to merge 11 commits intoNVIDIA-NeMo:mainfrom
Open
feat(results): enhance report generation pipeline and fix incomplete reports#708sephmard wants to merge 11 commits intoNVIDIA-NeMo:mainfrom
sephmard wants to merge 11 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Seph Mard <smard@nvidia.com>
Signed-off-by: Seph Mard <smard@nvidia.com>
Enhance the report hook with better answer display (maps numeric targets to choice labels like "F) Shareholders..."), improved grading match algorithm using scored multi-signal matching, scoped metrics collection for groups/tasks, and benchmark container/harness resolution from the launcher task registry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Seph Mard <smard@nvidia.com>
Overhaul the report template with NVIDIA brand bar, scoped metrics tables (group rollups and per-task breakdowns), improved sample viewer with prompt sections and choice-label targets, and refined styling for better readability across the report. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Seph Mard <smard@nvidia.com>
Add test cases covering instruction-following tasks (ifeval) and multiple-choice tasks (mmlu_pro) to verify correct target display, grading label counts, and HTML row generation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Seph Mard <smard@nvidia.com>
The glob in _collect_grading_records now matches *.jsonl* (not just
*.jsonl) and the filename filter uses startswith("output.jsonl") so
output.jsonl-async is picked up. The .done sentinel is explicitly
skipped. _derive_correctness falls back to comparing expected_answer
vs predicted_answer when no other correctness indicator exists.
Report template improvements:
- Add search box for filtering samples by keyword
- Show predicted answer letter prominently in the Answer column
- Enable "Graded" detail toggle by default
- Add "Sort: incorrect first / correct first" options
- Add accuracy progress bar visualization
- Auto-collapse lower reference sections on load
- Refactor mmlu_pro test fixtures into shared helper; add
test_report_grading_from_jsonl_async
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Seph Mard <smard@nvidia.com>
…line Fix incomplete MMLU Pro and IFEval reports by adding fallback paths for task_name (run_config.config.type), git_hash (metadata.yaml), and container resolution (framework_name-qualified lookup). Load metadata.yaml as a new data source for versioning info and eval parameters. Redesign the Run Summary section: merge Run Summary and Sample Stats into a unified layout with a hero stats banner (accuracy, samples, correct, error rate), color-coded category groups (Configuration, Evaluation, Performance, System), eval parameter pills, and a grade distribution bar. Also includes Phase 1-3 improvements: render graded_metrics and graded_source, fix ARC expected/predicted values, preserve IFEval instruction labels, add problem-only text in table, decompose post_eval_hook into helper methods, pre-normalize fuzzy match data, fix autoescape, debounce search, add CSV/JSON export, and 8 new tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Seph Mard <smard@nvidia.com>
Split the combined "Tokens: 711 / 3 / 714" display into separate labeled cards: "Prompt Tokens", "Completion Tokens", "Total Tokens". Also expand abbreviated labels "Req chars" and "Resp chars" to full "Request Chars" and "Response Chars". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Seph Mard <smard@nvidia.com>
05d646c to
490d3c6
Compare
Signed-off-by: Seph Mard <smard@nvidia.com>
The detect-secrets scanner flagged the test fixture "abc123def456" as a hex high entropy string. This is a fake git hash used in test_git_hash_fallback_from_metadata, not a real secret. Signed-off-by: Seph Mard <smard@nvidia.com>
Signed-off-by: Seph Mard <smard@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Evaluation Report Enhancements
This PR overhauls the post-evaluation reporting system — the HTML reports generated after each evaluation run. It fixes data completeness issues, redesigns the visual layout, adds new CLI tooling,
and significantly expands test coverage.
CLI
nel analyze <artifacts_dir>— generates standalone HTML analysis report from evaluation artifacts (results.yml, report.json, eval_factory_metrics.json)nel view --log-dir <path>— local HTTP viewer for browsing evaluation runs, tasks, logs, and reports with auto-refreshing logsReport Data Fixes
results.ymlis absent — added fallback resolution throughrun_config.yml(config.type,config.params.task) andmetadata.yaml(versioning.git-hash)framework_namefromrun_config.ymlto construct qualified lookups (e.g.lm-evaluation-harness.ifevalvscodec.ifeval)target+filtered_resps, notexpected_answer+predicted_answergraded_metrics(inst_level_strict_acc, symbolic_correct) collected but never rendered — now displayed in detail viewgraded_source(originating JSONL file path) collected but never renderedinstruction_id_listlabels lost during grading — now preserved and rendered as badgesoutput.jsonl-asyncfilesselect_autoescape(["html", "xml"])does not apply tofrom_string()templates, changed toautoescape=Trueecho '...'withcat <<'PAYLOAD_EOF'heredocnormalizeGradeClasses()JS function and call[data-graded="correct/incorrect/unknown"]CSS selectors (.grade-*classes already handle this).brand-metaand.meta-pillCSS blocksReport Visual Improvements
run_config.ymlmetadata.yaml(evaluator version, launcher version)metadata.yamlto raw artifacts sectionlang="en"to<html>tag:focus-visible)Architecture
post_eval_hookinto_load_auxiliary_data,_build_grading_index,_match_entries_to_grades,_compute_report_stats,_build_report_meta_normalize_ws()calls in O(n*m) loopTests
test_report_counts_and_target_for_ifevaltest_report_counts_and_target_for_ns_mmlu_protest_report_grading_from_jsonl_asyncoutput.jsonl-asyncfilestest_grading_match_failuretest_lm_eval_harness_arc_formattarget+filtered_respsformattest_whitespace_fuzzy_matchingtest_container_url_from_imagetest_flatten_numerictest_task_name_fallback_from_run_config_typerun_config.config.typetest_git_hash_fallback_from_metadatametadata.yamltest_eval_params_renderedExample Report
Example new report HTML:
arc_challenge_report.html