feat: Add eval coverage grid generator#92
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #92 +/- ##
=======================================
Coverage ? 73.27%
=======================================
Files ? 133
Lines ? 15495
Branches ? 0
=======================================
Hits ? 11354
Misses ? 3302
Partials ? 839
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a new waza coverage CLI command to generate an eval-coverage “grid” across discovered skills, with supporting docs and tests.
Changes:
- Introduces
waza coverage [root]withtext,markdown, andjsonoutput. - Implements skill/eval discovery and a coverage classification (none/partial/full) based on tasks + grader types.
- Updates CLI reference docs and README, and adds unit tests for report building + markdown rendering.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| site/src/content/docs/reference/cli.mdx | Documents the new waza coverage command, flags, and examples. |
| cmd/waza/root.go | Registers the new coverage subcommand on the root CLI. |
| cmd/waza/cmd_coverage.go | Implements discovery, report generation, and output renderers (text/markdown/json). |
| cmd/waza/cmd_coverage_test.go | Adds unit tests for report classification, markdown output, and command registration. |
| README.md | Adds waza coverage usage example and CLI reference entry. |
| .squad/log/2026-03-05T00-36-issue-assignment-pipeline.md | Adds team process log (non-functional). |
| .squad/log/2026-03-05T00-26-rusty-token-diff-design.md | Adds team process log (non-functional). |
| .squad/decisions.md | Records team workflow/design decisions (non-functional). |
spboyer
left a comment
There was a problem hiding this comment.
Verified by Rusty (Opus 4.6) — LGTM ✅
Clean eval coverage grid generator:
- New \waza coverage\ command with text/markdown/json output
- Smart skill/eval discovery with deduplication, hidden dir skipping
- Coverage classification (Full/Partial/None) is conservative and correct
- Tests cover no-eval, partial/full, markdown rendering, root command integration
- Docs updated: README, CLI reference
- CI green on ubuntu + windows + lint
Minor: \�valSpecLite.Tasks\ is []string\ while real eval YAML has structured task objects — means task count defaults to 0, showing Partial instead of Full for real evals. Conservative for a reporting tool. Worth fixing to []any\ in a follow-up.
Note: Can't self-approve via API. Setting auto-merge.
40b1f4c to
e1365f2
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
a6ca52d to
f3aa0c4
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wbreza
left a comment
There was a problem hiding this comment.
Code Review: PR #92 - feat: Add eval coverage grid generator
What Looks Good
- Clean architecture - discovery, parsing, classification, rendering well-separated
- Parse failures properly reported (not silently dropped)
- Both eval.yaml and eval.yml supported
- README and CLI reference updated with correct flag names
- Path deduplication via seenPaths map
Suggestions (non-blocking)
- isDir/isFile helpers duplicate workspace utilities. Consider importing or extracting.
- No test for JSON output format.
- -f shorthand missing from site CLI docs.
Summary
| Priority | Count |
|---|---|
| Critical | 0 |
| High | 0 |
| Medium | 1 |
| Low | 2 |
Overall Assessment: Approve
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
8f51dd0 to
b9f0f01
Compare
wbreza
left a comment
There was a problem hiding this comment.
Code Review: PR #92 — feat: Add eval coverage grid generator
✅ What Looks Good
- Discovery logic — properly walks
skills/,.github/skills/, and custom--pathdirectories with dedup - Eval inference — checks skill-adjacent dirs and infers skill name from eval path
- Error surfacing — collects all parse failures, reports together (not fail-on-first)
- Three output formats — text, markdown, JSON all tested
- Test coverage — 8 test functions covering no evals, partial+full, yml, parse errors, format validation
Findings Summary
| Priority | Count |
|---|---|
| Critical | 0 |
| High | 0 |
| Medium | 1 |
| Low | 1 |
| Total | 2 |
Overall Assessment: Approve — well-implemented command. Medium finding is a documentation suggestion.
- CoveragePct now counts only 'Full' (≥2 grader types) skills, not Partial - Add comment clarifying that Full requires tasks + multiple grader types - Update summary line to say 'fully covered' Addresses wbreza review feedback on PR #92. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Addressed wbreza review feedback in 2e6a818:
|
- CoveragePct now counts only 'Full' (≥2 grader types) skills, not Partial - Add comment clarifying that Full requires tasks + multiple grader types - Update summary line to say 'fully covered' Addresses wbreza review feedback on PR microsoft#92. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
cmd/waza/cmd_coverage.go:253
- Eval discovery currently looks under
{root}/evalsand a few skill-relative locations, but it does not consider the common single-skill layout{root}/eval/eval.yamlshown in README’s “Expected Skill Structure”. Ifcoverageis intended to work on those repos, it should include{root}/evalas a default search root and/or addeval/eval.yamlto the per-skill candidate list (in addition to the existingevals/locations).
func discoverEvalFiles(root string, skillPaths map[string]string, discoverPaths []string) ([]string, error) {
searchRoots := []string{filepath.Join(root, "evals")}
for _, p := range discoverPaths {
searchRoots = append(searchRoots, resolvePath(root, p))
}
You can also share your feedback on Copilot code review. Take the survey.
cmd/waza/cmd_coverage.go
Outdated
| Skill string `yaml:"skill"` | ||
| Tasks []string `yaml:"tasks"` | ||
| Graders []models.GraderConfig `yaml:"graders"` |
There was a problem hiding this comment.
evalSpecLite doesn’t include tasks_from, but buildCoverageReport uses len(spec.Tasks) to decide whether a skill has tasks. This will misclassify eval specs that define tasks via tasks_from (common in this repo’s eval spec docs) as having 0 tasks and therefore only “Partial” coverage. Consider adding TasksFrom string yaml:"tasks_from,omitempty"`` to the parsed struct and treating tasks_from != "" as “has tasks” for coverage purposes (and task counting, if you keep the count field).
| Skill string `yaml:"skill"` | |
| Tasks []string `yaml:"tasks"` | |
| Graders []models.GraderConfig `yaml:"graders"` | |
| Skill string `yaml:"skill"` | |
| Tasks []string `yaml:"tasks"` | |
| TasksFrom string `yaml:"tasks_from,omitempty"` | |
| Graders []models.GraderConfig `yaml:"graders"` |
wbreza
left a comment
There was a problem hiding this comment.
No significant issues found in the reviewed changes.
Re-Review Summary
I performed a thorough code review of PR #92 focusing on changes since the prior approval on 2026-03-10T17:31:36Z. The PR adds an �val coverage command that generates coverage grids for discovered skills.
Review Scope
- Reviewed 8 commits pushed since prior approval
- Examined the complete diff covering cmd_coverage.go, cmd_coverage_test.go, documentation updates, and command registration
- Verified build succeeds without errors
- Confirmed all tests pass (6/6 coverage tests passing)
- Ran go vet with no issues found
Analysis Performed
- Logic verification: Traced through the coverage classification algorithm (lines 166-191) - correctly categorizes skills as Full/Partial/None based on tasks and grader count
- Error handling: Verified proper error wrapping and fallback paths for file I/O operations
- Edge cases: Checked handling of empty grader lists, missing skill names, invalid YAML, and filepath.Abs failures
- Concurrency: No race conditions (single-threaded execution)
- Nil safety: No nil pointer dereferences found
Notable Design Decisions (Intentional, Not Bugs)
- Task counting across multiple eval files for the same skill is additive - acknowledged in prior review as approximation for the common case
- Markdown output omits coverage percentage summary (text format includes it) - confirmed intentional by test assertions
- asks_from counts as 1 task rather than resolving actual count - documented limitation, appropriate for lightweight discovery
Recommendation
APPROVE - The code is production-ready. All prior review feedback has been addressed, error handling is comprehensive, and tests provide good coverage of the functionality.
wbreza
left a comment
There was a problem hiding this comment.
Re-review complete — new commits since prior approval look good. No issues found. ✅
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- CoveragePct now counts only 'Full' (≥2 grader types) skills, not Partial - Add comment clarifying that Full requires tasks + multiple grader types - Update summary line to say 'fully covered' Addresses wbreza review feedback on PR microsoft#92. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Validate root path is a directory, not just exists - Update help text to mention both eval.yaml and eval.yml - Update CLI docs to reference eval.yaml/eval.yml consistently - Add test for file-path rejection Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
59b1c7d to
a55c52a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
cmd/waza/cmd_coverage.go:25
- For
--format json, thecoveragefield includes emoji-prefixed strings ("✅ Full", "⚠️ Partial", "❌ None"). That makes machine consumption harder and couples API output to presentation. Consider emitting a stable enum/string like "full"/"partial"/"none" (and optionally separatecoverage_label/iconfields), while keeping emoji only in text/markdown renderers.
type coverageSkillRow struct {
Skill string `json:"skill"`
Tasks int `json:"tasks"`
Graders []string `json:"graders"`
Coverage string `json:"coverage"`
}
You can also share your feedback on Copilot code review. Take the survey.
wbreza
left a comment
There was a problem hiding this comment.
Re-review complete. All 7 new commits since prior approval address review feedback: root path validation, tasks_from handling, coverage percentage fix, documentation, and formatting. Changes look good. ✅
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- CoveragePct now counts only 'Full' (≥2 grader types) skills, not Partial - Add comment clarifying that Full requires tasks + multiple grader types - Update summary line to say 'fully covered' Addresses wbreza review feedback on PR microsoft#92. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Validate root path is a directory, not just exists - Update help text to mention both eval.yaml and eval.yml - Update CLI docs to reference eval.yaml/eval.yml consistently - Add test for file-path rejection Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eval specs using tasks_from instead of inline tasks were misclassified as Partial coverage. Now tasks_from is parsed and treated as having tasks for coverage purposes. Updated docs to clarify both forms qualify. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Parse failures now warn to stderr instead of aborting the report, making waza coverage usable in repos with broken eval files. - Use tabwriter placeholders for emoji to fix column alignment. - Updated test to match new warn-not-error behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
a55c52a to
10f3a73
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| } | ||
| var sk skill.Skill | ||
| if err := sk.UnmarshalText(data); err != nil { | ||
| return "" |
There was a problem hiding this comment.
We should definitely not eat the error here. Let's return and fail the command
| Skills []coverageSkillRow `json:"skills"` | ||
| } | ||
|
|
||
| type evalSpecLite struct { |
There was a problem hiding this comment.
Let's just use the BenchmarkConfig and not have multiple copies of the same data structure floating around.
richardpark-msft
left a comment
There was a problem hiding this comment.
I think there's some places that could re-use common code, and I've left a few review comments that are worth addressing.
| } | ||
| for _, g := range spec.Graders { | ||
| kind := strings.TrimSpace(string(g.Kind)) | ||
| if kind != "" { |
There was a problem hiding this comment.
Let's error out on this - there's no reason to have a grader without a kind.
| } | ||
| if len(parseFailures) > 0 { | ||
| sort.Strings(parseFailures) | ||
| fmt.Fprintf(os.Stderr, "warning: failed to parse %d eval file(s): %s\n", len(parseFailures), strings.Join(parseFailures, "; ")) |
There was a problem hiding this comment.
I don't think this is a warning at this point - I would just quit out. Any results you get are going to be skewed by the fact that not all the evals were parsed, etc...
|
|
||
| report := &coverageReport{ | ||
| TotalSkills: len(skillNames), | ||
| Skills: make([]coverageSkillRow, 0, len(skillNames)), |
There was a problem hiding this comment.
| Skills: make([]coverageSkillRow, 0, len(skillNames)), |
It's okay for Skills to be 'nil'. The append below works properly with it.
| switch { | ||
| case !hasEval: | ||
| report.Uncovered++ | ||
| case tasks > 0 && len(graders) >= 2: |
There was a problem hiding this comment.
Some graders can be rather complex, like Program graders. I think we could consider a skill fully covered, even with just a single grader.
|
|
||
| for _, name := range skillNames { | ||
| graderSet := gradersBySkill[name] | ||
| graders := sortedKeys(graderSet) |
There was a problem hiding this comment.
| graders := sortedKeys(graderSet) | |
| graders := slices.Sorted(maps.Keys(graderSet)) |
| gradersBySkill := make(map[string]map[string]struct{}) | ||
| var parseFailures []string | ||
|
|
||
| evalPaths, err := discoverEvalFiles(absRoot, skillPaths, discoverPaths) |
There was a problem hiding this comment.
I think resolveSpecPaths does this already - can you re-use it? It's in cmd_run.go, so same package.
|
Hey @richardpark-msft — all 8 review items addressed in 0ed6510:
Also rebased on main. All tests passing. Net: −37 lines. |
Closes #82