feat: Add eval coverage grid generator by spboyer · Pull Request #92 · microsoft/waza

spboyer · 2026-03-05T01:56:26Z

Closes #82

codecov-commenter · 2026-03-05T02:00:11Z

Codecov Report

❌ Patch coverage is 67.42424% with 86 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@d1b8c25). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
cmd/waza/cmd_coverage.go	67.30%	64 Missing and 22 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #92   +/-   ##
=======================================
  Coverage        ?   73.27%           
=======================================
  Files           ?      133           
  Lines           ?    15495           
  Branches        ?        0           
=======================================
  Hits            ?    11354           
  Misses          ?     3302           
  Partials        ?      839

Flag	Coverage Δ
go-implementation	`73.27% <67.42%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Adds a new waza coverage CLI command to generate an eval-coverage “grid” across discovered skills, with supporting docs and tests.

Changes:

Introduces waza coverage [root] with text, markdown, and json output.
Implements skill/eval discovery and a coverage classification (none/partial/full) based on tasks + grader types.
Updates CLI reference docs and README, and adds unit tests for report building + markdown rendering.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
site/src/content/docs/reference/cli.mdx	Documents the new `waza coverage` command, flags, and examples.
cmd/waza/root.go	Registers the new `coverage` subcommand on the root CLI.
cmd/waza/cmd_coverage.go	Implements discovery, report generation, and output renderers (text/markdown/json).
cmd/waza/cmd_coverage_test.go	Adds unit tests for report classification, markdown output, and command registration.
README.md	Adds `waza coverage` usage example and CLI reference entry.
.squad/log/2026-03-05T00-36-issue-assignment-pipeline.md	Adds team process log (non-functional).
.squad/log/2026-03-05T00-26-rusty-token-diff-design.md	Adds team process log (non-functional).
.squad/decisions.md	Records team workflow/design decisions (non-functional).

cmd/waza/cmd_coverage.go

spboyer

Verified by Rusty (Opus 4.6) — LGTM ✅

Clean eval coverage grid generator:

New \waza coverage\ command with text/markdown/json output
Smart skill/eval discovery with deduplication, hidden dir skipping
Coverage classification (Full/Partial/None) is conservative and correct
Tests cover no-eval, partial/full, markdown rendering, root command integration
Docs updated: README, CLI reference
CI green on ubuntu + windows + lint

Minor: \�valSpecLite.Tasks\ is []string\ while real eval YAML has structured task objects — means task count defaults to 0, showing Partial instead of Full for real evals. Conservative for a reporting tool. Worth fixing to []any\ in a follow-up.

Note: Can't self-approve via API. Setting auto-merge.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

README.md

cmd/waza/cmd_coverage.go

site/src/content/docs/reference/cli.mdx

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

microsoft#56) (microsoft#92)

wbreza

Code Review: PR #92 - feat: Add eval coverage grid generator

What Looks Good

Clean architecture - discovery, parsing, classification, rendering well-separated
Parse failures properly reported (not silently dropped)
Both eval.yaml and eval.yml supported
README and CLI reference updated with correct flag names
Path deduplication via seenPaths map

Suggestions (non-blocking)

isDir/isFile helpers duplicate workspace utilities. Consider importing or extracting.
No test for JSON output format.
-f shorthand missing from site CLI docs.

Summary

Priority	Count
Critical	0
High	0
Medium	1
Low	2

Overall Assessment: Approve

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

cmd/waza/cmd_coverage.go

wbreza

Code Review: PR #92 — feat: Add eval coverage grid generator

✅ What Looks Good

Discovery logic — properly walks skills/, .github/skills/, and custom --path directories with dedup
Eval inference — checks skill-adjacent dirs and infers skill name from eval path
Error surfacing — collects all parse failures, reports together (not fail-on-first)
Three output formats — text, markdown, JSON all tested
Test coverage — 8 test functions covering no evals, partial+full, yml, parse errors, format validation

Findings Summary

Priority	Count
Critical	0
High	0
Medium	1
Low	1
Total	2

Overall Assessment: Approve — well-implemented command. Medium finding is a documentation suggestion.

cmd/waza/cmd_coverage.go

- CoveragePct now counts only 'Full' (≥2 grader types) skills, not Partial - Add comment clarifying that Full requires tasks + multiple grader types - Update summary line to say 'fully covered' Addresses wbreza review feedback on PR #92. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer · 2026-03-09T22:07:06Z

Addressed wbreza review feedback in 2e6a818:

CoveragePct now counts only fully covered skills (was including Partial in the numerator)
Added inline comment documenting the Full coverage threshold (≥2 grader types)
Updated summary line to clarify 'fully covered' count

- CoveragePct now counts only 'Full' (≥2 grader types) skills, not Partial - Add comment clarifying that Full requires tasks + multiple grader types - Update summary line to say 'fully covered' Addresses wbreza review feedback on PR microsoft#92. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

cmd/waza/cmd_coverage.go:253

Eval discovery currently looks under {root}/evals and a few skill-relative locations, but it does not consider the common single-skill layout {root}/eval/eval.yaml shown in README’s “Expected Skill Structure”. If coverage is intended to work on those repos, it should include {root}/eval as a default search root and/or add eval/eval.yaml to the per-skill candidate list (in addition to the existing evals/ locations).

func discoverEvalFiles(root string, skillPaths map[string]string, discoverPaths []string) ([]string, error) {
	searchRoots := []string{filepath.Join(root, "evals")}
	for _, p := range discoverPaths {
		searchRoots = append(searchRoots, resolvePath(root, p))
	}

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-11T18:07:50Z

cmd/waza/cmd_coverage.go

+	Skill   string                `yaml:"skill"`
+	Tasks   []string              `yaml:"tasks"`
+	Graders []models.GraderConfig `yaml:"graders"`


evalSpecLite doesn’t include tasks_from, but buildCoverageReport uses len(spec.Tasks) to decide whether a skill has tasks. This will misclassify eval specs that define tasks via tasks_from (common in this repo’s eval spec docs) as having 0 tasks and therefore only “Partial” coverage. Consider adding TasksFrom string yaml:"tasks_from,omitempty"`` to the parsed struct and treating tasks_from != "" as “has tasks” for coverage purposes (and task counting, if you keep the count field).

Suggested change

Skill string `yaml:"skill"`

Tasks []string `yaml:"tasks"`

Graders []models.GraderConfig `yaml:"graders"`

Skill string `yaml:"skill"`

Tasks []string `yaml:"tasks"`

TasksFrom string `yaml:"tasks_from,omitempty"`

Graders []models.GraderConfig `yaml:"graders"`

README.md

site/src/content/docs/reference/cli.mdx

wbreza

No significant issues found in the reviewed changes.

Re-Review Summary

I performed a thorough code review of PR #92 focusing on changes since the prior approval on 2026-03-10T17:31:36Z. The PR adds an �val coverage command that generates coverage grids for discovered skills.

Review Scope

Reviewed 8 commits pushed since prior approval
Examined the complete diff covering cmd_coverage.go, cmd_coverage_test.go, documentation updates, and command registration
Verified build succeeds without errors
Confirmed all tests pass (6/6 coverage tests passing)
Ran go vet with no issues found

Analysis Performed

Logic verification: Traced through the coverage classification algorithm (lines 166-191) - correctly categorizes skills as Full/Partial/None based on tasks and grader count
Error handling: Verified proper error wrapping and fallback paths for file I/O operations
Edge cases: Checked handling of empty grader lists, missing skill names, invalid YAML, and filepath.Abs failures
Concurrency: No race conditions (single-threaded execution)
Nil safety: No nil pointer dereferences found

Notable Design Decisions (Intentional, Not Bugs)

Task counting across multiple eval files for the same skill is additive - acknowledged in prior review as approximation for the common case
Markdown output omits coverage percentage summary (text format includes it) - confirmed intentional by test assertions
asks_from counts as 1 task rather than resolving actual count - documented limitation, appropriate for lightweight discovery

Recommendation

APPROVE - The code is production-ready. All prior review feedback has been addressed, error handling is comprehensive, and tests provide good coverage of the functionality.

wbreza

Re-review complete — new commits since prior approval look good. No issues found. ✅

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- CoveragePct now counts only 'Full' (≥2 grader types) skills, not Partial - Add comment clarifying that Full requires tasks + multiple grader types - Update summary line to say 'fully covered' Addresses wbreza review feedback on PR microsoft#92. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Validate root path is a directory, not just exists - Update help text to mention both eval.yaml and eval.yml - Update CLI docs to reference eval.yaml/eval.yml consistently - Add test for file-path rejection Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

cmd/waza/cmd_coverage.go:25

For --format json, the coverage field includes emoji-prefixed strings ("✅ Full", "⚠️ Partial", "❌ None"). That makes machine consumption harder and couples API output to presentation. Consider emitting a stable enum/string like "full"/"partial"/"none" (and optionally separate coverage_label/icon fields), while keeping emoji only in text/markdown renderers.

type coverageSkillRow struct {
	Skill    string   `json:"skill"`
	Tasks    int      `json:"tasks"`
	Graders  []string `json:"graders"`
	Coverage string   `json:"coverage"`
}

You can also share your feedback on Copilot code review. Take the survey.

cmd/waza/cmd_coverage.go

wbreza

Re-review complete. All 7 new commits since prior approval address review feedback: root path validation, tasks_from handling, coverage percentage fix, documentation, and formatting. Changes look good. ✅

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- CoveragePct now counts only 'Full' (≥2 grader types) skills, not Partial - Add comment clarifying that Full requires tasks + multiple grader types - Update summary line to say 'fully covered' Addresses wbreza review feedback on PR microsoft#92. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Validate root path is a directory, not just exists - Update help text to mention both eval.yaml and eval.yml - Update CLI docs to reference eval.yaml/eval.yml consistently - Add test for file-path rejection Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Eval specs using tasks_from instead of inline tasks were misclassified as Partial coverage. Now tasks_from is parsed and treated as having tasks for coverage purposes. Updated docs to clarify both forms qualify. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Parse failures now warn to stderr instead of aborting the report, making waza coverage usable in repos with broken eval files. - Use tabwriter placeholders for emoji to fix column alignment. - Updated test to match new warn-not-error behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

richardpark-msft · 2026-03-17T00:48:14Z

cmd/waza/cmd_coverage.go

+	}
+	var sk skill.Skill
+	if err := sk.UnmarshalText(data); err != nil {
+		return ""


We should definitely not eat the error here. Let's return and fail the command

richardpark-msft · 2026-03-17T00:49:17Z

cmd/waza/cmd_coverage.go

+	Skills      []coverageSkillRow `json:"skills"`
+}
+
+type evalSpecLite struct {


Let's just use the BenchmarkConfig and not have multiple copies of the same data structure floating around.

richardpark-msft

I think there's some places that could re-use common code, and I've left a few review comments that are worth addressing.

richardpark-msft · 2026-03-17T00:51:24Z

cmd/waza/cmd_coverage.go

+		}
+		for _, g := range spec.Graders {
+			kind := strings.TrimSpace(string(g.Kind))
+			if kind != "" {


Let's error out on this - there's no reason to have a grader without a kind.

richardpark-msft · 2026-03-17T00:56:48Z

cmd/waza/cmd_coverage.go

+	}
+	if len(parseFailures) > 0 {
+		sort.Strings(parseFailures)
+		fmt.Fprintf(os.Stderr, "warning: failed to parse %d eval file(s): %s\n", len(parseFailures), strings.Join(parseFailures, "; "))


I don't think this is a warning at this point - I would just quit out. Any results you get are going to be skewed by the fact that not all the evals were parsed, etc...

richardpark-msft · 2026-03-17T00:57:28Z

cmd/waza/cmd_coverage.go

+
+	report := &coverageReport{
+		TotalSkills: len(skillNames),
+		Skills:      make([]coverageSkillRow, 0, len(skillNames)),


Suggested change

Skills: make([]coverageSkillRow, 0, len(skillNames)),

It's okay for Skills to be 'nil'. The append below works properly with it.

richardpark-msft · 2026-03-17T00:59:21Z

cmd/waza/cmd_coverage.go

+		switch {
+		case !hasEval:
+			report.Uncovered++
+		case tasks > 0 && len(graders) >= 2:


Some graders can be rather complex, like Program graders. I think we could consider a skill fully covered, even with just a single grader.

richardpark-msft · 2026-03-17T01:04:13Z

cmd/waza/cmd_coverage.go

+
+	for _, name := range skillNames {
+		graderSet := gradersBySkill[name]
+		graders := sortedKeys(graderSet)


Suggested change

graders := sortedKeys(graderSet)

graders := slices.Sorted(maps.Keys(graderSet))

richardpark-msft · 2026-03-17T01:18:23Z

cmd/waza/cmd_coverage.go

+	gradersBySkill := make(map[string]map[string]struct{})
+	var parseFailures []string
+
+	evalPaths, err := discoverEvalFiles(absRoot, skillPaths, discoverPaths)


I think resolveSpecPaths does this already - can you re-use it? It's in cmd_run.go, so same package.

spboyer · 2026-03-19T21:06:43Z

Hey @richardpark-msft — all 8 review items addressed in 0ed6510:

Don't eat errors in parseSkillName — now returns (string, error), propagated through discoverSkillFiles
Use BenchmarkSpec — removed evalSpecLite, using models.BenchmarkSpec directly
Error on graders without a kind — returns error: "grader %q in %s has no type"
Hard error on parse failures — no more warnings, returns immediately on parse error
Drop Skills pre-allocation — removed make(), nil slice + append works fine
Full coverage at ≥1 grader — threshold lowered from >= 2 to >= 1
Use slices.Sorted(maps.Keys()) — replaced custom sortedKeys with stdlib
Remove resolvePath/isDir/isFile helpers — inlined filepath.Join and os.Stat calls

Also rebased on main. All tests passing. Net: −37 lines.

Copilot AI review requested due to automatic review settings March 5, 2026 01:56

spboyer requested review from chlowell and richardpark-msft as code owners March 5, 2026 01:56

spboyer self-assigned this Mar 5, 2026

github-actions bot enabled auto-merge (squash) March 5, 2026 01:57

Copilot started reviewing on behalf of spboyer March 5, 2026 01:57 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

cmd/waza/cmd_coverage.go Show resolved Hide resolved

cmd/waza/cmd_coverage.go Outdated Show resolved Hide resolved

cmd/waza/cmd_coverage.go Show resolved Hide resolved

spboyer commented Mar 5, 2026

View reviewed changes

spboyer force-pushed the squad/82-eval-coverage-grid branch from 40b1f4c to e1365f2 Compare March 5, 2026 17:12

spboyer added a commit that referenced this pull request Mar 5, 2026

fix: address review feedback on PR #92

a6ca52d

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 5, 2026 17:42

spboyer added a commit that referenced this pull request Mar 5, 2026

fix: address review feedback on PR #92

f3aa0c4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer force-pushed the squad/82-eval-coverage-grid branch from a6ca52d to f3aa0c4 Compare March 5, 2026 17:46

Copilot AI reviewed Mar 5, 2026

View reviewed changes

spboyer added a commit that referenced this pull request Mar 5, 2026

fix: address PR #92 coverage command review comments

8f51dd0

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

chlowell pushed a commit to chlowell/waza that referenced this pull request Mar 5, 2026

feat: implement result interpretation and CLI invocation (microsoft#55,

b992cea

microsoft#56) (microsoft#92)

wbreza approved these changes Mar 5, 2026

View reviewed changes

spboyer added a commit to spboyer/waza-fk that referenced this pull request Mar 6, 2026

fix: address review feedback on PR microsoft#92

e0e6c38

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer added a commit to spboyer/waza-fk that referenced this pull request Mar 6, 2026

fix: address PR microsoft#92 coverage command review comments

b9f0f01

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 6, 2026 00:04

spboyer force-pushed the squad/82-eval-coverage-grid branch from 8f51dd0 to b9f0f01 Compare March 6, 2026 00:04

Copilot started reviewing on behalf of spboyer March 6, 2026 00:06 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

cmd/waza/cmd_coverage.go Show resolved Hide resolved

cmd/waza/cmd_coverage.go Show resolved Hide resolved

cmd/waza/cmd_coverage.go Show resolved Hide resolved

cmd/waza/cmd_coverage.go Show resolved Hide resolved

wbreza previously approved these changes Mar 6, 2026

View reviewed changes

cmd/waza/cmd_coverage.go Show resolved Hide resolved

cmd/waza/cmd_coverage.go Show resolved Hide resolved

spboyer dismissed wbreza’s stale review via ef01501 March 9, 2026 22:24

spboyer added a commit that referenced this pull request Mar 10, 2026

fix: address review feedback on PR #92

8c0e8e8

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot started reviewing on behalf of spboyer March 11, 2026 18:03 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

spboyer dismissed wbreza’s stale review via 59b1c7d March 11, 2026 18:17

wbreza reviewed Mar 11, 2026

View reviewed changes

wbreza approved these changes Mar 11, 2026

View reviewed changes

spboyer added a commit to spboyer/waza-fk that referenced this pull request Mar 11, 2026

fix: address review feedback on PR microsoft#92

bf66b68

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer added a commit to spboyer/waza-fk that referenced this pull request Mar 11, 2026

fix: address PR microsoft#92 coverage command review comments

0a37389

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 11, 2026 20:26

spboyer force-pushed the squad/82-eval-coverage-grid branch from 59b1c7d to a55c52a Compare March 11, 2026 20:26

Copilot started reviewing on behalf of spboyer March 11, 2026 20:27 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

cmd/waza/cmd_coverage.go Show resolved Hide resolved

cmd/waza/cmd_coverage.go Outdated Show resolved Hide resolved

cmd/waza/cmd_coverage.go Show resolved Hide resolved

wbreza previously approved these changes Mar 11, 2026

View reviewed changes

spboyer and others added 9 commits March 12, 2026 06:34

feat: add eval coverage grid generator microsoft#82

d3eb11e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: address review feedback on PR microsoft#92

db26aaf

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: address PR microsoft#92 coverage command review comments

82427c1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

docs: Document waza coverage levels and percentage calculation

972722c

fix: gofmt formatting in cmd_coverage.go

51f1878

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer dismissed wbreza’s stale review via 10f3a73 March 12, 2026 13:35

spboyer force-pushed the squad/82-eval-coverage-grid branch from a55c52a to 10f3a73 Compare March 12, 2026 13:35

spboyer added a commit that referenced this pull request Mar 12, 2026

fix: address review feedback on PR #92

a027139

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

richardpark-msft reviewed Mar 17, 2026

View reviewed changes

	graders := sortedKeys(graderSet)
	graders := slices.Sorted(maps.Keys(graderSet))

Conversation

spboyer commented Mar 5, 2026

Uh oh!

codecov-commenter commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spboyer left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Code Review: PR #92 - feat: Add eval coverage grid generator

What Looks Good

Suggestions (non-blocking)

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Code Review: PR #92 — feat: Add eval coverage grid generator

✅ What Looks Good

Findings Summary

Uh oh!

Uh oh!

Uh oh!

spboyer commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Re-Review Summary

Review Scope

Analysis Performed

Notable Design Decisions (Intentional, Not Bugs)

Recommendation

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

codecov-commenter commented Mar 5, 2026 •

edited

Loading