Releases: microsoft/waza
Waza v0.22.0
What's Changed
- waza check uses configured token limits from .waza.yaml by @chlowell in #124
- Add
gradeverb for grading run outcomes by @chlowell in #102 - Fixing skill loading to use .waza.yaml by @richardpark-msft in #131
- Removing a err != nil that we don't need since errors.Is() works great by @richardpark-msft in #138
- Apply timeout configuration to
waza runby @chlowell in #130
Full Changelog: v0.21.0...v0.22.0
Waza azd Extension v0.22.0
Changelog
All notable changes to waza will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[Unreleased]
[0.21.0] - 2026-03-12
Added
waza new task from-promptcommand — Record Copilot sessions into task YAML files for eval creation (#110)- Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
- Eval scaffolding command —
waza eval newgenerates eval.yaml scaffolding for skills (#94) - Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
- Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
- Per-file token budget configuration — Configure token budgets per-file in
.waza.yaml(#96) - Skill-aware thresholds —
waza tokens comparesupports skill-specific threshold configuration (#93) - Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
- CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
- FileWriter service — Refactored
waza initinventory with FileWriter abstraction (#63)
Fixed
waza suggestdeadlock —Execute()now applies the request timeout before callingStart(), preventing goroutine deadlock (#43)ResourceFile.Contenttype — Changed fromstringto[]bytefor proper binary file handling (#117)tokens comparein subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)--output-dirignored — Fixed--output-dirhaving no effect for single-skill runs (#109)- Web dashboard build order — Build dashboard assets before Go compilation (#107)
- Test file leak — Fixed test that leaked files into the repo (#120)
- Config schema defaults — Aligned
config.schema.jsondefaults with Go source of truth (#65) - Skill discovery path — Discover skills under
.github/skills/directory (#69)
Changed
- Custom YAML deserializers for config types (#106)
- Token limits priority inverted to
.waza.yamlfirst (#64) @wbrezaadded to CODEOWNERS (#111)- Go 1.26+ noted in agent instruction files (#108)
[0.9.0] - 2026-02-23
Added
- A/B baseline testing —
--baselineflag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307) - Pairwise LLM judging —
pairwisemode onpromptgrader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310) - Tool constraint grader — New
tool_constraintgrader type withexpect_tools,reject_tools,max_turns,max_tokensconstraints. Validates agent tool usage behavior (#391) - Auto skill discovery —
--discoverflag walks directory trees for SKILL.md + eval.yaml pairs.--strictmode fails if any skill lacks eval coverage (#392) - Releases page — New docs site page at
reference/releaseswith platform download links, install commands, and azd extension info (#383)
Fixed
- Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings
Changed
- Competitive research — Added OpenAI Evals analysis (
docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md) - Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md
[0.8.0] - 2026-02-21
Added
- MCP Server —
waza servenow includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286) waza suggestcommand — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags:--model,--dry-run,--apply,--output-dir,--format(#287)- Interactive workflow skill —
skills/waza-interactive/SKILL.mdwith 5 workflow scenarios for conversational eval orchestration (#288) - Grader weighting —
weightfield on grader configs,ComputeWeightedRunScoremethod, dashboard weighted scores column (#299) - Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
- Judge model support —
--judge-modelflag andjudge_modelconfig for separate LLM-as-judge model (#309) - Spec compliance checks — 8 agentskills.io compliance checks in
waza checkandwaza dev(#314) - SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
- MCP integration scoring — 4 MCP integration checks in
waza dev(#316) - Batch skill processing —
waza devprocesses multiple skills in one run (#317) - Token compare --strict — Budget enforcement mode for
waza tokens compare(#318) - Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
- Skill profile —
waza tokens profilefor static analysis of skill token distribution (#311) - JUnit XML reporter —
--format junitoutput for CI integration (#312) - Template Variables — New
internal/templatepackage withRender()for Go text/template syntax in hooks and commands. System variables:JobID,TaskName,Iteration,Attempt,Timestamp. User variables viavarsmap (#186) - GroupBy Results — New
group_byconfig field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includesGroupStatswith name/passed/total/avg_score (#188) - Custom Input Variables — New
inputssection in eval.yaml for defining key-value pairs available as{{.Vars.key}}throughout evaluation. Accessible in hooks, task templates, and grader configs (#189) - CSV Dataset Support — New
tasks_fromfield to generate tasks from CSV files. Each row becomes a task with columns accessible as{{.Vars.column}}. Optionalrange: [start, end]for row filtering. First row treated as headers (#187) - Retry/Attempts — Add
max_attemptsconfig field for retrying failed task executions within each trial (#191) - Lifecycle Hooks — Add
hookssection withbefore_run/after_run/before_task/after_tasklifecycle points (#191) promptgrader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)- Two modes:
clean(fresh context) andcontinue_session(resumes test session) - Tool-based grading:
set_waza_grade_passandset_waza_grade_failtools for LLM graders - Separate judge model configuration: run evaluation with a different model than the executor
- Pre-built rubric templates adapted from Azure ML evaluators
- Two modes:
trigger_tests.yamlauto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)- New
internal/trigger/package for trigger testing - Automatically discovered alongside
eval.yaml - Confidence weighting:
high(weight 1.0) andmedium(weight 0.5) for borderline cases trigger_accuracymetric with configurable cutoff threshold- Metrics: accuracy, precision, recall, F1, error count
- New
diffgrader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)- Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in
examples/rubrics/adapted from Azure ML evaluators (#160, #161):- Tool call rubrics:
tool_call_accuracy,tool_selection,tool_input_accuracy,tool_output_utilization - Task evaluation rubrics:
task_completion,task_adherence,intent_resolution,response_completeness
- Tool call rubrics:
- MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)
Changed
- Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
- Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)
Fixed
- install.sh macOS checksum — added
shasum -a 256fallback for macOS (which lackssha256sum) (#163) - Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
- GitHub icon alignment and search bar width on docs site
[0.4.0-alpha.1] - 2026-02-17
Added
- Go cross-platform release pipeline —
go-release.ymlworkflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155) install.shinstaller — one-line binary install with checksum verification:curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bashskill_invocationgrader — validates orchestration workflows by checking which skills were invoked (#146)required_skillspreflight validation — verifies skill dependencies before evaluation (#147)- Multi-model
--modelflag — run evaluations across multiple models in a single command (#39) waza checkcommand — skill submission readiness checks (#151)- Evaluation result caching — incremental testing with cache invalidation (#150)
- GitHub PR comment reporter — post eval results as PR comments (#140)
- Skills CI integration — GitHub Actions workflow for microsoft/skills (#141)
Fixed
- Engine shutdown leak —
runSingleModel()now callsengine.Shutdown(context.Background())via defer after engine creation (#153, #154)
Changed
- Python release deprecated — the Python release workflow is no longer maintained; Go binaries are the official distribution
- **First Go ...
Waza v0.21.0
What's Changed
- Copilot SDK usage display for waza run by @chlowell in #47
- Fix broken dashboard-explore doc images across deployment bases by @Copilot in #50
- Fix Docker build by @chlowell in #51
- Working around an issue in the copilot SDK, Start() and contexts by @richardpark-msft in #53
- fix: Standardize emoji spacing in waza check display by @Copilot in #45
- fix: regression test + changelog for waza suggest deadlock by @Copilot in #43
- fix: make site base path configurable + remove unused workflow by @spboyer in #56
- fix: repair broken test blocking all PR CI by @spboyer in #67
- feat: add FileWriter service and refactor waza init inventory #48 by @wbreza in #63
- chore(deps): Bump svgo from 4.0.0 to 4.0.1 in /site by @dependabot[bot] in #88
- chore: add MIT LICENSE file by @spboyer in #99
- fix: discover skills under .github/skills/ directory by @spboyer in #69
- feat: invert token limits priority to .waza.yaml first #59 by @wbreza in #64
- fix: align config.schema.json defaults with Go source of truth #57 by @wbreza in #65
- docs: Add CI/CD integration guide (GitHub Actions, Azure DevOps) by @spboyer in #100
- fix: update docs link to GitHub Pages URL by @spboyer in #87
- chore: update repo refs + disable heartbeat auto-triggers by @spboyer in #101
- feat: Add skill-aware thresholds to waza tokens compare by @spboyer in #93
- feat: Per-file token budget configuration in .waza.yaml by @spboyer in #96
- feat: Snapshot auto-update workflow for diff grader by @spboyer in #95
- feat: Add multi-trial flakiness detection for evals by @spboyer in #103
- feat: Add eval scaffolding command (waza eval new) by @spboyer in #94
- Add in custom YAML deserializers for our config by @richardpark-msft in #106
- fix: Build web dashboard assets before Go compilation by @chlowell in #107
- docs: note Go 1.26+ in agent instruction files by @Copilot in #108
- Fix
--output-dirhaving no effect for single-skill runs by @chlowell in #109 - feat: sensei scoring parity — WHEN: triggers, spec-security, Invalid level, advisory checks 16-18 by @spboyer in #79
- chore: add @wbreza to CODEOWNERS by @spboyer in #111
- feat: Add trigger heuristic grader by @spboyer in #90
- Adding in
wz new task from-promptby @richardpark-msft in #110 - Fix test that leaks files into the repo by @chlowell in #120
- chore(deps): Bump devalue from 5.6.3 to 5.6.4 in /site by @dependabot[bot] in #119
- Fix
tokens comparein subdirectory showing all files as "added" by @chlowell in #105 - fix: change ResourceFile.Content from string to []byte by @Copilot in #117
- Release v0.21.0 by @spboyer in #122
New Contributors
Full Changelog: v0.1.0...v0.21.0
Waza azd Extension v0.21.0
Changelog
All notable changes to waza will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[Unreleased]
[0.21.0] - 2026-03-12
Added
waza new task from-promptcommand — Record Copilot sessions into task YAML files for eval creation (#110)- Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
- Eval scaffolding command —
waza eval newgenerates eval.yaml scaffolding for skills (#94) - Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
- Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
- Per-file token budget configuration — Configure token budgets per-file in
.waza.yaml(#96) - Skill-aware thresholds —
waza tokens comparesupports skill-specific threshold configuration (#93) - Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
- CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
- FileWriter service — Refactored
waza initinventory with FileWriter abstraction (#63)
Fixed
waza suggestdeadlock —Execute()now applies the request timeout before callingStart(), preventing goroutine deadlock (#43)ResourceFile.Contenttype — Changed fromstringto[]bytefor proper binary file handling (#117)tokens comparein subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)--output-dirignored — Fixed--output-dirhaving no effect for single-skill runs (#109)- Web dashboard build order — Build dashboard assets before Go compilation (#107)
- Test file leak — Fixed test that leaked files into the repo (#120)
- Config schema defaults — Aligned
config.schema.jsondefaults with Go source of truth (#65) - Skill discovery path — Discover skills under
.github/skills/directory (#69)
Changed
- Custom YAML deserializers for config types (#106)
- Token limits priority inverted to
.waza.yamlfirst (#64) @wbrezaadded to CODEOWNERS (#111)- Go 1.26+ noted in agent instruction files (#108)
[0.9.0] - 2026-02-23
Added
- A/B baseline testing —
--baselineflag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307) - Pairwise LLM judging —
pairwisemode onpromptgrader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310) - Tool constraint grader — New
tool_constraintgrader type withexpect_tools,reject_tools,max_turns,max_tokensconstraints. Validates agent tool usage behavior (#391) - Auto skill discovery —
--discoverflag walks directory trees for SKILL.md + eval.yaml pairs.--strictmode fails if any skill lacks eval coverage (#392) - Releases page — New docs site page at
reference/releaseswith platform download links, install commands, and azd extension info (#383)
Fixed
- Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings
Changed
- Competitive research — Added OpenAI Evals analysis (
docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md) - Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md
[0.8.0] - 2026-02-21
Added
- MCP Server —
waza servenow includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286) waza suggestcommand — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags:--model,--dry-run,--apply,--output-dir,--format(#287)- Interactive workflow skill —
skills/waza-interactive/SKILL.mdwith 5 workflow scenarios for conversational eval orchestration (#288) - Grader weighting —
weightfield on grader configs,ComputeWeightedRunScoremethod, dashboard weighted scores column (#299) - Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
- Judge model support —
--judge-modelflag andjudge_modelconfig for separate LLM-as-judge model (#309) - Spec compliance checks — 8 agentskills.io compliance checks in
waza checkandwaza dev(#314) - SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
- MCP integration scoring — 4 MCP integration checks in
waza dev(#316) - Batch skill processing —
waza devprocesses multiple skills in one run (#317) - Token compare --strict — Budget enforcement mode for
waza tokens compare(#318) - Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
- Skill profile —
waza tokens profilefor static analysis of skill token distribution (#311) - JUnit XML reporter —
--format junitoutput for CI integration (#312) - Template Variables — New
internal/templatepackage withRender()for Go text/template syntax in hooks and commands. System variables:JobID,TaskName,Iteration,Attempt,Timestamp. User variables viavarsmap (#186) - GroupBy Results — New
group_byconfig field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includesGroupStatswith name/passed/total/avg_score (#188) - Custom Input Variables — New
inputssection in eval.yaml for defining key-value pairs available as{{.Vars.key}}throughout evaluation. Accessible in hooks, task templates, and grader configs (#189) - CSV Dataset Support — New
tasks_fromfield to generate tasks from CSV files. Each row becomes a task with columns accessible as{{.Vars.column}}. Optionalrange: [start, end]for row filtering. First row treated as headers (#187) - Retry/Attempts — Add
max_attemptsconfig field for retrying failed task executions within each trial (#191) - Lifecycle Hooks — Add
hookssection withbefore_run/after_run/before_task/after_tasklifecycle points (#191) promptgrader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)- Two modes:
clean(fresh context) andcontinue_session(resumes test session) - Tool-based grading:
set_waza_grade_passandset_waza_grade_failtools for LLM graders - Separate judge model configuration: run evaluation with a different model than the executor
- Pre-built rubric templates adapted from Azure ML evaluators
- Two modes:
trigger_tests.yamlauto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)- New
internal/trigger/package for trigger testing - Automatically discovered alongside
eval.yaml - Confidence weighting:
high(weight 1.0) andmedium(weight 0.5) for borderline cases trigger_accuracymetric with configurable cutoff threshold- Metrics: accuracy, precision, recall, F1, error count
- New
diffgrader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)- Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in
examples/rubrics/adapted from Azure ML evaluators (#160, #161):- Tool call rubrics:
tool_call_accuracy,tool_selection,tool_input_accuracy,tool_output_utilization - Task evaluation rubrics:
task_completion,task_adherence,intent_resolution,response_completeness
- Tool call rubrics:
- MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)
Changed
- Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
- Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)
Fixed
- install.sh macOS checksum — added
shasum -a 256fallback for macOS (which lackssha256sum) (#163) - Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
- GitHub icon alignment and search bar width on docs site
[0.4.0-alpha.1] - 2026-02-17
Added
- Go cross-platform release pipeline —
go-release.ymlworkflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155) install.shinstaller — one-line binary install with checksum verification:curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bashskill_invocationgrader — validates orchestration workflows by checking which skills were invoked (#146)required_skillspreflight validation — verifies skill dependencies before evaluation (#147)- Multi-model
--modelflag — run evaluations across multiple models in a single command (#39) waza checkcommand — skill submission readiness checks (#151)- Evaluation result caching — incremental testing with cache invalidation (#150)
- GitHub PR comment reporter — post eval results as PR comments (#140)
- Skills CI integration — GitHub Actions workflow for microsoft/skills (#141)
Fixed
- Engine shutdown leak —
runSingleModel()now callsengine.Shutdown(context.Background())via defer after engine creation (#153, #154)
Changed
- Python release deprecated — the Python release workflow is no longer maintained; Go binaries are the official distribution
- **First Go ...
Waza v0.12.0
Full Changelog: v0.11.0...v0.12.0
Waza azd Extension v0.12.0
Changelog
All notable changes to waza will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[Unreleased]
[0.9.0] - 2026-02-23
Added
- A/B baseline testing —
--baselineflag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307) - Pairwise LLM judging —
pairwisemode onpromptgrader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310) - Tool constraint grader — New
tool_constraintgrader type withexpect_tools,reject_tools,max_turns,max_tokensconstraints. Validates agent tool usage behavior (#391) - Auto skill discovery —
--discoverflag walks directory trees for SKILL.md + eval.yaml pairs.--strictmode fails if any skill lacks eval coverage (#392) - Releases page — New docs site page at
reference/releaseswith platform download links, install commands, and azd extension info (#383)
Fixed
- Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings
Changed
- Competitive research — Added OpenAI Evals analysis (
docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md) - Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md
[0.8.0] - 2026-02-21
Added
- MCP Server —
waza servenow includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286) waza suggestcommand — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags:--model,--dry-run,--apply,--output-dir,--format(#287)- Interactive workflow skill —
skills/waza-interactive/SKILL.mdwith 5 workflow scenarios for conversational eval orchestration (#288) - Grader weighting —
weightfield on grader configs,ComputeWeightedRunScoremethod, dashboard weighted scores column (#299) - Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
- Judge model support —
--judge-modelflag andjudge_modelconfig for separate LLM-as-judge model (#309) - Spec compliance checks — 8 agentskills.io compliance checks in
waza checkandwaza dev(#314) - SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
- MCP integration scoring — 4 MCP integration checks in
waza dev(#316) - Batch skill processing —
waza devprocesses multiple skills in one run (#317) - Token compare --strict — Budget enforcement mode for
waza tokens compare(#318) - Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
- Skill profile —
waza tokens profilefor static analysis of skill token distribution (#311) - JUnit XML reporter —
--format junitoutput for CI integration (#312) - Template Variables — New
internal/templatepackage withRender()for Go text/template syntax in hooks and commands. System variables:JobID,TaskName,Iteration,Attempt,Timestamp. User variables viavarsmap (#186) - GroupBy Results — New
group_byconfig field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includesGroupStatswith name/passed/total/avg_score (#188) - Custom Input Variables — New
inputssection in eval.yaml for defining key-value pairs available as{{.Vars.key}}throughout evaluation. Accessible in hooks, task templates, and grader configs (#189) - CSV Dataset Support — New
tasks_fromfield to generate tasks from CSV files. Each row becomes a task with columns accessible as{{.Vars.column}}. Optionalrange: [start, end]for row filtering. First row treated as headers (#187) - Retry/Attempts — Add
max_attemptsconfig field for retrying failed task executions within each trial (#191) - Lifecycle Hooks — Add
hookssection withbefore_run/after_run/before_task/after_tasklifecycle points (#191) promptgrader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)- Two modes:
clean(fresh context) andcontinue_session(resumes test session) - Tool-based grading:
set_waza_grade_passandset_waza_grade_failtools for LLM graders - Separate judge model configuration: run evaluation with a different model than the executor
- Pre-built rubric templates adapted from Azure ML evaluators
- Two modes:
trigger_tests.yamlauto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)- New
internal/trigger/package for trigger testing - Automatically discovered alongside
eval.yaml - Confidence weighting:
high(weight 1.0) andmedium(weight 0.5) for borderline cases trigger_accuracymetric with configurable cutoff threshold- Metrics: accuracy, precision, recall, F1, error count
- New
diffgrader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)- Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in
examples/rubrics/adapted from Azure ML evaluators (#160, #161):- Tool call rubrics:
tool_call_accuracy,tool_selection,tool_input_accuracy,tool_output_utilization - Task evaluation rubrics:
task_completion,task_adherence,intent_resolution,response_completeness
- Tool call rubrics:
- MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)
Changed
- Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
- Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)
Fixed
- install.sh macOS checksum — added
shasum -a 256fallback for macOS (which lackssha256sum) (#163) - Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
- GitHub icon alignment and search bar width on docs site
[0.4.0-alpha.1] - 2026-02-17
Added
- Go cross-platform release pipeline —
go-release.ymlworkflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155) install.shinstaller — one-line binary install with checksum verification:curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bashskill_invocationgrader — validates orchestration workflows by checking which skills were invoked (#146)required_skillspreflight validation — verifies skill dependencies before evaluation (#147)- Multi-model
--modelflag — run evaluations across multiple models in a single command (#39) waza checkcommand — skill submission readiness checks (#151)- Evaluation result caching — incremental testing with cache invalidation (#150)
- GitHub PR comment reporter — post eval results as PR comments (#140)
- Skills CI integration — GitHub Actions workflow for microsoft/skills (#141)
Fixed
- Engine shutdown leak —
runSingleModel()now callsengine.Shutdown(context.Background())via defer after engine creation (#153, #154)
Changed
- Python release deprecated — the Python release workflow is no longer maintained; Go binaries are the official distribution
- First Go binary release — v0.4.0-alpha.1 is the first release distributed as pre-built binaries
[0.3.0] - 2026-02-13
Added
- Grader showcase examples demonstrating all grader types (#134)
- Reusable GitHub Actions workflow for waza evaluations (#132)
- Documentation for prompt and action_sequence grader types (#133)
- Documentation for
waza devcommand and compliance scoring (#131) - Auto-loading of skills for testing (#129)
- Debug logging support (
--debugflag) (#130)
Fixed
- Always output test run errors to help debug failures (#128)
- Include cwd as a skill folder when running waza (workspace fix)
Changed
- Exit codes for CI/CD integration: 0=success, 1=test failure, 2=config error (#135)
- Reordered azd-publish skill workflow steps (#127)
- Auto-merge bot registry PRs in release workflow
[0.2.1] - 2026-02-12
Added
waza devcommand for interactive skill development and testing (#117)- Prerelease input to azd publish workflow
- CHANGELOG.md as release notes source for azd extension releases
waza generate --skill <name>- Filter to specific skill when using--repoor--scan
Fixed
- Fixed azd extensions documentation link
- Corrected
azd ext source addcommand syntax - Branch release PR from origin/main to avoid workflow permission error (#121)
Changed
- Removed path filters from Go CI to unblock non-code PRs
- Removed auto-merge from azd publish PR workflow
- Added azd extension installation instructions to README
[0.2.0] - 2026-02-02
Added
-
Skill Discovery (#3)
waza generate --repo <org/repo>- Scan GitHub repos for SKILL.md fileswaza generate --scan- Scan local directory for skillswaza generate --all- Generate evals for all discovered skills (CI-friendly)- Interactive skill selection with checkboxes when not using
--all
-
GitHub Issue Creation (#3)
- Post-run prompt to create GitHub issues with eval results
- Options: create for failed tasks only, all tasks, or none
- Issues include results table, failed task details, and suggestions
--no-issues...
Waza v0.11.0
What's Changed
- Adding Microsoft SECURITY.MD by @microsoft-github-policy-service[bot] in #2
- chore(deps): Bump go.opentelemetry.io/otel/sdk from 1.38.0 to 1.40.0 by @dependabot[bot] in #4
- chore(deps): Bump rollup from 4.58.0 to 4.59.0 in /site by @dependabot[bot] in #5
New Contributors
- @microsoft-github-policy-service[bot] made their first contribution in #2
- @dependabot[bot] made their first contribution in #4
Full Changelog: v0.10.0...v0.11.0
Waza azd Extension v0.11.0
Changelog
All notable changes to waza will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[Unreleased]
[0.9.0] - 2026-02-23
Added
- A/B baseline testing —
--baselineflag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307) - Pairwise LLM judging —
pairwisemode onpromptgrader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310) - Tool constraint grader — New
tool_constraintgrader type withexpect_tools,reject_tools,max_turns,max_tokensconstraints. Validates agent tool usage behavior (#391) - Auto skill discovery —
--discoverflag walks directory trees for SKILL.md + eval.yaml pairs.--strictmode fails if any skill lacks eval coverage (#392) - Releases page — New docs site page at
reference/releaseswith platform download links, install commands, and azd extension info (#383)
Fixed
- Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings
Changed
- Competitive research — Added OpenAI Evals analysis (
docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md) - Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md
[0.8.0] - 2026-02-21
Added
- MCP Server —
waza servenow includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286) waza suggestcommand — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags:--model,--dry-run,--apply,--output-dir,--format(#287)- Interactive workflow skill —
skills/waza-interactive/SKILL.mdwith 5 workflow scenarios for conversational eval orchestration (#288) - Grader weighting —
weightfield on grader configs,ComputeWeightedRunScoremethod, dashboard weighted scores column (#299) - Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
- Judge model support —
--judge-modelflag andjudge_modelconfig for separate LLM-as-judge model (#309) - Spec compliance checks — 8 agentskills.io compliance checks in
waza checkandwaza dev(#314) - SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
- MCP integration scoring — 4 MCP integration checks in
waza dev(#316) - Batch skill processing —
waza devprocesses multiple skills in one run (#317) - Token compare --strict — Budget enforcement mode for
waza tokens compare(#318) - Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
- Skill profile —
waza tokens profilefor static analysis of skill token distribution (#311) - JUnit XML reporter —
--format junitoutput for CI integration (#312) - Template Variables — New
internal/templatepackage withRender()for Go text/template syntax in hooks and commands. System variables:JobID,TaskName,Iteration,Attempt,Timestamp. User variables viavarsmap (#186) - GroupBy Results — New
group_byconfig field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includesGroupStatswith name/passed/total/avg_score (#188) - Custom Input Variables — New
inputssection in eval.yaml for defining key-value pairs available as{{.Vars.key}}throughout evaluation. Accessible in hooks, task templates, and grader configs (#189) - CSV Dataset Support — New
tasks_fromfield to generate tasks from CSV files. Each row becomes a task with columns accessible as{{.Vars.column}}. Optionalrange: [start, end]for row filtering. First row treated as headers (#187) - Retry/Attempts — Add
max_attemptsconfig field for retrying failed task executions within each trial (#191) - Lifecycle Hooks — Add
hookssection withbefore_run/after_run/before_task/after_tasklifecycle points (#191) promptgrader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)- Two modes:
clean(fresh context) andcontinue_session(resumes test session) - Tool-based grading:
set_waza_grade_passandset_waza_grade_failtools for LLM graders - Separate judge model configuration: run evaluation with a different model than the executor
- Pre-built rubric templates adapted from Azure ML evaluators
- Two modes:
trigger_tests.yamlauto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)- New
internal/trigger/package for trigger testing - Automatically discovered alongside
eval.yaml - Confidence weighting:
high(weight 1.0) andmedium(weight 0.5) for borderline cases trigger_accuracymetric with configurable cutoff threshold- Metrics: accuracy, precision, recall, F1, error count
- New
diffgrader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)- Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in
examples/rubrics/adapted from Azure ML evaluators (#160, #161):- Tool call rubrics:
tool_call_accuracy,tool_selection,tool_input_accuracy,tool_output_utilization - Task evaluation rubrics:
task_completion,task_adherence,intent_resolution,response_completeness
- Tool call rubrics:
- MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)
Changed
- Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
- Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)
Fixed
- install.sh macOS checksum — added
shasum -a 256fallback for macOS (which lackssha256sum) (#163) - Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
- GitHub icon alignment and search bar width on docs site
[0.4.0-alpha.1] - 2026-02-17
Added
- Go cross-platform release pipeline —
go-release.ymlworkflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155) install.shinstaller — one-line binary install with checksum verification:curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bashskill_invocationgrader — validates orchestration workflows by checking which skills were invoked (#146)required_skillspreflight validation — verifies skill dependencies before evaluation (#147)- Multi-model
--modelflag — run evaluations across multiple models in a single command (#39) waza checkcommand — skill submission readiness checks (#151)- Evaluation result caching — incremental testing with cache invalidation (#150)
- GitHub PR comment reporter — post eval results as PR comments (#140)
- Skills CI integration — GitHub Actions workflow for microsoft/skills (#141)
Fixed
- Engine shutdown leak —
runSingleModel()now callsengine.Shutdown(context.Background())via defer after engine creation (#153, #154)
Changed
- Python release deprecated — the Python release workflow is no longer maintained; Go binaries are the official distribution
- First Go binary release — v0.4.0-alpha.1 is the first release distributed as pre-built binaries
[0.3.0] - 2026-02-13
Added
- Grader showcase examples demonstrating all grader types (#134)
- Reusable GitHub Actions workflow for waza evaluations (#132)
- Documentation for prompt and action_sequence grader types (#133)
- Documentation for
waza devcommand and compliance scoring (#131) - Auto-loading of skills for testing (#129)
- Debug logging support (
--debugflag) (#130)
Fixed
- Always output test run errors to help debug failures (#128)
- Include cwd as a skill folder when running waza (workspace fix)
Changed
- Exit codes for CI/CD integration: 0=success, 1=test failure, 2=config error (#135)
- Reordered azd-publish skill workflow steps (#127)
- Auto-merge bot registry PRs in release workflow
[0.2.1] - 2026-02-12
Added
waza devcommand for interactive skill development and testing (#117)- Prerelease input to azd publish workflow
- CHANGELOG.md as release notes source for azd extension releases
waza generate --skill <name>- Filter to specific skill when using--repoor--scan
Fixed
- Fixed azd extensions documentation link
- Corrected
azd ext source addcommand syntax - Branch release PR from origin/main to avoid workflow permission error (#121)
Changed
- Removed path filters from Go CI to unblock non-code PRs
- Removed auto-merge from azd publish PR workflow
- Added azd extension installation instructions to README
[0.2.0] - 2026-02-02
Added
-
Skill Discovery (#3)
waza generate --repo <org/repo>- Scan GitHub repos for SKILL.md fileswaza generate --scan- Scan local directory for skillswaza generate --all- Generate evals for all discovered skills (CI-friendly)- Interactive skill selection with checkboxes when not using
--all
-
GitHub Issue Creation (#3)
- Post-run prompt to create GitHub issues with eval results
- Options: create for failed tasks only, all tasks, or none
- Issues include results table, failed task details, and suggestions
--no-issues...