18 Mar 16:47

github-actions

b07221f

Waza v0.22.0

What's Changed

waza check uses configured token limits from .waza.yaml by @chlowell in #124
Add grade verb for grading run outcomes by @chlowell in #102
Fixing skill loading to use .waza.yaml by @richardpark-msft in #131
Removing a err != nil that we don't need since errors.Is() works great by @richardpark-msft in #138
Apply timeout configuration to waza run by @chlowell in #130

Full Changelog: v0.21.0...v0.22.0

Contributors

chlowell and richardpark-msft

Assets 9

18 Mar 16:49

github-actions

azd-ext-microsoft-azd-waza_0.22.0

93fdb85

Waza azd Extension v0.22.0 Latest

Latest

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.21.0] - 2026-03-12

Added

waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
Eval scaffolding command — waza eval new generates eval.yaml scaffolding for skills (#94)
Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
Per-file token budget configuration — Configure token budgets per-file in .waza.yaml (#96)
Skill-aware thresholds — waza tokens compare supports skill-specific threshold configuration (#93)
Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
FileWriter service — Refactored waza init inventory with FileWriter abstraction (#63)

Fixed

waza suggest deadlock — Execute() now applies the request timeout before calling Start(), preventing goroutine deadlock (#43)
ResourceFile.Content type — Changed from string to []byte for proper binary file handling (#117)
tokens compare in subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)
--output-dir ignored — Fixed --output-dir having no effect for single-skill runs (#109)
Web dashboard build order — Build dashboard assets before Go compilation (#107)
Test file leak — Fixed test that leaked files into the repo (#120)
Config schema defaults — Aligned config.schema.json defaults with Go source of truth (#65)
Skill discovery path — Discover skills under .github/skills/ directory (#69)

Changed

Custom YAML deserializers for config types (#106)
Token limits priority inverted to .waza.yaml first (#64)
@wbreza added to CODEOWNERS (#111)
Go 1.26+ noted in agent instruction files (#108)

[0.9.0] - 2026-02-23

Added

A/B baseline testing — --baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
Pairwise LLM judging — pairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
Auto skill discovery — --discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks eval coverage (#392)
Releases page — New docs site page at reference/releases with platform download links, install commands, and azd extension info (#383)

Fixed

Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings

Changed

Competitive research — Added OpenAI Evals analysis (docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md)
Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md

[0.8.0] - 2026-02-21

Added

MCP Server — waza serve now includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286)
waza suggest command — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags: --model, --dry-run, --apply, --output-dir, --format (#287)
Interactive workflow skill — skills/waza-interactive/SKILL.md with 5 workflow scenarios for conversational eval orchestration (#288)
Grader weighting — weight field on grader configs, ComputeWeightedRunScore method, dashboard weighted scores column (#299)
Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
Judge model support — --judge-model flag and judge_model config for separate LLM-as-judge model (#309)
Spec compliance checks — 8 agentskills.io compliance checks in waza check and waza dev (#314)
SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
MCP integration scoring — 4 MCP integration checks in waza dev (#316)
Batch skill processing — waza dev processes multiple skills in one run (#317)
Token compare --strict — Budget enforcement mode for waza tokens compare (#318)
Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
Skill profile — waza tokens profile for static analysis of skill token distribution (#311)
JUnit XML reporter — --format junit output for CI integration (#312)
Template Variables — New internal/template package with Render() for Go text/template syntax in hooks and commands. System variables: JobID, TaskName, Iteration, Attempt, Timestamp. User variables via vars map (#186)
GroupBy Results — New group_by config field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includes GroupStats with name/passed/total/avg_score (#188)
Custom Input Variables — New inputs section in eval.yaml for defining key-value pairs available as {{.Vars.key}} throughout evaluation. Accessible in hooks, task templates, and grader configs (#189)
CSV Dataset Support — New tasks_from field to generate tasks from CSV files. Each row becomes a task with columns accessible as {{.Vars.column}}. Optional range: [start, end] for row filtering. First row treated as headers (#187)
Retry/Attempts — Add max_attempts config field for retrying failed task executions within each trial (#191)
Lifecycle Hooks — Add hooks section with before_run/after_run/before_task/after_task lifecycle points (#191)
prompt grader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)
- Two modes: clean (fresh context) and continue_session (resumes test session)
- Tool-based grading: set_waza_grade_pass and set_waza_grade_fail tools for LLM graders
- Separate judge model configuration: run evaluation with a different model than the executor
- Pre-built rubric templates adapted from Azure ML evaluators
trigger_tests.yaml auto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)
- New internal/trigger/ package for trigger testing
- Automatically discovered alongside eval.yaml
- Confidence weighting: high (weight 1.0) and medium (weight 0.5) for borderline cases
- trigger_accuracy metric with configurable cutoff threshold
- Metrics: accuracy, precision, recall, F1, error count
diff grader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)
Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in examples/rubrics/ adapted from Azure ML evaluators (#160, #161):
- Tool call rubrics: tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization
- Task evaluation rubrics: task_completion, task_adherence, intent_resolution, response_completeness
MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)

Changed

Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)

Fixed

install.sh macOS checksum — added shasum -a 256 fallback for macOS (which lacks sha256sum) (#163)
Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
GitHub icon alignment and search bar width on docs site

[0.4.0-alpha.1] - 2026-02-17

Added

Go cross-platform release pipeline — go-release.yml workflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155)
install.sh installer — one-line binary install with checksum verification: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
skill_invocation grader — validates orchestration workflows by checking which skills were invoked (#146)
required_skills preflight validation — verifies skill dependencies before evaluation (#147)
Multi-model --model flag — run evaluations across multiple models in a single command (#39)
waza check command — skill submission readiness checks (#151)
Evaluation result caching — incremental testing with cache invalidation (#150)
GitHub PR comment reporter — post eval results as PR comments (#140)
Skills CI integration — GitHub Actions workflow for microsoft/skills (#141)

Fixed

Engine shutdown leak — runSingleModel() now calls engine.Shutdown(context.Background()) via defer after engine creation (#153, #154)

Changed

Python release deprecated — the Python release workflow is no longer maintained; Go binaries are the official distribution
**First Go ...

Assets 8

12 Mar 22:25

github-actions

v0.21.0

97ac57a

Waza v0.21.0

What's Changed

Copilot SDK usage display for waza run by @chlowell in #47
Fix broken dashboard-explore doc images across deployment bases by @Copilot in #50
Fix Docker build by @chlowell in #51
Working around an issue in the copilot SDK, Start() and contexts by @richardpark-msft in #53
fix: Standardize emoji spacing in waza check display by @Copilot in #45
fix: regression test + changelog for waza suggest deadlock by @Copilot in #43
fix: make site base path configurable + remove unused workflow by @spboyer in #56
fix: repair broken test blocking all PR CI by @spboyer in #67
feat: add FileWriter service and refactor waza init inventory #48 by @wbreza in #63
chore(deps): Bump svgo from 4.0.0 to 4.0.1 in /site by @dependabot[bot] in #88
chore: add MIT LICENSE file by @spboyer in #99
fix: discover skills under .github/skills/ directory by @spboyer in #69
feat: invert token limits priority to .waza.yaml first #59 by @wbreza in #64
fix: align config.schema.json defaults with Go source of truth #57 by @wbreza in #65
docs: Add CI/CD integration guide (GitHub Actions, Azure DevOps) by @spboyer in #100
fix: update docs link to GitHub Pages URL by @spboyer in #87
chore: update repo refs + disable heartbeat auto-triggers by @spboyer in #101
feat: Add skill-aware thresholds to waza tokens compare by @spboyer in #93
feat: Per-file token budget configuration in .waza.yaml by @spboyer in #96
feat: Snapshot auto-update workflow for diff grader by @spboyer in #95
feat: Add multi-trial flakiness detection for evals by @spboyer in #103
feat: Add eval scaffolding command (waza eval new) by @spboyer in #94
Add in custom YAML deserializers for our config by @richardpark-msft in #106
fix: Build web dashboard assets before Go compilation by @chlowell in #107
docs: note Go 1.26+ in agent instruction files by @Copilot in #108
Fix --output-dir having no effect for single-skill runs by @chlowell in #109
feat: sensei scoring parity — WHEN: triggers, spec-security, Invalid level, advisory checks 16-18 by @spboyer in #79
chore: add @wbreza to CODEOWNERS by @spboyer in #111
feat: Add trigger heuristic grader by @spboyer in #90
Adding in wz new task from-prompt by @richardpark-msft in #110
Fix test that leaks files into the repo by @chlowell in #120
chore(deps): Bump devalue from 5.6.3 to 5.6.4 in /site by @dependabot[bot] in #119
Fix tokens compare in subdirectory showing all files as "added" by @chlowell in #105
fix: change ResourceFile.Content from string to []byte by @Copilot in #117
Release v0.21.0 by @spboyer in #122

New Contributors

@chlowell made their first contribution in #47

Full Changelog: v0.1.0...v0.21.0

Contributors

wbreza, spboyer, and 3 other contributors

Assets 9

12 Mar 22:27

github-actions

azd-ext-microsoft-azd-waza_0.21.0

97ac57a

Waza azd Extension v0.21.0

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.21.0] - 2026-03-12

Added

waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
Eval scaffolding command — waza eval new generates eval.yaml scaffolding for skills (#94)
Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
Per-file token budget configuration — Configure token budgets per-file in .waza.yaml (#96)
Skill-aware thresholds — waza tokens compare supports skill-specific threshold configuration (#93)
Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
FileWriter service — Refactored waza init inventory with FileWriter abstraction (#63)

Fixed

waza suggest deadlock — Execute() now applies the request timeout before calling Start(), preventing goroutine deadlock (#43)
ResourceFile.Content type — Changed from string to []byte for proper binary file handling (#117)
tokens compare in subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)
--output-dir ignored — Fixed --output-dir having no effect for single-skill runs (#109)
Web dashboard build order — Build dashboard assets before Go compilation (#107)
Test file leak — Fixed test that leaked files into the repo (#120)
Config schema defaults — Aligned config.schema.json defaults with Go source of truth (#65)
Skill discovery path — Discover skills under .github/skills/ directory (#69)

Changed

Custom YAML deserializers for config types (#106)
Token limits priority inverted to .waza.yaml first (#64)
@wbreza added to CODEOWNERS (#111)
Go 1.26+ noted in agent instruction files (#108)

[0.9.0] - 2026-02-23

Added

A/B baseline testing — --baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
Pairwise LLM judging — pairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
Auto skill discovery — --discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks eval coverage (#392)
Releases page — New docs site page at reference/releases with platform download links, install commands, and azd extension info (#383)

Fixed

Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings

Changed

Competitive research — Added OpenAI Evals analysis (docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md)
Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md

[0.8.0] - 2026-02-21

Added

MCP Server — waza serve now includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286)
waza suggest command — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags: --model, --dry-run, --apply, --output-dir, --format (#287)
Interactive workflow skill — skills/waza-interactive/SKILL.md with 5 workflow scenarios for conversational eval orchestration (#288)
Grader weighting — weight field on grader configs, ComputeWeightedRunScore method, dashboard weighted scores column (#299)
Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
Judge model support — --judge-model flag and judge_model config for separate LLM-as-judge model (#309)
Spec compliance checks — 8 agentskills.io compliance checks in waza check and waza dev (#314)
SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
MCP integration scoring — 4 MCP integration checks in waza dev (#316)
Batch skill processing — waza dev processes multiple skills in one run (#317)
Token compare --strict — Budget enforcement mode for waza tokens compare (#318)
Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
Skill profile — waza tokens profile for static analysis of skill token distribution (#311)
JUnit XML reporter — --format junit output for CI integration (#312)
Template Variables — New internal/template package with Render() for Go text/template syntax in hooks and commands. System variables: JobID, TaskName, Iteration, Attempt, Timestamp. User variables via vars map (#186)
GroupBy Results — New group_by config field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includes GroupStats with name/passed/total/avg_score (#188)
Custom Input Variables — New inputs section in eval.yaml for defining key-value pairs available as {{.Vars.key}} throughout evaluation. Accessible in hooks, task templates, and grader configs (#189)
CSV Dataset Support — New tasks_from field to generate tasks from CSV files. Each row becomes a task with columns accessible as {{.Vars.column}}. Optional range: [start, end] for row filtering. First row treated as headers (#187)
Retry/Attempts — Add max_attempts config field for retrying failed task executions within each trial (#191)
Lifecycle Hooks — Add hooks section with before_run/after_run/before_task/after_task lifecycle points (#191)
prompt grader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)
- Two modes: clean (fresh context) and continue_session (resumes test session)
- Tool-based grading: set_waza_grade_pass and set_waza_grade_fail tools for LLM graders
- Separate judge model configuration: run evaluation with a different model than the executor
- Pre-built rubric templates adapted from Azure ML evaluators
trigger_tests.yaml auto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)
- New internal/trigger/ package for trigger testing
- Automatically discovered alongside eval.yaml
- Confidence weighting: high (weight 1.0) and medium (weight 0.5) for borderline cases
- trigger_accuracy metric with configurable cutoff threshold
- Metrics: accuracy, precision, recall, F1, error count
diff grader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)
Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in examples/rubrics/ adapted from Azure ML evaluators (#160, #161):
- Tool call rubrics: tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization
- Task evaluation rubrics: task_completion, task_adherence, intent_resolution, response_completeness
MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)

Changed

Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)

Fixed

install.sh macOS checksum — added shasum -a 256 fallback for macOS (which lacks sha256sum) (#163)
Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
GitHub icon alignment and search bar width on docs site

[0.4.0-alpha.1] - 2026-02-17

Added

Go cross-platform release pipeline — go-release.yml workflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155)
install.sh installer — one-line binary install with checksum verification: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
skill_invocation grader — validates orchestration workflows by checking which skills were invoked (#146)
required_skills preflight validation — verifies skill dependencies before evaluation (#147)
Multi-model --model flag — run evaluations across multiple models in a single command (#39)
waza check command — skill submission readiness checks (#151)
Evaluation result caching — incremental testing with cache invalidation (#150)
GitHub PR comment reporter — post eval results as PR comments (#140)
Skills CI integration — GitHub Actions workflow for microsoft/skills (#141)

Fixed

Engine shutdown leak — runSingleModel() now calls engine.Shutdown(context.Background()) via defer after engine creation (#153, #154)

Changed

Python release deprecated — the Python release workflow is no longer maintained; Go binaries are the official distribution
**First Go ...

Assets 8

02 Mar 18:16

github-actions

v0.12.0

c3cff15

Waza v0.12.0

Full Changelog: v0.11.0...v0.12.0

Assets 9

02 Mar 18:16

github-actions

azd-ext-microsoft-azd-waza_0.12.0

d2c158d

Waza azd Extension v0.12.0

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.9.0] - 2026-02-23

Added

A/B baseline testing — --baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
Pairwise LLM judging — pairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
Auto skill discovery — --discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks eval coverage (#392)
Releases page — New docs site page at reference/releases with platform download links, install commands, and azd extension info (#383)

Fixed

Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings

Changed

Competitive research — Added OpenAI Evals analysis (docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md)
Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md

[0.8.0] - 2026-02-21

Added

MCP Server — waza serve now includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286)
waza suggest command — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags: --model, --dry-run, --apply, --output-dir, --format (#287)
Interactive workflow skill — skills/waza-interactive/SKILL.md with 5 workflow scenarios for conversational eval orchestration (#288)
Grader weighting — weight field on grader configs, ComputeWeightedRunScore method, dashboard weighted scores column (#299)
Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
Judge model support — --judge-model flag and judge_model config for separate LLM-as-judge model (#309)
Spec compliance checks — 8 agentskills.io compliance checks in waza check and waza dev (#314)
SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
MCP integration scoring — 4 MCP integration checks in waza dev (#316)
Batch skill processing — waza dev processes multiple skills in one run (#317)
Token compare --strict — Budget enforcement mode for waza tokens compare (#318)
Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
Skill profile — waza tokens profile for static analysis of skill token distribution (#311)
JUnit XML reporter — --format junit output for CI integration (#312)
Template Variables — New internal/template package with Render() for Go text/template syntax in hooks and commands. System variables: JobID, TaskName, Iteration, Attempt, Timestamp. User variables via vars map (#186)
GroupBy Results — New group_by config field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includes GroupStats with name/passed/total/avg_score (#188)
Custom Input Variables — New inputs section in eval.yaml for defining key-value pairs available as {{.Vars.key}} throughout evaluation. Accessible in hooks, task templates, and grader configs (#189)
CSV Dataset Support — New tasks_from field to generate tasks from CSV files. Each row becomes a task with columns accessible as {{.Vars.column}}. Optional range: [start, end] for row filtering. First row treated as headers (#187)
Retry/Attempts — Add max_attempts config field for retrying failed task executions within each trial (#191)
Lifecycle Hooks — Add hooks section with before_run/after_run/before_task/after_task lifecycle points (#191)
prompt grader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)
- Two modes: clean (fresh context) and continue_session (resumes test session)
- Tool-based grading: set_waza_grade_pass and set_waza_grade_fail tools for LLM graders
- Separate judge model configuration: run evaluation with a different model than the executor
- Pre-built rubric templates adapted from Azure ML evaluators
trigger_tests.yaml auto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)
- New internal/trigger/ package for trigger testing
- Automatically discovered alongside eval.yaml
- Confidence weighting: high (weight 1.0) and medium (weight 0.5) for borderline cases
- trigger_accuracy metric with configurable cutoff threshold
- Metrics: accuracy, precision, recall, F1, error count
diff grader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)
Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in examples/rubrics/ adapted from Azure ML evaluators (#160, #161):
- Tool call rubrics: tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization
- Task evaluation rubrics: task_completion, task_adherence, intent_resolution, response_completeness
MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)

Changed

Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)

Fixed

install.sh macOS checksum — added shasum -a 256 fallback for macOS (which lacks sha256sum) (#163)
Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
GitHub icon alignment and search bar width on docs site

[0.4.0-alpha.1] - 2026-02-17

Added

Go cross-platform release pipeline — go-release.yml workflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155)
install.sh installer — one-line binary install with checksum verification: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
skill_invocation grader — validates orchestration workflows by checking which skills were invoked (#146)
required_skills preflight validation — verifies skill dependencies before evaluation (#147)
Multi-model --model flag — run evaluations across multiple models in a single command (#39)
waza check command — skill submission readiness checks (#151)
Evaluation result caching — incremental testing with cache invalidation (#150)
GitHub PR comment reporter — post eval results as PR comments (#140)
Skills CI integration — GitHub Actions workflow for microsoft/skills (#141)

Fixed

Engine shutdown leak — runSingleModel() now calls engine.Shutdown(context.Background()) via defer after engine creation (#153, #154)

Changed

Python release deprecated — the Python release workflow is no longer maintained; Go binaries are the official distribution
First Go binary release — v0.4.0-alpha.1 is the first release distributed as pre-built binaries

[0.3.0] - 2026-02-13

Added

Grader showcase examples demonstrating all grader types (#134)
Reusable GitHub Actions workflow for waza evaluations (#132)
Documentation for prompt and action_sequence grader types (#133)
Documentation for waza dev command and compliance scoring (#131)
Auto-loading of skills for testing (#129)
Debug logging support (--debug flag) (#130)

Fixed

Always output test run errors to help debug failures (#128)
Include cwd as a skill folder when running waza (workspace fix)

Changed

Exit codes for CI/CD integration: 0=success, 1=test failure, 2=config error (#135)
Reordered azd-publish skill workflow steps (#127)
Auto-merge bot registry PRs in release workflow

[0.2.1] - 2026-02-12

Added

waza dev command for interactive skill development and testing (#117)
Prerelease input to azd publish workflow
CHANGELOG.md as release notes source for azd extension releases
waza generate --skill <name> - Filter to specific skill when using --repo or --scan

Fixed

Fixed azd extensions documentation link
Corrected azd ext source add command syntax
Branch release PR from origin/main to avoid workflow permission error (#121)

Changed

Removed path filters from Go CI to unblock non-code PRs
Removed auto-merge from azd publish PR workflow
Added azd extension installation instructions to README

[0.2.0] - 2026-02-02

Added

Skill Discovery (#3)
- waza generate --repo <org/repo> - Scan GitHub repos for SKILL.md files
- waza generate --scan - Scan local directory for skills
- waza generate --all - Generate evals for all discovered skills (CI-friendly)
- Interactive skill selection with checkboxes when not using --all
GitHub Issue Creation (#3)
- Post-run prompt to create GitHub issues with eval results
- Options: create for failed tasks only, all tasks, or none
- Issues include results table, failed task details, and suggestions
- --no-issues...

Assets 8

28 Feb 04:02

github-actions

v0.11.0

f2cdcb2

Waza v0.11.0

What's Changed

Adding Microsoft SECURITY.MD by @microsoft-github-policy-service[bot] in #2
chore(deps): Bump go.opentelemetry.io/otel/sdk from 1.38.0 to 1.40.0 by @dependabot[bot] in #4
chore(deps): Bump rollup from 4.58.0 to 4.59.0 in /site by @dependabot[bot] in #5

New Contributors

@microsoft-github-policy-service[bot] made their first contribution in #2
@dependabot[bot] made their first contribution in #4

Full Changelog: v0.10.0...v0.11.0

Contributors

dependabot

Assets 9

28 Feb 04:06

github-actions

azd-ext-microsoft-azd-waza_0.11.0

f2cdcb2

Waza azd Extension v0.11.0

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.9.0] - 2026-02-23

Added

A/B baseline testing — --baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
Pairwise LLM judging — pairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
Auto skill discovery — --discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks eval coverage (#392)
Releases page — New docs site page at reference/releases with platform download links, install commands, and azd extension info (#383)

Fixed

Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings

Changed

Competitive research — Added OpenAI Evals analysis (docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md)
Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md

[0.8.0] - 2026-02-21

Added

MCP Server — waza serve now includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286)
waza suggest command — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags: --model, --dry-run, --apply, --output-dir, --format (#287)
Interactive workflow skill — skills/waza-interactive/SKILL.md with 5 workflow scenarios for conversational eval orchestration (#288)
Grader weighting — weight field on grader configs, ComputeWeightedRunScore method, dashboard weighted scores column (#299)
Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
Judge model support — --judge-model flag and judge_model config for separate LLM-as-judge model (#309)
Spec compliance checks — 8 agentskills.io compliance checks in waza check and waza dev (#314)
SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
MCP integration scoring — 4 MCP integration checks in waza dev (#316)
Batch skill processing — waza dev processes multiple skills in one run (#317)
Token compare --strict — Budget enforcement mode for waza tokens compare (#318)
Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
Skill profile — waza tokens profile for static analysis of skill token distribution (#311)
JUnit XML reporter — --format junit output for CI integration (#312)
Template Variables — New internal/template package with Render() for Go text/template syntax in hooks and commands. System variables: JobID, TaskName, Iteration, Attempt, Timestamp. User variables via vars map (#186)
GroupBy Results — New group_by config field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includes GroupStats with name/passed/total/avg_score (#188)
Custom Input Variables — New inputs section in eval.yaml for defining key-value pairs available as {{.Vars.key}} throughout evaluation. Accessible in hooks, task templates, and grader configs (#189)
CSV Dataset Support — New tasks_from field to generate tasks from CSV files. Each row becomes a task with columns accessible as {{.Vars.column}}. Optional range: [start, end] for row filtering. First row treated as headers (#187)
Retry/Attempts — Add max_attempts config field for retrying failed task executions within each trial (#191)
Lifecycle Hooks — Add hooks section with before_run/after_run/before_task/after_task lifecycle points (#191)
prompt grader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)
- Two modes: clean (fresh context) and continue_session (resumes test session)
- Tool-based grading: set_waza_grade_pass and set_waza_grade_fail tools for LLM graders
- Separate judge model configuration: run evaluation with a different model than the executor
- Pre-built rubric templates adapted from Azure ML evaluators
trigger_tests.yaml auto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)
- New internal/trigger/ package for trigger testing
- Automatically discovered alongside eval.yaml
- Confidence weighting: high (weight 1.0) and medium (weight 0.5) for borderline cases
- trigger_accuracy metric with configurable cutoff threshold
- Metrics: accuracy, precision, recall, F1, error count
diff grader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)
Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in examples/rubrics/ adapted from Azure ML evaluators (#160, #161):
- Tool call rubrics: tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization
- Task evaluation rubrics: task_completion, task_adherence, intent_resolution, response_completeness
MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)

Changed

Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)

Fixed

install.sh macOS checksum — added shasum -a 256 fallback for macOS (which lacks sha256sum) (#163)
Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
GitHub icon alignment and search bar width on docs site

[0.4.0-alpha.1] - 2026-02-17

Added

Go cross-platform release pipeline — go-release.yml workflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155)
install.sh installer — one-line binary install with checksum verification: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
skill_invocation grader — validates orchestration workflows by checking which skills were invoked (#146)
required_skills preflight validation — verifies skill dependencies before evaluation (#147)
Multi-model --model flag — run evaluations across multiple models in a single command (#39)
waza check command — skill submission readiness checks (#151)
Evaluation result caching — incremental testing with cache invalidation (#150)
GitHub PR comment reporter — post eval results as PR comments (#140)
Skills CI integration — GitHub Actions workflow for microsoft/skills (#141)

Fixed

Engine shutdown leak — runSingleModel() now calls engine.Shutdown(context.Background()) via defer after engine creation (#153, #154)

Changed

Python release deprecated — the Python release workflow is no longer maintained; Go binaries are the official distribution
First Go binary release — v0.4.0-alpha.1 is the first release distributed as pre-built binaries

[0.3.0] - 2026-02-13

Added

Grader showcase examples demonstrating all grader types (#134)
Reusable GitHub Actions workflow for waza evaluations (#132)
Documentation for prompt and action_sequence grader types (#133)
Documentation for waza dev command and compliance scoring (#131)
Auto-loading of skills for testing (#129)
Debug logging support (--debug flag) (#130)

Fixed

Always output test run errors to help debug failures (#128)
Include cwd as a skill folder when running waza (workspace fix)

Changed

Exit codes for CI/CD integration: 0=success, 1=test failure, 2=config error (#135)
Reordered azd-publish skill workflow steps (#127)
Auto-merge bot registry PRs in release workflow

[0.2.1] - 2026-02-12

Added

waza dev command for interactive skill development and testing (#117)
Prerelease input to azd publish workflow
CHANGELOG.md as release notes source for azd extension releases
waza generate --skill <name> - Filter to specific skill when using --repo or --scan

Fixed

Fixed azd extensions documentation link
Corrected azd ext source add command syntax
Branch release PR from origin/main to avoid workflow permission error (#121)

Changed

Removed path filters from Go CI to unblock non-code PRs
Removed auto-merge from azd publish PR workflow
Added azd extension installation instructions to README

[0.2.0] - 2026-02-02

Added

Skill Discovery (#3)
- waza generate --repo <org/repo> - Scan GitHub repos for SKILL.md files
- waza generate --scan - Scan local directory for skills
- waza generate --all - Generate evals for all discovered skills (CI-friendly)
- Interactive skill selection with checkboxes when not using --all
GitHub Issue Creation (#3)
- Post-run prompt to create GitHub issues with eval results
- Options: create for failed tasks only, all tasks, or none
- Issues include results table, failed task details, and suggestions
- --no-issues...

Assets 8

Releases: microsoft/waza

Waza v0.22.0

What's Changed

Contributors

Uh oh!

Waza azd Extension v0.22.0

Changelog

[Unreleased]

[0.21.0] - 2026-03-12

Added

Fixed

Changed

[0.9.0] - 2026-02-23

Added

Fixed

Changed

[0.8.0] - 2026-02-21

Added

Changed

Fixed

[0.4.0-alpha.1] - 2026-02-17

Added

Fixed

Changed

Uh oh!

Waza v0.21.0

What's Changed

New Contributors

Contributors

Uh oh!

Waza azd Extension v0.21.0

Changelog

[Unreleased]

[0.21.0] - 2026-03-12

Added

Fixed

Changed

[0.9.0] - 2026-02-23

Added

Fixed

Changed

[0.8.0] - 2026-02-21

Added

Changed

Fixed

[0.4.0-alpha.1] - 2026-02-17

Added

Fixed

Changed

Uh oh!

Waza v0.12.0

Uh oh!

Waza azd Extension v0.12.0

Changelog

[Unreleased]

[0.9.0] - 2026-02-23

Added

Fixed

Changed

[0.8.0] - 2026-02-21

Added

Changed

Fixed

[0.4.0-alpha.1] - 2026-02-17

Added

Fixed

Changed

[0.3.0] - 2026-02-13

Added

Fixed

Changed

[0.2.1] - 2026-02-12

Added

Fixed

Changed

[0.2.0] - 2026-02-02

Added

Uh oh!

Waza v0.11.0

What's Changed