Skip to content

Releases: microsoft/waza

Waza v0.22.0

18 Mar 16:47
b07221f

Choose a tag to compare

What's Changed

Full Changelog: v0.21.0...v0.22.0

Waza azd Extension v0.22.0

18 Mar 16:49
93fdb85

Choose a tag to compare

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.21.0] - 2026-03-12

Added

  • waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
  • Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
  • Eval scaffolding commandwaza eval new generates eval.yaml scaffolding for skills (#94)
  • Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
  • Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
  • Per-file token budget configuration — Configure token budgets per-file in .waza.yaml (#96)
  • Skill-aware thresholdswaza tokens compare supports skill-specific threshold configuration (#93)
  • Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
  • CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
  • FileWriter service — Refactored waza init inventory with FileWriter abstraction (#63)

Fixed

  • waza suggest deadlockExecute() now applies the request timeout before calling Start(), preventing goroutine deadlock (#43)
  • ResourceFile.Content type — Changed from string to []byte for proper binary file handling (#117)
  • tokens compare in subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)
  • --output-dir ignored — Fixed --output-dir having no effect for single-skill runs (#109)
  • Web dashboard build order — Build dashboard assets before Go compilation (#107)
  • Test file leak — Fixed test that leaked files into the repo (#120)
  • Config schema defaults — Aligned config.schema.json defaults with Go source of truth (#65)
  • Skill discovery path — Discover skills under .github/skills/ directory (#69)

Changed

  • Custom YAML deserializers for config types (#106)
  • Token limits priority inverted to .waza.yaml first (#64)
  • @wbreza added to CODEOWNERS (#111)
  • Go 1.26+ noted in agent instruction files (#108)

[0.9.0] - 2026-02-23

Added

  • A/B baseline testing--baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
  • Pairwise LLM judgingpairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
  • Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
  • Auto skill discovery--discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks eval coverage (#392)
  • Releases page — New docs site page at reference/releases with platform download links, install commands, and azd extension info (#383)

Fixed

  • Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings

Changed

  • Competitive research — Added OpenAI Evals analysis (docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md)
  • Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md

[0.8.0] - 2026-02-21

Added

  • MCP Serverwaza serve now includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286)
  • waza suggest command — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags: --model, --dry-run, --apply, --output-dir, --format (#287)
  • Interactive workflow skillskills/waza-interactive/SKILL.md with 5 workflow scenarios for conversational eval orchestration (#288)
  • Grader weightingweight field on grader configs, ComputeWeightedRunScore method, dashboard weighted scores column (#299)
  • Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
  • Judge model support--judge-model flag and judge_model config for separate LLM-as-judge model (#309)
  • Spec compliance checks — 8 agentskills.io compliance checks in waza check and waza dev (#314)
  • SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
  • MCP integration scoring — 4 MCP integration checks in waza dev (#316)
  • Batch skill processingwaza dev processes multiple skills in one run (#317)
  • Token compare --strict — Budget enforcement mode for waza tokens compare (#318)
  • Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
  • Skill profilewaza tokens profile for static analysis of skill token distribution (#311)
  • JUnit XML reporter--format junit output for CI integration (#312)
  • Template Variables — New internal/template package with Render() for Go text/template syntax in hooks and commands. System variables: JobID, TaskName, Iteration, Attempt, Timestamp. User variables via vars map (#186)
  • GroupBy Results — New group_by config field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includes GroupStats with name/passed/total/avg_score (#188)
  • Custom Input Variables — New inputs section in eval.yaml for defining key-value pairs available as {{.Vars.key}} throughout evaluation. Accessible in hooks, task templates, and grader configs (#189)
  • CSV Dataset Support — New tasks_from field to generate tasks from CSV files. Each row becomes a task with columns accessible as {{.Vars.column}}. Optional range: [start, end] for row filtering. First row treated as headers (#187)
  • Retry/Attempts — Add max_attempts config field for retrying failed task executions within each trial (#191)
  • Lifecycle Hooks — Add hooks section with before_run/after_run/before_task/after_task lifecycle points (#191)
  • prompt grader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)
    • Two modes: clean (fresh context) and continue_session (resumes test session)
    • Tool-based grading: set_waza_grade_pass and set_waza_grade_fail tools for LLM graders
    • Separate judge model configuration: run evaluation with a different model than the executor
    • Pre-built rubric templates adapted from Azure ML evaluators
  • trigger_tests.yaml auto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)
    • New internal/trigger/ package for trigger testing
    • Automatically discovered alongside eval.yaml
    • Confidence weighting: high (weight 1.0) and medium (weight 0.5) for borderline cases
    • trigger_accuracy metric with configurable cutoff threshold
    • Metrics: accuracy, precision, recall, F1, error count
  • diff grader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)
  • Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in examples/rubrics/ adapted from Azure ML evaluators (#160, #161):
    • Tool call rubrics: tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization
    • Task evaluation rubrics: task_completion, task_adherence, intent_resolution, response_completeness
  • MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)

Changed

  • Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
  • Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)

Fixed

  • install.sh macOS checksum — added shasum -a 256 fallback for macOS (which lacks sha256sum) (#163)
  • Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
  • GitHub icon alignment and search bar width on docs site

[0.4.0-alpha.1] - 2026-02-17

Added

  • Go cross-platform release pipelinego-release.yml workflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155)
  • install.sh installer — one-line binary install with checksum verification: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
  • skill_invocation grader — validates orchestration workflows by checking which skills were invoked (#146)
  • required_skills preflight validation — verifies skill dependencies before evaluation (#147)
  • Multi-model --model flag — run evaluations across multiple models in a single command (#39)
  • waza check command — skill submission readiness checks (#151)
  • Evaluation result caching — incremental testing with cache invalidation (#150)
  • GitHub PR comment reporter — post eval results as PR comments (#140)
  • Skills CI integration — GitHub Actions workflow for microsoft/skills (#141)

Fixed

  • Engine shutdown leakrunSingleModel() now calls engine.Shutdown(context.Background()) via defer after engine creation (#153, #154)

Changed

  • Python release deprecated — the Python release workflow is no longer maintained; Go binaries are the official distribution
  • **First Go ...
Read more

Waza v0.21.0

12 Mar 22:25
97ac57a

Choose a tag to compare

What's Changed

  • Copilot SDK usage display for waza run by @chlowell in #47
  • Fix broken dashboard-explore doc images across deployment bases by @Copilot in #50
  • Fix Docker build by @chlowell in #51
  • Working around an issue in the copilot SDK, Start() and contexts by @richardpark-msft in #53
  • fix: Standardize emoji spacing in waza check display by @Copilot in #45
  • fix: regression test + changelog for waza suggest deadlock by @Copilot in #43
  • fix: make site base path configurable + remove unused workflow by @spboyer in #56
  • fix: repair broken test blocking all PR CI by @spboyer in #67
  • feat: add FileWriter service and refactor waza init inventory #48 by @wbreza in #63
  • chore(deps): Bump svgo from 4.0.0 to 4.0.1 in /site by @dependabot[bot] in #88
  • chore: add MIT LICENSE file by @spboyer in #99
  • fix: discover skills under .github/skills/ directory by @spboyer in #69
  • feat: invert token limits priority to .waza.yaml first #59 by @wbreza in #64
  • fix: align config.schema.json defaults with Go source of truth #57 by @wbreza in #65
  • docs: Add CI/CD integration guide (GitHub Actions, Azure DevOps) by @spboyer in #100
  • fix: update docs link to GitHub Pages URL by @spboyer in #87
  • chore: update repo refs + disable heartbeat auto-triggers by @spboyer in #101
  • feat: Add skill-aware thresholds to waza tokens compare by @spboyer in #93
  • feat: Per-file token budget configuration in .waza.yaml by @spboyer in #96
  • feat: Snapshot auto-update workflow for diff grader by @spboyer in #95
  • feat: Add multi-trial flakiness detection for evals by @spboyer in #103
  • feat: Add eval scaffolding command (waza eval new) by @spboyer in #94
  • Add in custom YAML deserializers for our config by @richardpark-msft in #106
  • fix: Build web dashboard assets before Go compilation by @chlowell in #107
  • docs: note Go 1.26+ in agent instruction files by @Copilot in #108
  • Fix --output-dir having no effect for single-skill runs by @chlowell in #109
  • feat: sensei scoring parity — WHEN: triggers, spec-security, Invalid level, advisory checks 16-18 by @spboyer in #79
  • chore: add @wbreza to CODEOWNERS by @spboyer in #111
  • feat: Add trigger heuristic grader by @spboyer in #90
  • Adding in wz new task from-prompt by @richardpark-msft in #110
  • Fix test that leaks files into the repo by @chlowell in #120
  • chore(deps): Bump devalue from 5.6.3 to 5.6.4 in /site by @dependabot[bot] in #119
  • Fix tokens compare in subdirectory showing all files as "added" by @chlowell in #105
  • fix: change ResourceFile.Content from string to []byte by @Copilot in #117
  • Release v0.21.0 by @spboyer in #122

New Contributors

Full Changelog: v0.1.0...v0.21.0

Waza azd Extension v0.21.0

12 Mar 22:27
97ac57a

Choose a tag to compare

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.21.0] - 2026-03-12

Added

  • waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
  • Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
  • Eval scaffolding commandwaza eval new generates eval.yaml scaffolding for skills (#94)
  • Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
  • Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
  • Per-file token budget configuration — Configure token budgets per-file in .waza.yaml (#96)
  • Skill-aware thresholdswaza tokens compare supports skill-specific threshold configuration (#93)
  • Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
  • CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
  • FileWriter service — Refactored waza init inventory with FileWriter abstraction (#63)

Fixed

  • waza suggest deadlockExecute() now applies the request timeout before calling Start(), preventing goroutine deadlock (#43)
  • ResourceFile.Content type — Changed from string to []byte for proper binary file handling (#117)
  • tokens compare in subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)
  • --output-dir ignored — Fixed --output-dir having no effect for single-skill runs (#109)
  • Web dashboard build order — Build dashboard assets before Go compilation (#107)
  • Test file leak — Fixed test that leaked files into the repo (#120)
  • Config schema defaults — Aligned config.schema.json defaults with Go source of truth (#65)
  • Skill discovery path — Discover skills under .github/skills/ directory (#69)

Changed

  • Custom YAML deserializers for config types (#106)
  • Token limits priority inverted to .waza.yaml first (#64)
  • @wbreza added to CODEOWNERS (#111)
  • Go 1.26+ noted in agent instruction files (#108)

[0.9.0] - 2026-02-23

Added

  • A/B baseline testing--baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
  • Pairwise LLM judgingpairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
  • Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
  • Auto skill discovery--discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks eval coverage (#392)
  • Releases page — New docs site page at reference/releases with platform download links, install commands, and azd extension info (#383)

Fixed

  • Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings

Changed

  • Competitive research — Added OpenAI Evals analysis (docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md)
  • Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md

[0.8.0] - 2026-02-21

Added

  • MCP Serverwaza serve now includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286)
  • waza suggest command — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags: --model, --dry-run, --apply, --output-dir, --format (#287)
  • Interactive workflow skillskills/waza-interactive/SKILL.md with 5 workflow scenarios for conversational eval orchestration (#288)
  • Grader weightingweight field on grader configs, ComputeWeightedRunScore method, dashboard weighted scores column (#299)
  • Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
  • Judge model support--judge-model flag and judge_model config for separate LLM-as-judge model (#309)
  • Spec compliance checks — 8 agentskills.io compliance checks in waza check and waza dev (#314)
  • SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
  • MCP integration scoring — 4 MCP integration checks in waza dev (#316)
  • Batch skill processingwaza dev processes multiple skills in one run (#317)
  • Token compare --strict — Budget enforcement mode for waza tokens compare (#318)
  • Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
  • Skill profilewaza tokens profile for static analysis of skill token distribution (#311)
  • JUnit XML reporter--format junit output for CI integration (#312)
  • Template Variables — New internal/template package with Render() for Go text/template syntax in hooks and commands. System variables: JobID, TaskName, Iteration, Attempt, Timestamp. User variables via vars map (#186)
  • GroupBy Results — New group_by config field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includes GroupStats with name/passed/total/avg_score (#188)
  • Custom Input Variables — New inputs section in eval.yaml for defining key-value pairs available as {{.Vars.key}} throughout evaluation. Accessible in hooks, task templates, and grader configs (#189)
  • CSV Dataset Support — New tasks_from field to generate tasks from CSV files. Each row becomes a task with columns accessible as {{.Vars.column}}. Optional range: [start, end] for row filtering. First row treated as headers (#187)
  • Retry/Attempts — Add max_attempts config field for retrying failed task executions within each trial (#191)
  • Lifecycle Hooks — Add hooks section with before_run/after_run/before_task/after_task lifecycle points (#191)
  • prompt grader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)
    • Two modes: clean (fresh context) and continue_session (resumes test session)
    • Tool-based grading: set_waza_grade_pass and set_waza_grade_fail tools for LLM graders
    • Separate judge model configuration: run evaluation with a different model than the executor
    • Pre-built rubric templates adapted from Azure ML evaluators
  • trigger_tests.yaml auto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)
    • New internal/trigger/ package for trigger testing
    • Automatically discovered alongside eval.yaml
    • Confidence weighting: high (weight 1.0) and medium (weight 0.5) for borderline cases
    • trigger_accuracy metric with configurable cutoff threshold
    • Metrics: accuracy, precision, recall, F1, error count
  • diff grader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)
  • Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in examples/rubrics/ adapted from Azure ML evaluators (#160, #161):
    • Tool call rubrics: tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization
    • Task evaluation rubrics: task_completion, task_adherence, intent_resolution, response_completeness
  • MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)

Changed

  • Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
  • Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)

Fixed

  • install.sh macOS checksum — added shasum -a 256 fallback for macOS (which lacks sha256sum) (#163)
  • Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
  • GitHub icon alignment and search bar width on docs site

[0.4.0-alpha.1] - 2026-02-17

Added

  • Go cross-platform release pipelinego-release.yml workflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155)
  • install.sh installer — one-line binary install with checksum verification: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
  • skill_invocation grader — validates orchestration workflows by checking which skills were invoked (#146)
  • required_skills preflight validation — verifies skill dependencies before evaluation (#147)
  • Multi-model --model flag — run evaluations across multiple models in a single command (#39)
  • waza check command — skill submission readiness checks (#151)
  • Evaluation result caching — incremental testing with cache invalidation (#150)
  • GitHub PR comment reporter — post eval results as PR comments (#140)
  • Skills CI integration — GitHub Actions workflow for microsoft/skills (#141)

Fixed

  • Engine shutdown leakrunSingleModel() now calls engine.Shutdown(context.Background()) via defer after engine creation (#153, #154)

Changed

  • Python release deprecated — the Python release workflow is no longer maintained; Go binaries are the official distribution
  • **First Go ...
Read more

Waza v0.12.0

02 Mar 18:16

Choose a tag to compare

Waza azd Extension v0.12.0

02 Mar 18:16
d2c158d

Choose a tag to compare

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.9.0] - 2026-02-23

Added

  • A/B baseline testing--baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
  • Pairwise LLM judgingpairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
  • Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
  • Auto skill discovery--discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks eval coverage (#392)
  • Releases page — New docs site page at reference/releases with platform download links, install commands, and azd extension info (#383)

Fixed

  • Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings

Changed

  • Competitive research — Added OpenAI Evals analysis (docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md)
  • Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md

[0.8.0] - 2026-02-21

Added

  • MCP Serverwaza serve now includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286)
  • waza suggest command — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags: --model, --dry-run, --apply, --output-dir, --format (#287)
  • Interactive workflow skillskills/waza-interactive/SKILL.md with 5 workflow scenarios for conversational eval orchestration (#288)
  • Grader weightingweight field on grader configs, ComputeWeightedRunScore method, dashboard weighted scores column (#299)
  • Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
  • Judge model support--judge-model flag and judge_model config for separate LLM-as-judge model (#309)
  • Spec compliance checks — 8 agentskills.io compliance checks in waza check and waza dev (#314)
  • SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
  • MCP integration scoring — 4 MCP integration checks in waza dev (#316)
  • Batch skill processingwaza dev processes multiple skills in one run (#317)
  • Token compare --strict — Budget enforcement mode for waza tokens compare (#318)
  • Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
  • Skill profilewaza tokens profile for static analysis of skill token distribution (#311)
  • JUnit XML reporter--format junit output for CI integration (#312)
  • Template Variables — New internal/template package with Render() for Go text/template syntax in hooks and commands. System variables: JobID, TaskName, Iteration, Attempt, Timestamp. User variables via vars map (#186)
  • GroupBy Results — New group_by config field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includes GroupStats with name/passed/total/avg_score (#188)
  • Custom Input Variables — New inputs section in eval.yaml for defining key-value pairs available as {{.Vars.key}} throughout evaluation. Accessible in hooks, task templates, and grader configs (#189)
  • CSV Dataset Support — New tasks_from field to generate tasks from CSV files. Each row becomes a task with columns accessible as {{.Vars.column}}. Optional range: [start, end] for row filtering. First row treated as headers (#187)
  • Retry/Attempts — Add max_attempts config field for retrying failed task executions within each trial (#191)
  • Lifecycle Hooks — Add hooks section with before_run/after_run/before_task/after_task lifecycle points (#191)
  • prompt grader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)
    • Two modes: clean (fresh context) and continue_session (resumes test session)
    • Tool-based grading: set_waza_grade_pass and set_waza_grade_fail tools for LLM graders
    • Separate judge model configuration: run evaluation with a different model than the executor
    • Pre-built rubric templates adapted from Azure ML evaluators
  • trigger_tests.yaml auto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)
    • New internal/trigger/ package for trigger testing
    • Automatically discovered alongside eval.yaml
    • Confidence weighting: high (weight 1.0) and medium (weight 0.5) for borderline cases
    • trigger_accuracy metric with configurable cutoff threshold
    • Metrics: accuracy, precision, recall, F1, error count
  • diff grader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)
  • Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in examples/rubrics/ adapted from Azure ML evaluators (#160, #161):
    • Tool call rubrics: tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization
    • Task evaluation rubrics: task_completion, task_adherence, intent_resolution, response_completeness
  • MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)

Changed

  • Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
  • Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)

Fixed

  • install.sh macOS checksum — added shasum -a 256 fallback for macOS (which lacks sha256sum) (#163)
  • Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
  • GitHub icon alignment and search bar width on docs site

[0.4.0-alpha.1] - 2026-02-17

Added

  • Go cross-platform release pipelinego-release.yml workflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155)
  • install.sh installer — one-line binary install with checksum verification: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
  • skill_invocation grader — validates orchestration workflows by checking which skills were invoked (#146)
  • required_skills preflight validation — verifies skill dependencies before evaluation (#147)
  • Multi-model --model flag — run evaluations across multiple models in a single command (#39)
  • waza check command — skill submission readiness checks (#151)
  • Evaluation result caching — incremental testing with cache invalidation (#150)
  • GitHub PR comment reporter — post eval results as PR comments (#140)
  • Skills CI integration — GitHub Actions workflow for microsoft/skills (#141)

Fixed

  • Engine shutdown leakrunSingleModel() now calls engine.Shutdown(context.Background()) via defer after engine creation (#153, #154)

Changed

  • Python release deprecated — the Python release workflow is no longer maintained; Go binaries are the official distribution
  • First Go binary release — v0.4.0-alpha.1 is the first release distributed as pre-built binaries

[0.3.0] - 2026-02-13

Added

  • Grader showcase examples demonstrating all grader types (#134)
  • Reusable GitHub Actions workflow for waza evaluations (#132)
  • Documentation for prompt and action_sequence grader types (#133)
  • Documentation for waza dev command and compliance scoring (#131)
  • Auto-loading of skills for testing (#129)
  • Debug logging support (--debug flag) (#130)

Fixed

  • Always output test run errors to help debug failures (#128)
  • Include cwd as a skill folder when running waza (workspace fix)

Changed

  • Exit codes for CI/CD integration: 0=success, 1=test failure, 2=config error (#135)
  • Reordered azd-publish skill workflow steps (#127)
  • Auto-merge bot registry PRs in release workflow

[0.2.1] - 2026-02-12

Added

  • waza dev command for interactive skill development and testing (#117)
  • Prerelease input to azd publish workflow
  • CHANGELOG.md as release notes source for azd extension releases
  • waza generate --skill <name> - Filter to specific skill when using --repo or --scan

Fixed

  • Fixed azd extensions documentation link
  • Corrected azd ext source add command syntax
  • Branch release PR from origin/main to avoid workflow permission error (#121)

Changed

  • Removed path filters from Go CI to unblock non-code PRs
  • Removed auto-merge from azd publish PR workflow
  • Added azd extension installation instructions to README

[0.2.0] - 2026-02-02

Added

  • Skill Discovery (#3)

    • waza generate --repo <org/repo> - Scan GitHub repos for SKILL.md files
    • waza generate --scan - Scan local directory for skills
    • waza generate --all - Generate evals for all discovered skills (CI-friendly)
    • Interactive skill selection with checkboxes when not using --all
  • GitHub Issue Creation (#3)

    • Post-run prompt to create GitHub issues with eval results
    • Options: create for failed tasks only, all tasks, or none
    • Issues include results table, failed task details, and suggestions
    • --no-issues...
Read more

Waza v0.11.0

28 Feb 04:02

Choose a tag to compare

What's Changed

  • Adding Microsoft SECURITY.MD by @microsoft-github-policy-service[bot] in #2
  • chore(deps): Bump go.opentelemetry.io/otel/sdk from 1.38.0 to 1.40.0 by @dependabot[bot] in #4
  • chore(deps): Bump rollup from 4.58.0 to 4.59.0 in /site by @dependabot[bot] in #5

New Contributors

  • @microsoft-github-policy-service[bot] made their first contribution in #2
  • @dependabot[bot] made their first contribution in #4

Full Changelog: v0.10.0...v0.11.0

Waza azd Extension v0.11.0

28 Feb 04:06

Choose a tag to compare

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.9.0] - 2026-02-23

Added

  • A/B baseline testing--baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
  • Pairwise LLM judgingpairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
  • Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
  • Auto skill discovery--discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks eval coverage (#392)
  • Releases page — New docs site page at reference/releases with platform download links, install commands, and azd extension info (#383)

Fixed

  • Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings

Changed

  • Competitive research — Added OpenAI Evals analysis (docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md)
  • Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md

[0.8.0] - 2026-02-21

Added

  • MCP Serverwaza serve now includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286)
  • waza suggest command — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags: --model, --dry-run, --apply, --output-dir, --format (#287)
  • Interactive workflow skillskills/waza-interactive/SKILL.md with 5 workflow scenarios for conversational eval orchestration (#288)
  • Grader weightingweight field on grader configs, ComputeWeightedRunScore method, dashboard weighted scores column (#299)
  • Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
  • Judge model support--judge-model flag and judge_model config for separate LLM-as-judge model (#309)
  • Spec compliance checks — 8 agentskills.io compliance checks in waza check and waza dev (#314)
  • SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
  • MCP integration scoring — 4 MCP integration checks in waza dev (#316)
  • Batch skill processingwaza dev processes multiple skills in one run (#317)
  • Token compare --strict — Budget enforcement mode for waza tokens compare (#318)
  • Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
  • Skill profilewaza tokens profile for static analysis of skill token distribution (#311)
  • JUnit XML reporter--format junit output for CI integration (#312)
  • Template Variables — New internal/template package with Render() for Go text/template syntax in hooks and commands. System variables: JobID, TaskName, Iteration, Attempt, Timestamp. User variables via vars map (#186)
  • GroupBy Results — New group_by config field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includes GroupStats with name/passed/total/avg_score (#188)
  • Custom Input Variables — New inputs section in eval.yaml for defining key-value pairs available as {{.Vars.key}} throughout evaluation. Accessible in hooks, task templates, and grader configs (#189)
  • CSV Dataset Support — New tasks_from field to generate tasks from CSV files. Each row becomes a task with columns accessible as {{.Vars.column}}. Optional range: [start, end] for row filtering. First row treated as headers (#187)
  • Retry/Attempts — Add max_attempts config field for retrying failed task executions within each trial (#191)
  • Lifecycle Hooks — Add hooks section with before_run/after_run/before_task/after_task lifecycle points (#191)
  • prompt grader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)
    • Two modes: clean (fresh context) and continue_session (resumes test session)
    • Tool-based grading: set_waza_grade_pass and set_waza_grade_fail tools for LLM graders
    • Separate judge model configuration: run evaluation with a different model than the executor
    • Pre-built rubric templates adapted from Azure ML evaluators
  • trigger_tests.yaml auto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)
    • New internal/trigger/ package for trigger testing
    • Automatically discovered alongside eval.yaml
    • Confidence weighting: high (weight 1.0) and medium (weight 0.5) for borderline cases
    • trigger_accuracy metric with configurable cutoff threshold
    • Metrics: accuracy, precision, recall, F1, error count
  • diff grader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)
  • Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in examples/rubrics/ adapted from Azure ML evaluators (#160, #161):
    • Tool call rubrics: tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization
    • Task evaluation rubrics: task_completion, task_adherence, intent_resolution, response_completeness
  • MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)

Changed

  • Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
  • Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)

Fixed

  • install.sh macOS checksum — added shasum -a 256 fallback for macOS (which lacks sha256sum) (#163)
  • Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
  • GitHub icon alignment and search bar width on docs site

[0.4.0-alpha.1] - 2026-02-17

Added

  • Go cross-platform release pipelinego-release.yml workflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155)
  • install.sh installer — one-line binary install with checksum verification: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
  • skill_invocation grader — validates orchestration workflows by checking which skills were invoked (#146)
  • required_skills preflight validation — verifies skill dependencies before evaluation (#147)
  • Multi-model --model flag — run evaluations across multiple models in a single command (#39)
  • waza check command — skill submission readiness checks (#151)
  • Evaluation result caching — incremental testing with cache invalidation (#150)
  • GitHub PR comment reporter — post eval results as PR comments (#140)
  • Skills CI integration — GitHub Actions workflow for microsoft/skills (#141)

Fixed

  • Engine shutdown leakrunSingleModel() now calls engine.Shutdown(context.Background()) via defer after engine creation (#153, #154)

Changed

  • Python release deprecated — the Python release workflow is no longer maintained; Go binaries are the official distribution
  • First Go binary release — v0.4.0-alpha.1 is the first release distributed as pre-built binaries

[0.3.0] - 2026-02-13

Added

  • Grader showcase examples demonstrating all grader types (#134)
  • Reusable GitHub Actions workflow for waza evaluations (#132)
  • Documentation for prompt and action_sequence grader types (#133)
  • Documentation for waza dev command and compliance scoring (#131)
  • Auto-loading of skills for testing (#129)
  • Debug logging support (--debug flag) (#130)

Fixed

  • Always output test run errors to help debug failures (#128)
  • Include cwd as a skill folder when running waza (workspace fix)

Changed

  • Exit codes for CI/CD integration: 0=success, 1=test failure, 2=config error (#135)
  • Reordered azd-publish skill workflow steps (#127)
  • Auto-merge bot registry PRs in release workflow

[0.2.1] - 2026-02-12

Added

  • waza dev command for interactive skill development and testing (#117)
  • Prerelease input to azd publish workflow
  • CHANGELOG.md as release notes source for azd extension releases
  • waza generate --skill <name> - Filter to specific skill when using --repo or --scan

Fixed

  • Fixed azd extensions documentation link
  • Corrected azd ext source add command syntax
  • Branch release PR from origin/main to avoid workflow permission error (#121)

Changed

  • Removed path filters from Go CI to unblock non-code PRs
  • Removed auto-merge from azd publish PR workflow
  • Added azd extension installation instructions to README

[0.2.0] - 2026-02-02

Added

  • Skill Discovery (#3)

    • waza generate --repo <org/repo> - Scan GitHub repos for SKILL.md files
    • waza generate --scan - Scan local directory for skills
    • waza generate --all - Generate evals for all discovered skills (CI-friendly)
    • Interactive skill selection with checkboxes when not using --all
  • GitHub Issue Creation (#3)

    • Post-run prompt to create GitHub issues with eval results
    • Options: create for failed tasks only, all tasks, or none
    • Issues include results table, failed task details, and suggestions
    • --no-issues...
Read more