A Go CLI for evaluating AI agent skills — scaffold eval suites, run benchmarks, and compare results across models.
Download and install the latest pre-built binary with the install script:
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bashThe script auto-detects your OS and architecture (linux/darwin/windows, amd64/arm64), downloads the binary, verifies the checksum, and installs to /usr/local/bin (or ~/bin if not writable).
Or download binaries directly from the latest release.
Requires Go 1.26+:
go install github.com/microsoft/waza/cmd/waza@latestWaza is also available as an azd extension:
# Add the waza extension registry
azd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.json
# Install the extension
azd ext install microsoft.azd.waza
# Verify it's working
azd waza --helpOnce installed, all waza commands are available under azd waza. For example:
azd waza init my-eval --interactive
azd waza run examples/code-explainer/eval.yaml -vSee Getting Started Guide for a complete walkthrough:
# Initialize a new project
waza init my-project && cd my-project
# Create a new skill
waza new skill my-skill
# Define the skill in skills/my-skill/SKILL.md
# Write evaluation tasks in evals/my-skill/tasks/
# Add test fixtures in evals/my-skill/fixtures/
# Run evaluations
waza run my-skill
# Check skill readiness
waza check my-skill# Build
make build
# Initialize a project workspace
waza init [directory]
# Create a new skill
waza new skill skill-name
# Create a new eval scaffold from an existing SKILL.md
waza new eval skill-name
# Generate a task YAML by recording a prompt run
waza new task from-prompt "Explain this code and suggest fixes" evals/code-explainer/tasks/recorded-task.yaml
# Check if a skill is ready for submission
waza check skills/my-skill
# Suggest an eval suite from SKILL.md
waza suggest skills/my-skill --dry-run
waza suggest skills/my-skill --apply
# Note: 'generate' is available as an alias for 'new' (see below for new command)
# Run evaluations
waza run examples/code-explainer/eval.yaml --context-dir examples/code-explainer/fixtures -v
# Grade output from a previous `waza run --output results.json ...`
waza grade eval.yaml --results results.json
# Compare results across models
waza compare results-gpt4.json results-sonnet.json
# Count tokens in skill files
waza tokens count skills/
# Compare skill token budgets vs main
waza tokens compare main --skills --threshold 10
# Suggest token optimizations
waza tokens suggest skills/Initialize a waza project workspace with separated skills/ and evals/ directories. Idempotent — creates only missing files.
| Flag | Description |
|---|---|
--no-skill |
Skip the first-skill creation prompt |
Creates:
skills/— Skill definitions directoryevals/— Evaluation suites directory.github/workflows/eval.yml— CI/CD pipeline for running evals on PR.gitignore— Waza-specific exclusionsREADME.md— Getting started guide for your project
Example:
waza init my-project
# Optionally creates first skill interactively
waza init my-project --no-skill
# Skip skill creation promptCreate a new skill with scaffolded structure and evaluation suite. Detects workspace context and adapts output.
| Flag | Short | Description |
|---|---|---|
--template |
-t |
Template pack (coming soon) |
Modes:
Project mode (detects skills/ directory):
project/
├── skills/{skill-name}/SKILL.md
└── evals/{skill-name}/
├── eval.yaml
├── tasks/*.yaml
└── fixtures/
Standalone mode (no skills/ detected):
{skill-name}/
├── SKILL.md
├── evals/
│ ├── eval.yaml
│ ├── tasks/*.yaml
│ └── fixtures/
├── .github/workflows/eval.yml
├── .gitignore
└── README.md
Example:
# In project mode (explained Modes section, above): creates skills/code-explainer/SKILL.md + evals/code-explainer/
waza new skill code-explainer
# In standalone mode (explained Modes section, above): creates code-explainer/ self-contained directory
waza new skill code-explainerScaffold an eval suite from an existing SKILL.md (reads frontmatter trigger hints from USE FOR and DO NOT USE FOR).
Creates:
evals/<skill-name>/eval.yamlevals/<skill-name>/tasks/positive-trigger-1.yamlevals/<skill-name>/tasks/positive-trigger-2.yamlevals/<skill-name>/tasks/negative-trigger-1.yaml
| Flag | Description |
|---|---|
--output <path> |
Custom path for eval.yaml (tasks are generated under sibling tasks/) |
Example:
# Default output location
waza new eval code-explainer
# Custom eval path
waza new eval code-explainer --output evals/custom-code-explainer/eval.yamlRun a prompt through Copilot and generate a task YAML with inferred validators based on observed behavior (response text, tool usage, and invoked skills).
| Flag | Description |
|---|---|
--model <name> |
Copilot model to run for recording (default: claude-sonnet-4.5) |
--testname <name> |
Test name and ID written into the generated task (default: auto-generated-test) |
--tags <a,b,...> |
Comma-separated tags to attach to the generated task |
--timeout <duration> |
Max time for prompt execution (default: 5m) |
--overwrite |
Overwrite the output task file if it already exists |
--root <dir> |
Root directory used for skill discovery (default: .) |
Example:
# Record a prompt and generate a reusable task YAML
waza new task from-prompt "Refactor this function for readability" evals/code-explainer/tasks/refactor-readability.yaml
# Add metadata and overwrite an existing file
waza new task from-prompt "Explain this diff and risks" evals/code-explainer/tasks/diff-analysis.yaml \
--testname diff-analysis \
--tags recorded,regression \
--overwriteRun an evaluation benchmark from a spec file.
| Flag | Short | Description |
|---|---|---|
--context-dir <dir> |
Fixture directory (default: ./fixtures relative to spec) |
|
--output <file> |
-o |
Save results to JSON |
--verbose |
-v |
Detailed progress output |
--transcript-dir <dir> |
Save per-task transcript JSON files | |
--task <glob> |
Filter tasks by name/ID pattern (repeatable) | |
--parallel |
Run tasks concurrently | |
--workers <n> |
Concurrent workers (default: 4, requires --parallel) |
|
--trials <n> |
Run each task n times to detect flakiness (omit to use config.trials_per_task; if provided, n must be >= 1) |
|
--interpret |
Print plain-language result interpretation | |
--format <fmt> |
Output format: default or github-comment (default: default) |
|
--cache |
Enable result caching to speed up repeated runs | |
--no-cache |
Explicitly disable result caching | |
--cache-dir <dir> |
Cache directory (default: .waza-cache) |
|
--reporter <spec> |
Output reporters: json (default), junit:<path> (repeatable) |
|
--baseline |
A/B testing mode — runs each task twice (without skill = baseline, with skill = normal) and computes improvement scores | |
--discover |
Auto skill discovery — walks directory tree for SKILL.md + eval.yaml (root/tests/evals) | |
--strict |
Fail if any SKILL.md lacks eval coverage (use with --discover) |
|
--suggest |
Generate a Copilot suggestion report based on test outcomes (mock engine emits a deterministic fake report) |
Result Caching
Enable caching with --cache to store test results and skip re-execution on repeated runs:
# First run executes all tests and caches results
waza run eval.yaml --cache
# Second run uses cached results (much faster)
waza run eval.yaml --cache
# Clear the cache when needed
waza cache clearCached results are automatically invalidated when:
- Spec configuration changes (model, timeout, graders, etc.)
- Task definitions change
- Fixture files change
Note: Caching is automatically disabled for evaluations using non-deterministic graders (behavior, prompt).
Exit Codes
The run command uses exit codes to enable CI/CD integration:
| Exit Code | Condition | Description |
|---|---|---|
0 |
Success | All tests passed |
1 |
Test failure | One or more tests failed validation |
2 |
Configuration error | Invalid spec, missing files, or runtime error |
Example CI usage:
# Fail the build if any tests fail
waza run eval.yaml || exit $?
# Capture specific exit codes
waza run eval.yaml
EXIT_CODE=$?
if [ $EXIT_CODE -eq 1 ]; then
echo "Tests failed - check results"
elif [ $EXIT_CODE -eq 2 ]; then
echo "Configuration error"
fi
# Post results as PR comment (GitHub Actions)
waza run eval.yaml --format github-comment > comment.md
gh pr comment $PR_NUMBER --body-file comment.md
# Generate JUnit XML for CI test reporting
waza run eval.yaml --reporter junit:results.xml
# Both JSON output and JUnit XML
waza run eval.yaml -o results.json --reporter junit:results.xmlNote: waza generate is an alias for waza new. Both commands support the same functionality with the --output-dir flag for specifying custom output locations.
Compare results from multiple evaluation runs side by side — per-task score deltas, pass rate differences, and aggregate statistics.
| Flag | Short | Description |
|---|---|---|
--format <fmt> |
-f |
Output format: table or json (default: table) |
Clear all cached evaluation results to force re-execution on the next run.
| Flag | Description |
|---|---|
--cache-dir <dir> |
Cache directory to clear (default: .waza-cache) |
Iteratively score and improve skill frontmatter in a SKILL.md file.
Use --copilot for a non-interactive, single-pass markdown report that:
- Summarizes current skill details and token usage
- Loads trigger test prompts as examples (when
trigger_tests.yamlexists) - Requests Copilot suggestions for improving skill selection
- Prints the report to stdout without applying any changes
When --copilot is set, iterative mode flags (--target, --max-iterations, --auto) are invalid.
| Flag | Description |
|---|---|
--target <level> |
Target adherence level for iterative mode: low, medium, medium-high, high (default: medium-high) |
--max-iterations <n> |
Maximum improvement iterations for iterative mode (default: 5) |
--auto |
Apply improvements without prompting in iterative mode |
--copilot |
Generate a non-interactive markdown report with Copilot suggestions |
--model <id> |
Model to use with --copilot |
Check if a skill is ready for submission with a comprehensive readiness report.
Performs five types of checks:
- Compliance scoring — Validates frontmatter adherence (Low/Medium/Medium-High/High)
- Token budget — Checks if SKILL.md is within token limits (configurable in
.waza.yamltokens.limits) - Evaluation suite — Checks for the presence of eval.yaml
- Spec compliance — Validates the skill against the agentskills.io spec (frontmatter structure, required fields, naming rules, directory match, description length, compatibility, license, and version)
- Advisory checks — Detects quality and maintainability issues (reference module count, complexity classification, negative delta risk patterns, procedural content, and over-specificity)
Provides a plain-language summary and actionable next steps to improve the skill.
Example output:
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: code-explainer
📋 Compliance Score: High
✅ Excellent! Your skill meets all compliance requirements.
📊 Token Budget: 450 / 500 tokens
✅ Within budget (50 tokens remaining).
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Spec Compliance (agentskills.io)
✅ spec-frontmatter Frontmatter structure valid with required fields
✅ spec-allowed-fields All frontmatter fields are spec-allowed
✅ spec-name Name follows spec naming rules
✅ spec-dir-match Directory name matches skill name
✅ spec-description Description is valid
✅ spec-license License field present
✅ spec-version metadata.version present
🔬 Advisory Checks
✅ module-count Found 2 reference modules (2-3 is optimal)
✅ complexity Complexity: detailed (350 tokens, 2 modules)
✅ negative-delta-risk No negative delta risk patterns detected
✅ procedural-content Description contains procedural language
✅ over-specificity No over-specificity patterns detected
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ Your skill is ready for submission!
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✨ No action needed! Your skill looks great.
Consider:
• Running 'waza run eval.yaml' to verify functionality
• Sharing your skill with the community
Usage:
# Check current directory
waza check
# Check specific skill
waza check skills/my-skill
# Suggested workflow
waza check skills/my-skill # Check readiness
waza dev skills/my-skill # Improve compliance if needed
waza check skills/my-skill # Verify improvementsUse an LLM to analyze SKILL.md and generate suggested evaluation artifacts.
| Flag | Description |
|---|---|
--model <model> |
Model to use for suggestions (default: project default model) |
--dry-run |
Print suggested output to stdout (default) |
--apply |
Write files to disk |
--output-dir <dir> |
Output directory (default: <skill-path>/evals) |
--format yaml|json |
Output format (default: yaml) |
Examples:
# Preview generated eval/task/fixture files as YAML
waza suggest skills/code-explainer --dry-run
# Write generated files to disk
waza suggest skills/code-explainer --apply
# Print JSON-formatted suggestion payload
waza suggest skills/code-explainer --format jsonCount tokens in markdown files. Paths may be files or directories (scanned recursively for .md/.mdx).
| Flag | Description |
|---|---|
--format <fmt> |
Output format: table or json (default: table) |
--sort <field> |
Sort by: tokens, name, or path (default: path) |
--min-tokens <n> |
Filter files below n tokens |
--no-total |
Hide total row in table output |
Compare markdown token counts between git refs.
With no arguments, compares HEAD to the working tree. With one ref, compares that ref to the working tree. With two refs, compares the first ref to the second.
| Flag | Description |
|---|---|
--format <fmt> |
Output format: table or json (default: table) |
--show-unchanged |
Include unchanged files in output |
--strict |
Exit with code 1 if any file exceeds its absolute token limit |
--skills |
Only compare SKILL.md files under configured skill roots |
--threshold <n> |
Fail when any existing file increases by more than n percent (0 = disabled) |
Use --skills to restrict comparison to SKILL.md files under configured skill
roots (skills/, .github/skills/, and paths.skills from .waza.yaml). In
skills mode the default base ref is origin/main (falling back to main).
Use --threshold for CI gating — newly added files are exempt from threshold
checks (no baseline) but still subject to absolute limit checks with --strict.
# Compare all markdown tokens between HEAD and working tree
waza tokens compare
# Skill-aware comparison vs main with CI threshold
waza tokens compare main --skills --threshold 10
# JSON output for CI pipelines
waza tokens compare main --skills --threshold 10 --strict --format jsonStructural analysis of SKILL.md files — reports token count, section count, code block count, and workflow step detection with a one-line summary and warnings.
| Flag | Description |
|---|---|
--format <fmt> |
Output format: text or json (default: text) |
--tokenizer <t> |
Tokenizer: bpe or estimate (default: bpe) |
Example output:
📊 my-skill: 1,722 tokens (detailed ✓), 8 sections, 4 code blocks
⚠️ no workflow steps detected
Suggest ways to reduce token usage in markdown files. Paths may be files or
directories (scanned recursively for .md/.mdx).
| Flag | Description |
|---|---|
--format <fmt> |
Output format: text or json (default: text) |
--min-savings <n> |
Minimum estimated token savings for heuristic suggestions |
--copilot |
Enable Copilot-powered suggestions |
--model <id> |
Model to use with --copilot |
Start the waza dashboard server to visualize evaluation results. The HTTP server opens in your browser automatically and scans the specified directory for .json result files.
Optionally, run a JSON-RPC 2.0 server (for IDE integration) instead of the HTTP dashboard using the --tcp flag.
| Flag | Default | Description |
|---|---|---|
--port <port> |
3000 |
HTTP server port |
--no-browser |
false |
Don't auto-open the browser |
--results-dir <dir> |
. |
Directory to scan for result files |
--tcp <addr> |
(off) | TCP address for JSON-RPC (e.g., :9000); defaults to loopback for security |
--tcp-allow-remote |
false |
Allow TCP binding to non-loopback addresses ( |
Examples:
Start the HTTP dashboard on port 3000:
waza serveStart the HTTP dashboard on a custom port and scan a results directory:
waza serve --port 8080 --results-dir ./resultsStart the dashboard without auto-opening the browser:
waza serve --no-browserStart a JSON-RPC server for IDE integration:
waza serve --tcp :9000Dashboard Views:
The dashboard displays evaluation results with:
- Task-level pass/fail status
- Score distributions across trials
- Model comparisons
- Aggregated metrics and trends
For detailed documentation on the dashboard and result visualization, see docs/GUIDE.md.
Manage evaluation results stored in cloud or local storage.
List all evaluation runs from configured cloud storage or local results directory.
| Flag | Description |
|---|---|
--limit <n> |
Maximum results to display (default: 20) |
--format <fmt> |
Output format: table or json (default: table) |
# List recent results
waza results list
# List with custom limit
waza results list --limit 20
# Output as JSON
waza results list --format jsonCompare two evaluation runs side by side. Displays per-task score deltas, pass rate differences, and key metrics.
| Flag | Description |
|---|---|
--format <fmt> |
Output format: table or json (default: table) |
# Compare two runs
waza results compare run-20250226-001 run-20250226-002
# Output as JSON for further processing
waza results compare run-20250226-001 run-20250226-002 --format jsonRun graders against agent output without executing an agent. Designed for standalone grading of previous eval runs.
| Flag | Description |
|---|---|
--task <id> |
Task ID to grade |
--results <file> |
Path to waza run output JSON |
--workspace <dir> |
Agent workspace directory for file-based graders; must point to the agent's actual workspace (default: .) |
--judge-model <model> |
Model for prompt graders |
-o, --output <file> |
Write full EvaluationOutcome JSON (compatible with waza compare) |
-v, --verbose |
Verbose output |
waza run eval.yaml --output results.json
waza grade eval.yaml --results results.jsonWaza can automatically upload evaluation results to Azure Blob Storage for team collaboration and historical tracking.
Add a storage: section to your .waza.yaml:
storage:
provider: azure-blob
accountName: "myteamwaza"
containerName: "waza-results"
enabled: true| Field | Description | Required |
|---|---|---|
provider |
Cloud provider (azure-blob currently supported) |
Yes |
accountName |
Azure Storage account name | Yes |
containerName |
Blob container name (default: waza-results) |
No |
enabled |
Enable/disable uploads (default: true when configured) |
No |
Waza uses DefaultAzureCredential — it automatically detects and uses available credentials in this order:
- Environment variables (
AZURE_CLIENT_ID,AZURE_CLIENT_SECRET,AZURE_TENANT_ID) - Managed Identity (on Azure services)
- Azure CLI (
az login) - Visual Studio Code (if signed in)
- Azure PowerShell (if signed in)
In most cases, running az login is all you need:
az login
waza run eval.yaml # Results auto-upload to Azure Storage- Auto-upload on run: When
storage:is configured,waza runautomatically uploads results to Azure Blob Storage - Organized by skill: Results are stored as
{skill-name}/{run-id}.json - Local copy kept: Results are also saved locally (via
-oflag) - List remote results: Use
waza results listto browse uploaded runs - Compare runs: Use
waza results compareto diff two remote results
# Configure once (edit .waza.yaml)
cat > .waza.yaml <<EOF
storage:
provider: azure-blob
accountName: "myteamwaza"
containerName: "waza-results"
enabled: true
EOF
# Authenticate
az login
# Run evaluations — results auto-upload
waza run evals/my-skill/eval.yaml -v
# Browse uploaded results
waza results list
# Compare two runs
waza results compare run-id-1 run-id-2For step-by-step setup and troubleshooting, see Getting Started with Azure Storage guide.
make build # Compile binary to ./waza
make test # Run tests with coverage
make lint # Run golangci-lint
make fmt # Format code and tidy modules
make install # Install to GOPATHcmd/waza/ CLI entrypoint and command definitions
tokens/ Token counting subcommand
internal/
config/ Configuration with functional options
execution/ AgentEngine interface (mock, copilot)
graders/ Validator registry and built-in graders
metrics/ Scoring metrics
models/ Data structures (BenchmarkSpec, TestCase, EvaluationOutcome)
orchestration/ TestRunner for coordinating execution
reporting/ Result formatting and output
transcript/ Per-task transcript capture
wizard/ Interactive init wizard
examples/ Example eval suites
skills/ Example skills
name: my-eval
skill: my-skill
version: "1.0"
config:
trials_per_task: 3
max_attempts: 3 # Retry failed graders up to 3 times (default: 1, no retries)
timeout_seconds: 300
parallel: false
executor: mock # or copilot-sdk
model: claude-sonnet-4-20250514
group_by: model # Group results by model (or other dimension)
# Custom input variables available as {{.Vars.key}} in tasks and hooks
inputs:
api_version: v2
environment: production
max_retries: 3
hooks:
before_run:
- command: "echo 'Starting evaluation'"
working_directory: "."
exit_codes: [0]
error_on_fail: false
after_run:
- command: "echo 'Evaluation complete'"
working_directory: "."
exit_codes: [0]
error_on_fail: false
before_task:
- command: "echo 'Running task: {{.TaskName}}'"
working_directory: "."
exit_codes: [0]
error_on_fail: false
after_task:
- command: "echo 'Task {{.TaskName}} completed'"
working_directory: "."
exit_codes: [0]
error_on_fail: false
graders:
- type: text
name: pattern_check
config:
regex_match: ["\\d+ tests passed"]
- type: behavior
name: efficiency
config:
max_tool_calls: 20
max_duration_ms: 300000
- type: action_sequence
name: workflow_check
config:
matching_mode: in_order_match
expected_actions: ["bash", "edit", "report_progress"]
# Task definitions: glob patterns or CSV dataset
tasks:
- "tasks/*.yaml"
# Optional: Generate tasks from CSV dataset
# tasks_from: ./test-cases.csv
# range: [1, 10] # Only include rows 1-10 (0-indexed, skips header)Use the inputs section to define key-value variables available throughout your evaluation as {{.Vars.key}}:
inputs:
api_endpoint: https://api.example.com
timeout: 30
environment: staging
hooks:
before_run:
- command: "echo 'Testing against {{.Vars.environment}}'"
working_directory: "."
exit_codes: [0]
error_on_fail: falseVariables are accessible in:
- Hook commands
- Task prompts and fixtures (via template rendering)
- Grader configurations
Generate tasks dynamically from a CSV file using tasks_from:
# eval.yaml
tasks_from: ./test-cases.csv
range: [0, 50] # Optional: limit to rows 0-50 (skip header at 0)CSV Format:
prompt,expected_output,language
"Explain this function","Function explanation",python
"Review this code","Code review",javascriptTask Generation:
- First row is treated as column headers
- Each subsequent row becomes a task
- Column values are available as
{{.Vars.column_name}} - Range filtering (optional) allows limiting to a subset of rows
Example task prompt using CSV variables:
In your task file or inline prompt:
prompt: "{{.Vars.prompt}}"
expected_output: "{{.Vars.expected_output}}"
language: "{{.Vars.language}}"Tasks can also be mixed — use both explicit task files and CSV-generated tasks:
tasks:
- "tasks/*.yaml" # Explicit tasks
tasks_from: ./test-cases.csv # CSV-generated tasks
range: [0, 20] # Only first 20 rowsCSV vs Inputs:
inputs: Static key-value pairs defined once in eval.yamltasks_from: Generates multiple tasks from CSV rows- Conflict resolution: CSV column values override
inputsfor the same key
Use max_attempts to retry failed grader validations within each trial:
config:
max_attempts: 3 # Retry failed graders up to 3 times (default: 1, no retries)When a grader fails, waza will retry the task execution up to max_attempts times. The evaluation outcome includes an attempts field showing how many executions were needed to pass. This is useful for handling transient failures in external services or non-deterministic grader behavior.
Output: JSON results include attempts per task showing the number of executions performed.
Use group_by to organize results by a dimension (e.g., model, environment). Results are grouped in CLI output and JSON results include group statistics:
config:
group_by: modelGrouped results in JSON output include GroupStats:
{
"group_stats": [
{
"name": "claude-sonnet-4-20250514",
"passed": 8,
"total": 10,
"avg_score": 0.85
}
]
}Use hooks to run commands before/after evaluations and tasks:
hooks:
before_run:
- command: "npm install"
working_directory: "."
exit_codes: [0]
error_on_fail: true
after_run:
- command: "rm -rf node_modules"
working_directory: "."
exit_codes: [0]
error_on_fail: false
before_task:
- command: "echo 'Task: {{.TaskName}}'"
working_directory: "."
exit_codes: [0]
error_on_fail: false
after_task:
- command: "echo 'Done: {{.TaskName}}'"
working_directory: "."
exit_codes: [0]
error_on_fail: falseHook Fields:
command— Shell command to executeworking_directory— Directory to run command in (relative to eval.yaml)exit_codes— List of acceptable exit codes (default:[0])error_on_fail— Fail entire evaluation if hook fails (default:false)
Lifecycle Points:
before_run— Execute once before all tasksafter_run— Execute once after all tasksbefore_task— Execute before each taskafter_task— Execute after each task
Template Variables in Hooks and Commands:
Available variables in hook commands and task execution contexts:
{{.JobID}}— Unique evaluation run identifier{{.TaskName}}— Name/ID of the current task (available inbefore_task/after_taskonly){{.Iteration}}— Current trial number (1-indexed){{.Attempt}}— Current attempt number (1-indexed, used for retries){{.Timestamp}}— ISO 8601 timestamp of execution{{.Vars.key}}— User-defined variables from theinputssection or CSV columns
Custom variables can be defined in the inputs section and referenced in hooks:
inputs:
environment: production
api_version: v2
debug_mode: "true"
hooks:
before_run:
- command: "echo 'Starting eval {{.JobID}} in {{.Vars.environment}}'"
working_directory: "."
exit_codes: [0]
error_on_fail: falseWhen using CSV-generated tasks, each row's column values are also available as {{.Vars.column_name}}.
Waza is designed to work seamlessly with CI/CD pipelines.
Waza can validate your skill in CI before publishing:
Option 1: Binary install (recommended)
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bashOption 2: Install from source
# Requires Go 1.26+
go install github.com/microsoft/waza/cmd/waza@latestOption 3: Use Docker
docker build -t waza:local .
docker run -v $(pwd):/workspace waza:local run eval/eval.yamlCopy .github/workflows/skills-ci-example.yml to your skill repository:
jobs:
evaluate-skill:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install waza
run: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
- run: waza run eval/eval.yaml --verbose --output results.json
- uses: actions/upload-artifact@v4
with:
name: waza-evaluation-results
path: results.json| Requirement | Details |
|---|---|
| Go Version | 1.26 or higher |
| Executor | Use mock executor for CI (no API keys needed) |
| GitHub Token | Only required for copilot-sdk executor: set GITHUB_TOKEN env var |
| Exit Codes | 0=success, 1=test failure, 2=config error |
your-skill/
├── SKILL.md # Skill definition
└── eval/ # Evaluation suite
├── eval.yaml # Benchmark spec
├── tasks/ # Task definitions
│ └── *.yaml
└── fixtures/ # Context files
└── *.txt
This repository includes reusable workflows:
-
.github/workflows/waza-eval.yml- Reusable workflow for running evalsjobs: eval: uses: ./.github/workflows/waza-eval.yml with: eval-yaml: 'examples/code-explainer/eval.yaml' verbose: true
-
examples/ci/eval-on-pr.yml- Matrix testing across models -
examples/ci/basic-example.yml- Minimal workflow example
See examples/ci/README.md for detailed documentation and more examples.
Waza supports multiple grader types for comprehensive evaluation:
| Grader | Purpose | Documentation |
|---|---|---|
code |
Python/JavaScript assertion-based validation | docs/GRADERS.md |
text |
Substring and pattern matching in output | docs/GRADERS.md |
file |
File existence and content validation | docs/GRADERS.md |
diff |
Workspace file comparison with snapshots and fragments | docs/GRADERS.md |
behavior |
Agent behavior constraints (tool calls, tokens, duration) | docs/GRADERS.md |
action_sequence |
Tool call sequence validation with F1 scoring | docs/GRADERS.md |
skill_invocation |
Skill orchestration sequence validation | docs/GRADERS.md |
prompt |
LLM-as-judge evaluation with rubrics | docs/GRADERS.md |
trigger_tests |
Prompt trigger accuracy detection | docs/GRADERS.md |
See the complete Grader Reference for detailed configuration options and examples.
- Getting Started - Complete walkthrough: init → new → run → check
- Demo Guide - 7 live demo scenarios for presentations
- Grader Reference - Complete grader types and configuration
- Tutorial - Getting started with writing skill evals
- CI Integration - GitHub Actions workflows for skill evaluation
- Token Management - Tracking and optimizing skill context size
See AGENTS.md for coding guidelines.
- Use conventional commits (
feat:,fix:,docs:, etc.) - Go CI is required:
Build and Test Go ImplementationandLint Go Codemust pass - Add tests for new features
- Update docs when changing CLI surface
The Python implementation has been superseded by the Go CLI. The last Python release is available at v0.3.2. Starting with v0.4.0-alpha.1, waza is distributed exclusively as pre-built Go binaries.
See LICENSE.