Automatically evaluate your LLMs on every pull request.
Gate merges on correctness, faithfulness, hallucination, toxicity, and bias — enforced in CI.
This action connects to a VerifyWise instance to run your models against curated test datasets and block merges when quality drops below your standards.
- Catch regressions before they ship. Prompt changes, fine-tuning updates, or model swaps can silently degrade output quality. This action measures it.
- Enforce quality gates. Set pass/fail thresholds per metric. If correctness drops below 70% or hallucination rises above 30%, the PR fails.
- Works with any model. OpenAI, Anthropic, Google, Mistral, xAI, or self-hosted — if VerifyWise can talk to it, this action can evaluate it.
- Full visibility. Results are posted as PR comments, uploaded as build artifacts, and stored in your VerifyWise dashboard for trend tracking.
Add this step to any GitHub Actions workflow:
- uses: verifywise-ai/verifywise-eval-action@v1
with:
api_url: https://app.verifywise.ai
project_id: proj_abc
dataset_id: '2'
metrics: 'correctness,faithfulness,hallucination'
model_name: gpt-4o-mini
model_provider: openai
vw_api_token: ${{ secrets.VW_API_TOKEN }}
llm_api_key: ${{ secrets.LLM_API_KEY }}The action will:
- Create an evaluation experiment on your VerifyWise instance
- Run your model against the specified dataset using an LLM judge
- Wait for results (polling automatically)
- Fail the step if any metric is below threshold
- Upload structured JSON results and a Markdown summary as build artifacts
Copy this file to .github/workflows/llm-eval.yml in your repository. That's it — every PR will be evaluated and results posted as a comment automatically.
# .github/workflows/llm-eval.yml
name: LLM Quality Gate
on:
pull_request:
branches: [main, develop]
jobs:
eval:
name: Evaluate LLM
runs-on: ubuntu-latest
permissions:
pull-requests: write
contents: read
steps:
- uses: actions/checkout@v4
- name: Run evaluation
uses: verifywise-ai/verifywise-eval-action@v1
with:
api_url: https://app.verifywise.ai
project_id: proj_abc
dataset_id: '2'
metrics: correctness,faithfulness,hallucination
model_name: gpt-4o-mini
model_provider: openai
threshold: '0.7'
vw_api_token: ${{ secrets.VW_API_TOKEN }}
llm_api_key: ${{ secrets.LLM_API_KEY }}The action automatically:
- Runs the evaluation and waits for results
- Posts a summary comment on the PR (updates the same comment on re-runs)
- Fails the check if any metric is below threshold
- Compares scores against the previous experiment (shows deltas)
- Uploads JSON results and Markdown summary as build artifacts
Required secrets — add these in your repo's Settings > Secrets and variables > Actions:
| Secret | Required | Where to get it |
|---|---|---|
VW_API_TOKEN |
yes | VerifyWise dashboard > Settings > API Tokens |
LLM_API_KEY |
yes | API key for the model being evaluated (e.g. OpenAI, Anthropic) |
JUDGE_API_KEY |
no | API key for the judge LLM. Defaults to LLM_API_KEY if not set. Only needed when the model and judge use different providers. |
How it works: The evaluation uses two LLMs — the model generates responses to your prompts, and the judge scores those responses against the selected metrics. If both use the same provider (e.g. both OpenAI), a single
LLM_API_KEYis enough. If they use different providers (e.g. evaluating a Claude model with GPT-4o as judge), setJUDGE_API_KEYseparately.
| Input | Required | Default | Description |
|---|---|---|---|
api_url |
yes | — | Base URL of your VerifyWise instance |
project_id |
yes | — | Project ID (find it in the VerifyWise dashboard) |
dataset_id |
yes | — | Dataset to evaluate against |
metrics |
yes | — | Comma-separated metric names (see Metrics) |
model_name |
yes | — | Model to evaluate (e.g. gpt-4o-mini, claude-3-5-sonnet) |
model_provider |
yes | — | openai, anthropic, google, mistral, xai, or self-hosted |
vw_api_token |
yes | — | VerifyWise API token (store as a repository secret) |
llm_api_key |
yes | — | API key for the model being evaluated |
judge_api_key |
no | (same as llm_api_key) | API key for the judge LLM (only needed when model and judge use different providers) |
judge_model |
no | gpt-4o |
LLM used to judge responses |
judge_provider |
no | openai |
Provider for the judge LLM |
threshold |
no | 0.7 |
Pass/fail threshold (0.0–1.0) |
timeout_minutes |
no | 30 |
Max minutes to wait for completion |
poll_interval_seconds |
no | 15 |
Seconds between status checks |
experiment_name |
no | (auto) | Custom name for the experiment |
fail_on_threshold |
no | true |
Set to false to report without failing |
post_pr_comment |
no | true |
Post results as a comment on the PR |
| Output | Description |
|---|---|
passed |
true if every metric met its threshold |
results_path |
Path to the JSON results file (use in subsequent steps) |
summary_path |
Path to the Markdown summary (use for PR comments) |
experiment_id |
Experiment ID on VerifyWise (link back to the dashboard) |
| Metric | Category | Direction | What it measures |
|---|---|---|---|
answer_relevancy |
Universal | Higher is better | Is the response relevant to what was asked? |
correctness |
Universal | Higher is better | Are the answers factually right? |
completeness |
Universal | Higher is better | Does the answer cover all parts of the question? |
instruction_following |
Universal | Higher is better | Does the response follow the instructions? |
hallucination |
Universal | Lower is better | How much of the response is fabricated? |
toxicity |
Universal | Lower is better | Does the response contain harmful content? |
bias |
Universal | Lower is better | Does the response exhibit unfair bias? |
faithfulness |
RAG | Higher is better | Is the response grounded in the provided context? |
contextual_relevancy |
RAG | Higher is better | Is the retrieved context relevant? |
context_precision |
RAG | Higher is better | Is the retrieved context precise? |
context_recall |
RAG | Higher is better | Was all relevant context retrieved? |
tool_correctness |
Agent | Higher is better | Are the right tools selected? |
argument_correctness |
Agent | Higher is better | Are tool arguments correct? |
task_completion |
Agent | Higher is better | Is the overall task completed? |
step_efficiency |
Agent | Higher is better | Are steps efficient (no redundancy)? |
plan_quality |
Agent | Higher is better | Is the execution plan well-structured? |
plan_adherence |
Agent | Higher is better | Does execution follow the plan? |
How thresholds work: For standard metrics (higher is better), the score must be at or above the threshold to pass. For inverted metrics (lower is better), the score must be at or below the threshold to pass. A threshold of 0.7 means "70% correct is the minimum" for standard metrics, or "30% hallucination is the maximum" for inverted ones.
This repo also ships a Python SDK for programmatic access to the full VerifyWise API. Use it in custom scripts, Jupyter notebooks, or non-GitHub CI systems.
pip install verifywisefrom verifywise import VerifyWiseClient
client = VerifyWiseClient(api_url="https://app.verifywise.ai", token="your-token")
results = client.experiments.run_and_wait(
project_id="proj_abc",
name="Nightly Eval",
model_name="gpt-4o-mini",
model_provider="openai",
dataset_id="2",
metrics=["correctness", "faithfulness", "hallucination"],
threshold=0.7,
)
assert results.passed, f"Evaluation failed: {[m.name for m in results.metrics if not m.passed]}"| Namespace | What it does |
|---|---|
client.experiments |
Run evaluations, poll for results, get scores |
client.datasets |
Upload test datasets, list built-in presets |
client.reports |
Generate and download PDF/HTML evaluation reports |
client.arena |
Run head-to-head model comparisons |
client.projects |
Create and manage evaluation projects |
client.models |
Save and validate model configurations |
client.scorers |
Build custom LLM-as-judge scoring functions |
client.metrics |
Query available metrics and historical aggregates |
client.bias_audits |
Run fairness and bias evaluations on tabular data |
client.orgs |
Manage organizations |
client.logs |
Query per-prompt evaluation logs |
All exceptions inherit from VerifyWiseError and include status_code and response_body:
| Exception | When |
|---|---|
AuthenticationError |
Invalid, expired, or missing token (401/403) |
NotFoundError |
Resource doesn't exist (404) |
ValidationError |
Bad request body or parameters (400/422) |
ServerError |
Server-side failure (5xx) |
TimeoutError |
Polling exceeded the configured timeout |
Full SDK reference: verifywise-python-sdk.md
The verifywise command lets you run evaluations and manage resources from the terminal.
export VW_API_URL=https://app.verifywise.ai
export VW_API_TOKEN=your-token
verifywise projects list
verifywise experiments run \
--project-id proj_abc --name "Quick check" \
--model-name gpt-4o-mini --model-provider openai \
--dataset-id 2 --metrics correctness,faithfulness --threshold 0.7
verifywise experiments list --json # machine-readable output
verifywise reports generate --experiments exp1,exp2 --format pdf
verifywise config # verify your setupThe experiments run command exits with code 1 on threshold failure — drop it into any CI script.
| Command | Subcommands |
|---|---|
config |
Show API URL, masked token, SDK version |
projects |
list get create delete stats |
experiments |
list get delete run |
datasets |
list list-builtin upload read |
reports |
list generate download |
metrics |
list aggregates |
models |
list create delete validate |
scorers |
list |
logs |
list |
The bundled ci_eval_runner.py script works in any CI system. It only requires requests:
pip install requests
python ci_eval_runner.py \
--api-url "$VW_API_URL" --token "$VW_API_TOKEN" \
--project-id "$VW_PROJECT_ID" --dataset-id "$VW_DATASET_ID" \
--metrics "correctness,faithfulness" \
--model-name "gpt-4o-mini" --model-provider "openai" \
--threshold 0.7 \
--output results.json --markdown-output summary.md| Exit code | Meaning |
|---|---|
0 |
All metrics passed |
1 |
One or more metrics below threshold |
2 |
Error (timeout, API failure, bad config) |
cd sdk
pip install -e .
python -m pytest tests/ -v # 93 tests, ~0.2sverifywise-eval-action/
├── action.yml # GitHub Action (Marketplace entry point)
├── ci_eval_runner.py # Standalone CI script
├── README.md
├── LICENSE
├── .github/workflows/
│ └── test.yml # CI: Python 3.9–3.13 matrix
└── sdk/
├── pyproject.toml # Package: pip install verifywise
├── src/verifywise/ # SDK source (15 modules)
└── tests/ # 93 tests (SDK + CLI)
VerifyWise is an open-source AI governance platform that helps teams comply with the EU AI Act, ISO 42001, NIST AI RMF, and more. The evaluation engine is powered by DeepEval and supports 20+ LLM quality metrics out of the box.
- Platform: github.com/verifywise-ai/verifywise
- Docs: docs.verifywise.ai
Apache-2.0 — see LICENSE.