LLM Evaluation Action

Automatically evaluate your LLMs on every pull request.
Gate merges on correctness, faithfulness, hallucination, toxicity, and bias — enforced in CI.

This action connects to a VerifyWise instance to run your models against curated test datasets and block merges when quality drops below your standards.

Why Use This

Catch regressions before they ship. Prompt changes, fine-tuning updates, or model swaps can silently degrade output quality. This action measures it.
Enforce quality gates. Set pass/fail thresholds per metric. If correctness drops below 70% or hallucination rises above 30%, the PR fails.
Works with any model. OpenAI, Anthropic, Google, Mistral, xAI, or self-hosted — if VerifyWise can talk to it, this action can evaluate it.
Full visibility. Results are posted as PR comments, uploaded as build artifacts, and stored in your VerifyWise dashboard for trend tracking.

Quick Start

Add this step to any GitHub Actions workflow:

- uses: verifywise-ai/verifywise-eval-action@v1
  with:
    api_url: https://app.verifywise.ai
    project_id: proj_abc
    dataset_id: '2'
    metrics: 'correctness,faithfulness,hallucination'
    model_name: gpt-4o-mini
    model_provider: openai
    vw_api_token: ${{ secrets.VW_API_TOKEN }}
    llm_api_key: ${{ secrets.LLM_API_KEY }}

The action will:

Create an evaluation experiment on your VerifyWise instance
Run your model against the specified dataset using an LLM judge
Wait for results (polling automatically)
Fail the step if any metric is below threshold
Upload structured JSON results and a Markdown summary as build artifacts

Run on Every Pull Request

Copy this file to .github/workflows/llm-eval.yml in your repository. That's it — every PR will be evaluated and results posted as a comment automatically.

# .github/workflows/llm-eval.yml
name: LLM Quality Gate

on:
  pull_request:
    branches: [main, develop]

jobs:
  eval:
    name: Evaluate LLM
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
      contents: read
    steps:
      - uses: actions/checkout@v4

      - name: Run evaluation
        uses: verifywise-ai/verifywise-eval-action@v1
        with:
          api_url: https://app.verifywise.ai
          project_id: proj_abc
          dataset_id: '2'
          metrics: correctness,faithfulness,hallucination
          model_name: gpt-4o-mini
          model_provider: openai
          threshold: '0.7'
          vw_api_token: ${{ secrets.VW_API_TOKEN }}
          llm_api_key: ${{ secrets.LLM_API_KEY }}

The action automatically:

Runs the evaluation and waits for results
Posts a summary comment on the PR (updates the same comment on re-runs)
Fails the check if any metric is below threshold
Compares scores against the previous experiment (shows deltas)
Uploads JSON results and Markdown summary as build artifacts

Required secrets — add these in your repo's Settings > Secrets and variables > Actions:

Secret	Required	Where to get it
`VW_API_TOKEN`	yes	VerifyWise dashboard > Settings > API Tokens
`LLM_API_KEY`	yes	API key for the model being evaluated (e.g. OpenAI, Anthropic)
`JUDGE_API_KEY`	no	API key for the judge LLM. Defaults to `LLM_API_KEY` if not set. Only needed when the model and judge use different providers.

How it works: The evaluation uses two LLMs — the model generates responses to your prompts, and the judge scores those responses against the selected metrics. If both use the same provider (e.g. both OpenAI), a single LLM_API_KEY is enough. If they use different providers (e.g. evaluating a Claude model with GPT-4o as judge), set JUDGE_API_KEY separately.

Inputs

Input	Required	Default	Description
`api_url`	yes	—	Base URL of your VerifyWise instance
`project_id`	yes	—	Project ID (find it in the VerifyWise dashboard)
`dataset_id`	yes	—	Dataset to evaluate against
`metrics`	yes	—	Comma-separated metric names (see Metrics)
`model_name`	yes	—	Model to evaluate (e.g. `gpt-4o-mini`, `claude-3-5-sonnet`)
`model_provider`	yes	—	`openai`, `anthropic`, `google`, `mistral`, `xai`, or `self-hosted`
`vw_api_token`	yes	—	VerifyWise API token (store as a repository secret)
`llm_api_key`	yes	—	API key for the model being evaluated
`judge_api_key`	no	(same as llm_api_key)	API key for the judge LLM (only needed when model and judge use different providers)
`judge_model`	no	`gpt-4o`	LLM used to judge responses
`judge_provider`	no	`openai`	Provider for the judge LLM
`threshold`	no	`0.7`	Pass/fail threshold (0.0–1.0)
`timeout_minutes`	no	`30`	Max minutes to wait for completion
`poll_interval_seconds`	no	`15`	Seconds between status checks
`experiment_name`	no	(auto)	Custom name for the experiment
`fail_on_threshold`	no	`true`	Set to `false` to report without failing
`post_pr_comment`	no	`true`	Post results as a comment on the PR

Outputs

Output	Description
`passed`	`true` if every metric met its threshold
`results_path`	Path to the JSON results file (use in subsequent steps)
`summary_path`	Path to the Markdown summary (use for PR comments)
`experiment_id`	Experiment ID on VerifyWise (link back to the dashboard)

Metrics

Metric	Category	Direction	What it measures
`answer_relevancy`	Universal	Higher is better	Is the response relevant to what was asked?
`correctness`	Universal	Higher is better	Are the answers factually right?
`completeness`	Universal	Higher is better	Does the answer cover all parts of the question?
`instruction_following`	Universal	Higher is better	Does the response follow the instructions?
`hallucination`	Universal	Lower is better	How much of the response is fabricated?
`toxicity`	Universal	Lower is better	Does the response contain harmful content?
`bias`	Universal	Lower is better	Does the response exhibit unfair bias?
`faithfulness`	RAG	Higher is better	Is the response grounded in the provided context?
`contextual_relevancy`	RAG	Higher is better	Is the retrieved context relevant?
`context_precision`	RAG	Higher is better	Is the retrieved context precise?
`context_recall`	RAG	Higher is better	Was all relevant context retrieved?
`tool_correctness`	Agent	Higher is better	Are the right tools selected?
`argument_correctness`	Agent	Higher is better	Are tool arguments correct?
`task_completion`	Agent	Higher is better	Is the overall task completed?
`step_efficiency`	Agent	Higher is better	Are steps efficient (no redundancy)?
`plan_quality`	Agent	Higher is better	Is the execution plan well-structured?
`plan_adherence`	Agent	Higher is better	Does execution follow the plan?

How thresholds work: For standard metrics (higher is better), the score must be at or above the threshold to pass. For inverted metrics (lower is better), the score must be at or below the threshold to pass. A threshold of 0.7 means "70% correct is the minimum" for standard metrics, or "30% hallucination is the maximum" for inverted ones.

Python SDK

This repo also ships a Python SDK for programmatic access to the full VerifyWise API. Use it in custom scripts, Jupyter notebooks, or non-GitHub CI systems.

Install

pip install verifywise

Example — CI quality gate in 10 lines

from verifywise import VerifyWiseClient

client = VerifyWiseClient(api_url="https://app.verifywise.ai", token="your-token")

results = client.experiments.run_and_wait(
    project_id="proj_abc",
    name="Nightly Eval",
    model_name="gpt-4o-mini",
    model_provider="openai",
    dataset_id="2",
    metrics=["correctness", "faithfulness", "hallucination"],
    threshold=0.7,
)

assert results.passed, f"Evaluation failed: {[m.name for m in results.metrics if not m.passed]}"

What you can do with the SDK

Namespace	What it does
`client.experiments`	Run evaluations, poll for results, get scores
`client.datasets`	Upload test datasets, list built-in presets
`client.reports`	Generate and download PDF/HTML evaluation reports
`client.arena`	Run head-to-head model comparisons
`client.projects`	Create and manage evaluation projects
`client.models`	Save and validate model configurations
`client.scorers`	Build custom LLM-as-judge scoring functions
`client.metrics`	Query available metrics and historical aggregates
`client.bias_audits`	Run fairness and bias evaluations on tabular data
`client.orgs`	Manage organizations
`client.logs`	Query per-prompt evaluation logs

Error handling

All exceptions inherit from VerifyWiseError and include status_code and response_body:

Exception	When
`AuthenticationError`	Invalid, expired, or missing token (401/403)
`NotFoundError`	Resource doesn't exist (404)
`ValidationError`	Bad request body or parameters (400/422)
`ServerError`	Server-side failure (5xx)
`TimeoutError`	Polling exceeded the configured timeout

Full SDK reference: verifywise-python-sdk.md

CLI

The verifywise command lets you run evaluations and manage resources from the terminal.

export VW_API_URL=https://app.verifywise.ai
export VW_API_TOKEN=your-token

verifywise projects list
verifywise experiments run \
  --project-id proj_abc --name "Quick check" \
  --model-name gpt-4o-mini --model-provider openai \
  --dataset-id 2 --metrics correctness,faithfulness --threshold 0.7

verifywise experiments list --json       # machine-readable output
verifywise reports generate --experiments exp1,exp2 --format pdf
verifywise config                         # verify your setup

The experiments run command exits with code 1 on threshold failure — drop it into any CI script.

Command	Subcommands
`config`	Show API URL, masked token, SDK version
`projects`	`list` `get` `create` `delete` `stats`
`experiments`	`list` `get` `delete` `run`
`datasets`	`list` `list-builtin` `upload` `read`
`reports`	`list` `generate` `download`
`metrics`	`list` `aggregates`
`models`	`list` `create` `delete` `validate`
`scorers`	`list`
`logs`	`list`

Using Outside GitHub Actions

The bundled ci_eval_runner.py script works in any CI system. It only requires requests:

pip install requests

python ci_eval_runner.py \
  --api-url "$VW_API_URL" --token "$VW_API_TOKEN" \
  --project-id "$VW_PROJECT_ID" --dataset-id "$VW_DATASET_ID" \
  --metrics "correctness,faithfulness" \
  --model-name "gpt-4o-mini" --model-provider "openai" \
  --threshold 0.7 \
  --output results.json --markdown-output summary.md

Exit code	Meaning
`0`	All metrics passed
`1`	One or more metrics below threshold
`2`	Error (timeout, API failure, bad config)

Development

cd sdk
pip install -e .
python -m pytest tests/ -v   # 93 tests, ~0.2s

Repository structure

verifywise-eval-action/
├── action.yml              # GitHub Action (Marketplace entry point)
├── ci_eval_runner.py       # Standalone CI script
├── README.md
├── LICENSE
├── .github/workflows/
│   └── test.yml            # CI: Python 3.9–3.13 matrix
└── sdk/
    ├── pyproject.toml      # Package: pip install verifywise
    ├── src/verifywise/     # SDK source (15 modules)
    └── tests/              # 93 tests (SDK + CLI)

About VerifyWise

VerifyWise is an open-source AI governance platform that helps teams comply with the EU AI Act, ISO 42001, NIST AI RMF, and more. The evaluation engine is powered by DeepEval and supports 20+ LLM quality metrics out of the box.

Platform: github.com/verifywise-ai/verifywise
Docs: docs.verifywise.ai

License

Apache-2.0 — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Evaluation Action

Why Use This

Quick Start

Run on Every Pull Request

Inputs

Outputs

Metrics

Python SDK

Install

Example — CI quality gate in 10 lines

What you can do with the SDK

Error handling

CLI

Using Outside GitHub Actions

Development

Repository structure

About VerifyWise

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
sdk		sdk
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
action.yml		action.yml
ci_eval_runner.py		ci_eval_runner.py

Folders and files

Latest commit

History

Repository files navigation

LLM Evaluation Action

Why Use This

Quick Start

Run on Every Pull Request

Inputs

Outputs

Metrics

Python SDK

Install

Example — CI quality gate in 10 lines

What you can do with the SDK

Error handling

CLI

Using Outside GitHub Actions

Development

Repository structure

About VerifyWise

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages