Skip to content

verifywise-ai/verifywise-eval-action

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VerifyWise

LLM Evaluation Action

Automatically evaluate your LLMs on every pull request.
Gate merges on correctness, faithfulness, hallucination, toxicity, and bias — enforced in CI.

GitHub Action License Python SDK


This action connects to a VerifyWise instance to run your models against curated test datasets and block merges when quality drops below your standards.


Why Use This

  • Catch regressions before they ship. Prompt changes, fine-tuning updates, or model swaps can silently degrade output quality. This action measures it.
  • Enforce quality gates. Set pass/fail thresholds per metric. If correctness drops below 70% or hallucination rises above 30%, the PR fails.
  • Works with any model. OpenAI, Anthropic, Google, Mistral, xAI, or self-hosted — if VerifyWise can talk to it, this action can evaluate it.
  • Full visibility. Results are posted as PR comments, uploaded as build artifacts, and stored in your VerifyWise dashboard for trend tracking.

Quick Start

Add this step to any GitHub Actions workflow:

- uses: verifywise-ai/verifywise-eval-action@v1
  with:
    api_url: https://app.verifywise.ai
    project_id: proj_abc
    dataset_id: '2'
    metrics: 'correctness,faithfulness,hallucination'
    model_name: gpt-4o-mini
    model_provider: openai
    vw_api_token: ${{ secrets.VW_API_TOKEN }}
    llm_api_key: ${{ secrets.LLM_API_KEY }}

The action will:

  1. Create an evaluation experiment on your VerifyWise instance
  2. Run your model against the specified dataset using an LLM judge
  3. Wait for results (polling automatically)
  4. Fail the step if any metric is below threshold
  5. Upload structured JSON results and a Markdown summary as build artifacts

Run on Every Pull Request

Copy this file to .github/workflows/llm-eval.yml in your repository. That's it — every PR will be evaluated and results posted as a comment automatically.

# .github/workflows/llm-eval.yml
name: LLM Quality Gate

on:
  pull_request:
    branches: [main, develop]

jobs:
  eval:
    name: Evaluate LLM
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
      contents: read
    steps:
      - uses: actions/checkout@v4

      - name: Run evaluation
        uses: verifywise-ai/verifywise-eval-action@v1
        with:
          api_url: https://app.verifywise.ai
          project_id: proj_abc
          dataset_id: '2'
          metrics: correctness,faithfulness,hallucination
          model_name: gpt-4o-mini
          model_provider: openai
          threshold: '0.7'
          vw_api_token: ${{ secrets.VW_API_TOKEN }}
          llm_api_key: ${{ secrets.LLM_API_KEY }}

The action automatically:

  • Runs the evaluation and waits for results
  • Posts a summary comment on the PR (updates the same comment on re-runs)
  • Fails the check if any metric is below threshold
  • Compares scores against the previous experiment (shows deltas)
  • Uploads JSON results and Markdown summary as build artifacts

Required secrets — add these in your repo's Settings > Secrets and variables > Actions:

Secret Required Where to get it
VW_API_TOKEN yes VerifyWise dashboard > Settings > API Tokens
LLM_API_KEY yes API key for the model being evaluated (e.g. OpenAI, Anthropic)
JUDGE_API_KEY no API key for the judge LLM. Defaults to LLM_API_KEY if not set. Only needed when the model and judge use different providers.

How it works: The evaluation uses two LLMs — the model generates responses to your prompts, and the judge scores those responses against the selected metrics. If both use the same provider (e.g. both OpenAI), a single LLM_API_KEY is enough. If they use different providers (e.g. evaluating a Claude model with GPT-4o as judge), set JUDGE_API_KEY separately.


Inputs

Input Required Default Description
api_url yes Base URL of your VerifyWise instance
project_id yes Project ID (find it in the VerifyWise dashboard)
dataset_id yes Dataset to evaluate against
metrics yes Comma-separated metric names (see Metrics)
model_name yes Model to evaluate (e.g. gpt-4o-mini, claude-3-5-sonnet)
model_provider yes openai, anthropic, google, mistral, xai, or self-hosted
vw_api_token yes VerifyWise API token (store as a repository secret)
llm_api_key yes API key for the model being evaluated
judge_api_key no (same as llm_api_key) API key for the judge LLM (only needed when model and judge use different providers)
judge_model no gpt-4o LLM used to judge responses
judge_provider no openai Provider for the judge LLM
threshold no 0.7 Pass/fail threshold (0.0–1.0)
timeout_minutes no 30 Max minutes to wait for completion
poll_interval_seconds no 15 Seconds between status checks
experiment_name no (auto) Custom name for the experiment
fail_on_threshold no true Set to false to report without failing
post_pr_comment no true Post results as a comment on the PR

Outputs

Output Description
passed true if every metric met its threshold
results_path Path to the JSON results file (use in subsequent steps)
summary_path Path to the Markdown summary (use for PR comments)
experiment_id Experiment ID on VerifyWise (link back to the dashboard)

Metrics

Metric Category Direction What it measures
answer_relevancy Universal Higher is better Is the response relevant to what was asked?
correctness Universal Higher is better Are the answers factually right?
completeness Universal Higher is better Does the answer cover all parts of the question?
instruction_following Universal Higher is better Does the response follow the instructions?
hallucination Universal Lower is better How much of the response is fabricated?
toxicity Universal Lower is better Does the response contain harmful content?
bias Universal Lower is better Does the response exhibit unfair bias?
faithfulness RAG Higher is better Is the response grounded in the provided context?
contextual_relevancy RAG Higher is better Is the retrieved context relevant?
context_precision RAG Higher is better Is the retrieved context precise?
context_recall RAG Higher is better Was all relevant context retrieved?
tool_correctness Agent Higher is better Are the right tools selected?
argument_correctness Agent Higher is better Are tool arguments correct?
task_completion Agent Higher is better Is the overall task completed?
step_efficiency Agent Higher is better Are steps efficient (no redundancy)?
plan_quality Agent Higher is better Is the execution plan well-structured?
plan_adherence Agent Higher is better Does execution follow the plan?

How thresholds work: For standard metrics (higher is better), the score must be at or above the threshold to pass. For inverted metrics (lower is better), the score must be at or below the threshold to pass. A threshold of 0.7 means "70% correct is the minimum" for standard metrics, or "30% hallucination is the maximum" for inverted ones.


Python SDK

This repo also ships a Python SDK for programmatic access to the full VerifyWise API. Use it in custom scripts, Jupyter notebooks, or non-GitHub CI systems.

Install

pip install verifywise

Example — CI quality gate in 10 lines

from verifywise import VerifyWiseClient

client = VerifyWiseClient(api_url="https://app.verifywise.ai", token="your-token")

results = client.experiments.run_and_wait(
    project_id="proj_abc",
    name="Nightly Eval",
    model_name="gpt-4o-mini",
    model_provider="openai",
    dataset_id="2",
    metrics=["correctness", "faithfulness", "hallucination"],
    threshold=0.7,
)

assert results.passed, f"Evaluation failed: {[m.name for m in results.metrics if not m.passed]}"

What you can do with the SDK

Namespace What it does
client.experiments Run evaluations, poll for results, get scores
client.datasets Upload test datasets, list built-in presets
client.reports Generate and download PDF/HTML evaluation reports
client.arena Run head-to-head model comparisons
client.projects Create and manage evaluation projects
client.models Save and validate model configurations
client.scorers Build custom LLM-as-judge scoring functions
client.metrics Query available metrics and historical aggregates
client.bias_audits Run fairness and bias evaluations on tabular data
client.orgs Manage organizations
client.logs Query per-prompt evaluation logs

Error handling

All exceptions inherit from VerifyWiseError and include status_code and response_body:

Exception When
AuthenticationError Invalid, expired, or missing token (401/403)
NotFoundError Resource doesn't exist (404)
ValidationError Bad request body or parameters (400/422)
ServerError Server-side failure (5xx)
TimeoutError Polling exceeded the configured timeout

Full SDK reference: verifywise-python-sdk.md


CLI

The verifywise command lets you run evaluations and manage resources from the terminal.

export VW_API_URL=https://app.verifywise.ai
export VW_API_TOKEN=your-token

verifywise projects list
verifywise experiments run \
  --project-id proj_abc --name "Quick check" \
  --model-name gpt-4o-mini --model-provider openai \
  --dataset-id 2 --metrics correctness,faithfulness --threshold 0.7

verifywise experiments list --json       # machine-readable output
verifywise reports generate --experiments exp1,exp2 --format pdf
verifywise config                         # verify your setup

The experiments run command exits with code 1 on threshold failure — drop it into any CI script.

Command Subcommands
config Show API URL, masked token, SDK version
projects list get create delete stats
experiments list get delete run
datasets list list-builtin upload read
reports list generate download
metrics list aggregates
models list create delete validate
scorers list
logs list

Using Outside GitHub Actions

The bundled ci_eval_runner.py script works in any CI system. It only requires requests:

pip install requests

python ci_eval_runner.py \
  --api-url "$VW_API_URL" --token "$VW_API_TOKEN" \
  --project-id "$VW_PROJECT_ID" --dataset-id "$VW_DATASET_ID" \
  --metrics "correctness,faithfulness" \
  --model-name "gpt-4o-mini" --model-provider "openai" \
  --threshold 0.7 \
  --output results.json --markdown-output summary.md
Exit code Meaning
0 All metrics passed
1 One or more metrics below threshold
2 Error (timeout, API failure, bad config)

Development

cd sdk
pip install -e .
python -m pytest tests/ -v   # 93 tests, ~0.2s

Repository structure

verifywise-eval-action/
├── action.yml              # GitHub Action (Marketplace entry point)
├── ci_eval_runner.py       # Standalone CI script
├── README.md
├── LICENSE
├── .github/workflows/
│   └── test.yml            # CI: Python 3.9–3.13 matrix
└── sdk/
    ├── pyproject.toml      # Package: pip install verifywise
    ├── src/verifywise/     # SDK source (15 modules)
    └── tests/              # 93 tests (SDK + CLI)

About VerifyWise

VerifyWise is an open-source AI governance platform that helps teams comply with the EU AI Act, ISO 42001, NIST AI RMF, and more. The evaluation engine is powered by DeepEval and supports 20+ LLM quality metrics out of the box.

License

Apache-2.0 — see LICENSE.

About

GitHub Action & Python SDK to evaluate LLMs in CI/CD — gate PRs on correctness, faithfulness, hallucination, and more. Powered by VerifyWise.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages