Prevent silent AI failures before they hit production
LLM Eval Guard is a lightweight, production-minded evaluation framework that detects quality regressions in LLM outputs when prompts, datasets, or models change.
In real products, prompts evolve, models update, and teams "optimize" responses — but most teams don't test LLM outputs like code. Failures slip into production silently.
This project answers one question: Did this change make the AI worse?
| Failure Type | Description |
|---|---|
| 🔻 Information Loss | Critical details missing from responses |
| 📉 Over-simplification | Incomplete or shallow answers |
| 🎭 Hallucinations | Fabricated technical details |
| 🔄 Behavioral Drift | Inconsistent behavior after updates |
┌─────────────┐
│ Dataset │
└──────┬──────┘
│
▼
┌──────────────────────────────────────────┐
│ │
│ ┌─────────┐ ┌─────────┐ │
│ │Prompt v1│ │Prompt v2│ │
│ └────┬────┘ └────┬────┘ │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │ LLM │ │ LLM │ │
│ └────┬────┘ └────┬────┘ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │Validators│ │Validators│ │
│ └────┬─────┘ └────┬─────┘ │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │ Score: 3│ │ Score: 2│ │
│ └─────────┘ └─────────┘ │
│ │
└──────────────────┬───────────────────────┘
▼
┌────────────────┐
│ Regression │
│ Detected! │
└────────────────┘
- ✔️ Runs the same dataset across multiple prompt versions
- ✔️ Evaluates outputs using deterministic validators
- ✔️ Scores responses objectively
- ✔️ Flags regressions with structured JSON reports
- ✔️ Works with cloud + local models (Ollama supported)
The framework validates responses for:
- Minimum completeness — Is the response substantive?
- Required domain keywords — Does it cover expected concepts?
- Refusal / non-answer patterns — Did the model decline to answer?
- Hallucinated entities — Are there ungrounded claims?
- Regression score — Is v2 measurably worse than v1?
Validators are simple and hackable:
# validators/keywords.py
REQUIRED = ["authentication", "authorization", "token"]
def validate(response_text):
text = response_text.lower()
matches = sum(1 for keyword in REQUIRED if keyword in text)
return {
"passed": matches == len(REQUIRED),
"score": matches,
"max_score": len(REQUIRED),
"missing": [k for k in REQUIRED if k not in text]
}{
"id": 2,
"regression": true,
"v1_score": { "score": 3, "max_score": 4 },
"v2_score": { "score": 2, "max_score": 4 }
}Interpretation: Prompt v2 produced a weaker answer than v1 → regression detected.
- Python
- Deterministic custom validators
- Supported LLM Providers:
- OpenAI
- Google Gemini
- Ollama (local LLMs)
- CI-driven evaluation workflow
pip install -r requirements.txtpython -m runner.run_evalReports are saved to:
/reports/latest_report.json
├── datasets/
│ └── eval_dataset.json # Test cases
├── prompts/
│ ├── v1.txt # Baseline prompt
│ └── v2.txt # Updated prompt
├── validators/
│ ├── length.py # Length / completeness checks
│ ├── keywords.py # Domain keyword coverage
│ ├── refusal.py # Non-answer detection
│ └── hallucination.py # Ungrounded claim detection
├── runner/
│ └── run_eval.py # Evaluation runner
└── reports/
└── latest_report.json # Generated reports
This repo includes GitHub Actions CI that automatically runs evaluation when:
- Prompts change
- Datasets change
- Validators change
This prevents unnoticed regressions from entering main.
CI enforces evaluation discipline without relying on manual review.
- AI Feature QA — Validate outputs before releases
- Prompt Engineering Quality Gates — Ensure prompt changes don't degrade quality
- Model Upgrade Regression Checks — Test new model versions safely
- Enterprise Reliability Workflows — Build trust in AI systems
- Fail CI if regression score drops below threshold ← Quality gates
- Human review mode for flagged edge cases
- Baseline locking for enterprise audit trails
- Richer validators (JSON/schema validation)
- Scoring dashboards with historical trends
- Multi-model comparison reports
Apache License 2.0
Contributions welcome! Please open an issue or PR.