Skip to content

Commit 5edd013

Browse files
committed
feat: show per-sample failure details and drop emojis
- Parse detailed_results from the experiment API to include per-sample input/response/reason in the Markdown summary - Replace emoji indicators with text (PASS/FAIL) throughout - Add "Failure Details" section showing each sample that failed with the model's response and the judge's reasoning - Update README with copy-paste PR workflow and secrets setup guide Made-with: Cursor
1 parent b354eaa commit 5edd013

File tree

3 files changed

+167
-46
lines changed

3 files changed

+167
-46
lines changed

README.md

Lines changed: 27 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -60,67 +60,79 @@ The action will:
6060
4. Fail the step if any metric is below threshold
6161
5. Upload structured JSON results and a Markdown summary as build artifacts
6262
63-
### Full Workflow — Quality Gate on PRs
63+
### Run on Every Pull Request
64+
65+
Copy this file to `.github/workflows/llm-eval.yml` in your repository. That's it — every PR against `main` or `develop` will be evaluated automatically.
6466

6567
```yaml
68+
# .github/workflows/llm-eval.yml
6669
name: LLM Quality Gate
70+
6771
on:
6872
pull_request:
6973
branches: [main, develop]
7074
7175
jobs:
7276
eval:
77+
name: Evaluate LLM
7378
runs-on: ubuntu-latest
7479
permissions:
7580
pull-requests: write
7681
contents: read
7782
steps:
7883
- uses: actions/checkout@v4
7984
80-
- name: Evaluate LLM quality
85+
- name: Run evaluation
8186
id: eval
8287
uses: verifywise-ai/verifywise-eval-action@v1
8388
with:
8489
api_url: https://app.verifywise.ai
8590
project_id: proj_abc
8691
dataset_id: '2'
87-
metrics: 'correctness,faithfulness,hallucination'
92+
metrics: correctness,faithfulness,hallucination
8893
model_name: gpt-4o-mini
8994
model_provider: openai
9095
threshold: '0.7'
91-
fail_on_threshold: 'true'
9296
vw_api_token: ${{ secrets.VW_API_TOKEN }}
9397
llm_api_key: ${{ secrets.LLM_API_KEY }}
9498
95-
- name: Comment results on PR
96-
if: github.event_name == 'pull_request' && always()
99+
# Optional: post results as a PR comment
100+
- name: Comment on PR
101+
if: always() && github.event_name == 'pull_request'
97102
uses: actions/github-script@v7
98103
with:
99104
script: |
100105
const fs = require('fs');
101-
const summaryPath = '${{ steps.eval.outputs.summary_path }}';
102-
if (!summaryPath || !fs.existsSync(summaryPath)) return;
103-
const body = fs.readFileSync(summaryPath, 'utf8');
104-
const marker = '<!-- verifywise-eval-results -->';
106+
const path = '${{ steps.eval.outputs.summary_path }}';
107+
if (!path || !fs.existsSync(path)) return;
108+
const body = fs.readFileSync(path, 'utf8');
109+
const tag = '<!-- verifywise-eval -->';
105110
const { data: comments } = await github.rest.issues.listComments({
106111
owner: context.repo.owner, repo: context.repo.repo,
107112
issue_number: context.issue.number,
108113
});
109-
const existing = comments.find(c => c.body.includes(marker));
110-
const fullBody = `${marker}\n${body}`;
111-
if (existing) {
114+
const prev = comments.find(c => c.body.includes(tag));
115+
const full = `${tag}\n${body}`;
116+
if (prev) {
112117
await github.rest.issues.updateComment({
113118
owner: context.repo.owner, repo: context.repo.repo,
114-
comment_id: existing.id, body: fullBody,
119+
comment_id: prev.id, body: full,
115120
});
116121
} else {
117122
await github.rest.issues.createComment({
118123
owner: context.repo.owner, repo: context.repo.repo,
119-
issue_number: context.issue.number, body: fullBody,
124+
issue_number: context.issue.number, body: full,
120125
});
121126
}
122127
```
123128

129+
**Required secrets** — add these in your repo's Settings > Secrets and variables > Actions:
130+
131+
| Secret | Where to get it |
132+
|--------|----------------|
133+
| `VW_API_TOKEN` | VerifyWise dashboard > Settings > API Tokens |
134+
| `LLM_API_KEY` | Your LLM provider (OpenAI, Anthropic, etc.) |
135+
124136
---
125137

126138
## Inputs

action.yml

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -146,23 +146,25 @@ runs:
146146
SUMMARY_PATH: ${{ steps.eval.outputs.summary_path }}
147147
FAIL_ON_THRESHOLD: ${{ inputs.fail_on_threshold }}
148148
run: |
149-
# If no results at all, report error
150149
if [ ! -f "$RESULTS_PATH" ]; then
151-
echo "::error title=VerifyWise Evaluation Failed::No results produced — the evaluation may have failed to start. Check the logs above."
150+
echo "::error title=Evaluation Failed::No results produced. The evaluation may have failed to connect to the VerifyWise instance."
152151
echo "passed=false" >> "$GITHUB_OUTPUT"
153-
echo "### ❌ VerifyWise Evaluation Failed" >> "$GITHUB_STEP_SUMMARY"
154-
echo "" >> "$GITHUB_STEP_SUMMARY"
155-
echo "No results were produced. The evaluation may have failed to connect to the VerifyWise instance." >> "$GITHUB_STEP_SUMMARY"
152+
{
153+
echo "## VerifyWise Evaluation"
154+
echo ""
155+
echo "**FAILED** -- No results were produced. The evaluation may have failed to start."
156+
echo "Check the logs in the \"Run evaluation\" step for details."
157+
} >> "$GITHUB_STEP_SUMMARY"
156158
[ "$FAIL_ON_THRESHOLD" = "true" ] && exit 1
157159
exit 0
158160
fi
159161
160-
# Write the Markdown summary to the Job Summary (visible on the run page)
162+
# Write Markdown summary to Job Summary (shown on the run page)
161163
if [ -f "$SUMMARY_PATH" ]; then
162164
cat "$SUMMARY_PATH" >> "$GITHUB_STEP_SUMMARY"
163165
fi
164166
165-
# Parse pass/fail and annotate
167+
# Parse results, create annotations, set outputs
166168
python3 << 'PYEOF'
167169
import json, os
168170
@@ -174,25 +176,29 @@ runs:
174176
175177
passed = data.get("passed", False)
176178
metrics = data.get("metrics", [])
179+
samples = data.get("samples", [])
177180
name = data.get("name", "Evaluation")
178181
model = data.get("model", "unknown")
179182
180183
failing = [m for m in metrics if not m.get("passed")]
181184
passing = [m for m in metrics if m.get("passed")]
182185
183-
# Write outputs
184186
with open(os.environ["GITHUB_OUTPUT"], "a") as out:
185187
out.write(f"passed={'true' if passed else 'false'}\n")
186188
187189
if passed:
188-
print(f"::notice title=✅ {name} Passed::All {len(metrics)} metrics passed for {model}")
190+
print(f"::notice title=All metrics passed::{len(metrics)} metrics passed for {model}")
189191
else:
190192
for m in failing:
191-
inv = " (inverted — lower is better)" if m.get("inverted") else ""
192-
print(f"::error title=❌ {m['name']} failed threshold::{m['name']}: scored {m['score']*100:.1f}% against {m['threshold']*100:.0f}% threshold{inv}")
193+
inv = " (inverted -- lower is better)" if m.get("inverted") else ""
194+
print(f"::error title={m['name']} failed threshold::"
195+
f"{m['name']}: scored {m['score']*100:.1f}% "
196+
f"against {m['threshold']*100:.0f}% threshold{inv}")
193197
194198
summary = ", ".join(f"{m['name']}={m['score']*100:.0f}%" for m in failing)
195-
print(f"::error title=VerifyWise Evaluation Failed::{len(failing)}/{len(metrics)} metrics below threshold on {model}: {summary}")
199+
print(f"::error title=Evaluation Failed::"
200+
f"{len(failing)}/{len(metrics)} metrics below threshold "
201+
f"on {model}: {summary}")
196202
197203
if fail_on:
198204
raise SystemExit(1)

ci_eval_runner.py

Lines changed: 122 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
import os
1818
import sys
1919
import time
20-
from datetime import datetime
20+
from datetime import datetime, timezone
2121
from typing import Any, Dict, List, Optional
2222

2323
try:
@@ -26,6 +26,8 @@
2626
print("ERROR: 'requests' package required. Install with: pip install requests")
2727
sys.exit(2)
2828

29+
INVERTED_KEYWORDS = ("bias", "toxicity", "hallucination", "conversationsafety")
30+
2931

3032
def parse_args() -> argparse.Namespace:
3133
p = argparse.ArgumentParser(description="VerifyWise CI/CD Evaluation Runner")
@@ -95,8 +97,8 @@ def create_experiment(
9597
dataset_name = dataset_info.get("name", f"dataset-{dataset_id}")
9698
print(f"Resolved dataset '{dataset_name}' -> {dataset_path}")
9799

98-
now = datetime.now(tz=__import__('datetime').timezone.utc)
99-
experiment_name = name or f"CI Eval {now.strftime('%Y-%m-%d %H:%M')}"
100+
now = datetime.now(tz=timezone.utc)
101+
experiment_name = name or f"CI Eval -- {now.strftime('%Y-%m-%d %H:%M')}"
100102

101103
payload = {
102104
"project_id": project_id,
@@ -170,13 +172,18 @@ def poll_experiment(
170172
raise TimeoutError(f"Experiment did not complete within {timeout_minutes} minutes")
171173

172174

175+
def is_inverted(name: str) -> bool:
176+
return any(k in name.lower() for k in INVERTED_KEYWORDS)
177+
178+
173179
def parse_results(experiment: Dict[str, Any], threshold: float) -> Dict[str, Any]:
174180
results = experiment.get("results", {})
175181
if isinstance(results, str):
176182
results = json.loads(results)
177183

178184
avg_scores = results.get("avg_scores", {})
179185
metric_thresholds_raw = results.get("metric_thresholds", {})
186+
detailed_results = results.get("detailed_results", [])
180187

181188
config = experiment.get("config", {})
182189
if isinstance(config, str):
@@ -189,7 +196,7 @@ def parse_results(experiment: Dict[str, Any], threshold: float) -> Dict[str, Any
189196
score = float(score)
190197
mt = metric_thresholds_raw.get(name)
191198
mt = float(mt) if mt is not None else threshold
192-
inverted = any(k in name.lower() for k in ["bias", "toxicity", "hallucination", "conversationsafety"])
199+
inverted = is_inverted(name)
193200
passed = (score <= mt) if inverted else (score >= mt)
194201
if not passed:
195202
all_passed = False
@@ -201,6 +208,31 @@ def parse_results(experiment: Dict[str, Any], threshold: float) -> Dict[str, Any
201208
"inverted": inverted,
202209
})
203210

211+
samples = []
212+
for i, sample in enumerate(detailed_results):
213+
sample_entry = {
214+
"index": i + 1,
215+
"input": sample.get("input", ""),
216+
"output": sample.get("output", ""),
217+
"expected": sample.get("expected", ""),
218+
"metric_scores": {},
219+
}
220+
raw_scores = sample.get("metric_scores", {})
221+
for metric_name, metric_data in raw_scores.items():
222+
if isinstance(metric_data, dict):
223+
sample_entry["metric_scores"][metric_name] = {
224+
"score": metric_data.get("score"),
225+
"passed": metric_data.get("passed"),
226+
"reason": metric_data.get("reason", ""),
227+
}
228+
else:
229+
sample_entry["metric_scores"][metric_name] = {
230+
"score": metric_data,
231+
"passed": None,
232+
"reason": "",
233+
}
234+
samples.append(sample_entry)
235+
204236
return {
205237
"experiment_id": experiment.get("id", ""),
206238
"name": experiment.get("name", ""),
@@ -210,47 +242,118 @@ def parse_results(experiment: Dict[str, Any], threshold: float) -> Dict[str, Any
210242
"duration_ms": results.get("duration"),
211243
"passed": all_passed,
212244
"metrics": metrics_out,
245+
"samples": samples,
213246
}
214247

215248

249+
def _truncate(text: str, max_len: int = 200) -> str:
250+
if not text:
251+
return "(empty)"
252+
text = text.replace("\n", " ").strip()
253+
if len(text) <= max_len:
254+
return text
255+
return text[:max_len] + "..."
256+
257+
216258
def generate_markdown(results: Dict[str, Any]) -> str:
217259
lines = [
218260
"## VerifyWise LLM Evaluation Results",
219261
"",
220-
f"**Experiment:** {results['name']}",
221-
f"**Model:** {results['model']}",
222-
f"**Status:** {results['status']}",
223-
f"**Samples:** {results['total_prompts']}",
262+
f"**Experiment:** {results['name']} ",
263+
f"**Model:** {results['model']} ",
264+
f"**Status:** {results['status']} ",
265+
f"**Samples:** {results['total_prompts']} ",
224266
]
225267

226268
if results.get("duration_ms"):
227-
lines.append(f"**Duration:** {results['duration_ms'] / 1000:.1f}s")
269+
lines.append(f"**Duration:** {results['duration_ms'] / 1000:.1f}s ")
228270

229271
overall = "PASS" if results["passed"] else "FAIL"
230-
emoji = "white_check_mark" if results["passed"] else "x"
231272
lines.extend([
232273
"",
233-
f"### Overall: :{emoji}: **{overall}**",
274+
f"### Overall: **{overall}**",
234275
"",
235-
"| Metric | Score | Threshold | Status |",
236-
"|--------|-------|-----------|--------|",
276+
"| Metric | Score | Threshold | Result |",
277+
"|--------|------:|----------:|--------|",
237278
])
238279

239280
for m in results["metrics"]:
240-
status_icon = ":white_check_mark:" if m["passed"] else ":x:"
241-
inv = " *(inverted)*" if m["inverted"] else ""
281+
inv = " (inverted)" if m["inverted"] else ""
282+
result = "PASS" if m["passed"] else "FAIL"
242283
lines.append(
243-
f"| {m['name']}{inv} | {m['score']*100:.1f}% | {m['threshold']*100:.0f}% | {status_icon} |"
284+
f"| {m['name']}{inv} | {m['score']*100:.1f}% | {m['threshold']*100:.0f}% | {result} |"
244285
)
245286

287+
# Per-sample breakdown for failing metrics
288+
failing_metrics = {m["name"] for m in results["metrics"] if not m["passed"]}
289+
samples = results.get("samples", [])
290+
291+
if failing_metrics and samples:
292+
lines.extend(["", "---", "", "### Failure Details", ""])
293+
lines.append(
294+
"Showing per-sample breakdown for metrics that did not meet the threshold."
295+
)
296+
297+
for sample in samples:
298+
sample_scores = sample.get("metric_scores", {})
299+
has_failing = any(
300+
_metric_name_matches(name, failing_metrics)
301+
for name in sample_scores
302+
)
303+
if not has_failing:
304+
continue
305+
306+
lines.extend([
307+
"",
308+
f"#### Sample {sample['index']}",
309+
"",
310+
f"> **Input:** {_truncate(sample['input'], 300)}",
311+
"",
312+
f"> **Response:** {_truncate(sample['output'], 300)}",
313+
])
314+
315+
if sample.get("expected"):
316+
lines.append(f"> **Expected:** {_truncate(sample['expected'], 300)}")
317+
318+
lines.extend(["", "| Metric | Score | Result | Reason |", "|--------|------:|--------|--------|"])
319+
for metric_name, score_data in sample_scores.items():
320+
score_val = score_data.get("score")
321+
passed = score_data.get("passed")
322+
reason = score_data.get("reason", "")
323+
324+
if score_val is not None:
325+
score_str = f"{score_val * 100:.1f}%" if isinstance(score_val, float) else str(score_val)
326+
else:
327+
score_str = "N/A"
328+
329+
if passed is True:
330+
result_str = "PASS"
331+
elif passed is False:
332+
result_str = "FAIL"
333+
else:
334+
result_str = "-"
335+
336+
reason_str = _truncate(reason, 120) if reason else "-"
337+
lines.append(f"| {metric_name} | {score_str} | {result_str} | {reason_str} |")
338+
246339
lines.extend([
247340
"",
248-
f"*Generated by [VerifyWise](https://verifywise.ai) at {datetime.now(tz=__import__('datetime').timezone.utc).strftime('%Y-%m-%d %H:%M UTC')}*",
341+
"---",
342+
f"*Generated by [VerifyWise](https://verifywise.ai) at {datetime.now(tz=timezone.utc).strftime('%Y-%m-%d %H:%M UTC')}*",
249343
])
250344

251345
return "\n".join(lines)
252346

253347

348+
def _metric_name_matches(name: str, targets: set) -> bool:
349+
"""Check if a metric name matches any target, case-insensitively."""
350+
lower = name.lower()
351+
for t in targets:
352+
if t.lower() == lower or t.lower().replace("_", "") == lower.replace("_", ""):
353+
return True
354+
return False
355+
356+
254357
def main():
255358
args = parse_args()
256359

@@ -319,10 +422,10 @@ def main():
319422
print(f" [{icon}] {m['name']}: {m['score']*100:.1f}% (threshold: {m['threshold']*100:.0f}%)")
320423

321424
if not results["passed"]:
322-
print("\nEvaluation FAILED one or more metrics below threshold")
425+
print("\nEvaluation FAILED -- one or more metrics below threshold")
323426
sys.exit(1)
324427
else:
325-
print("\nEvaluation PASSED all metrics within threshold")
428+
print("\nEvaluation PASSED -- all metrics within threshold")
326429
sys.exit(0)
327430

328431
except TimeoutError as e:

0 commit comments

Comments
 (0)