Skip to content

Commit 110c0a8

Browse files
authored
Merge pull request #86 from EMMUUU28/feat/ai-evals
feat(ai-evals): add BrowserStack AI Evals GitHub Action
2 parents 3d416d0 + 7a1de51 commit 110c0a8

13 files changed

Lines changed: 6168 additions & 0 deletions

ai-evals/.eslintrc.json

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
{
2+
"root": true,
3+
"parser": "@typescript-eslint/parser",
4+
"plugins": ["@typescript-eslint"],
5+
"extends": [
6+
"eslint:recommended",
7+
"plugin:@typescript-eslint/recommended"
8+
],
9+
"env": {
10+
"node": true,
11+
"es2022": true,
12+
"mocha": true
13+
},
14+
"rules": {
15+
"@typescript-eslint/no-explicit-any": "off",
16+
"@typescript-eslint/no-unused-vars": ["error", { "argsIgnorePattern": "^_" }]
17+
},
18+
"ignorePatterns": ["dist/", "node_modules/", "coverage/"]
19+
}

ai-evals/.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
node_modules/
2+
*.log
3+
.nyc_output/
4+
coverage/

ai-evals/README.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# BrowserStack AI Evals — GitHub Action
2+
3+
Run AI evaluation experiments on every pull request. Compares scores against the previous baseline and reports pass/regression status with a sticky PR comment, Job Summary, and CI metadata tracking.
4+
5+
## How it works
6+
7+
1. Looks up the experiment by name (configured in the BrowserStack AI Evals UI)
8+
2. Triggers a new experiment run with CI metadata (branch, commit, actor, PR number)
9+
3. Waits for the run to complete
10+
4. Fetches a server-computed comparison against the previous baseline run
11+
5. Posts a sticky PR comment and Job Summary with per-evaluator scores, deltas, and threshold status
12+
6. Fails the job if any threshold is breached (configurable)
13+
14+
## Quickstart
15+
16+
```yaml
17+
name: AI Evals
18+
on:
19+
pull_request:
20+
paths: ['src/**', 'prompts/**']
21+
22+
jobs:
23+
evals:
24+
runs-on: ubuntu-latest
25+
permissions:
26+
pull-requests: write
27+
contents: read
28+
steps:
29+
- uses: actions/checkout@v4
30+
- uses: browserstack/github-actions/ai-evals@v1
31+
with:
32+
experiment: refund-bot-eval
33+
public-key: ${{ secrets.AISDK_PUBLIC_KEY }}
34+
secret-key: ${{ secrets.AISDK_SECRET_KEY }}
35+
```
36+
37+
## Inputs
38+
39+
| Name | Required | Default | Description |
40+
|---|---|---|---|
41+
| `experiment` | yes | — | Experiment name (configured in the UI). |
42+
| `public-key` | no | — | API public key. Falls back to `AISDK_PUBLIC_KEY` env var. |
43+
| `secret-key` | no | — | API secret key. Falls back to `AISDK_SECRET_KEY` env var. |
44+
| `github-token` | no | `${{ github.token }}` | Token for the PR comment. |
45+
| `fail-on-regression` | no | `true` | Fail the job when a threshold is breached. |
46+
| `comment-on-pr` | no | `true` | Post/edit a sticky PR comment. |
47+
| `timeout` | no | `900` | Max seconds to wait for the run to complete and its comparison scores to be ready. |
48+
49+
## Exit codes
50+
51+
| Code | Meaning |
52+
|---|---|
53+
| 0 | All thresholds passed |
54+
| 1 | At least one threshold breached |
55+
| 2 | Experiment not found |
56+
| 3 | Run failed or timed out |
57+
58+
## Multiple experiments
59+
60+
Each experiment gets its own sticky comment. Run them in parallel or sequence:
61+
62+
```yaml
63+
steps:
64+
- uses: browserstack/github-actions/ai-evals@v1
65+
with:
66+
experiment: refund-bot-eval
67+
public-key: ${{ secrets.AISDK_PUBLIC_KEY }}
68+
secret-key: ${{ secrets.AISDK_SECRET_KEY }}
69+
70+
- uses: browserstack/github-actions/ai-evals@v1
71+
with:
72+
experiment: search-ranking-eval
73+
public-key: ${{ secrets.AISDK_PUBLIC_KEY }}
74+
secret-key: ${{ secrets.AISDK_SECRET_KEY }}
75+
```

ai-evals/action.yml

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
name: 'BrowserStack AI Evals'
2+
description: 'Run AI evaluation experiments, compare scores against the previous baseline, and report pass/regression status with a PR comment and Job Summary.'
3+
author: 'BrowserStack'
4+
branding:
5+
icon: 'check-circle'
6+
color: 'green'
7+
8+
inputs:
9+
experiment:
10+
description: 'Name of an Experiment in BrowserStack AI Evals to run. The Action triggers the experiment (prompt + dataset + evaluators + thresholds configured in the UI) and waits for results.'
11+
required: true
12+
13+
public-key:
14+
description: 'BrowserStack AI Evals public API key. Falls back to the AISDK_PUBLIC_KEY environment variable when omitted.'
15+
required: false
16+
default: ''
17+
18+
secret-key:
19+
description: 'BrowserStack AI Evals secret API key. Falls back to the AISDK_SECRET_KEY environment variable when omitted.'
20+
required: false
21+
default: ''
22+
23+
github-token:
24+
description: 'Token used to post the sticky PR comment. Defaults to the workflow GITHUB_TOKEN; override only if you need to post as a different identity (e.g., a GitHub App).'
25+
required: false
26+
default: ${{ github.token }}
27+
28+
fail-on-regression:
29+
description: 'Exit with a non-zero code if any evaluator breaches its threshold. Set to "false" to report without blocking the PR.'
30+
required: false
31+
default: 'true'
32+
33+
comment-on-pr:
34+
description: 'When running on a pull_request event, post (or edit) a sticky summary comment on the PR. Set to "false" to disable.'
35+
required: false
36+
default: 'true'
37+
38+
timeout:
39+
description: 'Maximum time (in seconds) to wait for the experiment run to complete and its comparison scores to be ready. Applies to both lifecycle polling and score aggregation polling. Default is 900 (15 minutes).'
40+
required: false
41+
default: '900'
42+
43+
runs:
44+
using: 'node20'
45+
main: 'dist/index.js'

ai-evals/dist/index.js

Lines changed: 192 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

ai-evals/dist/index.js.map

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)