A reusable web-based human evaluation instrument for blind review discrimination tasks. Built for the CERA research paper but designed to be adaptable to any study comparing synthetic vs. real text across multiple sources and domains.
Evaluators judge review quality through two forced-choice tasks with full bias controls (length normalization, seeded randomization, no source labels).
Each question presents three reviews from the same domain — one from each source — in randomized order. The evaluator selects which review was written by a real person.
- 5 triplets: 2 Laptop, 2 Restaurant, 1 Hotel
- Position within each triplet is shuffled per evaluator
- Source labels are never shown
- Chance level: 33% (3-way forced choice)
Each question presents two reviews side-by-side from the same domain. The evaluator selects which sounds more natural and realistic.
- 2 CERA vs Heuristic, 2 CERA vs Real, 1 Heuristic vs Real
- Left/right placement randomized per evaluator
- Source labels are never shown; reviews labeled "Review A" / "Review B"
- Chance level: 50% (2-way forced choice)
Total: 10 judgments per evaluator, ~3 minutes.
All review data lives in the datasets/ directory as SemEval-format XML files.
Naming convention: {source}-{domain}s.xml
| File | Source | Domain |
|---|---|---|
real-laptops.xml |
Real (human-written) | Laptop |
real-restaurants.xml |
Real (human-written) | Restaurant |
real-hotels.xml |
Real (human-written) | Hotel |
cera-laptops.xml |
CERA (generated) | Laptop |
cera-restaurants.xml |
CERA (generated) | Restaurant |
cera-hotels.xml |
CERA (generated) | Hotel |
heuristic-laptops.xml |
Heuristic (generated) | Laptop |
heuristic-restaurants.xml |
Heuristic (generated) | Restaurant |
heuristic-hotels.xml |
Heuristic (generated) | Hotel |
The build script derives source and domain from the filename: {source}-{domain}s.xml → source real/cera/heuristic, domain laptop/restaurant/hotel (plural suffix stripped).
Each file follows SemEval XML format:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Reviews>
<Review rid="0">
<sentences>
<sentence id="0:0">
<text>First sentence of the review.</text>
</sentence>
<sentence id="0:1">
<text>Second sentence of the review.</text>
<Opinions>
<Opinion target="display" category="LAPTOP#DISPLAY" polarity="positive" from="0" to="0" />
</Opinions>
</sentence>
</sentences>
</Review>
<!-- more reviews -->
</Reviews>The <Opinions> element is optional — the build script only reads the <text> content. Sentences are concatenated to form the full review text.
After modifying dataset files, regenerate the TypeScript review pool:
npm run build:reviews # Parses XML → src/data/all-reviews.tsThis script:
- Reads all 9
datasets/*.xmlfiles - Concatenates sentence texts into full reviews
- Truncates to max 4 sentences per review
- Filters to 2-4 sentences and 120-600 characters
- Outputs
src/data/all-reviews.tswith reviews grouped by{source}_{domain}
The generated all-reviews.ts is keyed as:
reviewPools["real_laptop"] // Real laptop reviews
reviewPools["cera_restaurant"] // CERA restaurant reviews
reviewPools["heuristic_hotel"] // Heuristic hotel reviews
// ... etc.To change the sources (e.g., replace "heuristic" with "rag") or domains:
- Name your XML files following the
{source}-{domain}s.xmlconvention - Update
SOURCESandDOMAINSarrays inscripts/build-reviews.ts - Update
DOMAIN_MAPin the same file (maps plural filename suffix → singular key) - Update the pair/triplet specs in
src/data/reviews.ts(tripletDomains,pairSpec) - Run
npm run build:reviews
- All reviews are truncated at a word boundary with "..." appended
- Within each triplet/pair, reviews are truncated to 85-100% of the shortest review's character count
- Truncation target is randomized per review via session PRNG, so which source appears longest varies
- The most length-constrained source is selected first as an anchor
- Other sources are matched to within ±40% character length and ±1 sentence count
- All randomization uses a seeded PRNG (Mulberry32 + Fisher-Yates shuffle) for full reproducibility
- Each evaluator gets a unique seed (timestamp-based)
- Same seed always produces the identical evaluation sequence
- Every review shown to an evaluator is unique — no review appears in both Task 1 and Task 2
- Within each triplet/pair, all reviews come from the same domain
- Node.js 18+
- A Convex Cloud project (free tier)
npm install
npx convex init # Create a new Convex project (first time only)Create a .env.local file:
# Convex Cloud connection (provided by `npx convex init`)
CONVEX_DEPLOYMENT=dev:your-deployment-name
VITE_CONVEX_URL=https://your-deployment-name.convex.cloud
# Results dashboard access key (choose any password)
VITE_RESULTS_KEY=your-secret-key-here| Variable | Required | Description |
|---|---|---|
CONVEX_DEPLOYMENT |
Yes | Convex deployment identifier |
VITE_CONVEX_URL |
Yes | Convex Cloud URL for the frontend |
VITE_RESULTS_KEY |
Yes | Password for the /results admin dashboard. Access via /results?key=YOUR_KEY. No fallback — if unset, the dashboard is inaccessible. |
npx convex dev # Start Convex dev server (watches for schema changes)
npm run dev # Start Vite dev server (http://localhost:5173)- Push to GitHub
- Connect the repo to Vercel
- Set environment variables in Vercel dashboard:
VITE_CONVEX_URL,VITE_RESULTS_KEY - Deploy — Vercel auto-detects Vite and builds the SPA
The vercel.json handles SPA routing (all paths → index.html).
sessions table — one row per evaluator:
| Field | Type | Description |
|---|---|---|
sessionId |
string | UUID, unique per evaluation session |
evaluatorName |
string | Self-reported name |
startedAt |
number | Timestamp |
completedAt |
number? | Timestamp (null if abandoned) |
userAgent |
string | Browser user agent |
seed |
number | PRNG seed for reproducibility |
evaluations table — one row per response:
| Field | Type | Description |
|---|---|---|
evaluatorName |
string | Self-reported name |
sessionId |
string | Session UUID |
task |
"turing" / "pairwise" |
Which task |
questionIndex |
number | 0-4 |
response |
string | Triplet: actual source picked ("real" / "cera" / "heuristic"). Pairwise: "left" / "right" |
reviewSource |
string | Triplet: "triplet". Pairwise: pair type (e.g., "cera_vs_heuristic") |
domain |
string | "laptop" / "restaurant" / "hotel" |
timeSpentMs |
number | Per-question response time (ms) |
leftSource |
string? | Pairwise only: actual source shown on left |
rightSource |
string? | Pairwise only: actual source shown on right |
createdAt |
number | Timestamp |
Task 1 — Triplet Identification:
- Selection rate per source (% of times each was picked as "real")
- Fleiss' kappa for inter-annotator agreement
Task 2 — Pairwise Naturalness:
- Win rate per source within each pair type
- Resolved using
leftSource/rightSourceto map position → actual source
Access at /results?key=YOUR_RESULTS_KEY. Features:
- Live stats: Real-time triplet rates, pairwise preferences, Fleiss' kappa, per-evaluator breakdowns
- Session management: Include/exclude evaluators, soft-delete (trash) with restore, permanent delete
- Pause/Resume: Temporarily prevent new evaluators from starting
- End Survey: Show "Study Complete" to all visitors
- Copy LaTeX Table: Publication-ready table with all metrics
- Download CSV: Raw evaluation data
- Frontend: Vite + React 19 + TypeScript + Tailwind CSS v4
- Backend: Convex Cloud (free tier) — reactive queries for live dashboard updates
- Deployment: Vercel (SPA)
- Charts: Recharts (results dashboard)