Skip to content

project-cera/cera-human-eval

Repository files navigation

CERA Human Evaluation

A reusable web-based human evaluation instrument for blind review discrimination tasks. Built for the CERA research paper but designed to be adaptable to any study comparing synthetic vs. real text across multiple sources and domains.

Evaluators judge review quality through two forced-choice tasks with full bias controls (length normalization, seeded randomization, no source labels).

Study Design

Task 1: Triplet Identification (5 questions)

Each question presents three reviews from the same domain — one from each source — in randomized order. The evaluator selects which review was written by a real person.

  • 5 triplets: 2 Laptop, 2 Restaurant, 1 Hotel
  • Position within each triplet is shuffled per evaluator
  • Source labels are never shown
  • Chance level: 33% (3-way forced choice)

Task 2: Pairwise Naturalness (5 questions)

Each question presents two reviews side-by-side from the same domain. The evaluator selects which sounds more natural and realistic.

  • 2 CERA vs Heuristic, 2 CERA vs Real, 1 Heuristic vs Real
  • Left/right placement randomized per evaluator
  • Source labels are never shown; reviews labeled "Review A" / "Review B"
  • Chance level: 50% (2-way forced choice)

Total: 10 judgments per evaluator, ~3 minutes.

Datasets

Adding Your Own Reviews

All review data lives in the datasets/ directory as SemEval-format XML files.

Naming convention: {source}-{domain}s.xml

File Source Domain
real-laptops.xml Real (human-written) Laptop
real-restaurants.xml Real (human-written) Restaurant
real-hotels.xml Real (human-written) Hotel
cera-laptops.xml CERA (generated) Laptop
cera-restaurants.xml CERA (generated) Restaurant
cera-hotels.xml CERA (generated) Hotel
heuristic-laptops.xml Heuristic (generated) Laptop
heuristic-restaurants.xml Heuristic (generated) Restaurant
heuristic-hotels.xml Heuristic (generated) Hotel

The build script derives source and domain from the filename: {source}-{domain}s.xml → source real/cera/heuristic, domain laptop/restaurant/hotel (plural suffix stripped).

XML Format

Each file follows SemEval XML format:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Reviews>
  <Review rid="0">
    <sentences>
      <sentence id="0:0">
        <text>First sentence of the review.</text>
      </sentence>
      <sentence id="0:1">
        <text>Second sentence of the review.</text>
        <Opinions>
          <Opinion target="display" category="LAPTOP#DISPLAY" polarity="positive" from="0" to="0" />
        </Opinions>
      </sentence>
    </sentences>
  </Review>
  <!-- more reviews -->
</Reviews>

The <Opinions> element is optional — the build script only reads the <text> content. Sentences are concatenated to form the full review text.

Build Pipeline

After modifying dataset files, regenerate the TypeScript review pool:

npm run build:reviews    # Parses XML → src/data/all-reviews.ts

This script:

  1. Reads all 9 datasets/*.xml files
  2. Concatenates sentence texts into full reviews
  3. Truncates to max 4 sentences per review
  4. Filters to 2-4 sentences and 120-600 characters
  5. Outputs src/data/all-reviews.ts with reviews grouped by {source}_{domain}

The generated all-reviews.ts is keyed as:

reviewPools["real_laptop"]       // Real laptop reviews
reviewPools["cera_restaurant"]   // CERA restaurant reviews
reviewPools["heuristic_hotel"]   // Heuristic hotel reviews
// ... etc.

Customizing Sources/Domains

To change the sources (e.g., replace "heuristic" with "rag") or domains:

  1. Name your XML files following the {source}-{domain}s.xml convention
  2. Update SOURCES and DOMAINS arrays in scripts/build-reviews.ts
  3. Update DOMAIN_MAP in the same file (maps plural filename suffix → singular key)
  4. Update the pair/triplet specs in src/data/reviews.ts (tripletDomains, pairSpec)
  5. Run npm run build:reviews

Bias Controls

Length Normalization

  • All reviews are truncated at a word boundary with "..." appended
  • Within each triplet/pair, reviews are truncated to 85-100% of the shortest review's character count
  • Truncation target is randomized per review via session PRNG, so which source appears longest varies

Length Matching at Selection

  • The most length-constrained source is selected first as an anchor
  • Other sources are matched to within ±40% character length and ±1 sentence count

Randomization

  • All randomization uses a seeded PRNG (Mulberry32 + Fisher-Yates shuffle) for full reproducibility
  • Each evaluator gets a unique seed (timestamp-based)
  • Same seed always produces the identical evaluation sequence

No Overlap

  • Every review shown to an evaluator is unique — no review appears in both Task 1 and Task 2

Domain Consistency

  • Within each triplet/pair, all reviews come from the same domain

Setup

Prerequisites

Installation

npm install
npx convex init           # Create a new Convex project (first time only)

Environment Variables

Create a .env.local file:

# Convex Cloud connection (provided by `npx convex init`)
CONVEX_DEPLOYMENT=dev:your-deployment-name
VITE_CONVEX_URL=https://your-deployment-name.convex.cloud

# Results dashboard access key (choose any password)
VITE_RESULTS_KEY=your-secret-key-here
Variable Required Description
CONVEX_DEPLOYMENT Yes Convex deployment identifier
VITE_CONVEX_URL Yes Convex Cloud URL for the frontend
VITE_RESULTS_KEY Yes Password for the /results admin dashboard. Access via /results?key=YOUR_KEY. No fallback — if unset, the dashboard is inaccessible.

Development

npx convex dev          # Start Convex dev server (watches for schema changes)
npm run dev             # Start Vite dev server (http://localhost:5173)

Deployment (Vercel)

  1. Push to GitHub
  2. Connect the repo to Vercel
  3. Set environment variables in Vercel dashboard: VITE_CONVEX_URL, VITE_RESULTS_KEY
  4. Deploy — Vercel auto-detects Vite and builds the SPA

The vercel.json handles SPA routing (all paths → index.html).

Collected Data

Schema

sessions table — one row per evaluator:

Field Type Description
sessionId string UUID, unique per evaluation session
evaluatorName string Self-reported name
startedAt number Timestamp
completedAt number? Timestamp (null if abandoned)
userAgent string Browser user agent
seed number PRNG seed for reproducibility

evaluations table — one row per response:

Field Type Description
evaluatorName string Self-reported name
sessionId string Session UUID
task "turing" / "pairwise" Which task
questionIndex number 0-4
response string Triplet: actual source picked ("real" / "cera" / "heuristic"). Pairwise: "left" / "right"
reviewSource string Triplet: "triplet". Pairwise: pair type (e.g., "cera_vs_heuristic")
domain string "laptop" / "restaurant" / "hotel"
timeSpentMs number Per-question response time (ms)
leftSource string? Pairwise only: actual source shown on left
rightSource string? Pairwise only: actual source shown on right
createdAt number Timestamp

Computed Metrics

Task 1 — Triplet Identification:

  • Selection rate per source (% of times each was picked as "real")
  • Fleiss' kappa for inter-annotator agreement

Task 2 — Pairwise Naturalness:

  • Win rate per source within each pair type
  • Resolved using leftSource/rightSource to map position → actual source

Admin Dashboard

Access at /results?key=YOUR_RESULTS_KEY. Features:

  • Live stats: Real-time triplet rates, pairwise preferences, Fleiss' kappa, per-evaluator breakdowns
  • Session management: Include/exclude evaluators, soft-delete (trash) with restore, permanent delete
  • Pause/Resume: Temporarily prevent new evaluators from starting
  • End Survey: Show "Study Complete" to all visitors
  • Copy LaTeX Table: Publication-ready table with all metrics
  • Download CSV: Raw evaluation data

Tech Stack

  • Frontend: Vite + React 19 + TypeScript + Tailwind CSS v4
  • Backend: Convex Cloud (free tier) — reactive queries for live dashboard updates
  • Deployment: Vercel (SPA)
  • Charts: Recharts (results dashboard)

About

Human evaluation web app for the CERA conference paper (Canadian AI 2026)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors