CERA Human Evaluation

A reusable web-based human evaluation instrument for blind review discrimination tasks. Built for the CERA research paper but designed to be adaptable to any study comparing synthetic vs. real text across multiple sources and domains.

Evaluators judge review quality through two forced-choice tasks with full bias controls (length normalization, seeded randomization, no source labels).

Study Design

Task 1: Triplet Identification (5 questions)

Each question presents three reviews from the same domain — one from each source — in randomized order. The evaluator selects which review was written by a real person.

5 triplets: 2 Laptop, 2 Restaurant, 1 Hotel
Position within each triplet is shuffled per evaluator
Source labels are never shown
Chance level: 33% (3-way forced choice)

Task 2: Pairwise Naturalness (5 questions)

Each question presents two reviews side-by-side from the same domain. The evaluator selects which sounds more natural and realistic.

2 CERA vs Heuristic, 2 CERA vs Real, 1 Heuristic vs Real
Left/right placement randomized per evaluator
Source labels are never shown; reviews labeled "Review A" / "Review B"
Chance level: 50% (2-way forced choice)

Total: 10 judgments per evaluator, ~3 minutes.

Datasets

Adding Your Own Reviews

All review data lives in the datasets/ directory as SemEval-format XML files.

Naming convention: {source}-{domain}s.xml

File	Source	Domain
`real-laptops.xml`	Real (human-written)	Laptop
`real-restaurants.xml`	Real (human-written)	Restaurant
`real-hotels.xml`	Real (human-written)	Hotel
`cera-laptops.xml`	CERA (generated)	Laptop
`cera-restaurants.xml`	CERA (generated)	Restaurant
`cera-hotels.xml`	CERA (generated)	Hotel
`heuristic-laptops.xml`	Heuristic (generated)	Laptop
`heuristic-restaurants.xml`	Heuristic (generated)	Restaurant
`heuristic-hotels.xml`	Heuristic (generated)	Hotel

The build script derives source and domain from the filename: {source}-{domain}s.xml → source real/cera/heuristic, domain laptop/restaurant/hotel (plural suffix stripped).

XML Format

Each file follows SemEval XML format:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Reviews>
  <Review rid="0">
    <sentences>
      <sentence id="0:0">
        <text>First sentence of the review.</text>
      </sentence>
      <sentence id="0:1">
        <text>Second sentence of the review.</text>
        <Opinions>
          <Opinion target="display" category="LAPTOP#DISPLAY" polarity="positive" from="0" to="0" />
        </Opinions>
      </sentence>
    </sentences>
  </Review>
  <!-- more reviews -->
</Reviews>

The <Opinions> element is optional — the build script only reads the <text> content. Sentences are concatenated to form the full review text.

Build Pipeline

After modifying dataset files, regenerate the TypeScript review pool:

npm run build:reviews    # Parses XML → src/data/all-reviews.ts

This script:

Reads all 9 datasets/*.xml files
Concatenates sentence texts into full reviews
Truncates to max 4 sentences per review
Filters to 2-4 sentences and 120-600 characters
Outputs src/data/all-reviews.ts with reviews grouped by {source}_{domain}

The generated all-reviews.ts is keyed as:

reviewPools["real_laptop"]       // Real laptop reviews
reviewPools["cera_restaurant"]   // CERA restaurant reviews
reviewPools["heuristic_hotel"]   // Heuristic hotel reviews
// ... etc.

Customizing Sources/Domains

To change the sources (e.g., replace "heuristic" with "rag") or domains:

Name your XML files following the {source}-{domain}s.xml convention
Update SOURCES and DOMAINS arrays in scripts/build-reviews.ts
Update DOMAIN_MAP in the same file (maps plural filename suffix → singular key)
Update the pair/triplet specs in src/data/reviews.ts (tripletDomains, pairSpec)
Run npm run build:reviews

Bias Controls

Length Normalization

All reviews are truncated at a word boundary with "..." appended
Within each triplet/pair, reviews are truncated to 85-100% of the shortest review's character count
Truncation target is randomized per review via session PRNG, so which source appears longest varies

Length Matching at Selection

The most length-constrained source is selected first as an anchor
Other sources are matched to within ±40% character length and ±1 sentence count

Randomization

All randomization uses a seeded PRNG (Mulberry32 + Fisher-Yates shuffle) for full reproducibility
Each evaluator gets a unique seed (timestamp-based)
Same seed always produces the identical evaluation sequence

No Overlap

Every review shown to an evaluator is unique — no review appears in both Task 1 and Task 2

Domain Consistency

Within each triplet/pair, all reviews come from the same domain

Setup

Prerequisites

Node.js 18+
A Convex Cloud project (free tier)

Installation

npm install
npx convex init           # Create a new Convex project (first time only)

Environment Variables

Create a .env.local file:

# Convex Cloud connection (provided by `npx convex init`)
CONVEX_DEPLOYMENT=dev:your-deployment-name
VITE_CONVEX_URL=https://your-deployment-name.convex.cloud

# Results dashboard access key (choose any password)
VITE_RESULTS_KEY=your-secret-key-here

Variable	Required	Description
`CONVEX_DEPLOYMENT`	Yes	Convex deployment identifier
`VITE_CONVEX_URL`	Yes	Convex Cloud URL for the frontend
`VITE_RESULTS_KEY`	Yes	Password for the `/results` admin dashboard. Access via `/results?key=YOUR_KEY`. No fallback — if unset, the dashboard is inaccessible.

Development

npx convex dev          # Start Convex dev server (watches for schema changes)
npm run dev             # Start Vite dev server (http://localhost:5173)

Deployment (Vercel)

Push to GitHub
Connect the repo to Vercel
Set environment variables in Vercel dashboard: VITE_CONVEX_URL, VITE_RESULTS_KEY
Deploy — Vercel auto-detects Vite and builds the SPA

The vercel.json handles SPA routing (all paths → index.html).

Collected Data

Schema

sessions table — one row per evaluator:

Field	Type	Description
`sessionId`	string	UUID, unique per evaluation session
`evaluatorName`	string	Self-reported name
`startedAt`	number	Timestamp
`completedAt`	number?	Timestamp (null if abandoned)
`userAgent`	string	Browser user agent
`seed`	number	PRNG seed for reproducibility

evaluations table — one row per response:

Field	Type	Description
`evaluatorName`	string	Self-reported name
`sessionId`	string	Session UUID
`task`	`"turing"` / `"pairwise"`	Which task
`questionIndex`	number	0-4
`response`	string	Triplet: actual source picked (`"real"` / `"cera"` / `"heuristic"`). Pairwise: `"left"` / `"right"`
`reviewSource`	string	Triplet: `"triplet"`. Pairwise: pair type (e.g., `"cera_vs_heuristic"`)
`domain`	string	`"laptop"` / `"restaurant"` / `"hotel"`
`timeSpentMs`	number	Per-question response time (ms)
`leftSource`	string?	Pairwise only: actual source shown on left
`rightSource`	string?	Pairwise only: actual source shown on right
`createdAt`	number	Timestamp

Computed Metrics

Task 1 — Triplet Identification:

Selection rate per source (% of times each was picked as "real")
Fleiss' kappa for inter-annotator agreement

Task 2 — Pairwise Naturalness:

Win rate per source within each pair type
Resolved using leftSource/rightSource to map position → actual source

Admin Dashboard

Access at /results?key=YOUR_RESULTS_KEY. Features:

Live stats: Real-time triplet rates, pairwise preferences, Fleiss' kappa, per-evaluator breakdowns
Session management: Include/exclude evaluators, soft-delete (trash) with restore, permanent delete
Pause/Resume: Temporarily prevent new evaluators from starting
End Survey: Show "Study Complete" to all visitors
Copy LaTeX Table: Publication-ready table with all metrics
Download CSV: Raw evaluation data

Tech Stack

Frontend: Vite + React 19 + TypeScript + Tailwind CSS v4
Backend: Convex Cloud (free tier) — reactive queries for live dashboard updates
Deployment: Vercel (SPA)
Charts: Recharts (results dashboard)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
convex		convex
datasets		datasets
public		public
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
eslint.config.js		eslint.config.js
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
tsconfig.app.json		tsconfig.app.json
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
vercel.json		vercel.json
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CERA Human Evaluation

Study Design

Task 1: Triplet Identification (5 questions)

Task 2: Pairwise Naturalness (5 questions)

Datasets

Adding Your Own Reviews

XML Format

Build Pipeline

Customizing Sources/Domains

Bias Controls

Length Normalization

Length Matching at Selection

Randomization

No Overlap

Domain Consistency

Setup

Prerequisites

Installation

Environment Variables

Development

Deployment (Vercel)

Collected Data

Schema

Computed Metrics

Admin Dashboard

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CERA Human Evaluation

Study Design

Task 1: Triplet Identification (5 questions)

Task 2: Pairwise Naturalness (5 questions)

Datasets

Adding Your Own Reviews

XML Format

Build Pipeline

Customizing Sources/Domains

Bias Controls

Length Normalization

Length Matching at Selection

Randomization

No Overlap

Domain Consistency

Setup

Prerequisites

Installation

Environment Variables

Development

Deployment (Vercel)

Collected Data

Schema

Computed Metrics

Admin Dashboard

Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages