PrivacyLens

A differentially-private analytics engine with a real-time, uncertainty-aware UI. Analysts write SQL aggregates, the engine adds calibrated Laplace/Gaussian noise, tracks a per-dataset ε-budget, and the dashboard renders noisy estimates with confidence intervals, suppression, and live attack-pattern alerts.

PrivacyLens is an end-to-end demonstration of a production-shaped DP query system: a FastAPI + DuckDB backend, a React + Vite frontend, JWT auth with RBAC, an audit-bearing dataset registration pipeline, and a streaming WebSocket layer for budget and alert updates.

Main dashboard: query builder, ε-slider with simulation preview, noisy bar chart with 68 % / 95 % confidence intervals, suppression markers (⊘) for groups under min_group_size, and the live ε-budget panel.

Attack-Replay Theater: a step-by-step reconstruction of a detected differencing / averaging attack — shows the offending query sequence, the heuristic that fired, and what the analyst saw in real time.

Features

Privacy engine

Laplace and Gaussian mechanisms with per-aggregate sensitivity bounds (COUNT, SUM, MEAN, PERCENTILE, HISTOGRAM).
Per-dataset ε-budget ledger persisted in PostgreSQL — every successful query atomically deducts its cost; once exhausted the API returns 402 Payment Required.
Bin-based ε allocation with a reserved sub-budget (EPSILON_TEST = 0.5) for the noisy threshold used to suppress small groups.
Suppression contract: groups below min_group_size are returned as explicit null values (never zero) and rendered with a hatched ⊘ pattern in the UI.
Free simulation preview (no budget charged) so analysts can preview the expected confidence interval at a chosen ε before committing.
DuckDB execution over Parquet datasets with per-column clipping bounds validated at registration time.

Front-end

Visual query builder + CodeMirror-based SQL editor, with column-aware filter widgets and a free cohort preview for early small-group warnings.
Real-time noisy charts (visx + D3) with 68 % / 95 % confidence intervals, noise indicators, and explicit suppression rendering.
Significance Inspector: posterior P(μₐ > μ_B) computed from the analytical CDF of the difference of noise variables (Laplace and Gaussian).
Noise Sandbox: Hypothetical-Outcome-Plot (HOP) cloud of 80 posterior samples that can be re-rolled and ε-scaled to make uncertainty physically visible.
Trust Radar / ε-Oracle: five-axis glyph and verdict for each query (sample size, ε spent, suppression, freshness, attack-detector flag).
Attack-Replay Theater: post-hoc, step-by-step reconstruction of detected differencing / averaging / rapid-query attacks against the audit log.
Session Budget Planner & What-If Console: plan ε allocation across a session before spending it, and probe how a chart would change at a different ε.
History / Audit / Provenance Drawer: every query, its ε cost, the analyst who ran it, and the resulting noise are auditable from the UI.
Admin UI: create users, drag-and-drop dataset upload + 4-step registration wizard with bounds validation.
Accessibility: keyboard-first navigation, ARIA labels, multi-channel encoding (color + shape + text) so suppression and confidence are visible without color vision.

Operations & security

JWT password login (PBKDF2-SHA256, 240 k iterations) with three roles — ADMIN, ANALYST, AUDITOR — enforced by FastAPI dependencies and React route guards.
Rate limiting (fixed window per minute, per (tenant, user)) backed by Redis when available, in-process otherwise.
Attack detector: rolling-window heuristics for differencing, averaging, and rapid-query patterns; suspicious queries are flagged in the audit log.
Audit log for every privacy-affecting event: query execution, dataset registration / replacement / archive, user creation, attack flags.
WebSocket hub with heartbeat, exponential-backoff reconnect, and topic pub/sub for budget and alert channels.
Sentry integration (DSN-gated, no PII / request bodies) and Prometheus metrics endpoint.
Production-startup guards: the backend refuses to boot when ENV=production if AUTH_DEV_MODE=true or JWT_SECRET is the shipped default.

Architecture

┌──────────────────────┐     HTTP / JSON     ┌──────────────────────────┐
│  React + Vite SPA    │ ──────────────────▶ │  FastAPI                 │
│  (TanStack Query,    │ ◀── WebSocket ─────│   ├─ DP engine (Laplace/  │
│   Zustand, visx,     │                     │   │   Gaussian + ledger) │
│   CodeMirror)        │                     │   ├─ Attack detector     │
└──────────────────────┘                     │   ├─ Audit + RBAC        │
                                             │   └─ Auth (JWT)          │
                                             └────────────┬─────────────┘
                                                          │
                          ┌───────────────────────────────┼─────────────────┐
                          ▼                               ▼                 ▼
                    ┌──────────┐                    ┌──────────┐      ┌──────────┐
                    │ DuckDB   │                    │ Postgres │      │ Redis    │
                    │ (Parquet │                    │ (ε-ledger│      │ (cache,  │
                    │  scans)  │                    │  + audit)│      │  sims,   │
                    └──────────┘                    └──────────┘      │  ratelim)│
                                                                      └──────────┘

The single source of truth for API types is the backend's Pydantic schemas → OpenAPI → packages/shared/types.ts. CI fails on drift.

Tech stack

Layer	Stack
Backend	Python 3.11, FastAPI, Pydantic v2, SQLAlchemy 2 (async), Alembic, DuckDB, NumPy, Google `dp-accounting`
Storage	PostgreSQL 15 (ledger, audit, users), Redis 7 (cache, rate limit), Parquet (datasets)
Frontend	React 18, TypeScript 5, Vite 6, TanStack Query 5, Zustand 5, Radix UI, visx + D3, CodeMirror 6, Tailwind
Testing	pytest + Hypothesis, Vitest + React Testing Library, Playwright
Observability	Prometheus client, Sentry SDK
Auth	PyJWT (HS256), PBKDF2-SHA256 password hashing

Quick start (Docker)

Prerequisites

Tool	Version
Docker + Docker Compose	≥ 24
GNU `make`	any recent
Python	≥ 3.11 (only for tests / type-gen on the host)
Node.js	≥ 20 (only for tests / type-gen on the host)

1. Generate the synthetic demo dataset (one-time)

python data/generate_demo.py
# → data/employees_demo.parquet  (~15 000 synthetic employee rows)

2. Start everything

make dev
# or: make dev-detached  (background)

Services exposed:

Service	URL
Frontend (Vite)	http://localhost:5173
Backend (FastAPI)	http://localhost:8000
Swagger / OpenAPI	http://localhost:8000/docs

Wait until the backend logs Application startup complete.

3. Seed demo users + audit history

make seed-users   # creates admin@demo.local / analyst@demo.local / auditor@demo.local
make seed         # runs 4 representative queries, warms the simulation cache

4. Sign in

Open http://localhost:5173 and sign in with any of:

Username	Role	Password
`admin@demo.local`	`ADMIN`	`demo`
`analyst@demo.local`	`ANALYST`	`demo`
`auditor@demo.local`	`AUDITOR`	`demo`

Pick Employees Demo in the dataset dropdown and run a query.

Stop / clean up

make stop         # stop containers (data preserved)
make clean        # stop + remove volumes (Postgres data wiped)

Configuration

All settings are environment-driven. Copy .env.example to .env and edit.

Variable	Default	Purpose
`JWT_SECRET`	`dev-secret-change-in-production`	HS256 signing key. Must be overridden in production (`openssl rand -hex 32`).
`STORAGE_BACKEND`	`db`	`db` (Postgres, durable) or `memory` (ephemeral, tests).
`AUTH_DEV_MODE`	`false`	When `true`, identity comes from `X-User-Id` / `X-Tenant-Id` / `X-Role` headers (local-dev only).
`RATE_LIMIT_ENABLED`	`true`	Toggle the per-`(tenant, user)` rate limiter.
`RATE_LIMIT_REQUESTS_PER_MINUTE`	`120`	Fixed-window quota.
`SIM_CACHE_TTL_SECONDS`	`600`	TTL for cached simulation previews.
`SENTRY_DSN`	empty	Set to enable Sentry; empty disables.
`SENTRY_ENVIRONMENT`	`dev`	Sentry environment tag.
`SENTRY_TRACES_SAMPLE_RATE`	`0.0`	Performance sampling fraction.
`DATABASE_URL`	(compose default)	Override only if you point both host + container to the same managed Postgres.
`REDIS_URL`	(compose default)	Leave unset to disable cross-replica shared state.
`VITE_API_BASE_URL`	`http://localhost:8000`	Browser-side API base.
`VITE_WS_BASE_URL`	`ws://localhost:8000`	Browser-side WebSocket base.

Production guard: the backend refuses to start when ENV=production and AUTH_DEV_MODE=true, or JWT_SECRET is left at the shipped default.

Authentication & roles

PrivacyLens ships real password login with JWT bearer tokens — there is no public signup; accounts are seeded by an admin.

Roles

Role	Can do
`ADMIN`	Everything `ANALYST` can do, plus: create users, upload + register datasets, archive datasets.
`ANALYST`	Run queries, simulate previews, view their own history, view the audit log for their tenant.
`AUDITOR`	Read-only — view datasets, audit log, attack flags. Cannot run queries.

Programmatic login

curl -X POST http://localhost:8000/v1/auth/login \
  -H 'Content-Type: application/json' \
  -d '{"username":"analyst@demo.local","password":"demo"}'

TOKEN=...   # returned access_token
curl -H "Authorization: Bearer $TOKEN" http://localhost:8000/v1/auth/me

The frontend stores the JWT in localStorage under privacylens_token, attaches it to every API call, and clears it on logout or HTTP 401.

Dataset registration

Datasets carry their own ε-budget, clipping bounds, and allowed aggregations. Registration is an audit-bearing admin operation and can be done two ways.

From the UI (`/admin/datasets/new`)

Upload — drag-and-drop a .csv or .parquet file (≤ 200 MiB; override with DATASET_UPLOAD_MAX_BYTES).
Schema preview — the backend infers types via DuckDB and returns observed min/max plus a 10-row sample.
Configure — edit dataset id, display name, per-column types, configured bounds, allowed aggregates, ε-budget, min_group_size.
Register — bounds are validated against the staged file (registration fails when any column has > 1 % of rows outside the configured bounds). On success the dataset appears immediately in the Query Builder dropdown and a dataset_registered audit event is written.

From the CLI

cd packages/backend
python -m scripts.register_dataset \
  --config datasets/employees_demo.yaml \
  [--force] [--archive] [--dry-run] [--skip-bounds-check]

Example config:

dataset_id:        employees_demo
display_name:      "Employees Demo"
parquet_path:      data/employees_demo.parquet
table_name:        employees_demo
epsilon_budget:    1.0
min_group_size:    50
allowed_aggregations: [count, sum, mean, percentile, histogram]
columns:
  - { name: salary,     type: numeric,     clipping_bounds: [20000, 250000] }
  - { name: department, type: categorical }
  - { name: hire_date,  type: date }

Audit event	When
`dataset_registered`	New `dataset_id` (or re-registering an archived one)
`dataset_replaced`	`--force` after a column-set diff
`dataset_archived`	`--archive` flag

See docs/DATASETS.md for the full lifecycle.

Running tests

make install            # one-time: pip install + npm install
make install-e2e        # one-time: Playwright browsers

make test               # full suite (backend + frontend unit)
make test-backend       # pytest only
make test-frontend      # Vitest only
make test-e2e           # Playwright (requires `make dev-detached && make seed`)
make test-e2e-ui        # interactive Playwright UI

What the suites cover:

Backend — DP mechanism math, sensitivity bounds, budget ledger accounting and exhaustion, suppression contract, audit store, attack heuristics, WebSocket hub, full API integration including 402-on-exhaustion.
Frontend — Zustand stores, WebSocket hub dispatch, React Query hooks, posterior math, render-level component tests with React Testing Library.
E2E (Playwright) — login flow, role guards, end-to-end query → chart → audit, attack-detector alert path, dataset wizard.

Project layout

privacylens/
├── docker-compose.yml         # postgres, redis, backend, frontend
├── Makefile                   # dev / test / type-gen entrypoints
├── .env.example               # documented configuration
├── data/
│   ├── generate_demo.py       # synthetic dataset generator
│   └── uploads/               # admin-uploaded staging files (gitignored)
├── docs/
│   ├── DATASETS.md            # dataset-registration lifecycle
│   ├── RBAC.md                # roles and route guards
│   └── BACKUP.md              # Postgres backup / restore
├── imgs/                      # README screenshots
├── packages/
│   ├── backend/               # FastAPI app, DP engine, migrations
│   │   ├── app/{api,core,models,ws}
│   │   ├── tests/
│   │   ├── scripts/           # seed.py, register_dataset.py, seed_users.py
│   │   └── migrations/        # Alembic
│   ├── frontend/              # React SPA
│   │   ├── src/{components,hooks,pages,stores,lib}
│   │   ├── tests/             # Playwright E2E
│   │   ├── vite.config.ts
│   │   └── playwright.config.ts
│   └── shared/                # Auto-generated TypeScript types
│       └── types.ts           # ← do not edit; run `make generate-types`
└── scripts/
    ├── add-named-exports.js
    ├── backup-postgres.sh
    └── restore-postgres.sh

Development

Install host-side dev dependencies (for tests + type-gen)

make install
# pip install -e ".[dev]" + npm install

Run the backend in isolation

cd packages/backend
uvicorn app.main:app --reload

The host-native run uses the localhost defaults from app/config.py — Postgres / Redis must be running locally (or override DATABASE_URL / REDIS_URL).

Run the frontend dev server

cd packages/frontend
npm run dev      # Vite at http://localhost:5173
npm run typecheck

Database migrations

make migrate                                 # apply all
cd packages/backend && alembic revision --autogenerate -m "msg"   # author a new one

Backups

./scripts/backup-postgres.sh   ./scripts/restore-postgres.sh

See docs/BACKUP.md.

Type-generation pipeline

The single source of truth for shared types is the backend's Pydantic schemas. The flow:

schemas.py  ──▶  openapi.yaml  ──▶  packages/shared/types.ts
                (FastAPI)            (openapi-typescript)

Regenerate after any schema change:

make generate-types

CI runs make check-drift and fails the build if packages/shared/types.ts is out of sync with the backend.

API reference

The full live OpenAPI spec is at http://localhost:8000/docs. Highlights:

Endpoint	Method	What
`/v1/auth/login`	`POST`	Username + password → JWT access token
`/v1/auth/me`	`GET`	Current user identity
`/v1/datasets`	`GET`	Active datasets visible to the caller
`/v1/datasets/{id}/schema`	`GET`	Columns, bounds, allowed aggregations
`/v1/queries/preview-cohort`	`POST`	Free cohort-size preview (used by Filter Panel)
`/v1/queries/simulate`	`POST`	Free CI preview at a chosen ε (no budget charged)
`/v1/queries/execute`	`POST`	Run query, deduct ε, return noisy result + audit id
`/v1/budget/{dataset_id}`	`GET`	Current per-dataset ε ledger snapshot
`/v1/audit`	`GET`	Paginated audit log (filterable by attack flags)
`/v1/admin/users`	`GET`/`POST`	Admin: list / create users
`/v1/admin/datasets/uploads`	`POST`	Admin: stream-upload a dataset file
`/v1/admin/datasets`	`POST`	Admin: register a staged dataset
`/ws`	`WS`	Pub/sub for `budget` and `alert` topics

Limitations

PrivacyLens is a research / thesis-scale system. Known constraints:

Single-tenant per workspace. Multi-tenancy is wired through the data model but not exercised at scale.
Laplace / Gaussian only. No advanced composition (RDP, zCDP) wiring beyond dp-accounting plumbing.
DuckDB single-node. Datasets must fit on one host; there is no distributed execution layer.
No external joins. Queries are restricted to a single registered dataset at a time.
Demo password (demo) is intentional for the seeded users — change it before any internet-facing deploy.

License & acknowledgements

This project is released under the MIT License (see LICENSE — add one before publishing if you fork it).

PrivacyLens was built as a master's thesis project and stands on the shoulders of:

Differential privacy — Dwork & Roth, The Algorithmic Foundations of Differential Privacy (2014); Google's dp-accounting; Microsoft's SmartNoise.
Uncertainty visualisation — Hullman et al., Hypothetical Outcome Plots (CHI 2015); Kale et al., Hypothetical Outcome Plots Help Untrained Observers Judge Trends in Ambiguous Data (TVCG 2019).
Open-source toolchain — FastAPI, DuckDB, React, Vite, TanStack Query, Zustand, visx, Radix UI, CodeMirror, Playwright.

Issues, PRs, and academic-feedback discussions are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
docs		docs
imgs		imgs
packages		packages
scripts		scripts
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
backend.md		backend.md
docker-compose.yml		docker-compose.yml
frontend.md		frontend.md
package-lock.json		package-lock.json
package.json		package.json
seed_users_in_compose.sh		seed_users_in_compose.sh

Folders and files

Latest commit

History

Repository files navigation

PrivacyLens

Table of contents

Features

Privacy engine

Front-end

Operations & security

Architecture

Tech stack

Quick start (Docker)

Prerequisites

1. Generate the synthetic demo dataset (one-time)

2. Start everything

3. Seed demo users + audit history

4. Sign in

Stop / clean up

Configuration

Authentication & roles

Roles

Programmatic login

Dataset registration

From the UI (/admin/datasets/new)

From the CLI

Running tests

Project layout

Development

Install host-side dev dependencies (for tests + type-gen)

Run the backend in isolation

Run the frontend dev server

Database migrations

Backups

Type-generation pipeline

API reference

Limitations

License & acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

From the UI (`/admin/datasets/new`)

Packages