A differentially-private analytics engine with a real-time, uncertainty-aware UI. Analysts write SQL aggregates, the engine adds calibrated Laplace/Gaussian noise, tracks a per-dataset ε-budget, and the dashboard renders noisy estimates with confidence intervals, suppression, and live attack-pattern alerts.
PrivacyLens is an end-to-end demonstration of a production-shaped DP query system: a FastAPI + DuckDB backend, a React + Vite frontend, JWT auth with RBAC, an audit-bearing dataset registration pipeline, and a streaming WebSocket layer for budget and alert updates.
Main dashboard: query builder, ε-slider with simulation preview, noisy bar chart with 68 % / 95 % confidence intervals, suppression markers (⊘) for groups under min_group_size, and the live ε-budget panel.
Attack-Replay Theater: a step-by-step reconstruction of a detected differencing / averaging attack — shows the offending query sequence, the heuristic that fired, and what the analyst saw in real time.
- Features
- Architecture
- Tech stack
- Quick start
- Configuration
- Authentication & roles
- Dataset registration
- Running tests
- Project layout
- Development
- Type-generation pipeline
- API reference
- Limitations
- License & acknowledgements
- Laplace and Gaussian mechanisms with per-aggregate sensitivity bounds (
COUNT,SUM,MEAN,PERCENTILE,HISTOGRAM). - Per-dataset ε-budget ledger persisted in PostgreSQL — every successful query atomically deducts its cost; once exhausted the API returns
402 Payment Required. - Bin-based ε allocation with a reserved sub-budget (
EPSILON_TEST = 0.5) for the noisy threshold used to suppress small groups. - Suppression contract: groups below
min_group_sizeare returned as explicitnullvalues (never zero) and rendered with a hatched ⊘ pattern in the UI. - Free simulation preview (no budget charged) so analysts can preview the expected confidence interval at a chosen ε before committing.
- DuckDB execution over Parquet datasets with per-column clipping bounds validated at registration time.
- Visual query builder + CodeMirror-based SQL editor, with column-aware filter widgets and a free cohort preview for early small-group warnings.
- Real-time noisy charts (visx + D3) with 68 % / 95 % confidence intervals, noise indicators, and explicit suppression rendering.
- Significance Inspector: posterior P(μₐ > μ_B) computed from the analytical CDF of the difference of noise variables (Laplace and Gaussian).
- Noise Sandbox: Hypothetical-Outcome-Plot (HOP) cloud of 80 posterior samples that can be re-rolled and ε-scaled to make uncertainty physically visible.
- Trust Radar / ε-Oracle: five-axis glyph and verdict for each query (sample size, ε spent, suppression, freshness, attack-detector flag).
- Attack-Replay Theater: post-hoc, step-by-step reconstruction of detected differencing / averaging / rapid-query attacks against the audit log.
- Session Budget Planner & What-If Console: plan ε allocation across a session before spending it, and probe how a chart would change at a different ε.
- History / Audit / Provenance Drawer: every query, its ε cost, the analyst who ran it, and the resulting noise are auditable from the UI.
- Admin UI: create users, drag-and-drop dataset upload + 4-step registration wizard with bounds validation.
- Accessibility: keyboard-first navigation, ARIA labels, multi-channel encoding (color + shape + text) so suppression and confidence are visible without color vision.
- JWT password login (PBKDF2-SHA256, 240 k iterations) with three roles —
ADMIN,ANALYST,AUDITOR— enforced by FastAPI dependencies and React route guards. - Rate limiting (fixed window per minute, per
(tenant, user)) backed by Redis when available, in-process otherwise. - Attack detector: rolling-window heuristics for differencing, averaging, and rapid-query patterns; suspicious queries are flagged in the audit log.
- Audit log for every privacy-affecting event: query execution, dataset registration / replacement / archive, user creation, attack flags.
- WebSocket hub with heartbeat, exponential-backoff reconnect, and topic pub/sub for
budgetandalertchannels. - Sentry integration (DSN-gated, no PII / request bodies) and Prometheus metrics endpoint.
- Production-startup guards: the backend refuses to boot when
ENV=productionifAUTH_DEV_MODE=trueorJWT_SECRETis the shipped default.
┌──────────────────────┐ HTTP / JSON ┌──────────────────────────┐
│ React + Vite SPA │ ──────────────────▶ │ FastAPI │
│ (TanStack Query, │ ◀── WebSocket ─────│ ├─ DP engine (Laplace/ │
│ Zustand, visx, │ │ │ Gaussian + ledger) │
│ CodeMirror) │ │ ├─ Attack detector │
└──────────────────────┘ │ ├─ Audit + RBAC │
│ └─ Auth (JWT) │
└────────────┬─────────────┘
│
┌───────────────────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ DuckDB │ │ Postgres │ │ Redis │
│ (Parquet │ │ (ε-ledger│ │ (cache, │
│ scans) │ │ + audit)│ │ sims, │
└──────────┘ └──────────┘ │ ratelim)│
└──────────┘
The single source of truth for API types is the backend's Pydantic schemas → OpenAPI → packages/shared/types.ts. CI fails on drift.
| Layer | Stack |
|---|---|
| Backend | Python 3.11, FastAPI, Pydantic v2, SQLAlchemy 2 (async), Alembic, DuckDB, NumPy, Google dp-accounting |
| Storage | PostgreSQL 15 (ledger, audit, users), Redis 7 (cache, rate limit), Parquet (datasets) |
| Frontend | React 18, TypeScript 5, Vite 6, TanStack Query 5, Zustand 5, Radix UI, visx + D3, CodeMirror 6, Tailwind |
| Testing | pytest + Hypothesis, Vitest + React Testing Library, Playwright |
| Observability | Prometheus client, Sentry SDK |
| Auth | PyJWT (HS256), PBKDF2-SHA256 password hashing |
| Tool | Version |
|---|---|
| Docker + Docker Compose | ≥ 24 |
GNU make |
any recent |
| Python | ≥ 3.11 (only for tests / type-gen on the host) |
| Node.js | ≥ 20 (only for tests / type-gen on the host) |
python data/generate_demo.py
# → data/employees_demo.parquet (~15 000 synthetic employee rows)make dev
# or: make dev-detached (background)Services exposed:
| Service | URL |
|---|---|
| Frontend (Vite) | http://localhost:5173 |
| Backend (FastAPI) | http://localhost:8000 |
| Swagger / OpenAPI | http://localhost:8000/docs |
Wait until the backend logs Application startup complete.
make seed-users # creates admin@demo.local / analyst@demo.local / auditor@demo.local
make seed # runs 4 representative queries, warms the simulation cacheOpen http://localhost:5173 and sign in with any of:
| Username | Role | Password |
|---|---|---|
admin@demo.local |
ADMIN |
demo |
analyst@demo.local |
ANALYST |
demo |
auditor@demo.local |
AUDITOR |
demo |
Pick Employees Demo in the dataset dropdown and run a query.
make stop # stop containers (data preserved)
make clean # stop + remove volumes (Postgres data wiped)All settings are environment-driven. Copy .env.example to .env and edit.
| Variable | Default | Purpose |
|---|---|---|
JWT_SECRET |
dev-secret-change-in-production |
HS256 signing key. Must be overridden in production (openssl rand -hex 32). |
STORAGE_BACKEND |
db |
db (Postgres, durable) or memory (ephemeral, tests). |
AUTH_DEV_MODE |
false |
When true, identity comes from X-User-Id / X-Tenant-Id / X-Role headers (local-dev only). |
RATE_LIMIT_ENABLED |
true |
Toggle the per-(tenant, user) rate limiter. |
RATE_LIMIT_REQUESTS_PER_MINUTE |
120 |
Fixed-window quota. |
SIM_CACHE_TTL_SECONDS |
600 |
TTL for cached simulation previews. |
SENTRY_DSN |
empty | Set to enable Sentry; empty disables. |
SENTRY_ENVIRONMENT |
dev |
Sentry environment tag. |
SENTRY_TRACES_SAMPLE_RATE |
0.0 |
Performance sampling fraction. |
DATABASE_URL |
(compose default) | Override only if you point both host + container to the same managed Postgres. |
REDIS_URL |
(compose default) | Leave unset to disable cross-replica shared state. |
VITE_API_BASE_URL |
http://localhost:8000 |
Browser-side API base. |
VITE_WS_BASE_URL |
ws://localhost:8000 |
Browser-side WebSocket base. |
Production guard: the backend refuses to start when ENV=production and AUTH_DEV_MODE=true, or JWT_SECRET is left at the shipped default.
PrivacyLens ships real password login with JWT bearer tokens — there is no public signup; accounts are seeded by an admin.
| Role | Can do |
|---|---|
ADMIN |
Everything ANALYST can do, plus: create users, upload + register datasets, archive datasets. |
ANALYST |
Run queries, simulate previews, view their own history, view the audit log for their tenant. |
AUDITOR |
Read-only — view datasets, audit log, attack flags. Cannot run queries. |
curl -X POST http://localhost:8000/v1/auth/login \
-H 'Content-Type: application/json' \
-d '{"username":"analyst@demo.local","password":"demo"}'
TOKEN=... # returned access_token
curl -H "Authorization: Bearer $TOKEN" http://localhost:8000/v1/auth/meThe frontend stores the JWT in localStorage under privacylens_token, attaches it to every API call, and clears it on logout or HTTP 401.
Datasets carry their own ε-budget, clipping bounds, and allowed aggregations. Registration is an audit-bearing admin operation and can be done two ways.
- Upload — drag-and-drop a
.csvor.parquetfile (≤ 200 MiB; override withDATASET_UPLOAD_MAX_BYTES). - Schema preview — the backend infers types via DuckDB and returns observed
min/maxplus a 10-row sample. - Configure — edit dataset id, display name, per-column types, configured bounds, allowed aggregates, ε-budget,
min_group_size. - Register — bounds are validated against the staged file (registration fails when any column has > 1 % of rows outside the configured bounds). On success the dataset appears immediately in the Query Builder dropdown and a
dataset_registeredaudit event is written.
cd packages/backend
python -m scripts.register_dataset \
--config datasets/employees_demo.yaml \
[--force] [--archive] [--dry-run] [--skip-bounds-check]Example config:
dataset_id: employees_demo
display_name: "Employees Demo"
parquet_path: data/employees_demo.parquet
table_name: employees_demo
epsilon_budget: 1.0
min_group_size: 50
allowed_aggregations: [count, sum, mean, percentile, histogram]
columns:
- { name: salary, type: numeric, clipping_bounds: [20000, 250000] }
- { name: department, type: categorical }
- { name: hire_date, type: date }| Audit event | When |
|---|---|
dataset_registered |
New dataset_id (or re-registering an archived one) |
dataset_replaced |
--force after a column-set diff |
dataset_archived |
--archive flag |
See docs/DATASETS.md for the full lifecycle.
make install # one-time: pip install + npm install
make install-e2e # one-time: Playwright browsers
make test # full suite (backend + frontend unit)
make test-backend # pytest only
make test-frontend # Vitest only
make test-e2e # Playwright (requires `make dev-detached && make seed`)
make test-e2e-ui # interactive Playwright UIWhat the suites cover:
- Backend — DP mechanism math, sensitivity bounds, budget ledger accounting and exhaustion, suppression contract, audit store, attack heuristics, WebSocket hub, full API integration including 402-on-exhaustion.
- Frontend — Zustand stores, WebSocket hub dispatch, React Query hooks, posterior math, render-level component tests with React Testing Library.
- E2E (Playwright) — login flow, role guards, end-to-end query → chart → audit, attack-detector alert path, dataset wizard.
privacylens/
├── docker-compose.yml # postgres, redis, backend, frontend
├── Makefile # dev / test / type-gen entrypoints
├── .env.example # documented configuration
├── data/
│ ├── generate_demo.py # synthetic dataset generator
│ └── uploads/ # admin-uploaded staging files (gitignored)
├── docs/
│ ├── DATASETS.md # dataset-registration lifecycle
│ ├── RBAC.md # roles and route guards
│ └── BACKUP.md # Postgres backup / restore
├── imgs/ # README screenshots
├── packages/
│ ├── backend/ # FastAPI app, DP engine, migrations
│ │ ├── app/{api,core,models,ws}
│ │ ├── tests/
│ │ ├── scripts/ # seed.py, register_dataset.py, seed_users.py
│ │ └── migrations/ # Alembic
│ ├── frontend/ # React SPA
│ │ ├── src/{components,hooks,pages,stores,lib}
│ │ ├── tests/ # Playwright E2E
│ │ ├── vite.config.ts
│ │ └── playwright.config.ts
│ └── shared/ # Auto-generated TypeScript types
│ └── types.ts # ← do not edit; run `make generate-types`
└── scripts/
├── add-named-exports.js
├── backup-postgres.sh
└── restore-postgres.sh
make install
# pip install -e ".[dev]" + npm installcd packages/backend
uvicorn app.main:app --reloadThe host-native run uses the localhost defaults from app/config.py — Postgres / Redis must be running locally (or override DATABASE_URL / REDIS_URL).
cd packages/frontend
npm run dev # Vite at http://localhost:5173
npm run typecheckmake migrate # apply all
cd packages/backend && alembic revision --autogenerate -m "msg" # author a new one./scripts/backup-postgres.sh ./scripts/restore-postgres.shSee docs/BACKUP.md.
The single source of truth for shared types is the backend's Pydantic schemas. The flow:
schemas.py ──▶ openapi.yaml ──▶ packages/shared/types.ts
(FastAPI) (openapi-typescript)
Regenerate after any schema change:
make generate-typesCI runs make check-drift and fails the build if packages/shared/types.ts is out of sync with the backend.
The full live OpenAPI spec is at http://localhost:8000/docs. Highlights:
| Endpoint | Method | What |
|---|---|---|
/v1/auth/login |
POST |
Username + password → JWT access token |
/v1/auth/me |
GET |
Current user identity |
/v1/datasets |
GET |
Active datasets visible to the caller |
/v1/datasets/{id}/schema |
GET |
Columns, bounds, allowed aggregations |
/v1/queries/preview-cohort |
POST |
Free cohort-size preview (used by Filter Panel) |
/v1/queries/simulate |
POST |
Free CI preview at a chosen ε (no budget charged) |
/v1/queries/execute |
POST |
Run query, deduct ε, return noisy result + audit id |
/v1/budget/{dataset_id} |
GET |
Current per-dataset ε ledger snapshot |
/v1/audit |
GET |
Paginated audit log (filterable by attack flags) |
/v1/admin/users |
GET/POST |
Admin: list / create users |
/v1/admin/datasets/uploads |
POST |
Admin: stream-upload a dataset file |
/v1/admin/datasets |
POST |
Admin: register a staged dataset |
/ws |
WS |
Pub/sub for budget and alert topics |
PrivacyLens is a research / thesis-scale system. Known constraints:
- Single-tenant per workspace. Multi-tenancy is wired through the data model but not exercised at scale.
- Laplace / Gaussian only. No advanced composition (RDP, zCDP) wiring beyond
dp-accountingplumbing. - DuckDB single-node. Datasets must fit on one host; there is no distributed execution layer.
- No external joins. Queries are restricted to a single registered dataset at a time.
- Demo password (
demo) is intentional for the seeded users — change it before any internet-facing deploy.
This project is released under the MIT License (see LICENSE — add one before publishing if you fork it).
PrivacyLens was built as a master's thesis project and stands on the shoulders of:
- Differential privacy — Dwork & Roth, The Algorithmic Foundations of Differential Privacy (2014); Google's
dp-accounting; Microsoft's SmartNoise. - Uncertainty visualisation — Hullman et al., Hypothetical Outcome Plots (CHI 2015); Kale et al., Hypothetical Outcome Plots Help Untrained Observers Judge Trends in Ambiguous Data (TVCG 2019).
- Open-source toolchain — FastAPI, DuckDB, React, Vite, TanStack Query, Zustand, visx, Radix UI, CodeMirror, Playwright.
Issues, PRs, and academic-feedback discussions are welcome.