Skip to content

g1phy/privacylens

Repository files navigation

PrivacyLens

A differentially-private analytics engine with a real-time, uncertainty-aware UI. Analysts write SQL aggregates, the engine adds calibrated Laplace/Gaussian noise, tracks a per-dataset ε-budget, and the dashboard renders noisy estimates with confidence intervals, suppression, and live attack-pattern alerts.

PrivacyLens is an end-to-end demonstration of a production-shaped DP query system: a FastAPI + DuckDB backend, a React + Vite frontend, JWT auth with RBAC, an audit-bearing dataset registration pipeline, and a streaming WebSocket layer for budget and alert updates.

PrivacyLens main page Main dashboard: query builder, ε-slider with simulation preview, noisy bar chart with 68 % / 95 % confidence intervals, suppression markers (⊘) for groups under min_group_size, and the live ε-budget panel.

Attack-replay theater Attack-Replay Theater: a step-by-step reconstruction of a detected differencing / averaging attack — shows the offending query sequence, the heuristic that fired, and what the analyst saw in real time.


Table of contents


Features

Privacy engine

  • Laplace and Gaussian mechanisms with per-aggregate sensitivity bounds (COUNT, SUM, MEAN, PERCENTILE, HISTOGRAM).
  • Per-dataset ε-budget ledger persisted in PostgreSQL — every successful query atomically deducts its cost; once exhausted the API returns 402 Payment Required.
  • Bin-based ε allocation with a reserved sub-budget (EPSILON_TEST = 0.5) for the noisy threshold used to suppress small groups.
  • Suppression contract: groups below min_group_size are returned as explicit null values (never zero) and rendered with a hatched ⊘ pattern in the UI.
  • Free simulation preview (no budget charged) so analysts can preview the expected confidence interval at a chosen ε before committing.
  • DuckDB execution over Parquet datasets with per-column clipping bounds validated at registration time.

Front-end

  • Visual query builder + CodeMirror-based SQL editor, with column-aware filter widgets and a free cohort preview for early small-group warnings.
  • Real-time noisy charts (visx + D3) with 68 % / 95 % confidence intervals, noise indicators, and explicit suppression rendering.
  • Significance Inspector: posterior P(μₐ > μ_B) computed from the analytical CDF of the difference of noise variables (Laplace and Gaussian).
  • Noise Sandbox: Hypothetical-Outcome-Plot (HOP) cloud of 80 posterior samples that can be re-rolled and ε-scaled to make uncertainty physically visible.
  • Trust Radar / ε-Oracle: five-axis glyph and verdict for each query (sample size, ε spent, suppression, freshness, attack-detector flag).
  • Attack-Replay Theater: post-hoc, step-by-step reconstruction of detected differencing / averaging / rapid-query attacks against the audit log.
  • Session Budget Planner & What-If Console: plan ε allocation across a session before spending it, and probe how a chart would change at a different ε.
  • History / Audit / Provenance Drawer: every query, its ε cost, the analyst who ran it, and the resulting noise are auditable from the UI.
  • Admin UI: create users, drag-and-drop dataset upload + 4-step registration wizard with bounds validation.
  • Accessibility: keyboard-first navigation, ARIA labels, multi-channel encoding (color + shape + text) so suppression and confidence are visible without color vision.

Operations & security

  • JWT password login (PBKDF2-SHA256, 240 k iterations) with three roles — ADMIN, ANALYST, AUDITOR — enforced by FastAPI dependencies and React route guards.
  • Rate limiting (fixed window per minute, per (tenant, user)) backed by Redis when available, in-process otherwise.
  • Attack detector: rolling-window heuristics for differencing, averaging, and rapid-query patterns; suspicious queries are flagged in the audit log.
  • Audit log for every privacy-affecting event: query execution, dataset registration / replacement / archive, user creation, attack flags.
  • WebSocket hub with heartbeat, exponential-backoff reconnect, and topic pub/sub for budget and alert channels.
  • Sentry integration (DSN-gated, no PII / request bodies) and Prometheus metrics endpoint.
  • Production-startup guards: the backend refuses to boot when ENV=production if AUTH_DEV_MODE=true or JWT_SECRET is the shipped default.

Architecture

┌──────────────────────┐     HTTP / JSON     ┌──────────────────────────┐
│  React + Vite SPA    │ ──────────────────▶ │  FastAPI                 │
│  (TanStack Query,    │ ◀── WebSocket ─────│   ├─ DP engine (Laplace/  │
│   Zustand, visx,     │                     │   │   Gaussian + ledger) │
│   CodeMirror)        │                     │   ├─ Attack detector     │
└──────────────────────┘                     │   ├─ Audit + RBAC        │
                                             │   └─ Auth (JWT)          │
                                             └────────────┬─────────────┘
                                                          │
                          ┌───────────────────────────────┼─────────────────┐
                          ▼                               ▼                 ▼
                    ┌──────────┐                    ┌──────────┐      ┌──────────┐
                    │ DuckDB   │                    │ Postgres │      │ Redis    │
                    │ (Parquet │                    │ (ε-ledger│      │ (cache,  │
                    │  scans)  │                    │  + audit)│      │  sims,   │
                    └──────────┘                    └──────────┘      │  ratelim)│
                                                                      └──────────┘

The single source of truth for API types is the backend's Pydantic schemas → OpenAPI → packages/shared/types.ts. CI fails on drift.


Tech stack

Layer Stack
Backend Python 3.11, FastAPI, Pydantic v2, SQLAlchemy 2 (async), Alembic, DuckDB, NumPy, Google dp-accounting
Storage PostgreSQL 15 (ledger, audit, users), Redis 7 (cache, rate limit), Parquet (datasets)
Frontend React 18, TypeScript 5, Vite 6, TanStack Query 5, Zustand 5, Radix UI, visx + D3, CodeMirror 6, Tailwind
Testing pytest + Hypothesis, Vitest + React Testing Library, Playwright
Observability Prometheus client, Sentry SDK
Auth PyJWT (HS256), PBKDF2-SHA256 password hashing

Quick start (Docker)

Prerequisites

Tool Version
Docker + Docker Compose ≥ 24
GNU make any recent
Python ≥ 3.11 (only for tests / type-gen on the host)
Node.js ≥ 20 (only for tests / type-gen on the host)

1. Generate the synthetic demo dataset (one-time)

python data/generate_demo.py
# → data/employees_demo.parquet  (~15 000 synthetic employee rows)

2. Start everything

make dev
# or: make dev-detached  (background)

Services exposed:

Service URL
Frontend (Vite) http://localhost:5173
Backend (FastAPI) http://localhost:8000
Swagger / OpenAPI http://localhost:8000/docs

Wait until the backend logs Application startup complete.

3. Seed demo users + audit history

make seed-users   # creates admin@demo.local / analyst@demo.local / auditor@demo.local
make seed         # runs 4 representative queries, warms the simulation cache

4. Sign in

Open http://localhost:5173 and sign in with any of:

Username Role Password
admin@demo.local ADMIN demo
analyst@demo.local ANALYST demo
auditor@demo.local AUDITOR demo

Pick Employees Demo in the dataset dropdown and run a query.

Stop / clean up

make stop         # stop containers (data preserved)
make clean        # stop + remove volumes (Postgres data wiped)

Configuration

All settings are environment-driven. Copy .env.example to .env and edit.

Variable Default Purpose
JWT_SECRET dev-secret-change-in-production HS256 signing key. Must be overridden in production (openssl rand -hex 32).
STORAGE_BACKEND db db (Postgres, durable) or memory (ephemeral, tests).
AUTH_DEV_MODE false When true, identity comes from X-User-Id / X-Tenant-Id / X-Role headers (local-dev only).
RATE_LIMIT_ENABLED true Toggle the per-(tenant, user) rate limiter.
RATE_LIMIT_REQUESTS_PER_MINUTE 120 Fixed-window quota.
SIM_CACHE_TTL_SECONDS 600 TTL for cached simulation previews.
SENTRY_DSN empty Set to enable Sentry; empty disables.
SENTRY_ENVIRONMENT dev Sentry environment tag.
SENTRY_TRACES_SAMPLE_RATE 0.0 Performance sampling fraction.
DATABASE_URL (compose default) Override only if you point both host + container to the same managed Postgres.
REDIS_URL (compose default) Leave unset to disable cross-replica shared state.
VITE_API_BASE_URL http://localhost:8000 Browser-side API base.
VITE_WS_BASE_URL ws://localhost:8000 Browser-side WebSocket base.

Production guard: the backend refuses to start when ENV=production and AUTH_DEV_MODE=true, or JWT_SECRET is left at the shipped default.


Authentication & roles

PrivacyLens ships real password login with JWT bearer tokens — there is no public signup; accounts are seeded by an admin.

Roles

Role Can do
ADMIN Everything ANALYST can do, plus: create users, upload + register datasets, archive datasets.
ANALYST Run queries, simulate previews, view their own history, view the audit log for their tenant.
AUDITOR Read-only — view datasets, audit log, attack flags. Cannot run queries.

Programmatic login

curl -X POST http://localhost:8000/v1/auth/login \
  -H 'Content-Type: application/json' \
  -d '{"username":"analyst@demo.local","password":"demo"}'

TOKEN=...   # returned access_token
curl -H "Authorization: Bearer $TOKEN" http://localhost:8000/v1/auth/me

The frontend stores the JWT in localStorage under privacylens_token, attaches it to every API call, and clears it on logout or HTTP 401.


Dataset registration

Datasets carry their own ε-budget, clipping bounds, and allowed aggregations. Registration is an audit-bearing admin operation and can be done two ways.

From the UI (/admin/datasets/new)

  1. Upload — drag-and-drop a .csv or .parquet file (≤ 200 MiB; override with DATASET_UPLOAD_MAX_BYTES).
  2. Schema preview — the backend infers types via DuckDB and returns observed min/max plus a 10-row sample.
  3. Configure — edit dataset id, display name, per-column types, configured bounds, allowed aggregates, ε-budget, min_group_size.
  4. Register — bounds are validated against the staged file (registration fails when any column has > 1 % of rows outside the configured bounds). On success the dataset appears immediately in the Query Builder dropdown and a dataset_registered audit event is written.

From the CLI

cd packages/backend
python -m scripts.register_dataset \
  --config datasets/employees_demo.yaml \
  [--force] [--archive] [--dry-run] [--skip-bounds-check]

Example config:

dataset_id:        employees_demo
display_name:      "Employees Demo"
parquet_path:      data/employees_demo.parquet
table_name:        employees_demo
epsilon_budget:    1.0
min_group_size:    50
allowed_aggregations: [count, sum, mean, percentile, histogram]
columns:
  - { name: salary,     type: numeric,     clipping_bounds: [20000, 250000] }
  - { name: department, type: categorical }
  - { name: hire_date,  type: date }
Audit event When
dataset_registered New dataset_id (or re-registering an archived one)
dataset_replaced --force after a column-set diff
dataset_archived --archive flag

See docs/DATASETS.md for the full lifecycle.


Running tests

make install            # one-time: pip install + npm install
make install-e2e        # one-time: Playwright browsers

make test               # full suite (backend + frontend unit)
make test-backend       # pytest only
make test-frontend      # Vitest only
make test-e2e           # Playwright (requires `make dev-detached && make seed`)
make test-e2e-ui        # interactive Playwright UI

What the suites cover:

  • Backend — DP mechanism math, sensitivity bounds, budget ledger accounting and exhaustion, suppression contract, audit store, attack heuristics, WebSocket hub, full API integration including 402-on-exhaustion.
  • Frontend — Zustand stores, WebSocket hub dispatch, React Query hooks, posterior math, render-level component tests with React Testing Library.
  • E2E (Playwright) — login flow, role guards, end-to-end query → chart → audit, attack-detector alert path, dataset wizard.

Project layout

privacylens/
├── docker-compose.yml         # postgres, redis, backend, frontend
├── Makefile                   # dev / test / type-gen entrypoints
├── .env.example               # documented configuration
├── data/
│   ├── generate_demo.py       # synthetic dataset generator
│   └── uploads/               # admin-uploaded staging files (gitignored)
├── docs/
│   ├── DATASETS.md            # dataset-registration lifecycle
│   ├── RBAC.md                # roles and route guards
│   └── BACKUP.md              # Postgres backup / restore
├── imgs/                      # README screenshots
├── packages/
│   ├── backend/               # FastAPI app, DP engine, migrations
│   │   ├── app/{api,core,models,ws}
│   │   ├── tests/
│   │   ├── scripts/           # seed.py, register_dataset.py, seed_users.py
│   │   └── migrations/        # Alembic
│   ├── frontend/              # React SPA
│   │   ├── src/{components,hooks,pages,stores,lib}
│   │   ├── tests/             # Playwright E2E
│   │   ├── vite.config.ts
│   │   └── playwright.config.ts
│   └── shared/                # Auto-generated TypeScript types
│       └── types.ts           # ← do not edit; run `make generate-types`
└── scripts/
    ├── add-named-exports.js
    ├── backup-postgres.sh
    └── restore-postgres.sh

Development

Install host-side dev dependencies (for tests + type-gen)

make install
# pip install -e ".[dev]" + npm install

Run the backend in isolation

cd packages/backend
uvicorn app.main:app --reload

The host-native run uses the localhost defaults from app/config.py — Postgres / Redis must be running locally (or override DATABASE_URL / REDIS_URL).

Run the frontend dev server

cd packages/frontend
npm run dev      # Vite at http://localhost:5173
npm run typecheck

Database migrations

make migrate                                 # apply all
cd packages/backend && alembic revision --autogenerate -m "msg"   # author a new one

Backups

./scripts/backup-postgres.sh   ./scripts/restore-postgres.sh

See docs/BACKUP.md.


Type-generation pipeline

The single source of truth for shared types is the backend's Pydantic schemas. The flow:

schemas.py  ──▶  openapi.yaml  ──▶  packages/shared/types.ts
                (FastAPI)            (openapi-typescript)

Regenerate after any schema change:

make generate-types

CI runs make check-drift and fails the build if packages/shared/types.ts is out of sync with the backend.


API reference

The full live OpenAPI spec is at http://localhost:8000/docs. Highlights:

Endpoint Method What
/v1/auth/login POST Username + password → JWT access token
/v1/auth/me GET Current user identity
/v1/datasets GET Active datasets visible to the caller
/v1/datasets/{id}/schema GET Columns, bounds, allowed aggregations
/v1/queries/preview-cohort POST Free cohort-size preview (used by Filter Panel)
/v1/queries/simulate POST Free CI preview at a chosen ε (no budget charged)
/v1/queries/execute POST Run query, deduct ε, return noisy result + audit id
/v1/budget/{dataset_id} GET Current per-dataset ε ledger snapshot
/v1/audit GET Paginated audit log (filterable by attack flags)
/v1/admin/users GET/POST Admin: list / create users
/v1/admin/datasets/uploads POST Admin: stream-upload a dataset file
/v1/admin/datasets POST Admin: register a staged dataset
/ws WS Pub/sub for budget and alert topics

Limitations

PrivacyLens is a research / thesis-scale system. Known constraints:

  • Single-tenant per workspace. Multi-tenancy is wired through the data model but not exercised at scale.
  • Laplace / Gaussian only. No advanced composition (RDP, zCDP) wiring beyond dp-accounting plumbing.
  • DuckDB single-node. Datasets must fit on one host; there is no distributed execution layer.
  • No external joins. Queries are restricted to a single registered dataset at a time.
  • Demo password (demo) is intentional for the seeded users — change it before any internet-facing deploy.

License & acknowledgements

This project is released under the MIT License (see LICENSE — add one before publishing if you fork it).

PrivacyLens was built as a master's thesis project and stands on the shoulders of:

  • Differential privacy — Dwork & Roth, The Algorithmic Foundations of Differential Privacy (2014); Google's dp-accounting; Microsoft's SmartNoise.
  • Uncertainty visualisation — Hullman et al., Hypothetical Outcome Plots (CHI 2015); Kale et al., Hypothetical Outcome Plots Help Untrained Observers Judge Trends in Ambiguous Data (TVCG 2019).
  • Open-source toolchain — FastAPI, DuckDB, React, Vite, TanStack Query, Zustand, visx, Radix UI, CodeMirror, Playwright.

Issues, PRs, and academic-feedback discussions are welcome.

About

End-to-end differential-privacy analytics platform — FastAPI + DuckDB engine with Laplace/Gaussian mechanisms and a per-dataset ε-budget ledger, plus a React + visx dashboard that renders noisy estimates with 68/95% confidence intervals, suppression, posterior significance, and live attack-replay alerts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors