Open-source ML competition judging platform
Securely evaluate untrusted machine learning models in isolated Docker containers,
score them on accuracy, size, and latency, and rank teams on a live anonymous leaderboard.
Evalda was built for DataQuest 2026 — the annual data science competition run by the IEEE INSAT SB CS Chapter and ACM INSAT as part of the DataOverflow event.
| Metric | Value |
|---|---|
| Duration | 7 continuous hours |
| Participants | 164 across 41 teams |
| Total submissions | 1,452 (435 accepted, 1,017 rejected) |
| Peak throughput | 400 submissions in one hour |
| Total downtime | < 1 minute (two brief maintenance windows) |
Teams submit a .zip containing a Python solution, a trained model, and a requirements.txt. The system:
- Validates the zip in a security sandbox (path traversal, zip bombs, symlinks, extension allowlist)
- Extracts and sanitizes
requirements.txt(blocks malicious pip options) - Installs dependencies in an isolated container with outbound-only internet
- Runs the model in a fully sandboxed container (no network, no labels, resource-capped)
- Scores the predictions in a trusted process outside all containers
- Streams real-time progress to the participant via WebSocket
- Updates the anonymous leaderboard with blind-hour support
Every container runs with all capabilities dropped, no privilege escalation, memory and CPU limits, and tmpfs with noexec. Ground-truth labels never enter any container. See ARCHITECTURE.md for the full rationale.
flowchart LR
%% Defining aesthetic styles for the docs
classDef proxy fill:#2C3E50,stroke:#fff,stroke-width:2px,color:#fff;
classDef api fill:#059669,stroke:#fff,stroke-width:2px,color:#fff;
classDef worker fill:#D97706,stroke:#fff,stroke-width:2px,color:#fff;
classDef docker fill:#2496ED,stroke:#fff,stroke-width:2px,color:#fff;
classDef db fill:#3ECF8E,stroke:#fff,stroke-width:2px,color:#111;
classDef cache fill:#DC382D,stroke:#fff,stroke-width:2px,color:#fff;
%% Core Pipeline
N[Nginx]:::proxy -->|HTTP / REST| F(FastAPI<br>4 Workers):::api
F -->|Enqueues Tasks| C{Celery<br>2 Workers}:::worker
C -->|Spins up / Executes| D[[Docker Containers]]:::docker
D -->|Persists Data| S[(Supabase)]:::db
%% Redis Sub-system
R[(Redis<br>Broker, Streams,<br>Rate Limits, Cache)]:::cache
%% Connections to Redis
F <-->|Rate limits, Cache,<br>Publishes to Stream| R
C <-->|Consumes from Stream,<br>Updates State| R
%% Creating a bounding box for the Backend Logic to group them visually
subgraph Core Backend
F
C
R
end
| Component | Technology | Role |
|---|---|---|
| Frontend | Next.js 16, shadcn/ui, TanStack Query | Submission UI, team dashboard, leaderboard, admin panel |
| Backend API | FastAPI, 4 uvicorn workers | Auth, rate limiting, submission intake, WebSocket streaming |
| Task Queue | Celery, 2 workers | Judging pipeline orchestration |
| Containers | Docker (socket-mounted, sibling containers) | Sandboxed code execution across 4 phases |
| Database | Supabase Postgres + RLS | Profiles, teams, submissions, whitelist |
| Auth | Supabase Auth (JWT) | Whitelist-gated registration, token verification |
| Storage | Supabase Storage | Submission zip files (private, 50MB, zip-only) |
| Cache / Broker | Redis 7.2 | Task broker, verdict streams, rate limits, leaderboard cache |
| Reverse Proxy | Nginx | SSL termination, request filtering |
| Deployment | Azure VM + Vercel + Supabase Cloud | Single-VM backend, frontend, managed DB |
| Document | Description |
|---|---|
| ARCHITECTURE.md | System design, trust boundaries, security model, technology justifications, and lessons learned |
| WORKFLOW.md | Step-by-step submission lifecycle from upload to leaderboard, with sequence diagram |
- Docker and Docker Compose
- Node.js 18+
- A Supabase project (free tier works)
cd backend
cp .env.example .env
# Fill in your Supabase credentials, Redis password, and admin account
docker compose up --buildThis starts Nginx, Redis, the FastAPI backend, and the Celery worker. On first run, the system automatically:
- Downloads validation data (features + labels) from configured URLs
- Seeds teams and whitelist entries from
data/teams.json - Creates the admin account
- Builds the sandbox and judge Docker images
cd frontend
npm install
cp .env.example .env.local
# Set NEXT_PUBLIC_SUPABASE_URL, NEXT_PUBLIC_SUPABASE_ANON_KEY, BACKEND_URL
npm run devApply the Supabase migrations in order:
supabase db pushOr apply them manually from supabase/migrations/ in your Supabase dashboard. The migrations create:
profilesandteamstables with RLS policiesteam_whitelistfor registration gatingsubmissionstable and storage bucket- RPCs for atomic score updates (locked to
service_role) - Triggers for role protection and auto-profile creation
Create a data/teams.json (see data/teams.example.json) with your teams:
[
{
"name": "Team Alpha",
"leader_email": "leader@example.com",
"members": ["member1@example.com", "member2@example.com"]
}
]Only whitelisted emails can register. On signup, users are automatically linked to their team.
Key environment variables (see backend/.env.example for the full list):
| Variable | Description |
|---|---|
SUPABASE_URL |
Your Supabase project URL |
SUPABASE_KEY |
Supabase service_role key (not the anon key) |
SUPABASE_JWT_SECRET |
JWT secret for HS256 token verification |
REDIS_PASSWORD |
Redis authentication password |
ADMIN_EMAIL / ADMIN_PASSWORD |
Admin account credentials (seeded on startup) |
COMPETITION_START / COMPETITION_END |
ISO timestamps for competition window |
BLIND_DURATION_HOURS |
Hours before end when leaderboard freezes (default: 1) |
MAX_SUBMISSIONS_PER_TEAM |
Accepted submission cap per team (default: 20) |
JUDGE_MEM_LIMIT |
Memory limit for builder/runner containers (default: 1g) |
JUDGE_TIMEOUT_SECONDS |
Container execution timeout (default: 120) |
FEATURES_URL / LABELS_URL |
URLs to download validation data on startup |
TEAMS_URL |
URL to download team data JSON on startup |
Evalda-DQ/
├── backend/
│ ├── main.py # FastAPI app, lifespan, startup
│ ├── app/
│ │ ├── src/
│ │ │ ├── routers/ # Thin HTTP/WS endpoints
│ │ │ ├── services/ # Business logic layer
│ │ │ │ ├── submissions_service.py
│ │ │ │ ├── security_service.py
│ │ │ │ ├── judge_service.py # 4-phase container pipeline
│ │ │ │ ├── docker_service.py # Container lifecycle management
│ │ │ │ ├── scorer.py # Trusted scoring (runs in worker)
│ │ │ │ ├── stream_service.py # WebSocket verdict streaming
│ │ │ │ ├── leaderboard_service.py
│ │ │ │ ├── worker/ # Celery task definitions
│ │ │ │ └── scripts/ # Template scripts for containers
│ │ │ ├── auth/ # JWT verification, rate limiting, WS guards
│ │ │ ├── models/ # Pydantic models
│ │ │ ├── db/ # Supabase + Redis client factories
│ │ │ └── settings/ # Centralized configuration
│ │ └── utils/ # Logger, seeder, janitor, data downloader
│ ├── sandbox/ # Security sandbox (verify.py + Dockerfile)
│ ├── judge/ # Runner + judge Dockerfile
│ ├── template/ # Participant solution template + docs
│ ├── compose.yml # Dev environment
│ └── compose.prod.yml # Production (SSL, Let's Encrypt)
├── frontend/ # Next.js 16 app
│ ├── app/ # App Router pages
│ ├── components/ # UI components (shadcn/ui)
│ └── lib/ # Supabase clients, server actions, types
├── supabase/
│ ├── migrations/ # SQL migrations (schema, RLS, RPCs)
│ └── config.toml # Supabase local dev config
├── docs/
│ ├── ARCHITECTURE.md # System design deep dive
│ └── WORKFLOW.md # Submission lifecycle walkthrough
└── README.md # This file
These are covered in depth in ARCHITECTURE.md. The highlights:
-
Scoring runs outside all containers. The original design ran scoring alongside user code. That would cause participants to discover the communication channel and overwrite predictions with perfect answers. The scorer now runs in the trusted worker process. Labels never enter any container.
-
Four-phase pipeline with progressive network access. Verify (no network) → Extract (no network) → Build (outbound only,
--only-binary :all:) → Run (no network). Each phase gets exactly the permissions it needs. -
requirements.txtsanitization. A regex blocks dangerous pip options (--extra-index-url,--no-build-isolation, etc.) before dependencies are installed.--only-binary :all:eliminatessetup.pyas an attack surface entirely. -
Anonymous leaderboard with blind mode. Teams only see their own name. During the final hour, rankings freeze publicly while scores are still recorded.
-
JWKS-verified rate limiting. The rate limiter cryptographically verifies JWTs using Supabase's public keys, preventing spoofed user IDs from bypassing per-user limits.
Evalda was designed for ML model evaluation but the architecture generalizes. To adapt it:
- Change the scoring logic — modify
backend/app/src/services/scorer.pyandbackend/judge/runner.py - Change the submission format — modify
backend/sandbox/verify.py(extension allowlist, required files) - Change resource limits — adjust environment variables (
JUDGE_MEM_LIMIT,JUDGE_TIMEOUT_SECONDS, etc.) - Change the dataset — point
FEATURES_URLandLABELS_URLto your data - Change team structure — edit
data/teams.jsonand the seeder
The security sandbox, container pipeline, streaming infrastructure, and leaderboard work independently of the scoring domain.
Evalda was created for DataQuest 2026, the annual data science competition closing out the DataOverflow event — a collaboration between the IEEE INSAT SB CS Chapter and ACM INSAT.
Security audit and penetration testing by Salah Chafai, whose findings directly shaped the hardening of the RLS policies, RPC permissions, and WebSocket infrastructure.
