Skip to content

YassWorks/Evalda-DQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

268 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evalda cover

Open-source ML competition judging platform

Securely evaluate untrusted machine learning models in isolated Docker containers,
score them on accuracy, size, and latency, and rank teams on a live anonymous leaderboard.

License: MIT Python 3.13 FastAPI Next.js Supabase Redis Docker Celery


Architecture · Workflow


Battle-Tested

Evalda was built for DataQuest 2026 — the annual data science competition run by the IEEE INSAT SB CS Chapter and ACM INSAT as part of the DataOverflow event.

Metric Value
Duration 7 continuous hours
Participants 164 across 41 teams
Total submissions 1,452 (435 accepted, 1,017 rejected)
Peak throughput 400 submissions in one hour
Total downtime < 1 minute (two brief maintenance windows)

How It Works

Teams submit a .zip containing a Python solution, a trained model, and a requirements.txt. The system:

  1. Validates the zip in a security sandbox (path traversal, zip bombs, symlinks, extension allowlist)
  2. Extracts and sanitizes requirements.txt (blocks malicious pip options)
  3. Installs dependencies in an isolated container with outbound-only internet
  4. Runs the model in a fully sandboxed container (no network, no labels, resource-capped)
  5. Scores the predictions in a trusted process outside all containers
  6. Streams real-time progress to the participant via WebSocket
  7. Updates the anonymous leaderboard with blind-hour support

Every container runs with all capabilities dropped, no privilege escalation, memory and CPU limits, and tmpfs with noexec. Ground-truth labels never enter any container. See ARCHITECTURE.md for the full rationale.

flowchart LR
%% Defining aesthetic styles for the docs
classDef proxy fill:#2C3E50,stroke:#fff,stroke-width:2px,color:#fff;
classDef api fill:#059669,stroke:#fff,stroke-width:2px,color:#fff;
classDef worker fill:#D97706,stroke:#fff,stroke-width:2px,color:#fff;
classDef docker fill:#2496ED,stroke:#fff,stroke-width:2px,color:#fff;
classDef db fill:#3ECF8E,stroke:#fff,stroke-width:2px,color:#111;
classDef cache fill:#DC382D,stroke:#fff,stroke-width:2px,color:#fff;

    %% Core Pipeline
    N[Nginx]:::proxy -->|HTTP / REST| F(FastAPI<br>4 Workers):::api
    F -->|Enqueues Tasks| C{Celery<br>2 Workers}:::worker
    C -->|Spins up / Executes| D[[Docker Containers]]:::docker
    D -->|Persists Data| S[(Supabase)]:::db

    %% Redis Sub-system
    R[(Redis<br>Broker, Streams,<br>Rate Limits, Cache)]:::cache

    %% Connections to Redis
    F <-->|Rate limits, Cache,<br>Publishes to Stream| R
    C <-->|Consumes from Stream,<br>Updates State| R

    %% Creating a bounding box for the Backend Logic to group them visually
    subgraph Core Backend
        F
        C
        R
    end
Loading

Tech Stack

Component Technology Role
Frontend Next.js 16, shadcn/ui, TanStack Query Submission UI, team dashboard, leaderboard, admin panel
Backend API FastAPI, 4 uvicorn workers Auth, rate limiting, submission intake, WebSocket streaming
Task Queue Celery, 2 workers Judging pipeline orchestration
Containers Docker (socket-mounted, sibling containers) Sandboxed code execution across 4 phases
Database Supabase Postgres + RLS Profiles, teams, submissions, whitelist
Auth Supabase Auth (JWT) Whitelist-gated registration, token verification
Storage Supabase Storage Submission zip files (private, 50MB, zip-only)
Cache / Broker Redis 7.2 Task broker, verdict streams, rate limits, leaderboard cache
Reverse Proxy Nginx SSL termination, request filtering
Deployment Azure VM + Vercel + Supabase Cloud Single-VM backend, frontend, managed DB

Documentation

Document Description
ARCHITECTURE.md System design, trust boundaries, security model, technology justifications, and lessons learned
WORKFLOW.md Step-by-step submission lifecycle from upload to leaderboard, with sequence diagram

Quick Start

Prerequisites

  • Docker and Docker Compose
  • Node.js 18+
  • A Supabase project (free tier works)

Backend

cd backend
cp .env.example .env
# Fill in your Supabase credentials, Redis password, and admin account

docker compose up --build

This starts Nginx, Redis, the FastAPI backend, and the Celery worker. On first run, the system automatically:

  • Downloads validation data (features + labels) from configured URLs
  • Seeds teams and whitelist entries from data/teams.json
  • Creates the admin account
  • Builds the sandbox and judge Docker images

Frontend

cd frontend
npm install
cp .env.example .env.local
# Set NEXT_PUBLIC_SUPABASE_URL, NEXT_PUBLIC_SUPABASE_ANON_KEY, BACKEND_URL

npm run dev

Database

Apply the Supabase migrations in order:

supabase db push

Or apply them manually from supabase/migrations/ in your Supabase dashboard. The migrations create:

  • profiles and teams tables with RLS policies
  • team_whitelist for registration gating
  • submissions table and storage bucket
  • RPCs for atomic score updates (locked to service_role)
  • Triggers for role protection and auto-profile creation

Team Data

Create a data/teams.json (see data/teams.example.json) with your teams:

[
    {
        "name": "Team Alpha",
        "leader_email": "leader@example.com",
        "members": ["member1@example.com", "member2@example.com"]
    }
]

Only whitelisted emails can register. On signup, users are automatically linked to their team.


Configuration

Key environment variables (see backend/.env.example for the full list):

Variable Description
SUPABASE_URL Your Supabase project URL
SUPABASE_KEY Supabase service_role key (not the anon key)
SUPABASE_JWT_SECRET JWT secret for HS256 token verification
REDIS_PASSWORD Redis authentication password
ADMIN_EMAIL / ADMIN_PASSWORD Admin account credentials (seeded on startup)
COMPETITION_START / COMPETITION_END ISO timestamps for competition window
BLIND_DURATION_HOURS Hours before end when leaderboard freezes (default: 1)
MAX_SUBMISSIONS_PER_TEAM Accepted submission cap per team (default: 20)
JUDGE_MEM_LIMIT Memory limit for builder/runner containers (default: 1g)
JUDGE_TIMEOUT_SECONDS Container execution timeout (default: 120)
FEATURES_URL / LABELS_URL URLs to download validation data on startup
TEAMS_URL URL to download team data JSON on startup

Project Structure

Evalda-DQ/
├── backend/
│   ├── main.py                              # FastAPI app, lifespan, startup
│   ├── app/
│   │   ├── src/
│   │   │   ├── routers/                     # Thin HTTP/WS endpoints
│   │   │   ├── services/                    # Business logic layer
│   │   │   │   ├── submissions_service.py
│   │   │   │   ├── security_service.py
│   │   │   │   ├── judge_service.py         # 4-phase container pipeline
│   │   │   │   ├── docker_service.py        # Container lifecycle management
│   │   │   │   ├── scorer.py                # Trusted scoring (runs in worker)
│   │   │   │   ├── stream_service.py        # WebSocket verdict streaming
│   │   │   │   ├── leaderboard_service.py
│   │   │   │   ├── worker/                  # Celery task definitions
│   │   │   │   └── scripts/                 # Template scripts for containers
│   │   │   ├── auth/                        # JWT verification, rate limiting, WS guards
│   │   │   ├── models/                      # Pydantic models
│   │   │   ├── db/                          # Supabase + Redis client factories
│   │   │   └── settings/                    # Centralized configuration
│   │   └── utils/                           # Logger, seeder, janitor, data downloader
│   ├── sandbox/                             # Security sandbox (verify.py + Dockerfile)
│   ├── judge/                               # Runner + judge Dockerfile
│   ├── template/                            # Participant solution template + docs
│   ├── compose.yml                          # Dev environment
│   └── compose.prod.yml                     # Production (SSL, Let's Encrypt)
├── frontend/                                # Next.js 16 app
│   ├── app/                                 # App Router pages
│   ├── components/                          # UI components (shadcn/ui)
│   └── lib/                                 # Supabase clients, server actions, types
├── supabase/
│   ├── migrations/                          # SQL migrations (schema, RLS, RPCs)
│   └── config.toml                          # Supabase local dev config
├── docs/
│   ├── ARCHITECTURE.md                      # System design deep dive
│   └── WORKFLOW.md                          # Submission lifecycle walkthrough
└── README.md                                # This file

Key Design Decisions

These are covered in depth in ARCHITECTURE.md. The highlights:

  • Scoring runs outside all containers. The original design ran scoring alongside user code. That would cause participants to discover the communication channel and overwrite predictions with perfect answers. The scorer now runs in the trusted worker process. Labels never enter any container.

  • Four-phase pipeline with progressive network access. Verify (no network) → Extract (no network) → Build (outbound only, --only-binary :all:) → Run (no network). Each phase gets exactly the permissions it needs.

  • requirements.txt sanitization. A regex blocks dangerous pip options (--extra-index-url, --no-build-isolation, etc.) before dependencies are installed. --only-binary :all: eliminates setup.py as an attack surface entirely.

  • Anonymous leaderboard with blind mode. Teams only see their own name. During the final hour, rankings freeze publicly while scores are still recorded.

  • JWKS-verified rate limiting. The rate limiter cryptographically verifies JWTs using Supabase's public keys, preventing spoofed user IDs from bypassing per-user limits.


Adapting for Your Competition

Evalda was designed for ML model evaluation but the architecture generalizes. To adapt it:

  1. Change the scoring logic — modify backend/app/src/services/scorer.py and backend/judge/runner.py
  2. Change the submission format — modify backend/sandbox/verify.py (extension allowlist, required files)
  3. Change resource limits — adjust environment variables (JUDGE_MEM_LIMIT, JUDGE_TIMEOUT_SECONDS, etc.)
  4. Change the dataset — point FEATURES_URL and LABELS_URL to your data
  5. Change team structure — edit data/teams.json and the seeder

The security sandbox, container pipeline, streaming infrastructure, and leaderboard work independently of the scoring domain.


Acknowledgments

Evalda was created for DataQuest 2026, the annual data science competition closing out the DataOverflow event — a collaboration between the IEEE INSAT SB CS Chapter and ACM INSAT.

Security audit and penetration testing by Salah Chafai, whose findings directly shaped the hardening of the RLS policies, RPC permissions, and WebSocket infrastructure.


License

MIT

About

DataQuest Hackathon judging and scoring platform.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors