Predictive Runtime and Inference Serving Module
PRISM is a local-first model serving platform that lets you upload ML model artifacts, build isolated Docker runtimes automatically, and serve predictions through a unified FastAPI interface and web dashboard.
Most model serving tools are optimized for production clusters and cloud infrastructure. PRISM is designed for fast iteration and lightweight sharing from a single machine.
With PRISM you can:
- Upload
.pkl,.pickle,.joblib, and.onnxmodels - Build model-specific Docker images automatically
- Launch and manage containerized inference runtimes
- Call predictions via stable API routes
- Use a browser dashboard for deployment, logs, prediction testing, and lifecycle actions
- Optionally expose model endpoints via reverse tunnel
- Containerized model isolation: Each deployed model runs in its own container
- Unified inference API: Predict through
POST /models/{model_id}/predict - Model registry: Tracks container IDs, ports, metadata, and optional tunnel URL
- Access control: Optional API key auth + in-memory rate limiting for public inference route
- Health monitoring: Background monitor can restart stopped containers and prune stale entries
- Developer CLI:
prismcommand for running/stopping server, linting, and formatting
Client (UI / API)
│
▼
PRISM FastAPI App
│
├── /models/* endpoints (upload, run, list, delete, predict proxy)
├── /registry/* endpoints
├── /health/monitor endpoint
└── Dashboard + HTMX UI routes
│
▼
Docker Model Containers
(one per model)
- Python 3.12+
- FastAPI + Uvicorn
- Docker (for model runtime isolation)
- ONNX Runtime and scikit-learn adapters
- Poetry for dependency management
- Pytest, Ruff, Black for quality checks
Before running PRISM, ensure you have:
- Python
>=3.12 - Poetry installed
- Docker installed and daemon running
- (Optional) ngrok authtoken for public tunnels
poetry install
poetry shellOr run commands without activating the shell:
poetry run <command>Foreground mode:
poetry run uvicorn app.main:app --reload --host 127.0.0.1 --port 8000Detached mode via CLI:
poetry run prism run 8000 --reloadStop detached server:
poetry run prism stopTip: run
pkill -f "app.core.tunnel_worker" || true rm -rf /tmp/prism/tunnelsbefore uploading a model to ensure no existing tunnel is working in the background to interfere with prism's tunneling.
Visit:
http://127.0.0.1:8000/
Use UI route:
http://127.0.0.1:8000/upload-model
Or API:
curl -X POST http://127.0.0.1:8000/models/upload-and-run \
-F "file=@model_store/linear_regression.pkl" \
-F "model_name=Linear Regression" \
-F "model_description=Demo model" \
-F 'expected_input_json={"feature1": 0.1}'curl -X POST http://127.0.0.1:8000/models/<model_id>/predict \
-H "Content-Type: application/json" \
-d '{"feature1": 0.1}'GET /→ basic service identityGET /health/monitor→ background monitor status and last cycle metrics
POST /models/upload→ upload + build image onlyPOST /models/upload-and-run→ upload + build + run container + register modelPOST /models→ alias of upload-and-run flowGET /models→ list deployed modelsGET /models/{model_id}→ model metadataDELETE /models/{model_id}→ stop/remove container + delete registry recordPOST /models/{model_id}/predict→ proxy inference request to container
GET /registry→ full registry payloadGET /registry/{model_id}→ one model registry entryPOST /registry/prune-stale→ remove stale records
GET /→ dashboardGET /upload-model→ upload pageGET /model-logs→ logs pageGET /predict?model_id=...→ prediction UIPOST /api/upload-and-run-ui→ UI upload+deploy actionPOST /predict-result→ UI prediction action
When you upload a model through upload-and-run:
- File is saved under
model_store/uploads/<model_id>/ - Runtime files (
runtime.py,requirements.txt,entrypoint.sh) are copied into build context - PRISM generates a model-specific
Dockerfile - PRISM builds image
prism_model_<model_id> - PRISM starts container on an allocated localhost port
- PRISM writes metadata to registry (
app/registry/containers.jsonby default)
Environment variables commonly used in PRISM:
| Variable | Default | Purpose |
|---|---|---|
MODEL_UPLOAD_ROOT |
model_store/uploads |
Upload/build context root |
MODEL_CONTAINER_REGISTRY_PATH |
app/registry/containers.json |
Registry file location |
PRISM_SINGLE_ACTIVE_MODEL |
true |
If true, old deployments are removed when deploying a new one |
PRISM_ENABLE_HEALTH_MONITOR |
true |
Enables background health monitor |
PRISM_HEALTH_MONITOR_INTERVAL_SECONDS |
10 |
Monitor cycle interval |
ENABLE_TUNNEL |
false |
Enables tunnel creation in /models/upload-and-run flow |
NGROK_AUTHTOKEN |
unset | Required for ngrok tunnel worker |
PRISM_API_KEYS |
unset | Comma-separated API keys for protected inference |
PRISM_RATE_LIMIT_REQUESTS |
120 |
Requests per rate-limit window |
PRISM_RATE_LIMIT_WINDOW_SECONDS |
60 |
Rate-limit window duration |
PRISM_BATCH_WINDOW_MS |
50 |
Request batching window for /models/{model_id}/predict |
PRISM_TUNNEL_START_TIMEOUT |
30 |
Timeout for tunnel worker startup |
Example:
export PRISM_API_KEYS="key-one,key-two"
export PRISM_RATE_LIMIT_REQUESTS="120"
export PRISM_RATE_LIMIT_WINDOW_SECONDS="60"
export PRISM_BATCH_WINDOW_MS="50"
export NGROK_AUTHTOKEN="<your-token>"The endpoint POST /models/{model_id}/predict supports:
- API key via
X-API-Key - API key via
Authorization: Bearer <token> - In-memory sliding-window rate limiting
If PRISM_API_KEYS is not set, endpoint runs in open mode.
Example request with API key:
curl -X POST http://127.0.0.1:8000/models/<model_id>/predict \
-H "Content-Type: application/json" \
-H "X-API-Key: key-one" \
-d '{"feature1": 0.1}'Kill tunnels if any running before:
pkill -f "app.core.tunnel_worker" || true
rm -rf /tmp/prism/tunnelsRun lint and format:
poetry run prism lint .
poetry run prism format .Run tests:
poetry run pytest -qRun selected tests:
poetry run pytest tests/test_frontend.py -v
poetry run pytest tests/test_model_lifecycle_endpoints.py -vBenchmark methodology and latest results are documented in BENCHMARKS.md.
Run benchmark script:
poetry run python scripts/benchmark_models.py --iterations 2000- Docker errors during upload/build: Ensure Docker daemon is running and socket is accessible
- Model not reachable: Check container status via dashboard/logs and validate registry port entry
401on predict endpoint: Verify request API key whenPRISM_API_KEYSis configured429responses: Increase rate-limit vars or reduce request burst frequency- Tunnel startup failure: Confirm
NGROK_AUTHTOKENand retry after worker startup delay
app/
main.py # FastAPI app bootstrap
cli.py # prism CLI command
routing/ # API + UI routes
services/ # dashboard + health monitor services
registry/containers.json# model container registry (default)
runtime/ # model loader/runtime adapters
model_container/ # template files copied into per-model build contexts
model_store/ # sample artifacts + generated uploads
tests/ # API, UI, and service tests
scripts/benchmark_models.py
- Author: Aaryan Kumar Sinha
- Project: PRISM (Predictive Runtime and Inference Serving Module)
- Research inspiration: practical ideas from systems such as Clipper and broader model-serving/runtime literature
- Open-source ecosystem: FastAPI, Uvicorn, Docker, ONNX Runtime, scikit-learn, Pytest, Ruff, Black
MIT License.







