Skip to content

joonsoome/embed-rerank

Repository files navigation

πŸ”₯ Embeddings + Reranking on your Mac (MLX‑first)

OpenAI rerank supported (/v1/openai/rerank) auto-sigmoid default on PyPI Version

Blazing‑fast local embeddings and true cross‑encoder reranking on Apple Silicon. Works with Native, OpenAI, TEI, and Cohere APIs.

This page is a beginner‑friendly quick start. Detailed guides live in docs/.

🌐 Four APIs, One Service

API Endpoint Use Case
Native /api/v1/embed, /api/v1/rerank New projects
OpenAI /v1/embeddings, /v1/openai/rerank (alias: /v1/rerank_openai) Existing OpenAI code
TEI /embed, /rerank, /info Hugging Face TEI replacement
Cohere /v1/rerank, /v2/rerank Cohere API replacement
/docs /health More info.

πŸ“ˆ Performance Visualization

Latency Comparison (Projected)

Single Text Embedding Latency (milliseconds)

Apple MLX    β–ˆβ–ˆβ–ˆβ–ˆ 0.2ms
PyTorch MPS  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 45ms  
PyTorch CPU  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 120ms
CUDA (Est.)  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 12ms
Vulkan (Est.) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 25ms

0ms        25ms       50ms       75ms       100ms      125ms

Throughput Comparison (texts/second)

Maximum Throughput (texts per second)

Apple MLX     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 35,000
CUDA (Est.)   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 8,000  
PyTorch MPS   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 1,500
Vulkan (Est.) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 3,000
PyTorch CPU   β–ˆβ–ˆ 500

0          10k        20k        30k        40k

πŸš€ Start here (60 seconds)

  1. Install and run (embeddings only)
pip install embed-rerank

# Minimal .env
cat > .env <<'ENV'
BACKEND=auto
MODEL_NAME=mlx-community/Qwen3-Embedding-4B-4bit-DWQ
PORT=9000
HOST=0.0.0.0
ENV

embed-rerank  # http://localhost:9000

Want 2560‑D vectors by default? Add this to .env and restart:

cat >> .env <<'ENV'
# Use the model hidden_size (2560 for Qwen3-Embedding-4B) as output dimension
DIMENSION_STRATEGY=hidden_size
# Or enforce a fixed size (pads/truncates as needed):
# OUTPUT_EMBEDDING_DIMENSION=2560
# DIMENSION_STRATEGY=pad_or_truncate
ENV

# Verify
curl -s http://localhost:9000/api/v1/embed/ \
  -H 'Content-Type: application/json' \
  -d '{"texts":["hello"],"normalize":true}' | jq '.vectors[0] | length'
  1. Try it (embeddings + simple rerank)
# Embeddings (Native)
curl -s http://localhost:9000/api/v1/embed/ \
  -H 'Content-Type: application/json' \
  -d '{"texts":["Hello MLX","Apple Silicon rocks"]}' | jq '.embeddings | length'

# Rerank fallback (no dedicated reranker yet)
curl -s http://localhost:9000/api/v1/rerank/ \
  -H 'Content-Type: application/json' \
  -d '{"query":"capital of france","documents":["Paris is the capital of France","Berlin is in Germany"],"top_n":2}' | jq '.results[0]'
  1. Add a dedicated reranker (better quality)
cat >> .env <<'ENV'
# Torch cross-encoder (stable)
RERANKER_BACKEND=auto
RERANKER_MODEL_ID=cross-encoder/ms-marco-MiniLM-L-6-v2

# MLX reranker (Apple Silicon). Required for MLX models:
# RERANKER_BACKEND=mlx
# RERANKER_MODEL_NAME=vserifsaglam/Qwen3-Reranker-4B-4bit-MLX
ENV

# Optional: pre-download the MLX reranker to the default HF cache
# ./download-rerank-model.sh

# Restart server, then call Native or OpenAI-compatible rerank
curl -s http://localhost:9000/api/v1/rerank/ \
  -H 'Content-Type: application/json' \
  -d '{"query":"capital of france","documents":["Paris is the capital of France","Berlin is in Germany"],"top_n":2}' | jq '.results[0]'
  1. (Optional) Run as a macOS service
# Uses your .env to generate a LaunchAgent and start the service
./tools/setup-macos-service.sh

# Check status and health
launchctl list | grep com.embed-rerank.server
open http://localhost:9000/health/

Notes

  • OpenAI drop-in supported for both embeddings and rerank (/v1/embeddings, /v1/rerank). See docs for a tiny SDK example.
  • Scores may be auto‑sigmoid‑normalized for OpenAI clients by default (disable via OPENAI_RERANK_AUTO_SIGMOID=false).
  • The root endpoint / shows both embedding_dimension (served) and hidden_size (model config) for clarity.
  • Reranker backend auto defaults to Torch for compatibility; MLX reranker models require RERANKER_BACKEND=mlx.

Run the full validation suite

./tools/server-tests.sh --full

🧭 Pick your path

  • Deployment profiles (Embeddings‑only, Fallback rerank, Dedicated reranker): docs/DEPLOYMENT_PROFILES.md
  • OpenAI usage (tiny example + options): docs/ENHANCED_OPENAI_API.md
  • Quality benchmarks (JSONL/CSV judgments): docs/QUALITY_BENCHMARKS.md
  • Troubleshooting: docs/TROUBLESHOOTING.md
  • Backend specs and performance: docs/BACKEND_TECHNICAL_SPECS.md, docs/PERFORMANCE_COMPARISON_CHARTS.md

Try it with OpenAI SDK (tiny)

import openai

client = openai.OpenAI(base_url="http://localhost:9000/v1", api_key="dummy")

# Embeddings
res = client.embeddings.create(model="text-embedding-ada-002", input=["hello world"]) 
print(len(res.data[0].embedding))

# Rerank (OpenAI-compatible)
rr = client._request(
  "post",
  "/v1/openai/rerank",
  json={
    "query": "capital of france",
    "documents": [
      {"id": "a", "text": "Paris is the capital of France"},
      {"id": "b", "text": "Berlin is in Germany"},
    ],
    "top_n": 2,
  },
)
print(rr.get("results", rr))

Tested Frameworks

Framework Tests
βœ… Open WebUI Embed
βœ… LightRAG Embed Rerank
βœ… continue.dev Embed Rerank
βœ… Kilo Code Embed
βœ… LiteLLM Embed Rerank
βœ… RisuAI Embed
We are waiting for your reports!

LightRAG Option Parity

  • EMBEDDING_SEND_DIM: gate honoring OpenAI-compatible dimensions in /v1/embeddings
  • EMBEDDING_TOKEN_LIMIT: alias for DEFAULT_MAX_TOKENS_OVERRIDE (default per-text token budget)
  • RERANK_ENABLE_CHUNKING, RERANK_MAX_TOKENS_PER_DOC: Cohere-compatible rerank document chunking
  • When unset, defaults remain unchanged.

πŸ“„ License

MIT License – build amazing things locally.

About

A Python-based API service built with FastAPI and PyTorch that performs both text embedding and reranking using such as powerful omni-model Qwen3-Embedding-4B. Designed to load a single model instance in memory to efficiently handle embedding & reranking requests simultaneously, enabling fast and resource-effective operations.

Resources

License

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors