SMG Roadmap

# SMG Roadmap

## Part I — Features

### 1. Day-0 Model Support

**Goal**: Serve new models on launch day with minimal friction.

**What SMG provides**:
- Model onboarding checklist and validation harness (routing, health checks, streaming, tool-calling e2e)
- Automated compatibility tests triggered on new model registration
- Documentation template for model-specific quirks (special stop tokens, non-standard tool formats)

**Deliverables**:
- [ ] Model onboarding guide — step-by-step checklist for enabling a new model
- [ ] Validation harness — automated test suite runnable against a staged deployment
- [ ] Model metadata schema — standardized format for declaring model capabilities
- [ ] Launch readiness gate — CI job that blocks model GA if integration tests fail

---

### 2. OSS Engine Support

**Goal**: Full, production-quality coverage across SGLang, vLLM, TensorRT-LLM over both gRPC and HTTP.

**Key decisions**:
- **gRPC is primary** — tokenization, function call parsing, and reasoning parser features live here
- **HTTP is fallback** — will NOT get tokenization/func-call/reasoning-parser features

**Scope**:
- Interface parity — close gaps where an engine only works over one transport
- Feature coverage matrix — document which features (streaming, tool calling, multimodality, structured output, prefix caching) are supported per engine x transport

**Deliverables**:
- [x] gRPC + HTTP parity audit
- [ ] Feature coverage matrix (feature x engine x transport)

---

### 3. Multimodality

**Goal**: Close gaps in multi-modal support. **High priority** — at least 1 model + 1 engine must work e2e or VLM day-0 support is blocked.

**Current state**:
- Image input works for select models (LLaVA, Qwen-VL variants) over HTTP only
- No VLM support in gRPC mode
- Gaps in audio, video, and mixed-modality inputs

**Priorities**:
1. Audit current coverage — modality x model x engine x transport
2. Image input hardening — cross-engine reliability, content-type handling, base64/URL, size/format validation
3. Audio and video — evaluate demand and engine support, define phased plan
4. Multimodal output — image generation, TTS if applicable

**Deliverables**:
- [x] Multimodality support matrix
- [x] Image input hardening (bug fixes, test coverage, cross-engine validation)
- [ ] Audio/video phased roadmap
- [x] Multimodal test suite

---

### 4. MCP Semantic Search

**Goal**: Efficient tool discovery across MCP servers with hundreds/thousands of registered tools.

**Problem**: As MCP adoption grows, linear listing and exact-name matching won't scale. Models need to find tools by intent, not just name.

**Scope**:
- Tool registry indexing — discover and index tools from all connected MCP servers at startup + periodic refresh
- Semantic search — embed tool descriptions into vector space, expose search API for natural-language queries returning ranked matches
- Integration with tool dispatch — when a tool call doesn't match an exact name, optionally fall back to semantic search with configurable confidence threshold
- Multi-server aggregation — search spans all servers; handle namespace conflicts via precedence rules or `server.tool_name` namespacing

**Deliverables**:
- [ ] Tool registry with embedding index
- [ ] Semantic search API
- [ ] Multi-server MCP config schema
- [ ] Configurable resolution policy (thresholds, fallback, conflict handling)

---

### 5. Semantic Routing

**Goal**: Lightweight classification-based routing — dispatch requests to different backends based on semantic content, not just static rules.

**Scope**:
- Request classifier — small, fast model (or embedding + nearest-neighbor) running inline, <5ms p99
- Routing policies — declarative rules mapping classifier outputs to backends (coding → code model, simple Q&A → small model, safety-sensitive → guardrailed model)
- Fallback and override — below-threshold confidence falls back to default route; explicit overrides via request headers/metadata
- Observability — log classification decisions, expose metrics on route distribution

**Deliverables**:
- [ ] Pluggable classifier interface with default lightweight implementation
- [ ] Declarative routing policy config
- [ ] Fallback and override logic
- [ ] Routing observability (metrics + trace annotations)

---

### 6. Mixture of Vendors (MoV)

**Goal**: Route requests across multiple cloud/API vendors for the same model — A/B testing, cost optimization, quality comparison, semantic routing across providers.

**Vision**: The same model (e.g. GPT-5) may be served by multiple vendors with different pricing, latency, and quality. SMG becomes a vendor-agnostic gateway that intelligently routes across them.

**Target vendors**:

| Vendor | Auth Mechanism | Complexity |
|--------|---------------|------------|
| TogetherAI, OpenAI, xAI, DeepSeek, Anthropic | API key | Low |
| GCP GenAI | OAuth 2.0 / API key | Medium |
| Azure, GCP Vertex | OAuth 2.0 / service account | Medium-High |
| AWS Bedrock | SigV4 (IAM) | High |
| OCI GenAI | RSA-SHA256 request signing (IAM) | High |

**Two core problems to solve**:

1. **Generic credential crate (`smg-credential`)** — standalone Rust crate for all auth: API key injection, OIDC/OAuth 2.0 token fetch/refresh, SigV4, RSA-SHA256, GCP service account exchange. Key challenge: many cloud providers lack Rust SDKs — signing/auth must be implemented from scratch.

2. **Protocol adapters** — map vendor-specific req/resp formats to SMG's internal representation. Tedious but well-defined. Each vendor gets an adapter module in `protocols/`.

**Routing possibilities** (once auth + adapters are solved):
- A/B testing across vendors for the same model
- Cost-based routing — cheapest provider for a given quality threshold
- Latency-based routing — fastest provider for a given region
- Quality-based routing — route based on eval scores per vendor
- Semantic routing (Section 5) combined with vendor selection
- Failover — automatic fallback to alternate vendor on errors

**Deliverables**:
- [ ] `smg-credential` crate
- [ ] Vendor protocol adapters
- [ ] Vendor routing config (declarative)
- [ ] SMG SDK for applications

---

### 7. Data Connector Redesign

**Goal**: Redesign the `data_connector` crate to support flexible schemas and pluggable storage logic via WASM plugins.

**Problem**: Current implementation has hardcoded database schemas duplicated across backends (Postgres, Oracle, Redis, Memory). No way to customize table/column names, add fields, or inject custom storage behavior without forking.

**Architecture** (3 layers):

1. **Schema Registry** — centralized schema definition replacing inline SQL scattered across backends. Schema defined once, shared by all backends. Supports custom fields, column renaming, and versioned migrations.
2. **Schema Configuration** — YAML-based config allowing users to customize table names, column names, and add custom fields without recompilation.
3. **WASM Plugin System** — users can implement custom storage behavior (audit logging, encryption, multi-tenancy, custom query patterns) via WASM plugins written in any language. Runs on wasmtime with host functions for storage operations.

**Deliverables**:
- [x] Schema registry crate with centralized definitions
- [x] YAML schema configuration format
- [x] WASM plugin runtime integration (wasmtime)
- [x] Plugin SDK for custom storage backends
- [x] Migration framework with version tracking
- [x] Updated built-in backends using the new schema registry

---

### 8. Metrics Service & Worker Load Monitoring

**Goal**: Unified, event-driven metrics architecture — single source of truth for all worker metrics, decoupled from routing, with real-time response piggyback and pluggable collection.

**Problem**: Current state has fragmented metrics handling — response piggyback tightly coupled to request flow, LoadMonitor polling with stale data, no unified real-time view of worker performance.

**Architecture**:
- **Metrics Store** — per-worker metrics cache with change detection, backed by `DashMap`
- **Event Bus** — broadcast channel for fan-out to subscribers (policy engine, load balancer, observability). Non-blocking, per-subscriber bounded queues with health monitoring
- **Response Piggyback** — real-time metrics extracted from response headers (HTTP `X-SGLang-*`) and gRPC metadata on every chunk. Zero additional network cost
- **Direct Scraper** — background polling of worker `/metrics` endpoints as fallback
- **Prometheus Scraper** — query Prometheus for custom metrics used in CEL policies

**Subscribers**:
- **Policy Engine** — subscribes to metrics events, evaluates CEL policies on demand, tiered fallback (fresh → stale → round-robin → 503)
- **Load Balancer** — maintains local cache updated by events, no more polling. Uses `kv_cache_tokens` as primary metric when fresh, falls back to `in_flight * avg_tokens`
- **Observability** — exports all worker metrics as Prometheus gauges

**Key design decisions**:
- Monotonic sequence numbers per worker to prevent out-of-order updates
- Source priority: piggyback (100) > direct scrape (50) > Prometheus (25)
- Snapshot-then-subscribe API for late joiners (no cold start gap)
- Shared immutable snapshots with `Arc` for O(1) memory per worker regardless of subscriber count

**Deliverables**:
- [ ] `MetricsService` crate (store, event bus, staleness detection)
- [ ] Response piggyback collectors (HTTP headers, gRPC metadata)
- [ ] Direct and Prometheus scrapers
- [ ] Policy engine with CEL support
- [ ] Event-driven load balancer
- [ ] Prometheus metrics exporter (40+ worker metrics)
- [ ] Legacy LoadMonitor removal

---

### 9. Custom Metrics Load Balancing

**Goal**: Allow users to define custom routing policies using CEL expressions over arbitrary metrics — GPU memory, queue depth, business metrics — with sub-millisecond routing overhead.

**Problem**: Current load balancing supports only predefined policies (round-robin, least-connections, bucket-mode). Users cannot route based on GPU utilization from DCGM exporters, factor in custom business metrics, or apply hard constraints like "exclude backends above 95% memory."

**Architecture**:
- **CEL Policy Engine** — one policy per model. CEL expressions compiled at creation time (~1-5ms), executed per-request in ~100ns-1us. Supports scoring expressions and hard constraints
- **Metrics Sources** — pluggable: Prometheus server (PromQL queries), direct backend `/metrics` scraping, OpenTelemetry (future). Background scraping with configurable interval
- **Control Plane API** — REST API for CRUD on policies and metrics sources. Validation at creation time (CEL compilation, connectivity test, dry-run). Simulation endpoint for testing routing decisions
- **Storage** — policies persisted in storage backend (Memory, Postgres, Oracle). Per-replica cache with polling refresh + jitter

**CEL capabilities**:
- Arithmetic, comparison, logical, ternary operators
- Access to custom metrics, request context (`request.max_tokens`, `request.stream`, `request.prompt_tokens`)
- Hard constraints (e.g., `gpu_mem_used / gpu_mem_total < 0.95`) with fail-safe exclusion
- Scoring modes: `lower_is_better` / `higher_is_better`

**Deliverables**:
- [ ] CEL policy engine with compilation caching
- [ ] Metrics source CRUD API (`/v1/metrics-sources`, `/v1/policies`)
- [ ] Prometheus and direct scraper implementations
- [ ] Policy simulation and validation endpoints
- [ ] Storage trait extension for policies and metrics sources
- [ ] Observability (routing scores, constraint violations, cache age, CEL eval duration)

---

## Part II — Infrastructure & CI

### 10. Image Building

**Goal**: Reproducible, automated container image builds for SMG bundled with each supported engine.

**Scope**:
- Support building from a specific commit or branch of each engine
- 9 base configurations: 3 engines (vLLM, SGLang, TensorRT-LLM) x 3 sub-versions each
- H200 runner support for GPU-accelerated builds
- Publish engine-specific images (e.g. `smg:v1.1.0-sglang-0.4.7`) plus standalone SMG-only image

**Deliverables**:
- [x] Multi-stage Dockerfiles per engine with version-parameterized builds
- [x] Build automation workflow (GitHub Actions — release + manual trigger)
- [x] Image naming convention (`smg:<smg-version>-<engine>-<engine-version>`)
- [x] H200 runner integration

---

### 11. Version Matrix & CI

**Goal**: Clear, tested compatibility matrix between SMG releases and engine versions, validated automatically.

**Scope**:
- Support contract — for each SMG release, publish supported engine versions (and vice versa)
- PR-level CI — integration tests against supported engine versions on every PR (GPU ARC-Runner)
- Release-level CI — gate releases on full matrix pass (perf regression, multimodal, tool dispatch)
- Auto-generate matrix from machine-readable source of truth (`SUPPORTED_ENGINES.yaml`)

**CI modes**:

| Mode | Trigger | Scope |
|------|---------|-------|
| PR test | On PR | SMG + runtime combo on GPU ARC-Runner |
| Matrix bench (manual) | Manual | Benchmark across all engines |
| Matrix bench (auto) | New runtime image build | Auto-verify on new images |
| Nightly | Cron | Top models across all engines |

**Test suites**:

| Suite | Coverage |
|-------|----------|
| Function calling | Lighter BFCL |
| Performance | genai-bench (PR + nightly) |
| Quality | GPQA / Livebench |

**Deliverables**:
- [ ] `SUPPORTED_ENGINES.yaml`
- [x] CI matrix workflow (GitHub Actions)
- [ ] README compatibility table / badge
- [ ] Version support policy doc (rolling window, deprecation timeline)

---

### 12. Nightly Chaos & Load Testing

**Goal**: Prove 100% stability of SMG under real-world conditions — mixed backends, mixed models, extreme load, and injected failures.

**Environment**:
- SMG routing across multiple backends simultaneously:
  - Self-hosted: SGLang, vLLM, TensorRT-LLM each serving different models
  - 3rd-party: OpenAI, xAI, and other vendor endpoints
- Realistic model mix (chat, code, multimodal, reasoning models)

**Chaos testing**:
- Random backend kills / restarts mid-request
- Network partition injection between SMG and backends
- Backend latency spikes and timeout simulation
- Partial backend failures (some models healthy, some not)
- Certificate / auth token expiry during traffic
- Health check flapping

**Extreme load testing**:
- Sustained high-throughput traffic across all backends simultaneously
- Burst traffic (sudden 10x spike)
- Long-running streaming requests under load
- Connection exhaustion / backpressure scenarios
- Mixed workloads (short completions + long streaming + tool calls + multimodal)

**Validation criteria**:
- Zero dropped requests that should have succeeded
- Correct failover behavior on backend failures
- No memory leaks or resource exhaustion over extended runs
- Graceful degradation under overload (proper error responses, no panics)
- Metrics and logs remain accurate under stress

**Deliverables**:
- [ ] Nightly CI job with full chaos + load suite
- [ ] Test harness for spinning up multi-backend SMG environment
- [ ] Chaos injection framework (backend kill, network partition, latency injection)
- [ ] Load generation config (sustained, burst, mixed workload profiles)
- [ ] Stability report (auto-generated per nightly run)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMG Roadmap #511

SMG Roadmap

Part I — Features

1. Day-0 Model Support

2. OSS Engine Support

3. Multimodality

4. MCP Semantic Search

5. Semantic Routing

6. Mixture of Vendors (MoV)

7. Data Connector Redesign

8. Metrics Service & Worker Load Monitoring

9. Custom Metrics Load Balancing

Part II — Infrastructure & CI

10. Image Building

11. Version Matrix & CI

12. Nightly Chaos & Load Testing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vendor	Auth Mechanism	Complexity
TogetherAI, OpenAI, xAI, DeepSeek, Anthropic	API key	Low
GCP GenAI	OAuth 2.0 / API key	Medium
Azure, GCP Vertex	OAuth 2.0 / service account	Medium-High
AWS Bedrock	SigV4 (IAM)	High
OCI GenAI	RSA-SHA256 request signing (IAM)	High

Mode	Trigger	Scope
PR test	On PR	SMG + runtime combo on GPU ARC-Runner
Matrix bench (manual)	Manual	Benchmark across all engines
Matrix bench (auto)	New runtime image build	Auto-verify on new images
Nightly	Cron	Top models across all engines

Suite	Coverage
Function calling	Lighter BFCL
Performance	genai-bench (PR + nightly)
Quality	GPQA / Livebench

SMG Roadmap #511

Description

SMG Roadmap

Part I — Features

1. Day-0 Model Support

2. OSS Engine Support

3. Multimodality

4. MCP Semantic Search

5. Semantic Routing

6. Mixture of Vendors (MoV)

7. Data Connector Redesign

8. Metrics Service & Worker Load Monitoring

9. Custom Metrics Load Balancing

Part II — Infrastructure & CI

10. Image Building

11. Version Matrix & CI

12. Nightly Chaos & Load Testing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions