SMG Roadmap
Part I — Features
1. Day-0 Model Support
Goal: Serve new models on launch day with minimal friction.
What SMG provides:
- Model onboarding checklist and validation harness (routing, health checks, streaming, tool-calling e2e)
- Automated compatibility tests triggered on new model registration
- Documentation template for model-specific quirks (special stop tokens, non-standard tool formats)
Deliverables:
2. OSS Engine Support
Goal: Full, production-quality coverage across SGLang, vLLM, TensorRT-LLM over both gRPC and HTTP.
Key decisions:
- gRPC is primary — tokenization, function call parsing, and reasoning parser features live here
- HTTP is fallback — will NOT get tokenization/func-call/reasoning-parser features
Scope:
- Interface parity — close gaps where an engine only works over one transport
- Feature coverage matrix — document which features (streaming, tool calling, multimodality, structured output, prefix caching) are supported per engine x transport
Deliverables:
3. Multimodality
Goal: Close gaps in multi-modal support. High priority — at least 1 model + 1 engine must work e2e or VLM day-0 support is blocked.
Current state:
- Image input works for select models (LLaVA, Qwen-VL variants) over HTTP only
- No VLM support in gRPC mode
- Gaps in audio, video, and mixed-modality inputs
Priorities:
- Audit current coverage — modality x model x engine x transport
- Image input hardening — cross-engine reliability, content-type handling, base64/URL, size/format validation
- Audio and video — evaluate demand and engine support, define phased plan
- Multimodal output — image generation, TTS if applicable
Deliverables:
4. MCP Semantic Search
Goal: Efficient tool discovery across MCP servers with hundreds/thousands of registered tools.
Problem: As MCP adoption grows, linear listing and exact-name matching won't scale. Models need to find tools by intent, not just name.
Scope:
- Tool registry indexing — discover and index tools from all connected MCP servers at startup + periodic refresh
- Semantic search — embed tool descriptions into vector space, expose search API for natural-language queries returning ranked matches
- Integration with tool dispatch — when a tool call doesn't match an exact name, optionally fall back to semantic search with configurable confidence threshold
- Multi-server aggregation — search spans all servers; handle namespace conflicts via precedence rules or
server.tool_name namespacing
Deliverables:
5. Semantic Routing
Goal: Lightweight classification-based routing — dispatch requests to different backends based on semantic content, not just static rules.
Scope:
- Request classifier — small, fast model (or embedding + nearest-neighbor) running inline, <5ms p99
- Routing policies — declarative rules mapping classifier outputs to backends (coding → code model, simple Q&A → small model, safety-sensitive → guardrailed model)
- Fallback and override — below-threshold confidence falls back to default route; explicit overrides via request headers/metadata
- Observability — log classification decisions, expose metrics on route distribution
Deliverables:
6. Mixture of Vendors (MoV)
Goal: Route requests across multiple cloud/API vendors for the same model — A/B testing, cost optimization, quality comparison, semantic routing across providers.
Vision: The same model (e.g. GPT-5) may be served by multiple vendors with different pricing, latency, and quality. SMG becomes a vendor-agnostic gateway that intelligently routes across them.
Target vendors:
| Vendor |
Auth Mechanism |
Complexity |
| TogetherAI, OpenAI, xAI, DeepSeek, Anthropic |
API key |
Low |
| GCP GenAI |
OAuth 2.0 / API key |
Medium |
| Azure, GCP Vertex |
OAuth 2.0 / service account |
Medium-High |
| AWS Bedrock |
SigV4 (IAM) |
High |
| OCI GenAI |
RSA-SHA256 request signing (IAM) |
High |
Two core problems to solve:
-
Generic credential crate (smg-credential) — standalone Rust crate for all auth: API key injection, OIDC/OAuth 2.0 token fetch/refresh, SigV4, RSA-SHA256, GCP service account exchange. Key challenge: many cloud providers lack Rust SDKs — signing/auth must be implemented from scratch.
-
Protocol adapters — map vendor-specific req/resp formats to SMG's internal representation. Tedious but well-defined. Each vendor gets an adapter module in protocols/.
Routing possibilities (once auth + adapters are solved):
- A/B testing across vendors for the same model
- Cost-based routing — cheapest provider for a given quality threshold
- Latency-based routing — fastest provider for a given region
- Quality-based routing — route based on eval scores per vendor
- Semantic routing (Section 5) combined with vendor selection
- Failover — automatic fallback to alternate vendor on errors
Deliverables:
7. Data Connector Redesign
Goal: Redesign the data_connector crate to support flexible schemas and pluggable storage logic via WASM plugins.
Problem: Current implementation has hardcoded database schemas duplicated across backends (Postgres, Oracle, Redis, Memory). No way to customize table/column names, add fields, or inject custom storage behavior without forking.
Architecture (3 layers):
- Schema Registry — centralized schema definition replacing inline SQL scattered across backends. Schema defined once, shared by all backends. Supports custom fields, column renaming, and versioned migrations.
- Schema Configuration — YAML-based config allowing users to customize table names, column names, and add custom fields without recompilation.
- WASM Plugin System — users can implement custom storage behavior (audit logging, encryption, multi-tenancy, custom query patterns) via WASM plugins written in any language. Runs on wasmtime with host functions for storage operations.
Deliverables:
8. Metrics Service & Worker Load Monitoring
Goal: Unified, event-driven metrics architecture — single source of truth for all worker metrics, decoupled from routing, with real-time response piggyback and pluggable collection.
Problem: Current state has fragmented metrics handling — response piggyback tightly coupled to request flow, LoadMonitor polling with stale data, no unified real-time view of worker performance.
Architecture:
- Metrics Store — per-worker metrics cache with change detection, backed by
DashMap
- Event Bus — broadcast channel for fan-out to subscribers (policy engine, load balancer, observability). Non-blocking, per-subscriber bounded queues with health monitoring
- Response Piggyback — real-time metrics extracted from response headers (HTTP
X-SGLang-*) and gRPC metadata on every chunk. Zero additional network cost
- Direct Scraper — background polling of worker
/metrics endpoints as fallback
- Prometheus Scraper — query Prometheus for custom metrics used in CEL policies
Subscribers:
- Policy Engine — subscribes to metrics events, evaluates CEL policies on demand, tiered fallback (fresh → stale → round-robin → 503)
- Load Balancer — maintains local cache updated by events, no more polling. Uses
kv_cache_tokens as primary metric when fresh, falls back to in_flight * avg_tokens
- Observability — exports all worker metrics as Prometheus gauges
Key design decisions:
- Monotonic sequence numbers per worker to prevent out-of-order updates
- Source priority: piggyback (100) > direct scrape (50) > Prometheus (25)
- Snapshot-then-subscribe API for late joiners (no cold start gap)
- Shared immutable snapshots with
Arc for O(1) memory per worker regardless of subscriber count
Deliverables:
9. Custom Metrics Load Balancing
Goal: Allow users to define custom routing policies using CEL expressions over arbitrary metrics — GPU memory, queue depth, business metrics — with sub-millisecond routing overhead.
Problem: Current load balancing supports only predefined policies (round-robin, least-connections, bucket-mode). Users cannot route based on GPU utilization from DCGM exporters, factor in custom business metrics, or apply hard constraints like "exclude backends above 95% memory."
Architecture:
- CEL Policy Engine — one policy per model. CEL expressions compiled at creation time (~1-5ms), executed per-request in ~100ns-1us. Supports scoring expressions and hard constraints
- Metrics Sources — pluggable: Prometheus server (PromQL queries), direct backend
/metrics scraping, OpenTelemetry (future). Background scraping with configurable interval
- Control Plane API — REST API for CRUD on policies and metrics sources. Validation at creation time (CEL compilation, connectivity test, dry-run). Simulation endpoint for testing routing decisions
- Storage — policies persisted in storage backend (Memory, Postgres, Oracle). Per-replica cache with polling refresh + jitter
CEL capabilities:
- Arithmetic, comparison, logical, ternary operators
- Access to custom metrics, request context (
request.max_tokens, request.stream, request.prompt_tokens)
- Hard constraints (e.g.,
gpu_mem_used / gpu_mem_total < 0.95) with fail-safe exclusion
- Scoring modes:
lower_is_better / higher_is_better
Deliverables:
Part II — Infrastructure & CI
10. Image Building
Goal: Reproducible, automated container image builds for SMG bundled with each supported engine.
Scope:
- Support building from a specific commit or branch of each engine
- 9 base configurations: 3 engines (vLLM, SGLang, TensorRT-LLM) x 3 sub-versions each
- H200 runner support for GPU-accelerated builds
- Publish engine-specific images (e.g.
smg:v1.1.0-sglang-0.4.7) plus standalone SMG-only image
Deliverables:
11. Version Matrix & CI
Goal: Clear, tested compatibility matrix between SMG releases and engine versions, validated automatically.
Scope:
- Support contract — for each SMG release, publish supported engine versions (and vice versa)
- PR-level CI — integration tests against supported engine versions on every PR (GPU ARC-Runner)
- Release-level CI — gate releases on full matrix pass (perf regression, multimodal, tool dispatch)
- Auto-generate matrix from machine-readable source of truth (
SUPPORTED_ENGINES.yaml)
CI modes:
| Mode |
Trigger |
Scope |
| PR test |
On PR |
SMG + runtime combo on GPU ARC-Runner |
| Matrix bench (manual) |
Manual |
Benchmark across all engines |
| Matrix bench (auto) |
New runtime image build |
Auto-verify on new images |
| Nightly |
Cron |
Top models across all engines |
Test suites:
| Suite |
Coverage |
| Function calling |
Lighter BFCL |
| Performance |
genai-bench (PR + nightly) |
| Quality |
GPQA / Livebench |
Deliverables:
12. Nightly Chaos & Load Testing
Goal: Prove 100% stability of SMG under real-world conditions — mixed backends, mixed models, extreme load, and injected failures.
Environment:
- SMG routing across multiple backends simultaneously:
- Self-hosted: SGLang, vLLM, TensorRT-LLM each serving different models
- 3rd-party: OpenAI, xAI, and other vendor endpoints
- Realistic model mix (chat, code, multimodal, reasoning models)
Chaos testing:
- Random backend kills / restarts mid-request
- Network partition injection between SMG and backends
- Backend latency spikes and timeout simulation
- Partial backend failures (some models healthy, some not)
- Certificate / auth token expiry during traffic
- Health check flapping
Extreme load testing:
- Sustained high-throughput traffic across all backends simultaneously
- Burst traffic (sudden 10x spike)
- Long-running streaming requests under load
- Connection exhaustion / backpressure scenarios
- Mixed workloads (short completions + long streaming + tool calls + multimodal)
Validation criteria:
- Zero dropped requests that should have succeeded
- Correct failover behavior on backend failures
- No memory leaks or resource exhaustion over extended runs
- Graceful degradation under overload (proper error responses, no panics)
- Metrics and logs remain accurate under stress
Deliverables:
SMG Roadmap
Part I — Features
1. Day-0 Model Support
Goal: Serve new models on launch day with minimal friction.
What SMG provides:
Deliverables:
2. OSS Engine Support
Goal: Full, production-quality coverage across SGLang, vLLM, TensorRT-LLM over both gRPC and HTTP.
Key decisions:
Scope:
Deliverables:
3. Multimodality
Goal: Close gaps in multi-modal support. High priority — at least 1 model + 1 engine must work e2e or VLM day-0 support is blocked.
Current state:
Priorities:
Deliverables:
4. MCP Semantic Search
Goal: Efficient tool discovery across MCP servers with hundreds/thousands of registered tools.
Problem: As MCP adoption grows, linear listing and exact-name matching won't scale. Models need to find tools by intent, not just name.
Scope:
server.tool_namenamespacingDeliverables:
5. Semantic Routing
Goal: Lightweight classification-based routing — dispatch requests to different backends based on semantic content, not just static rules.
Scope:
Deliverables:
6. Mixture of Vendors (MoV)
Goal: Route requests across multiple cloud/API vendors for the same model — A/B testing, cost optimization, quality comparison, semantic routing across providers.
Vision: The same model (e.g. GPT-5) may be served by multiple vendors with different pricing, latency, and quality. SMG becomes a vendor-agnostic gateway that intelligently routes across them.
Target vendors:
Two core problems to solve:
Generic credential crate (
smg-credential) — standalone Rust crate for all auth: API key injection, OIDC/OAuth 2.0 token fetch/refresh, SigV4, RSA-SHA256, GCP service account exchange. Key challenge: many cloud providers lack Rust SDKs — signing/auth must be implemented from scratch.Protocol adapters — map vendor-specific req/resp formats to SMG's internal representation. Tedious but well-defined. Each vendor gets an adapter module in
protocols/.Routing possibilities (once auth + adapters are solved):
Deliverables:
smg-credentialcrate7. Data Connector Redesign
Goal: Redesign the
data_connectorcrate to support flexible schemas and pluggable storage logic via WASM plugins.Problem: Current implementation has hardcoded database schemas duplicated across backends (Postgres, Oracle, Redis, Memory). No way to customize table/column names, add fields, or inject custom storage behavior without forking.
Architecture (3 layers):
Deliverables:
8. Metrics Service & Worker Load Monitoring
Goal: Unified, event-driven metrics architecture — single source of truth for all worker metrics, decoupled from routing, with real-time response piggyback and pluggable collection.
Problem: Current state has fragmented metrics handling — response piggyback tightly coupled to request flow, LoadMonitor polling with stale data, no unified real-time view of worker performance.
Architecture:
DashMapX-SGLang-*) and gRPC metadata on every chunk. Zero additional network cost/metricsendpoints as fallbackSubscribers:
kv_cache_tokensas primary metric when fresh, falls back toin_flight * avg_tokensKey design decisions:
Arcfor O(1) memory per worker regardless of subscriber countDeliverables:
MetricsServicecrate (store, event bus, staleness detection)9. Custom Metrics Load Balancing
Goal: Allow users to define custom routing policies using CEL expressions over arbitrary metrics — GPU memory, queue depth, business metrics — with sub-millisecond routing overhead.
Problem: Current load balancing supports only predefined policies (round-robin, least-connections, bucket-mode). Users cannot route based on GPU utilization from DCGM exporters, factor in custom business metrics, or apply hard constraints like "exclude backends above 95% memory."
Architecture:
/metricsscraping, OpenTelemetry (future). Background scraping with configurable intervalCEL capabilities:
request.max_tokens,request.stream,request.prompt_tokens)gpu_mem_used / gpu_mem_total < 0.95) with fail-safe exclusionlower_is_better/higher_is_betterDeliverables:
/v1/metrics-sources,/v1/policies)Part II — Infrastructure & CI
10. Image Building
Goal: Reproducible, automated container image builds for SMG bundled with each supported engine.
Scope:
smg:v1.1.0-sglang-0.4.7) plus standalone SMG-only imageDeliverables:
smg:<smg-version>-<engine>-<engine-version>)11. Version Matrix & CI
Goal: Clear, tested compatibility matrix between SMG releases and engine versions, validated automatically.
Scope:
SUPPORTED_ENGINES.yaml)CI modes:
Test suites:
Deliverables:
SUPPORTED_ENGINES.yaml12. Nightly Chaos & Load Testing
Goal: Prove 100% stability of SMG under real-world conditions — mixed backends, mixed models, extreme load, and injected failures.
Environment:
Chaos testing:
Extreme load testing:
Validation criteria:
Deliverables: