Skip to content

SMG Roadmap #511

@slin1237

Description

@slin1237

SMG Roadmap

Part I — Features

1. Day-0 Model Support

Goal: Serve new models on launch day with minimal friction.

What SMG provides:

  • Model onboarding checklist and validation harness (routing, health checks, streaming, tool-calling e2e)
  • Automated compatibility tests triggered on new model registration
  • Documentation template for model-specific quirks (special stop tokens, non-standard tool formats)

Deliverables:

  • Model onboarding guide — step-by-step checklist for enabling a new model
  • Validation harness — automated test suite runnable against a staged deployment
  • Model metadata schema — standardized format for declaring model capabilities
  • Launch readiness gate — CI job that blocks model GA if integration tests fail

2. OSS Engine Support

Goal: Full, production-quality coverage across SGLang, vLLM, TensorRT-LLM over both gRPC and HTTP.

Key decisions:

  • gRPC is primary — tokenization, function call parsing, and reasoning parser features live here
  • HTTP is fallback — will NOT get tokenization/func-call/reasoning-parser features

Scope:

  • Interface parity — close gaps where an engine only works over one transport
  • Feature coverage matrix — document which features (streaming, tool calling, multimodality, structured output, prefix caching) are supported per engine x transport

Deliverables:

  • gRPC + HTTP parity audit
  • Feature coverage matrix (feature x engine x transport)

3. Multimodality

Goal: Close gaps in multi-modal support. High priority — at least 1 model + 1 engine must work e2e or VLM day-0 support is blocked.

Current state:

  • Image input works for select models (LLaVA, Qwen-VL variants) over HTTP only
  • No VLM support in gRPC mode
  • Gaps in audio, video, and mixed-modality inputs

Priorities:

  1. Audit current coverage — modality x model x engine x transport
  2. Image input hardening — cross-engine reliability, content-type handling, base64/URL, size/format validation
  3. Audio and video — evaluate demand and engine support, define phased plan
  4. Multimodal output — image generation, TTS if applicable

Deliverables:

  • Multimodality support matrix
  • Image input hardening (bug fixes, test coverage, cross-engine validation)
  • Audio/video phased roadmap
  • Multimodal test suite

4. MCP Semantic Search

Goal: Efficient tool discovery across MCP servers with hundreds/thousands of registered tools.

Problem: As MCP adoption grows, linear listing and exact-name matching won't scale. Models need to find tools by intent, not just name.

Scope:

  • Tool registry indexing — discover and index tools from all connected MCP servers at startup + periodic refresh
  • Semantic search — embed tool descriptions into vector space, expose search API for natural-language queries returning ranked matches
  • Integration with tool dispatch — when a tool call doesn't match an exact name, optionally fall back to semantic search with configurable confidence threshold
  • Multi-server aggregation — search spans all servers; handle namespace conflicts via precedence rules or server.tool_name namespacing

Deliverables:

  • Tool registry with embedding index
  • Semantic search API
  • Multi-server MCP config schema
  • Configurable resolution policy (thresholds, fallback, conflict handling)

5. Semantic Routing

Goal: Lightweight classification-based routing — dispatch requests to different backends based on semantic content, not just static rules.

Scope:

  • Request classifier — small, fast model (or embedding + nearest-neighbor) running inline, <5ms p99
  • Routing policies — declarative rules mapping classifier outputs to backends (coding → code model, simple Q&A → small model, safety-sensitive → guardrailed model)
  • Fallback and override — below-threshold confidence falls back to default route; explicit overrides via request headers/metadata
  • Observability — log classification decisions, expose metrics on route distribution

Deliverables:

  • Pluggable classifier interface with default lightweight implementation
  • Declarative routing policy config
  • Fallback and override logic
  • Routing observability (metrics + trace annotations)

6. Mixture of Vendors (MoV)

Goal: Route requests across multiple cloud/API vendors for the same model — A/B testing, cost optimization, quality comparison, semantic routing across providers.

Vision: The same model (e.g. GPT-5) may be served by multiple vendors with different pricing, latency, and quality. SMG becomes a vendor-agnostic gateway that intelligently routes across them.

Target vendors:

Vendor Auth Mechanism Complexity
TogetherAI, OpenAI, xAI, DeepSeek, Anthropic API key Low
GCP GenAI OAuth 2.0 / API key Medium
Azure, GCP Vertex OAuth 2.0 / service account Medium-High
AWS Bedrock SigV4 (IAM) High
OCI GenAI RSA-SHA256 request signing (IAM) High

Two core problems to solve:

  1. Generic credential crate (smg-credential) — standalone Rust crate for all auth: API key injection, OIDC/OAuth 2.0 token fetch/refresh, SigV4, RSA-SHA256, GCP service account exchange. Key challenge: many cloud providers lack Rust SDKs — signing/auth must be implemented from scratch.

  2. Protocol adapters — map vendor-specific req/resp formats to SMG's internal representation. Tedious but well-defined. Each vendor gets an adapter module in protocols/.

Routing possibilities (once auth + adapters are solved):

  • A/B testing across vendors for the same model
  • Cost-based routing — cheapest provider for a given quality threshold
  • Latency-based routing — fastest provider for a given region
  • Quality-based routing — route based on eval scores per vendor
  • Semantic routing (Section 5) combined with vendor selection
  • Failover — automatic fallback to alternate vendor on errors

Deliverables:

  • smg-credential crate
  • Vendor protocol adapters
  • Vendor routing config (declarative)
  • SMG SDK for applications

7. Data Connector Redesign

Goal: Redesign the data_connector crate to support flexible schemas and pluggable storage logic via WASM plugins.

Problem: Current implementation has hardcoded database schemas duplicated across backends (Postgres, Oracle, Redis, Memory). No way to customize table/column names, add fields, or inject custom storage behavior without forking.

Architecture (3 layers):

  1. Schema Registry — centralized schema definition replacing inline SQL scattered across backends. Schema defined once, shared by all backends. Supports custom fields, column renaming, and versioned migrations.
  2. Schema Configuration — YAML-based config allowing users to customize table names, column names, and add custom fields without recompilation.
  3. WASM Plugin System — users can implement custom storage behavior (audit logging, encryption, multi-tenancy, custom query patterns) via WASM plugins written in any language. Runs on wasmtime with host functions for storage operations.

Deliverables:

  • Schema registry crate with centralized definitions
  • YAML schema configuration format
  • WASM plugin runtime integration (wasmtime)
  • Plugin SDK for custom storage backends
  • Migration framework with version tracking
  • Updated built-in backends using the new schema registry

8. Metrics Service & Worker Load Monitoring

Goal: Unified, event-driven metrics architecture — single source of truth for all worker metrics, decoupled from routing, with real-time response piggyback and pluggable collection.

Problem: Current state has fragmented metrics handling — response piggyback tightly coupled to request flow, LoadMonitor polling with stale data, no unified real-time view of worker performance.

Architecture:

  • Metrics Store — per-worker metrics cache with change detection, backed by DashMap
  • Event Bus — broadcast channel for fan-out to subscribers (policy engine, load balancer, observability). Non-blocking, per-subscriber bounded queues with health monitoring
  • Response Piggyback — real-time metrics extracted from response headers (HTTP X-SGLang-*) and gRPC metadata on every chunk. Zero additional network cost
  • Direct Scraper — background polling of worker /metrics endpoints as fallback
  • Prometheus Scraper — query Prometheus for custom metrics used in CEL policies

Subscribers:

  • Policy Engine — subscribes to metrics events, evaluates CEL policies on demand, tiered fallback (fresh → stale → round-robin → 503)
  • Load Balancer — maintains local cache updated by events, no more polling. Uses kv_cache_tokens as primary metric when fresh, falls back to in_flight * avg_tokens
  • Observability — exports all worker metrics as Prometheus gauges

Key design decisions:

  • Monotonic sequence numbers per worker to prevent out-of-order updates
  • Source priority: piggyback (100) > direct scrape (50) > Prometheus (25)
  • Snapshot-then-subscribe API for late joiners (no cold start gap)
  • Shared immutable snapshots with Arc for O(1) memory per worker regardless of subscriber count

Deliverables:

  • MetricsService crate (store, event bus, staleness detection)
  • Response piggyback collectors (HTTP headers, gRPC metadata)
  • Direct and Prometheus scrapers
  • Policy engine with CEL support
  • Event-driven load balancer
  • Prometheus metrics exporter (40+ worker metrics)
  • Legacy LoadMonitor removal

9. Custom Metrics Load Balancing

Goal: Allow users to define custom routing policies using CEL expressions over arbitrary metrics — GPU memory, queue depth, business metrics — with sub-millisecond routing overhead.

Problem: Current load balancing supports only predefined policies (round-robin, least-connections, bucket-mode). Users cannot route based on GPU utilization from DCGM exporters, factor in custom business metrics, or apply hard constraints like "exclude backends above 95% memory."

Architecture:

  • CEL Policy Engine — one policy per model. CEL expressions compiled at creation time (~1-5ms), executed per-request in ~100ns-1us. Supports scoring expressions and hard constraints
  • Metrics Sources — pluggable: Prometheus server (PromQL queries), direct backend /metrics scraping, OpenTelemetry (future). Background scraping with configurable interval
  • Control Plane API — REST API for CRUD on policies and metrics sources. Validation at creation time (CEL compilation, connectivity test, dry-run). Simulation endpoint for testing routing decisions
  • Storage — policies persisted in storage backend (Memory, Postgres, Oracle). Per-replica cache with polling refresh + jitter

CEL capabilities:

  • Arithmetic, comparison, logical, ternary operators
  • Access to custom metrics, request context (request.max_tokens, request.stream, request.prompt_tokens)
  • Hard constraints (e.g., gpu_mem_used / gpu_mem_total < 0.95) with fail-safe exclusion
  • Scoring modes: lower_is_better / higher_is_better

Deliverables:

  • CEL policy engine with compilation caching
  • Metrics source CRUD API (/v1/metrics-sources, /v1/policies)
  • Prometheus and direct scraper implementations
  • Policy simulation and validation endpoints
  • Storage trait extension for policies and metrics sources
  • Observability (routing scores, constraint violations, cache age, CEL eval duration)

Part II — Infrastructure & CI

10. Image Building

Goal: Reproducible, automated container image builds for SMG bundled with each supported engine.

Scope:

  • Support building from a specific commit or branch of each engine
  • 9 base configurations: 3 engines (vLLM, SGLang, TensorRT-LLM) x 3 sub-versions each
  • H200 runner support for GPU-accelerated builds
  • Publish engine-specific images (e.g. smg:v1.1.0-sglang-0.4.7) plus standalone SMG-only image

Deliverables:

  • Multi-stage Dockerfiles per engine with version-parameterized builds
  • Build automation workflow (GitHub Actions — release + manual trigger)
  • Image naming convention (smg:<smg-version>-<engine>-<engine-version>)
  • H200 runner integration

11. Version Matrix & CI

Goal: Clear, tested compatibility matrix between SMG releases and engine versions, validated automatically.

Scope:

  • Support contract — for each SMG release, publish supported engine versions (and vice versa)
  • PR-level CI — integration tests against supported engine versions on every PR (GPU ARC-Runner)
  • Release-level CI — gate releases on full matrix pass (perf regression, multimodal, tool dispatch)
  • Auto-generate matrix from machine-readable source of truth (SUPPORTED_ENGINES.yaml)

CI modes:

Mode Trigger Scope
PR test On PR SMG + runtime combo on GPU ARC-Runner
Matrix bench (manual) Manual Benchmark across all engines
Matrix bench (auto) New runtime image build Auto-verify on new images
Nightly Cron Top models across all engines

Test suites:

Suite Coverage
Function calling Lighter BFCL
Performance genai-bench (PR + nightly)
Quality GPQA / Livebench

Deliverables:

  • SUPPORTED_ENGINES.yaml
  • CI matrix workflow (GitHub Actions)
  • README compatibility table / badge
  • Version support policy doc (rolling window, deprecation timeline)

12. Nightly Chaos & Load Testing

Goal: Prove 100% stability of SMG under real-world conditions — mixed backends, mixed models, extreme load, and injected failures.

Environment:

  • SMG routing across multiple backends simultaneously:
    • Self-hosted: SGLang, vLLM, TensorRT-LLM each serving different models
    • 3rd-party: OpenAI, xAI, and other vendor endpoints
  • Realistic model mix (chat, code, multimodal, reasoning models)

Chaos testing:

  • Random backend kills / restarts mid-request
  • Network partition injection between SMG and backends
  • Backend latency spikes and timeout simulation
  • Partial backend failures (some models healthy, some not)
  • Certificate / auth token expiry during traffic
  • Health check flapping

Extreme load testing:

  • Sustained high-throughput traffic across all backends simultaneously
  • Burst traffic (sudden 10x spike)
  • Long-running streaming requests under load
  • Connection exhaustion / backpressure scenarios
  • Mixed workloads (short completions + long streaming + tool calls + multimodal)

Validation criteria:

  • Zero dropped requests that should have succeeded
  • Correct failover behavior on backend failures
  • No memory leaks or resource exhaustion over extended runs
  • Graceful degradation under overload (proper error responses, no panics)
  • Metrics and logs remain accurate under stress

Deliverables:

  • Nightly CI job with full chaos + load suite
  • Test harness for spinning up multi-backend SMG environment
  • Chaos injection framework (backend kill, network partition, latency injection)
  • Load generation config (sustained, burst, mixed workload profiles)
  • Stability report (auto-generated per nightly run)

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions