| 
 | 1 | +# Dynamo integration with Inference Gateway  | 
 | 2 | + | 
 | 3 | +**Status**: Draft  | 
 | 4 | + | 
 | 5 | +**Authors**: [Biswa Panda](https://github.com/biswapanda)   | 
 | 6 | + | 
 | 7 | +**Category**: Architecture  | 
 | 8 | + | 
 | 9 | +**Replaces**: [Link of previous proposal if applicable]   | 
 | 10 | + | 
 | 11 | +**Replaced By**: [Link of previous proposal if applicable]   | 
 | 12 | + | 
 | 13 | +**Sponsor**: [Name of code owner or maintainer to shepard process]  | 
 | 14 | + | 
 | 15 | +**Required Reviewers**: [Names of technical leads that are required for acceptance]  | 
 | 16 | + | 
 | 17 | +**Review Date**: [Date for review]  | 
 | 18 | + | 
 | 19 | +**Pull Request**: [Link to Pull Request of the Proposal itself]  | 
 | 20 | + | 
 | 21 | +**Implementation PR / Tracking Issue**: [Link to Pull Request or Tracking Issue for Implementation]  | 
 | 22 | + | 
 | 23 | +# Summary  | 
 | 24 | + | 
 | 25 | +This proposal outlines the integration of Dynamo components with the Gateway API Inference Extension.   | 
 | 26 | + | 
 | 27 | +The current Inference Gateway is tightly coupled with model's tokenizer. However use cases require:  | 
 | 28 | +1. **External Tokenization**: Preprocessing requests outside the gateway for specialized tokenization logic  | 
 | 29 | +2. **KV-Aware Routing**: Intelligent routing based on prefix cache status and token analysis  | 
 | 30 | +3. **Flexible side channel to offload tokens**: Support for both external cache and direct token passing strategies. This would be helpful for transfering large blob of tokens for VLMs (image/audio/video tokens)  | 
 | 31 | +4. **Unified Dynamo Architecture**: Consolidated deployment model for all processing components  | 
 | 32 | + | 
 | 33 | +## Terminology & Definitions  | 
 | 34 | + | 
 | 35 | +| Term | Definition |  | 
 | 36 | +| :---- | :---- |  | 
 | 37 | +| **Dynamo EPP** | Enhanced Endpoint Picker Protocol service with Dynamo integration |  | 
 | 38 | +| **Dynamo Processor** | Dynamo component responsible for request tokenization and preprocessing |  | 
 | 39 | +| **Dynamo Router** | Dynamo component responsible for KV aware Routing strategy |  | 
 | 40 | +| **Token Cache / Side Channel** | External storage system for tokenized request |  | 
 | 41 | + | 
 | 42 | +## Acronyms & Abbreviations  | 
 | 43 | + | 
 | 44 | +**EPP:** Endpoint Picker Protocol  | 
 | 45 | +**IGW:** Inference Gateway  | 
 | 46 | + | 
 | 47 | +## Goals  | 
 | 48 | + | 
 | 49 | +* Integrate Dynamo Processor for request preprocessing and tokenization  | 
 | 50 | +* Enable KV-aware routing through Dynamo Router Service  | 
 | 51 | +* Support flexible token management (cache keys vs direct values)  | 
 | 52 | +* Provide unified deployment architecture for all Dynamo components  | 
 | 53 | +* Maintain backward compatibility with existing EPP functionality  | 
 | 54 | + | 
 | 55 | +### Non Goals  | 
 | 56 | + | 
 | 57 | +* Replace existing EPP internal scheduling completely  | 
 | 58 | +* Modify core Gateway API specifications  | 
 | 59 | +* Change existing worker pod interfaces significantly  | 
 | 60 | + | 
 | 61 | +## Requirements  | 
 | 62 | + | 
 | 63 | +### REQ 1 External Processing Integration  | 
 | 64 | + | 
 | 65 | +Dynamo EPP (Endpoint picker) **MUST** support calling LLM processors for request preprocessing and tokenization while maintaining the existing ext-proc interface.  | 
 | 66 | + | 
 | 67 | +### REQ 2 Flexible Routing Strategies  | 
 | 68 | + | 
 | 69 | +The system **SHOULD** support both external routing (via Dynamo Router) and internal EPP scheduling based on request configuration.  | 
 | 70 | + | 
 | 71 | +### REQ 3 Token offloading capability   | 
 | 72 | + | 
 | 73 | +The system **SHOULD** support both external cache-based token storage and direct token value passing to worker pods.  | 
 | 74 | + | 
 | 75 | +### REQ 4 Unified Dynamo Architecture  | 
 | 76 | + | 
 | 77 | +Dynamo EPP and components (Processor, Router, Workers) **MUST** be deployable as a unified dynamo graph within Kubernetes.  | 
 | 78 | + | 
 | 79 | +### REQ 5 Maintain compatibility with Inference Gateway protocols  | 
 | 80 | + | 
 | 81 | +Dynamo EPP **MUST** be compatible with Inference Gateway  | 
 | 82 | + | 
 | 83 | +# Proposal  | 
 | 84 | + | 
 | 85 | +## Design Principles  | 
 | 86 | + | 
 | 87 | +## Architecture Overview  | 
 | 88 | + | 
 | 89 | +The updated architecture unifies Inference Gateway with Dynamo Graph deployment. See architecture diagram below for detailed component interactions.  | 
 | 90 | + | 
 | 91 | +  | 
 | 92 | + | 
 | 93 | +## Sequence Diagram  | 
 | 94 | + | 
 | 95 | +```mermaid  | 
 | 96 | +sequenceDiagram  | 
 | 97 | +    participant Client  | 
 | 98 | +    participant IGW as Inference Gateway<br/>(Envoy/kGateway)  | 
 | 99 | +    participant EPP as EPP Service<br/>(ext-proc/Endpoint Picker)  | 
 | 100 | +    participant ExtProcessor as (Dynamo) External LLM<br/>Processor  | 
 | 101 | +    participant Router as (Dynamo) Router  | 
 | 102 | +    participant TokenCache as External Token<br/>Cache/Side-channel  | 
 | 103 | +    participant Worker as (Dynamo) Worker<br/>Pod  | 
 | 104 | +
  | 
 | 105 | +    Note over Client,Worker: Token Handling & Routing Strategies  | 
 | 106 | +
  | 
 | 107 | +    %% Client Request  | 
 | 108 | +    Client->>IGW: POST /v1/chat/completions<br/>{"model": "llama-instruct",<br/> "messages": [...]<br/> }  | 
 | 109 | +
  | 
 | 110 | +    IGW->>EPP: ext-proc: RequestHeaders   | 
 | 111 | +    EPP->>EPP: Parse model name from request<br/>Set X-Gateway-Model-Name header  | 
 | 112 | +    IGW->>EPP: ext-proc: RequestBody  | 
 | 113 | +
  | 
 | 114 | +    %% Scenario 1: route=true (External routing via Router Service)  | 
 | 115 | +    alt route=true  | 
 | 116 | +        EPP->>ExtProcessor: POST /process<br/>{"request_body": {<br/>   "model": "llama-instruct",<br/>   "messages": [...],<br/>   "route": true<br/> },<br/> "headers": {"x-request-id": "req-123"}}  | 
 | 117 | +          | 
 | 118 | +        ExtProcessor->>ExtProcessor: Tokenize prompt (always)<br/>Generate token_ids: [1, 15043, 29892, ...]  | 
 | 119 | +          | 
 | 120 | +        ExtProcessor->>Router: POST /route<br/>{"token_ids": [1, 15043, 29892, ...]}  | 
 | 121 | +          | 
 | 122 | +        Router->>Router: Apply KV aware routing:<br/>- Check prefix cache<br/>- Apply custom routing strategy<br/>- Select optimal worker  | 
 | 123 | +          | 
 | 124 | +        Router-->>ExtProcessor: Worker selection:<br/>{"worker_address": "worker-3:8080"}  | 
 | 125 | +          | 
 | 126 | +        %% Token storage decision  | 
 | 127 | +        alt Using External Cache  | 
 | 128 | +            ExtProcessor->>TokenCache: Store tokens<br/>Key: "cache_key_abc123"<br/>Value: [1, 15043, 29892, ...]  | 
 | 129 | +              | 
 | 130 | +            ExtProcessor-->>EPP: Response with token_key:<br/>{"worker_address": "worker-3:8080",<br/> "token_key": "cache_key_abc123"}  | 
 | 131 | +              | 
 | 132 | +            EPP->>EPP: Set x-gateway-destination-endpoint: "worker-3:8080"<br/>Set routing metadata  | 
 | 133 | +            EPP->>EPP: Prepare headers:<br/>x-req-tokens-key: "cache_key_abc123"  | 
 | 134 | +              | 
 | 135 | +        else Direct Token Values  | 
 | 136 | +            ExtProcessor-->>EPP: Response with token_value:<br/>{"worker_address": "worker-2:8080",<br/> "token_value": "[1,15043,29892,...]"}  | 
 | 137 | +              | 
 | 138 | +            EPP->>EPP: Set x-gateway-destination-endpoint: "worker-2:8080"<br/>Set routing metadata  | 
 | 139 | +            EPP->>EPP: Modify request body:<br/>Add "token_ids": [1,15043,29892,...]  | 
 | 140 | +        end  | 
 | 141 | +          | 
 | 142 | +    %% Scenario 2: route not specified (Internal routing)  | 
 | 143 | +    else route not specified  | 
 | 144 | +        Note over EPP: EPP schedules worker pods<br/>using internal logic  | 
 | 145 | +          | 
 | 146 | +        EPP->>ExtProcessor: POST /process<br/>{"request_body": {<br/>   "model": "llama-instruct",<br/>   "messages": [...]<br/> },<br/> "headers": {"x-request-id": "req-123"}}  | 
 | 147 | +          | 
 | 148 | +        ExtProcessor->>ExtProcessor: Tokenize prompt (always)  | 
 | 149 | +          | 
 | 150 | +        %% Allow external cache for internal routing too  | 
 | 151 | +        alt Store in External Cache  | 
 | 152 | +            ExtProcessor->>TokenCache: Store tokens<br/>Key: "cache_key_xyz789"<br/>Value: [1, 15043, 29892, ...]  | 
 | 153 | +              | 
 | 154 | +            ExtProcessor-->>EPP: Response with token_key:<br/>{"token_key": "cache_key_xyz789"}  | 
 | 155 | +              | 
 | 156 | +            EPP->>EPP: Schedule worker pods:<br/>- Apply internal scheduling<br/>- Check pod availability<br/>- Select: worker-pool-1:8080  | 
 | 157 | +              | 
 | 158 | +            EPP->>EPP: Set x-gateway-destination-endpoint: "worker-pool-1:8080"<br/>Set routing metadata  | 
 | 159 | +            EPP->>EPP: Prepare headers:<br/>x-req-tokens-key: "cache_key_xyz789"  | 
 | 160 | +              | 
 | 161 | +        else Direct Token Values  | 
 | 162 | +            ExtProcessor-->>EPP: Response with tokens:<br/>{"token_value": "[1,15043,29892,...]"}  | 
 | 163 | +              | 
 | 164 | +            EPP->>EPP: Schedule worker pods:<br/>- Apply internal scheduling<br/>- Select: worker-pool-1:8080  | 
 | 165 | +              | 
 | 166 | +            EPP->>EPP: Set x-gateway-destination-endpoint: "worker-pool-1:8080"<br/>Set routing metadata  | 
 | 167 | +            EPP->>EPP: Modify request body:<br/>Add "token_ids": [1,15043,29892,...]  | 
 | 168 | +        end  | 
 | 169 | +    end  | 
 | 170 | +
  | 
 | 171 | +    EPP-->>IGW: ext-proc Response<br/>Header: x-gateway-destination-endpoint<br/>Header: x-req-tokens-key (if using cache)<br/>Modified request body (if direct tokens)  | 
 | 172 | +
  | 
 | 173 | +    %% Request forwarding  | 
 | 174 | +    IGW->>Worker: HTTP Request to selected worker<br/>Header: x-gateway-destination-endpoint<br/>Header: x-req-tokens-key (if applicable)<br/>Body: includes token_ids (if direct)  | 
 | 175 | +
  | 
 | 176 | +    alt Worker receives token_key in header  | 
 | 177 | +        Worker->>TokenCache: Fetch tokens<br/>Key from header: x-req-tokens-key  | 
 | 178 | +        TokenCache-->>Worker: Token array: [1,15043,29892,...]  | 
 | 179 | +    else Worker receives token_ids in body  | 
 | 180 | +        Worker->>Worker: Use token_ids directly<br/>from request body  | 
 | 181 | +    end  | 
 | 182 | +
  | 
 | 183 | +    Worker->>Worker: LLM Inference with tokens  | 
 | 184 | +    Worker-->>IGW: Response<br/>{"choices": [...], "usage": {...}}  | 
 | 185 | +
  | 
 | 186 | +    IGW-->>Client: Final Response  | 
 | 187 | +```  | 
 | 188 | + | 
 | 189 | +# Implementation Details  | 
 | 190 | + | 
 | 191 | +## Key Components  | 
 | 192 | + | 
 | 193 | +### Dynamo EPP (ext-proc)  | 
 | 194 | +- Integrates with Gateway via ext-proc protocol  | 
 | 195 | +- Parses model names and sets `X-Gateway-Model-Name` header  | 
 | 196 | +- Calls External LLM Processor for tokenization  | 
 | 197 | +- Handles both external and internal routing strategies  | 
 | 198 | +- Manages token key/value header and body modifications  | 
 | 199 | + | 
 | 200 | +### Dynamo Processor  | 
 | 201 | +- Performs request tokenization  | 
 | 202 | +- Supports both routing modes (external via Router, internal via EPP)  | 
 | 203 | +- Manages token transfer strategies (cache vs direct)  | 
 | 204 | +- Returns worker selection and dynamo backend framework (vLLM/Trtllm/sglang) agnostic request   | 
 | 205 | + | 
 | 206 | +### Dynamo Router Service  | 
 | 207 | +- Implements KV-aware routing algorithms  | 
 | 208 | +- Analyzes token_ids for optimal worker selection based on prefix cache  | 
 | 209 | +- Called only when `route=true` is specified  | 
 | 210 | + | 
 | 211 | +### Dynamo Worker Pods  | 
 | 212 | +- Perform LLM inference with preprocessed tokens  | 
 | 213 | +- Support both token retrieval methods (cache keys, direct values)  | 
 | 214 | +- Maintain compatibility with existing worker interfaces  | 
 | 215 | +- exposes HTTP endpoint for direct intgerration with Inference gateway  | 
 | 216 | + | 
 | 217 | +### Token Cache / Side channel  | 
 | 218 | +- External storage system which provides a Key/Value store interface transfer token_ids from processor to worker  | 
 | 219 | +- Stores tokenized data with generated keys  | 
 | 220 | +- Enables efficient token sharing between components  | 
 | 221 | +- Optional component (direct token passing also supported)  | 
 | 222 | + | 
 | 223 | +## Configuration  | 
 | 224 | + | 
 | 225 | +### Environment Variables  | 
 | 226 | +- `EXTERNAL_LLM_PROCESSOR_ENDPOINT`: Dynamo External LLM Processor URL  | 
 | 227 | +- `USE_EXTERNAL_LLM_PROCESSOR`: Enable/disable external pre-processing (apply prompt templates/tokenization)  | 
 | 228 | +- `USE_EXTERNAL_LLM_ROUTER`: Enable/disable external routing (in this case it's Dynamo Router)  | 
 | 229 | + | 
 | 230 | +### Headers  | 
 | 231 | +- `X-Gateway-Model-Name`: Set by EPP from parsed model name in user request's body  | 
 | 232 | +- `x-req-tokens-key`: Token cache key (when using external cache)  | 
 | 233 | +- `x-req-tokens-value`: Direct token values (alternative to cache)  | 
 | 234 | + | 
 | 235 | +## Deferred to Implementation  | 
 | 236 | + | 
 | 237 | +- Specific token cache implementation details (Redis vs alternatives)  | 
 | 238 | +- Fallback mechanisms for external service failures  | 
 | 239 | +- Metrics and observability integration  | 
 | 240 | + | 
 | 241 | +# Implementation Phases  | 
 | 242 | + | 
 | 243 | +## Phase 1 Core Integration  | 
 | 244 | +**Supported API / Behavior:**  | 
 | 245 | +- External tokenization via Dynamo Processor  | 
 | 246 | +- External scheduling/routiung using Dynamo Router  | 
 | 247 | +- Direct token value passing to workers  | 
 | 248 | + | 
 | 249 | +**Not Supported:**  | 
 | 250 | +- External cache-based token passing   | 
 | 251 | + | 
 | 252 | +## Phase 2 Tokens transfer thrugh side channel/cache  | 
 | 253 | +**Supported API / Behavior:**  | 
 | 254 | +- External cache-based token passing   | 
 | 255 | + | 
 | 256 | +# Related Proposals  | 
 | 257 | +* Gateway API Inference Extension Architecture  | 
 | 258 | +* EPP Architecture Proposal   | 
 | 259 | +* Model Server Protocol  | 
 | 260 | + | 
 | 261 | +# Alternate Solutions  | 
 | 262 | + | 
 | 263 | +## Alt 1 Direct Tokenizer Integration in EPP (current EPP architecture)  | 
 | 264 | + | 
 | 265 | +**Pros:**  | 
 | 266 | +- Simpler architecture without additional layer  | 
 | 267 | +- Lower latency for request processing  | 
 | 268 | +- Fewer network hops  | 
 | 269 | + | 
 | 270 | +**Cons:**  | 
 | 271 | +- Less flexible for different models  | 
 | 272 | +- Harder to maintain separation of concerns  | 
 | 273 | + | 
 | 274 | +**Reason Rejected:**  | 
 | 275 | +- Violates Gateway API integration principles  | 
 | 276 | +- Reduces portability across models  | 
 | 277 | +- Increases complexity/TCO by using golang based tokenizer  | 
 | 278 | + | 
 | 279 | +## Alt 2 Sidecar Pattern  | 
 | 280 | +- TODO  | 
 | 281 | + | 
 | 282 | +## References  | 
 | 283 | + | 
 | 284 | +* [Gateway API Inference Extension Documentation](https://gateway-api-inference-extension.sigs.k8s.io/)  | 
 | 285 | +* [Envoy External Processing Filter](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter)  | 
 | 286 | +* [Gateway API Specification](https://gateway-api.sigs.k8s.io/)  | 
0 commit comments