diff --git a/docs/devnotes/posts/owning-the-model-stack.md b/docs/devnotes/posts/owning-the-model-stack.md index 53a9eb00..16b142cf 100644 --- a/docs/devnotes/posts/owning-the-model-stack.md +++ b/docs/devnotes/posts/owning-the-model-stack.md @@ -28,8 +28,12 @@ So we built a native client layer. Thin HTTP adapters with adaptive rate-limit h The replacement is a layered stack where each layer does one thing. `ModelFacade`, the public orchestration surface that column generators call, didn't change at all. Everything below it is new. +
+ ![Native model client architecture: six layers from ModelFacade down to provider HTTP APIs](assets/owning-the-model-stack/native-model-client-layers.png){ style="max-width:75%; height:auto" } +
+ From top to bottom: 1. **ModelFacade**: orchestrates correction loops, MCP tool-calling, and usage tracking. This is the public API. Column generators talk to this layer, and it was untouched during the migration. If you've written a Data Designer pipeline, nothing about your code changes. @@ -67,8 +71,12 @@ The asymmetry is deliberate. You probe upward slowly because overshooting wastes The result is that the system converges on the provider's actual capacity without you setting it. It starts at your configured `max_parallel_requests`, discovers the real limit through 429 signals, and settles into a steady state that tracks the provider's capacity as it changes. +
+ ![AIMD concurrency control over time: initial phase, 429 drop, recovery, ceiling stabilization, steady state](assets/owning-the-model-stack/aimd-concurrency-over-time.png){ style="max-width:75%; height:auto" } +
+ This is especially useful when you're self-hosting your inference stack (running vLLM or NVIDIA NIM on your own hardware) as long as the serving framework returns 429s when it's at capacity. The capacity of a self-hosted endpoint depends on your GPU count, model size, quantization, batch settings, and whatever else is sharing the cluster. That capacity might change between runs, or even mid-run if other workloads spin up. If your serving layer signals overload with 429s, you don't need to figure any of that out. Point Data Designer at your endpoint, set `max_parallel_requests` to a generous upper bound, and the system self-adjusts to whatever your infrastructure can actually handle. ### **Ceiling stabilization** @@ -89,8 +97,12 @@ Real pipelines aren't simple. A single provider+model combination might serve ch The throttle manager handles this with two-level keying: +
+ ![Two-level throttle keying: global cap per provider+model, independent domain states for chat, embedding, image](assets/owning-the-model-stack/throttle-keying.png){ style="max-width:75%; height:auto" } +
+ - **Global cap**: keyed by `(provider_name, model_id)`. When multiple model aliases target the same provider and model, the effective max is `min()` of their configured `max_parallel_requests`. This enforces the most conservative limit for shared upstream capacity, because the provider doesn't care what you *call* the model, it sees the same API key. - **Domain state**: keyed by `(provider_name, model_id, throttle_domain)`. Each domain (`chat`, `embedding`, `image`, `healthcheck`) maintains its own AIMD state: `current_limit`, `in_flight`, `blocked_until`, `success_streak`, and `rate_limit_ceiling`. Domains float independently but are always capped by the global max. @@ -105,8 +117,12 @@ The transport layer (via `httpx` with `RetryTransport`) handles transient server But **429 is explicitly excluded from transport retries**. +
+ ![Retry boundary: 502/503/504 retried at transport, 429 passed through to ThrottledModelClient for AIMD feedback](assets/owning-the-model-stack/retry-boundary.png){ style="max-width:75%; height:auto" } +
+ Why? Because if the retry layer swallows 429s, the throttle manager never learns the provider is overloaded. The whole AIMD feedback loop depends on seeing raw rate-limit signals. A 429 must bubble up to `ThrottledModelClient` so it can call `release_rate_limited()`, cut the concurrency limit, apply the cooldown, and record the ceiling. The next attempt then re-enters the throttle acquire path, waiting for a permit, before making another HTTP call. The split is clean and worth remembering. Transport retries handle *server problems*. Throttle adaptation handles *capacity problems*. The provider is working fine, you're just sending too many requests. Conflating the two is how you get retry storms.