NVIDIA-NeMo · nabinchha · Mar 25, 2026 · Mar 25, 2026
@@ -17,7 +17,8 @@ This guide explains the architecture, execution model, and how to tune performan
 │  • Column dependency resolution     │          │  • GPU allocation and scheduling    │
 │  • Batching and parallelism         │          │  • Request queuing                  │
 │  • Retry and error handling         │          │  • Token generation                 │
-│  • Data validation and quality      │          │  • Rate limiting (optional)         │
+│  • Adaptive concurrency (AIMD)      │          │  • Rate limiting (optional)         │
+│  • Data validation and quality      │          │                                     │
 └─────────────────────────────────────┘          └─────────────────────────────────────┘
               ▲                                                    ▲
               │                                                    │
@@ -31,6 +32,7 @@ This guide explains the architecture, execution model, and how to tune performan
 - **Resolves dependencies** between columns (DAG-based execution)
 - **Batches** work into manageable chunks (`buffer_size`)
 - **Parallelizes** LLM calls within batches (`max_parallel_requests`)
+- **Adapts to rate limits** automatically via AIMD concurrency control
 - **Handles errors** with retries and early shutdown logic
 - **Validates** generated data against schemas and constraints
 
@@ -39,14 +41,14 @@ This guide explains the architecture, execution model, and how to tune performan
 - **Host models**: You must provide LLM endpoints
 - **Manage GPUs**: Your inference server handles GPU allocation
 - **Scale inference**: You must provision sufficient capacity
-- **Rate limit**: Your server or API gateway handles this
+- **Impose rate limits**: Your server or API gateway sets rate limits (Data Designer *reacts* to them automatically)
 
 ---
 
 ## Execution Model
 
-!!! note "Column-Wise Generator"
-    This describes Data Designer's current **column-wise dataset generator**. Other dataset generation strategies are in development.
+!!! note "Dataset Builder"
+    This describes Data Designer's current **`DatasetBuilder`**, which generates columns sequentially within batches. Other dataset generation strategies are in development.
 
 Data Designer processes datasets in **batches**, with **parallel** operations within each batch.
 
@@ -102,12 +104,22 @@ At any moment, the number of concurrent LLM requests is:
 ```python
 concurrent_requests = min(
     buffer_size,                # Records in current batch
-    max_parallel_requests,      # Per-model limit
+    current_throttle_limit,     # AIMD-managed limit (≤ max_parallel_requests)
     remaining_cells_in_column   # Cells left to generate
 )
 ```
 
-**Example**: With `buffer_size=100` and `max_parallel_requests=8`, Data Designer sends up to 8 LLM requests at a time until all 100 cells in the column are complete.
+`max_parallel_requests` sets the **ceiling**. The actual limit (`current_throttle_limit`) is managed at runtime by an AIMD (Additive Increase / Multiplicative Decrease) controller that reacts to rate-limit signals from the inference server:
+
+- **On a 429 response**: the limit is reduced by a configurable factor (default: 25% reduction) and a cooldown is applied.
+- **After consecutive successes**: the limit increases by 1 (by default) until it reaches the ceiling or a stabilized rate-limit threshold.
+
+This means Data Designer automatically finds the right concurrency level for your server without manual tuning.
+
+!!! note "Sync engine caveat"
+    AIMD adaptive concurrency is fully active on the **async engine** path. On the current **sync engine** path, 429 responses are retried transparently at the HTTP transport layer and do not reach the AIMD controller, so the concurrency limit stays fixed at `max_parallel_requests`. The async engine is landing soon and will be the recommended path for production workloads.
+
+**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer starts sending up to 32 requests in parallel. If the server returns 429s, concurrency drops automatically (e.g., to 24, then 18) and recovers once the server catches up.
 
 ---
 
@@ -141,7 +153,7 @@ designer.set_run_config(run_config)
 
 ### `max_parallel_requests` (InferenceParams)
 
-Controls concurrent LLM API calls **per model alias**.
+Sets the **maximum** concurrent LLM API calls **per model**. This is the ceiling that the AIMD throttle controller can ramp up to — the actual concurrency at runtime may be lower if the server signals rate limits.
 
 ```python
 import data_designer.config as dd
@@ -157,13 +169,15 @@ model = dd.ModelConfig(
 
 **Default**: 4
 
-**When to increase**: Your inference backend has high throughput capacity, you're using a cloud API with generous rate limits, or you're running vLLM/TensorRT-LLM with multiple GPUs
+**When to increase**: Your inference backend has high throughput capacity, you're using a cloud API with generous rate limits, or you're running vLLM/TensorRT-LLM with multiple GPUs. With AIMD, setting an aggressively high value is safer than before — the system will self-correct downward if the server can't keep up.
 
-**When to decrease**: You're hitting rate limits or 429 errors, the inference server is overloaded, or you want more predictable/debuggable execution
+**When to decrease**: You want to cap resource usage to a known safe level, or you want more predictable/debuggable execution.
 
 !!! tip "Finding the optimal value"
     The right value depends on your inference stack and model. Self-hosted vLLM servers can often handle values as high as 256, 512, or even 1024 depending on your hardware.
 
+    With AIMD, a practical approach is to set `max_parallel_requests` to the **upper bound** you're comfortable with and let the throttle controller find the sustainable level automatically. If you see frequent 429 → recovery cycles in the logs, your ceiling is above the server's true capacity but the system is handling it. If you never see any throttle activity, you may have room to increase the ceiling further.
+
     **Benchmark approach**: Run a small dataset (e.g., 100 records) with increasing `max_parallel_requests` values (4 → 8 → 16 → 32 → ...) and measure generation time. Stop increasing when the runtime stops decreasing—that's when your inference server is saturated.
 
 ---
@@ -183,6 +197,46 @@ designer.set_run_config(run_config)
 
 ---
 
+### Adaptive Throttling (RunConfig)
+
+Data Designer uses an AIMD (Additive Increase / Multiplicative Decrease) controller to automatically adjust concurrency per model based on rate-limit feedback from the inference server. The defaults work well for most workloads. Override them via `ThrottleConfig` only when you understand the trade-offs.
+
+!!! note "Requires the async engine"
+    Adaptive throttling is active on the **async engine** path, where 429 responses propagate to the AIMD controller. On the sync engine path, 429s are retried at the HTTP transport layer and `ThrottleConfig` settings have no effect. The async engine is landing soon.
+
+```python
+import data_designer.config as dd
+from data_designer.interface import DataDesigner
+
+run_config = dd.RunConfig(
+    throttle=dd.ThrottleConfig(
+        reduce_factor=0.75,       # Multiply limit by this on a 429 (default: 0.75)
+        additive_increase=1,      # Add this many slots after success_window successes (default: 1)
+        success_window=25,        # Consecutive successes before increasing (default: 25)
+        cooldown_seconds=2.0,     # Pause after a 429 when no Retry-After header (default: 2.0)
+        ceiling_overshoot=0.10,   # Probe 10% above observed server limit (default: 0.10)
+    ),
+)
+
+designer = DataDesigner()
+designer.set_run_config(run_config)
+```
+
+| Parameter | Default | Effect |
+|-----------|---------|--------|
+| `reduce_factor` | 0.75 | How aggressively to cut concurrency on a 429. Lower = more aggressive. |
+| `additive_increase` | 1 | Slots added per recovery step. Higher = faster ramp-up, but riskier. |
+| `success_window` | 25 | Consecutive successes required before each increase step. |
+| `cooldown_seconds` | 2.0 | Pause duration after a 429 (used when the server doesn't send `Retry-After`). |
+| `ceiling_overshoot` | 0.10 | Fraction above the observed rate-limit ceiling the controller is allowed to probe. |
+
+!!! tip "How it works in practice"
+    When a model endpoint returns HTTP 429, the controller reduces the concurrency limit for that model and pauses briefly. After enough consecutive successes, it begins ramping back up. If the server rate-limits again, the controller records that level as a ceiling and stabilizes just below it, with a small overshoot band to detect when the server can handle more load.
+
+    You can observe this in the logs — look for messages like `concurrency reduced from X → Y` and `concurrency increased from X → Y`.
+
+---
+
 ### Error Handling (RunConfig)
 
 Control retry behavior and early shutdown for failed generations.
@@ -210,7 +264,8 @@ designer.set_run_config(run_config)
 
 | Problem | Symptom | Solution |
 |---------|---------|----------|
-| **Low throughput** | Low GPU utilization | Increase `max_parallel_requests` and/or `buffer_size` |
+| **Low throughput** | Low GPU utilization | Increase `max_parallel_requests` and/or `buffer_size`. If the throttle has self-reduced due to earlier 429s (check logs for "concurrency reduced" messages), the server may need more capacity or you can wait for AIMD recovery. |
+| **Frequent 429 → recovery cycles** | Logs show repeated concurrency drops and ramp-ups | The `max_parallel_requests` ceiling is above the server's sustained capacity. This is handled automatically, but you can lower the ceiling to reduce the sawtooth or tune `reduce_factor` / `success_window`. |
 | **Long tail of slow generations** | Most records fast, few very slow | Reduce `max_conversation_restarts`, simplify schemas, improve prompts |
 | **Multi-model idle periods** | One model busy, others idle | Reduce `buffer_size` for faster cycling, or consolidate models |
 | **Memory errors** | OOM crashes | Reduce `buffer_size` and `max_parallel_requests` |
@@ -220,10 +275,11 @@ designer.set_run_config(run_config)
 
 ## Tuning Workflow
 
-1. **Start with defaults** for initial development
+1. **Start with defaults** for initial development — AIMD handles rate-limit adaptation automatically
 2. **Profile your workload**: How many LLM columns? How many records? What models?
-3. **Identify bottleneck**: Low GPU util → increase `max_parallel_requests`. Memory issues → decrease `buffer_size`. Long tails → tune retry settings.
-4. **Iterate**: Make one change at a time, measure impact before next change
+3. **Identify bottleneck**: Low GPU util → increase `max_parallel_requests` (AIMD will self-correct if you overshoot). Memory issues → decrease `buffer_size`. Long tails → tune retry settings.
+4. **Check throttle logs**: Look for "concurrency reduced" / "concurrency increased" messages to understand whether rate limits are the bottleneck
+5. **Iterate**: Make one change at a time, measure impact before next change
 
 ---