BudAIScaler uses a multi-layered prediction system that combines time-series analysis, seasonal pattern learning, and workload pattern detection to make proactive scaling decisions.
The prediction system operates at three levels:
- Time-Series Prediction - Linear regression on recent metric history
- Seasonal Learning - EWMA-based hourly/daily pattern learning
- Pattern Detection - Workload-specific pattern recognition
┌─────────────────────────────────────────────────────────────────┐
│ Prediction Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Metrics Sources │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Metric Collector │ ─────► Collects from pods/external │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ ┌────────────────────┐ │
│ │ Time-Series │ │ Seasonal Profile │ │
│ │ Predictor │ │ Manager │ │
│ │ │ │ │ │
│ │ • Linear Regression│ │ • 168 time buckets │ │
│ │ • Moving Average │ │ • EWMA smoothing │ │
│ │ • Confidence calc│ │ • Factor prediction │ │
│ └────────┬─────────┘ └──────────┬─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Pattern Detector │ │
│ │ │ │
│ │ • KV Cache Pressure • Batch Ingestion │ │
│ │ • Traffic Spike/Drain • Cold Start │ │
│ └───────────────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Final Prediction Output │ │
│ │ │ │
│ │ • Predicted Replicas │ │
│ │ • Confidence Score (0-1) │ │
│ │ • Pattern Adjustment Factor │ │
│ └──────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
The prediction system is fully generic - you can use any metrics from your metricsSources:
apiVersion: scaler.bud.studio/v1alpha1
kind: BudAIScaler
spec:
metricsSources:
- targetMetric: "app:cache_usage"
targetValue: "70"
- targetMetric: "app:queue_depth"
targetValue: "50"
- targetMetric: "app:response_time_ms"
targetValue: "200"
predictionConfig:
enabled: true
lookAheadMinutes: 15
# Which metrics to use for time-series prediction
# These metrics drive the linear regression model
predictionMetrics:
- "app:cache_usage"
# Which metrics to track for seasonal learning
# These metrics build up hourly/daily patterns over time
seasonalMetrics:
- "app:cache_usage"
- "app:queue_depth"| Field | Default |
|---|---|
predictionMetrics |
First metric in metricsSources |
seasonalMetrics |
Same as predictionMetrics |
The predictor uses least-squares linear regression on recent metric history to forecast future values:
predicted_value = intercept + slope × future_time
Where:
slope = (n × Σxy - Σx × Σy) / (n × Σx² - (Σx)²)intercept = (Σy - slope × Σx) / nfuture_time = current_time + lookAheadMinutes
Once the predicted metric value is known:
ratio = predicted_value / target_value
desired_replicas = ceil(current_replicas × ratio)
Confidence is calculated based on:
- Data point count - More history = higher confidence
- Variance - Lower variance = higher confidence
count_factor = 1 - 1/(1 + history_length/10)
cv = std_dev / mean # Coefficient of variation
variance_factor = 1 / (1 + cv)
confidence = count_factor × variance_factor
The seasonal profile tracks 168 buckets (24 hours × 7 days) to capture weekly patterns:
bucket_index = day_of_week × 24 + hour_of_day
Example:
Monday 9 AM = 0 × 24 + 9 = bucket 9
Friday 5 PM = 4 × 24 + 17 = bucket 113
Sunday 2 AM = 6 × 24 + 2 = bucket 146
Each bucket maintains an exponentially weighted moving average:
new_mean = α × new_value + (1 - α) × old_mean
Where α (alpha) is the smoothing factor:
- Default: 0.2
- Higher α = more responsive to recent data
- Lower α = more stable, slower to adapt
The seasonal factor indicates how much the current time bucket deviates from the global average:
factor = bucket_mean / global_mean
Example:
- Factor 1.5 at Monday 9 AM → expect 50% higher load than average
- Factor 0.7 at Sunday 3 AM → expect 30% lower load than average
The system detects several workload patterns specific to GenAI/LLM workloads:
Trigger conditions:
- GPU cache usage > 85% AND requests waiting > 5
- OR cache usage rapidly increasing toward 70%+
Response:
- Scale factor: 1.5× (default)
- Priority: Highest
Trigger conditions:
- Current requests > 2× baseline (5-10 min ago)
- Sudden spike pattern detected
Response:
- Scale factor: 2.0× (default)
- Priority: High
Trigger conditions:
- Trend analysis shows >50% increase (spike) or decrease (drain)
Response:
- Spike: Scale factor 1.5×
- Drain: Scale factor 0.8×
Trigger conditions:
- Low cache usage (<20%)
- Low request count (<10)
- Increasing trend
Response:
- Scale factor: 1.2× (preemptive)
The system records predictions and verifies them after the look-ahead period:
T=0: Record prediction (P=5 replicas in 15 min)
T=15: Verify actual state (A=4 replicas)
Error = (P - A) / A = 25%
The system tracks:
- MAE - Mean Absolute Error
- MAPE - Mean Absolute Percentage Error
- Direction Accuracy - Did we predict scale up/down correctly?
When accuracy drops, the system automatically:
- Adjusts EWMA alpha for seasonal profiles
- Resets poorly-performing time buckets
- Updates pattern detection thresholds
Prediction integrates with the BudScaler algorithm through weighted blending:
// Only if confidence >= 0.7
weight = 0.3 × confidence
adjusted = (metric_based × (1 - weight)) + (prediction × weight)The prediction can influence the final decision by up to 30%, proportional to its confidence level.
Schedule hints and predictions serve different purposes:
| Aspect | Schedule Hints | Prediction |
|---|---|---|
| Source | User-defined cron schedules | Learned from metrics |
| Priority | Acts as a floor | Weighted input |
| Certainty | 100% confidence | Variable confidence |
| Use case | Known traffic patterns | Unknown/varying patterns |
When both are active:
- Schedule hint sets a minimum floor
- Prediction can only increase above the floor
- Cost constraints can still cap the maximum
Select metrics that:
- Correlate with scaling needs
- Change before resource pressure (leading indicators)
- Have relatively stable patterns
Good choices:
- Cache/memory utilization
- Queue depth
- Request latency percentiles
Avoid:
- Highly volatile metrics without patterns
- Metrics with lots of noise
| Duration | Expected Accuracy |
|---|---|
| < 1 day | Low - insufficient data |
| 1-3 days | Moderate - some patterns |
| 1 week | Good - full weekly cycle |
| 2+ weeks | Best - refined patterns |
predictionConfig:
# Longer look-ahead = more time to scale, but less accurate
lookAheadMinutes: 15 # 5-30 minutes typical
# More history = better patterns, more storage
historicalDataDays: 7 # 3-14 days typical
# Higher confidence = only act on strong signals
minConfidence: 0.7 # 0.5-0.9 typicalCheck prediction status via the BudAIScaler status:
kubectl get budaiscaler my-scaler -o jsonpath='{.status.learningStatus}'Key fields:
dataPointsCollected- Total learning data pointsoverallAccuracy- Direction prediction accuracy (%)overallMAPE- Mean Absolute Percentage ErrorcurrentPattern- Detected workload patternpatternConfidence- Pattern detection confidencecalibrationNeeded- Whether recalibration is needed