Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 142 additions & 0 deletions docs/HARDWARE_REQUIREMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Hardware Requirements for Bud-Stack Platform

## Executive Summary

Bud-Stack is a comprehensive multi-service platform for AI/ML model deployment and cluster management. This document provides infrastructure requirements for Cloud Service Providers (CSPs) and organizations planning to deploy the platform.

### Platform Overview

The platform consists of:
- **14 Microservices** (Application, cluster management, ML optimization, model registry, etc.)
- **Core Infrastructure** (Databases, message queues, object storage, authentication)
- **Observability Stack** (Metrics, logging, distributed tracing)
- **High-Performance Gateway** (Rust-based API routing)

---

## Infrastructure Requirements Summary

### AI-In-A-Box - OEM

| Resource | Requirement |
|----------|-------------|
| **CPU Cores** | 32 cores |
| **Memory (RAM)** | 64 GiB |
| **Storage (SSD)** | 200 GiB |
| **Operating System** | Linux (Ubuntu 22.04+, RHEL 8+, or OpenShift 4.12+) |
| **Kubernetes** | Version 1.29+ |

**Max concurrency**: Upto 100 concurrent users

---

### Enterprise deployment

| Resource | Requirement |
|----------|-------------|
| **CPU Cores** | 96 cores |
| **Memory (RAM)** | 384 GiB |
| **Storage (SSD)** | 5 TiB |
| **Network Bandwidth** | 10 Gbps |
| **Operating System** | Linux (Ubuntu 22.04+, RHEL 8+, or OpenShift 4.12+) |
| **Kubernetes** | Version 1.29+ |

**Max concurreny**: Upto 1000 concurrent users

---

### CSP Deployment

| Resource | Requirement |
|----------|-------------|
| **CPU Cores** | 120-200 cores |
| **Memory (RAM)** | 0.5 - 1 TiB |
| **Storage (SSD)** | 10 - 20 TiB |
| **Network Bandwidth** | 10-40 Gbps |
| **Operating System** | Linux (Ubuntu 22.04+, RHEL 8+, or OpenShift 4.12+) |
| **Kubernetes** | Version 1.29+ |
Comment on lines 50 to 57
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The resource requirements in this summary table are inconsistent with the totals derived from the 'Detailed Production Architecture' section below. For example, you specify 120-200 CPU cores here, but the detailed breakdown sums to 168-304 vCPU. Similar discrepancies exist for RAM and Storage. To avoid confusion for capacity planning, this summary table should be updated to accurately reflect the totals from the detailed breakdown (after correcting the calculation errors in that section).


**Concurrency**: 10000+

---

## Detailed Architecture

### Node Pool Breakdown

Production deployments use specialized node pools for optimal resource allocation:

| Node Pool | Purpose | Node Spec | Count | Total Resources |
|-----------|---------|-----------|-------|-----------------|
| **Control Plane** | Databases, state management | 8 vCPU, 32GB RAM, 500GB SSD | 3-5 | 24-40 vCPU, 96-160GB RAM |
| **Application** | Microservices, APIs | 16 vCPU, 32GB RAM, 200GB SSD | 5-10 | 80-160 vCPU, 160-320GB RAM |
| **Data Plane** | Analytics, storage, messaging | 16 vCPU, 64GB RAM, 1TB SSD | 3-5 | 48-80 vCPU, 192-320GB RAM |
| **Gateway** | API gateway, ingress | 8 vCPU, 16GB RAM, 100GB SSD | 2-3 | 16-24 vCPU, 32-48GB RAM |


---

### Persistent Storage Breakdown

| Component | Size (Min) | Size (Recommended) | Performance |
|-----------|------------|-------------------|-------------|
| **Databases** (PostgreSQL) | 10 GiB | 100-200 GiB | 3,000-10,000 IOPS, <10ms latency |
| **Analytics** (ClickHouse) | 30 GiB | 200-500 GiB | 5,000-20,000 IOPS, <5ms latency |
| **Object Storage** (Models, Datasets) | 50 GiB | 500 GiB-1 TiB | 1,000-5,000 IOPS, <20ms latency |
| **Message Queue** (Kafka) | 20 GiB | 100-200 GiB | 2,000-10,000 IOPS, <10ms latency |
| **Application Data** | 50 GiB | 100-200 GiB | Standard SSD |
| **Backups** | - | 500 GiB-1 TiB | Standard/Archive |

**Total Storage**:
- **Minimum**: 256 GiB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The 'Minimum' total storage is listed as 256 GiB, but summing the 'Size (Min)' column in the 'Persistent Storage Breakdown' table gives 160 GiB (10+30+50+20+50). This discrepancy is confusing. Please either correct the total or clarify what other storage components are included in this figure.

- **Recommended**: 2 TiB

### Storage Type Requirements

- **Premium SSD/NVMe**: Required for databases (PostgreSQL, ClickHouse)
- **Standard SSD**: Acceptable for application data, metrics
- **Network Storage**: Supported for shared volumes (NFS, Azure Files, EFS)

---

## Network Requirements

| Traffic Type | Minimum | Recommended | Notes |
|--------------|---------|-------------|-------|
| **Inter-Node** | 5 Gbps | 10 Gbps | Between cluster nodes |
| **Internet Ingress** | 1 Gbps | 5 Gbps | API traffic, model uploads |
| **Internet Egress** | 1 Gbps | 5 Gbps | Model downloads, webhooks |

---

## High Availability Scenarios

### Standard HA Configuration

| Component | Configuration | Failover Time |
|-----------|--------------|---------------|
| **Kubernetes Masters** | 3 nodes (multi-zone) | <30 seconds |
| **PostgreSQL** | 1 master + 2 replicas | <1 minute |
| **ClickHouse** | 3-node cluster | <2 minutes |
| **Redis** | 3-node Sentinel | <10 seconds |
| **Microservices** | 3+ replicas | Immediate |
| **Gateway** | 3+ replicas | Immediate |


### Key HA Features

- **Auto-Scaling**: HPA enabled for all stateless services (CPU/memory threshold: 75%)
- **Health Checks**: Liveness/readiness probes on all pods (5-second intervals)
- **Anti-Affinity**: Pods distributed across zones to prevent single point of failure
- **PodDisruptionBudget**: Minimum 50% pods available during updates
- **Backup Schedule**: Daily database backups, 30-day retention, WAL archiving



## Required Software

- **Kubernetes**: Version 1.29+
- **Helm**: Version 3.10 or higher
- **Container Runtime**: containerd 1.6+
- **kubectl**: Matching Kubernetes version
- **Operating System**: Ubuntu 22.04+, RHEL 8+, or OpenShift 4.12+