|
| 1 | +--- |
| 2 | +title: Cluster sizing |
| 3 | +weight: 50 |
| 4 | +aliases: /openshift-aiops-platform/openshift-aiops-platform-cluster-sizing/ |
| 5 | +--- |
| 6 | + |
| 7 | +:toc: |
| 8 | +:imagesdir: /images |
| 9 | +:_content-type: ASSEMBLY |
| 10 | + |
| 11 | +include::modules/comm-attributes.adoc[] |
| 12 | +include::modules/openshift-aiops-platform/metadata-openshift-aiops-platform.adoc[] |
| 13 | + |
| 14 | +include::modules/cluster-sizing-template.adoc[] |
| 15 | + |
| 16 | +[id="additional-sizing-considerations-openshift-aiops"] |
| 17 | +== Additional sizing considerations for AIOps workloads |
| 18 | + |
| 19 | +The OpenShift AIOps Self-Healing Platform has specific resource requirements beyond the baseline cluster sizing due to its observability, machine learning, and data storage needs. |
| 20 | + |
| 21 | +=== Hub cluster sizing recommendations |
| 22 | + |
| 23 | +The hub cluster hosts the majority of the AIOps platform components including observability aggregation, ML training and inference, and the self-healing decision engine. |
| 24 | + |
| 25 | +The pattern supports two deployment topologies, which are automatically detected during deployment using `make show-cluster-info`: |
| 26 | + |
| 27 | +*Standard HighlyAvailable Topology*:: |
| 28 | ++ |
| 29 | +Recommended for production multi-cluster deployments: |
| 30 | ++ |
| 31 | +*Control Plane Nodes*:: |
| 32 | +* Minimum: 3 nodes |
| 33 | +* vCPUs per node: 8 |
| 34 | +* Memory per node: 32 GB |
| 35 | +* Sufficient for ACM, GitOps, and platform operators |
| 36 | + |
| 37 | +*Compute Nodes*:: |
| 38 | +* Minimum: 6 nodes |
| 39 | +* vCPUs per node: 16 |
| 40 | +* Memory per node: 64 GB |
| 41 | +* Required for OpenShift AI workloads, observability stack, and data storage |
| 42 | + |
| 43 | +*Total Hub Cluster Resources*:: |
| 44 | +* Control plane: 24 vCPUs, 96 GB memory |
| 45 | +* Compute: 96 vCPUs, 384 GB memory |
| 46 | +* Combined: 120 vCPUs, 480 GB memory |
| 47 | + |
| 48 | +*Single Node OpenShift (SNO) Topology*:: |
| 49 | ++ |
| 50 | +Suitable for edge deployments, development, or single-cluster self-healing scenarios: |
| 51 | ++ |
| 52 | +*Single Node Requirements*:: |
| 53 | +* Minimum: 1 node |
| 54 | +* vCPUs: 8 minimum, 16+ recommended |
| 55 | +* Memory: 32 GB minimum, 64 GB recommended |
| 56 | +* Storage: 120 GB minimum, 250 GB recommended |
| 57 | +* Combined control plane and compute workloads on one node |
| 58 | ++ |
| 59 | +[NOTE] |
| 60 | +==== |
| 61 | +SNO deployments have reduced high availability but are suitable for edge locations, development environments, or scenarios where a single cluster is being managed. The pattern automatically detects SNO topology and adjusts resource allocation and storage configuration accordingly. |
| 62 | +
|
| 63 | +To verify cluster topology before deployment: |
| 64 | +[source,terminal] |
| 65 | +---- |
| 66 | +make show-cluster-info |
| 67 | +---- |
| 68 | +
|
| 69 | +During deployment, OpenShift Data Foundation (ODF) installation is automated via `make configure-cluster`, which adjusts for SNO topology when detected. |
| 70 | +==== |
| 71 | + |
| 72 | +=== Spoke cluster requirements |
| 73 | + |
| 74 | +Spoke clusters have minimal overhead from the AIOps platform since most processing occurs on the hub: |
| 75 | + |
| 76 | +* Standard OpenShift cluster sizing for your workloads |
| 77 | +* Add 2 vCPUs and 4 GB memory per node for observability agents (Prometheus, Fluentd, OpenTelemetry) |
| 78 | +* No additional nodes required specifically for AIOps |
| 79 | + |
| 80 | +=== Storage considerations |
| 81 | + |
| 82 | +The pattern requires persistent storage for several components: |
| 83 | + |
| 84 | +*Metrics Storage (Thanos)*:: |
| 85 | +* 500 GB minimum for 30 days of retention |
| 86 | +* 1 TB recommended for 60 days |
| 87 | +* Scale based on number of clusters and metric cardinality |
| 88 | +* Storage class: Block storage with good IOPS (gp3, Premium SSD) |
| 89 | + |
| 90 | +*Log Storage (Loki)*:: |
| 91 | +* 200 GB minimum for 15 days of retention |
| 92 | +* 500 GB recommended for 30 days |
| 93 | +* Scale based on log volume from applications |
| 94 | +* Storage class: Block or object storage |
| 95 | + |
| 96 | +*Model Storage (S3-compatible)*:: |
| 97 | +* 50 GB minimum for model artifacts and registry |
| 98 | +* 100 GB recommended for multiple model versions and A/B testing |
| 99 | +* Storage class: Object storage (S3, MinIO, ODF) |
| 100 | + |
| 101 | +*Incident History Database*:: |
| 102 | +* 50 GB minimum for incident data and ML training datasets |
| 103 | +* 100 GB recommended for extended history |
| 104 | +* Storage class: Block storage with good IOPS |
| 105 | + |
| 106 | +*Total Storage Requirements*:: |
| 107 | +* Minimum: 800 GB |
| 108 | +* Recommended: 1.75 TB |
| 109 | +* Consider using OpenShift Data Foundation for unified storage |
| 110 | + |
| 111 | +=== Scaling recommendations by cluster count |
| 112 | + |
| 113 | +Resource requirements scale with the number of managed clusters: |
| 114 | + |
| 115 | +*1-5 Spoke Clusters*:: |
| 116 | +* Use baseline hub sizing (5 compute nodes) |
| 117 | +* 1 TB total storage |
| 118 | +* Suitable for development and small production deployments |
| 119 | + |
| 120 | +*6-20 Spoke Clusters*:: |
| 121 | +* Scale to 7-9 compute nodes |
| 122 | +* 2 TB total storage |
| 123 | +* Consider dedicated nodes for observability workloads |
| 124 | +* May require metrics downsampling for cost optimization |
| 125 | + |
| 126 | +*21-50 Spoke Clusters*:: |
| 127 | +* Scale to 10-15 compute nodes |
| 128 | +* 4 TB total storage |
| 129 | +* Use separate node pools for ML, observability, and data storage |
| 130 | +* Implement metric federation and sampling strategies |
| 131 | +* Consider dedicated Kafka or similar for event streaming |
| 132 | + |
| 133 | +*50+ Spoke Clusters*:: |
| 134 | +* Enterprise deployment requiring detailed capacity planning |
| 135 | +* Consider horizontal scaling of observability components |
| 136 | +* Implement tiered storage with hot/warm/cold data lifecycle |
| 137 | +* May require multiple hub clusters for geographic distribution |
| 138 | +* Consult Red Hat for sizing recommendations |
| 139 | + |
| 140 | +=== Network requirements |
| 141 | + |
| 142 | +*Bandwidth*:: |
| 143 | +* Each spoke cluster generates approximately 1-5 Mbps of metrics and logs |
| 144 | +* Hub cluster needs sufficient ingress bandwidth: 50 Mbps for 10 spokes, 250 Mbps for 50 spokes |
| 145 | +* Model inference is low bandwidth (<1 Mbps) |
| 146 | + |
| 147 | +*Latency*:: |
| 148 | +* Observability can tolerate latency up to 1 second |
| 149 | +* Real-time self-healing performs best with <200ms latency to spoke clusters |
| 150 | +* Consider regional hub clusters for global deployments |
| 151 | + |
| 152 | +=== GPU requirements (Optional) |
| 153 | + |
| 154 | +GPU acceleration is optional but recommended for ML training: |
| 155 | + |
| 156 | +*ML Model Training*:: |
| 157 | +* Not required for inference (CPU-based inference is sufficient) |
| 158 | +* Recommended for faster model training: 1-2 NVIDIA GPUs (T4, V100, or A100) |
| 159 | +* Reduces training time from hours to minutes for large datasets |
| 160 | +* Use GPU node pools with taints to reserve for ML workloads |
| 161 | + |
| 162 | +The baseline cluster sizing includes sufficient CPU resources for inference. Add GPUs only if training time is a concern or if experimenting with larger neural network models. |
0 commit comments