Skip to content

Commit d37226f

Browse files
committed
Add OpenShift AIOps Self-Healing Platform pattern documentation
Comprehensive documentation for the OpenShift AIOps Platform validated pattern, including deployment guides, architecture overview, customization examples, and cluster sizing recommendations. Key additions: - Pattern overview and solution elements - Architecture documentation with component details - Deployment instructions for HA and SNO topologies - Customization guide with real-world examples from live clusters - Platform configuration tuning examples - Storage configuration patterns (ODF, CephFS, cloud storage) - Resource planning and sizing for HA/SNO deployments - Development workflow with Jupyter notebooks - Workshop resources and learning materials - Cluster sizing guidelines Includes 33 Jupyter notebook catalog covering anomaly detection, self-healing logic, model serving, MCP/Lightspeed integration, and advanced scenarios.
1 parent 7ac8632 commit d37226f

File tree

12 files changed

+3542
-0
lines changed

12 files changed

+3542
-0
lines changed
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
---
2+
title: OpenShift AIOps Self-Healing Platform
3+
date: 2026-02-26
4+
tier: community
5+
summary: This pattern provides an AI-powered self-healing platform for OpenShift clusters, combining deterministic automation with machine learning for intelligent incident response.
6+
rh_products:
7+
- Red Hat OpenShift Container Platform
8+
- Red Hat OpenShift AI
9+
- Red Hat OpenShift GitOps
10+
- Red Hat OpenShift Pipelines
11+
- Red Hat OpenShift Data Foundation
12+
- Red Hat Advanced Cluster Management for Kubernetes
13+
industries:
14+
- General
15+
pattern_logo: openshift-aiops-platform.png
16+
links:
17+
github: https://github.com/KubeHeal/openshift-aiops-platform
18+
install: getting-started
19+
arch: https://github.com/KubeHeal/openshift-aiops-platform/blob/main/docs/adrs/002-hybrid-self-healing-approach.md
20+
bugs: https://github.com/KubeHeal/openshift-aiops-platform/issues
21+
feedback: https://github.com/KubeHeal/openshift-aiops-platform/discussions
22+
ci: openshift-aiops-platform
23+
---
24+
:toc:
25+
:imagesdir: /images
26+
:_content-type: ASSEMBLY
27+
include::modules/comm-attributes.adoc[]
28+
29+
include::modules/oaiops-about.adoc[leveloffset=+1]
30+
31+
include::modules/oaiops-solution-elements.adoc[leveloffset=+2]
32+
33+
include::modules/oaiops-architecture.adoc[leveloffset=+1]
34+
35+
[id="next-steps_openshift-aiops-platform-index"]
36+
== Next steps
37+
38+
* link:getting-started[Deploy the OpenShift AIOps Self-Healing Platform]
39+
* Review the link:cluster-sizing[cluster sizing requirements]
40+
* Explore link:ideas-for-customization[customization options] to adapt the pattern to your use case
41+
* Read the link:{github-url}/blob/main/docs/adrs/002-hybrid-self-healing-approach.md[hybrid self-healing architecture decision record]
42+
* Join the discussion at link:{feedback-url}[GitHub Discussions]
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
---
2+
title: Cluster sizing
3+
weight: 50
4+
aliases: /openshift-aiops-platform/openshift-aiops-platform-cluster-sizing/
5+
---
6+
7+
:toc:
8+
:imagesdir: /images
9+
:_content-type: ASSEMBLY
10+
11+
include::modules/comm-attributes.adoc[]
12+
include::modules/openshift-aiops-platform/metadata-openshift-aiops-platform.adoc[]
13+
14+
include::modules/cluster-sizing-template.adoc[]
15+
16+
[id="additional-sizing-considerations-openshift-aiops"]
17+
== Additional sizing considerations for AIOps workloads
18+
19+
The OpenShift AIOps Self-Healing Platform has specific resource requirements beyond the baseline cluster sizing due to its observability, machine learning, and data storage needs.
20+
21+
=== Hub cluster sizing recommendations
22+
23+
The hub cluster hosts the majority of the AIOps platform components including observability aggregation, ML training and inference, and the self-healing decision engine.
24+
25+
The pattern supports two deployment topologies, which are automatically detected during deployment using `make show-cluster-info`:
26+
27+
*Standard HighlyAvailable Topology*::
28+
+
29+
Recommended for production multi-cluster deployments:
30+
+
31+
*Control Plane Nodes*::
32+
* Minimum: 3 nodes
33+
* vCPUs per node: 8
34+
* Memory per node: 32 GB
35+
* Sufficient for ACM, GitOps, and platform operators
36+
37+
*Compute Nodes*::
38+
* Minimum: 6 nodes
39+
* vCPUs per node: 16
40+
* Memory per node: 64 GB
41+
* Required for OpenShift AI workloads, observability stack, and data storage
42+
43+
*Total Hub Cluster Resources*::
44+
* Control plane: 24 vCPUs, 96 GB memory
45+
* Compute: 96 vCPUs, 384 GB memory
46+
* Combined: 120 vCPUs, 480 GB memory
47+
48+
*Single Node OpenShift (SNO) Topology*::
49+
+
50+
Suitable for edge deployments, development, or single-cluster self-healing scenarios:
51+
+
52+
*Single Node Requirements*::
53+
* Minimum: 1 node
54+
* vCPUs: 8 minimum, 16+ recommended
55+
* Memory: 32 GB minimum, 64 GB recommended
56+
* Storage: 120 GB minimum, 250 GB recommended
57+
* Combined control plane and compute workloads on one node
58+
+
59+
[NOTE]
60+
====
61+
SNO deployments have reduced high availability but are suitable for edge locations, development environments, or scenarios where a single cluster is being managed. The pattern automatically detects SNO topology and adjusts resource allocation and storage configuration accordingly.
62+
63+
To verify cluster topology before deployment:
64+
[source,terminal]
65+
----
66+
make show-cluster-info
67+
----
68+
69+
During deployment, OpenShift Data Foundation (ODF) installation is automated via `make configure-cluster`, which adjusts for SNO topology when detected.
70+
====
71+
72+
=== Spoke cluster requirements
73+
74+
Spoke clusters have minimal overhead from the AIOps platform since most processing occurs on the hub:
75+
76+
* Standard OpenShift cluster sizing for your workloads
77+
* Add 2 vCPUs and 4 GB memory per node for observability agents (Prometheus, Fluentd, OpenTelemetry)
78+
* No additional nodes required specifically for AIOps
79+
80+
=== Storage considerations
81+
82+
The pattern requires persistent storage for several components:
83+
84+
*Metrics Storage (Thanos)*::
85+
* 500 GB minimum for 30 days of retention
86+
* 1 TB recommended for 60 days
87+
* Scale based on number of clusters and metric cardinality
88+
* Storage class: Block storage with good IOPS (gp3, Premium SSD)
89+
90+
*Log Storage (Loki)*::
91+
* 200 GB minimum for 15 days of retention
92+
* 500 GB recommended for 30 days
93+
* Scale based on log volume from applications
94+
* Storage class: Block or object storage
95+
96+
*Model Storage (S3-compatible)*::
97+
* 50 GB minimum for model artifacts and registry
98+
* 100 GB recommended for multiple model versions and A/B testing
99+
* Storage class: Object storage (S3, MinIO, ODF)
100+
101+
*Incident History Database*::
102+
* 50 GB minimum for incident data and ML training datasets
103+
* 100 GB recommended for extended history
104+
* Storage class: Block storage with good IOPS
105+
106+
*Total Storage Requirements*::
107+
* Minimum: 800 GB
108+
* Recommended: 1.75 TB
109+
* Consider using OpenShift Data Foundation for unified storage
110+
111+
=== Scaling recommendations by cluster count
112+
113+
Resource requirements scale with the number of managed clusters:
114+
115+
*1-5 Spoke Clusters*::
116+
* Use baseline hub sizing (5 compute nodes)
117+
* 1 TB total storage
118+
* Suitable for development and small production deployments
119+
120+
*6-20 Spoke Clusters*::
121+
* Scale to 7-9 compute nodes
122+
* 2 TB total storage
123+
* Consider dedicated nodes for observability workloads
124+
* May require metrics downsampling for cost optimization
125+
126+
*21-50 Spoke Clusters*::
127+
* Scale to 10-15 compute nodes
128+
* 4 TB total storage
129+
* Use separate node pools for ML, observability, and data storage
130+
* Implement metric federation and sampling strategies
131+
* Consider dedicated Kafka or similar for event streaming
132+
133+
*50+ Spoke Clusters*::
134+
* Enterprise deployment requiring detailed capacity planning
135+
* Consider horizontal scaling of observability components
136+
* Implement tiered storage with hot/warm/cold data lifecycle
137+
* May require multiple hub clusters for geographic distribution
138+
* Consult Red Hat for sizing recommendations
139+
140+
=== Network requirements
141+
142+
*Bandwidth*::
143+
* Each spoke cluster generates approximately 1-5 Mbps of metrics and logs
144+
* Hub cluster needs sufficient ingress bandwidth: 50 Mbps for 10 spokes, 250 Mbps for 50 spokes
145+
* Model inference is low bandwidth (<1 Mbps)
146+
147+
*Latency*::
148+
* Observability can tolerate latency up to 1 second
149+
* Real-time self-healing performs best with <200ms latency to spoke clusters
150+
* Consider regional hub clusters for global deployments
151+
152+
=== GPU requirements (Optional)
153+
154+
GPU acceleration is optional but recommended for ML training:
155+
156+
*ML Model Training*::
157+
* Not required for inference (CPU-based inference is sufficient)
158+
* Recommended for faster model training: 1-2 NVIDIA GPUs (T4, V100, or A100)
159+
* Reduces training time from hours to minutes for large datasets
160+
* Use GPU node pools with taints to reserve for ML workloads
161+
162+
The baseline cluster sizing includes sufficient CPU resources for inference. Add GPUs only if training time is a concern or if experimenting with larger neural network models.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
title: Getting started
3+
weight: 10
4+
aliases: /openshift-aiops-platform/getting-started/
5+
---
6+
:toc:
7+
:imagesdir: /images
8+
:_content-type: ASSEMBLY
9+
include::modules/comm-attributes.adoc[]
10+
11+
include::modules/oaiops-deploying.adoc[leveloffset=+1]
12+
13+
[id="next-steps_openshift-aiops-platform-getting-started"]
14+
= Next steps
15+
16+
* Review link:../cluster-sizing[cluster sizing requirements] to ensure your infrastructure meets the pattern's needs
17+
* Explore link:../ideas-for-customization[customization options] to adapt the self-healing platform to your environment
18+
* Check the Grafana dashboards to monitor self-healing activity and model performance
19+
* Review and extend the runbook library for your specific use cases
20+
* Configure integrations with external systems like ServiceNow, PagerDuty, or Slack

0 commit comments

Comments
 (0)