Add docs for decommissioning nodes with the operator

jhlodin · jhlodin · commit 12c889cd8cef · 2025-10-09T17:26:57.000-04:00
diff --git a/src/current/v25.2/scale-cockroachdb-operator.md b/src/current/v25.2/scale-cockroachdb-operator.md
@@ -104,3 +104,53 @@ Do not scale down to fewer than 3 nodes. This is considered an anti-pattern on C
     ~~~ shell
     kubectl get pods
     ~~~
+
+## Decommission nodes
+
+When a Kubernetes node is scheduled for removal or maintenance, the {{ site.data.products.cockroachdb-operator }} can be instructed to decommission the node which safely moves data and workloads away before it goes offline.
+
+{{site.data.alerts.callout_info}}
+The CockroachDB node begins immediately decommissioning once the annotation is applied, it is not a mark for future removal. Once decommissioned, the node is cordoned so no further pods are scheduled on the node.
+
+If cluster capacity is limited, replacement pods may remain in the `Pending` state until new nodes are available. This is expected, as the operator prioritizes data safety and full replication over immediate scheduling.
+{{site.data.alerts.end}}
+
+The following prerequisites are necessary for the {{ site.data.products.cockroachdb-operator }} to be able to decommission a CockroachDB node:
+
+- The `--enable-k8s-node-/controller=true` flag must be enabled in the operator's `.yaml` values file, for example:
+    {% include_cached copy-clipboard.html %}
+    ~~~ yaml
+    containers:
+        - name: cockroach-operator
+          image: {{ .Values.image.registry }}/{{ .Values.image.repository }}:{{ .Values.image.tag }}
+          args:
+            - "-enable-k8s-node-controller=true"
+    ~~~
+- The role-based access control system must be configured to allow the operator to patch nodes.
+- At least one replica of the operator must not be on the target node.
+- There must be no under-replicated ranges on the CockroachDB cluster.
+
+To mark a node for decommissioning, follow these steps:
+
+1. Identify the name of the Kubernetes node that is to be removed.
+
+1. Annotate the Kubernetes node with `crdb.cockroachlabs.com/decommission="true"`. The decommissioning process begins immediately after this annotation is applied. Using `kubectl` for example:
+
+    {% include_cached copy-clipboard.html %}
+    ~~~ shell
+    kubectl annotate node {example-node-name} crdb.cockroachlabs.com/decommission="true"
+    ~~~
+
+1. Monitor the cluster:
+    - Confirm the decommissioned node's cordoned status:
+      {% include_cached copy-clipboard.html %}
+      ~~~ shell
+      kubectl describe node {example-node-name}
+      ~~~
+    - Monitor operator events and logs for decommission start and completion messages:
+      {% include_cached copy-clipboard.html %}
+      ~~~ shell
+      kubectl logs pod {operator-pod-name}
+      ~~~
+
+If the replacement pods remain in a `Pending` state, this typically means there is not enough available capacity in the cluster for these pods to be scheduled.
diff --git a/src/current/v25.3/scale-cockroachdb-operator.md b/src/current/v25.3/scale-cockroachdb-operator.md
@@ -104,3 +104,53 @@ Do not scale down to fewer than 3 nodes. This is considered an anti-pattern on C
     ~~~ shell
     kubectl get pods
     ~~~
+
+## Decommission nodes
+
+When a Kubernetes node is scheduled for removal or maintenance, the {{ site.data.products.cockroachdb-operator }} can be instructed to decommission the node which safely moves data and workloads away before it goes offline.
+
+{{site.data.alerts.callout_info}}
+The CockroachDB node begins immediately decommissioning once the annotation is applied, it is not a mark for future removal. Once decommissioned, the node is cordoned so no further pods are scheduled on the node.
+
+If cluster capacity is limited, replacement pods may remain in the `Pending` state until new nodes are available. This is expected, as the operator prioritizes data safety and full replication over immediate scheduling.
+{{site.data.alerts.end}}
+
+The following prerequisites are necessary for the {{ site.data.products.cockroachdb-operator }} to be able to decommission a CockroachDB node:
+
+- The `--enable-k8s-node-/controller=true` flag must be enabled in the operator's `.yaml` values file, for example:
+    {% include_cached copy-clipboard.html %}
+    ~~~ yaml
+    containers:
+        - name: cockroach-operator
+          image: {{ .Values.image.registry }}/{{ .Values.image.repository }}:{{ .Values.image.tag }}
+          args:
+            - "-enable-k8s-node-controller=true"
+    ~~~
+- The role-based access control system must be configured to allow the operator to patch nodes.
+- At least one replica of the operator must not be on the target node.
+- There must be no under-replicated ranges on the CockroachDB cluster.
+
+To mark a node for decommissioning, follow these steps:
+
+1. Identify the name of the Kubernetes node that is to be removed.
+
+1. Annotate the Kubernetes node with `crdb.cockroachlabs.com/decommission="true"`. The decommissioning process begins immediately after this annotation is applied. Using `kubectl` for example:
+
+    {% include_cached copy-clipboard.html %}
+    ~~~ shell
+    kubectl annotate node {example-node-name} crdb.cockroachlabs.com/decommission="true"
+    ~~~
+
+1. Monitor the cluster:
+    - Confirm the decommissioned node's cordoned status:
+      {% include_cached copy-clipboard.html %}
+      ~~~ shell
+      kubectl describe node {example-node-name}
+      ~~~
+    - Monitor operator events and logs for decommission start and completion messages:
+      {% include_cached copy-clipboard.html %}
+      ~~~ shell
+      kubectl logs pod {operator-pod-name}
+      ~~~
+
+If the replacement pods remain in a `Pending` state, this typically means there is not enough available capacity in the cluster for these pods to be scheduled.
diff --git a/src/current/v25.4/scale-cockroachdb-operator.md b/src/current/v25.4/scale-cockroachdb-operator.md
@@ -104,3 +104,53 @@ Do not scale down to fewer than 3 nodes. This is considered an anti-pattern on C
     ~~~ shell
     kubectl get pods
     ~~~
+
+## Decommission nodes
+
+When a Kubernetes node is scheduled for removal or maintenance, the {{ site.data.products.cockroachdb-operator }} can be instructed to decommission the node which safely moves data and workloads away before it goes offline.
+
+{{site.data.alerts.callout_info}}
+The CockroachDB node begins immediately decommissioning once the annotation is applied, it is not a mark for future removal. Once decommissioned, the node is cordoned so no further pods are scheduled on the node.
+
+If cluster capacity is limited, replacement pods may remain in the `Pending` state until new nodes are available. This is expected, as the operator prioritizes data safety and full replication over immediate scheduling.
+{{site.data.alerts.end}}
+
+The following prerequisites are necessary for the {{ site.data.products.cockroachdb-operator }} to be able to decommission a CockroachDB node:
+
+- The `--enable-k8s-node-/controller=true` flag must be enabled in the operator's `.yaml` values file, for example:
+    {% include_cached copy-clipboard.html %}
+    ~~~ yaml
+    containers:
+        - name: cockroach-operator
+          image: {{ .Values.image.registry }}/{{ .Values.image.repository }}:{{ .Values.image.tag }}
+          args:
+            - "-enable-k8s-node-controller=true"
+    ~~~
+- The role-based access control system must be configured to allow the operator to patch nodes.
+- At least one replica of the operator must not be on the target node.
+- There must be no under-replicated ranges on the CockroachDB cluster.
+
+To mark a node for decommissioning, follow these steps:
+
+1. Identify the name of the Kubernetes node that is to be removed.
+
+1. Annotate the Kubernetes node with `crdb.cockroachlabs.com/decommission="true"`. The decommissioning process begins immediately after this annotation is applied. Using `kubectl` for example:
+
+    {% include_cached copy-clipboard.html %}
+    ~~~ shell
+    kubectl annotate node {example-node-name} crdb.cockroachlabs.com/decommission="true"
+    ~~~
+
+1. Monitor the cluster:
+    - Confirm the decommissioned node's cordoned status:
+      {% include_cached copy-clipboard.html %}
+      ~~~ shell
+      kubectl describe node {example-node-name}
+      ~~~
+    - Monitor operator events and logs for decommission start and completion messages:
+      {% include_cached copy-clipboard.html %}
+      ~~~ shell
+      kubectl logs pod {operator-pod-name}
+      ~~~
+
+If the replacement pods remain in a `Pending` state, this typically means there is not enough available capacity in the cluster for these pods to be scheduled.