stakater · tanalam2411 · Apr 9, 2021 · Apr 10, 2021 · Apr 10, 2021 · Apr 10, 2021
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -0,0 +1,18 @@
+
+#### What type of PR is this?
+
+<!--
+Add one of the following kinds:
+/kind bug
+/kind cleanup
+/kind documentation
+/kind feature
+/kind design
+-->
+
+#### What this PR does / why we need it:
+
+
+#### Special notes for your reviewer:
+
+
diff --git a/README.md b/README.md
@@ -1,2 +1,81 @@
-# jarvis
+# JARVIS
 Machine auto healer!
+
+
+
+## Problem
+
+
+
+For a kubernetes cluster to remain in a healthy state, all the nodes should remain in a healthy, running state.
+
+
+## Solution
+
+
+
+- Machine auto healer operator will always try to keep the nodes (machines)
+  in your cluster in a healthy, running state.
+- It will perform periodic checks on the health state of each node (machine) in your cluster.
+- If a node (machine) fails consecutive health checks over an extended time period,
+  it will initiate a repair process for that node (machine).
+
+
+
+![](./docs/images/machine_auto_healer.png)
+
+##### Node Conditions
+- The conditions field describes the status of all Running nodes.
+- By describing any node we can see the NodeCondition and its respective status
+  ```bash
+  Conditions:
+  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
+  ----             ------  -----------------                 ------------------                ------                       -------
+  MemoryPressure   False   Sat, 10 Apr 2021 02:23:21 +0530   Fri, 09 Apr 2021 15:44:18 +0530   KubeletHasSufficientMemory   kubelet has sufficient memory available
+  DiskPressure     False   Sat, 10 Apr 2021 02:23:21 +0530   Fri, 09 Apr 2021 15:44:18 +0530   KubeletHasNoDiskPressure     kubelet has no disk pressure
+  PIDPressure      False   Sat, 10 Apr 2021 02:23:21 +0530   Fri, 09 Apr 2021 15:44:18 +0530   KubeletHasSufficientPID      kubelet has sufficient PID available
+  Ready            True    Sat, 10 Apr 2021 02:23:21 +0530   Fri, 09 Apr 2021 15:45:45 +0530   KubeletReady                 kubelet is posting ready status
+
+  ```
+- Following are the NodeConditions by default supported by K8s cluster.
+
+
+  | ConditionType      | Condition Status   |Effect        | Key      |
+  | ------------------ | ------------------ | ------------ | -------- |
+  |Ready               |True                | -            | |
+  |                    |False               | NoExecute    | node.kubernetes.io/not-ready           |
+  |                    |Unknown             | NoExecute    | node.kubernetes.io/unreachable         |
+  |OutOfDisk           |True                | NoSchedule   | node.kubernetes.io/out-of-disk         |
+  |                    |False               | -            | |
+  |                    |Unknown             | -            | |
+  |MemoryPressure      |True                | NoSchedule   | node.kubernetes.io/memory-pressure     |
+  |                    |False               | -            | |
+  |                    |Unknown             | -            | |
+  |DiskPressure        |True                | NoSchedule   | node.kubernetes.io/disk-pressure       |
+  |                    |False               | -            | |
+  |                    |Unknown             | -            | |
+  |NetworkUnavailable  |True                | NoSchedule   | node.kubernetes.io/network-unavailable |
+  |                    |False               | -            | |
+  |                    |Unknown             | -            | |
+  |PIDPressure         |True                | NoSchedule   | node.kubernetes.io/pid-pressure        |
+  |                    |False               | -            | |
+  |                    |Unknown             | -            | |
+
+
+- `Node Problem Detector`
+  - By default, k8s support limited set of NodeConditions. 
+  - We can use [node-problem-detector](https://github.com/kubernetes/node-problem-detector) which runs as a DaemonSet and collects different node problems and reports them in form of NodeConditions.
+
+- Based on the NodeCondition's we can apply taint effects such as
+  - `NoSchedule`: Does not allow new pods to schedule onto the node unless they tolerate the taint. Does not interrupt already running pods.
+  - `PreferNoSchedule`: Scheduler tries not to schedule new pods onto the node.
+  - `NoExecute`: Evicts any already-running pods that do not tolerate the taint. 
+
+
+## Node Taint Controller(NTC)
+- NTC control loop will continuously look for `node conditions` of all nodes and will apply taint based on the `condition` type. 
+
+## Node Auto Healer Operator(NAHO)
+- Will reconcile all Nodes, look for applied `taints` on each node and apply taint effect based on the taint type on that node.
+- Next, it will evict all the pods from that Node.
+- Once all the pods are evicted from that Node or else it extends the default eviction period, it will delete that Node(Machine) resource.
diff --git a/docs/images/machine_auto_healer.png b/docs/images/machine_auto_healer.png
diff --git a/docs/node-auto-healer-operator.md b/docs/node-auto-healer-operator.md
@@ -0,0 +1,117 @@
+
+
+### Node Auto Healer Operator (NAHO)
+
+##### NodeAutoHealer
+
+- `NodeAutoHealer` is an object or CR of NAHO, which provides how NAHO can be enabled to provide node auto healing.
+- 
+  ```yaml
+    apiVersion: autohealer.stakater.com/v1alpha1
+    kind: NodeAutoHealer
+    metadata:
+      name: node-auto-healer-1
+    spec:
+      disableAutoHealing: false
+      nodeSelector:
+        matchLabels:
+          type: small 
+        matchExpressions:
+          - {key: tier, operator: In, values: [cache]}
+          - {key: environment, operator: NotIn, values: [dev]}
+      noScheduleThresholdLimit: 30mins     
+      parallelHealing:
+        enable: true
+        maxAllowedNodesToDrain: 20%
+      forceDelete: true
+    status:
+      state: active
+      disabledAt: {LastTime when disabled}
+  ```
+- **kind**: NodeAutoHealer
+- **spec**:  
+  - **disableAutoHealing**: A boolean flag, to define change state of NAHO. Setting it to `true` will pause auto healing.
+  - **nodeSelector**: If provided, nodes can be filtered using `matchLabels` or `matchExpression` and only those nodes would be considered for auto healing.
+    - If not provided all nodes within the cluster would be considered for auto healing
+  - **noScheduleThresholdLimit**: Threshold time limit on how long a node can remain in `NoSchedule` state. And if it exceeds the threshold we can apply `NoExecute` taint, which will evict all pods and later the Node will be drained.
+  - **parallelHealing**: If enabled, then multiple nodes can be drained in parallel. 
+    - **maxAllowedNodesToDrain**: Maximum number of nodes that can remain under drained state at any given time. 
+      - Value can be an absolute number (ex: 5), or a percentage of total nodes at that particular moment (ex: 10%).
+  - **forceDelete**: If set true, it will delete the node even if it fails to drain the node.
+- **status**:
+  - **state**: {Active|Paused}, represents whether auto healer is active or in pause state.
+  - **disabledAt**: Last datetime, when auto healing was disabled
+
+---
+
+##### HealedNode
+
+- `HealedNode` is an object or CR of NAHO, which will be created when a node will match certain `ConditionSet` and requires recovery.
+- Using this CR we can monitor the progress of node's healing process. (drain node -> delete machine -> monitor, creation of replacement node)  
+-
+  ```yaml
+  apiVersion: autohealer.stakater.com/v1alpha1
+  kind: HealedNode
+  metadata:
+    name: kind-control-plane
+  spec:
+    nodeDetails:
+      nodeName: kind-control-plane
+      taints: 
+        - effect: NoSchedule
+          key: key1
+          value: value1  
+        - effect: NoExecute
+          key: key1
+          value: value1
+      conditions:
+      - lastHeartbeatTime: "2021-04-25T11:51:06Z"
+        lastTransitionTime: "2021-04-25T11:36:05Z"
+        message: Kubelet never posted node status.
+        reason: NodeStatusNeverUpdated
+        status: "Unknown"
+        type: Ready
+      addresses:
+      - address: 172.18.0.2
+        type: InternalIP
+      - address: kind-control-plane
+        type: Hostname
+      nodeSystemInfo:
+        architecture: amd64
+        bootID: 3b622bbf-a04c-4a50-81d9-7afb89502684
+        containerRuntimeVersion: containerd://1.4.0-106-gce4439a8
+        kernelVersion: 5.8.0-50-generic
+        kubeProxyVersion: v1.20.2
+        kubeletVersion: v1.20.2
+        machineID: bed729392962410587918db70d475183
+        operatingSystem: linux
+        osImage: Ubuntu 20.10
+        systemUUID: 1e3f9f51-2c77-473d-a173-3e095e6e652c
+    matchedConditionSets:
+      - name: NetworkUnavailable
+    appliedAction: {drainNode|deleteNode}
+  status:
+    currentState: {draining|drained|deleting|deleted|recovering|recovered}
+    lastStateChangeTime:
+    isHealingProcessStable: true   
+  ```
+
+- **kind**: HealedNode
+- **spec**:
+  - **nodeDetails** - Target node's related details such as its name, address, system info, taints, conditions etc.
+  - **matchedConditionSets** - Matched ConditionSets that shows the cause of a node being unhealthy.
+  - **appliedAction** - Once a node becomes unhealthy, it should be drained first and then the node should be deleted.
+    - So we will support two types of actions `drainNode` and `deleteNode`.
+- **status**:
+  - **currentState**: 
+    - Once action type `drainNode` would be applied, the node will go through these 2 states:
+      - `draining`: state representing execution of pod evictions process.
+      - `drained`: state representing completion of pod eviction.
+    - After completion of draining process, the Machine object would be deleted and would through these 2 states:    
+      - `deleting`: state representing deleting of machine object, once `HealedNode`'s `currentState` becomes `drained`.
+      - `deleted`: state representing completion of deletion of machine object.
+    - Once the machine object would be deleted, it would go through these 2 states:
+      - `recovering`: Before deleting the machine object, we will store count of total number of nodes.
+        - Next, once delete is complete, it will keep checking if total number of previous nodes equals total number of current nodes.
+      - `recovered`: Once the count of total previous nodes becomes same as current total nodes, the state would become `recovered`.  
+
diff --git a/docs/node-conditions.md b/docs/node-conditions.md
@@ -0,0 +1,161 @@
+
+##### Node Conditions
+
+- Apart from the default conditions supported by native k8s, `NodeProblemDetector` provides additional NodeConditions based 
+on the `ProblemDaemon` enabled with NodeProblemDetector.
+
+- K8s Native NodeConditions:
+
+  | ConditionType      | Condition Status   |Effect        | Key      |
+  | ------------------ | ------------------ | ------------ | -------- |
+  |Ready               |True                | -            | |
+  |                    |False               | NoExecute    | node.kubernetes.io/not-ready           |
+  |                    |Unknown             | NoExecute    | node.kubernetes.io/unreachable         |
+  |OutOfDisk           |True                | NoSchedule   | node.kubernetes.io/out-of-disk         |
+  |                    |False               | -            | |
+  |                    |Unknown             | -            | |
+  |MemoryPressure      |True                | NoSchedule   | node.kubernetes.io/memory-pressure     |
+  |                    |False               | -            | |
+  |                    |Unknown             | -            | |
+  |DiskPressure        |True                | NoSchedule   | node.kubernetes.io/disk-pressure       |
+  |                    |False               | -            | |
+  |                    |Unknown             | -            | |
+  |NetworkUnavailable  |True                | NoSchedule   | node.kubernetes.io/network-unavailable |
+  |                    |False               | -            | |
+  |                    |Unknown             | -            | |
+  |PIDPressure         |True                | NoSchedule   | node.kubernetes.io/pid-pressure        |
+  |                    |False               | -            | |
+  |                    |Unknown             | -            | |
+
+
+- `NodeProblemDetector`(NPD) supported NodeConditions:
+  - NPD only patches Nodes with conditions, it doesn’t apply taints on Nodes.
+We will have to decide on the effect and taint against the following conditions supported by NPD 
+
+- `ntp-custom-plugin-monitor`
+  -   | ConditionType      | Condition Status   |Effect        | Key      |
+      | ------------------ | ------------------ | ------------ | -------- |
+      |NTPProblem          |True                |              | |
+      |                    |False               |              |          |
+      |                    |Unknown             |              |          |
+
+- `docker-monitor`
+  -   | ConditionType        | Condition Status   |Effect        | Key      |
+      | ------------------   | ------------------ | ------------ | -------- |
+      |CorruptDockerOverlay2 |True                |              | |
+      |                      |False               |              |          |
+      |                      |Unknown             |              |          |
+
+- `Health-checker-containerd, docker`
+  -   | ConditionType            | Condition Status   |Effect        | Key      |
+      | ------------------       | ------------------ | ------------ | -------- |
+      |ContainerRuntimeUnhealthy |True                |              | |
+      |                          |False               |              |          |
+      |                          |Unknown             |              |          |
+
+- `Health-checker-kubelet`
+  -   | ConditionType            | Condition Status   |Effect        | Key      |
+      | ------------------       | ------------------ | ------------ | -------- |
+      |KubeletUnhealthy          |True                |              | |
+      |                          |False               |              |          |
+      |                          |Unknown             |              |          |
+
+- `kernel-monitor`
+  -   | ConditionType            | Condition Status   |Effect        | Key      |
+      | ------------------       | ------------------ | ------------ | -------- |
+      |KernelDeadlock            |True                |              | |
+      |                          |False               |              |          |
+      |                          |Unknown             |              |          |
+      |ReadonlyFilesystem        |True                |              | |
+      |                          |False               |              |          |
+      |                          |Unknown             |              |          |
+      |FrequentUnregisterNetDevice        |True                |              | |
+      |                                   |False               |              |          |
+      |                                   |Unknown             |              |          |
+
+- `systemd-monitor`
+  -   | ConditionType            | Condition Status   |Effect        | Key      |
+      | ------------------       | ------------------ | ------------ | -------- |
+      |FrequentKubeletRestart    |True                |              | |
+      |                          |False               |              |          |
+      |                          |Unknown             |              |          |
+      |FrequentDockerRestart     |True                |              | |
+      |                          |False               |              |          |
+      |                          |Unknown             |              |          |
+      |FrequentContainerdRestart          |True                |              | |
+      |                                   |False               |              |          |
+      |                                   |Unknown             |              |          |   
+
+
+---
+
+##### One node can have not good several conditions at one time. We need to new condition types that could be combination multiple condition types.
+- In certain case where a single condition is not sufficient to mark a node as unhealthy we can support a new type as `ConditionSet`.
+
+##### Approach 1
+
+##### ConditionSet as ConfigMap
+
+- A `ConditionSet` would be combination of 1 or more conditions.
+- Sample ConfigMap
+  ```yaml
+    type: ConditionSets
+    conditionSets:
+    - type: KubeletContainerRuntimeUnhealthy
+      effect: NoExecute
+      taintKey: node.stakater.com/KubeletContainerRuntimeUnhealthy
+      conditions:
+        - ConditionType: KubeletUnhealthy
+          conditionStatus: true
+        - ConditionType: ContainerRuntimeUnhealthy
+          conditionStatus: Unknown
+    - type: KernelDeadlock
+      effect: NoExecute
+      taintKey: node.stakater.com/KernelDeadlock
+      conditions:
+        - ConditionType: KernelDeadlock
+          conditionStatus: true
+  ```
+- If, a Node's conditions matches any of the ConditionSet then corresponding effect would be applied. 
+- In any case Node's conditions matches multiple ConditionSets then higher level of effect(`NoExecute > NoSchedule > PreferNoSchedule`)  would be applied.
+
+
+##### Approach 2
+
+##### ConditionSet as a new CustomResource type
+
+-
+    ```yaml
+    apiVersion: autohealer.stakater.com/v1alpha1
+    kind: ConditionSet
+    metadata:
+      name: conditionset-1
+    spec:
+      type: KubeletContainerRuntimeUnhealthy
+      effect: NoExecute
+      taintKey: node.stakater.com/KubeletContainerRuntimeUnhealthy
+      conditions:
+        KubeletUnhealthy:
+          status: true
+        ContainerRuntimeUnhealthy:
+          status: unknown
+    ---
+    apiVersion: autohealer.stakater.com/v1alpha1
+    kind: ConditionSet
+    metadata:
+      name: conditionset-2
+    spec:
+      type: KernelDeadlock
+      effect: NoExecute
+      taintKey: node.stakater.com/KernelDeadlock
+      conditions:
+        KernelDeadlock:
+          status: true
+    ```
+- The advantage of having a `ConditionSet` as a new resource type is we can apply [Validation Webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook), 
+  which can validate each `ConditionSet`.
+  - For example, as we are giving the power to configure conditions and its effect.
+    If we had set `effect` to `NoExecute` for condition type `OutOfDisk` with condition status as `false` instead of `true`, 
+    then it will start evicting all healthy nodes in the cluster and could take down the whole cluster if not noticed.
+- So by having a validation webhook, we can make sure that certain `condition` doesn't get configured which could impact the cluster.
+- Another advantage could be avoiding duplicate condition entries, or handling certain not supported condition type.