Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add readme #1

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@

#### What type of PR is this?

<!--
Add one of the following kinds:
/kind bug
/kind cleanup
/kind documentation
/kind feature
/kind design
-->

#### What this PR does / why we need it:


#### Special notes for your reviewer:


81 changes: 80 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,81 @@
# jarvis
# JARVIS
Machine auto healer!



## Problem



For a kubernetes cluster to remain in a healthy state, all the nodes should remain in a healthy, running state.


## Solution



- Machine auto healer operator will always try to keep the nodes (machines)
in your cluster in a healthy, running state.
- It will perform periodic checks on the health state of each node (machine) in your cluster.
- If a node (machine) fails consecutive health checks over an extended time period,
it will initiate a repair process for that node (machine).



![](./docs/images/machine_auto_healer.png)

##### Node Conditions
- The conditions field describes the status of all Running nodes.
- By describing any node we can see the NodeCondition and its respective status
```bash
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Sat, 10 Apr 2021 02:23:21 +0530 Fri, 09 Apr 2021 15:44:18 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sat, 10 Apr 2021 02:23:21 +0530 Fri, 09 Apr 2021 15:44:18 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sat, 10 Apr 2021 02:23:21 +0530 Fri, 09 Apr 2021 15:44:18 +0530 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sat, 10 Apr 2021 02:23:21 +0530 Fri, 09 Apr 2021 15:45:45 +0530 KubeletReady kubelet is posting ready status

```
- Following are the NodeConditions by default supported by K8s cluster.


| ConditionType | Condition Status |Effect | Key |
| ------------------ | ------------------ | ------------ | -------- |
|Ready |True | - | |
| |False | NoExecute | node.kubernetes.io/not-ready |
| |Unknown | NoExecute | node.kubernetes.io/unreachable |
|OutOfDisk |True | NoSchedule | node.kubernetes.io/out-of-disk |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to process NoSchedule tainted node?
Maybe set the threshold for status duration and drain the node also?

Copy link
Collaborator Author

@tanalam2411 tanalam2411 Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a threshold time limit on how long a node can be in NoSchedule state. And if it exceeds the threshold we can apply NoExecute taint, which will evict all pods and later the Node will be drained.
Using LastTransitionTime of Node's condition because of which NoSchedule taint was applied, we can know if there is any change on that specific Node's condition state. If there is no change and CurrentTime - LastTransitionTime exceeds threshold time then the Node will get evicted and drained.

| |False | - | |
| |Unknown | - | |
|MemoryPressure |True | NoSchedule | node.kubernetes.io/memory-pressure |
| |False | - | |
| |Unknown | - | |
|DiskPressure |True | NoSchedule | node.kubernetes.io/disk-pressure |
| |False | - | |
| |Unknown | - | |
|NetworkUnavailable |True | NoSchedule | node.kubernetes.io/network-unavailable |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One node can have not good several conditions at one time.
Maybe need to have new condition types by mix of the basic condition types?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have either a ConfigMap or new resource type as ConditionSet which could be collection of one or more condtions.
I have mentioned 2 approachs here, please suggest the preferred one

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest configmap because the conditions are not objects also.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, conditions are not objects but ConditionSet can be an object with conditions, taint effect, taint key etc as peroperties. Or should we rename it to NodeConditionSet.
Please suggest, If ConditionSet doesn't make sence as an object, I will revert back and use it as a configmap.

| |False | - | |
| |Unknown | - | |
|PIDPressure |True | NoSchedule | node.kubernetes.io/pid-pressure |
| |False | - | |
| |Unknown | - | |


- `Node Problem Detector`
- By default, k8s support limited set of NodeConditions.
- We can use [node-problem-detector](https://github.com/kubernetes/node-problem-detector) which runs as a DaemonSet and collects different node problems and reports them in form of NodeConditions.

- Based on the NodeCondition's we can apply taint effects such as
- `NoSchedule`: Does not allow new pods to schedule onto the node unless they tolerate the taint. Does not interrupt already running pods.
- `PreferNoSchedule`: Scheduler tries not to schedule new pods onto the node.
- `NoExecute`: Evicts any already-running pods that do not tolerate the taint.


## Node Taint Controller(NTC)
- NTC control loop will continuously look for `node conditions` of all nodes and will apply taint based on the `condition` type.

## Node Auto Healer Operator(NAHO)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think taints by NTC should be distinguished from the taints by other processes or humans.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default node conditions have key in format node.kubernetes.io/{{condition}}={{value}}:{{effect}}, for all the other conditions we can use same format either node.stakater.com/{{condition}}={{value}}:{{effect}} e.g; ref or NTC/{{condition}}={{value}}:{{effect}} or please suggest any other.

- Will reconcile all Nodes, look for applied `taints` on each node and apply taint effect based on the taint type on that node.
- Next, it will evict all the pods from that Node.
cuttingedge1109 marked this conversation as resolved.
Show resolved Hide resolved
- Once all the pods are evicted from that Node or else it extends the default eviction period, it will delete that Node(Machine) resource.
Binary file added docs/images/machine_auto_healer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
117 changes: 117 additions & 0 deletions docs/node-auto-healer-operator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@


### Node Auto Healer Operator (NAHO)

##### NodeAutoHealer

- `NodeAutoHealer` is an object or CR of NAHO, which provides how NAHO can be enabled to provide node auto healing.
-
```yaml
apiVersion: autohealer.stakater.com/v1alpha1
kind: NodeAutoHealer
metadata:
name: node-auto-healer-1
spec:
disableAutoHealing: false
nodeSelector:
matchLabels:
type: small
matchExpressions:
- {key: tier, operator: In, values: [cache]}
- {key: environment, operator: NotIn, values: [dev]}
noScheduleThresholdLimit: 30mins
parallelHealing:
enable: true
maxAllowedNodesToDrain: 20%
forceDelete: true
status:
state: active
disabledAt: {LastTime when disabled}
```
- **kind**: NodeAutoHealer
- **spec**:
- **disableAutoHealing**: A boolean flag, to define change state of NAHO. Setting it to `true` will pause auto healing.
- **nodeSelector**: If provided, nodes can be filtered using `matchLabels` or `matchExpression` and only those nodes would be considered for auto healing.
- If not provided all nodes within the cluster would be considered for auto healing
- **noScheduleThresholdLimit**: Threshold time limit on how long a node can remain in `NoSchedule` state. And if it exceeds the threshold we can apply `NoExecute` taint, which will evict all pods and later the Node will be drained.
- **parallelHealing**: If enabled, then multiple nodes can be drained in parallel.
- **maxAllowedNodesToDrain**: Maximum number of nodes that can remain under drained state at any given time.
- Value can be an absolute number (ex: 5), or a percentage of total nodes at that particular moment (ex: 10%).
- **forceDelete**: If set true, it will delete the node even if it fails to drain the node.
- **status**:
- **state**: {Active|Paused}, represents whether auto healer is active or in pause state.
- **disabledAt**: Last datetime, when auto healing was disabled

---

##### HealedNode

- `HealedNode` is an object or CR of NAHO, which will be created when a node will match certain `ConditionSet` and requires recovery.
- Using this CR we can monitor the progress of node's healing process. (drain node -> delete machine -> monitor, creation of replacement node)
-
```yaml
apiVersion: autohealer.stakater.com/v1alpha1
kind: HealedNode
metadata:
name: kind-control-plane
spec:
nodeDetails:
nodeName: kind-control-plane
taints:
- effect: NoSchedule
key: key1
value: value1
- effect: NoExecute
key: key1
value: value1
conditions:
- lastHeartbeatTime: "2021-04-25T11:51:06Z"
lastTransitionTime: "2021-04-25T11:36:05Z"
message: Kubelet never posted node status.
reason: NodeStatusNeverUpdated
status: "Unknown"
type: Ready
addresses:
- address: 172.18.0.2
type: InternalIP
- address: kind-control-plane
type: Hostname
nodeSystemInfo:
architecture: amd64
bootID: 3b622bbf-a04c-4a50-81d9-7afb89502684
containerRuntimeVersion: containerd://1.4.0-106-gce4439a8
kernelVersion: 5.8.0-50-generic
kubeProxyVersion: v1.20.2
kubeletVersion: v1.20.2
machineID: bed729392962410587918db70d475183
operatingSystem: linux
osImage: Ubuntu 20.10
systemUUID: 1e3f9f51-2c77-473d-a173-3e095e6e652c
matchedConditionSets:
- name: NetworkUnavailable
appliedAction: {drainNode|deleteNode}
status:
currentState: {draining|drained|deleting|deleted|recovering|recovered}
lastStateChangeTime:
isHealingProcessStable: true
```

- **kind**: HealedNode
- **spec**:
- **nodeDetails** - Target node's related details such as its name, address, system info, taints, conditions etc.
- **matchedConditionSets** - Matched ConditionSets that shows the cause of a node being unhealthy.
- **appliedAction** - Once a node becomes unhealthy, it should be drained first and then the node should be deleted.
- So we will support two types of actions `drainNode` and `deleteNode`.
- **status**:
- **currentState**:
- Once action type `drainNode` would be applied, the node will go through these 2 states:
- `draining`: state representing execution of pod evictions process.
- `drained`: state representing completion of pod eviction.
- After completion of draining process, the Machine object would be deleted and would through these 2 states:
- `deleting`: state representing deleting of machine object, once `HealedNode`'s `currentState` becomes `drained`.
- `deleted`: state representing completion of deletion of machine object.
- Once the machine object would be deleted, it would go through these 2 states:
- `recovering`: Before deleting the machine object, we will store count of total number of nodes.
- Next, once delete is complete, it will keep checking if total number of previous nodes equals total number of current nodes.
- `recovered`: Once the count of total previous nodes becomes same as current total nodes, the state would become `recovered`.

161 changes: 161 additions & 0 deletions docs/node-conditions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@

##### Node Conditions

- Apart from the default conditions supported by native k8s, `NodeProblemDetector` provides additional NodeConditions based
on the `ProblemDaemon` enabled with NodeProblemDetector.

- K8s Native NodeConditions:

| ConditionType | Condition Status |Effect | Key |
| ------------------ | ------------------ | ------------ | -------- |
|Ready |True | - | |
| |False | NoExecute | node.kubernetes.io/not-ready |
| |Unknown | NoExecute | node.kubernetes.io/unreachable |
|OutOfDisk |True | NoSchedule | node.kubernetes.io/out-of-disk |
| |False | - | |
| |Unknown | - | |
|MemoryPressure |True | NoSchedule | node.kubernetes.io/memory-pressure |
| |False | - | |
| |Unknown | - | |
|DiskPressure |True | NoSchedule | node.kubernetes.io/disk-pressure |
| |False | - | |
| |Unknown | - | |
|NetworkUnavailable |True | NoSchedule | node.kubernetes.io/network-unavailable |
| |False | - | |
| |Unknown | - | |
|PIDPressure |True | NoSchedule | node.kubernetes.io/pid-pressure |
| |False | - | |
| |Unknown | - | |


- `NodeProblemDetector`(NPD) supported NodeConditions:
- NPD only patches Nodes with conditions, it doesn’t apply taints on Nodes.
We will have to decide on the effect and taint against the following conditions supported by NPD

- `ntp-custom-plugin-monitor`
- | ConditionType | Condition Status |Effect | Key |
| ------------------ | ------------------ | ------------ | -------- |
|NTPProblem |True | | |
| |False | | |
| |Unknown | | |

- `docker-monitor`
- | ConditionType | Condition Status |Effect | Key |
| ------------------ | ------------------ | ------------ | -------- |
|CorruptDockerOverlay2 |True | | |
| |False | | |
| |Unknown | | |

- `Health-checker-containerd, docker`
- | ConditionType | Condition Status |Effect | Key |
| ------------------ | ------------------ | ------------ | -------- |
|ContainerRuntimeUnhealthy |True | | |
| |False | | |
| |Unknown | | |

- `Health-checker-kubelet`
- | ConditionType | Condition Status |Effect | Key |
| ------------------ | ------------------ | ------------ | -------- |
|KubeletUnhealthy |True | | |
| |False | | |
| |Unknown | | |

- `kernel-monitor`
- | ConditionType | Condition Status |Effect | Key |
| ------------------ | ------------------ | ------------ | -------- |
|KernelDeadlock |True | | |
| |False | | |
| |Unknown | | |
|ReadonlyFilesystem |True | | |
| |False | | |
| |Unknown | | |
|FrequentUnregisterNetDevice |True | | |
| |False | | |
| |Unknown | | |

- `systemd-monitor`
- | ConditionType | Condition Status |Effect | Key |
| ------------------ | ------------------ | ------------ | -------- |
|FrequentKubeletRestart |True | | |
| |False | | |
| |Unknown | | |
|FrequentDockerRestart |True | | |
| |False | | |
| |Unknown | | |
|FrequentContainerdRestart |True | | |
| |False | | |
| |Unknown | | |


---

##### One node can have not good several conditions at one time. We need to new condition types that could be combination multiple condition types.
- In certain case where a single condition is not sufficient to mark a node as unhealthy we can support a new type as `ConditionSet`.

##### Approach 1

##### ConditionSet as ConfigMap

- A `ConditionSet` would be combination of 1 or more conditions.
- Sample ConfigMap
```yaml
type: ConditionSets
conditionSets:
- type: KubeletContainerRuntimeUnhealthy
effect: NoExecute
taintKey: node.stakater.com/KubeletContainerRuntimeUnhealthy
conditions:
- ConditionType: KubeletUnhealthy
conditionStatus: true
- ConditionType: ContainerRuntimeUnhealthy
conditionStatus: Unknown
- type: KernelDeadlock
effect: NoExecute
taintKey: node.stakater.com/KernelDeadlock
conditions:
- ConditionType: KernelDeadlock
conditionStatus: true
```
- If, a Node's conditions matches any of the ConditionSet then corresponding effect would be applied.
- In any case Node's conditions matches multiple ConditionSets then higher level of effect(`NoExecute > NoSchedule > PreferNoSchedule`) would be applied.


##### Approach 2

##### ConditionSet as a new CustomResource type

-
```yaml
apiVersion: autohealer.stakater.com/v1alpha1
kind: ConditionSet
metadata:
name: conditionset-1
spec:
type: KubeletContainerRuntimeUnhealthy
effect: NoExecute
taintKey: node.stakater.com/KubeletContainerRuntimeUnhealthy
conditions:
KubeletUnhealthy:
status: true
ContainerRuntimeUnhealthy:
status: unknown
---
apiVersion: autohealer.stakater.com/v1alpha1
kind: ConditionSet
metadata:
name: conditionset-2
spec:
type: KernelDeadlock
effect: NoExecute
taintKey: node.stakater.com/KernelDeadlock
conditions:
KernelDeadlock:
status: true
```
- The advantage of having a `ConditionSet` as a new resource type is we can apply [Validation Webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook),
which can validate each `ConditionSet`.
- For example, as we are giving the power to configure conditions and its effect.
If we had set `effect` to `NoExecute` for condition type `OutOfDisk` with condition status as `false` instead of `true`,
then it will start evicting all healthy nodes in the cluster and could take down the whole cluster if not noticed.
- So by having a validation webhook, we can make sure that certain `condition` doesn't get configured which could impact the cluster.
- Another advantage could be avoiding duplicate condition entries, or handling certain not supported condition type.