-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add readme #1
base: main
Are you sure you want to change the base?
Add readme #1
Changes from all commits
f7dfc04
e6e8cd0
b117d2e
e986f69
a0bc59f
0390e0b
12789a3
f590cb1
1b33de3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
|
||
#### What type of PR is this? | ||
|
||
<!-- | ||
Add one of the following kinds: | ||
/kind bug | ||
/kind cleanup | ||
/kind documentation | ||
/kind feature | ||
/kind design | ||
--> | ||
|
||
#### What this PR does / why we need it: | ||
|
||
|
||
#### Special notes for your reviewer: | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,81 @@ | ||
# jarvis | ||
# JARVIS | ||
Machine auto healer! | ||
|
||
|
||
|
||
## Problem | ||
|
||
|
||
|
||
For a kubernetes cluster to remain in a healthy state, all the nodes should remain in a healthy, running state. | ||
|
||
|
||
## Solution | ||
|
||
|
||
|
||
- Machine auto healer operator will always try to keep the nodes (machines) | ||
in your cluster in a healthy, running state. | ||
- It will perform periodic checks on the health state of each node (machine) in your cluster. | ||
- If a node (machine) fails consecutive health checks over an extended time period, | ||
it will initiate a repair process for that node (machine). | ||
|
||
|
||
|
||
![](./docs/images/machine_auto_healer.png) | ||
|
||
##### Node Conditions | ||
- The conditions field describes the status of all Running nodes. | ||
- By describing any node we can see the NodeCondition and its respective status | ||
```bash | ||
Conditions: | ||
Type Status LastHeartbeatTime LastTransitionTime Reason Message | ||
---- ------ ----------------- ------------------ ------ ------- | ||
MemoryPressure False Sat, 10 Apr 2021 02:23:21 +0530 Fri, 09 Apr 2021 15:44:18 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available | ||
DiskPressure False Sat, 10 Apr 2021 02:23:21 +0530 Fri, 09 Apr 2021 15:44:18 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure | ||
PIDPressure False Sat, 10 Apr 2021 02:23:21 +0530 Fri, 09 Apr 2021 15:44:18 +0530 KubeletHasSufficientPID kubelet has sufficient PID available | ||
Ready True Sat, 10 Apr 2021 02:23:21 +0530 Fri, 09 Apr 2021 15:45:45 +0530 KubeletReady kubelet is posting ready status | ||
|
||
``` | ||
- Following are the NodeConditions by default supported by K8s cluster. | ||
|
||
|
||
| ConditionType | Condition Status |Effect | Key | | ||
| ------------------ | ------------------ | ------------ | -------- | | ||
|Ready |True | - | | | ||
| |False | NoExecute | node.kubernetes.io/not-ready | | ||
| |Unknown | NoExecute | node.kubernetes.io/unreachable | | ||
|OutOfDisk |True | NoSchedule | node.kubernetes.io/out-of-disk | | ||
| |False | - | | | ||
| |Unknown | - | | | ||
|MemoryPressure |True | NoSchedule | node.kubernetes.io/memory-pressure | | ||
| |False | - | | | ||
| |Unknown | - | | | ||
|DiskPressure |True | NoSchedule | node.kubernetes.io/disk-pressure | | ||
| |False | - | | | ||
| |Unknown | - | | | ||
|NetworkUnavailable |True | NoSchedule | node.kubernetes.io/network-unavailable | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One node can have not good several conditions at one time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can have either a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest configmap because the conditions are not objects also. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, conditions are not objects but ConditionSet can be an object with conditions, taint effect, taint key etc as peroperties. Or should we rename it to NodeConditionSet. |
||
| |False | - | | | ||
| |Unknown | - | | | ||
|PIDPressure |True | NoSchedule | node.kubernetes.io/pid-pressure | | ||
| |False | - | | | ||
| |Unknown | - | | | ||
|
||
|
||
- `Node Problem Detector` | ||
- By default, k8s support limited set of NodeConditions. | ||
- We can use [node-problem-detector](https://github.com/kubernetes/node-problem-detector) which runs as a DaemonSet and collects different node problems and reports them in form of NodeConditions. | ||
|
||
- Based on the NodeCondition's we can apply taint effects such as | ||
- `NoSchedule`: Does not allow new pods to schedule onto the node unless they tolerate the taint. Does not interrupt already running pods. | ||
- `PreferNoSchedule`: Scheduler tries not to schedule new pods onto the node. | ||
- `NoExecute`: Evicts any already-running pods that do not tolerate the taint. | ||
|
||
|
||
## Node Taint Controller(NTC) | ||
- NTC control loop will continuously look for `node conditions` of all nodes and will apply taint based on the `condition` type. | ||
|
||
## Node Auto Healer Operator(NAHO) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think taints by NTC should be distinguished from the taints by other processes or humans. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Default node conditions have key in format |
||
- Will reconcile all Nodes, look for applied `taints` on each node and apply taint effect based on the taint type on that node. | ||
- Next, it will evict all the pods from that Node. | ||
cuttingedge1109 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Once all the pods are evicted from that Node or else it extends the default eviction period, it will delete that Node(Machine) resource. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
|
||
|
||
### Node Auto Healer Operator (NAHO) | ||
|
||
##### NodeAutoHealer | ||
|
||
- `NodeAutoHealer` is an object or CR of NAHO, which provides how NAHO can be enabled to provide node auto healing. | ||
- | ||
```yaml | ||
apiVersion: autohealer.stakater.com/v1alpha1 | ||
kind: NodeAutoHealer | ||
metadata: | ||
name: node-auto-healer-1 | ||
spec: | ||
disableAutoHealing: false | ||
nodeSelector: | ||
matchLabels: | ||
type: small | ||
matchExpressions: | ||
- {key: tier, operator: In, values: [cache]} | ||
- {key: environment, operator: NotIn, values: [dev]} | ||
noScheduleThresholdLimit: 30mins | ||
parallelHealing: | ||
enable: true | ||
maxAllowedNodesToDrain: 20% | ||
forceDelete: true | ||
status: | ||
state: active | ||
disabledAt: {LastTime when disabled} | ||
``` | ||
- **kind**: NodeAutoHealer | ||
- **spec**: | ||
- **disableAutoHealing**: A boolean flag, to define change state of NAHO. Setting it to `true` will pause auto healing. | ||
- **nodeSelector**: If provided, nodes can be filtered using `matchLabels` or `matchExpression` and only those nodes would be considered for auto healing. | ||
- If not provided all nodes within the cluster would be considered for auto healing | ||
- **noScheduleThresholdLimit**: Threshold time limit on how long a node can remain in `NoSchedule` state. And if it exceeds the threshold we can apply `NoExecute` taint, which will evict all pods and later the Node will be drained. | ||
- **parallelHealing**: If enabled, then multiple nodes can be drained in parallel. | ||
- **maxAllowedNodesToDrain**: Maximum number of nodes that can remain under drained state at any given time. | ||
- Value can be an absolute number (ex: 5), or a percentage of total nodes at that particular moment (ex: 10%). | ||
- **forceDelete**: If set true, it will delete the node even if it fails to drain the node. | ||
- **status**: | ||
- **state**: {Active|Paused}, represents whether auto healer is active or in pause state. | ||
- **disabledAt**: Last datetime, when auto healing was disabled | ||
|
||
--- | ||
|
||
##### HealedNode | ||
|
||
- `HealedNode` is an object or CR of NAHO, which will be created when a node will match certain `ConditionSet` and requires recovery. | ||
- Using this CR we can monitor the progress of node's healing process. (drain node -> delete machine -> monitor, creation of replacement node) | ||
- | ||
```yaml | ||
apiVersion: autohealer.stakater.com/v1alpha1 | ||
kind: HealedNode | ||
metadata: | ||
name: kind-control-plane | ||
spec: | ||
nodeDetails: | ||
nodeName: kind-control-plane | ||
taints: | ||
- effect: NoSchedule | ||
key: key1 | ||
value: value1 | ||
- effect: NoExecute | ||
key: key1 | ||
value: value1 | ||
conditions: | ||
- lastHeartbeatTime: "2021-04-25T11:51:06Z" | ||
lastTransitionTime: "2021-04-25T11:36:05Z" | ||
message: Kubelet never posted node status. | ||
reason: NodeStatusNeverUpdated | ||
status: "Unknown" | ||
type: Ready | ||
addresses: | ||
- address: 172.18.0.2 | ||
type: InternalIP | ||
- address: kind-control-plane | ||
type: Hostname | ||
nodeSystemInfo: | ||
architecture: amd64 | ||
bootID: 3b622bbf-a04c-4a50-81d9-7afb89502684 | ||
containerRuntimeVersion: containerd://1.4.0-106-gce4439a8 | ||
kernelVersion: 5.8.0-50-generic | ||
kubeProxyVersion: v1.20.2 | ||
kubeletVersion: v1.20.2 | ||
machineID: bed729392962410587918db70d475183 | ||
operatingSystem: linux | ||
osImage: Ubuntu 20.10 | ||
systemUUID: 1e3f9f51-2c77-473d-a173-3e095e6e652c | ||
matchedConditionSets: | ||
- name: NetworkUnavailable | ||
appliedAction: {drainNode|deleteNode} | ||
status: | ||
currentState: {draining|drained|deleting|deleted|recovering|recovered} | ||
lastStateChangeTime: | ||
isHealingProcessStable: true | ||
``` | ||
|
||
- **kind**: HealedNode | ||
- **spec**: | ||
- **nodeDetails** - Target node's related details such as its name, address, system info, taints, conditions etc. | ||
- **matchedConditionSets** - Matched ConditionSets that shows the cause of a node being unhealthy. | ||
- **appliedAction** - Once a node becomes unhealthy, it should be drained first and then the node should be deleted. | ||
- So we will support two types of actions `drainNode` and `deleteNode`. | ||
- **status**: | ||
- **currentState**: | ||
- Once action type `drainNode` would be applied, the node will go through these 2 states: | ||
- `draining`: state representing execution of pod evictions process. | ||
- `drained`: state representing completion of pod eviction. | ||
- After completion of draining process, the Machine object would be deleted and would through these 2 states: | ||
- `deleting`: state representing deleting of machine object, once `HealedNode`'s `currentState` becomes `drained`. | ||
- `deleted`: state representing completion of deletion of machine object. | ||
- Once the machine object would be deleted, it would go through these 2 states: | ||
- `recovering`: Before deleting the machine object, we will store count of total number of nodes. | ||
- Next, once delete is complete, it will keep checking if total number of previous nodes equals total number of current nodes. | ||
- `recovered`: Once the count of total previous nodes becomes same as current total nodes, the state would become `recovered`. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
|
||
##### Node Conditions | ||
|
||
- Apart from the default conditions supported by native k8s, `NodeProblemDetector` provides additional NodeConditions based | ||
on the `ProblemDaemon` enabled with NodeProblemDetector. | ||
|
||
- K8s Native NodeConditions: | ||
|
||
| ConditionType | Condition Status |Effect | Key | | ||
| ------------------ | ------------------ | ------------ | -------- | | ||
|Ready |True | - | | | ||
| |False | NoExecute | node.kubernetes.io/not-ready | | ||
| |Unknown | NoExecute | node.kubernetes.io/unreachable | | ||
|OutOfDisk |True | NoSchedule | node.kubernetes.io/out-of-disk | | ||
| |False | - | | | ||
| |Unknown | - | | | ||
|MemoryPressure |True | NoSchedule | node.kubernetes.io/memory-pressure | | ||
| |False | - | | | ||
| |Unknown | - | | | ||
|DiskPressure |True | NoSchedule | node.kubernetes.io/disk-pressure | | ||
| |False | - | | | ||
| |Unknown | - | | | ||
|NetworkUnavailable |True | NoSchedule | node.kubernetes.io/network-unavailable | | ||
| |False | - | | | ||
| |Unknown | - | | | ||
|PIDPressure |True | NoSchedule | node.kubernetes.io/pid-pressure | | ||
| |False | - | | | ||
| |Unknown | - | | | ||
|
||
|
||
- `NodeProblemDetector`(NPD) supported NodeConditions: | ||
- NPD only patches Nodes with conditions, it doesn’t apply taints on Nodes. | ||
We will have to decide on the effect and taint against the following conditions supported by NPD | ||
|
||
- `ntp-custom-plugin-monitor` | ||
- | ConditionType | Condition Status |Effect | Key | | ||
| ------------------ | ------------------ | ------------ | -------- | | ||
|NTPProblem |True | | | | ||
| |False | | | | ||
| |Unknown | | | | ||
|
||
- `docker-monitor` | ||
- | ConditionType | Condition Status |Effect | Key | | ||
| ------------------ | ------------------ | ------------ | -------- | | ||
|CorruptDockerOverlay2 |True | | | | ||
| |False | | | | ||
| |Unknown | | | | ||
|
||
- `Health-checker-containerd, docker` | ||
- | ConditionType | Condition Status |Effect | Key | | ||
| ------------------ | ------------------ | ------------ | -------- | | ||
|ContainerRuntimeUnhealthy |True | | | | ||
| |False | | | | ||
| |Unknown | | | | ||
|
||
- `Health-checker-kubelet` | ||
- | ConditionType | Condition Status |Effect | Key | | ||
| ------------------ | ------------------ | ------------ | -------- | | ||
|KubeletUnhealthy |True | | | | ||
| |False | | | | ||
| |Unknown | | | | ||
|
||
- `kernel-monitor` | ||
- | ConditionType | Condition Status |Effect | Key | | ||
| ------------------ | ------------------ | ------------ | -------- | | ||
|KernelDeadlock |True | | | | ||
| |False | | | | ||
| |Unknown | | | | ||
|ReadonlyFilesystem |True | | | | ||
| |False | | | | ||
| |Unknown | | | | ||
|FrequentUnregisterNetDevice |True | | | | ||
| |False | | | | ||
| |Unknown | | | | ||
|
||
- `systemd-monitor` | ||
- | ConditionType | Condition Status |Effect | Key | | ||
| ------------------ | ------------------ | ------------ | -------- | | ||
|FrequentKubeletRestart |True | | | | ||
| |False | | | | ||
| |Unknown | | | | ||
|FrequentDockerRestart |True | | | | ||
| |False | | | | ||
| |Unknown | | | | ||
|FrequentContainerdRestart |True | | | | ||
| |False | | | | ||
| |Unknown | | | | ||
|
||
|
||
--- | ||
|
||
##### One node can have not good several conditions at one time. We need to new condition types that could be combination multiple condition types. | ||
- In certain case where a single condition is not sufficient to mark a node as unhealthy we can support a new type as `ConditionSet`. | ||
|
||
##### Approach 1 | ||
|
||
##### ConditionSet as ConfigMap | ||
|
||
- A `ConditionSet` would be combination of 1 or more conditions. | ||
- Sample ConfigMap | ||
```yaml | ||
type: ConditionSets | ||
conditionSets: | ||
- type: KubeletContainerRuntimeUnhealthy | ||
effect: NoExecute | ||
taintKey: node.stakater.com/KubeletContainerRuntimeUnhealthy | ||
conditions: | ||
- ConditionType: KubeletUnhealthy | ||
conditionStatus: true | ||
- ConditionType: ContainerRuntimeUnhealthy | ||
conditionStatus: Unknown | ||
- type: KernelDeadlock | ||
effect: NoExecute | ||
taintKey: node.stakater.com/KernelDeadlock | ||
conditions: | ||
- ConditionType: KernelDeadlock | ||
conditionStatus: true | ||
``` | ||
- If, a Node's conditions matches any of the ConditionSet then corresponding effect would be applied. | ||
- In any case Node's conditions matches multiple ConditionSets then higher level of effect(`NoExecute > NoSchedule > PreferNoSchedule`) would be applied. | ||
|
||
|
||
##### Approach 2 | ||
|
||
##### ConditionSet as a new CustomResource type | ||
|
||
- | ||
```yaml | ||
apiVersion: autohealer.stakater.com/v1alpha1 | ||
kind: ConditionSet | ||
metadata: | ||
name: conditionset-1 | ||
spec: | ||
type: KubeletContainerRuntimeUnhealthy | ||
effect: NoExecute | ||
taintKey: node.stakater.com/KubeletContainerRuntimeUnhealthy | ||
conditions: | ||
KubeletUnhealthy: | ||
status: true | ||
ContainerRuntimeUnhealthy: | ||
status: unknown | ||
--- | ||
apiVersion: autohealer.stakater.com/v1alpha1 | ||
kind: ConditionSet | ||
metadata: | ||
name: conditionset-2 | ||
spec: | ||
type: KernelDeadlock | ||
effect: NoExecute | ||
taintKey: node.stakater.com/KernelDeadlock | ||
conditions: | ||
KernelDeadlock: | ||
status: true | ||
``` | ||
- The advantage of having a `ConditionSet` as a new resource type is we can apply [Validation Webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook), | ||
which can validate each `ConditionSet`. | ||
- For example, as we are giving the power to configure conditions and its effect. | ||
If we had set `effect` to `NoExecute` for condition type `OutOfDisk` with condition status as `false` instead of `true`, | ||
then it will start evicting all healthy nodes in the cluster and could take down the whole cluster if not noticed. | ||
- So by having a validation webhook, we can make sure that certain `condition` doesn't get configured which could impact the cluster. | ||
- Another advantage could be avoiding duplicate condition entries, or handling certain not supported condition type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to process
NoSchedule
tainted node?Maybe set the threshold for status duration and drain the node also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a threshold time limit on how long a node can be in
NoSchedule
state. And if it exceeds the threshold we can applyNoExecute
taint, which will evict all pods and later the Node will be drained.Using LastTransitionTime of Node's condition because of which
NoSchedule
taint was applied, we can know if there is any change on that specific Node's condition state. If there is no change andCurrentTime
-LastTransitionTime
exceeds threshold time then the Node will get evicted and drained.