-
Notifications
You must be signed in to change notification settings - Fork 184
DOC-12691-Updated automatic-failover.adoc #3820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release/7.6
Are you sure you want to change the base?
Conversation
corrected formatting issues
updated image under Equal groups as per https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0
Updated description of images under Equal Groups
updated as per Ben's comments on https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0
Added note that... at the beginning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read about half of the automatic-failover page, will finish tomorrow.
_Automatic Failover_ — or _auto-failover_ — can be configured to fail over one or more nodes automatically. No immediate administrator intervention is required. | ||
Specifically, the Cluster Manager autonomously detects and verifies that the nodes are unresponsive, and then initiates the _hard_ failover process. | ||
Auto-failover does not fix or identify problems that may have occurred. | ||
Once appropriate fixes have been applied to the cluster by the administrator, a rebalance is required. | ||
Auto-failover is always _hard_ failover. | ||
For information on how services are affected by hard failover, see xref:learn:clusters-and-availability/hard-failover.adoc[Hard Failover]. | ||
|
||
This page describes auto-failover concepts and policy. | ||
As a reminder, failover is a mechanism in the Couchbase Server that allows a node to be taken out of the cluster so that applications no longer reference the services on the failed node and availability is maintained. The failover is at the node level, and the automatic failover process for a non-responsive or an unhealthy node starts when the cluster manager detects, per the configured auto-failover settings, that the node is unresponsive (node ns-server process is not responding to the cluster manager) or the Data or the Index Service on a node is not healthy (the service heartbeat or process is not responding to the cluster manager). Then, multiple safety checks are run to see if an auto-failover can be performed. If all checks pass, the cluster manager performs the hard failover process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(node ns-server process is not responding to the cluster manager)
I think that this would read better to a user as "(the cluster manager of the node is not sending heartbeats to the cluster manager of other nodes)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Ben's suggested change, should read:
As a reminder, failover is a mechanism in the Couchbase Server that allows a node to be taken out of the cluster so that applications no longer reference the services on the failed node and availability is maintained. The failover is at the node level, and the automatic failover process for a non-responsive or an unhealthy node starts when the cluster manager detects, per the configured auto-failover settings, that the node is unresponsive (the cluster manager of the node is not sending heartbeats to the cluster manager of other nodes) or the Data or the Index Service on a node is not healthy (the service heartbeat or process is not responding to the cluster manager). Then, multiple safety checks are run to see if an auto-failover can be performed. If all checks pass, the cluster manager performs the hard failover process.
Auto-failover occurs in response to failed/failing events. | ||
There are three types of event that can trigger auto-failover: | ||
Auto-failover occurs in response to failed/failing events. Auto-failover applies to the node -- it’s the node that fails over regardless of the triggering event. There are specific types of events that trigger auto-failover processing. However, auto-failover will only actually occur if all of the checks (constraints and policies) for auto-failover pass. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd probably make the above and below one paragraph, but I'm not the expert :)
|
||
* _Node failure_. | ||
A server-node within the cluster is unresponsive (due to a network failure, out-of-memory problem, or other node-specific issue). | ||
A server-node within the cluster is unresponsive (due to a network failure, very high CPU utilization problem, out-of-memory problem, or other node-specific issue). This means that the ns-server process on the node has not responded to the cluster manager for the user-specified configured amount of time -- the health of the services running on the node is unknown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to the previous comment, we might want to revise mention of the ns-server process and just talk about the cluster manager heatbeat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Ben's suggested change, line 42 may be changed to:
A server-node within the cluster is unresponsive (due to a network failure, very high CPU utilization problem, out-of-memory problem, or other node-specific issue). This means that the the cluster manager of the node has not sent heartbeats in the configured timeout period, and therefore, the health of the services running on the node is unknown.
Note for Supritha: The "Node-Failure Detection" section of Cluster Manager doc has info on the heartbeat mechanism in case some context is needed.
Attempts to read from or write to disk on a particular node have resulted in a significant rate of failure, for longer than a specified time-period. | ||
The node is removed by auto-failover, even though the node continues to be contactable. | ||
* _Data Service disk read/write issues_. | ||
Data Service disk read/write errors. Attempts by the Data Service to read from or write to disk on a particular node have resulted in a significant rate of failure (errors returned), for longer than a specified time-period. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hyunjuV should we not also document the disk responsiveness trigger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenHuddleston
The disk non-responsiveness issue will be in the 8.0 documentation.
These updates/changes are for current latest, which is 7.6.x.
|
||
The xref:install:deployment-considerations-lt-3nodes.adoc#quorum-arbitration[quorum constraint] is a critical part of auto-failover since the cluster must be able to form a quorum to initiate a failover, following the failure of one of the nodes. For Server Groups, this means that if you have two server groups with equal number of nodes, for auto-failover of all nodes in one server group to be able to occur, you could deploy an xref:learn:clusters-and-availability:nodes.adoc#adding-arbiter-nodes[arbiter node] (or another) in a third physical server group which will allow the remaining nodes to form a quorum. | ||
|
||
Another critical auto-failover constraint for Server Groups is the maximum number of nodes to be automatically failed over (`maxCount` in `/settings/autoFailover`) before administrator-intervention is required. If you want one entire server group of nodes to be able to be all automatically failed over, then the `maxCount` value should be at least the number of nodes in the server group. You can check the value of `maxCount` in `GET /settings/autoFailover` to see what the `maxCount` setting is. The value of `count` in the same `GET /settings/autoFailover` output tells you how many node auto-failovers have occurred since the parameter was last reset. Running a rebalance will reset the count value back to 0. You can also reset the count back to 0 using `POST /settings/autoFailover/resetCount`, but it is rare that you would need to manually reset the count. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also reset the count back to 0 using
POST /settings/autoFailover/resetCount
, but it is rare that you would need to manually reset the count.
I don't think that we should document this without giving a reason why, or more explicit guidance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenHuddleston
The POST /settings/autoFailover/resetCount is already listed in the documentation, referenced from here. It's been in the documentation since 7.0.
Also, on line 80, which has also been in the documentation for a while, it says:
"After this maximum number of auto-failovers has been reached, no further auto-failover occurs, until the count is manually reset by the administrator, or until a rebalance is successfully performed."
Should we advise here (in line 67) by saying:
Running a rebalance will reset the count value back to 0. The count should not be reset manually unless guided by Support, since resetting manually will cause you to lose track of the number of auto-failovers that have already occurred without the cluster being rebalanced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hyunjuV, I think that that's a reasonable addition.
|
||
Another critical auto-failover constraint for Server Groups is the maximum number of nodes to be automatically failed over (`maxCount` in `/settings/autoFailover`) before administrator-intervention is required. If you want one entire server group of nodes to be able to be all automatically failed over, then the `maxCount` value should be at least the number of nodes in the server group. You can check the value of `maxCount` in `GET /settings/autoFailover` to see what the `maxCount` setting is. The value of `count` in the same `GET /settings/autoFailover` output tells you how many node auto-failovers have occurred since the parameter was last reset. Running a rebalance will reset the count value back to 0. You can also reset the count back to 0 using `POST /settings/autoFailover/resetCount`, but it is rare that you would need to manually reset the count. | ||
|
||
The list below describes other conditions that must be met for an auto-failover to be executed even after a monitored or configured auto-failover event has occurred. | ||
|
||
* If the majority of nodes in the cluster can form a quorum to initiate failover, following the failure of one of the nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following the failure of one of the nodes.
Should be "following the failure of some of the nodes". If we have 2 simultaneous failures in a 3 node cluster then we will not have a quorum.
|
||
Auto-failover may take significantly longer if the unresponsive node is that on which the _orchestrator_ is running; since _time-outs_ must occur, before available nodes can elect a new orchestrator-node and thereby continue. | ||
* Auto-failover may take significantly longer if the unresponsive node is that on which the _orchestrator_ is running; since _time-outs_ must occur, before available nodes can elect a new orchestrator-node and thereby continue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention and link here to the faster failover configuration with arbiter/service-less node - https://docs.couchbase.com/server/current/learn/clusters-and-availability/nodes.html#fast-failover
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I suppose it's a couple of lines below in the further docs references, but may be nice to call out explicitly in this paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
- Auto-failover may take significantly longer if the unresponsive node is that on which the orchestrator is running; since time-outs must occur, before available nodes can elect a new orchestrator-node and thereby continue. Faster failover can be achieved by deploying an arbiter node, which is a node that hosts no Couchbase service. (See Fast Failover.)
[#failover-policy] | ||
== Service-Specific Auto-Failover Policy | ||
|
||
When a monitored or configured auto-failover event occurs on a node, there are constraints that need to be checked to determine if the node can be automatically failed-over. An example of such an event is the node ns-server not responding to the cluster manager. In such instances, one of the constraints is the policies or rules specific to the services that are running on the unresponsive node. Since a number of different service configurations are possible, below is information about the auto-failover policy for Couchbase Services, followed by specific examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, recommended referring to ns_server as the "Cluster Manager" here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested change below:
When a monitored or configured auto-failover event occurs on a node, there are constraints that need to be checked to determine if the node can be automatically failed-over. An example of such an event is the node cluster manager not being responsive. In such instances, one of the constraints is the policies or rules specific to the services that are running on the unresponsive node. Since a number of different service configurations are possible, below is information about the auto-failover policy for Couchbase Services, followed by specific examples.
@@ -77,7 +118,6 @@ The auto-failover policy for Couchbase Services is as follows: | |||
* If the Data Service is running on its required minimum number of nodes, auto-failover may be applied to any of those nodes, even when auto-failover policy is thereby violated for one or more other, co-hosted services. | |||
This is referred to as xref:learn:clusters-and-availability/automatic-failover.adoc#data-service-preference[Data Service Preference]. | |||
|
|||
* The index service shares the same Auto-Failover settings of the Data Service. | |||
* When the Index service is co-located with the Data service, it will not be consulted on failing over the node. | |||
|
|||
The node-minimum for each service is provided in the following table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the table below, the Data service required nodes is now 2, rather than 3, as of 7.6 (MB-56023).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per @BenHuddleston 's note that the Data
service "Nodes Required" is 2 now, rather than 3, the number 3
should be changed to number 2
in line 130 (in the table).
Since the Data Service nodes required is 2 (instead of 3 previously), this changes the example that can be shown for the "A cluster has the following three nodes
:" -- an example that starts on line 179.
Instead of 3 Data nodes, can show 2 Data nodes, and the node #3
can be an arbiter node -- so, line 192 can be "Arbiter Node, no services
" instead of "Data
"
Then lines 195 and 196 remain the same, but line 197 can be:
In this case, even though the Query and Search Services were both running on only a single node (#1
), which is below the auto-failover policy requirement for each of those services (2), the Data Service was running on two nodes (#1
and #2
), which meets the auto-failover policy requirement for the Data Service (2).
|
||
Note that the monitoring for the Data Service and Index Service health for auto-failover uses the same Timeout value set for node unresponsiveness. For example, if the Index Service is deemed unhealthy (because of Index Service failure to send heartbeats to the cluster manager) for the Timeout amount of time, then the node that the Index Service is on will be considered for auto-failover (despite the fact that that the node ns-server may be responding to the cluster manager). | ||
|
||
WARNING: Care must be taken when running an un-replicated Index Service and a Data Service configured for fast failover (i.e., 1 second) on the same node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "fast failover" configuration has no bearing on this case, care should be taken regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also just a repeat of line 226, not sure it adds value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Please remove line 237 (the line "WARNING: Care must be taken...")
|
||
Note that node failover in the Couchbase Server is in the context of a single cluster, and auto-failover only occurs in a single cluster. | ||
In the context of Cross Data Center Replication (XDCR), the failover refers to application failover to a different cluster. Application failovers are always determined and controlled by the user. | ||
|
||
_Automatic Failover_ — or _auto-failover_ — can be configured to fail over one or more nodes automatically. No immediate administrator intervention is required. | ||
Specifically, the Cluster Manager autonomously detects and verifies that the nodes are unresponsive, and then initiates the _hard_ failover process. | ||
Auto-failover does not fix or identify problems that may have occurred. | ||
Once appropriate fixes have been applied to the cluster by the administrator, a rebalance is required. | ||
Auto-failover is always _hard_ failover. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo -- missing "a" : Auto-failover is always a hard failover.
** Data Service is unhealthy. | ||
Besides the Data Service disk read/write issues configured monitoring for auto-failover, the Data Service running on a node can be deemed unhealthy per various other internal monitoring. If the Data Service stays unhealthy for the user-specified threshold time for auto-failover, the cluster manager will start the auto-failover checks for the node that the data service is on. | ||
|
||
Note that the Data Service and Index Service health for auto-failover uses the same Timeout value set for node unresponsiveness (see xref:learn:clusters-and-availability:automatic-failover.adoc[Configuring Auto-Failover] -- this is the user-specified threshold time for auto-failover mentioned in the Data Service and Index Service monitoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: Missing closing parenthesis.
Note that the Data Service and Index Service health for auto-failover uses the same Timeout value set for node unresponsiveness (see xref:learn:clusters-and-availability:automatic-failover.adoc[Configuring Auto-Failover])
|
||
Note that on a node where there are only Search, Eventing, Query, or Analytics services running, the services could become unhealthy, but as long as the node ns-server process is responding to the cluster manager, an auto-failover of the node will not be attempted -- this is because only the Data and Index Services health are monitored for auto-failover. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "node ns-server process is responding to the cluster manager", say "node heartbeats are being sent":
Note that on a node where there are only Search, Eventing, Query, or Analytics services running, the services could become unhealthy, but as long as the node heartbeats are being sent, an auto-failover of the node will not be attempted -- this is because only the Data and Index Services health are monitored for node auto-failover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hyunjuV , I'd revise this slightly (two/three changes):
- Addition of backup service
- swap "node heartbeats" for "cluster manager heartbeats" - probably best to use consistent nomenclature throughout
- swap "[node heartbeats] are being sent" for "[cluster manager heartbeats] are sent and processed by the rest of the cluster" - it's not enough to just send them, they must be received/processed too
Note that on a node where there are only Search, Eventing, Query, Analytics, or Backup services running, the services could become unhealthy, but as long as the cluster manager heartbeats are sent and processed by the rest of the cluster, an auto-failover of the node will not be attempted -- this is because only the Data and Index Services health are monitored for node auto-failover.
Auto-failover is triggered: | ||
If a monitored or configured auto-failover event occurs, an auto-failover will not be performed if all the safety checks do not pass. These checks are explained in this section and the xref:learn:clusters-and-availability:automatic-failover.adoc#failover-policy[Service-Specific Auto-Failover Policy] section. | ||
|
||
The xref:install:deployment-considerations-lt-3nodes.adoc#quorum-arbitration[quorum constraint] is a critical part of auto-failover since the cluster must be able to form a quorum to initiate a failover, following the failure of one of the nodes. For Server Groups, this means that if you have two server groups with equal number of nodes, for auto-failover of all nodes in one server group to be able to occur, you could deploy an xref:learn:clusters-and-availability:nodes.adoc#adding-arbiter-nodes[arbiter node] (or another) in a third physical server group which will allow the remaining nodes to form a quorum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "(or another)", should be "(or another node)" in the last sentence.
WARNING: Care must be taken when running an un-replicated Index Service and a Data Service configured for fast failover (i.e., 5 seconds) on the same node. | ||
The number of seconds that must elapse, after a node or group has become unresponsive, before auto-failover is triggered. The default is 120 seconds, the minimum permitted is 1 second and the maximum is 3600 seconds. Note that a low number reduces the potential time-period during which a consistently unresponsive node remains unresponsive before auto-failover is triggered; but may also result in auto-failover being unnecessarily triggered, in consequence of short, intermittent periods of node unavailability. | ||
|
||
Note that the monitoring for the Data Service and Index Service health for auto-failover uses the same Timeout value set for node unresponsiveness. For example, if the Index Service is deemed unhealthy (because of Index Service failure to send heartbeats to the cluster manager) for the Timeout amount of time, then the node that the Index Service is on will be considered for auto-failover (despite the fact that that the node ns-server may be responding to the cluster manager). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo (that that) and remove ns-server phrasing to be consistent with changes being made.
Note that the monitoring for the Data Service and Index Service health for auto-failover uses the same Timeout value set for node unresponsiveness. For example, if the Index Service is deemed unhealthy (because of Index Service failure to send heartbeats) for the Timeout amount of time, then the node that the Index Service is on will be considered for auto-failover (despite the fact that the node cluster manager may be responding and sending heartbeats).
@@ -206,10 +248,7 @@ The value is incremented by 1 for every node that has an automatic-failover that | |||
* _Enablement of disk-related automatic failover; with corresponding time-period_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is in reference to line 247 which describes the "Count".
It should say:
The value is incremented by 1 for every node that has an automatic-failover that occurs, up to the defined maximum count: beyond this point, no further automatic failover can be triggered until the count is reset to 0. Running a rebalance will reset the count value back to 0.
@@ -68,13 +71,16 @@ In each illustration, all servers are assumed to be running the Data Service. | |||
[#vbucket-distribution-across-equal-groups] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is for line 69, which currently says:
In each illustration, all servers are assumed to be running the Data Service.
Should be updated to say:
In each illustration, all servers are assumed to be running the Data Service, except for the arbiter node server, which does not run any service.
@@ -68,13 +71,16 @@ In each illustration, all servers are assumed to be running the Data Service. | |||
[#vbucket-distribution-across-equal-groups] | |||
=== Equal Groups | |||
|
|||
The following illustration shows how vBuckets are distributed across two groups; each group containing four of its cluster's eight nodes. | |||
The following illustration shows how vBuckets are distributed across three groups; two group containing four of its cluster's eight nodes and a thrid group that can include a single arbiter node.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original description was correct since the vBuckets are distributed only across two groups. The third group only contains an arbiter node to allow a quorum to be formed if all the nodes in one server group fails. So, should say:
The following illustration shows how vBuckets are distributed across two groups; each group containing four of the cluster's nodes. The third group only contains one node, an arbiter node, which exists to allow a quorum to be formed if all the nodes in server group 1 or 2 fails.
@@ -208,6 +214,8 @@ For example, given a cluster: | |||
|
|||
At a minimum, one instance of the Index Service and one instance of the Search Service should be deployed on each rack. | |||
|
|||
Also, for auto-failover to be possible, the service-specific auto-failover constraints be met -- the policy information is documented in xref:learn:clusters-and-availability/automatic-failover.adoc#failover-policy[Service-Specific Auto-Failover Policy] -- it lists the number of nodes that each service must be running on and explains the xref:learn:clusters-and-availability/automatic-failover.adoc#data-service-preference[Data Service Preference] when a service is co-located with the Data Service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo.
Instead of:
Also, for auto-failover to be possible, the service-specific auto-failover constraints be met
Should be:
Also, for auto-failover to be possible, the service-specific auto-failover constraints must be met
updated as per https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0