DOC-12691-Updated automatic-failover.adoc #3820

supritha-kumar · 2025-07-03T13:03:28Z

corrected formatting issues

https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0

updated image under Equal groups as per https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0

Updated description of images under Equal Groups

updated as per Ben's comments on https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0

https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0

Added note that... at the beginning

typos

BenHuddleston

I've read about half of the automatic-failover page, will finish tomorrow.

BenHuddleston · 2025-07-17T15:21:29Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc

 _Automatic Failover_ — or _auto-failover_ — can be configured to fail over one or more nodes automatically. No immediate administrator intervention is required.
 Specifically, the Cluster Manager autonomously detects and verifies that the nodes are unresponsive, and then initiates the _hard_ failover process.
 Auto-failover does not fix or identify problems that may have occurred.
 Once appropriate fixes have been applied to the cluster by the administrator, a rebalance is required.
 Auto-failover is always _hard_ failover.
 For information on how services are affected by hard failover, see xref:learn:clusters-and-availability/hard-failover.adoc[Hard Failover].

-This page describes auto-failover concepts and policy.
+As a reminder, failover is a mechanism in the Couchbase Server that allows a node to be taken out of the cluster so that applications no longer reference the services on the failed node and availability is maintained.  The failover is at the node level, and the automatic failover process for a non-responsive or an unhealthy node starts when the cluster manager detects, per the configured auto-failover settings, that the node is unresponsive (node ns-server process is not responding to the cluster manager) or the Data or the Index Service on a node is not healthy (the service heartbeat or process is not responding to the cluster manager).  Then, multiple safety checks are run to see if an auto-failover can be performed.  If all checks pass, the cluster manager performs the hard failover process.  


(node ns-server process is not responding to the cluster manager)

I think that this would read better to a user as "(the cluster manager of the node is not sending heartbeats to the cluster manager of other nodes)"

With Ben's suggested change, should read:

As a reminder, failover is a mechanism in the Couchbase Server that allows a node to be taken out of the cluster so that applications no longer reference the services on the failed node and availability is maintained. The failover is at the node level, and the automatic failover process for a non-responsive or an unhealthy node starts when the cluster manager detects, per the configured auto-failover settings, that the node is unresponsive (the cluster manager of the node is not sending heartbeats to the cluster manager of other nodes) or the Data or the Index Service on a node is not healthy (the service heartbeat or process is not responding to the cluster manager). Then, multiple safety checks are run to see if an auto-failover can be performed. If all checks pass, the cluster manager performs the hard failover process.

BenHuddleston · 2025-07-17T15:25:18Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc

-Auto-failover occurs in response to failed/failing events.
-There are three types of event that can trigger auto-failover:
+Auto-failover occurs in response to failed/failing events. Auto-failover applies to the node -- it’s the node that fails over regardless of the triggering event. There are specific types of events that trigger auto-failover processing. However, auto-failover will only actually occur if all of the checks (constraints and policies) for auto-failover pass.  
+


nit: I'd probably make the above and below one paragraph, but I'm not the expert :)

BenHuddleston · 2025-07-17T15:27:49Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc


 * _Node failure_.
-A server-node within the cluster is unresponsive (due to a network failure, out-of-memory problem, or other node-specific issue).
+A server-node within the cluster is unresponsive (due to a network failure, very high CPU utilization problem, out-of-memory problem, or other node-specific issue). This means that the ns-server process on the node has not responded to the cluster manager for the user-specified configured amount of time -- the health of the services running on the node is unknown.


Similarly to the previous comment, we might want to revise mention of the ns-server process and just talk about the cluster manager heatbeat

With Ben's suggested change, line 42 may be changed to:

A server-node within the cluster is unresponsive (due to a network failure, very high CPU utilization problem, out-of-memory problem, or other node-specific issue). This means that the the cluster manager of the node has not sent heartbeats in the configured timeout period, and therefore, the health of the services running on the node is unknown.

Note for Supritha: The "Node-Failure Detection" section of Cluster Manager doc has info on the heartbeat mechanism in case some context is needed.

BenHuddleston · 2025-07-17T15:29:43Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc

-Attempts to read from or write to disk on a particular node have resulted in a significant rate of failure, for longer than a specified time-period.
-The node is removed by auto-failover, even though the node continues to be contactable.
+* _Data Service disk read/write issues_.
+Data Service disk read/write errors. Attempts by the Data Service to read from or write to disk on a particular node have resulted in a significant rate of failure (errors returned), for longer than a specified time-period.


@hyunjuV should we not also document the disk responsiveness trigger?

@BenHuddleston
The disk non-responsiveness issue will be in the 8.0 documentation.
These updates/changes are for current latest, which is 7.6.x.

BenHuddleston · 2025-07-17T15:48:44Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc

+
+The xref:install:deployment-considerations-lt-3nodes.adoc#quorum-arbitration[quorum constraint] is a critical part of auto-failover since the cluster must be able to form a quorum to initiate a failover, following the failure of one of the nodes. For Server Groups, this means that if you have two server groups with equal number of nodes, for auto-failover of all nodes in one server group to be able to occur, you could deploy an xref:learn:clusters-and-availability:nodes.adoc#adding-arbiter-nodes[arbiter node] (or another) in a third physical server group which will allow the remaining nodes to form a quorum.
+
+Another critical auto-failover constraint for Server Groups is the maximum number of nodes to be automatically failed over (`maxCount` in `/settings/autoFailover`) before administrator-intervention is required.  If you want one entire server group of nodes to be able to be all automatically failed over, then the `maxCount` value should be at least the number of nodes in the server group.  You can check the value of `maxCount` in `GET /settings/autoFailover` to see what the `maxCount` setting is.  The value of `count` in the same `GET /settings/autoFailover` output tells you how many node auto-failovers have occurred since the parameter was last reset.  Running a rebalance will reset the count value back to 0.  You can also reset the count back to 0 using `POST /settings/autoFailover/resetCount`, but it is rare that you would need to manually reset the count.


You can also reset the count back to 0 using POST /settings/autoFailover/resetCount, but it is rare that you would need to manually reset the count.

I don't think that we should document this without giving a reason why, or more explicit guidance.

@BenHuddleston
The POST /settings/autoFailover/resetCount is already listed in the documentation, referenced from here. It's been in the documentation since 7.0.

Also, on line 80, which has also been in the documentation for a while, it says:
"After this maximum number of auto-failovers has been reached, no further auto-failover occurs, until the count is manually reset by the administrator, or until a rebalance is successfully performed."

Should we advise here (in line 67) by saying:

Running a rebalance will reset the count value back to 0. The count should not be reset manually unless guided by Support, since resetting manually will cause you to lose track of the number of auto-failovers that have already occurred without the cluster being rebalanced.

@hyunjuV, I think that that's a reasonable addition.

BenHuddleston · 2025-07-17T15:50:31Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc

+
+Another critical auto-failover constraint for Server Groups is the maximum number of nodes to be automatically failed over (`maxCount` in `/settings/autoFailover`) before administrator-intervention is required.  If you want one entire server group of nodes to be able to be all automatically failed over, then the `maxCount` value should be at least the number of nodes in the server group.  You can check the value of `maxCount` in `GET /settings/autoFailover` to see what the `maxCount` setting is.  The value of `count` in the same `GET /settings/autoFailover` output tells you how many node auto-failovers have occurred since the parameter was last reset.  Running a rebalance will reset the count value back to 0.  You can also reset the count back to 0 using `POST /settings/autoFailover/resetCount`, but it is rare that you would need to manually reset the count.
+
+The list below describes other conditions that must be met for an auto-failover to be executed even after a monitored or configured auto-failover event has occurred.

 * If the majority of nodes in the cluster can form a quorum to initiate failover, following the failure of one of the nodes.


following the failure of one of the nodes.

Should be "following the failure of some of the nodes". If we have 2 simultaneous failures in a 3 node cluster then we will not have a quorum.

BenHuddleston · 2025-07-18T10:27:31Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc


-Auto-failover may take significantly longer if the unresponsive node is that on which the _orchestrator_ is running; since _time-outs_ must occur, before available nodes can elect a new orchestrator-node and thereby continue.
+* Auto-failover may take significantly longer if the unresponsive node is that on which the _orchestrator_ is running; since _time-outs_ must occur, before available nodes can elect a new orchestrator-node and thereby continue.


We should mention and link here to the faster failover configuration with arbiter/service-less node - https://docs.couchbase.com/server/current/learn/clusters-and-availability/nodes.html#fast-failover

Ah I suppose it's a couple of lines below in the further docs references, but may be nice to call out explicitly in this paragraph.

Agreed.

Auto-failover may take significantly longer if the unresponsive node is that on which the orchestrator is running; since time-outs must occur, before available nodes can elect a new orchestrator-node and thereby continue. Faster failover can be achieved by deploying an arbiter node, which is a node that hosts no Couchbase service. (See Fast Failover.)

BenHuddleston · 2025-07-18T10:30:00Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc

 [#failover-policy]
 == Service-Specific Auto-Failover Policy

+When a monitored or configured auto-failover event occurs on a node, there are constraints that need to be checked to determine if the node can be automatically failed-over. An example of such an event is the node ns-server not responding to the cluster manager. In such instances, one of the constraints is the policies or rules specific to the services that are running on the unresponsive node. Since a number of different service configurations are possible, below is information about the auto-failover policy for Couchbase Services, followed by specific examples.


Again, recommended referring to ns_server as the "Cluster Manager" here.

Suggested change below:

When a monitored or configured auto-failover event occurs on a node, there are constraints that need to be checked to determine if the node can be automatically failed-over. An example of such an event is the node cluster manager not being responsive. In such instances, one of the constraints is the policies or rules specific to the services that are running on the unresponsive node. Since a number of different service configurations are possible, below is information about the auto-failover policy for Couchbase Services, followed by specific examples.

BenHuddleston · 2025-07-18T10:32:27Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc

@@ -77,7 +118,6 @@ The auto-failover policy for Couchbase Services is as follows:
 * If the Data Service is running on its required minimum number of nodes, auto-failover may be applied to any of those nodes, even when auto-failover policy is thereby violated for one or more other, co-hosted services.
 This is referred to as xref:learn:clusters-and-availability/automatic-failover.adoc#data-service-preference[Data Service Preference].

-* The index service shares the same Auto-Failover settings of the Data Service.
 * When the Index service is co-located with the Data service, it will not be consulted on failing over the node.

 The node-minimum for each service is provided in the following table:


In the table below, the Data service required nodes is now 2, rather than 3, as of 7.6 (MB-56023).

Per @BenHuddleston 's note that the Data service "Nodes Required" is 2 now, rather than 3, the number 3 should be changed to number 2 in line 130 (in the table).

Since the Data Service nodes required is 2 (instead of 3 previously), this changes the example that can be shown for the "A cluster has the following three nodes:" -- an example that starts on line 179.

Instead of 3 Data nodes, can show 2 Data nodes, and the node #3 can be an arbiter node -- so, line 192 can be "Arbiter Node, no services" instead of "Data"

Then lines 195 and 196 remain the same, but line 197 can be:

In this case, even though the Query and Search Services were both running on only a single node (#1), which is below the auto-failover policy requirement for each of those services (2), the Data Service was running on two nodes (#1 and #2), which meets the auto-failover policy requirement for the Data Service (2).

BenHuddleston · 2025-07-18T10:38:16Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc

+
+Note that the monitoring for the Data Service and Index Service health for auto-failover uses the same Timeout value set for node unresponsiveness.  For example, if the Index Service is deemed unhealthy (because of Index Service failure to send heartbeats to the cluster manager) for the Timeout amount of time, then the node that the Index Service is on will be considered for auto-failover (despite the fact that that the node ns-server may be responding to the cluster manager).
+
+WARNING: Care must be taken when running an un-replicated Index Service and a Data Service configured for fast failover (i.e., 1 second) on the same node.


The "fast failover" configuration has no bearing on this case, care should be taken regardless.

It's also just a repeat of line 226, not sure it adds value.

Agreed. Please remove line 237 (the line "WARNING: Care must be taken...")

hyunjuV · 2025-07-20T02:47:52Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc

+
+Note that node failover in the Couchbase Server is in the context of a single cluster, and auto-failover only occurs in a single cluster.
+In the context of Cross Data Center Replication (XDCR), the failover refers to application failover to a different cluster. Application failovers are always determined and controlled by the user.
+
 _Automatic Failover_ — or _auto-failover_ — can be configured to fail over one or more nodes automatically. No immediate administrator intervention is required.
 Specifically, the Cluster Manager autonomously detects and verifies that the nodes are unresponsive, and then initiates the _hard_ failover process.
 Auto-failover does not fix or identify problems that may have occurred.
 Once appropriate fixes have been applied to the cluster by the administrator, a rebalance is required.
 Auto-failover is always _hard_ failover.


Typo -- missing "a" : Auto-failover is always a hard failover.

hyunjuV · 2025-07-20T04:28:33Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc

+** Data Service is unhealthy.  
+Besides the Data Service disk read/write issues configured monitoring for auto-failover, the Data Service running on a node can be deemed unhealthy per various other internal monitoring. If the Data Service stays unhealthy for the user-specified threshold time for auto-failover, the cluster manager will start the auto-failover checks for the node that the data service is on.
+
+Note that the Data Service and Index Service health for auto-failover uses the same Timeout value set for node unresponsiveness (see xref:learn:clusters-and-availability:automatic-failover.adoc[Configuring Auto-Failover] -- this is the user-specified threshold time for auto-failover mentioned in the Data Service and Index Service monitoring.


Typo: Missing closing parenthesis.
Note that the Data Service and Index Service health for auto-failover uses the same Timeout value set for node unresponsiveness (see xref:learn:clusters-and-availability:automatic-failover.adoc[Configuring Auto-Failover])

hyunjuV · 2025-07-20T04:33:15Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc


+Note that on a node where there are only Search, Eventing, Query, or Analytics services running, the services could become unhealthy, but as long as the node ns-server process is responding to the cluster manager, an auto-failover of the node will not be attempted -- this is because only the Data and Index Services health are monitored for auto-failover.


Instead of "node ns-server process is responding to the cluster manager", say "node heartbeats are being sent":

Note that on a node where there are only Search, Eventing, Query, or Analytics services running, the services could become unhealthy, but as long as the node heartbeats are being sent, an auto-failover of the node will not be attempted -- this is because only the Data and Index Services health are monitored for node auto-failover.

@hyunjuV , I'd revise this slightly (two/three changes):

Addition of backup service

swap "node heartbeats" for "cluster manager heartbeats" - probably best to use consistent nomenclature throughout

swap "[node heartbeats] are being sent" for "[cluster manager heartbeats] are sent and processed by the rest of the cluster" - it's not enough to just send them, they must be received/processed too

Note that on a node where there are only Search, Eventing, Query, Analytics, or Backup services running, the services could become unhealthy, but as long as the cluster manager heartbeats are sent and processed by the rest of the cluster, an auto-failover of the node will not be attempted -- this is because only the Data and Index Services health are monitored for node auto-failover.

hyunjuV · 2025-07-20T05:20:47Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc

-Auto-failover is triggered:
+If a monitored or configured auto-failover event occurs, an auto-failover will not be performed if all the safety checks do not pass. These checks are explained in this section and the xref:learn:clusters-and-availability:automatic-failover.adoc#failover-policy[Service-Specific Auto-Failover Policy] section.
+
+The xref:install:deployment-considerations-lt-3nodes.adoc#quorum-arbitration[quorum constraint] is a critical part of auto-failover since the cluster must be able to form a quorum to initiate a failover, following the failure of one of the nodes. For Server Groups, this means that if you have two server groups with equal number of nodes, for auto-failover of all nodes in one server group to be able to occur, you could deploy an xref:learn:clusters-and-availability:nodes.adoc#adding-arbiter-nodes[arbiter node] (or another) in a third physical server group which will allow the remaining nodes to form a quorum.


Instead of "(or another)", should be "(or another node)" in the last sentence.

hyunjuV · 2025-07-20T06:36:14Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc

-WARNING: Care must be taken when running an un-replicated Index Service and a Data Service configured for fast failover (i.e., 5 seconds) on the same node.
+The number of seconds that must elapse, after a node or group has become unresponsive, before auto-failover is triggered. The default is 120 seconds, the minimum permitted is 1 second and the maximum is 3600 seconds. Note that a low number reduces the potential time-period during which a consistently unresponsive node remains unresponsive before auto-failover is triggered; but may also result in auto-failover being unnecessarily triggered, in consequence of short, intermittent periods of node unavailability.
+
+Note that the monitoring for the Data Service and Index Service health for auto-failover uses the same Timeout value set for node unresponsiveness.  For example, if the Index Service is deemed unhealthy (because of Index Service failure to send heartbeats to the cluster manager) for the Timeout amount of time, then the node that the Index Service is on will be considered for auto-failover (despite the fact that that the node ns-server may be responding to the cluster manager).


Typo (that that) and remove ns-server phrasing to be consistent with changes being made.

Note that the monitoring for the Data Service and Index Service health for auto-failover uses the same Timeout value set for node unresponsiveness. For example, if the Index Service is deemed unhealthy (because of Index Service failure to send heartbeats) for the Timeout amount of time, then the node that the Index Service is on will be considered for auto-failover (despite the fact that the node cluster manager may be responding and sending heartbeats).

hyunjuV · 2025-07-20T06:47:58Z

modules/learn/pages/clusters-and-availability/automatic-failover.adoc

@@ -206,10 +248,7 @@ The value is incremented by 1 for every node that has an automatic-failover that
 * _Enablement of disk-related automatic failover; with corresponding time-period_.


This comment is in reference to line 247 which describes the "Count".
It should say:

The value is incremented by 1 for every node that has an automatic-failover that occurs, up to the defined maximum count: beyond this point, no further automatic failover can be triggered until the count is reset to 0. Running a rebalance will reset the count value back to 0.

cc @BenHuddleston

hyunjuV · 2025-07-20T08:08:01Z

modules/learn/pages/clusters-and-availability/groups.adoc

@@ -68,13 +71,16 @@ In each illustration, all servers are assumed to be running the Data Service.
 [#vbucket-distribution-across-equal-groups]


This comment is for line 69, which currently says:
In each illustration, all servers are assumed to be running the Data Service.
Should be updated to say:
In each illustration, all servers are assumed to be running the Data Service, except for the arbiter node server, which does not run any service.

hyunjuV · 2025-07-20T08:19:40Z

modules/learn/pages/clusters-and-availability/groups.adoc

@@ -68,13 +71,16 @@ In each illustration, all servers are assumed to be running the Data Service.
 [#vbucket-distribution-across-equal-groups]
 === Equal Groups

-The following illustration shows how vBuckets are distributed across two groups; each group containing four of its cluster's eight nodes.
+The following illustration shows how vBuckets are distributed across three groups; two group containing four of its cluster's eight nodes and a thrid group that can include a single arbiter node..


The original description was correct since the vBuckets are distributed only across two groups. The third group only contains an arbiter node to allow a quorum to be formed if all the nodes in one server group fails. So, should say:

The following illustration shows how vBuckets are distributed across two groups; each group containing four of the cluster's nodes. The third group only contains one node, an arbiter node, which exists to allow a quorum to be formed if all the nodes in server group 1 or 2 fails.

hyunjuV · 2025-07-20T08:24:03Z

modules/learn/pages/clusters-and-availability/groups.adoc

@@ -208,6 +214,8 @@ For example, given a cluster:

 At a minimum, one instance of the Index Service and one instance of the Search Service should be deployed on each rack.

+Also, for auto-failover to be possible, the service-specific auto-failover constraints be met -- the policy information is documented in xref:learn:clusters-and-availability/automatic-failover.adoc#failover-policy[Service-Specific Auto-Failover Policy] -- it lists the number of nodes that each service must be running on and explains the xref:learn:clusters-and-availability/automatic-failover.adoc#data-service-preference[Data Service Preference] when a service is co-located with the Data Service.


Typo.
Instead of:
Also, for auto-failover to be possible, the service-specific auto-failover constraints be met
Should be:
Also, for auto-failover to be possible, the service-specific auto-failover constraints must be met

supritha-kumar added 20 commits July 3, 2025 18:33

Updated automatic-failover.adoc

f8ed51f

updated as per https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0

DOC-12691-Update automatic-failover.adoc

494b422

DOC-12691 Updated automatic-failover.adoc

d857fcb

DOC-12691 Updated automatic-failover.adoc

a0c1045

Updated as per https://jira.issues.couchbase.com/issues/MB-58264?jql=ORDER%20BY%20created%20DESC

DOC-12691 Update automatic-failover.adoc

65e7499

DOC-12691 Updated automatic-failover.adoc

b06054d

updated as per https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0

DOC-12691 Update automatic-failover.adoc

503dff9

DOC-12691 Updated automatic-failover.adoc

ee5308a

Updated based on https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0

DOC-12691 Update automatic-failover.adoc

63ee191

corrected formatting issues

DOC-12691 Update groups.adoc

4e52ad5

https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0

DOC-12691 Updated groups.adoc

57bcda6

https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0

Add files via upload

7205255

DOC-12691 Updated groups.adoc

6a9324f

updated image under Equal groups as per https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0

DOC-12691 Updated groups.adoc

913c830

Updated description of images under Equal Groups

DOC-12691 Update automatic-failover.adoc

ccfc4dc

DOC-12691 Update automatic-failover.adoc

c8f1d5f

updated as per Ben's comments on https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0

DOC-12691 Update automatic-failover.adoc

6ae2e40

https://docs.google.com/document/d/1Lnzqu7mW8PtGDzrWbS-e7SU0J457ZC2d1ijfsx_nbTw/edit?tab=t.0

DOC-12691 Update automatic-failover.adoc

203308a

Added note that... at the beginning

DOC-12691 Update automatic-failover.adoc

5a7468d

typos

fixed typo in image

391370c

BenHuddleston reviewed Jul 17, 2025

View reviewed changes

BenHuddleston reviewed Jul 18, 2025

View reviewed changes

hyunjuV reviewed Jul 20, 2025

View reviewed changes


		The xref:install:deployment-considerations-lt-3nodes.adoc#quorum-arbitration[quorum constraint] is a critical part of auto-failover since the cluster must be able to form a quorum to initiate a failover, following the failure of one of the nodes. For Server Groups, this means that if you have two server groups with equal number of nodes, for auto-failover of all nodes in one server group to be able to occur, you could deploy an xref:learn:clusters-and-availability:nodes.adoc#adding-arbiter-nodes[arbiter node] (or another) in a third physical server group which will allow the remaining nodes to form a quorum.

		Another critical auto-failover constraint for Server Groups is the maximum number of nodes to be automatically failed over (`maxCount` in `/settings/autoFailover`) before administrator-intervention is required. If you want one entire server group of nodes to be able to be all automatically failed over, then the `maxCount` value should be at least the number of nodes in the server group. You can check the value of `maxCount` in `GET /settings/autoFailover` to see what the `maxCount` setting is. The value of `count` in the same `GET /settings/autoFailover` output tells you how many node auto-failovers have occurred since the parameter was last reset. Running a rebalance will reset the count value back to 0. You can also reset the count back to 0 using `POST /settings/autoFailover/resetCount`, but it is rare that you would need to manually reset the count.


		Auto-failover may take significantly longer if the unresponsive node is that on which the _orchestrator_ is running; since _time-outs_ must occur, before available nodes can elect a new orchestrator-node and thereby continue.
		* Auto-failover may take significantly longer if the unresponsive node is that on which the _orchestrator_ is running; since _time-outs_ must occur, before available nodes can elect a new orchestrator-node and thereby continue.


		Note that the monitoring for the Data Service and Index Service health for auto-failover uses the same Timeout value set for node unresponsiveness. For example, if the Index Service is deemed unhealthy (because of Index Service failure to send heartbeats to the cluster manager) for the Timeout amount of time, then the node that the Index Service is on will be considered for auto-failover (despite the fact that that the node ns-server may be responding to the cluster manager).

		WARNING: Care must be taken when running an un-replicated Index Service and a Data Service configured for fast failover (i.e., 1 second) on the same node.


		Note that on a node where there are only Search, Eventing, Query, or Analytics services running, the services could become unhealthy, but as long as the node ns-server process is responding to the cluster manager, an auto-failover of the node will not be attempted -- this is because only the Data and Index Services health are monitored for auto-failover.

		@@ -206,10 +248,7 @@ The value is incremented by 1 for every node that has an automatic-failover that
		* _Enablement of disk-related automatic failover; with corresponding time-period_.

		@@ -68,13 +71,16 @@ In each illustration, all servers are assumed to be running the Data Service.
		[#vbucket-distribution-across-equal-groups]

		@@ -208,6 +214,8 @@ For example, given a cluster:

		At a minimum, one instance of the Index Service and one instance of the Search Service should be deployed on each rack.

		Also, for auto-failover to be possible, the service-specific auto-failover constraints be met -- the policy information is documented in xref:learn:clusters-and-availability/automatic-failover.adoc#failover-policy[Service-Specific Auto-Failover Policy] -- it lists the number of nodes that each service must be running on and explains the xref:learn:clusters-and-availability/automatic-failover.adoc#data-service-preference[Data Service Preference] when a service is co-located with the Data Service.

DOC-12691-Updated automatic-failover.adoc #3820

Are you sure you want to change the base?

DOC-12691-Updated automatic-failover.adoc #3820

Conversation

supritha-kumar commented Jul 3, 2025

Uh oh!

BenHuddleston left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyunjuV Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyunjuV Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyunjuV Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hyunjuV Jul 18, 2025 •

edited

Loading

hyunjuV Jul 20, 2025 •

edited

Loading

hyunjuV Jul 20, 2025 •

edited

Loading