Skip to content

Update docs for Leader Leases GA in v25.2 #19587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/current/_includes/v25.2/essential-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ The **Usage** column explains why each metric is important to visualize in a cus
| <div style="width:225px">CockroachDB Metric Name</div> | {% if include.deployment == 'self-hosted' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)<br>(add `cockroachdb.` prefix)</div> |{% elsif include.deployment == 'advanced' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)<br>(add `crdb_dedicated.` prefix)</div> |{% endif %}<div style="width:150px">Description</div>| Usage |
| ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ |
| leases.transfers.success | leases.transfers.success | Number of successful lease transfers | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. |
| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. These lease transfers are tracked by a component that looks for a [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance of either QPS (`rebalancing.queriespersecond`) or CPU usage (`rebalancing.cpunanospersecond`), depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores. |
| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leases) that occur during replica rebalancing. These lease transfers are tracked by a component that looks for a [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance of either QPS (`rebalancing.queriespersecond`) or CPU usage (`rebalancing.cpunanospersecond`), depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores. |
| rebalancing_range_rebalances | {% if include.deployment == 'self-hosted' %}rebalancing.range.rebalances | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing). This range movement is tracked by a component that looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance of either QPS (`rebalancing.queriespersecond`) or CPU usage (`rebalancing.cpunanospersecond`), depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores. |
| rebalancing_replicas_queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.queriespersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the KV-level requests received per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram that maintains buckets that can be queried for, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also: `rebalancing_replicas_cpunanospersecond`. |
| rebalancing_replicas_cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.cpunanospersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the CPU nanoseconds of execution time per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram that maintains buckets that can be queried for, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also the non-histogram variant: `rebalancing.cpunanospersecond`. |
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
[Per-replica circuit breakers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#per-replica-circuit-breakers) have the following limitations:

- They cannot prevent requests from hanging when the node's [liveness range]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) is unavailable. For more information about troubleshooting a cluster that's having node liveness issues, see [Node liveness issues]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#node-liveness-issues).
- They are not tripped if _all_ replicas of a range [become unavailable]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#db-console-shows-under-replicated-unavailable-ranges), because the circuit breaker mechanism operates per-replica. This means at least one replica needs to be available to receive the request in order for the breaker to trip.
- They are not tripped if _all_ replicas of a range [become unavailable]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#db-console-shows-under-replicated-unavailable-ranges), because the circuit breaker mechanism operates per-replica. This means at least one replica needs to be available to receive the request in order for the breaker to trip.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
By default under `SERIALIZABLE` isolation, locks acquired using `SELECT ... FOR UPDATE` and `SELECT ... FOR SHARE` are implemented as fast, in-memory [unreplicated locks](architecture/transaction-layer.html#unreplicated-locks). If a [lease transfer]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) or [range split/merge]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#range-merges) occurs on a range held by an unreplicated lock, the lock is dropped. The following behaviors can occur:
By default under `SERIALIZABLE` isolation, locks acquired using `SELECT ... FOR UPDATE` and `SELECT ... FOR SHARE` are implemented as fast, in-memory [unreplicated locks](architecture/transaction-layer.html#unreplicated-locks). If a [lease transfer]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leases) or [range split/merge]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#range-merges) occurs on a range held by an unreplicated lock, the lock is dropped. The following behaviors can occur:

- The desired ordering of concurrent accesses to one or more rows of a table expressed by your use of `SELECT ... FOR UPDATE` may not be preserved (that is, a transaction _B_ against some table _T_ that was supposed to wait behind another transaction _A_ operating on _T_ may not wait for transaction _A_).
- The transaction that acquired the (now dropped) unreplicated lock may fail to commit, leading to [transaction retry errors with code `40001`]({% link {{ page.version.version }}/transaction-retry-error-reference.md %}) and the [`restart transaction` error message]({% link {{ page.version.version }}/common-errors.md %}#restart-transaction).
Expand Down
2 changes: 1 addition & 1 deletion src/current/_includes/v25.2/leader-leases-intro.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
CockroachDB offers an improved leasing system rebuilt atop a stronger form of [Raft]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) leadership that ensures that the Raft leader is **always** the range's leaseholder. This new type of lease is called a _Leader lease_, and supersedes [epoch-based leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) and [expiration-based leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#expiration-based-leases-meta-and-system-ranges) leases while combining the performance of the former with the resilience of the latter. **Leader leases are not enabled by default.**
CockroachDB offers an improved leasing system rebuilt atop a stronger form of [Raft]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) leadership that ensures that the Raft leader is always the range's leaseholder, except briefly during [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#how-leases-are-transferred-from-a-dead-node). This type of lease is called a _Leader lease_, and supersedes the former system of having different epoch-based and expiration-based lease types, while combining the performance of the former with the resilience of the latter.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{% include_cached new-in.html version="v25.2" %} For the purposes of [Raft replication]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) and determining the [leaseholder]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) of a [range]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-range), node health is no longer determined by heartbeating a single "liveness range"; instead it is determined using [Leader leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases).

However, node heartbeats of a single range are still used to determine:

- Whether a node is still a member of a cluster (this is used by [`cockroach node decommission`]({% link {{ page.version.version }}/cockroach-node.md %}#node-decommission)).
- Whether a node is dead (in which case [its leases will be transferred away]({% link {{ page.version.version }}/architecture/replication-layer.md %}#how-leases-are-transferred-from-a-dead-node)).
- How to avoid placing replicas on dead, decommissioning or unhealthy nodes, and to make decisions about lease transfers.
4 changes: 3 additions & 1 deletion src/current/_includes/v25.2/misc/basic-terms.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,15 @@ The replica that holds the "range lease." This replica receives and coordinates

For most types of tables and queries, the leaseholder is the only replica that can serve consistent reads (reads that return "the latest" data).

{% include_cached new-in.html version="v25.2" %} The leaseholder is always the same replica as the [Raft leader](#architecture-raft-leader), except briefly during [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#how-leases-are-transferred-from-a-dead-node). For more information, refer to [Leader leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases).

### Raft protocol
<a name="architecture-raft"></a>
The [consensus protocol]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) employed in CockroachDB that ensures that your data is safely stored on multiple nodes and that those nodes agree on the current state even if some of them are temporarily disconnected.

### Raft leader
<a name="architecture-raft-leader"></a>
For each range, the replica that is the "leader" for write requests. The leader uses the Raft protocol to ensure that a majority of replicas (the leader and enough followers) agree, based on their Raft logs, before committing the write. The Raft leader is almost always the same replica as the leaseholder.
For each range, the replica that is the "leader" for write requests. The leader uses the Raft protocol to ensure that a majority of replicas (the leader and enough followers) agree, based on their Raft logs, before committing the write. {% include_cached new-in.html version="v25.2" %} The Raft leader is always the same replica as the [leaseholder](#architecture-raft-leader). For more information, refer to [Leader leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases).

### Raft log
A time-ordered log of writes to a range that its replicas have agreed on. This log exists on-disk with each replica and is the range's source of truth for consistent replication.
4 changes: 2 additions & 2 deletions src/current/v25.2/admission-control.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ toc: true
docs_area: develop
---

CockroachDB supports an admission control system to maintain cluster performance and availability when some nodes experience high load. When admission control is enabled, CockroachDB sorts request and response operations into work queues by priority, giving preference to higher priority operations. Internal operations critical to node health, like [node liveness heartbeats]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#node-liveness-issues), are high priority. The admission control system also prioritizes transactions that hold [locks]({% link {{ page.version.version }}/crdb-internal.md %}#cluster_locks), to reduce [contention]({% link {{ page.version.version }}/performance-best-practices-overview.md %}#transaction-contention) and release locks earlier.
CockroachDB supports an admission control system to maintain cluster performance and availability when some nodes experience high load. When admission control is enabled, CockroachDB sorts request and response operations into work queues by priority, giving preference to higher priority operations. Internal operations critical to node health are high priority. The admission control system also prioritizes transactions that hold [locks]({% link {{ page.version.version }}/crdb-internal.md %}#cluster_locks), to reduce [contention]({% link {{ page.version.version }}/performance-best-practices-overview.md %}#transaction-contention) and release locks earlier.

## How admission control works

Expand Down Expand Up @@ -95,7 +95,7 @@ When you enable or disable admission control settings for one layer, Cockroach L

When admission control is enabled, request and response operations are sorted into work queues where the operations are organized by priority and transaction start time.

Higher priority operations are processed first. The criteria for determining higher and lower priority operations is different at each processing layer, and is determined by the CPU and storage I/O of the operation. Write operations in the [KV storage layer]({% link {{ page.version.version }}/architecture/storage-layer.md %}) in particular are often the cause of performance bottlenecks, and admission control prevents [the Pebble storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}#pebble) from experiencing high [read amplification]({% link {{ page.version.version }}/architecture/storage-layer.md %}#read-amplification). Critical cluster operations like node heartbeats are processed as high priority, as are transactions that hold [locks]({% link {{ page.version.version }}/crdb-internal.md %}#cluster_locks) in order to avoid [contention]({% link {{ page.version.version }}/performance-recipes.md %}#transaction-contention) and release locks earlier.
Higher priority operations are processed first. The criteria for determining higher and lower priority operations is different at each processing layer, and is determined by the CPU and storage I/O of the operation. Write operations in the [KV storage layer]({% link {{ page.version.version }}/architecture/storage-layer.md %}) in particular are often the cause of performance bottlenecks, and admission control prevents [the Pebble storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}#pebble) from experiencing high [read amplification]({% link {{ page.version.version }}/architecture/storage-layer.md %}#read-amplification). Critical cluster operations are processed as high priority, as are transactions that hold [locks]({% link {{ page.version.version }}/crdb-internal.md %}#cluster_locks) in order to avoid [contention]({% link {{ page.version.version }}/performance-recipes.md %}#transaction-contention) and release locks earlier.

The transaction start time is used within the priority queue and gives preference to operations with earlier transaction start times. For example, within the high priority queue operations with an earlier transaction start time are processed first.

Expand Down
2 changes: 1 addition & 1 deletion src/current/v25.2/architecture/distribution-layer.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,7 +239,7 @@ The distribution layer's `DistSender` receives `BatchRequests` from its own node

### Distribution and replication layer

The distribution layer routes `BatchRequests` to nodes containing ranges of data, which is ultimately routed to the Raft group leader or leaseholder, which are handled in the replication layer.
The distribution layer routes `BatchRequests` to nodes containing ranges of data, which is ultimately routed to the Raft group leader and leaseholder, which are handled in the replication layer.

## What's next?

Expand Down
Loading
Loading