Frequent PD-TiKV Connection Retries and High GC Activity Affecting Performance #59646

FangCundi · 2025-02-19T09:33:50Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Cluster Environment:
- TiDB version: v9.0.0-alpha-100-g91706ec8df-dirty
- Operating system: Kubernetes-based containerized environment with cgroup resource control
- Cluster topology: 4 PD nodes, 4 TiKV nodes, 1 TiDB node (all in the same network)
Reproduction Scenario:
1. Deploy the TiDB cluster in a Kubernetes environment.
2. Load the cluster with a significant amount of data and trigger high load on the system.
3. During the high load period (especially during GC and periodic compaction), observe the PD-TiKV connection retries and GC Safe Point updates.
Observed Timeframe:
- The issue is most apparent from 14:00 to 15:00 on February 18, 2025.
- The PD logs indicate frequent GC Safe Point updates, while TiKV repeatedly connects to PD, suggesting a communication issue between TiKV and PD.

2. What did you expect to see? (Required)

Stable connections between TiKV and PD without frequent retries.
PD's GC operation should be efficient without causing excessive disk I/O or performance degradation.
The TiDB cluster should maintain high availability and performance during high load periods.

3. What did you see instead (Required)

TiKV logs show repeated attempts to connect to PD endpoints with the following entries:

[2025/02/18 15:00:09.856 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=http://166.18.0.11:2379] [thread_id=12]
[2025/02/18 15:10:09.856 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=http://166.18.0.11:2379] [thread_id=12]
[2025/02/18 15:20:09.859 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=http://166.18.0.11:2379] [thread_id=12]


-**PD logs** show frequent GC Safe Point updates, compaction operations, and storage of new hash values:

```text
[2025/02/18 14:06:07.190 +08:00] [INFO] [periodic.go:134] ["starting auto periodic compaction"] [revision=2857322] [compact-period=1h0m0s]
[2025/02/18 14:06:07.192 +08:00] [INFO] [periodic.go:142] ["completed auto periodic compaction"] [revision=2857322] [compact-period=1h0m0s] 
[took=1.302374ms]
[2025/02/18 14:06:07.192 +08:00] [INFO] [index.go:214] ["compact tree index"] [revision=2857322]
[2025/02/18 14:06:07.364 +08:00] [INFO] [kvstore_compaction.go:69] ["finished scheduled compaction"] [compact-revision=2857322] 
[took=167.096203ms]
[2025/02/18 14:06:07.364 +08:00] [INFO] [hash.go:137] ["storing new hash"] [hash=989640134] [revision=2857322] [compact-revision=2853381]

-TiDB logs also show frequent updates and transitions, particularly in the adaptive update ts interval state transition:

[2025/02/18 15:03:28.432 +08:00] [INFO] [pd.go:431] ["adaptive update ts interval state transition"] [configuredInterval=2s] 
[prevAdaptiveUpdateInterval=2s] [newAdaptiveUpdateInterval=2s] [requiredStaleness=0s] [prevState=unknown(0)] [newState=normal]
[2025/02/18 15:04:29.600 +08:00] [INFO] [meminfo.go:179] ["use cgroup memory hook because TiDB is in the container"]
[2025/02/18 15:04:29.601 +08:00] [INFO] [cgmon.go:60] ["cgroup monitor started"]

The frequent connection retries and GC operations may cause performance degradation, such as slow query responses, increased I/O load, and possible service interruptions.

What is your TiDB version? (Required)

Release Version: v9.0.0-alpha-100-g91706ec8df-dirty
Edition: Community
Git Commit Hash: 91706ec
Git Branch: master
UTC Build Time: 2025-01-13 18:22:07
GoVersion: go1.23.4
Race Enabled: false
Check Table Before Drop: false
Store: tikv

The text was updated successfully, but these errors were encountered:

FangCundi added the type/bug The issue is confirmed as a bug. label Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent PD-TiKV Connection Retries and High GC Activity Affecting Performance #59646

Frequent PD-TiKV Connection Retries and High GC Activity Affecting Performance #59646

FangCundi commented Feb 19, 2025

Frequent PD-TiKV Connection Retries and High GC Activity Affecting Performance #59646

Frequent PD-TiKV Connection Retries and High GC Activity Affecting Performance #59646

Comments

FangCundi commented Feb 19, 2025

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)