Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent PD-TiKV Connection Retries and High GC Activity Affecting Performance #59646

Open
FangCundi opened this issue Feb 19, 2025 · 0 comments
Labels
type/bug The issue is confirmed as a bug.

Comments

@FangCundi
Copy link

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  • Cluster Environment:

    • TiDB version: v9.0.0-alpha-100-g91706ec8df-dirty
    • Operating system: Kubernetes-based containerized environment with cgroup resource control
    • Cluster topology: 4 PD nodes, 4 TiKV nodes, 1 TiDB node (all in the same network)
  • Reproduction Scenario:

    1. Deploy the TiDB cluster in a Kubernetes environment.
    2. Load the cluster with a significant amount of data and trigger high load on the system.
    3. During the high load period (especially during GC and periodic compaction), observe the PD-TiKV connection retries and GC Safe Point updates.
  • Observed Timeframe:

    • The issue is most apparent from 14:00 to 15:00 on February 18, 2025.
    • The PD logs indicate frequent GC Safe Point updates, while TiKV repeatedly connects to PD, suggesting a communication issue between TiKV and PD.

2. What did you expect to see? (Required)

  • Stable connections between TiKV and PD without frequent retries.
  • PD's GC operation should be efficient without causing excessive disk I/O or performance degradation.
  • The TiDB cluster should maintain high availability and performance during high load periods.

3. What did you see instead (Required)

  • TiKV logs show repeated attempts to connect to PD endpoints with the following entries:
    [2025/02/18 15:00:09.856 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=http://166.18.0.11:2379] [thread_id=12]
    [2025/02/18 15:10:09.856 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=http://166.18.0.11:2379] [thread_id=12]
    [2025/02/18 15:20:09.859 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=http://166.18.0.11:2379] [thread_id=12]
    

-**PD logs** show frequent GC Safe Point updates, compaction operations, and storage of new hash values:

```text
[2025/02/18 14:06:07.190 +08:00] [INFO] [periodic.go:134] ["starting auto periodic compaction"] [revision=2857322] [compact-period=1h0m0s]
[2025/02/18 14:06:07.192 +08:00] [INFO] [periodic.go:142] ["completed auto periodic compaction"] [revision=2857322] [compact-period=1h0m0s] 
[took=1.302374ms]
[2025/02/18 14:06:07.192 +08:00] [INFO] [index.go:214] ["compact tree index"] [revision=2857322]
[2025/02/18 14:06:07.364 +08:00] [INFO] [kvstore_compaction.go:69] ["finished scheduled compaction"] [compact-revision=2857322] 
[took=167.096203ms]
[2025/02/18 14:06:07.364 +08:00] [INFO] [hash.go:137] ["storing new hash"] [hash=989640134] [revision=2857322] [compact-revision=2853381]

-TiDB logs also show frequent updates and transitions, particularly in the adaptive update ts interval state transition:

[2025/02/18 15:03:28.432 +08:00] [INFO] [pd.go:431] ["adaptive update ts interval state transition"] [configuredInterval=2s] 
[prevAdaptiveUpdateInterval=2s] [newAdaptiveUpdateInterval=2s] [requiredStaleness=0s] [prevState=unknown(0)] [newState=normal]
[2025/02/18 15:04:29.600 +08:00] [INFO] [meminfo.go:179] ["use cgroup memory hook because TiDB is in the container"]
[2025/02/18 15:04:29.601 +08:00] [INFO] [cgmon.go:60] ["cgroup monitor started"]

The frequent connection retries and GC operations may cause performance degradation, such as slow query responses, increased I/O load, and possible service interruptions.

  1. What is your TiDB version? (Required)

Release Version: v9.0.0-alpha-100-g91706ec8df-dirty
Edition: Community
Git Commit Hash: 91706ec
Git Branch: master
UTC Build Time: 2025-01-13 18:22:07
GoVersion: go1.23.4
Race Enabled: false
Check Table Before Drop: false
Store: tikv

@FangCundi FangCundi added the type/bug The issue is confirmed as a bug. label Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

1 participant