Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown hung on database updates [SQLite] #3436

Closed
SlimeDog opened this issue Jan 27, 2024 · 60 comments
Closed

Shutdown hung on database updates [SQLite] #3436

SlimeDog opened this issue Jan 27, 2024 · 60 comments
Labels
Bug Issues that contain unintended behavior DEV release Issues that are related to a DEV version and not present in previous Release version

Comments

@SlimeDog
Copy link

Describe the issue

On one of two (insofar as possible) identical servers, one hung on shutdown waiting for Plan database updates (see below). Still waiting 5 minutes later, force killed the server. Plan database seemed ok, in that there were no start-up errors (see further below) and Plan website displayed correct data.

Exceptions & Other Logs

On server shutdown

...
[14:14:02 INFO]: [Plan] Disabling Plan v5.6 build 2731
[14:14:02 INFO]: [Plan] Processing complete.
[14:14:02 INFO]: [Plan] Waiting for unfinished transactions to avoid data loss..
[14:14:22 WARN]: [Plan] 181 unfinished database transactions were not executed.
[14:14:22 INFO]: [Plan] Transaction queue closed.
[14:14:22 INFO]: [Plan] Waiting queries to finish to avoid SQLite crashing JVM..

Plugin versions

> version
[13:46:26 INFO]: This server is running Paper version git-Paper-401 (MC: 1.20.4) (Implementing API version 1.20.4-R0.1-SNAPSHOT) (Git: 1ed50da)
> version Plan
[14:19:50 INFO]: Plan version 5.6 build 2731

Additional information

On server restart

$ grep -i plan logs/latest
[14:18:16] [Server thread/INFO]: [Plan] Loading server plugin Plan v5.6 build 2731
[14:18:37] [Server thread/INFO]: [Plan] Enabling Plan v5.6 build 2731
[14:18:38] [Server thread/INFO]: [Plan] 
[14:18:38] [Server thread/INFO]: [Plan] 
[14:18:38] [Server thread/INFO]: [Plan] Locale: 'English' by AuroraLS3
[14:18:38] [Server thread/INFO]: [Plan] Downloading SQLite Driver, this may take a while...
[14:18:38] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] Database: Making sure schema is up to date..
[14:18:38] [Server thread/INFO]: [Plan] SQLite-database connection established.
[14:18:38] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] All database patches applied successfully.
[14:18:38] [Server thread/INFO]: [Plan] WebServer: Certificate KeyStore File not Found: plugins/Plan/Cert.keystore
[14:18:38] [Server thread/INFO]: [Plan] WebServer: No Certificate -> Using HTTP-server for Visualization.
[14:18:38] [Server thread/INFO]: [Plan] WebServer: User Authorization Disabled! (Not secure over HTTP)
[14:18:38] [Server thread/INFO]: [Plan] Webserver running on PORT 8804 ( http://blackdog.straight8.com:8804 )
[14:18:38] [Server thread/INFO]: [Plan] Loading server identifying information
[14:18:38] [Server thread/INFO]: [Plan] Server identifier loaded: 6af88d11-64d3-3a61-b49d-d327d5932d73
[14:18:38] [Server thread/INFO]: [Plan] Registered extension: BSkyBlock
[14:18:38] [Server thread/INFO]: [Plan] Registered extension: Essentials
[14:18:38] [Server thread/INFO]: [Plan] Registered extension: EssentialsEco
[14:18:39] [Server thread/INFO]: [Plan] Registered extension: LuckPerms
[14:18:39] [Server thread/INFO]: [Plan] Registered extension: PlaceholderAPI
[14:18:39] [Server thread/INFO]: [Plan] Registered extension: Economy (Vault)
[14:18:39] [Server thread/INFO]: [Plan] Player Analytics Enabled.
[14:19:02] [Server thread/INFO]: [PlaceholderAPI] Successfully registered internal expansion: plan [5.6 build 2731]
[14:19:50] [Server thread/INFO]: Plan version 5.6 build 2731
@SlimeDog SlimeDog added the Bug Issues that contain unintended behavior label Jan 27, 2024
@AuroraLS3 AuroraLS3 added the DEV release Issues that are related to a DEV version and not present in previous Release version label Jan 28, 2024
@github-project-automation github-project-automation bot moved this to INBOX in Backlog Jan 28, 2024
@AuroraLS3 AuroraLS3 changed the title Shutdown hung on database updates Shutdown hung on database updates [SQLite] Jan 28, 2024
@AuroraLS3
Copy link
Collaborator

Might be related to recent SQLite driver update 🤔 Did the server have an existing database when it started?

@SlimeDog
Copy link
Author

SlimeDog commented Jan 28, 2024

Did the server have an existing database when it started?

Yes. The database is nearly a decade old, and has traversed many Plan (and MC) updates since then. I have restarted both servers several times since this error occurred, without further issues.

@SlimeDog
Copy link
Author

SlimeDog commented Jan 28, 2024

Spoke too soon. The following on the same server

> plan reload
[21:27:21 INFO]: [Plan] Processing complete.
[21:27:21 INFO]: [Plan] Waiting for unfinished transactions to avoid data loss..
[21:27:41 WARN]: [Plan] 184 unfinished database transactions were not executed.
[21:27:41 INFO]: [Plan] Transaction queue closed.
[21:27:41 INFO]: [Plan] Waiting queries to finish to avoid SQLite crashing JVM..

and the server is hung again. Force-stopped and restarted.

@SlimeDog
Copy link
Author

SlimeDog commented Jan 28, 2024

Followed (after restart) by

> plan reload
[21:39:12 INFO]: [Plan] Processing complete.
[21:39:12 INFO]: [Plan] Waiting for unfinished transactions to avoid data loss..
[21:39:12 INFO]: [Plan] Transaction queue closed.
[21:39:12 INFO]: [Plan] Waiting queries to finish to avoid SQLite crashing JVM..
[21:39:12 INFO]: [Plan] Closed SQLite connection.
[21:39:12 INFO]: [Plan] Player Analytics Disabled.
[21:39:12 INFO]: [Plan] 
[21:39:12 INFO]: [Plan] 
[21:39:12 INFO]: [Plan] Locale: 'English' by AuroraLS3
[21:39:12 INFO]: [Plan] Downloading SQLite Driver, this may take a while...
[21:39:12 ERROR]: [STDERR] [org.slf4j.helpers.Util] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
[21:39:12 ERROR]: [STDERR] [org.slf4j.helpers.Util] SLF4J: Defaulting to no-operation (NOP) logger implementation
[21:39:12 ERROR]: [STDERR] [org.slf4j.helpers.Util] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[21:39:12 INFO]: [Plan] SQLite-database connection established.
[21:39:12 INFO]: [Plan] Database: Making sure schema is up to date..
[21:39:12 INFO]: [Plan] WebServer: Certificate KeyStore File not Found: plugins/Plan/Cert.keystore
[21:39:12 INFO]: [Plan] WebServer: No Certificate -> Using HTTP-server for Visualization.
[21:39:12 INFO]: [Plan] WebServer: User Authorization Disabled! (Not secure over HTTP)
[21:39:12 INFO]: [Plan] Webserver running on PORT ELIDED ( URL_ELIDED )
[21:39:12 INFO]: [Plan] Loading server identifying information
[21:39:12 INFO]: [Plan] Server identifier loaded: 6af88d11-64d3-3a61-b49d-d327d5932d73
[21:39:12 INFO]: [Plan] All database patches applied successfully.
[21:39:12 INFO]: [Plan] Registered extension: BSkyBlock
[21:39:12 INFO]: [Plan] Registered extension: Essentials
[21:39:12 INFO]: [Plan] Registered extension: EssentialsEco
[21:39:12 INFO]: [Plan] Registered extension: LuckPerms
[21:39:12 INFO]: [Plan] Registered extension: PlaceholderAPI
[21:39:12 INFO]: [Plan] Registered extension: Economy (Vault)
[21:39:12 INFO]: [Plan] Player Analytics Enabled.
[21:39:12 INFO]: Reload Complete

@AuroraLS3
Copy link
Collaborator

AuroraLS3 commented Jan 28, 2024

Hmm, I'll have to investigate if it's something to do with WAL (write ahead log) becoming slow or something

@SlimeDog
Copy link
Author

SlimeDog commented Jan 28, 2024

Given that this was happening on only one of the two (nearly identical) servers, I think it likely that the Plan database became corrupted. I restored it from backup, prior to Plan build 2731. The problem has not recurred through five server stop/restart sequences (performed to test the proposition). I restart these test servers several times a day, to test various plugin updates, so I will report if the issue recurs.

@AuroraLS3
Copy link
Collaborator

AuroraLS3 commented Jan 28, 2024

Could you run this build
Plan-5.6-dev-build-2739.jar.zip

It has some debug logging benchmarks for transactions, it should catch if some transaction is causing the queue to be this long. I couldn't unfortunately reproduce the issue locally

@AuroraLS3
Copy link
Collaborator

Should be possible to grep for [Plan] to get those benchmarks from the logs if it occurs again

@SlimeDog
Copy link
Author

Could you run this build Plan-5.6-dev-build-2739.jar.zip

It has some debug logging benchmarks for transactions, it should catch if some transaction is causing the queue to be this long. I couldn't unfortunately reproduce the issue locally

Will do. Thanks.

@SlimeDog
Copy link
Author

Plan 5.6-b2739

On one server, all TPSStoreTransaction events report fast and double-digit ms timings.
On the other, some TPSStoreTransaction events report woah. The fast events have single-digit ms timings, but the woah events have 6-digit ns timings. For example

$ grep TPSStoreT logs/latest.log
[00:08:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] fast (5ms)       - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:09:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] fast (1ms)       - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:10:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] fast (1ms)       - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:11:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] fast (2ms)       - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:12:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] fast (1ms)       - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:13:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] fast (1ms)       - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:14:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] woah (939807ns)  - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:15:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] fast (1ms)       - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:16:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] woah (923289ns)  - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:17:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] woah (926757ns)  - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:18:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] woah (902576ns)  - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:19:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] fast (1ms)       - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:20:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] woah (881481ns)  - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:21:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] fast (1ms)       - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:22:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] fast (1ms)       - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[00:23:37] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] fast (1ms)       - com.djrapitops.plan.storage.database.transactions.events.TPSStoreTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355

@AuroraLS3
Copy link
Collaborator

Ah sorry the "woah" is faster than "fast" 😅 - It's going to say slow if something is clogging up the queue, though if it never finishes it won't show up in the logs at all 🤔 https://github.com/plan-player-analytics/Plan/blob/master/Plan/common/src/main/java/com/djrapitops/plan/utilities/dev/Benchmark.java#L86

It's woah/fast/slow to keep it 4 characters for consistent logging width

@SlimeDog
Copy link
Author

Excellent. Thanks. For the last six hours (new log at midnight), the second server showed

[ram@whitedog MC]$ grep TPSStoreT logs/latest.log | grep -c woah
0
[ram@whitedog MC]$ grep TPSStoreT logs/latest.log | grep -c fast
356

which is expected.

The first server showed

[ram@blackdog MC]$ grep TPSStoreT logs/latest.log | grep -c woah
191
[ram@blackdog MC]$ grep TPSStoreT logs/latest.log | grep -c fast
17

so only 208 per-minute reports. It should be 360 minutes.

The second server also completed plan reload

> plan reload
[06:07:48 INFO]: [Plan] Processing complete.
[06:07:48 INFO]: [Plan] Waiting for unfinished transactions to avoid data loss..
[06:08:08 WARN]: [Plan] 4 unfinished database transactions were not executed.
[06:08:08 INFO]: [Plan] Transaction queue closed.
[06:08:08 INFO]: [Plan] Waiting queries to finish to avoid SQLite crashing JVM..
[06:08:35 INFO]: [Plan] slow (70s)       - com.djrapitops.plan.storage.database.transactions.init.RemoveOldSampledDataTransaction, com.djrapitops.plan.storage.database.SQLDB#lambda$executeTransaction$5:355
[06:08:35 INFO]: [Plan] Closed SQLite connection.
[06:08:35 INFO]: [Plan] Player Analytics Disabled.
[06:08:35 INFO]: [Plan] 
[06:08:35 INFO]: [Plan] 
[06:08:35 INFO]: [Plan] Locale: 'English' by AuroraLS3
[06:08:35 INFO]: [Plan] Downloading SQLite Driver, this may take a while...
[06:08:36 ERROR]: [STDERR] [org.slf4j.helpers.Util] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
[06:08:36 ERROR]: [STDERR] [org.slf4j.helpers.Util] SLF4J: Defaulting to no-operation (NOP) logger implementation
[06:08:36 ERROR]: [STDERR] [org.slf4j.helpers.Util] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[06:08:36 INFO]: [Plan] SQLite-database connection established.
[06:08:36 INFO]: [Plan] WebServer: Certificate KeyStore File not Found: plugins/Plan/Cert.keystore
[06:08:36 INFO]: [Plan] Database: Making sure schema is up to date..
[06:08:36 INFO]: [Plan] WebServer: No Certificate -> Using HTTP-server for Visualization.
[06:08:36 INFO]: [Plan] WebServer: User Authorization Disabled! (Not secure over HTTP)
[06:08:36 INFO]: [Plan] Player Analytics Enabled.
[06:08:36 INFO]: Reload Complete

The first server is hung

> plan reload
[06:08:09 INFO]: [Plan] Processing complete.
[06:08:09 INFO]: [Plan] Waiting for unfinished transactions to avoid data loss..
[06:08:29 WARN]: [Plan] 176 unfinished database transactions were not executed.
[06:08:29 INFO]: [Plan] Transaction queue closed.
[06:08:29 INFO]: [Plan] Waiting queries to finish to avoid SQLite crashing JVM..

requiring a force-stop. :(
There were four SQLite locks in /tmp/.

@AuroraLS3
Copy link
Collaborator

Alright I'll compile another version to log all the unfinished transactions to figure out what is hanging the server

@SlimeDog
Copy link
Author

SlimeDog commented Jan 28, 2024

Oh, I should use grep djrapitops.plan rather than grep TPSStoreT. The reports are not all the latter. Next time. Or tell me what values you want to see.

@AuroraLS3
Copy link
Collaborator

AuroraLS3 commented Jan 28, 2024

It's going to take a bit more time to figure out what to log - looks like there is something (a query or a transaction) that is holding the SQLite connection (The locks in /tmp/ suggest same). I don't unfortunately have time today since it turned out to be more complicated than I thought.

It could be related to looking up some data for the website (or export) near the reload - which can explain how it sometimes works and sometimes hangs.

For context the plugin will wait until the SQLite connection is released since without it there's a potential JVM crash with a Segmentation fault in SQLite Native C code.

@SlimeDog
Copy link
Author

No worries, no hurry. I am going to revert to Plan 5.6-b2703 in the interim. The issue started with Plan 5.6-b2731 (or sometime between them, to be more precise).

@AuroraLS3
Copy link
Collaborator

Plan-5.6-dev-build-2742.jar.zip

Alright I managed to get it to log something at the end if something is holding a connection. It was pretty tricky.

In this above build there will be something like these messages:

[06:08:29 INFO]: [Plan] Waiting queries to finish to avoid SQLite crashing JVM..
[06:08:29 INFO]: [Plan] Following call sites are holding connection:
[06:08:29 INFO]: [Plan] [com.djrapitops.plan.storage.database.SQLiteTest.setUp(SQLiteTest.java:79)]

So you should be able to grep [Plan] to get the messages or look at the end of the log when the server is frozen

Note that these are not logged if server doesn't freeze

@SlimeDog
Copy link
Author

SlimeDog commented Feb 3, 2024

Thanks. Installed build 2742 on both servers. At first shutdowns, nothing to report. I'll report as I recycle them over the next few days.

@SlimeDog
Copy link
Author

SlimeDog commented Feb 3, 2024

First hit:

[09:24:56 INFO]: [Plan] Disabling Plan v5.6 build 2742
[09:24:56 INFO]: [Plan] Processing complete.
[09:24:56 INFO]: [Plan] Waiting for unfinished transactions to avoid data loss..
[09:25:16 WARN]: [Plan] 95 unfinished database transactions were not executed.
[09:25:16 INFO]: [Plan] Transaction queue closed.
[09:25:16 INFO]: [Plan] Waiting queries to finish to avoid SQLite crashing JVM..
[09:25:16 INFO]: [Plan] Following call sites are holding connection:
[09:25:16 INFO]: [Plan] []
[09:25:16 INFO]: [Plan] [Plan-5.6-b2742.jar//com.djrapitops.plan.storage.database.Database.queryCollection(Database.java:83), Plan-5.6-b2742.jar//com.djrapitops.plan.storage.upkeep.DBCleanTask.cleanOldPlayers(DBCleanTask.java:165)]
[09:25:16 INFO]: [Plan] [Plan-5.6-b2742.jar//com.djrapitops.plan.storage.database.queries.objects.ServerQueries.lambda$fetchPlanServerInformationCollection$0(ServerQueries.java:107), Plan-5.6-b2742.jar//com.djrapitops.plan.delivery.webserver.Addresses.getAnyValidServerAddress(Addresses.java:116)]

@AuroraLS3
Copy link
Collaborator

Alright, that narrows it down a bit - looks like I missed some method that can reserve connections leading to the empty brackets in the log

@SlimeDog
Copy link
Author

SlimeDog commented Feb 3, 2024

The issue recurred on 4 of 16 shutdowns since installing Plan build 2742, both with the same message set. Interesting (perhaps) all of the failures to date (not just with Plan build 2742) have occurred on the same server (of the two in the test bed), the one with the faster woah processing.

Reverting to Plan build 2703 until you need more data.

@AuroraLS3
Copy link
Collaborator

It looks like it's the mystery [] call is the one holding everything up - the other two are trying to access plan_servers table with very quick queries.

I added logic to print full stacktrace for the empty thing since I don't know why it's empty.

I also changed the message a bit since this logic is now in the master branch

[09:25:16 INFO]: [Plan] Waiting queries to finish to avoid SQLite crashing JVM..
[09:25:16 INFO]: [Plan] Waiting for these connections to finish:
[09:25:16 INFO]: [Plan] [...]

New dev build will appear here in 10 minutes https://github.com/plan-player-analytics/Plan/actions/runs/7772660658

@SlimeDog
Copy link
Author

SlimeDog commented Feb 4, 2024

[08:09:24 INFO]: [Plan] Disabling Plan v5.6 build 2753
[08:09:24 INFO]: [Plan] Processing complete.
[08:09:24 INFO]: [Plan] Waiting for unfinished transactions to avoid data loss..
[08:09:44 WARN]: [Plan] 167 unfinished database transactions were not executed.
[08:09:44 INFO]: [Plan] Transaction queue closed.
[08:09:44 INFO]: [Plan] Waiting queries to finish to avoid SQLite crashing JVM..
[08:09:44 INFO]: [Plan] Waiting for these connections to finish:
[08:09:44 INFO]: [Plan] [Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.Database.queryCollection(Database.java:83), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.upkeep.DBCleanTask.cleanOldPlayers(DBCleanTask.java:165)]
[08:09:44 INFO]: [Plan] [java.base/java.lang.Thread.getStackTrace(Thread.java:2450), Plan-5.6-b2753.jar//com.djrapitops.plan.utilities.SemaphoreAccessCounter.getAccessingThing(SemaphoreAccessCounter.java:57), Plan-5.6-b2753.jar//com.djrapitops.plan.utilities.SemaphoreAccessCounter.enter(SemaphoreAccessCounter.java:63), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.SQLiteDB.getConnection(SQLiteDB.java:198), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.transactions.Transaction.initializeConnection(Transaction.java:179), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.transactions.Transaction.executeTransaction(Transaction.java:83), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.SQLDB.lambda$executeTransaction$4(SQLDB.java:353), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.DBAccessLock.performDatabaseOperation(DBAccessLock.java:78), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.DBAccessLock.performDatabaseOperation(DBAccessLock.java:56), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.SQLDB.lambda$executeTransaction$5(SQLDB.java:352), java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768), java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144), java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642), java.base/java.lang.Thread.run(Thread.java:1583)]
[08:09:44 INFO]: [Plan] [Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.queries.objects.ServerQueries.lambda$fetchPlanServerInformationCollection$0(ServerQueries.java:107), Plan-5.6-b2753.jar//com.djrapitops.plan.delivery.webserver.Addresses.getAnyValidServerAddress(Addresses.java:116)]
[08:09:44 INFO]: [Plan] [java.base/java.lang.Thread.getStackTrace(Thread.java:2450), Plan-5.6-b2753.jar//com.djrapitops.plan.utilities.SemaphoreAccessCounter.getAccessingThing(SemaphoreAccessCounter.java:57), Plan-5.6-b2753.jar//com.djrapitops.plan.utilities.SemaphoreAccessCounter.enter(SemaphoreAccessCounter.java:63), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.SQLiteDB.getConnection(SQLiteDB.java:198), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.transactions.Transaction.initializeConnection(Transaction.java:179), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.transactions.Transaction.executeTransaction(Transaction.java:83), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.transactions.init.OperationCriticalTransaction.executeTransaction(OperationCriticalTransaction.java:36), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.SQLDB.lambda$executeTransaction$4(SQLDB.java:353), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.DBAccessLock.performDatabaseOperation(DBAccessLock.java:61), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.DBAccessLock.performDatabaseOperation(DBAccessLock.java:56), Plan-5.6-b2753.jar//com.djrapitops.plan.storage.database.SQLDB.lambda$executeTransaction$5(SQLDB.java:352), java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768), java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144), java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642), java.base/java.lang.Thread.run(Thread.java:1583)]

Is that sufficient data, or do you need more?

@AuroraLS3
Copy link
Collaborator

I think I figured it out! Thanks for your help.

My best guess of what seems to be happening is that when this happens:

[08:09:24 INFO]: [Plan] Waiting for unfinished transactions to avoid data loss..
[08:09:44 INFO]: [Plan] Transaction queue closed.

there are still some transactions running, but since the queue is killed after 20 second wait, they never release the connection they held, and the server keeps waiting for something that will never happen since the thread supposed to do it is already dead.

@SlimeDog
Copy link
Author

SlimeDog commented Feb 4, 2024

Transaction queue closed.

Sounds right.

@SlimeDog
Copy link
Author

Got it. Thanks for the info.

AuroraLS3 added a commit that referenced this issue Feb 17, 2024
@AuroraLS3
Copy link
Collaborator

Here's another attempt at fixing this: https://github.com/plan-player-analytics/Plan/actions/runs/7940164586

  • Added more detail to the log in case this doesn't fix it
  • Stopped rebooting database on session shutdown save if database is closed (The sessions are preserved elsewhere in a backup mechanism). This was my best guess on what was happening previously

@SlimeDog
Copy link
Author

With build 2757, no hangs on four shutdown attempts on the server that occasionally hangs. For example:

[07:11:06] [Server thread/INFO]: [Plan] Disabling Plan v5.6 build 2757
[07:11:06] [Server thread/INFO]: [Plan] Processing complete.
[07:11:06] [Server thread/INFO]: [Plan] Waiting for unfinished transactions to avoid data loss..
[07:11:06] [Server thread/INFO]: [Plan] Transaction queue closed.
[07:11:06] [Server thread/INFO]: [Plan] Waiting queries to finish to avoid SQLite crashing JVM..
[07:11:06] [Server thread/INFO]: [Plan] Closed SQLite connection.
[07:11:06] [Server thread/INFO]: [Plan] Player Analytics Disabled.

It continues to be the case that the second server has never hung.

@SlimeDog
Copy link
Author

Unfortunately

[09:47:29 INFO]: [Plan] Disabling Plan v5.6 build 2757
[09:47:29 INFO]: [Plan] Processing complete.
[09:47:29 INFO]: [Plan] Waiting for unfinished transactions to avoid data loss..
[09:47:49 INFO]: [Plan] Waiting queries to finish to avoid SQLite crashing JVM..
[09:47:49 INFO]: [Plan] Waiting for these connections to finish:
[09:47:49 INFO]: [Plan] [Plan-5.6-b2757.jar//com.djrapitops.plan.storage.database.Database.queryCollection(Database.java:83), Plan-5.6-b2757.jar//com.djrapitops.plan.storage.upkeep.DBCleanTask.cleanOldPlayers(DBCleanTask.java:165)]
[09:47:49 INFO]: [Plan] [Plan-5.6-b2757.jar//com.djrapitops.plan.storage.upkeep.DBCleanTask.run(DBCleanTask.java:109)]
[09:47:49 INFO]: [Plan] [Plan-5.6-b2757.jar//com.djrapitops.plan.storage.database.queries.objects.ServerQueries.lambda$fetchPlanServerInformationCollection$0(ServerQueries.java:107), Plan-5.6-b2757.jar//com.djrapitops.plan.delivery.webserver.Addresses.getAnyValidServerAddress(Addresses.java:116)]

@AuroraLS3
Copy link
Collaborator

Alright. I'll try see if I can speed up the clean task somehow. Not tomorrow though, got birthday :)

I think it might be possible to reduce this occurring by changing Clean_database_every (or it was something like that) under Time settings

@SlimeDog
Copy link
Author

SlimeDog commented Feb 17, 2024

What about

Wait_for_DB_Transactions_on_disable:
            Time: 20
            Unit: SECONDS

or is it going to wait forever regardless (given the discussion above)?

@SlimeDog
Copy link
Author

Hang has now occurred on both servers, due to DBCleanTask. Reverting to build 2703.

@AuroraLS3
Copy link
Collaborator

is it going to wait forever regardless (given the discussion above)?

Yes

@AuroraLS3
Copy link
Collaborator

Someone mentioned slow query in clean task on discord so it's probably to blame - it's related to extensions so server with more extension data is going to experience the wait at the end more frequently.

It might stop waiting after the query ends though, I'm not sure how long it can take. The query was modified recently to try optimize it, maybe I accidentally made it worse and it's that causing the issue and not sqlite driver 🤔 I'll look into it

AuroraLS3 added a commit that referenced this issue Mar 2, 2024
Extensions support @conditional value where a boolean provider determines if other values should exist.
Unsatisfied values were being removed during database cleanup task.
The cleanup transaction was very slow and could hang the server if it was performed near shutdown.

The cleanup is now performed on boolean value change (individual value for one player)
instead of with large cleanup transaction (all values and all players).

Affects issues:
- #3436
@AuroraLS3
Copy link
Collaborator

I have made a dev release that might fix the issue https://github.com/plan-player-analytics/Plan/releases/tag/5.6.2796

It includes the commit above.

@SlimeDog
Copy link
Author

SlimeDog commented Mar 3, 2024

Paper 1.20.4-436
Plan 5.6-b2796

Update (from build 2703) and start-up

[04:27:13] [Server thread/INFO]: [Plan] Enabling Plan v5.6 build 2796
[04:27:14] [Server thread/INFO]: [Plan] 
[04:27:14] [Server thread/INFO]: [Plan] 
[04:27:14] [Server thread/INFO]: [Plan] Locale: 'English' by AuroraLS3
[04:27:14] [Server thread/INFO]: [Plan] Downloading SQLite Driver, this may take a while...
[04:27:14] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] Database: Making sure schema is up to date..
[04:27:14] [Server thread/INFO]: [Plan] SQLite-database connection established.
[04:27:14] [Plan SQLiteDB-transaction-thread-1/INFO]: [Plan] All database patches applied successfully.
[04:27:14] [Server thread/INFO]: [Plan] WebServer: Certificate KeyStore File not Found: plugins/Plan/Cert.keystore
[04:27:14] [Server thread/INFO]: [Plan] WebServer: No Certificate -> Using HTTP-server for Visualization.
[04:27:14] [Server thread/INFO]: [Plan] WebServer: User Authorization Disabled! (Not secure over HTTP)
[04:27:15] [Server thread/INFO]: [Plan] Webserver running on PORT 8804 ( http://blackdog.straight8.com:8804 )
[04:27:15] [Server thread/INFO]: [Plan] Loading server identifying information
[04:27:15] [Server thread/INFO]: [Plan] Server identifier loaded: 6af88d11-64d3-3a61-b49d-d327d5932d73
[04:27:15] [Server thread/INFO]: [Plan] Registered extension: BSkyBlock
[04:27:15] [Server thread/INFO]: [Plan] Registered extension: Essentials
[04:27:15] [Server thread/INFO]: [Plan] Registered extension: EssentialsEco
[04:27:15] [Server thread/INFO]: [Plan] Registered extension: LuckPerms
[04:27:15] [Server thread/INFO]: [Plan] Registered extension: PlaceholderAPI
[04:27:15] [Server thread/INFO]: [Plan] Registered extension: Economy (Vault)
[04:27:15] [Server thread/INFO]: [Plan] Player Analytics Enabled.

and a few hours later, first shutdown attempt

[07:18:19] [Craft Scheduler Thread - 651 - Plan/INFO]: Disabled 'Kick Counting' temporarily until next plugin reload.
[07:18:21] [Server thread/INFO]: [Essentials] CONSOLE issued server command: /essentials:kickall The Big Dog says: We'll be right back!
[07:18:25 INFO]: [Plan] Disabling Plan v5.6 build 2796
[07:18:25 INFO]: [Plan] Processing complete.
[07:18:25 INFO]: [Plan] Waiting for unfinished transactions to avoid data loss..
[07:18:45 INFO]: [Plan] Waiting queries to finish to avoid SQLite crashing JVM..
[07:18:45 INFO]: [Plan] Waiting for these connections to finish:
[07:18:45 INFO]: [Plan] [Plan-5.6-b2796.jar//com.djrapitops.plan.storage.upkeep.DBCleanTask.run(DBCleanTask.java:107)]
[07:18:45 INFO]: [Plan] [Plan-5.6-b2796.jar//com.djrapitops.plan.storage.database.Database.queryCollection(Database.java:83), Plan-5.6-b2796.jar//com.djrapitops.plan.storage.upkeep.DBCleanTask.cleanOldPlayers(DBCleanTask.java:163)]
[07:18:45 INFO]: [Plan] [Plan-5.6-b2796.jar//com.djrapitops.plan.storage.database.queries.objects.ServerQueries.lambda$fetchPlanServerInformationCollection$0(ServerQueries.java:107), Plan-5.6-b2796.jar//com.djrapitops.plan.delivery.webserver.Addresses.getAnyValidServerAddress(Addresses.java:116)]

and hung ...

@SlimeDog
Copy link
Author

SlimeDog commented Mar 3, 2024

Note above that the shutdown sequence includes the following commands

> plan:plan analyze
(wait 2 seconds)
> plan:plan disable kickCount
(wait 1 second)
> essentials:kickall
(wait 1 second, and no players were connected)

They appear to complete before disable is attempted at 07:18:25. But are these contributing to the problem?

@SlimeDog
Copy link
Author

SlimeDog commented Mar 3, 2024

I just noticed an anomaly.

sql> select * from plan_servers ;
id          uuid                                  name        web_address                         is_installed  max_players  is_proxy    plan_version  
----------  ------------------------------------  ----------  ----------------------------------  ------------  -----------  ----------  --------------
1           6af88d11-64d3-3a61-b49d-d327d5932d73  BlackDog    http://blackdog.straight8.com:8804  1             25           0           5.6 build 2703
2           adc7e80a-9e51-3adf-ac41-1ca978fd9c57  WhiteDog    http://whitedog.straight8.com:8804  1             25           0           5.6 build 2731

and

Webserver:
    Port: 8804
    Alternative_IP:
        Enabled: true
        # %port% is replaced automatically with Webserver.Port
        Address: blackdog.straight8.com:%port%
...
    External_Webserver_address: "http://blackdog.straight8.com"

but

> plan analyze
[08:17:14 INFO]: Server page: http://whitedog.straight8.com:8804/server/6af88d11-64d3-3a61-b49d-d327d5932d73

Surely that should be blackdog.straight8.com

@AuroraLS3
Copy link
Collaborator

If there's multiple webservers running the commands give one of them since all webservers can show any server in the same database.

Since this is a SQLite database, the database file might have been copied over. You can mark the other server as uninstalled with /plan db uninstalled [id]

@AuroraLS3
Copy link
Collaborator

/plan analyze likely leaves [07:18:45 INFO]: [Plan] [Plan-5.6-b2796.jar//com.djrapitops.plan.storage.database.queries.objects.ServerQueries.lambda$fetchPlanServerInformationCollection$0(ServerQueries.java:107), Plan-5.6-b2796.jar//com.djrapitops.plan.delivery.webserver.Addresses.getAnyValidServerAddress(Addresses.java:116)] in the waiting queue

@AuroraLS3
Copy link
Collaborator

Seems that the hanging transaction is cleaning time-series data.

If the database is copied over and other server uses different Time series deletion threshold in the config this can take a while.

Please check that these settings match with your test server and server the data is from
Time.Thresholds.Remove_time_series_data_after (Default 90 days)
Time.Thresholds.Remove_ping_data_after (Default 14 days)

@SlimeDog
Copy link
Author

SlimeDog commented Mar 3, 2024

As noted earlier, the two configurations are identical, except for ServerName, Address, and External_Webserver_address.

diff server-specific/{black,white}dog/plugins/Plan/config.yml
7c7
<     ServerName: BlackDog
---
>     ServerName: WhiteDog
50c50
<         Address: blackdog.straight8.com:%port%
---
>         Address: whitedog.straight8.com:%port%
93c93
<     External_Webserver_address: "http://blackdog.straight8.com"
---
>     External_Webserver_address: "http://whitedog.straight8.com"

@SlimeDog
Copy link
Author

SlimeDog commented Mar 3, 2024

$ grep -A 1 Remove_ server-specific/{black,white}dog/plugins/Plan/config.yml
server-specific/blackdog/plugins/Plan/config.yml:            Remove_logs_after_days: 30
server-specific/blackdog/plugins/Plan/config.yml-        IP_whitelist:
--
server-specific/blackdog/plugins/Plan/config.yml:        Remove_inactive_player_data_after:
server-specific/blackdog/plugins/Plan/config.yml-            Time: 3650
--
server-specific/blackdog/plugins/Plan/config.yml:        Remove_time_series_data_after:
server-specific/blackdog/plugins/Plan/config.yml-            Time: 90
--
server-specific/blackdog/plugins/Plan/config.yml:        Remove_ping_data_after:
server-specific/blackdog/plugins/Plan/config.yml-            Time: 14
--
server-specific/blackdog/plugins/Plan/config.yml:        Remove_disabled_extension_data_after:
server-specific/blackdog/plugins/Plan/config.yml-            Time: 2
--
server-specific/whitedog/plugins/Plan/config.yml:            Remove_logs_after_days: 30
server-specific/whitedog/plugins/Plan/config.yml-        IP_whitelist:
--
server-specific/whitedog/plugins/Plan/config.yml:        Remove_inactive_player_data_after:
server-specific/whitedog/plugins/Plan/config.yml-            Time: 3650
--
server-specific/whitedog/plugins/Plan/config.yml:        Remove_time_series_data_after:
server-specific/whitedog/plugins/Plan/config.yml-            Time: 90
--
server-specific/whitedog/plugins/Plan/config.yml:        Remove_ping_data_after:
server-specific/whitedog/plugins/Plan/config.yml-            Time: 14
--
server-specific/whitedog/plugins/Plan/config.yml:        Remove_disabled_extension_data_after:
server-specific/whitedog/plugins/Plan/config.yml-            Time: 2

@AuroraLS3
Copy link
Collaborator

Alright, I guess it's necessary to revert the driver then

@SlimeDog
Copy link
Author

SlimeDog commented Mar 4, 2024

If there is anything else you want me to test or help debug, I am happy to do so.

@AuroraLS3
Copy link
Collaborator

Thanks I'll comment here again once I've reverted the sqlite driver

AuroraLS3 added a commit that referenced this issue Mar 9, 2024
@AuroraLS3
Copy link
Collaborator

A build with reverted sqlite driver should appear here in 10 minutes https://github.com/plan-player-analytics/Plan/actions/runs/8213232733

If this doesn't solve it I don't know what will

@SlimeDog
Copy link
Author

SlimeDog commented Mar 9, 2024

With Plan build 2799 (per link above), neither server has failed on shutdown through 7 attempts each (recall that server B has never failed during this entire debugging cycle). So while not conclusive (difficult to prove a negative), I am optimistic that the issue is resolved by the SQLite driver reversion. Thanks.

@AuroraLS3
Copy link
Collaborator

Thanks, that gives me confidence to release tomorrow

@github-project-automation github-project-automation bot moved this from INBOX to Done in Backlog Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Issues that contain unintended behavior DEV release Issues that are related to a DEV version and not present in previous Release version
Projects
Status: Done
Development

No branches or pull requests

2 participants