-
-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shutdown hung on database updates [SQLite] #3436
Comments
Might be related to recent SQLite driver update 🤔 Did the server have an existing database when it started? |
Yes. The database is nearly a decade old, and has traversed many Plan (and MC) updates since then. I have restarted both servers several times since this error occurred, without further issues. |
Spoke too soon. The following on the same server
and the server is hung again. Force-stopped and restarted. |
Followed (after restart) by
|
Hmm, I'll have to investigate if it's something to do with WAL (write ahead log) becoming slow or something |
Given that this was happening on only one of the two (nearly identical) servers, I think it likely that the Plan database became corrupted. I restored it from backup, prior to Plan build 2731. The problem has not recurred through five server stop/restart sequences (performed to test the proposition). I restart these test servers several times a day, to test various plugin updates, so I will report if the issue recurs. |
Could you run this build It has some debug logging benchmarks for transactions, it should catch if some transaction is causing the queue to be this long. I couldn't unfortunately reproduce the issue locally |
Should be possible to |
Will do. Thanks. |
Plan 5.6-b2739 On one server, all TPSStoreTransaction events report
|
Ah sorry the "woah" is faster than "fast" 😅 - It's going to say slow if something is clogging up the queue, though if it never finishes it won't show up in the logs at all 🤔 https://github.com/plan-player-analytics/Plan/blob/master/Plan/common/src/main/java/com/djrapitops/plan/utilities/dev/Benchmark.java#L86 It's woah/fast/slow to keep it 4 characters for consistent logging width |
Excellent. Thanks. For the last six hours (new log at midnight), the second server showed
which is expected. The first server showed
so only 208 per-minute reports. It should be 360 minutes. The second server also completed
The first server is hung
requiring a force-stop. :( |
Alright I'll compile another version to log all the unfinished transactions to figure out what is hanging the server |
Oh, I should use |
It's going to take a bit more time to figure out what to log - looks like there is something (a query or a transaction) that is holding the SQLite connection (The locks in /tmp/ suggest same). I don't unfortunately have time today since it turned out to be more complicated than I thought. It could be related to looking up some data for the website (or export) near the reload - which can explain how it sometimes works and sometimes hangs. For context the plugin will wait until the SQLite connection is released since without it there's a potential JVM crash with a Segmentation fault in SQLite Native C code. |
No worries, no hurry. I am going to revert to Plan 5.6-b2703 in the interim. The issue started with Plan 5.6-b2731 (or sometime between them, to be more precise). |
Plan-5.6-dev-build-2742.jar.zip Alright I managed to get it to log something at the end if something is holding a connection. It was pretty tricky. In this above build there will be something like these messages:
So you should be able to Note that these are not logged if server doesn't freeze |
Thanks. Installed build 2742 on both servers. At first shutdowns, nothing to report. I'll report as I recycle them over the next few days. |
First hit:
|
Alright, that narrows it down a bit - looks like I missed some method that can reserve connections leading to the empty brackets in the log |
The issue recurred on 4 of 16 shutdowns since installing Plan build 2742, both with the same message set. Interesting (perhaps) all of the failures to date (not just with Plan build 2742) have occurred on the same server (of the two in the test bed), the one with the faster Reverting to Plan build 2703 until you need more data. |
It looks like it's the mystery [] call is the one holding everything up - the other two are trying to access plan_servers table with very quick queries. I added logic to print full stacktrace for the empty thing since I don't know why it's empty. I also changed the message a bit since this logic is now in the master branch
New dev build will appear here in 10 minutes https://github.com/plan-player-analytics/Plan/actions/runs/7772660658 |
Is that sufficient data, or do you need more? |
I think I figured it out! Thanks for your help. My best guess of what seems to be happening is that when this happens:
there are still some transactions running, but since the queue is killed after 20 second wait, they never release the connection they held, and the server keeps waiting for something that will never happen since the thread supposed to do it is already dead. |
Sounds right. |
Got it. Thanks for the info. |
Here's another attempt at fixing this: https://github.com/plan-player-analytics/Plan/actions/runs/7940164586
|
With build 2757, no hangs on four shutdown attempts on the server that occasionally hangs. For example:
It continues to be the case that the second server has never hung. |
Unfortunately
|
Alright. I'll try see if I can speed up the clean task somehow. Not tomorrow though, got birthday :) I think it might be possible to reduce this occurring by changing Clean_database_every (or it was something like that) under Time settings |
What about
or is it going to wait forever regardless (given the discussion above)? |
Hang has now occurred on both servers, due to |
Yes |
Someone mentioned slow query in clean task on discord so it's probably to blame - it's related to extensions so server with more extension data is going to experience the wait at the end more frequently. It might stop waiting after the query ends though, I'm not sure how long it can take. The query was modified recently to try optimize it, maybe I accidentally made it worse and it's that causing the issue and not sqlite driver 🤔 I'll look into it |
Extensions support @conditional value where a boolean provider determines if other values should exist. Unsatisfied values were being removed during database cleanup task. The cleanup transaction was very slow and could hang the server if it was performed near shutdown. The cleanup is now performed on boolean value change (individual value for one player) instead of with large cleanup transaction (all values and all players). Affects issues: - #3436
I have made a dev release that might fix the issue https://github.com/plan-player-analytics/Plan/releases/tag/5.6.2796 It includes the commit above. |
Paper 1.20.4-436 Update (from build 2703) and start-up
and a few hours later, first shutdown attempt
and hung ... |
Note above that the shutdown sequence includes the following commands
They appear to complete before |
I just noticed an anomaly.
and
but
Surely that should be |
If there's multiple webservers running the commands give one of them since all webservers can show any server in the same database. Since this is a SQLite database, the database file might have been copied over. You can mark the other server as uninstalled with |
|
Seems that the hanging transaction is cleaning time-series data. If the database is copied over and other server uses different Time series deletion threshold in the config this can take a while. Please check that these settings match with your test server and server the data is from |
As noted earlier, the two configurations are identical, except for
|
|
Alright, I guess it's necessary to revert the driver then |
If there is anything else you want me to test or help debug, I am happy to do so. |
Thanks I'll comment here again once I've reverted the sqlite driver |
A build with reverted sqlite driver should appear here in 10 minutes https://github.com/plan-player-analytics/Plan/actions/runs/8213232733 If this doesn't solve it I don't know what will |
With Plan build 2799 (per link above), neither server has failed on shutdown through 7 attempts each (recall that server B has never failed during this entire debugging cycle). So while not conclusive (difficult to prove a negative), I am optimistic that the issue is resolved by the SQLite driver reversion. Thanks. |
Thanks, that gives me confidence to release tomorrow |
Describe the issue
On one of two (insofar as possible) identical servers, one hung on shutdown waiting for Plan database updates (see below). Still waiting 5 minutes later, force killed the server. Plan database seemed ok, in that there were no start-up errors (see further below) and Plan website displayed correct data.
Exceptions & Other Logs
On server shutdown
Plugin versions
Additional information
On server restart
The text was updated successfully, but these errors were encountered: