-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Memory leak in kitodo:reindex CLI job #1165
Comments
Previously all document instances were kept in As it seems, this property isn't used any more (at least it isn't populated in So my guess would be that (And maybe we should rename it to "clearCache()" as well and remove the unused property |
Thanks. If you have a code snippet, we can test that tomorrow. |
I provided a pull request in #1168. Thanks for reviewing/testing! |
I cherry-picked the 2nd commit (1a2ea0d) from that pull request. Indexing still leaks:
The memory traces were produced using this patch:
|
@sebastian-meyer I have noticed a |
This one actually holds the core-specific instances of the Solr connection objects. That's fine and should not blow up memory usage, since all indexing uses the same connection instance (i. e. there should be only one instance kept by this property during indexing). But I found another potential memory leak in |
Short update on this topic. I have looked into this memory usage:
Results:
So basically
By this we can calculate that every single document adds between 200 and 400 KB. The main offender is I'm currently analyzing which parts of In the moment when indexer has passed 1500 documents memory usage is around 872 MB so not even close to the case presented in the #822. There it is already around 5GB. Please correct me if I have have mistaken something. Changes can be previewed here: https://github.com/kitodo/kitodo-presentation/compare/master...beatrycze-volk:memory?expand=1 |
UPDATEResults
The part which is increasing the memory usage is the function Results
It looks that actually the biggest increase in memory usage brings One obvious reason why size is growing: Other possible reasons are I'm continuing my investigation. |
@sebastian-meyer |
I didn't test it myself. Maybe @beatrycze-volk can give us a hint? |
When I start a reindex process, I always get an error message with the 2nd data record
|
I was running this many times... Never got into this error. It can be that something has gone wrong when I was splitting changes to own branches. I will check it out. |
@beatrycze-volk |
I have checked out and found out what is the reason. Of course while I was moving changes over I have missed the one part. But in fact I'm not sure if this change is even needed now as later I have switched my approach to chunk indexing. Actually state clearing makes now more sense after given chunk of the documents instead after every document. I will make today PR for it. |
No, that's not necessary. |
#1203 here also bug fix for reindex options. |
We have added the change from PR #1202 manually and reindexing now proceeds past the first two docs. We will report back any news about memory issues. Thanks @beatrycze-volk ! |
I merged both PRs. Can you confirm that I can close this issue? |
I have adopted both PRs. Currently, there is still the problem with the permanent increase in memory consumption. So far there has been no improvement, so the memory leak has not been fixed. |
Would it be possible to let the indexer fork itself after indexing a couple of documents? I haven't worked with multi-threading in PHP, yet, but maybe someone knows how to do it? I assume having multiple threads may help with garbage collection and constantly freeing up memory. What do you think? |
I have tried to even manually force garbage collection after each single document. It didn't give any improvement. As it was constantly showing no cycles to be collected and memory growing. I was trying to close and recreate SOLR client after every document, the same with persistence manager. Still no improvement. I have split functions into smaller as this should call garbage collections more often - no improvement. I was unsetting every variable in every loop, still no difference. Then I have started to search in the internet and found this quiet old topic - https://stackoverflow.com/a/849618/2701807:
And there also newer comment:
The upgrade has introduced not only changes in our code, but also usage of new stuff for e.g Extbase. As I was trying removing and resetting everything in our code I would sadly suspect that problem can be in some other layer. |
I was thinking about it and the answer is no. I wanted to implement the functionality in which one command will call other commands something like administrator calls
There are currently two possibilities:
|
OK, I'll release the current version with the workaround although it's quite annoying. For 5.1 we already planned to support newer PHP version up to 8.3, so maybe this will help with indexing as well (see #1172). |
The |
I don't expect that PHP 8.3 will have an effect on the memory leak, but maybe we'll know more soon. |
Description
Running the reindex jobs works with a small number of documents or very much memory, but fails with out of memory for huge collections. While the job is running, the used memory continuously grows. 10 GB of memory is not sufficient for less than 3000 documents.
Reproduction
Just run the CLI job and watch how the memory usage grows.
Expected Behavior
There is no reason why the required memory should grow with the number of indexed documents.
As soon as a document was indexed, there is no longer a need to store its metadata in RAM.
Environment
Additional Context
See also issue #822. The fix added there obviously no longer works.
The text was updated successfully, but these errors were encountered: