-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance of input_files on large workspaces #723
Comments
This is obviously bad and needs fixing :(
I did work on an |
Okay, so this is an elaborate filter for I think that we should try to parse |
Just to add another use-case to the scenario: even a simple core/ocrd_models/ocrd_models/ocrd_mets.py Line 304 in ad32b00
So the proposed caching should also happen here. |
@MehmedGIT what's your status of the OcrdMets profiling experiment? |
@bertsky I have pushed my latest changes to the benchmarking branch. I have not been working on that experiment after that. @mweidling is investigating this topic in more depth and I am available for discussions and support if needed. My personal opinion is that we should try to optimize the OcrdMets functionalities as soon as possible. |
Thanks for pointing that out, @bertsky ! |
@MehmedGIT I don't understand: the benchmark-mets branch does not seem to contain any actual changes to the modules, only additional tests in Could it be that you actually need to incorporate these changes into |
@bertsky you are right. I have not changed the actual modules. In order to implement my own version of the functions I have extended the OcrdMets class. The reason I did that was because I did not know how to compile just the OcrdMets class alone 😅 and found fast solution that way. Moreover, when I implemented ExtendedOcrdMets class it was more like a proof of concept to see if the functions can be optimized and to get some comparison results to trigger the further discussion. The changes in the bechmarks branch are not a proper implementation ready to be merged. |
@MehmedGIT I see, thanks. |
By implementing #635 to properly handle all cases of PAGE-XML file matching per pageId, we have lost sight of the severe performance penalty that this comes with. In effect, we are now nearly as slow as before #482 on workspaces with lots of pages and fileGrps.
Here's a typical scenario:
self.input_files
:core/ocrd/ocrd/processor/base.py
Lines 294 to 299 in 9069a65
mets:file
entries, matching them forfileGrp
(which is reasonably fast, it only gets a little inefficient when additionally filtering bypageId
):core/ocrd_models/ocrd_models/ocrd_mets.py
Lines 176 to 208 in 9069a65
OcrdFile.pageId
:core/ocrd_models/ocrd_models/ocrd_file.py
Lines 116 to 122 in 9069a65
input_files
):core/ocrd_models/ocrd_models/ocrd_mets.py
Lines 434 to 441 in 9069a65
A little cosmetics like turning
OcrdFile.pageId
into afunctools.cached_property
won't help here, the problem is bigger. METS with its mutually related fileGrp and pageId mappings is inherently expensive to parse. I know we have in the past decided against in-memory representations like dicts because that looked like memory leaks or seemed too expensive on very large workspaces. But have we really weighed the cost of that memory-cputime tradeoff carefully (and considering the necessity for pageId/mimetype filtering) yet? Is there any existing code attempting to cache fileGrp and pageId mappings to avoid reparsing the METS again and again, which I could tamper with?The text was updated successfully, but these errors were encountered: