performance of input_files on large workspaces #723

bertsky · 2021-09-28T10:55:57Z

By implementing #635 to properly handle all cases of PAGE-XML file matching per pageId, we have lost sight of the severe performance penalty that this comes with. In effect, we are now nearly as slow as before #482 on workspaces with lots of pages and fileGrps.

Here's a typical scenario:

ocrd-cis-ocropy-dewarp creates an image file for each text line, and references it under the pageId and fileGrp which it belongs to – under an image mimetype (this creates 29.000 files for me in a workspace with 500 pages)

ocrd-tesserocr-recognize runs afterwards and queries its self.input_files:

core/ocrd/ocrd/processor/base.py

Lines 294 to 299 in 9069a65

    
           for file_ in sorted(self.workspace.mets.find_all_files( 
        
                   pageId=self.page_id, fileGrp=ifg, mimetype=mimetype), 
        
                               # sort by MIME type so PAGE comes before images 
        
                               key=lambda file_: file_.mimetype): 
        
               if not file_.pageId: 
        
                   continue

this searches through all mets:file entries, matching them for fileGrp (which is reasonably fast, it only gets a little inefficient when additionally filtering by pageId):

core/ocrd_models/ocrd_models/ocrd_mets.py

Lines 176 to 208 in 9069a65

    
           for cand in self._tree.getroot().xpath('//mets:file', namespaces=NS): 
        
               if ID: 
        
                   if ID.startswith(REGEX_PREFIX): 
        
                       if not fullmatch(ID[REGEX_PREFIX_LEN:], cand.get('ID')): continue 
        
                   else: 
        
                       if not ID == cand.get('ID'): continue 
        
               if pageId is not None and cand.get('ID') not in pageId: 
        
                   continue 
        
               if fileGrp: 
        
                   if fileGrp.startswith(REGEX_PREFIX): 
        
                       if not fullmatch(fileGrp[REGEX_PREFIX_LEN:], cand.getparent().get('USE')): continue 
        
                   else: 
        
                       if cand.getparent().get('USE') != fileGrp: continue 
        
               if mimetype: 
        
                   if mimetype.startswith(REGEX_PREFIX): 
        
                       if not fullmatch(mimetype[REGEX_PREFIX_LEN:], cand.get('MIMETYPE') or ''): continue 
        
                   else: 
        
                       if cand.get('MIMETYPE') != mimetype: continue 
        
               if url: 
        
                   cand_locat = cand.find('mets:FLocat', namespaces=NS) 
        
                   if cand_locat is None: 
        
                       continue 
        
                   cand_url = cand_locat.get('{%s}href' % NS['xlink']) 
        
                   if url.startswith(REGEX_PREFIX): 
        
                       if not fullmatch(url[REGEX_PREFIX_LEN:], cand_url): continue 
        
                   else: 
        
                       if cand_url != url: continue 
        
               f = OcrdFile(cand, mets=self)

Then in line 298 (and again further below) it queries OcrdFile.pageId:

core/ocrd_models/ocrd_models/ocrd_file.py

Lines 116 to 122 in 9069a65

    
               def pageId(self): 
        
                   """ 
        
                   Get the ``@ID`` of the physical ``mets:structMap`` entry corresponding to this ``mets:file`` (physical page manifestation). 
        
                   """ 
        
                   if self.mets is None: 
        
                       raise Exception("OcrdFile %s has no member 'mets' pointing to parent OcrdMets" % self) 
        
                   return self.mets.get_physical_page_for_file(self)

This in turn needs to repeatedly query the whole structMap via XPath (which on a workspace with 500 files and 25 fileGrps and 200.000 takes about 0.2sec per file, i.e. needs more than 1h just for the computation of input_files):

core/ocrd_models/ocrd_models/ocrd_mets.py

Lines 434 to 441 in 9069a65

    
               def get_physical_page_for_file(self, ocrd_file): 
        
                   """ 
        
                   Get the physical page ID (``@ID`` of the physical ``mets:structMap`` ``mets:div`` entry) 
        
                   corresponding to the ``mets:file`` :py:attr:`ocrd_file`. 
        
                   """ 
        
                   ret = self._tree.getroot().xpath( 
        
                       '/mets:mets/mets:structMap[@TYPE="PHYSICAL"]/mets:div[@TYPE="physSequence"]/mets:div[@TYPE="page"][./mets:fptr[@FILEID="%s"]]/@ID' % 
        
                       ocrd_file.ID, namespaces=NS)

A little cosmetics like turning OcrdFile.pageId into a functools.cached_property won't help here, the problem is bigger. METS with its mutually related fileGrp and pageId mappings is inherently expensive to parse. I know we have in the past decided against in-memory representations like dicts because that looked like memory leaks or seemed too expensive on very large workspaces. But have we really weighed the cost of that memory-cputime tradeoff carefully (and considering the necessity for pageId/mimetype filtering) yet? Is there any existing code attempting to cache fileGrp and pageId mappings to avoid reparsing the METS again and again, which I could tamper with?

The text was updated successfully, but these errors were encountered:

kba · 2021-09-28T11:03:15Z

5. This in turn needs to repeatedly query the whole structMap via XPath (which on a workspace with 500 files and 25 fileGrps and 200.000 takes about 0.2sec per file, i.e. needs more than 1h just for the computation of input_files):

This is obviously bad and needs fixing :(

Is there any existing code attempting to cache fileGrp and pageId mappings to avoid reparsing the METS again and again, which I could tamper with?

I did work on an OcrdMetsFilter class for more fine-grained control for cloning workspaces but it was supposed to be used wherever METS is to be queried, cf. #582. One plan I had was to cache pageId in there to take the penalty of parsing everything just once. Had to abandon the PR for other business, not even sure how much effort it would be to merge master into it. But maybe it can serve as a starrting point for a discussion on how to improve the query performance.

bertsky · 2021-09-29T14:27:00Z

I did work on an OcrdMetsFilter class for more fine-grained control for cloning workspaces but it was supposed to be used wherever METS is to be queried, cf. #582. One plan I had was to cache pageId in there to take the penalty of parsing everything just once.

Okay, so this is an elaborate filter for find_files, allowing for negative tests as well (but before we had .. ranges for pageId). The function itself has the same structure though. I don't see just where the pageId caching should happen. But it could use some cached pageId2fileId mapping (which will speed up any non-first call to OcrdMets.find_files). With the above Processor.input_files loop however, it's the other direction we need, fileId2pageId (which will speed up any non-first call to OcrdFile.pageId and reduce the complexity for Processor.input_files).

I think that we should try to parse mets:fileSec and mets:structMap only once (into fileId/pageId and fileId/fileGrp dicts which we can reverse if needed): during Workspace's or OcrdMets's constructor (which is run in Resolver.workspace_from_url in processor tasks and in ocrd.cli.workspace).

bertsky · 2022-04-30T21:27:40Z

Just to add another use-case to the scenario: even a simple OcrdMets.add_file can become inefficient on large workspaces. (Becoming as slow as 1 op/sec.) The reason is that it looks for existing files of the same ID first:

core/ocrd_models/ocrd_models/ocrd_mets.py

Line 304 in ad32b00

mets_file = next(self.find_files(ID=ID), None)

So the proposed caching should also happen here.

bertsky · 2022-04-30T21:30:09Z

@MehmedGIT what's your status of the OcrdMets profiling experiment?

MehmedGIT · 2022-05-01T13:51:01Z

@bertsky I have pushed my latest changes to the benchmarking branch. I have not been working on that experiment after that. @mweidling is investigating this topic in more depth and I am available for discussions and support if needed. My personal opinion is that we should try to optimize the OcrdMets functionalities as soon as possible.

mweidling · 2022-05-02T06:31:34Z

Just to add another use-case to the scenario: even a simple OcrdMets.add_file can become inefficient on large workspaces. (Becoming as slow as 1 op/sec.) The reason is that it looks for existing files of the same ID first:

core/ocrd_models/ocrd_models/ocrd_mets.py

Line 304 in ad32b00

mets_file = next(self.find_files(ID=ID), None)

So the proposed caching should also happen here.

Thanks for pointing that out, @bertsky !

bertsky · 2022-05-02T06:44:05Z

@MehmedGIT I don't understand: the benchmark-mets branch does not seem to contain any actual changes to the modules, only additional tests in benchmarks (i.e. outside the normal test set). Also, under test_mets2.py it contains a ExtendedOcrdMets class with some caching.

Could it be that you actually need to incorporate these changes into ocrd_models.ocrd_mets.OcrdMets proper, so that one can see the diff and test in situ?

MehmedGIT · 2022-05-02T07:21:36Z

@bertsky you are right. I have not changed the actual modules. In order to implement my own version of the functions I have extended the OcrdMets class. The reason I did that was because I did not know how to compile just the OcrdMets class alone 😅 and found fast solution that way. Moreover, when I implemented ExtendedOcrdMets class it was more like a proof of concept to see if the functions can be optimized and to get some comparison results to trigger the further discussion. The changes in the bechmarks branch are not a proper implementation ready to be merged.

bertsky · 2022-05-02T07:28:08Z

@MehmedGIT I see, thanks.

bertsky mentioned this issue Mar 2, 2022

Suggestions regarding Mets handling optimization OCR-D/zenhub#39

Closed

bertsky mentioned this issue Jun 22, 2022

Cache functionality #875

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance of input_files on large workspaces #723

performance of input_files on large workspaces #723

bertsky commented Sep 28, 2021 •

edited

Loading

kba commented Sep 28, 2021

bertsky commented Sep 29, 2021

bertsky commented Apr 30, 2022

bertsky commented Apr 30, 2022

MehmedGIT commented May 1, 2022

mweidling commented May 2, 2022

bertsky commented May 2, 2022

MehmedGIT commented May 2, 2022

bertsky commented May 2, 2022

performance of input_files on large workspaces #723

performance of input_files on large workspaces #723

Comments

bertsky commented Sep 28, 2021 • edited Loading

kba commented Sep 28, 2021

bertsky commented Sep 29, 2021

bertsky commented Apr 30, 2022

bertsky commented Apr 30, 2022

MehmedGIT commented May 1, 2022

mweidling commented May 2, 2022

bertsky commented May 2, 2022

MehmedGIT commented May 2, 2022

bertsky commented May 2, 2022

bertsky commented Sep 28, 2021 •

edited

Loading