-
Notifications
You must be signed in to change notification settings - Fork 61
Precision limit
To allow the storage of links for relatively big web corpora (~1000 web entities) by keeping good performences in working with the index, it's been decided to set a Precision Limit on corpus.
The Precision Limit set the depth in the Reverse_URLs chain after which the links will not be stored in the memory structure.
Note that this limitation will drop information only on links index not on LRUs index. Also the crawler will follow those links too (see below) Moreover the all information is not lost since the all content of the page is stored in the raw data level.
For example a Precision Limit of 3 will imply that a link between :
fr | sciences-po | medialab | contact.html -> fr | sciences-po | recherche | news.html
will not be stored.
Although the two pagesfr | sciences-po | medialab | contact.html and fr | sciences-po | recherche | news.html will exist in the Page index
However locally the user might want to set a web entity at a level beyonf the precision limit. Thus the user could set a FULL PRECISION flag on a LRU asking the system to set an exception to the Precision limit. This process will allow the best level of precision but only to precise cases where needed, keeping the optimisation rule everywhere else.
In other word this exception allows the user to set what has been called a page-level-web entity at some point of the project.
Thus it's onlye possible to set a webentity to a LRU_prefix which is longuer than the PRECISION_LIMIT by setting a FULL_PRECISION exception
At the specific time, if we want to retrieve link information which has been forgotten by the memory structure, it will be possible to can the raw data level storage to reconstruct the linkage information.
Although the precision limit will avoid some links to be stored in the memory structure, this threshold isn't related to the crawl depth limit. Thus pages which can't be linked inside the memory structure because underneath the Precision limit will be crawled if there are above the crawl depth.