-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storing URL and its Index for reference and doing selective indexing #10
Comments
When Crawler Runs again, it should first update the old pages (i.e. if new time is not the same as old time(stored one)) and then index the new documents, if any |
filemname() function allows php to check the timestamp of a file. |
An alternative to check timestamp of file (webpage) is to check hash value (e.g. SHA) of content on the webpage. PHP has function " string sha1 ( string $str [, bool $raw_output = false ] ) " to calculate SHA value of string. We can store SHA value of content of each url in an additional field in ElasticSearch. (As suggested by Parag) Then, on re-indexing, check if url of currently crawled webpage exists in ElasticSearch. If it is not found, it is a new webpage added, so index it. If it is found, calculate SHA value of content of webpage and check if it matches with stored SHA value corresponding to that url. If it doesn't match, it means that webpage has been modified, so re-index it; otherwise do not re-index. |
As suggested by Parag, "Crontab" utility can be used for scheduling the crawler and indexer at regular intervals. |
For people on Windows -> https://www.digitalcitizen.life/how-create-task-basic-task-wizard |
We need to store each URL and its index. As in
PUT /<INDEX_NAME>/<INDEX_TYPE>/<"ID">. This ID field is unique identifier for each term in the index. Hence if new page is added to website by admin, and we run the crawler again, there are chances that this ID will be assigned to different document and everything will be indexed again.
Instead, we only need to selectively index after the initial indexing.
1)New pages
2)Updated pages
The text was updated successfully, but these errors were encountered: