Storing URL and its Index for reference and doing selective indexing #10

TusharAgey · 2017-10-21T01:06:02Z

We need to store each URL and its index. As in
PUT /<INDEX_NAME>/<INDEX_TYPE>/<"ID">. This ID field is unique identifier for each term in the index. Hence if new page is added to website by admin, and we run the crawler again, there are chances that this ID will be assigned to different document and everything will be indexed again.
Instead, we only need to selectively index after the initial indexing.
1)New pages
2)Updated pages

TusharAgey · 2017-10-21T03:40:25Z

When Crawler Runs again, it should first update the old pages (i.e. if new time is not the same as old time(stored one)) and then index the new documents, if any

TusharAgey · 2017-10-21T03:41:21Z

filemname() function allows php to check the timestamp of a file.
This can be useful to extract timestamp of file and serve as a solution to this issue

ypk4 · 2017-11-02T15:23:08Z

An alternative to check timestamp of file (webpage) is to check hash value (e.g. SHA) of content on the webpage. PHP has function " string sha1 ( string $str [, bool $raw_output = false ] ) " to calculate SHA value of string. We can store SHA value of content of each url in an additional field in ElasticSearch. (As suggested by Parag)

Then, on re-indexing, check if url of currently crawled webpage exists in ElasticSearch. If it is not found, it is a new webpage added, so index it. If it is found, calculate SHA value of content of webpage and check if it matches with stored SHA value corresponding to that url. If it doesn't match, it means that webpage has been modified, so re-index it; otherwise do not re-index.

ypk4 · 2017-11-02T15:26:55Z

As suggested by Parag, "Crontab" utility can be used for scheduling the crawler and indexer at regular intervals.
https://www.computerhope.com/unix/ucrontab.htm

paragverma · 2017-11-03T08:42:25Z

For people on Windows -> https://www.digitalcitizen.life/how-create-task-basic-task-wizard

ypk4 mentioned this issue Nov 2, 2017

Timestamping the indexing #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing URL and its Index for reference and doing selective indexing #10

Storing URL and its Index for reference and doing selective indexing #10

TusharAgey commented Oct 21, 2017

TusharAgey commented Oct 21, 2017

TusharAgey commented Oct 21, 2017

ypk4 commented Nov 2, 2017

ypk4 commented Nov 2, 2017

paragverma commented Nov 3, 2017

Storing URL and its Index for reference and doing selective indexing #10

Storing URL and its Index for reference and doing selective indexing #10

Comments

TusharAgey commented Oct 21, 2017

TusharAgey commented Oct 21, 2017

TusharAgey commented Oct 21, 2017

ypk4 commented Nov 2, 2017

ypk4 commented Nov 2, 2017

paragverma commented Nov 3, 2017