Fix for purge crawl failing if primary crawl comes up empty#436
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #381
When the primary crawl finishes without indexing any documents into a freshly-created index (e.g. every page returns
4xx,robots.txtis forbidden, the seed URL is unreachable, etc.), the index has no field mappings.The purge phase then issues a search sorted by
last_crawled_at, and Elasticsearch responds with400 query_shard_exception: No mapping found for [last_crawled_at] in order to sort on. The retry logic inES::Client#execute_with_retryblindly re-issues the same bad query 4 times and then the entire crawl fails.This PR fixes this by adding
unmapped_type: 'date'to the sort clause infetch_purge_docs.Per the official Elasticsearch sort docs, this is the canonical idiom for telling ES "treat this field as a date if it has no mapping yet." With the fix, the search succeeds against an empty/unmapped index, returns 0 hits, and run_purge_crawl! in
Coordinator.rbtakes over and finishes the crawl successfully.Checklists
Pre-Review Checklist
crawler.yml.exampleandelasticsearch.yml.example)v0.1.0)make noticeif any dependencies have been addedRelease Note
Fixed a bug where a crawl would fail at the purge phase if the primary crawl didn't index any documents into a newly-created index.