Skip to content

Fix for purge crawl failing if primary crawl comes up empty#436

Merged
mattnowzari merged 2 commits into
mainfrom
fix_purge_phase_failures
May 7, 2026
Merged

Fix for purge crawl failing if primary crawl comes up empty#436
mattnowzari merged 2 commits into
mainfrom
fix_purge_phase_failures

Conversation

@mattnowzari

Copy link
Copy Markdown
Contributor

Closes #381

When the primary crawl finishes without indexing any documents into a freshly-created index (e.g. every page returns 4xx, robots.txt is forbidden, the seed URL is unreachable, etc.), the index has no field mappings.

The purge phase then issues a search sorted by last_crawled_at, and Elasticsearch responds with 400 query_shard_exception: No mapping found for [last_crawled_at] in order to sort on. The retry logic in ES::Client#execute_with_retry blindly re-issues the same bad query 4 times and then the entire crawl fails.

This PR fixes this by adding unmapped_type: 'date' to the sort clause in fetch_purge_docs.

Per the official Elasticsearch sort docs, this is the canonical idiom for telling ES "treat this field as a date if it has no mapping yet." With the fix, the search succeeds against an empty/unmapped index, returns 0 hits, and run_purge_crawl! in Coordinator.rb takes over and finishes the crawl successfully.

Checklists

Pre-Review Checklist

  • This PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check crawler.yml.example and elasticsearch.yml.example)
  • This PR has a meaningful title
  • This PR links to all relevant GitHub issues that it fixes or partially addresses
    • If there is no GitHub issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Covered the changes with automated tests
  • Tested the changes locally
  • Added a label for each target release version (example: v0.1.0)
  • Considered corresponding documentation changes
  • Contributed any configuration settings changes to the configuration reference
  • Ran make notice if any dependencies have been added

Release Note

Fixed a bug where a crawl would fail at the purge phase if the primary crawl didn't index any documents into a newly-created index.

@mattnowzari mattnowzari requested a review from a team as a code owner May 5, 2026 16:03

@artem-shelkovnikov artem-shelkovnikov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 🌖

@mattnowzari mattnowzari merged commit 0900fdf into main May 7, 2026
2 checks passed
@mattnowzari mattnowzari deleted the fix_purge_phase_failures branch May 7, 2026 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Purge phase fails if primary phase ends with empty queue

2 participants