CSW harvester: recover records when a result page cannot be fully returned#294
Draft
juanluisrp wants to merge 3 commits into
Draft
CSW harvester: recover records when a result page cannot be fully returned#294juanluisrp wants to merge 3 commits into
juanluisrp wants to merge 3 commits into
Conversation
…eturned When harvesting via CSW GetRecords, the source returns pages of several records at once. Some CSW servers (GeoNetwork included) abort the whole GetRecords response when a single record of the page can not be serialized in the requested outputSchema, for instance an ISO 19110 feature catalogue requested with outputSchema=gmd, which has no gmd presentation. The harvester turned that page error into a fatal OperationAbortedEx and the whole harvest run stopped, leaving the catalogue partially harvested. Make the page fetch resilient: when the source returns an OWS exception for a page, split the page in half and retry each half. A single record that still fails is logged, skipped, and the harvest carries on with the rest. Each bad record costs O(log n) extra requests to isolate; in the sparse case this keeps the overhead low. In the worst case (every record in the page is bad) the split visits 2n-1 nodes, so the total is linear in the page size - bounded in practice because pages are small (typically 10-200 records), and still far fewer requests than fetching records one by one from the start. A SearchResults element is synthesized for the recovered page with consistent numberOfRecordsMatched, numberOfRecordsReturned (positions consumed, i.e. returned plus skipped) and nextRecord attributes, so the existing paging and end-of-set detection keep working unchanged. Only server-side OWS exceptions trigger this recovery. Connection and protocol errors, and all the existing handling of well-behaved and misbehaving servers (nextRecord based termination, old CSW namespace, GET/POST fallback), are left untouched. The recovery is on the harvester side, so it also handles non-GeoNetwork CSW servers that fail a page for any reason. The matching server-side behaviour, making a GeoNetwork source skip the records it can not present instead of failing the whole page, is tracked in geonetwork#6940 and proposed in geonetwork#6941; the two changes are complementary. Related to geonetwork#6940 Related to geonetwork#6941
When recoverRange isolates a single position and the server returns a successful but empty SearchResults (no exception, just 0 children), return 1 instead of 0 so the caller advances past that position. Without this the synthesized page reports numberOfRecordsReturned=0 and the outer paging loop stops prematurely, silently dropping the remaining records.
2eb21b1 to
06f7d6a
Compare
- Remove redundant request.setStartPosition() call in the outer paging loop; executeGetRecords already sets it. - Qualify the "declared vs actual record count" warning to mention that the mismatch is expected when page recovery skips positions, adding the number of skipped positions to the message. - Emit a dedicated warning when an entire recovered page is empty (all records skipped), pointing at a possible harvester misconfiguration. - Add a test verifying that REQUEST_REJECTED errors (e.g. InvalidParameterValueEx) propagate out of recoverRange instead of being silently treated as skippable single-record failures.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Details
Each bad record costs O(log n) extra requests to isolate. In the worst case (every record in the page is bad) the split visits 2n-1 nodes (linear in the page size), which is still bounded in practice because pages are small (typically 10-200 records).
A
SearchResultselement is synthesized for the recovered page with consistentnumberOfRecordsMatched,numberOfRecordsReturned, andnextRecordattributes so the existing paging and end-of-set detection keep working unchanged.Only server-side OWS exceptions trigger recovery. Connection and protocol errors are re-thrown as before.
Related
Test plan
HarvesterTestcovering: single bad record, all-bad page, partial failure, connection errors propagating correctlyoutputSchema=gmd