CSW harvester: recover records when a result page cannot be fully returned by juanluisrp · Pull Request #294 · GeoCat/core-geonetwork

juanluisrp · 2026-06-09T09:42:26Z

Summary

When a CSW source returns an OWS exception for a full page (e.g. a record that cannot be serialized in the requested outputSchema), the harvester used to abort the entire harvest run with a fatal error.
This change makes page fetching resilient: on failure, the page is split in half and each half is retried recursively. A single bad record is logged, skipped, and harvesting continues.
Recovery is harvester-side, so it works against any CSW server, not just GeoNetwork sources.

Details

Each bad record costs O(log n) extra requests to isolate. In the worst case (every record in the page is bad) the split visits 2n-1 nodes (linear in the page size), which is still bounded in practice because pages are small (typically 10-200 records).

A SearchResults element is synthesized for the recovered page with consistent numberOfRecordsMatched, numberOfRecordsReturned, and nextRecord attributes so the existing paging and end-of-set detection keep working unchanged.

Only server-side OWS exceptions trigger recovery. Connection and protocol errors are re-thrown as before.

Test plan

Unit tests added in HarvesterTest covering: single bad record, all-bad page, partial failure, connection errors propagating correctly
Run CSW harvest against a GeoNetwork source that contains ISO 19110 feature catalogues with outputSchema=gmd
Verify skipped records appear in the harvest report and the rest of the catalogue is fully harvested

…eturned When harvesting via CSW GetRecords, the source returns pages of several records at once. Some CSW servers (GeoNetwork included) abort the whole GetRecords response when a single record of the page can not be serialized in the requested outputSchema, for instance an ISO 19110 feature catalogue requested with outputSchema=gmd, which has no gmd presentation. The harvester turned that page error into a fatal OperationAbortedEx and the whole harvest run stopped, leaving the catalogue partially harvested. Make the page fetch resilient: when the source returns an OWS exception for a page, split the page in half and retry each half. A single record that still fails is logged, skipped, and the harvest carries on with the rest. Each bad record costs O(log n) extra requests to isolate; in the sparse case this keeps the overhead low. In the worst case (every record in the page is bad) the split visits 2n-1 nodes, so the total is linear in the page size - bounded in practice because pages are small (typically 10-200 records), and still far fewer requests than fetching records one by one from the start. A SearchResults element is synthesized for the recovered page with consistent numberOfRecordsMatched, numberOfRecordsReturned (positions consumed, i.e. returned plus skipped) and nextRecord attributes, so the existing paging and end-of-set detection keep working unchanged. Only server-side OWS exceptions trigger this recovery. Connection and protocol errors, and all the existing handling of well-behaved and misbehaving servers (nextRecord based termination, old CSW namespace, GET/POST fallback), are left untouched. The recovery is on the harvester side, so it also handles non-GeoNetwork CSW servers that fail a page for any reason. The matching server-side behaviour, making a GeoNetwork source skip the records it can not present instead of failing the whole page, is tracked in geonetwork#6940 and proposed in geonetwork#6941; the two changes are complementary. Related to geonetwork#6940 Related to geonetwork#6941

When recoverRange isolates a single position and the server returns a successful but empty SearchResults (no exception, just 0 children), return 1 instead of 0 so the caller advances past that position. Without this the synthesized page reports numberOfRecordsReturned=0 and the outer paging loop stops prematurely, silently dropping the remaining records.

- Remove redundant request.setStartPosition() call in the outer paging loop; executeGetRecords already sets it. - Qualify the "declared vs actual record count" warning to mention that the mismatch is expected when page recovery skips positions, adding the number of skipped positions to the message. - Emit a dedicated warning when an entire recovered page is empty (all records skipped), pointing at a possible harvester misconfiguration. - Add a test verifying that REQUEST_REJECTED errors (e.g. InvalidParameterValueEx) propagate out of recoverRange instead of being silently treated as skippable single-record failures.

juanluisrp added 2 commits June 8, 2026 13:33

juanluisrp force-pushed the csw-harvester-resilient-paging branch from 2eb21b1 to 06f7d6a Compare June 9, 2026 10:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CSW harvester: recover records when a result page cannot be fully returned#294

CSW harvester: recover records when a result page cannot be fully returned#294
juanluisrp wants to merge 3 commits into
mainfrom
csw-harvester-resilient-paging

juanluisrp commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

juanluisrp commented Jun 9, 2026

Summary

Details

Related

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant