Skip to content

CSW harvester: recover records when a result page cannot be fully returned#294

Draft
juanluisrp wants to merge 3 commits into
mainfrom
csw-harvester-resilient-paging
Draft

CSW harvester: recover records when a result page cannot be fully returned#294
juanluisrp wants to merge 3 commits into
mainfrom
csw-harvester-resilient-paging

Conversation

@juanluisrp

Copy link
Copy Markdown
Member

Summary

  • When a CSW source returns an OWS exception for a full page (e.g. a record that cannot be serialized in the requested outputSchema), the harvester used to abort the entire harvest run with a fatal error.
  • This change makes page fetching resilient: on failure, the page is split in half and each half is retried recursively. A single bad record is logged, skipped, and harvesting continues.
  • Recovery is harvester-side, so it works against any CSW server, not just GeoNetwork sources.

Details

Each bad record costs O(log n) extra requests to isolate. In the worst case (every record in the page is bad) the split visits 2n-1 nodes (linear in the page size), which is still bounded in practice because pages are small (typically 10-200 records).

A SearchResults element is synthesized for the recovered page with consistent numberOfRecordsMatched, numberOfRecordsReturned, and nextRecord attributes so the existing paging and end-of-set detection keep working unchanged.

Only server-side OWS exceptions trigger recovery. Connection and protocol errors are re-thrown as before.

Related

Test plan

  • Unit tests added in HarvesterTest covering: single bad record, all-bad page, partial failure, connection errors propagating correctly
  • Run CSW harvest against a GeoNetwork source that contains ISO 19110 feature catalogues with outputSchema=gmd
  • Verify skipped records appear in the harvest report and the rest of the catalogue is fully harvested

…eturned

When harvesting via CSW GetRecords, the source returns pages of several
records at once. Some CSW servers (GeoNetwork included) abort the whole
GetRecords response when a single record of the page can not be serialized
in the requested outputSchema, for instance an ISO 19110 feature catalogue
requested with outputSchema=gmd, which has no gmd presentation. The
harvester turned that page error into a fatal OperationAbortedEx and the
whole harvest run stopped, leaving the catalogue partially harvested.

Make the page fetch resilient: when the source returns an OWS exception for
a page, split the page in half and retry each half. A single record that
still fails is logged, skipped, and the harvest carries on with the rest.
Each bad record costs O(log n) extra requests to isolate; in the sparse
case this keeps the overhead low. In the worst case (every record in the
page is bad) the split visits 2n-1 nodes, so the total is linear in the
page size - bounded in practice because pages are small (typically 10-200
records), and still far fewer requests than fetching records one by one
from the start. A SearchResults element is synthesized for the recovered
page with consistent numberOfRecordsMatched, numberOfRecordsReturned
(positions consumed, i.e. returned plus skipped) and nextRecord attributes,
so the existing paging and end-of-set detection keep working unchanged.

Only server-side OWS exceptions trigger this recovery. Connection and
protocol errors, and all the existing handling of well-behaved and
misbehaving servers (nextRecord based termination, old CSW namespace,
GET/POST fallback), are left untouched.

The recovery is on the harvester side, so it also handles non-GeoNetwork
CSW servers that fail a page for any reason. The matching server-side
behaviour, making a GeoNetwork source skip the records it can not present
instead of failing the whole page, is tracked in geonetwork#6940 and proposed in
geonetwork#6941; the two changes are complementary.

Related to geonetwork#6940
Related to geonetwork#6941
When recoverRange isolates a single position and the server returns a
successful but empty SearchResults (no exception, just 0 children),
return 1 instead of 0 so the caller advances past that position. Without
this the synthesized page reports numberOfRecordsReturned=0 and the outer
paging loop stops prematurely, silently dropping the remaining records.
@juanluisrp juanluisrp force-pushed the csw-harvester-resilient-paging branch from 2eb21b1 to 06f7d6a Compare June 9, 2026 10:12
- Remove redundant request.setStartPosition() call in the outer paging
  loop; executeGetRecords already sets it.
- Qualify the "declared vs actual record count" warning to mention that
  the mismatch is expected when page recovery skips positions, adding
  the number of skipped positions to the message.
- Emit a dedicated warning when an entire recovered page is empty (all
  records skipped), pointing at a possible harvester misconfiguration.
- Add a test verifying that REQUEST_REJECTED errors (e.g.
  InvalidParameterValueEx) propagate out of recoverRange instead of
  being silently treated as skippable single-record failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant