Problem Statement
Scanning large S3 / object-store buckets is extremely slow. A real scan of 291,422 objects took ~3 hours 14 minutes. The bottlenecks are in packages/server/engine-lib/engLib/store/endpoints/objstore/base/scan.cpp:
- Listing relied on
StartAfter plus page result-equality heuristics to detect the end of a listing instead of proper ListObjectsV2 continuation-token pagination — fragile and wasteful.
- A fresh S3 client was created per scan call (including each recursive sub-prefix scan), so the HTTP connection pool was not reused across scanner threads.
- The default
ClientConfiguration.maxConnections serialized the scanner threads.
- Every object incurred a separate
HeadObject (Content-Type) round-trip.
Proposed Solution
- Walk listings with
ListObjectsV2 continuation-token pagination.
- Cache and share one S3 client across scanner threads (mutex-guarded, reset on list errors so the next scan re-connects).
- Raise
ClientConfiguration.maxConnections to 64 so the shared pool does not serialize threads.
- Drop the per-object
Content-Type HEAD; fetch owner metadata inline via ListObjectsV2 FetchOwner.
This also benefits the generic objstore connector, which inherits the same base scan.
Affected Modules
Problem Statement
Scanning large S3 / object-store buckets is extremely slow. A real scan of 291,422 objects took ~3 hours 14 minutes. The bottlenecks are in
packages/server/engine-lib/engLib/store/endpoints/objstore/base/scan.cpp:StartAfterplus page result-equality heuristics to detect the end of a listing instead of properListObjectsV2continuation-token pagination — fragile and wasteful.ClientConfiguration.maxConnectionsserialized the scanner threads.HeadObject(Content-Type) round-trip.Proposed Solution
ListObjectsV2continuation-token pagination.ClientConfiguration.maxConnectionsto 64 so the shared pool does not serialize threads.Content-TypeHEAD; fetch owner metadata inline viaListObjectsV2 FetchOwner.This also benefits the generic
objstoreconnector, which inherits the same base scan.Affected Modules