Skip to content

perf(objstore): S3/object-store scan is far too slow on large buckets #1208

@ksaurabhAparavi

Description

@ksaurabhAparavi

Problem Statement

Scanning large S3 / object-store buckets is extremely slow. A real scan of 291,422 objects took ~3 hours 14 minutes. The bottlenecks are in packages/server/engine-lib/engLib/store/endpoints/objstore/base/scan.cpp:

  • Listing relied on StartAfter plus page result-equality heuristics to detect the end of a listing instead of proper ListObjectsV2 continuation-token pagination — fragile and wasteful.
  • A fresh S3 client was created per scan call (including each recursive sub-prefix scan), so the HTTP connection pool was not reused across scanner threads.
  • The default ClientConfiguration.maxConnections serialized the scanner threads.
  • Every object incurred a separate HeadObject (Content-Type) round-trip.

Proposed Solution

  • Walk listings with ListObjectsV2 continuation-token pagination.
  • Cache and share one S3 client across scanner threads (mutex-guarded, reset on list errors so the next scan re-connects).
  • Raise ClientConfiguration.maxConnections to 64 so the shared pool does not serialize threads.
  • Drop the per-object Content-Type HEAD; fetch owner metadata inline via ListObjectsV2 FetchOwner.

This also benefits the generic objstore connector, which inherits the same base scan.

Affected Modules

  • server (C++ engine)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions