Skip to content

Add dc.bucket_overview() — fast hierarchical summary for large buckets #1750

@dmpetrov

Description

@dmpetrov

Problem

dc.read_storage(uri) enumerates every file (one row per leaf). On large buckets (e.g. gs://gresearch/robotics/droid_raw/ — 380k files, 1.7 TB) this is intractable: the listing times out, and downstream consumers like bucket_scan.py can't produce a useful overview.

Current workaround in the knowledge skill is to write the bucket markdown by hand, which the auto-render pipeline can't maintain. That's a smell — manual editing of artifacts that should be auto-generated.

Proposed primitive

dc.bucket_overview(
    "gs://bucket/prefix/",
    anon=True,
    name=None,                # default: "overview_{slug(uri)}_{timestamp}"
    max_listing_seconds=60,
)

Returns a saved DataChain dataset. Default name overview_{slug}_{timestamp} if name is not provided. Reusable like any other dataset (dc.read_dataset(name)).

Implementation sketch

Use cloud-SDK list with delimiter='/' rather than full enumeration:

  1. List depth-1 prefixes + immediate files at the given URI
  2. For each top-level prefix, sample 1-2 sub-prefixes deeper to capture extension distribution and file-size hints
  3. Aggregate into per-prefix rows (path, files-estimated, total-size-estimated, top extensions, sample paths)
  4. Save as named DataChain dataset

Output shape (one row per top-level prefix):

Column Type Notes
prefix str top-level subdirectory path
files_estimated int from sample × extrapolation, with accuracy flag
total_size_estimated int bytes, estimated
extensions list[str] top extensions found
sample_paths list[str] a handful of representative leaf paths
depth_explored int how deep the sampler went

Why not put it in read_storage()

read_storage() returns a row-per-file DataChain — that's its contract. Falling back to hierarchical mode there would silently change return shape based on bucket size, breaking downstream UDFs. Bucket overview is a different primitive with a different output shape; it deserves its own name.

Side request: read_storage(..., max_listing_seconds=N)

Today read_storage() on a huge bucket hangs without diagnostics. A timeout parameter would let users fail fast and switch to bucket_overview() deliberately.

Storage backends

Initial scope: GCS, S3, Azure Blob — same backends dc.read_storage() supports. Each has native delimiter-listing support in its SDK.

Affected files

  • New: dc.bucket_overview() in the SDK
  • Refactor: ~/src/datachain/src/datachain/skill/knowledge/scripts/bucket_scan.py becomes a thin CLI wrapper
  • Skill docs: update knowledge/SKILL.md Step 1 (Bucket Enlistment) to use bucket_overview for buckets that exceed the timeout

Context

Surfaced during a YC demo session running DataChain on the public DROID raw bucket. Manual hand-written bucket markdown was needed because no SDK primitive gave a usable summary in seconds. Reverted that workaround; filing this as the right-layer fix. A short-term skill-side script (separate from bucket_scan.py) will provide the capability until this lands.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions