Problem
dc.read_storage(uri) enumerates every file (one row per leaf). On large buckets (e.g. gs://gresearch/robotics/droid_raw/ — 380k files, 1.7 TB) this is intractable: the listing times out, and downstream consumers like bucket_scan.py can't produce a useful overview.
Current workaround in the knowledge skill is to write the bucket markdown by hand, which the auto-render pipeline can't maintain. That's a smell — manual editing of artifacts that should be auto-generated.
Proposed primitive
dc.bucket_overview(
"gs://bucket/prefix/",
anon=True,
name=None, # default: "overview_{slug(uri)}_{timestamp}"
max_listing_seconds=60,
)
Returns a saved DataChain dataset. Default name overview_{slug}_{timestamp} if name is not provided. Reusable like any other dataset (dc.read_dataset(name)).
Implementation sketch
Use cloud-SDK list with delimiter='/' rather than full enumeration:
- List depth-1 prefixes + immediate files at the given URI
- For each top-level prefix, sample 1-2 sub-prefixes deeper to capture extension distribution and file-size hints
- Aggregate into per-prefix rows (path, files-estimated, total-size-estimated, top extensions, sample paths)
- Save as named DataChain dataset
Output shape (one row per top-level prefix):
| Column |
Type |
Notes |
| prefix |
str |
top-level subdirectory path |
| files_estimated |
int |
from sample × extrapolation, with accuracy flag |
| total_size_estimated |
int |
bytes, estimated |
| extensions |
list[str] |
top extensions found |
| sample_paths |
list[str] |
a handful of representative leaf paths |
| depth_explored |
int |
how deep the sampler went |
Why not put it in read_storage()
read_storage() returns a row-per-file DataChain — that's its contract. Falling back to hierarchical mode there would silently change return shape based on bucket size, breaking downstream UDFs. Bucket overview is a different primitive with a different output shape; it deserves its own name.
Side request: read_storage(..., max_listing_seconds=N)
Today read_storage() on a huge bucket hangs without diagnostics. A timeout parameter would let users fail fast and switch to bucket_overview() deliberately.
Storage backends
Initial scope: GCS, S3, Azure Blob — same backends dc.read_storage() supports. Each has native delimiter-listing support in its SDK.
Affected files
- New:
dc.bucket_overview() in the SDK
- Refactor:
~/src/datachain/src/datachain/skill/knowledge/scripts/bucket_scan.py becomes a thin CLI wrapper
- Skill docs: update
knowledge/SKILL.md Step 1 (Bucket Enlistment) to use bucket_overview for buckets that exceed the timeout
Context
Surfaced during a YC demo session running DataChain on the public DROID raw bucket. Manual hand-written bucket markdown was needed because no SDK primitive gave a usable summary in seconds. Reverted that workaround; filing this as the right-layer fix. A short-term skill-side script (separate from bucket_scan.py) will provide the capability until this lands.
Problem
dc.read_storage(uri)enumerates every file (one row per leaf). On large buckets (e.g.gs://gresearch/robotics/droid_raw/— 380k files, 1.7 TB) this is intractable: the listing times out, and downstream consumers likebucket_scan.pycan't produce a useful overview.Current workaround in the knowledge skill is to write the bucket markdown by hand, which the auto-render pipeline can't maintain. That's a smell — manual editing of artifacts that should be auto-generated.
Proposed primitive
Returns a saved DataChain dataset. Default name
overview_{slug}_{timestamp}ifnameis not provided. Reusable like any other dataset (dc.read_dataset(name)).Implementation sketch
Use cloud-SDK list with
delimiter='/'rather than full enumeration:Output shape (one row per top-level prefix):
accuracyflagWhy not put it in
read_storage()read_storage()returns a row-per-file DataChain — that's its contract. Falling back to hierarchical mode there would silently change return shape based on bucket size, breaking downstream UDFs. Bucket overview is a different primitive with a different output shape; it deserves its own name.Side request:
read_storage(..., max_listing_seconds=N)Today
read_storage()on a huge bucket hangs without diagnostics. A timeout parameter would let users fail fast and switch tobucket_overview()deliberately.Storage backends
Initial scope: GCS, S3, Azure Blob — same backends
dc.read_storage()supports. Each has native delimiter-listing support in its SDK.Affected files
dc.bucket_overview()in the SDK~/src/datachain/src/datachain/skill/knowledge/scripts/bucket_scan.pybecomes a thin CLI wrapperknowledge/SKILL.mdStep 1 (Bucket Enlistment) to usebucket_overviewfor buckets that exceed the timeoutContext
Surfaced during a YC demo session running DataChain on the public DROID raw bucket. Manual hand-written bucket markdown was needed because no SDK primitive gave a usable summary in seconds. Reverted that workaround; filing this as the right-layer fix. A short-term skill-side script (separate from
bucket_scan.py) will provide the capability until this lands.