Add dc.bucket_overview() — fast hierarchical summary for large buckets

## Problem

`dc.read_storage(uri)` enumerates every file (one row per leaf). On large buckets (e.g. `gs://gresearch/robotics/droid_raw/` — 380k files, 1.7 TB) this is intractable: the listing times out, and downstream consumers like `bucket_scan.py` can't produce a useful overview.

Current workaround in the knowledge skill is to write the bucket markdown by hand, which the auto-render pipeline can't maintain. That's a smell — manual editing of artifacts that should be auto-generated.

## Proposed primitive

```python
dc.bucket_overview(
    "gs://bucket/prefix/",
    anon=True,
    name=None,                # default: "overview_{slug(uri)}_{timestamp}"
    max_listing_seconds=60,
)
```

**Returns a saved DataChain dataset.** Default name `overview_{slug}_{timestamp}` if `name` is not provided. Reusable like any other dataset (`dc.read_dataset(name)`).

## Implementation sketch

Use cloud-SDK list with `delimiter='/'` rather than full enumeration:

1. List depth-1 prefixes + immediate files at the given URI
2. For each top-level prefix, sample 1-2 sub-prefixes deeper to capture extension distribution and file-size hints
3. Aggregate into per-prefix rows (path, files-estimated, total-size-estimated, top extensions, sample paths)
4. Save as named DataChain dataset

Output shape (one row per top-level prefix):

| Column | Type | Notes |
|---|---|---|
| prefix | str | top-level subdirectory path |
| files_estimated | int | from sample × extrapolation, with `accuracy` flag |
| total_size_estimated | int | bytes, estimated |
| extensions | list[str] | top extensions found |
| sample_paths | list[str] | a handful of representative leaf paths |
| depth_explored | int | how deep the sampler went |

## Why not put it in `read_storage()`

`read_storage()` returns a row-per-file DataChain — that's its contract. Falling back to hierarchical mode there would silently change return shape based on bucket size, breaking downstream UDFs. Bucket overview is a different primitive with a different output shape; it deserves its own name.

## Side request: `read_storage(..., max_listing_seconds=N)`

Today `read_storage()` on a huge bucket hangs without diagnostics. A timeout parameter would let users fail fast and switch to `bucket_overview()` deliberately.

## Storage backends

Initial scope: GCS, S3, Azure Blob — same backends `dc.read_storage()` supports. Each has native delimiter-listing support in its SDK.

## Affected files

- New: `dc.bucket_overview()` in the SDK
- Refactor: `~/src/datachain/src/datachain/skill/knowledge/scripts/bucket_scan.py` becomes a thin CLI wrapper
- Skill docs: update `knowledge/SKILL.md` Step 1 (Bucket Enlistment) to use `bucket_overview` for buckets that exceed the timeout

## Context

Surfaced during a YC demo session running DataChain on the public DROID raw bucket. Manual hand-written bucket markdown was needed because no SDK primitive gave a usable summary in seconds. Reverted that workaround; filing this as the right-layer fix. A short-term skill-side script (separate from `bucket_scan.py`) will provide the capability until this lands.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dc.bucket_overview() — fast hierarchical summary for large buckets #1750

Problem

Proposed primitive

Implementation sketch

Why not put it in `read_storage()`

Side request: `read_storage(..., max_listing_seconds=N)`

Storage backends

Affected files

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Column	Type	Notes
prefix	str	top-level subdirectory path
files_estimated	int	from sample × extrapolation, with `accuracy` flag
total_size_estimated	int	bytes, estimated
extensions	list[str]	top extensions found
sample_paths	list[str]	a handful of representative leaf paths
depth_explored	int	how deep the sampler went

Add dc.bucket_overview() — fast hierarchical summary for large buckets #1750

Description

Problem

Proposed primitive

Implementation sketch

Why not put it in read_storage()

Side request: read_storage(..., max_listing_seconds=N)

Storage backends

Affected files

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why not put it in `read_storage()`

Side request: `read_storage(..., max_listing_seconds=N)`