Skip to content

Commit cfea977

Browse files
authored
Add user guide section on globbing (#51)
* Add user guide section on globbing * Goal-oriented name * More goal oriented * Add warning
1 parent 9809e0d commit cfea977

2 files changed

Lines changed: 155 additions & 0 deletions

File tree

docs/user-guide/finding-files.md

Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# Finding Files on Cloud Storage
2+
3+
This guide shows how to discover and list files stored in cloud object storage.
4+
5+
## Listing Files in a Directory
6+
7+
To see what files exist in a specific location, use the store's `list()` method with a prefix:
8+
9+
```python exec="on" source="above" session="find" result="code"
10+
from obstore.store import S3Store
11+
12+
# Access public AWS Open Data
13+
store = S3Store(
14+
bucket="nasanex",
15+
aws_region="us-west-2",
16+
skip_signature=True,
17+
)
18+
19+
# List files in a specific directory
20+
prefix = "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/"
21+
files = []
22+
for chunk in store.list(prefix=prefix):
23+
files.extend(chunk)
24+
25+
print(f"Found {len(files)} files in {prefix}")
26+
print(f"\nFirst 5 files:")
27+
for f in files[:5]:
28+
print(f" {f['path'].split('/')[-1]}")
29+
```
30+
31+
!!! warning "Use the class methods rather than `obstore` top-level functions"
32+
When using `obspec_utils` wrappers like `CachingReadableStore`, call methods
33+
directly on the store (e.g., `store.list()`) rather than using `obstore` functions
34+
(e.g., `obstore.list(store)`). The wrappers implement the `obspec` protocol, which decouples them from specific store instances. `Obstore` top-level functions are tied to the specific stores implemented by `obstore`, so they will not work with the `obspec`-based wrappers provided by `obspec-utils`.
35+
36+
## Finding Files Matching a Pattern
37+
38+
When you need files matching specific criteria (e.g., all files from year 2100), use `glob`:
39+
40+
```python exec="on" source="above" session="find" result="code"
41+
from obspec_utils import glob
42+
43+
# Find all NetCDF files for year 2100
44+
paths = list(glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_2100.nc"))
45+
print(f"Found {len(paths)} files for 2100:")
46+
for path in paths[:5]:
47+
print(f" {path.split('/')[-1]}")
48+
```
49+
50+
### Pattern Syntax
51+
52+
| Pattern | Matches | Example |
53+
|---------|---------|---------|
54+
| `*` | Any characters in one segment | `*_2100.nc` matches any model for 2100 |
55+
| `**` | Any number of segments | `data/**/*.nc` matches all .nc files recursively |
56+
| `?` | Exactly one character | `*_209?.nc` matches 2090-2099 |
57+
| `[abc]` | Any character in set | `*_209[012].nc` matches 2090, 2091, 2092 |
58+
| `[a-z]` | Any character in range | `*_209[0-5].nc` matches 2090-2095 |
59+
| `[!abc]` | Any character NOT in set | `*_209[!9].nc` excludes 2099 |
60+
61+
### More Pattern Examples
62+
63+
```python exec="on" source="above" session="find" result="code"
64+
# Match a range of years (2096-2099) using ?
65+
paths = list(glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_inmcm4_209?.nc"))
66+
print(f"Years 2090-2099: {len(paths)} files")
67+
for p in paths[-4:]: # Show last 4 (2096-2099)
68+
print(f" {p.split('/')[-1]}")
69+
70+
# Match specific years using character range
71+
paths = list(glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_inmcm4_209[5-9].nc"))
72+
print(f"\nYears 2095-2099: {len(paths)} files")
73+
for p in paths:
74+
print(f" {p.split('/')[-1]}")
75+
```
76+
77+
## Getting File Sizes and Dates
78+
79+
To get metadata (size, last modified time) along with paths, use `glob_objects`:
80+
81+
```python exec="on" source="above" session="find" result="code"
82+
from obspec_utils import glob_objects
83+
84+
# Get metadata for matching files
85+
objects = list(glob_objects(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_2100.nc"))
86+
87+
# Calculate total size
88+
total_bytes = sum(obj["size"] for obj in objects)
89+
print(f"Total: {total_bytes / 1e9:.2f} GB across {len(objects)} files")
90+
91+
# Show details for a few files
92+
print(f"\nSample files:")
93+
for obj in objects[:3]:
94+
print(f" {obj['path'].split('/')[-1]}")
95+
print(f" Size: {obj['size'] / 1e6:.1f} MB")
96+
print(f" Modified: {obj['last_modified'].date()}")
97+
```
98+
99+
## Improving Performance
100+
101+
Listing files in cloud storage requires network requests. The more files the server needs to enumerate, the slower the operation. Here's how to keep searches fast.
102+
103+
### Use Specific Prefixes
104+
105+
The `glob` function automatically extracts the longest literal prefix from your pattern to minimize the files the server must enumerate:
106+
107+
| Pattern | Server lists from | Files enumerated |
108+
|---------|-------------------|------------------|
109+
| `data/2024/january/*.nc` | `data/2024/january/` | Only January files |
110+
| `data/2024/*/*.nc` | `data/2024/` | All of 2024 |
111+
| `data/**/*.nc` | `data/` | Everything under data/ |
112+
| `**/*.nc` | (root) | Entire bucket |
113+
114+
Move literal path segments before wildcards when possible:
115+
116+
```python
117+
# Slower: wildcard early means listing more files
118+
glob(store, "NEX-GDDP/**/tasmax/**/v1.0/*_2100.nc")
119+
120+
# Faster: specific prefix narrows the listing
121+
glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_2100.nc")
122+
```
123+
124+
### Process Results Lazily
125+
126+
Both `glob` and `glob_objects` return iterators, so you can process results as they arrive without loading all paths into memory:
127+
128+
```python exec="on" source="above" session="find" result="code"
129+
# Stop after finding 3 files (doesn't load all results)
130+
count = 0
131+
for path in glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_2100.nc"):
132+
print(f"Found: {path.split('/')[-1]}")
133+
count += 1
134+
if count >= 3:
135+
break
136+
```
137+
138+
## Async Usage
139+
140+
For async contexts, use `glob_async` and `glob_objects_async`:
141+
142+
```python exec="on" source="above" session="find" result="code"
143+
import asyncio
144+
from obspec_utils import glob_async
145+
146+
async def find_recent_years():
147+
paths = []
148+
async for path in glob_async(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_inmcm4_209?.nc"):
149+
paths.append(path)
150+
return paths
151+
152+
paths = asyncio.run(find_recent_years())
153+
print(f"Found {len(paths)} files asynchronously")
154+
```

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ nav:
1616
- "index.md"
1717
- "User Guide":
1818
- "Opening Data with Xarray": "user-guide/opening-data-with-xarray.md"
19+
- "Finding files on cloud object storage": "user-guide/finding-files.md"
1920
- "API":
2021
- Glob: "api/glob.md"
2122
- Protocols: "api/protocols.md"

0 commit comments

Comments
 (0)