|
| 1 | +# Finding Files on Cloud Storage |
| 2 | + |
| 3 | +This guide shows how to discover and list files stored in cloud object storage. |
| 4 | + |
| 5 | +## Listing Files in a Directory |
| 6 | + |
| 7 | +To see what files exist in a specific location, use the store's `list()` method with a prefix: |
| 8 | + |
| 9 | +```python exec="on" source="above" session="find" result="code" |
| 10 | +from obstore.store import S3Store |
| 11 | + |
| 12 | +# Access public AWS Open Data |
| 13 | +store = S3Store( |
| 14 | + bucket="nasanex", |
| 15 | + aws_region="us-west-2", |
| 16 | + skip_signature=True, |
| 17 | +) |
| 18 | + |
| 19 | +# List files in a specific directory |
| 20 | +prefix = "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/" |
| 21 | +files = [] |
| 22 | +for chunk in store.list(prefix=prefix): |
| 23 | + files.extend(chunk) |
| 24 | + |
| 25 | +print(f"Found {len(files)} files in {prefix}") |
| 26 | +print(f"\nFirst 5 files:") |
| 27 | +for f in files[:5]: |
| 28 | + print(f" {f['path'].split('/')[-1]}") |
| 29 | +``` |
| 30 | + |
| 31 | +!!! warning "Use the class methods rather than `obstore` top-level functions" |
| 32 | + When using `obspec_utils` wrappers like `CachingReadableStore`, call methods |
| 33 | + directly on the store (e.g., `store.list()`) rather than using `obstore` functions |
| 34 | + (e.g., `obstore.list(store)`). The wrappers implement the `obspec` protocol, which decouples them from specific store instances. `Obstore` top-level functions are tied to the specific stores implemented by `obstore`, so they will not work with the `obspec`-based wrappers provided by `obspec-utils`. |
| 35 | + |
| 36 | +## Finding Files Matching a Pattern |
| 37 | + |
| 38 | +When you need files matching specific criteria (e.g., all files from year 2100), use `glob`: |
| 39 | + |
| 40 | +```python exec="on" source="above" session="find" result="code" |
| 41 | +from obspec_utils import glob |
| 42 | + |
| 43 | +# Find all NetCDF files for year 2100 |
| 44 | +paths = list(glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_2100.nc")) |
| 45 | +print(f"Found {len(paths)} files for 2100:") |
| 46 | +for path in paths[:5]: |
| 47 | + print(f" {path.split('/')[-1]}") |
| 48 | +``` |
| 49 | + |
| 50 | +### Pattern Syntax |
| 51 | + |
| 52 | +| Pattern | Matches | Example | |
| 53 | +|---------|---------|---------| |
| 54 | +| `*` | Any characters in one segment | `*_2100.nc` matches any model for 2100 | |
| 55 | +| `**` | Any number of segments | `data/**/*.nc` matches all .nc files recursively | |
| 56 | +| `?` | Exactly one character | `*_209?.nc` matches 2090-2099 | |
| 57 | +| `[abc]` | Any character in set | `*_209[012].nc` matches 2090, 2091, 2092 | |
| 58 | +| `[a-z]` | Any character in range | `*_209[0-5].nc` matches 2090-2095 | |
| 59 | +| `[!abc]` | Any character NOT in set | `*_209[!9].nc` excludes 2099 | |
| 60 | + |
| 61 | +### More Pattern Examples |
| 62 | + |
| 63 | +```python exec="on" source="above" session="find" result="code" |
| 64 | +# Match a range of years (2096-2099) using ? |
| 65 | +paths = list(glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_inmcm4_209?.nc")) |
| 66 | +print(f"Years 2090-2099: {len(paths)} files") |
| 67 | +for p in paths[-4:]: # Show last 4 (2096-2099) |
| 68 | + print(f" {p.split('/')[-1]}") |
| 69 | + |
| 70 | +# Match specific years using character range |
| 71 | +paths = list(glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_inmcm4_209[5-9].nc")) |
| 72 | +print(f"\nYears 2095-2099: {len(paths)} files") |
| 73 | +for p in paths: |
| 74 | + print(f" {p.split('/')[-1]}") |
| 75 | +``` |
| 76 | + |
| 77 | +## Getting File Sizes and Dates |
| 78 | + |
| 79 | +To get metadata (size, last modified time) along with paths, use `glob_objects`: |
| 80 | + |
| 81 | +```python exec="on" source="above" session="find" result="code" |
| 82 | +from obspec_utils import glob_objects |
| 83 | + |
| 84 | +# Get metadata for matching files |
| 85 | +objects = list(glob_objects(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_2100.nc")) |
| 86 | + |
| 87 | +# Calculate total size |
| 88 | +total_bytes = sum(obj["size"] for obj in objects) |
| 89 | +print(f"Total: {total_bytes / 1e9:.2f} GB across {len(objects)} files") |
| 90 | + |
| 91 | +# Show details for a few files |
| 92 | +print(f"\nSample files:") |
| 93 | +for obj in objects[:3]: |
| 94 | + print(f" {obj['path'].split('/')[-1]}") |
| 95 | + print(f" Size: {obj['size'] / 1e6:.1f} MB") |
| 96 | + print(f" Modified: {obj['last_modified'].date()}") |
| 97 | +``` |
| 98 | + |
| 99 | +## Improving Performance |
| 100 | + |
| 101 | +Listing files in cloud storage requires network requests. The more files the server needs to enumerate, the slower the operation. Here's how to keep searches fast. |
| 102 | + |
| 103 | +### Use Specific Prefixes |
| 104 | + |
| 105 | +The `glob` function automatically extracts the longest literal prefix from your pattern to minimize the files the server must enumerate: |
| 106 | + |
| 107 | +| Pattern | Server lists from | Files enumerated | |
| 108 | +|---------|-------------------|------------------| |
| 109 | +| `data/2024/january/*.nc` | `data/2024/january/` | Only January files | |
| 110 | +| `data/2024/*/*.nc` | `data/2024/` | All of 2024 | |
| 111 | +| `data/**/*.nc` | `data/` | Everything under data/ | |
| 112 | +| `**/*.nc` | (root) | Entire bucket | |
| 113 | + |
| 114 | +Move literal path segments before wildcards when possible: |
| 115 | + |
| 116 | +```python |
| 117 | +# Slower: wildcard early means listing more files |
| 118 | +glob(store, "NEX-GDDP/**/tasmax/**/v1.0/*_2100.nc") |
| 119 | + |
| 120 | +# Faster: specific prefix narrows the listing |
| 121 | +glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_2100.nc") |
| 122 | +``` |
| 123 | + |
| 124 | +### Process Results Lazily |
| 125 | + |
| 126 | +Both `glob` and `glob_objects` return iterators, so you can process results as they arrive without loading all paths into memory: |
| 127 | + |
| 128 | +```python exec="on" source="above" session="find" result="code" |
| 129 | +# Stop after finding 3 files (doesn't load all results) |
| 130 | +count = 0 |
| 131 | +for path in glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_2100.nc"): |
| 132 | + print(f"Found: {path.split('/')[-1]}") |
| 133 | + count += 1 |
| 134 | + if count >= 3: |
| 135 | + break |
| 136 | +``` |
| 137 | + |
| 138 | +## Async Usage |
| 139 | + |
| 140 | +For async contexts, use `glob_async` and `glob_objects_async`: |
| 141 | + |
| 142 | +```python exec="on" source="above" session="find" result="code" |
| 143 | +import asyncio |
| 144 | +from obspec_utils import glob_async |
| 145 | + |
| 146 | +async def find_recent_years(): |
| 147 | + paths = [] |
| 148 | + async for path in glob_async(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_inmcm4_209?.nc"): |
| 149 | + paths.append(path) |
| 150 | + return paths |
| 151 | + |
| 152 | +paths = asyncio.run(find_recent_years()) |
| 153 | +print(f"Found {len(paths)} files asynchronously") |
| 154 | +``` |
0 commit comments