You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update API documentation for new ArrivalOrder batch_size parameter
- Update all examples to use new ArrivalOrder(batch_size=X) syntax
- Add comprehensive memory formula with row size calculation
- Remove backward compatibility references (batch_size is new in this PR)
- Include performance characteristics and use case recommendations
- Provide clear guidance on TaskOrder vs ArrivalOrder memory behavior
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: mkdocs/docs/api.md
+32-32Lines changed: 32 additions & 32 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -355,28 +355,21 @@ for buf in tbl.scan().to_arrow_batch_reader():
355
355
print(f"Buffer contains {len(buf)} rows")
356
356
```
357
357
358
-
You can control the number of rows per batch using the `batch_size` parameter:
359
-
360
-
```python
361
-
for buf in tbl.scan().to_arrow_batch_reader(batch_size=1000):
362
-
print(f"Buffer contains {len(buf)} rows")
363
-
```
364
-
365
358
By default, each file's batches are materialized in memory before being yielded (`TaskOrder()`). For large files that may exceed available memory, use `ArrivalOrder()` to yield batches as they are produced without materializing entire files:
366
359
367
360
```python
368
361
from pyiceberg.table import ArrivalOrder
369
362
370
-
for buf in tbl.scan().to_arrow_batch_reader(order=ArrivalOrder(), batch_size=1000):
363
+
for buf in tbl.scan().to_arrow_batch_reader(order=ArrivalOrder()):
371
364
print(f"Buffer contains {len(buf)} rows")
372
365
```
373
366
374
-
For maximum throughput, use `concurrent_streams` to read multiple files in parallel with arrival order. Batches are yielded as they arrive from any file — ordering across files is not guaranteed:
367
+
For maximum throughput, tune `concurrent_streams` to read multiple files in parallel with arrival order. Batches are yielded as they arrive from any file — ordering across files is not guaranteed:
375
368
376
369
```python
377
370
from pyiceberg.table import ArrivalOrder
378
371
379
-
for buf in tbl.scan().to_arrow_batch_reader(order=ArrivalOrder(concurrent_streams=4), batch_size=1000):
372
+
for buf in tbl.scan().to_arrow_batch_reader(order=ArrivalOrder(concurrent_streams=4)):
380
373
print(f"Buffer contains {len(buf)} rows")
381
374
```
382
375
@@ -387,19 +380,35 @@ for buf in tbl.scan().to_arrow_batch_reader(order=ArrivalOrder(concurrent_stream
387
380
| `TaskOrder()` (default) | Batches grouped by file, in task submission order | Row order |
388
381
| `ArrivalOrder()` | Interleaved across files (no grouping guarantee) | Row order within each file |
389
382
390
-
Within each file, batch ordering always follows row order. The `limit` parameter is enforced correctly regardless of configuration.
383
+
The `limit` parameter is enforced correctly regardless of configuration.
391
384
392
385
**Which configuration should I use?**
393
386
394
387
| Use case | Recommended config |
395
388
|---|---|
396
389
| Small tables, simple queries | Default — no extra args needed |
397
-
| Large tables, memory-constrained | `order=ArrivalOrder()` — one file at a time, minimal memory |
398
-
| Maximum throughput with bounded memory | `order=ArrivalOrder(concurrent_streams=N)` — tune N to balance throughput vs memory |
399
-
| Fine-grained memory control | `order=ArrivalOrder(concurrent_streams=N, max_buffered_batches=M)` — tune both parameters |
400
-
| Fine-grained batch control | Add `batch_size=N` to any of the above |
390
+
| Large tables, maximum throughput with bounded memory | `order=ArrivalOrder(concurrent_streams=N)` — tune N to balance throughput vs memory |
391
+
| Fine-grained memory control | `order=ArrivalOrder(concurrent_streams=N, batch_size=M, max_buffered_batches=K)` — tune all parameters |
392
+
393
+
**Memory usage and performance characteristics:**
394
+
395
+
- **TaskOrder (default)**: Uses full file materialization. Each file is loaded entirely into memory before yielding batches. Memory usage depends on file sizes.
396
+
- **ArrivalOrder**: Uses streaming with controlled memory usage. Memory is bounded by the batch buffering mechanism.
**Note:** `ArrivalOrder()` yields batches in arrival order (interleaved across files when `concurrent_streams > 1`). For deterministic file ordering, use the default `TaskOrder()` mode. `batch_size` is usually an advanced tuning knob — the PyArrow default of 131,072 rows works well for most workloads.
406
+
- `concurrent_streams`: Number of files read in parallel (default: 8)
407
+
- `batch_size`: Number of rows per batch (default: 131,072, can be set via ArrivalOrder constructor)
408
+
- `max_buffered_batches`: Internal buffering parameter (default: 16, can be tuned for advanced use cases)
409
+
- Average row size depends on your schema and data; multiply the above by it to estimate bytes.
410
+
411
+
**Note:** `ArrivalOrder()` yields batches in arrival order (interleaved across files when `concurrent_streams > 1`). For deterministic file ordering, use the default `TaskOrder()` mode. The `batch_size` parameter in `ArrivalOrder` controls streaming memory usage, while `TaskOrder` uses full file materialization regardless of batch size.
403
412
404
413
To avoid any type inconsistencies during writing, you can convert the Iceberg table schema to Arrow:
405
414
@@ -1665,38 +1674,29 @@ table.scan(
1665
1674
).to_arrow_batch_reader()
1666
1675
```
1667
1676
1668
-
The `batch_size` parameter controls the maximum number of rows per RecordBatch (default is PyArrow's 131,072 rows):
When using `concurrent_files > 1`, batches from different files may be interleaved. Within each file, batches are always in row order. See the ordering semantics table in the [Apache Arrow section](#apache-arrow) above for details.
1699
+
When using `concurrent_streams > 1`, batches from different files may be interleaved. Within each file, batches are always in row order. See the ordering semantics table in the [Apache Arrow section](#apache-arrow) above for details.
0 commit comments