Skip to content

Commit 84adcfa

Browse files
sumedhsakdeoclaude
andcommitted
Update API documentation for new ArrivalOrder batch_size parameter
- Update all examples to use new ArrivalOrder(batch_size=X) syntax - Add comprehensive memory formula with row size calculation - Remove backward compatibility references (batch_size is new in this PR) - Include performance characteristics and use case recommendations - Provide clear guidance on TaskOrder vs ArrivalOrder memory behavior 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 432cd81 commit 84adcfa

File tree

1 file changed

+32
-32
lines changed

1 file changed

+32
-32
lines changed

mkdocs/docs/api.md

Lines changed: 32 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -355,28 +355,21 @@ for buf in tbl.scan().to_arrow_batch_reader():
355355
print(f"Buffer contains {len(buf)} rows")
356356
```
357357

358-
You can control the number of rows per batch using the `batch_size` parameter:
359-
360-
```python
361-
for buf in tbl.scan().to_arrow_batch_reader(batch_size=1000):
362-
print(f"Buffer contains {len(buf)} rows")
363-
```
364-
365358
By default, each file's batches are materialized in memory before being yielded (`TaskOrder()`). For large files that may exceed available memory, use `ArrivalOrder()` to yield batches as they are produced without materializing entire files:
366359

367360
```python
368361
from pyiceberg.table import ArrivalOrder
369362
370-
for buf in tbl.scan().to_arrow_batch_reader(order=ArrivalOrder(), batch_size=1000):
363+
for buf in tbl.scan().to_arrow_batch_reader(order=ArrivalOrder()):
371364
print(f"Buffer contains {len(buf)} rows")
372365
```
373366

374-
For maximum throughput, use `concurrent_streams` to read multiple files in parallel with arrival order. Batches are yielded as they arrive from any file — ordering across files is not guaranteed:
367+
For maximum throughput, tune `concurrent_streams` to read multiple files in parallel with arrival order. Batches are yielded as they arrive from any file — ordering across files is not guaranteed:
375368

376369
```python
377370
from pyiceberg.table import ArrivalOrder
378371
379-
for buf in tbl.scan().to_arrow_batch_reader(order=ArrivalOrder(concurrent_streams=4), batch_size=1000):
372+
for buf in tbl.scan().to_arrow_batch_reader(order=ArrivalOrder(concurrent_streams=4)):
380373
print(f"Buffer contains {len(buf)} rows")
381374
```
382375

@@ -387,19 +380,35 @@ for buf in tbl.scan().to_arrow_batch_reader(order=ArrivalOrder(concurrent_stream
387380
| `TaskOrder()` (default) | Batches grouped by file, in task submission order | Row order |
388381
| `ArrivalOrder()` | Interleaved across files (no grouping guarantee) | Row order within each file |
389382

390-
Within each file, batch ordering always follows row order. The `limit` parameter is enforced correctly regardless of configuration.
383+
The `limit` parameter is enforced correctly regardless of configuration.
391384

392385
**Which configuration should I use?**
393386

394387
| Use case | Recommended config |
395388
|---|---|
396389
| Small tables, simple queries | Default — no extra args needed |
397-
| Large tables, memory-constrained | `order=ArrivalOrder()` — one file at a time, minimal memory |
398-
| Maximum throughput with bounded memory | `order=ArrivalOrder(concurrent_streams=N)` — tune N to balance throughput vs memory |
399-
| Fine-grained memory control | `order=ArrivalOrder(concurrent_streams=N, max_buffered_batches=M)` — tune both parameters |
400-
| Fine-grained batch control | Add `batch_size=N` to any of the above |
390+
| Large tables, maximum throughput with bounded memory | `order=ArrivalOrder(concurrent_streams=N)` — tune N to balance throughput vs memory |
391+
| Fine-grained memory control | `order=ArrivalOrder(concurrent_streams=N, batch_size=M, max_buffered_batches=K)` — tune all parameters |
392+
393+
**Memory usage and performance characteristics:**
394+
395+
- **TaskOrder (default)**: Uses full file materialization. Each file is loaded entirely into memory before yielding batches. Memory usage depends on file sizes.
396+
- **ArrivalOrder**: Uses streaming with controlled memory usage. Memory is bounded by the batch buffering mechanism.
397+
398+
**Memory formula for ArrivalOrder:**
399+
400+
```text
401+
Peak memory ≈ concurrent_streams × batch_size × max_buffered_batches × (average row size in bytes)
402+
```
403+
404+
Where:
401405

402-
**Note:** `ArrivalOrder()` yields batches in arrival order (interleaved across files when `concurrent_streams > 1`). For deterministic file ordering, use the default `TaskOrder()` mode. `batch_size` is usually an advanced tuning knob — the PyArrow default of 131,072 rows works well for most workloads.
406+
- `concurrent_streams`: Number of files read in parallel (default: 8)
407+
- `batch_size`: Number of rows per batch (default: 131,072, can be set via ArrivalOrder constructor)
408+
- `max_buffered_batches`: Internal buffering parameter (default: 16, can be tuned for advanced use cases)
409+
- Average row size depends on your schema and data; multiply the above by it to estimate bytes.
410+
411+
**Note:** `ArrivalOrder()` yields batches in arrival order (interleaved across files when `concurrent_streams > 1`). For deterministic file ordering, use the default `TaskOrder()` mode. The `batch_size` parameter in `ArrivalOrder` controls streaming memory usage, while `TaskOrder` uses full file materialization regardless of batch size.
403412

404413
To avoid any type inconsistencies during writing, you can convert the Iceberg table schema to Arrow:
405414

@@ -1665,38 +1674,29 @@ table.scan(
16651674
).to_arrow_batch_reader()
16661675
```
16671676

1668-
The `batch_size` parameter controls the maximum number of rows per RecordBatch (default is PyArrow's 131,072 rows):
1669-
1670-
```python
1671-
table.scan(
1672-
row_filter=GreaterThanOrEqual("trip_distance", 10.0),
1673-
selected_fields=("VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime"),
1674-
).to_arrow_batch_reader(batch_size=1000)
1675-
```
1676-
1677-
Use `order=ScanOrder.ARRIVAL` to avoid materializing entire files in memory. This yields batches as they are produced by PyArrow, one file at a time:
1677+
To avoid materializing entire files in memory, use `ArrivalOrder` which yields batches as they are produced by PyArrow:
16781678

16791679
```python
1680-
from pyiceberg.table import ScanOrder
1680+
from pyiceberg.table import ArrivalOrder
16811681

16821682
table.scan(
16831683
row_filter=GreaterThanOrEqual("trip_distance", 10.0),
16841684
selected_fields=("VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime"),
1685-
).to_arrow_batch_reader(order=ScanOrder.ARRIVAL)
1685+
).to_arrow_batch_reader(order=ArrivalOrder())
16861686
```
16871687

1688-
For concurrent file reads with arrival order, use `concurrent_files`. Note that batch ordering across files is not guaranteed:
1688+
For concurrent file reads with arrival order, use `concurrent_streams`. Note that batch ordering across files is not guaranteed:
16891689

16901690
```python
1691-
from pyiceberg.table import ScanOrder
1691+
from pyiceberg.table import ArrivalOrder
16921692

16931693
table.scan(
16941694
row_filter=GreaterThanOrEqual("trip_distance", 10.0),
16951695
selected_fields=("VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime"),
1696-
).to_arrow_batch_reader(order=ScanOrder.ARRIVAL, concurrent_files=4)
1696+
).to_arrow_batch_reader(order=ArrivalOrder(concurrent_streams=16))
16971697
```
16981698

1699-
When using `concurrent_files > 1`, batches from different files may be interleaved. Within each file, batches are always in row order. See the ordering semantics table in the [Apache Arrow section](#apache-arrow) above for details.
1699+
When using `concurrent_streams > 1`, batches from different files may be interleaved. Within each file, batches are always in row order. See the ordering semantics table in the [Apache Arrow section](#apache-arrow) above for details.
17001700

17011701
### Pandas
17021702

0 commit comments

Comments
 (0)