-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Summary
We observe very slow compaction times across a number of different tables, and we'd like to fix it.
Desired Result
It should take a 1-3 seconds to compact 22MB, and yet I consistently see at least 60 seconds.
Examples
We ran compaction on a table where it only needed to compact 5 files totaling 315KB of data. According to the metrics from the library, the batch fetch duration was 85,514ms. The total job time (which we measure ourselves) was 93,092ms.
Here is a summary of 10 other tables compacted over the course of an hour.
| Datetime | Input Files | Input Size (bytes) | Fetch Time | Total Time |
|---|---|---|---|---|
| 2026-01-09T18:18:43 | 8 | 153,606 | 12,777ms | 37,641ms |
| 2026-01-09T18:16:31 | 231 | 11,704,639,665 | 1,324,772ms | 536,109ms |
| 2026-01-09T18:13:56 | 21 | 255,614,268 | 95,795ms | 110,706ms |
| 2026-01-09T18:10:59 | 8 | 11,359,652 | 74,555ms | 95,978ms |
| 2026-01-09T18:07:26 | 8 | 144,834 | 32,370ms | 50,014ms |
| 2026-01-09T17:49:55 | 5 | 2,408,452 | 684,052ms | 692,934ms |
| 2026-01-09T17:48:06 | 5 | 1,781,963 | 241,175ms | 251,488ms |
| 2026-01-09T17:34:13 | 5 | 17,838,726 | 140,258ms | 152,979ms |
| 2026-01-09T17:33:56 | 9 | 199,839 | 91,041ms | 102,836ms |
| 2026-01-09T17:21:05 | 5 | 1,465,254 | 416,035ms | 421,645ms |
Reproduction
I can reproduce long times from my laptop on my home network (600Mbps download, 38Mbps upload). I edited the rest catalog example in the iceberg-compaction library to run against my table in R2 Data Catalog. I generated fake data tables with N number of files and M number of rows, and loaded them into test tables. I tried adding updates/deletes. I also tried with greater number of columns. I ran with similar configurations as our production system. I hard coded a timer in the datafusion/mod.rs::rewrite_files that times the reads, and could confirm read time dominates the total time locally.
| Test | Columns | Files | Rows per file | Size per file | Deletes? | Time to Read | Time to compact |
|---|---|---|---|---|---|---|---|
| Test 1 | 4 | 5 | 6000 | 68KB | No | 9s | 17-18s |
| Test 2 | 8 | 5 | 6000 | 138KB | No | 12s | 21-22s |
| Test 3 | 8 | 10 | 100,000 | 2.3MB | No | 34-35s | 60s |
| Test 4 | 8 | 5 | 6000 | 23MB / 138MB | Yes | 26s | 47s |
| Test 5 | 8 | 10 | 100,000 | 2.3MB | Yes | 43s | 75s |
| Test 6 | 32 | 5 | 1000 | 125KB | No | 35s | 44s |
Root Cause
Iceberg-rust's ArrowReader only reads one range of bytes at a time. Each read requires an HTTP request to the same object. It reads a range for each column in the table. So as the number of columns in a table increases, so does the number of http requests to read the whole file.
Algorithmically, the number of http requests is O(N x M), where N is the number of columns, and M is the number of files to read. Prefetching reduces the number of http requests to O(M).
Proof: Static analysis of code
ArrowFileReaderonly implementsget_byteson theArrowAsyncReadertrait. iceberg-rust/arrow/reader.rs.ArrowAsyncReader::get_byte_rangeswill fall back to sequentially callingget_bytes(arrow-rs/parquet/arrow/async_reader.rs
Proof: The data from local testing (see previous section).
Solutions
I prototyped two solutions which can work. Both eliminated re-reading columns. I'll outline a summary of both approaches. I recommend Option A for now because it's cheaper to implement and it has better performance for compaction. Option B can be submitted as a general purpose improvement.
- (Recommended) Option A: Pre-fetch files in iceberg-compaction
- (+) Aligns with compaction use case, where we know we always want to download 100% of the file.
- (+) 2x faster than Option B in our tests
- (+) Less effort than committing a fix to iceberg-rust
- (-) Read more data into memory at once (although, we were already expecting to load the full file into memory anyway).
- Option B: Implement
get_file_rangesonArrowFileReaderto actually read multiple ranges at once- (+) This is a general purpose improvement that can speed up queries as well
- (-) Twice as slow a Option A in our tests (probably because it reads once for metadata and once again for the data).
- (-) More effort to implement and to iceberg-rust, and then to add here.
How to implement Option A: Pre-fetch Files in Iceberg-compaction
I added the optimization in iceberg_file_task_scan.rs. The ArrowReaderBuilder does not have a way of piping in bytes or files directly. The only interface to providing files is through the iceberg-rust FileIO object (link to code), which you get from the table. So what I did is use the FileIO instance passed into the function to fetch the file. Then I created a new FileIO which uses a Storage::Memory implementation, and I loaded the file into the memory storage.
This might not be the original intent of the Storage::Memory enum variant, but it actually fits the purpose of what we are trying to do. Instead of having the ArrowReader trigger calls to a remote object store using a FileIO with an S3::Storage, we want to give it bytes that are held in memory. We don't want it to call the object store at all. Thus, the Storage::Memory is a perfect fit.
Performance Tests
Note that I skipped Test 1 when evaluating the improvement
| Test | Columns | Files | Rows per file | Size per file | Deletes? | Old Read | Old Total | Improved Read | Improved Total | Read Speed Up |
|---|---|---|---|---|---|---|---|---|---|---|
| Test 1 | 4 | 5 | 6000 | 68KB | No | 9s | 17-18s | n/a | n/a | n/a |
| Test 2 | 8 | 5 | 6000 | 138KB | No | 12s | 21-22s | 2.3s | 12s | 5x |
| Test 3 | 8 | 10 | 100,000 | 2.3MB | No | 34-35s | 60s | 9s | 29s | 4x |
| Test 4 | 8 | 5 | 6000 | 23MB / 138MB | Yes | 26s | 47s | 3s | 24s | 8x |
| Test 5 | 8 | 10 | 100,000 | 2.3MB | Yes | 43s | 75s | 9.5s | 42s | 4x |
| Test 6 | 32 | 5 | 1000 | 125KB | No | 35s | 44s | 2.2s | 12s | 16x |