[Feature][DataLoader] Add log_duration timer to DataLoaderSplit batch loading by ShreyeshArangath · Pull Request #481 · linkedin/openhouse

ShreyeshArangath · 2026-02-27T23:14:24Z

Summary

Wrap the per-split record batch reading in iter with log_duration to give visibility into per-split read latency. Logs the file path from the FileScanTask.

Changes

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

… loading Wrap the per-split record batch reading in __iter__ with log_duration to give visibility into per-split read latency. Logs the file path from the FileScanTask for context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

robreeves · 2026-02-27T23:27:27Z

The yield from inside the with log_duration(...) block means the timer won't stop until the generator is fully consumed. Since yield suspends execution, whatever the caller does between consuming batches (writing to disk, running inference, etc.) gets included in the logged duration.

To time just the batch reads, the yield needs to happen outside the timed section:

it = iter(arrow_scan.to_record_batches([self._file_scan_task]))
while True:
    with log_duration(logger, "record_batch %s", self._file_scan_task.file.file_path):
        batch = next(it, None)
    if batch is None:
        break
    yield batch

Move yield outside the log_duration block so the timer covers only the next() call that fetches each record batch. Previously yield from inside the with block kept the timer open while the caller processed batches (writing to disk, running inference, etc.). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

robreeves · 2026-02-28T00:06:42Z

integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py

            row_filter=ctx.row_filter,
        )
-        yield from arrow_scan.to_record_batches([self._file_scan_task])
+        it = iter(arrow_scan.to_record_batches([self._file_scan_task]))


Actually, I don't think this will work since it materializes all batches. The actual expensive line is at arrow_scan.to_record_batches right?

The current way works if it was truly streamed. Maybe we just add another timer for creating the iterator to cover both cases and know where all time is spent. WDYT?

its going to be switched over by this PR, though? Right now, without that change, this timing is meaningless IMO because of the materialization behavior.

I think I'm going to close this PR and add it back once we have true streaming in pyiceberg

If we time both then we dont need to worry about knowing how PyIceberg does things. Timing both places is cheap. I think we should add it now. It will help in early performance testing and show us how much apache/iceberg-python#3046 improves performance.

ShreyeshArangath marked this pull request as ready for review February 28, 2026 00:01

robreeves reviewed Feb 28, 2026

View reviewed changes

ShreyeshArangath closed this Feb 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][DataLoader] Add log_duration timer to DataLoaderSplit batch loading#481

[Feature][DataLoader] Add log_duration timer to DataLoaderSplit batch loading#481
ShreyeshArangath wants to merge 2 commits intolinkedin:mainfrom
ShreyeshArangath:feat/add-read-metrics

ShreyeshArangath commented Feb 27, 2026 •

edited

Loading

Uh oh!

robreeves commented Feb 27, 2026

Uh oh!

robreeves Feb 28, 2026

Uh oh!

robreeves Feb 28, 2026

Uh oh!

ShreyeshArangath Feb 28, 2026

Uh oh!

ShreyeshArangath Feb 28, 2026

Uh oh!

robreeves Feb 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShreyeshArangath commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing Done

Additional Information

Uh oh!

robreeves commented Feb 27, 2026

Uh oh!

robreeves Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

robreeves Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

ShreyeshArangath Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

ShreyeshArangath Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

robreeves Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShreyeshArangath commented Feb 27, 2026 •

edited

Loading

robreeves Feb 28, 2026 •

edited

Loading