Skip to content

[Feature][DataLoader] Add log_duration timer to DataLoaderSplit batch loading#481

Closed
ShreyeshArangath wants to merge 2 commits intolinkedin:mainfrom
ShreyeshArangath:feat/add-read-metrics
Closed

[Feature][DataLoader] Add log_duration timer to DataLoaderSplit batch loading#481
ShreyeshArangath wants to merge 2 commits intolinkedin:mainfrom
ShreyeshArangath:feat/add-read-metrics

Conversation

@ShreyeshArangath
Copy link
Collaborator

@ShreyeshArangath ShreyeshArangath commented Feb 27, 2026

Summary

Wrap the per-split record batch reading in iter with log_duration to give visibility into per-split read latency. Logs the file path from the FileScanTask.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

… loading

Wrap the per-split record batch reading in __iter__ with log_duration
to give visibility into per-split read latency. Logs the file path
from the FileScanTask for context.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robreeves
Copy link
Collaborator

The yield from inside the with log_duration(...) block means the timer won't stop until the generator is fully consumed. Since yield suspends execution, whatever the caller does between consuming batches (writing to disk, running inference, etc.) gets included in the logged duration.

To time just the batch reads, the yield needs to happen outside the timed section:

it = iter(arrow_scan.to_record_batches([self._file_scan_task]))
while True:
    with log_duration(logger, "record_batch %s", self._file_scan_task.file.file_path):
        batch = next(it, None)
    if batch is None:
        break
    yield batch

Move yield outside the log_duration block so the timer covers only
the next() call that fetches each record batch. Previously yield from
inside the with block kept the timer open while the caller processed
batches (writing to disk, running inference, etc.).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ShreyeshArangath ShreyeshArangath marked this pull request as ready for review February 28, 2026 00:01
row_filter=ctx.row_filter,
)
yield from arrow_scan.to_record_batches([self._file_scan_task])
it = iter(arrow_scan.to_record_batches([self._file_scan_task]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't think this will work since it materializes all batches. The actual expensive line is at arrow_scan.to_record_batches right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current way works if it was truly streamed. Maybe we just add another timer for creating the iterator to cover both cases and know where all time is spent. WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its going to be switched over by this PR, though? Right now, without that change, this timing is meaningless IMO because of the materialization behavior.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going to close this PR and add it back once we have true streaming in pyiceberg

Copy link
Collaborator

@robreeves robreeves Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we time both then we dont need to worry about knowing how PyIceberg does things. Timing both places is cheap. I think we should add it now. It will help in early performance testing and show us how much apache/iceberg-python#3046 improves performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants