[Feature][DataLoader] Add log_duration timer to DataLoaderSplit batch loading#481
[Feature][DataLoader] Add log_duration timer to DataLoaderSplit batch loading#481ShreyeshArangath wants to merge 2 commits intolinkedin:mainfrom
Conversation
… loading Wrap the per-split record batch reading in __iter__ with log_duration to give visibility into per-split read latency. Logs the file path from the FileScanTask for context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The To time just the batch reads, the it = iter(arrow_scan.to_record_batches([self._file_scan_task]))
while True:
with log_duration(logger, "record_batch %s", self._file_scan_task.file.file_path):
batch = next(it, None)
if batch is None:
break
yield batch |
Move yield outside the log_duration block so the timer covers only the next() call that fetches each record batch. Previously yield from inside the with block kept the timer open while the caller processed batches (writing to disk, running inference, etc.). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| row_filter=ctx.row_filter, | ||
| ) | ||
| yield from arrow_scan.to_record_batches([self._file_scan_task]) | ||
| it = iter(arrow_scan.to_record_batches([self._file_scan_task])) |
There was a problem hiding this comment.
Actually, I don't think this will work since it materializes all batches. The actual expensive line is at arrow_scan.to_record_batches right?
There was a problem hiding this comment.
The current way works if it was truly streamed. Maybe we just add another timer for creating the iterator to cover both cases and know where all time is spent. WDYT?
There was a problem hiding this comment.
its going to be switched over by this PR, though? Right now, without that change, this timing is meaningless IMO because of the materialization behavior.
There was a problem hiding this comment.
I think I'm going to close this PR and add it back once we have true streaming in pyiceberg
There was a problem hiding this comment.
If we time both then we dont need to worry about knowing how PyIceberg does things. Timing both places is cheap. I think we should add it now. It will help in early performance testing and show us how much apache/iceberg-python#3046 improves performance.
Summary
Wrap the per-split record batch reading in iter with log_duration to give visibility into per-split read latency. Logs the file path from the FileScanTask.
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.