Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor _skip_sync for Improved Readability & Performance #3343

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ishansurdi
Copy link

What is the purpose of the change?

This pull request improves the _skip_sync method to enhance readability, maintainability, and correctness when checking for sync markers.
Instead of using seek(-SYNC_SIZE, 1), the method now stores the original file position using tell(), ensuring the read position can be reset explicitly when sync markers do not match.

Verifying this change

This change is a trivial rework / code cleanup without affecting the core logic of the reader.

This change is already covered by existing tests, such as:

  • Existing unit tests that verify proper sync marker detection and file position resetting.

Additionally, the change was manually verified by:

  • Running integration tests with AVRO files of various sizes.
  • Logging file positions before and after sync checks to confirm correct seek behavior.

Documentation

  • Does this pull request introduce a new feature? No
  • How is this change documented? Not applicable (code comments included for clarity).

@KalleOlaviNiemitalo
Copy link
Contributor

I doubt the claim of improved performance. This change may cost an additional system call in the success case.

@ishansurdi
Copy link
Author

Thank you for the review and for pointing this out!

The change was primarily aimed at improving code readability and maintainability by using tell() instead of seek(-SYNC_SIZE, 1). The concern about an additional system call is valid—tell() does introduce an extra operation in the success case.

However, here’s why I believe the tradeoff is acceptable:

  1. Clarity & Maintainability: Using tell() makes the intent of the function more explicit. The previous approach using seek(-SYNC_SIZE, 1) relies on relative positioning, which can be less intuitive to new maintainers.
  2. Performance Benchmark Results: I tested both approaches, and the results show that tell() is actually faster than seek(-SYNC_SIZE, 1) in this scenario:
    • tell() method: 0.0572 seconds
    • seek(-SYNC_SIZE, 1) method: 0.0777 seconds
      This suggests that the performance impact of tell() is not a concern and may even be beneficial.
  3. Consistency with Other File Operations: Many file-processing functions in AVRO, particularly in DataFileReader.java and DataFileReader12.java, already use tell() for tracking file positions. This means my approach aligns with existing AVRO patterns.

Given these findings, I believe using tell() is both efficient and consistent. However, if needed, I can provide additional benchmarks on larger datasets to validate these results further. Let me know if that would be helpful!

@KalleOlaviNiemitalo
Copy link
Contributor

In the previous code, if self.reader.read(SYNC_SIZE) returned fewer than SYNC_SIZE bytes, then the subsequent self.reader.seek(-SYNC_SIZE, 1) seeked too far back; it didn't stop at the original position. So this pull request at least makes the code more correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants