Skip to content

Conversation

@vigimite
Copy link
Contributor

@vigimite vigimite commented Jan 2, 2026

Which issue does this PR close?

Rationale for this change

The calculate_range function creates invalid byte ranges (where start > end) when reading single-line CSV/JSON files that are split into multiple partitions. This causes an error like:

ObjectStore(Generic { store: "S3", source: Inconsistent { start: 1149247, end: 1149246 } })

When find_first_newline doesn't find a newline (single-line file), it returns the remaining file length. This causes start + start_delta to exceed end + end_delta, creating an invalid range. The current check only handles range.start == range.end, not range.start > range.end.

What changes are included in this PR?

  1. Added an early termination check after computing start_delta: if the first newline after start is beyond the partition boundary (start + start_delta > end), return TerminateEarly since no complete records exist in this partition.

  2. Changed the final range validation from == to >= as a safety net for edge cases.

  3. Added a regression test that reproduces the bug with a single-line file split into partitions.

Are these changes tested?

Yes, added test_calculate_range_single_line_file which:

  • Creates a single-line JSON file without new-lines
  • Simulates partition 2 (middle to end of file)
  • Verifies that calculate_range returns TerminateEarly instead of an invalid range

Are there any user-facing changes?

No

@github-actions github-actions bot added the datasource Changes to the datasource crate label Jan 2, 2026
Copy link
Member

@martin-g martin-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jefffrey Jefffrey added this pull request to the merge queue Jan 4, 2026
Merged via the queue into apache:main with commit 7fde30a Jan 4, 2026
28 checks passed
@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 4, 2026

Nice pickup, thanks @vigimite & @martin-g

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

calculate_range creates invalid byte ranges for single-line JSON files

3 participants