Skip to content

Conversation

@wjones127
Copy link
Contributor

Summary

  • project_by_schema previously only handled field reordering for top-level columns and direct Struct fields, but not for fields nested inside List<Struct>, LargeList<Struct>, or FixedSizeList<Struct> types
  • Added project_array helper that recursively handles these nested list types
  • Added unit tests and integration test with checked-in test data that reproduces the original issue

Test plan

  • Unit tests in lance-arrow verify project_by_schema correctly reorders fields inside List<Struct>
  • Integration test in dataset_migrations.rs reads test data that triggers the bug scenario
  • Test data created with fragment 0 having List<Struct<a,b,c>> and fragment 1 having reordered/missing inner struct fields
  • Verified that without the fix, reading fails with "Incorrect datatype for StructArray field expected List(Struct(...)) got List(Struct(...))"

Fixes #5702

🤖 Generated with Claude Code

wjones127 and others added 2 commits January 12, 2026 13:37
Previously, project_by_schema only recursively handled direct Struct
fields. List<Struct>, LargeList<Struct>, and FixedSizeList<Struct>
types fell through to the default case which cloned them without
reordering inner struct fields.

This caused Arrow validation errors when reading fragments where fields
were stored out of order (scrambled `fields` array in DataFile metadata)
combined with schema evolution requiring null-filling.

Fixes lance-format#5702

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ance-format#5702)

Adds test data and integration test that reproduces the original bug:
- Fragment 0: List<Struct<a, b, c>> with all fields + "extra" column
- Fragment 1: List<Struct<c, b>> with reordered/missing inner struct fields

This combination of out-of-order field storage + schema evolution inside
the List<Struct> triggers project_by_schema to reorder fields. Before
the fix, this would fail with:
"Incorrect datatype for StructArray field expected List(Struct(...))
got List(Struct(...))"

Also adds a direct unit test in dataset_schema_evolution.rs that tests
the project_by_schema function with misordered List<Struct> fields.

Fixes lance-format#5702

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions github-actions bot added the bug Something isn't working label Jan 12, 2026
wjones127 and others added 2 commits January 12, 2026 14:54
- Move test data to version-specific directory (v1.0.1)
- Replace prints with assertions in datagen.py
- Add assertion that scanning fails with issue lance-format#5702 error
- Remove redundant test from dataset_schema_evolution.rs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Regenerated the test data using pylance 1.0.1 as intended.
Also fixed assertion to match actual error message format.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@wjones127 wjones127 marked this pull request as ready for review January 12, 2026 23:34
@codecov
Copy link

codecov bot commented Jan 12, 2026

Codecov Report

❌ Patch coverage is 97.57282% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-arrow/src/lib.rs 97.57% 0 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

@wjones127 wjones127 merged commit 6ecb573 into lance-format:main Jan 13, 2026
29 checks passed
@wjones127 wjones127 deleted the fix/project-list-struct-reorder branch January 13, 2026 14:58
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
…ance-format#5703)

## Summary

- `project_by_schema` previously only handled field reordering for
top-level columns and direct `Struct` fields, but not for fields nested
inside `List<Struct>`, `LargeList<Struct>`, or `FixedSizeList<Struct>`
types
- Added `project_array` helper that recursively handles these nested
list types
- Added unit tests and integration test with checked-in test data that
reproduces the original issue

## Test plan

- [x] Unit tests in `lance-arrow` verify `project_by_schema` correctly
reorders fields inside `List<Struct>`
- [x] Integration test in `dataset_migrations.rs` reads test data that
triggers the bug scenario
- [x] Test data created with fragment 0 having `List<Struct<a,b,c>>` and
fragment 1 having reordered/missing inner struct fields
- [x] Verified that without the fix, reading fails with "Incorrect
datatype for StructArray field expected List(Struct(...)) got
List(Struct(...))"

Fixes lance-format#5702

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

project_by_schema does not reorder fields inside List<Struct> types

2 participants