Skip to content

Conversation

@fangbo
Copy link
Contributor

@fangbo fangbo commented Jan 5, 2026

Problem Description

We have 10,000 fragments for a dataset. When Spark reads data with filter from the dataset, there will be 10,000 scan tasks, even though the filter can prune the vast majority of fragments. Majority of scan tasks will not read any data. But too many scan tasks degrade the execution efficiency of Spark jobs.

Purpose of this PR

The filter contain's a column which has index. Using the index filter can prune the fragments which should not be scanned.

This PR provides a method prune_fragments for Dataset. This method prunes some fragments depending on filter and index and returns the fragments need to be scanned.

@github-actions github-actions bot added the enhancement New feature or request label Jan 5, 2026
@fangbo fangbo marked this pull request as draft January 5, 2026 09:11
@fangbo fangbo changed the title feat: prune a list of fragments using scalar indices for a given filter feat: prune fragments using scalar indices for a given filter Jan 5, 2026
@codecov
Copy link

codecov bot commented Jan 5, 2026

Codecov Report

❌ Patch coverage is 91.22807% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/scanner.rs 89.13% 1 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

@fangbo fangbo force-pushed the prune_fragments branch 2 times, most recently from 70a05d0 to 9d59e88 Compare January 7, 2026 06:02
@fangbo fangbo marked this pull request as ready for review January 7, 2026 06:07
@fangbo
Copy link
Contributor Author

fangbo commented Jan 7, 2026

@jackye1995 @majin1102 @yanghua @wojiaodoubao @Jay-ju This PR is ready. Could you please review it? Thank you.

@fangbo fangbo force-pushed the prune_fragments branch 4 times, most recently from 4f0f9a8 to cc4195e Compare January 9, 2026 06:06
@fangbo fangbo force-pushed the prune_fragments branch 6 times, most recently from 21df368 to c360059 Compare January 14, 2026 02:22
@majin1102
Copy link
Contributor

It seems we already attempted similar work in PR #4835. Maybe we could revisit it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants