[WIP]: Create plan_splits method #4835

jtuglu1 · 2025-09-29T01:47:20Z

WIP

github-actions · 2025-09-29T01:47:37Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

jackye1995 · 2025-10-01T05:20:29Z

Thanks for working on this! The current implementation feels tied to the zone map index, and I think it should be implemented more generically.

I think we probably need to define the experience a bit more first. What I was thinking is that this can be a scanner level feature, so that similar to today users can do this to read data:

let mut scanner = dataset.scan();

scanner.filter("i = 100").unwrap().project(&["i"]).unwrap();

scanner.try_into_batch().await.unwrap()

We will have something like

let mut scanner = dataset.scan();

scanner.filter("i = 100").unwrap().project(&["i"]).unwrap();

scanner.plan_splits().await.unwrap()

This allows the query planning to go through the normal logical planning and optimization paths, and we can return a list of Splits based on some plan parameters.

The plan parameters could be things like the expected split size (so we can do the head tail bin pack to combine zones), if the split should contain complete fragments (more useful for writes), what indexes should the planning use (e.g. prefer zone map over btree if the column has both).

With this, we can then expose APIs like dataset.plan_splits() in rust and python accordingly.

@westonpace I don't know if this aligns with what you think, let us know if you have better ideas.

Also cc @Jay-ju since this is related to #4735 (comment)

Also cc @steFaiz since this is related to #4000 (comment)

westonpace · 2025-10-01T12:04:11Z

@westonpace I don't know if this aligns with what you think, let us know if you have better ideas.

Yes, your description matches roughly how I would expect the process to go. I think we could add this capability to the scanner. We have create_plan but we could also have create_estimation_plan or something along those lines. I think there are a few criteria:

If we create a FilterPlan then we should have a few different scenarios:

No filter - split evenly based on fragment metadata
Only index_query - Run the index query, split evenly based on the index search result
Only non-index filter - Use column statistics (which we don't really have yet) combined with DF's cardinality estimation
Both index query & non-index filter - Run index query and potentially refine with column statistics in the future

If we are going to assume we have zone maps then we can focus on the "only index_query" case. So it should just be creating the initial filter plan, grabbing the index_query if it exists, evaluating it, and then splitting up the index search result. The index search result is an allow list / block list of row addresses.

westonpace · 2025-10-01T12:07:03Z

I also like the API you've provided I think that each "split" returned by plan_splits should (at a minimum) have a "rows to skip" and "rows to take" (e.g. limit/offset).

Probably another feature that is missing is the ability for the scanner to define a pre-filter limit/offset. Right now we only allow the post-filter limit/offset to be specified. A course pre-filter can be specified using fragment ids but it would be nice to have something more fine-grained. The read node itself (filtered read) supports defining a pre-filter so this shouldn't be too big of a task (just plumbing through the scanner and API)

jtuglu1 · 2025-10-05T06:12:56Z

If we are going to assume we have zone maps then we can focus on the "only index_query" case. So it should just be creating the initial filter plan, grabbing the index_query if it exists, evaluating it, and then splitting up the index search result. The index search result is an allow list / block list of row addresses.

@westonpace regarding this: did we still want to return zonemap statistic results to the caller?

Reason being is my understanding was we'd want to provide these stats, say, in the absence of a scalar index which could help filter out address ranges, etc. Or is the point that this isn't needed assuming the indices are evaluated during planning stage?

jackye1995 · 2025-10-06T22:39:35Z

did we still want to return zonemap statistic results to the caller?

I think we can start with not returning that. We might want something like the residual (e.g. if there are some filters that were not fully applied in the planning phase and should be further evaluated), but we can always add that later.

Jay-ju · 2025-10-11T07:39:18Z

A very useful idea; I am also facing the same problem.
Here I have a few questions I'd like to ask you.

Why do we need to reconstruct the fragment? Is it because the original fragment object stores too much redundant data?
why use ZoneMapStatistics here? How should this statistical information be used when returned to the computing engine? Can we directly filter the size of the data to be scanned according to the fragment, then pack it, and evenly form splits?
In addition, I would like to provide a logical basis for packing fragments, which is to calculate the approximate number of rows of the fragment through the index. Based on this, we can evenly distribute fragments of different sizes into groups. feat: add approx_count_rows method for efficient row counting with index selection #4930

Create barebones plan_splits method

e5300ae

jtuglu1 changed the title ~~[WIP]: Create barebones plan_splits method~~ [WIP]: Create plan_splits method Sep 29, 2025

github-actions bot added the Stale label Jan 10, 2026

majin1102 mentioned this pull request Jan 14, 2026

feat: prune fragments using scalar indices for a given filter #5625

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP]: Create plan_splits method #4835

[WIP]: Create plan_splits method #4835

Uh oh!

jtuglu1 commented Sep 29, 2025 •

edited by jackye1995

Loading

Uh oh!

github-actions bot commented Sep 29, 2025

Uh oh!

jackye1995 commented Oct 1, 2025 •

edited

Loading

Uh oh!

westonpace commented Oct 1, 2025

Uh oh!

westonpace commented Oct 1, 2025

Uh oh!

jtuglu1 commented Oct 5, 2025 •

edited

Loading

Uh oh!

jackye1995 commented Oct 6, 2025

Uh oh!

Jay-ju commented Oct 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[WIP]: Create plan_splits method #4835

Are you sure you want to change the base?

[WIP]: Create plan_splits method #4835

Uh oh!

Conversation

jtuglu1 commented Sep 29, 2025 • edited by jackye1995 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 29, 2025

Uh oh!

jackye1995 commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace commented Oct 1, 2025

Uh oh!

westonpace commented Oct 1, 2025

Uh oh!

jtuglu1 commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackye1995 commented Oct 6, 2025

Uh oh!

Jay-ju commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jtuglu1 commented Sep 29, 2025 •

edited by jackye1995

Loading

jackye1995 commented Oct 1, 2025 •

edited

Loading

jtuglu1 commented Oct 5, 2025 •

edited

Loading

Jay-ju commented Oct 11, 2025 •

edited

Loading