-
Notifications
You must be signed in to change notification settings - Fork 526
[WIP]: Create plan_splits method #4835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
|
Thanks for working on this! The current implementation feels tied to the zone map index, and I think it should be implemented more generically. I think we probably need to define the experience a bit more first. What I was thinking is that this can be a scanner level feature, so that similar to today users can do this to read data: let mut scanner = dataset.scan();
scanner.filter("i = 100").unwrap().project(&["i"]).unwrap();
scanner.try_into_batch().await.unwrap()We will have something like let mut scanner = dataset.scan();
scanner.filter("i = 100").unwrap().project(&["i"]).unwrap();
scanner.plan_splits().await.unwrap()This allows the query planning to go through the normal logical planning and optimization paths, and we can return a list of The plan parameters could be things like the expected split size (so we can do the head tail bin pack to combine zones), if the split should contain complete fragments (more useful for writes), what indexes should the planning use (e.g. prefer zone map over btree if the column has both). With this, we can then expose APIs like @westonpace I don't know if this aligns with what you think, let us know if you have better ideas. Also cc @Jay-ju since this is related to #4735 (comment) Also cc @steFaiz since this is related to #4000 (comment) |
Yes, your description matches roughly how I would expect the process to go. I think we could add this capability to the scanner. We have If we create a
If we are going to assume we have zone maps then we can focus on the "only index_query" case. So it should just be creating the initial filter plan, grabbing the index_query if it exists, evaluating it, and then splitting up the index search result. The index search result is an allow list / block list of row addresses. |
|
I also like the API you've provided I think that each "split" returned by Probably another feature that is missing is the ability for the scanner to define a pre-filter limit/offset. Right now we only allow the post-filter limit/offset to be specified. A course pre-filter can be specified using fragment ids but it would be nice to have something more fine-grained. The read node itself (filtered read) supports defining a pre-filter so this shouldn't be too big of a task (just plumbing through the scanner and API) |
@westonpace regarding this: did we still want to return zonemap statistic results to the caller? Reason being is my understanding was we'd want to provide these stats, say, in the absence of a scalar index which could help filter out address ranges, etc. Or is the point that this isn't needed assuming the indices are evaluated during planning stage? |
I think we can start with not returning that. We might want something like the residual (e.g. if there are some filters that were not fully applied in the planning phase and should be further evaluated), but we can always add that later. |
|
A very useful idea; I am also facing the same problem.
|
WIP
Related to #4163, #4469