-
Notifications
You must be signed in to change notification settings - Fork 523
feat: add filter support to get_fragments method #4735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Jay-ju
wants to merge
1
commit into
lance-format:main
Choose a base branch
from
Jay-ju:get-fragments-filter-support
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand, this means you are not really improving from perf perspective? And it would be quite expensive to check every fragment against the filter. We have added zone map now, I think we should select fragments (zones even better) to distribute based on the zone map index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995 Actually, this is a tradeoff. When the number of fragments is relatively large, it will bring some problems. For example, when the number of fragments in a dataset exceeds 20,000, and because the current dataset caches the manifest by default, this causes significant consumption each time physical plan tasks are distributed.
Therefore, it is hoped here to first filter the valid fragments, meanwhile the filter here can still be left un filled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I am all for adding this filter. To be more specific, I am saying:
seems expensive since you are running a
count_rowsper fragment.I think we can expose a corse-grained
plan_fragmentsAPI that plans the fragments to distribute based on a filter, and we can apply that filter once against zone map or bloom filter index.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995 What you mean here is that after the dataset creates a zonemap index,
dataset.scannercan use the index. Butfragment.scannercannot use the index? So we need to make fragment also use the index? Is my understanding correct?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fragment.scannershould also be able to useScalarIndexQuery, but after my test, it's true that the time taken by fragment's count_rows is much worse than that of dataset's count_rows. I guess it should be due to multi-threaded operations.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to align the goal here first. In my understanding, we want to use
get_fragmentswith a filter because of use cases like https://github.com/lancedb/lance-ray/blame/main/lance_ray/io.py#L265.So what happens is that you (1) get the list of fragments to distribute, and then (2) process each fragment in worker.
The step 1 here you don't necessarily need an exact match, you just want to prune down the list of fragments that might match the filter. In many cases this can already discard a large portion of the fragments. And in the actual worker, you can do the actual filtering.
That is why I think using inexact indexes like zone_map index and bloom filter index is ideal here, because we would be able to very quickly answer the question of "what fragments might match the filter" and then distribute the work. Then in each worker, you can pushdown the filter and do the exact scan filtering there. If there are exact indexes like fbtree or bitmap we can still use them, just they could be slower but still okay.
If there is no index exist against the filtered columns, we can just return the full list of fragments which means all fragments might match the filter.
Compared to that, what is done in the current code is very heavy weight. You are doing an exact count which actually triggers a scan for each fragment. You are also doing that for each fragment separately which is even worse. I don't think that really achieves the goal of "reduce empty task overhead in distributed engines (Ray/Daft/Spark)" because you are doing much more compute in the coordinator.
I guess maybe from implementation perspective,
get_fragmentsshould do what you have implemented. But for the purpose of reduce empty task overhead, we might want to create another method likeprune_fragments.What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the goal can be agreed upon. My scenario is indeed that the driver uses a coarse-grained index, and the worker specifically executes the filtering. However, after looking at the current implementation of this part of the index, there are some minor issues that may need to be clarified first:
This means that if we want to filter a specific index for use, we need to add an interface similar to
Is it necessary to support multiple indexes for the one column here?
Suppose that the index of this column has been obtained, and then after applying the filter to the index, a RowIdTreeMap is obtained. Then, I wonder if it is necessary to add an
approx_count_rows()interface to count the number ofrow_idin theRowIdTreeMap. Additionally, can this interface also be exposed in the fragment interface?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah today in general there is an implicit assumption of 1 index per column. With these inexact indexes, this is no longer true since we might want to have a zone-map or bloom filter for many columns, and then have exact indexes to use at worker. I created #4805 listing a few work we need to do.
For here, I think we can first assume 1 index per column for now so we are not blocked by it.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995 I submitted a PR #5625 to prune fragments using filter and index. Could you please take a look and let me know if this seems reasonable? Thank you.
cc @majin1102