Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

are_co_aligned is tokenizing too greedily causing to be possibly slow #907

Open
fjetter opened this issue Feb 29, 2024 · 7 comments
Open

Comments

@fjetter
Copy link
Member

fjetter commented Feb 29, 2024

The utility function are_co_aligned, see

https://github.com/dask-contrib/dask-expr/blob/9334e062a7b41161977ca1c42176197629569cc5/dask_expr/_expr.py#L2863-L2874

is unfortunately rather slow due to the tokenization and lack of caching. in paruqet_reader benchmarks on larger datasets, I saw this slowing down the optimize step by almost a second (when using pyarrowFS such that filters are pushed down

image

On top of this, I believe the implementation is unsafe since it is putting Expr objects into a set. Sets and dicts are requiring both __hash__ and __eq__ to be implemented and working as the stdlib protocol defines them. While this is true for hash (it hashes the name, this is not the case for __eq__ since this just creates another Expr instance instead of returning a bool. I suspect this just tells how the set is redundant if there hasn't been ever a hash collision / duplicate object here.

@fjetter fjetter changed the title are_co_aligned slow and possibly dangerous are_co_aligned slow and possibly unsafe Feb 29, 2024
@phofl
Copy link
Collaborator

phofl commented Feb 29, 2024

The unsafe thing is relatively easy to fix (see pr). The slowdown probably needs a bit of care

@fjetter
Copy link
Member Author

fjetter commented Feb 29, 2024

Also, I think the are_co_aligned (which is a bit of a misleading function name considering that we typically talk about alignment in the context of partitions) is buggy since it doesn't include filters

@phofl
Copy link
Collaborator

phofl commented Feb 29, 2024

Are you talking about filters in the tokenize partial call? Filters can change the number of partitions with pruning so we shouldn’t include them

I thought a little bit about the filters in read parquet, we shouldn’t use are_co_aligned there anyway, it’s too generous and allows expressions that we don’t want to allow, I have a solution for this soonish

@fjetter
Copy link
Member Author

fjetter commented Feb 29, 2024

Are you talking about filters in the tokenize partial call?

yes. I haven't thought too deeply about this but it is confusing that it isn't included while columns are. No big deal, though

@fjetter
Copy link
Member Author

fjetter commented Feb 29, 2024

Looking into this a little more deeply, it appears that the _tokenize_partial is slow because it does tokenize too many things. The parquet reader for instance is reusing one of the paramters as a cache and tokenizing this cache is what makes it costly. This info is represented in ReadParquet by overwriting the _name method but this is obviously not respected by the _tokenize_partial

@fjetter
Copy link
Member Author

fjetter commented Feb 29, 2024

I think the tokenize_partial approach is a little brittle in that sense. It compares objects without knowing anything about the objects themselves. Maybe an __approx_eq__ method on the expressions would be more sensible. This would be performing the tokenize_partial but allow for caching and overwriting.

@fjetter
Copy link
Member Author

fjetter commented Feb 29, 2024

This abstraction leak is already present with the _series argument in the ignore list.

@fjetter fjetter changed the title are_co_aligned slow and possibly unsafe are_co_aligned is tokenizing too greedily causing to be possibly slow Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants