-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
are_co_aligned
is tokenizing too greedily causing to be possibly slow
#907
Comments
are_co_aligned
slow and possibly dangerousare_co_aligned
slow and possibly unsafe
The unsafe thing is relatively easy to fix (see pr). The slowdown probably needs a bit of care |
Also, I think the |
Are you talking about filters in the tokenize partial call? Filters can change the number of partitions with pruning so we shouldn’t include them I thought a little bit about the filters in read parquet, we shouldn’t use are_co_aligned there anyway, it’s too generous and allows expressions that we don’t want to allow, I have a solution for this soonish |
yes. I haven't thought too deeply about this but it is confusing that it isn't included while columns are. No big deal, though |
Looking into this a little more deeply, it appears that the |
I think the tokenize_partial approach is a little brittle in that sense. It compares objects without knowing anything about the objects themselves. Maybe an |
This abstraction leak is already present with the |
are_co_aligned
slow and possibly unsafeare_co_aligned
is tokenizing too greedily causing to be possibly slow
The utility function
are_co_aligned
, seehttps://github.com/dask-contrib/dask-expr/blob/9334e062a7b41161977ca1c42176197629569cc5/dask_expr/_expr.py#L2863-L2874
is unfortunately rather slow due to the tokenization and lack of caching. in paruqet_reader benchmarks on larger datasets, I saw this slowing down the optimize step by almost a second (when using pyarrowFS such that filters are pushed down
On top of this, I believe the implementation is unsafe since it is putting
Expr
objects into aset
. Sets and dicts are requiring both__hash__
and__eq__
to be implemented and working as the stdlib protocol defines them. While this is true for hash (it hashes the name, this is not the case for__eq__
since this just creates anotherExpr
instance instead of returning a bool. I suspect this just tells how the set is redundant if there hasn't been ever a hash collision / duplicate object here.The text was updated successfully, but these errors were encountered: