`are_co_aligned` is tokenizing too greedily causing to be possibly slow #907

fjetter · 2024-02-29T11:29:24Z

The utility function are_co_aligned, see

https://github.com/dask-contrib/dask-expr/blob/9334e062a7b41161977ca1c42176197629569cc5/dask_expr/_expr.py#L2863-L2874

is unfortunately rather slow due to the tokenization and lack of caching. in paruqet_reader benchmarks on larger datasets, I saw this slowing down the optimize step by almost a second (when using pyarrowFS such that filters are pushed down

On top of this, I believe the implementation is unsafe since it is putting Expr objects into a set. Sets and dicts are requiring both __hash__ and __eq__ to be implemented and working as the stdlib protocol defines them. While this is true for hash (it hashes the name, this is not the case for __eq__ since this just creates another Expr instance instead of returning a bool. I suspect this just tells how the set is redundant if there hasn't been ever a hash collision / duplicate object here.

The text was updated successfully, but these errors were encountered:

phofl · 2024-02-29T11:49:08Z

The unsafe thing is relatively easy to fix (see pr). The slowdown probably needs a bit of care

fjetter · 2024-02-29T12:14:01Z

Also, I think the are_co_aligned (which is a bit of a misleading function name considering that we typically talk about alignment in the context of partitions) is buggy since it doesn't include filters

phofl · 2024-02-29T12:18:33Z

Are you talking about filters in the tokenize partial call? Filters can change the number of partitions with pruning so we shouldn’t include them

I thought a little bit about the filters in read parquet, we shouldn’t use are_co_aligned there anyway, it’s too generous and allows expressions that we don’t want to allow, I have a solution for this soonish

fjetter · 2024-02-29T12:21:21Z

Are you talking about filters in the tokenize partial call?

yes. I haven't thought too deeply about this but it is confusing that it isn't included while columns are. No big deal, though

fjetter · 2024-02-29T12:25:00Z

Looking into this a little more deeply, it appears that the _tokenize_partial is slow because it does tokenize too many things. The parquet reader for instance is reusing one of the paramters as a cache and tokenizing this cache is what makes it costly. This info is represented in ReadParquet by overwriting the _name method but this is obviously not respected by the _tokenize_partial

fjetter · 2024-02-29T12:26:34Z

I think the tokenize_partial approach is a little brittle in that sense. It compares objects without knowing anything about the objects themselves. Maybe an __approx_eq__ method on the expressions would be more sensible. This would be performing the tokenize_partial but allow for caching and overwriting.

fjetter · 2024-02-29T12:29:52Z

This abstraction leak is already present with the _series argument in the ignore list.

fjetter changed the title ~~are_co_aligned slow and possibly dangerous~~ are_co_aligned slow and possibly unsafe Feb 29, 2024

phofl mentioned this issue Feb 29, 2024

Don't rely on sets in are_co_aligned #908

Merged

fjetter mentioned this issue Feb 29, 2024

Speedup init of ReadParquetPyarrowFS #909

Merged

fjetter changed the title ~~are_co_aligned slow and possibly unsafe~~ are_co_aligned is tokenizing too greedily causing to be possibly slow Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`are_co_aligned` is tokenizing too greedily causing to be possibly slow #907

`are_co_aligned` is tokenizing too greedily causing to be possibly slow #907

fjetter commented Feb 29, 2024

phofl commented Feb 29, 2024

fjetter commented Feb 29, 2024

phofl commented Feb 29, 2024

fjetter commented Feb 29, 2024

fjetter commented Feb 29, 2024

fjetter commented Feb 29, 2024

fjetter commented Feb 29, 2024

are_co_aligned is tokenizing too greedily causing to be possibly slow #907

are_co_aligned is tokenizing too greedily causing to be possibly slow #907

Comments

fjetter commented Feb 29, 2024

phofl commented Feb 29, 2024

fjetter commented Feb 29, 2024

phofl commented Feb 29, 2024

fjetter commented Feb 29, 2024

fjetter commented Feb 29, 2024

fjetter commented Feb 29, 2024

fjetter commented Feb 29, 2024

`are_co_aligned` is tokenizing too greedily causing to be possibly slow #907

`are_co_aligned` is tokenizing too greedily causing to be possibly slow #907