Skip to content

Conversation

@charlesbluca
Copy link
Member

Looks like we should unblocked to support left anti joins when dataframe.backend="cudf", similar to the case in legacy Dask dataframe:

https://github.com/dask/dask/blob/df4de6ea53054790b09006c8ea68ef8725d39025/dask/dataframe/multi.py#L565

Note that like the legacy code, we'll fail somewhere down in the comptutation stack if we try this on CPU - not sure if it makes sense to check the backend if how="leftanti" and eagerly raise a NotImplementedError if dataframe.backend != "cudf".

cc @rjzamora

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @charlesbluca - Seems reasonable to add "leftanti" support given that the necessary logic is pretty simple, and the legacy dask.dataframe API supports it.

df2 = df2.rename(columns={"aa": "dd"})
assert_eq(
df1.merge(df2, how="leftanti", left_on="aa", right_on="dd"),
pdf1[~pdf1.aa.isin(pdf2.aa)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we could just do this in merge_chunk for pandas data to support how="leftanti" for cpu as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, can look into this a bit more

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed some commits to dask/dask#11150 that, in conjunction with this PR, should unblock left anti/semi joins on CPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants