Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join_nulls in "asof" join #17886

Open
MariusMerkleQC opened this issue Jul 26, 2024 · 4 comments · May be fixed by #21172
Open

join_nulls in "asof" join #17886

MariusMerkleQC opened this issue Jul 26, 2024 · 4 comments · May be fixed by #21172
Labels
enhancement New feature or an improvement of an existing feature

Comments

@MariusMerkleQC
Copy link

Description

Would it be possible to add the join_nulls: bool = False argument to the .join_asof() function, as it is also available to the .join() function?

I have a use case where I want to join two data frames using the "asof" logic, and I'd also like to join when the join keys (on/left_on/right_on) are Null. I would also be interested whether there is a workaround in the mean time.

@MariusMerkleQC MariusMerkleQC added the enhancement New feature or an improvement of an existing feature label Jul 26, 2024
@mcrumiller
Copy link
Contributor

mcrumiller commented Jul 26, 2024

Can you provide an example including the desired result? Do you mean when the left frame is null, or when the right frame does not have a matching row but does have a null value?

@MariusMerkleQC
Copy link
Author

Does this example clarify the desired result?

import polars as pl
from datetime import datetime

df_expected = pl.DataFrame(
    data=[
        (None, datetime(2024, 1, 1, 0, 0, 0), 5),
        ("a", datetime(2024, 1, 1, 0, 0, 0), 5),
    ],
    schema={"category": pl.Utf8, "timestamp": pl.Datetime, "value": pl.Int8},
    orient="row",
)

df_left = df_expected.drop("value")

df_right = pl.DataFrame(
    data=[
        (None, datetime(2023, 1, 1, 0, 0, 0), 5),
        ("a", datetime(2023, 1, 1, 0, 0, 0), 5),
    ],
    schema={"category": pl.Utf8, "timestamp": pl.Datetime, "value": pl.Int8},
    orient="row",
)

df_actual = df_left.join_asof(
    other=df_right,
    on="timestamp",
    by=["category"],
    strategy="backward",  # join_nulls=True
)

@mcrumiller
Copy link
Contributor

I see--I believe you're asking that the initial by=... include the ability to join on nulls.

I think there is an issue which is that joining on nulls produces the cartesian product of the matching records, and these are not guaranteed to have a sorted output order, which is a requirement of join_asof. But of course if the join itself is producing those records, it could probably sort them.

@NicolasMuellerQC
Copy link

NicolasMuellerQC commented Feb 10, 2025

I think there is an issue which is that joining on nulls produces the cartesian product of the matching records,

Isn't this also true for any other values? What is special about nulls here (in the by part, not the on part)?

@NicolasMuellerQC NicolasMuellerQC linked a pull request Feb 10, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants