-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(proto): implement AsOfJoinRel
#331
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this really requires updates to the website (built from the site
directory) in order to describe what the behavior of multiple rhs relations in a normal join does in all cases, and what an as-of join is. I personally don't know what they are, but I also shouldn't need to know a priori because the documentation should tell me what they are.
The protos are also broken to the point of not compiling, but let's settle on the intended semantics by way of writing documentation first.
Got it. Please bear with me as I'm learning this repo and may take a couple of iterations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts.
### Join Properties | ||
|
||
| Property | Description | Required | | ||
| --------------- | ------------------------------------------------------------ | ---------------------------------- | | ||
| Left Input | A relational input. | Required | | ||
| Right Inputs | Each a relational input. | Required, at least one | | ||
| Join Expression | A boolean condition that describes whether each record from the left set "match" the record from the right set. Field references correspond to the direct output order of the data. | Required. Can be the literal True. | | ||
| Join Type | Same as the [Join](logical_relations.md#join-operator) operator. | Required | | ||
| Tolerance | The maximum on-field value difference in an inexact match. | Required | | ||
| On | The on-field. | Required | | ||
|
||
### Join Types | ||
|
||
Same as in the join operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these are all properties of the AsOf join (and an AsOf join, if I am reading the protobuf correctly, doesn't have a join type). So I think all you need is something like...
| Property | ...
| Join | ...
| Tolerance | ...
| On | ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I define here a generalized as-of-join operation, which is a generalization of both the join operation and the current AsofJoinNode
in Arrow (more details below). My understanding here is that it is OK (and desirable, I think) to define a more general logical operation, even if it is currently only partially implemented in physical operations, provided the general operation is well-designed.
At a high level, the table result of the generalized as-of-join operation, with given join-expression and join-type, is a (not necessarily strict) sub-table of the join-operation with the same join-expression and join-type. The site-doc for as-of-join says that a row in the join-result will also appear in the as-of-join-result if the inexact matching of the on-key evaluates to true for this row (false leads to null values in the result). This is why any join-type also works in the generalized as-of-join operation. Note that the AsOfJoinNode
in Arrow is equivalent to the generalized as-of-join operation with a left-join as the join-type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine for it to be more general in principle, and for implementations to support the various join types with as of join, but let's keep the implementation decoupled from JoinRel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean just duplicate the JoinRel
fields into AsOfJoinRel
?
proto/substrait/algebra.proto
Outdated
@@ -178,6 +178,18 @@ message JoinRel { | |||
substrait.extensions.AdvancedExtension advanced_extension = 10; | |||
} | |||
|
|||
// A time-series variant of the multi JOIN relational operator left-join-right(s), which joins on an ordered key using inexact matching | |||
message AsOfJoinRel { | |||
JoinRel join = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does an AsOfJoinRel use all the properties of join
? In particular, does it use the expression
or type
? If not, maybe composition is not the correct approach here and it would be better to just copy the relevant fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Weston here, let's repeat the fields instead of sticking a join inside of another join.
As of joins are significantly different from a generic join that they should not be subject to any changes that happen in JoinRel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and yes. See my earlier response.
@@ -219,6 +219,40 @@ The join operation will combine two separate inputs into a single output, based | |||
``` | |||
|
|||
|
|||
## AsOfJoin Operation | |||
|
|||
The as-of-join operation is a time-series operation that will combine a left input and multiple right inputs into a single output, based on a join expression, an on-field and a tolerance value. All inputs must have the on-field in ascending-order. The operation is similar to a join-operation where the join expression is used for exact matching whereas the on-field is used for inexact matching up to the tolerance value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the statement "All inputs must have the on-field in ascending-order" is too prescriptive. In theory, an as-of join could be created even if the on-field wasn't in ascending order by doing something similar to a hash-join and building up a table in memory. In this case, if the ordering is not present (or not compatible with the on-key) then a consumer that only supports the ordered version can simply reject the plan.
That being said...if no real consumer supports an unordered variant...then maybe this would just be pedantic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think of this the more I think maybe a non-ordered as-of join is just too weird. For example, distribution would not necessarily be maintained in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, you're not asking here for a change, but let me know if I misunderstood.
proto/substrait/algebra.proto
Outdated
oneof on_type { | ||
Expression.FieldReference on = 3; | ||
// Reserve next tags for future on_type alternatives | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be an example of a different on_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A natural example is an algebraic expression involving one or more fields and returning a value in an ordered domain (with a distance measure). Another, perhaps contrived, example is multiple on-fields interpreted as a radix.
proto/substrait/algebra.proto
Outdated
@@ -363,6 +375,7 @@ message Rel { | |||
ExtensionMultiRel extension_multi = 10; | |||
ExtensionLeafRel extension_leaf = 11; | |||
CrossRel cross = 12; | |||
AsOfJoinRel asofjoin = 13; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
AsOfJoinRel asofjoin = 13; | |
AsOfJoinRel asof_join = 13; |
proto/substrait/algebra.proto
Outdated
// A time-series variant of the multi JOIN relational operator left-join-right(s), which joins on an ordered key using inexact matching | ||
message AsOfJoinRel { | ||
JoinRel join = 1; | ||
int64 tolerance = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on why you chose int64
here as opposed to an expression or even a literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this should be more general. The exact definition needed here is a value from an ordered domain with a distance measure. How would that be captured in the proto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the recent commit I used Literal
and deferred the domain checking to the consumer.
Before I dive into the proposed Substrait representation, let me just write out what I now think a join relation behaviorally does or should do. It took about two hours of discussion with @cpcloud to get here, so it feels wrong not to share, and I'll be referring back to it as well.
Please bear in mind that this is just a behavioral description. You would certainly not be materializing the Cartesian product of all inputs as a first step of any sane join implementation. With that said; Multi-table joinFundamentally, multi-table join is already covered by Substrait. Something like SELECT employee.first_name, employee.last_name, call.start_time, call.end_time, call_outcome.outcome_text
FROM employee
INNER JOIN call ON call.employee_id = employee.id
INNER JOIN call_outcome ON call.call_outcome_id = call_outcome.id
ORDER BY call.start_time ASC; would look like two nested inner join operations; something like I don't necessarily see this as a reason not to define a multi-table join, but it does make me question what it's useful for. If the only reason is along the lines of "this maps more nicely to Acero," that's not a good reason to make consuming JoinRels more complicated for everyone else. If we do decide to embrace this, then I have some concerns and notes:
For Acero and friends, I don't necessarily see why a tree is that much harder to deal with than a flattened multi-rhs join. For a producer the conversion is trivial. For a consumer, you'd look at the left input and see if it's a compatible join; if so, merge it, otherwise consume it like a normal relation. This will also help proper consumption from other producers later, that might not know to flatten a join tree, even if Substrait were to support it (so you should kind of implement this anyway). If you're really dead-set on wanting to mirror Acero's expressiveness for relations exactly, the correct solution is to use a relation extension. Again, this is something you should IMO do eventually anyway, because it allows you to decouple and make optional this tree-flattening logic in/from the consumer itself. It's just a matter of which you find higher-priority. As-of joinI see a bunch of problems and implementation mistakes for this one, and have some random thoughts as well.
Some misc. notes:
|
Thanks for this very detailed feedback, @jvanstraten. It adds a lot of clarity to the discussion. I generally agree with a lot of the points you made. I'll address some points below. Regarding the behavioral description. This is a good approach and, IIUC, it can be seen as a more detailed description of the generalization I described. However, I'm not sure I follow the as-of-join part. In particular, I would have expected to see the tolerance embedded within the as-of-inequality (or more generally, a range expression) rather than be separate from it and/or an optional item. For example, I would think an as-of-join with a tolerance of 1 hour would be represented with an as-of-inequality expression with a meaning like Regarding multi-table joins. I agree there is symmetry between a left-join of one-left-with-multiple-right-tables and a right-join of one-right-with-multiple-left-tables, but then supporting one of them is enough and the choice is arbitrary. What I don't see the meaning of is a left-join or right-join of multiple-left-with-multiple-right-tables (no way to arbitrate between the multiple non-equivalent join-trees that would fit this), and If I'm reading the behavioral description correctly, it does not deal with this case anyway.
While Acero is a starting point for the design, it is not how I would justify defining a multi-table join. One justification is that it allows the Substrait consumer to represent the multi-join succinctly, so that any Substrait producer could optimize from this representation. Without a multi-join representation, the Substrait consumer would have to generate a tree-join and each Substrait producer would have to reimplement some kind of pattern-matcher to recover the multi-join from this tree-join before it could optimize. Regarding as-of-join. Several points you raise are related to |
My local protoc (libprotoc 3.6.1) compiled this fine, but I guess some CI job runs a different/older version. How should I correctly validate before committing?
What would be a proper way to allow for future types to be added to the
I think this is OK. If we do this, I'll give the as-of-join as a higher priority.
I believe the first commit's message is conventional. Does that mean I should use the same conventional message for all commits in this PR? In other repos there was no issue for the reviewer to pick the message of the first commit. |
I did it this way in part because I didn't fully consider your approach and in part because enforcing the inequality the be at the root of the expression and mandating that the LHS only uses fields from the left and the right only uses fields from the right makes it directly applicable for sorting as well. Conversely, something like That being said, I like your idea of just making the sort explicit more. Sorts are normally represented using I guess the operation could be implemented in two ways; either by sorting the entire RHS first, or doing a partial sort (i.e. min/max) operation after filtering all the candidates for a particular LHS record. To update the behavior description above, using the latter interpretation as the baseline:
I think the sort could also be left out entirely, and implicitly be whatever minimizes the difference between the LHS and RHS of the comparison operator at the root of the expression. But that's not as powerful, because the root of the expression doesn't actually need to be a comparison operator that Substrait knows about if you do include the sort, and without the sort it's much harder to derive how to sort the RHS input if you want to do it before the rest of the join operation. Actually... thinking about it like this, I think Substrait can already almost represent as-of: just insert a SortRel in the RHS with the appropriate sort, use a "single" join, and update the definition of that one that it needs to return the first value if ambiguous, rather than being allowed to error out or return any one of them regardless of sort. But I'm still fine with adding explicit support for as-of anyway.
Semi, anti, and single joins are not symmetrical so it falls apart there, but fair enough otherwise, although by the same logic one of left or right join need not exist either. It was mostly intended as a thought experiment.
Doesn't it? I'm kind of too lazy to reread my description in detail to see if I made any mistakes, but the generalization I intended is for any multi-table join of
I think you swapped producer and consumer in your last sentence, but I get what you're saying. My counterargument would (still) be that a consumer cannot generally rely on the producer doing this optimization for them unless it already knows what the producer is. So there are two cases here:
It is true that, in the former case, having multi-table join in Substrait allows them to not care about dealing with optimizing join trees. However, any consumer that does not support multi-table join internally would have to outright reject a plan that uses it unless they write new code to reduce it. So, IMO, the tradeoff here tips against introducing first-class support for multi-table join.
Correct.
That's weird IMO, but I've learned not to be surprised by inconsistencies in protobuf... I think CI uses
It is, protobuf doesn't care what the tags are. The only somewhat relevant thing to know about them is that tags > 31 cost one or more additional bytes in the serialization. The important thing for forward- and backward-compatibility is that all oneofs are semantically mandatory fields. This is because protobuf inherently doesn't know for unknown fields during deserialization which oneof they belong to (if any). It just treats them like any other unknown field; i.e. it stores them in a special place to retain them when reserializing and then silently moves on. So, a oneof that has a field populated that the deserializer doesn't know about looks like an unspecified oneof, and therefore needs to be rejected, or the plan would be executed incorrectly using whatever default behavior would be associated with not specifying the oneof.
IMO it shouldn't work this way, but CI will simply reject your PR if the commits don't satisfy conventional commit style, and I don't have the authority to overrule that. It's silly, because it's the final squash-merge commit that actually matters, which is ultimately typed and checked by the reviewer, and is not checked by CI (because it can't be until it's already in main). I've brought this issue up before, but feedback was mixed so I just dropped it. |
Rel left = 2; | ||
repeated Rel right = 3; | ||
Expression expression = 4; | ||
Expression post_join_filter = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't generally see a post join filter in asof join APIs. Do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think we should remove this. This seems like a very logical node. If an engine could benefit from a post-join filter then that should be left for the physical definition (one could argue that our join rel shouldn't have a post-join filter as well)
RelCommon common = 1; | ||
Rel left = 2; | ||
repeated Rel right = 3; | ||
Expression expression = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this expression mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be the actual join expression. The FieldRef that's still in the proposal right now does not work the way you think it works in Substrait and is just not applicable for a join. See my first few notes on AsOf join in #331 (comment). I think @rtpsw and me had already aligned on a number of changes to this proposal, but they haven't updated it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is roughly equivalent to Acero's by_key
I believe.
@jvanstraten Thank you for your detailed feedback, I think many of the point that you are making makes sense. For context, I work with @rtpsw on supporting asof join via Python (ibis) frontend -> substrait -> Acero backend. I think there are two separate goal here: Ideally we can achieve both goal at the same time but in reality, because of time/resource constraint, achieving (1) takes much more time and have more design concerns and need more careful thinking. I wonder if there is a way to achieve (2) without: One idea that I have it to allow "experimental" Relation to be added to the substrait and allow Acero to start using it / iterate on it, after all the Asof Join API in Acero is experimental and subject to change. In my mind, "experimental" stuff is good way for users of substrait to achieve (2) and iterate faster while making it clear that this is not something officially supported and will change/break. |
This is exactly what Substrait supports extensions for. Well, and also for things that Substrait does support, but doesn't support in exactly the way a particular producer/consumer pair would like, and portability is not (yet) a concern. To be more clear, this means using substrait/proto/substrait/algebra.proto Line 362 in e03b9cf
substrait/proto/substrait/algebra.proto Lines 266 to 270 in e03b9cf
and populating the I don't think that the "marked as experimental" thing that Arrow does would really work for Substrait, because we use an automated breaking-change detection and versioning system that would not be able to distinguish between breaking changes to experimental vs fully supported features. |
Sorry for my response taking some time. I'll need a couple more days before I can give this issue my full attention again. |
Thanks @jvanstraten. Per discussion with @westonpace and Voltron Data folks, this sounds like the best way forward. |
|
Sorry it took a while to get back to this. Are we ready to move forward? If so, use merge or rebase to update the branch? |
@rtpsw are you able to sign the CLA? |
Signed. |
The branch requires updating.; in this repo, do you prefer merge or rebase? |
One of the PR checks requires that every commit comment follows semantic commit conventions. So merge commits will not work and rebase is probably necessary. |
@jvanstraten, it looks like you need to reapprove and then rebase can take place automatically. |
What is expected for this repo? a merge commit or a rebase? |
@cpcloud ? |
@rtpsw Rebase |
proto/substrait/algebra.proto
Outdated
message JoinRel { | ||
RelCommon common = 1; | ||
Rel left = 2; | ||
Rel right = 3; | ||
repeated Rel right = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the changes to JoinRel
.
proto/substrait/algebra.proto
Outdated
message AsOfJoinRel { | ||
RelCommon common = 1; | ||
Rel left = 2; | ||
repeated Rel right = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should follow JoinRel
and be a single Rel
.
Hi @westonpace could you please review this? I see this earlier comment: #331 (comment) thank you, Sri |
I'm going to push some commits to this PR that should address many of the review comments. |
ACTION NEEDED Substrait follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
AsOfJoinRel
I have addressed the feedback, and have added an example implementation of an asof join algorithm using @jvanstraten's suggested generalizations. Hopefully this clarifies how it is to be used! |
| Inputs | 1 | | ||
| Outputs | 1 | | ||
| Property Maintenance | Distribution is maintained. Orderedness is by the on-field only post operation. Physical relations may provide better property maintenance. | | ||
| Direct Output Order | The emit order of the left input followed by the emit order of the right input. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we allow de duplication of the on field (and by field)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's something to be done by ibis or something producing ibis expressions, so I don't think that'll be anywhere in the Substrait spec.
I'm confused about what the next steps here are and who would be taking them. @cpcloud ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing to me. It looks like two separate approaches mashed together. On the one hand, the proto feels like a very academic definition of an asof join that tries to make an asof join and a regular join categorically equivalent. The closest example I can find is https://clickhouse.com/docs/en/sql-reference/statements/select/join/ and they don't support tolerance so it's a bit easier.
- If we are going to be academic about this then I think the terminology of
on
andexpression
is confusing. In SQL syntax a join isJOIN ... ON join_expression
. Theon
andexpression
are the same thing. I prefer clickhouse's terminology which uses "equality condition" (which is scalar/stateless) and "closest match condition" (which is not, strictly speaking, stateless) and the "join expression" isequality condition AND closest match condition
. They also limitclosest_match_condition
to<, <=, >, >=
which makes a lot of sense (frankly, I think you could limit it to just<, <=
and require the producer to switch the sides if they want> or >=
. - In this very academic definition
tolerance
doesn't make sense. Isn't it just a specialization of the join expression? In other words, if your join expression isl_key == r_key
and your tolerance isr_on - l_on < constant
then you could just have a join expression(l_key == r_key) && (r_on - l_on < constant)
.
On the other hand, the markdown is describing a typical time-series join in the way that anyone familiar with asof join would expect to see.
IMO, the most important next step is to pick which of the two approaches we prefer. Frankly, I would prefer the more physical description described in the markdown over what we have described in the proto. We can call it a physical relation if we want.
// possible_match_count = 0 | ||
// | ||
// # typically on is `left_key <= right_key` | ||
// # typically tolerance is `right_key - left_key < CONSTANT` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is left_key - right_key < CONSTANT
, at least that matches pandas.merge_asof
's definition (and I'm pretty sure it matches what we have in Acero).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, thanks.
// # track possible matches so we know what to emit | ||
// possible_match_count = 0 | ||
// | ||
// # typically on is `left_key <= right_key` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is reversed. In the normal case (non-negative tolerance) your right keys will always be <= the left keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, thank you.
// if possible_match_count and expression(left_row, right_right) and tolerance(left_row.key, right_row.key): | ||
// yield left_row + right fields where match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't actually work. First, match
is undefined. Second, you can't apply expression
this late in the process. Consider:
l_by | l_on | r_by | r_on |
---|---|---|---|
9 | 1 | ||
10 | 2 | ||
10 | 1 |
Given a tolerance of 5 you should output:
l_by | l_on | r_by | r_on |
---|---|---|---|
10 | 1 | 9 | 1 |
However, in this algorithm, right_row
(which I assume is what is meant by right_right
) will be [10, 1]
and will not satisfy expression
.
Since this is psedocode and we don't care about performance just apply expression
inside the while loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First,
match
is undefined.
It's not meant to be a variable in this case. It's mean to indicate that fields from the right row should be "merged" (or appended or choose your verb) into the left row. where match
is probably redundant.
Thanks for pointing this out. I'll correct it, and just evaluate all of the predicates as part of the while loop condition.
// | ||
// # check that the join key (expression) matches and the tolerance is satisfied | ||
// if possible_match_count and expression(left_row, right_right) and tolerance(left_row.key, right_row.key): | ||
// yield left_row + right fields where match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity we should just say right row
and not right fields
. The default emit is the entire row, not just the payload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, thanks.
RelCommon common = 1; | ||
Rel left = 2; | ||
repeated Rel right = 3; | ||
Expression expression = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is roughly equivalent to Acero's by_key
I believe.
@@ -235,6 +235,40 @@ The join operation will combine two separate inputs into a single output, based | |||
``` | |||
|
|||
|
|||
## AsOfJoin Operation | |||
|
|||
The as-of join operation is a time series operation that will combine a left input and a right input into a single output, based on a join expression, an `on` field and a tolerance value. All inputs must have the on-field in ascending-order. The operation is similar to a join operation where the join expression is used for exact matching whereas the `on` field is used for inexact matching up to the tolerance value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both on
and tolerance
are expressions above and referenced here as field
and value
.
@@ -235,6 +235,40 @@ The join operation will combine two separate inputs into a single output, based | |||
``` | |||
|
|||
|
|||
## AsOfJoin Operation | |||
|
|||
The as-of join operation is a time series operation that will combine a left input and a right input into a single output, based on a join expression, an `on` field and a tolerance value. All inputs must have the on-field in ascending-order. The operation is similar to a join operation where the join expression is used for exact matching whereas the `on` field is used for inexact matching up to the tolerance value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"All inputs must have the on-field in ascending-order" is an odd thing to say given that on
is an expression.
| Join Expression | A boolean condition that describes whether each record from the left set matches the record from the right set. Field references correspond to the direct output order of the data. | Required. Can be the literal True. | | ||
| Post Join Filter | A boolean condition that describes which join rows appear in the output. | Optional, defaulting to True. | | ||
| Join Type | Same as the [Join](logical_relations.md#join-operator) operator. | Required | | ||
| Tolerance | The maximum on-field value difference in an inexact match. | Optional | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect if tolerance
is an expression.
| Post Join Filter | A boolean condition that describes which join rows appear in the output. | Optional, defaulting to True. | | ||
| Join Type | Same as the [Join](logical_relations.md#join-operator) operator. | Required | | ||
| Tolerance | The maximum on-field value difference in an inexact match. | Optional | | ||
| On | The on-field. | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect if on
is an expression.
RelCommon common = 1; | ||
Rel left = 2; | ||
Rel right = 3; | ||
Expression expression = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field needs a comment:
a boolean expression that determines which rows are considered matches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks.
That's probably because I haven't yet had a chance to update much on the markdown side.
I don't follow the stateless/stateful distinction, in particular where any state comes in. In any case, we can change the terminology. I like the idea of calling the inequality
Yep, that makes sense to me. I think we can remove
Yep, as I said elsewhere I haven't had a chance to rewrite it.
In my mind, the academic approach is more flexible. At the same time it's probably more annoying to work with. The main thing that I dislike about what you're calling the physical version is that it's implementation is motivated by existing code without considering the problem outside of that implementation. Calling it physical doesn't change that, I think just pushes the problem down an abstraction layer. There's a tension here to be sure, people need to actually get stuff done. I'm aware of that. On the other hand, a more general approach is likely to work for systems that don't have the same exact physical API and representation as Acero. I think I could maybe be convinced to move back towards the physical representation, but |
Understandable, I didn't explain much. That comment was motivated by the exercise "why couldn't this be expressed as a regular inequality join?" For example, what makes In fact, this question I believe (I don't really know Spark) describes how to implement an asof join using a union and a window function. I think this statefulness is the fundamental difference between join and asof join.
I disagree the physical version is motivated purely by code. It is usually motivated by the specific problem "I want to know what the most recent ticker values were when trades happened". That leads to a very natural understanding of what I called the physical implementation. I think the fact that the top 5 google results12345 for asof join sound more like the physical version than the logical version are evidence of this. This is including the kdb version which at least tries to be domain independent.
I don't know if annoying is the right word. It's going to lead to lots of consumers that only partially implement the spec. I think this is a more general problem with logical relations though.
That's why I called it physical :) That being said, I know of 4 implementations (pandas, kdb, clickhouse, acero) that would be able to satisfy that physical implementation. To be clear, I am perfectly fine with eventually supporting both a logical and a physical relation. So this isn't "which one do we want" as much as "which one is this PR trying to provide?" |
Statefulness aside :) How should we proceed? I still don't know how |
I've done some more reading and I believe what we are stumbling towards is a window join. Asof join is a specialization of window join. Tolerance is how asof join defines the windows. @rtpsw and myself have an engine that is capable of performing asof join but not window join. There are other engines (e.g. pandas) that are in this same situation. AFAIK, no engine other than kdb actually has an implementation of the more generic window join.
If Acero only has asof join and Ibis only wants window join then I see no path forward. I don't think it makes sense to have a specification without two implementations. Either we wait until Acero adds window join (not likely in the near future), Ibis supports asof join (don't know what this entails) or someone writes an optimizer/translator that supports both. |
I’m an expert in streaming/temporal query and I don’t share your faith that window join will solve all problems. There are a lot of gnarly query patterns for streaming/temporal join. If you try to put them all under the banner of “window join” you’ll end up with a very complex definition of “window”. My advice is to deal with the common cases separately and don’t try to force them to converge. |
We now have ConsistentPartitionWindowRel defined and merged. Does that meet the requirements of AsOfJoinRel? |
Nope. A window join might do the trick but Even if a window join were defined we would probably still want an asof-join physical relation for the reasons that @julianhyde alluded to. For evidence I will point to:
|
@EpsilonPrime if we are going to revive this then I would propose a good starting point is to come up with python pseudo-code that we can all agree is correct. The current attempt would yield the incorrect answer if the The current pseudocode also does not correctly handle The second thing we need to do is decide how to handle
I think I am agreeing more with @cpcloud these days. Let's just make this a logical asof join and get rid of tolerance entirely. The producer, if they want it, can encode it in the expression. The producer, if they don't support it, can reject the expression (both duckdb and clickhouse mandate the expression consist only of equality conditions). |
I'm closing this as abandoned. Can reopen if someone is actually working on it. |
See #330