-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Value vs "expression" (lambda-ish) function arguments #320
Comments
Operators in Substrait are functional set operations. Each operator take in N sets of data and output one, so there can't be side effects (at least how I would generally use the term). Functions like Random doesn't have side effects, it is just non-deterministic. The one exception (no pun intended) that I see as something that one could consider a side effect is exceptions. If I do
|
Ah, good point. I hadn't considered exceptions.
Depends on whether it's seeded or not. But I'm fine with requiring that functions don't have side effects aside from exceptions, in which case seeded random is immediately off the table. Either way, the point I'm trying to make is that there is currently no way for a machine to know from just looking at a plan and associated YAML files whether something of for example the form
You're glossing over the reason why I used sort there though, namely that sorts tend to be passed callbacks to pass a comparator function. We've largely been skirting around that issue because for aggregate function bindings sorts are special-cased, but that workaround isn't available for a scalar function operating on the |
A few DBMSs have what they call 'lambda support' or 'higher-order functions', including operations such as True lambda support would allow function-values (lambdas and closures) that can be stored and processed in the same way as other values. The examples of higher-order that I have seen - filter, map, sort - can all be implemented using a sub-query, albeit a little less concisely. I suggest that substrait doesn't yield to this fad of putting a functional-programming-like veneer on relational algebra, and instead waits for real first-class-functions to arrive. |
Side effects are disputed in substrait-io/substrait#320
That's a good point; that would be taking it one step further. In that case, I suppose you would also need an expression type that constructs a lambda out of another expression and an argument type pack, and a data type that might look something like |
Yes, a function plus a (possibly empty) struct of argument values is a good way to implement lambdas and other closures. The danger of doing it the "constant" way is that you bake in too many assumptions based on the constant case. For example:
If you don't steer clear of those assumptions you might have to break compatibility when true first-class functions arrive. |
Substrait already has several "argument types," only one of which actually binds to values that exists in the normal type system, so lambda arguments could just be added to that list without making things any less consistent than they already are. They would simply bind to an expression tree with a new argument reference expression type as the leaves, of which the return type is just the return type of the lambda (rather than the lambda itself). If we then later decide to add first-class lambdas, we'd just add a lambda literal that constructs a lambda out of the context of one of those special argument slots, at which point we can deprecate the special lambda function argument type (or choose to just support both). It's certainly cleaner to do it right the first time, but that doesn't necessarily mean it's also better, because putting lambdas in the normal data type system that's also used for database schemas and the likes might be a bit alarming to the typical Substrait user because they're not really data to begin with. I don't think it's necessarily as complex to add to the spec as I originally thought, though. |
This issue appears to be covering two topics. First, clarifying our definition of short circuiting and, second, adding support for lambdas (expressions that are, themselves, arguments). I think we should leave this ticket open for lambdas (as the title implies). We can clarify short-circuiting in a separate ticket should someone choose. |
Also, on the topic of lambdas, I am +1 on adding these and I agree that (if I am reading the above correctly) making a "lambda type class" is a good way to tackle this. |
I am closing this in favor of #349 as I think we probably only need one issue open for lambdas. However, both that issue and this issue have useful info. |
This topic has come up in various forms a few times now, and in retrospect has IMO been overlooked by the specification entirely thus far, despite functions being defined for either binding method.
When a value argument is bound to a function, there is a distinction between binding the expression itself (which I'll call an expression argument) and binding the data value returned by the expression (which I'll call a value argument). The difference here is that, for expression arguments, the function can determine whether it wants to evaluate the expression (or it may even choose to evaluate it multiple times), while it can't for value arguments.
These are two fundamentally different things when expressions can have side effects (like random(), for instance) or if we want to care about short-circuiting behavior of functions like
and(a, b)
. Description of short-circuiting behavior is already used all over the specification (for example for if-then-else expressions), so I have to assume that we care about this. If we don't consider this in scope, i.e. we require that expressions never have side effects and leave it up to the consumer to determine whether it short-circuits (which is just an optimization, then), we should remove references to short-circuiting behavior or clarify that they are only recommendations for performance.The difference is especially important for an optimizer, because things like common-subexpression elimination only preserve behavior for value arguments and for expressions known to not have side effects. On that note, we're also not currently specifying in the YAMLs whether a function has side effects to begin with.
If it helps to conceptualize this, the representation of an expression argument in a typical programming language would be a lambda function with no arguments. Later on, we could extend this concept to passing lambda expressions with arguments, which, I suppose, would lead to a new expression type to refer to the argument values set by the function (similar to FieldRef, or just an extension thereof using a different root type). A feature like that is fundamentally necessary to represent functions like
sort_list(LIST<T>, |T, T| -> boolean) -> LIST<T>
in a generic way, to let you write for examplesort_list(x, less_than(ArgRef(0), ArgRef(1)))
, but also more complex things like sorting based on a particular field in a list of structs (sort_list(x, less_than(FieldRef(root=ArgRef(0), 1), FieldRef(root=ArgRef(1), 1)))
to compare based on the second field of the struct, for example).The text was updated successfully, but these errors were encountered: