-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for lambdas as function arguments #349
Comments
If I read your level 2 correctly, you can define a closure - that is, a value that is a function pointer plus zero or more values. The function pointer may be a lambda, or it may be defined some other way (e.g. a built-in function). And the values may themselves be closures. If the function pointer is allowed to call itself, congratulations, Substrait is now Turing complete! I think support for closures is a great thing to aim for. It presents great challenges (e.g. with Turing completeness you have to worry about termination and unbounded memory usage) but great rewards (you now have the kernel of a functional programing language, not just a SQL database). That's a journey to be made in several steps. |
I agree with what you're saying (I don't expect level 2 to happen anytime soon if ever, I'll be happy to just get level 0 sorted and fix argument evaluation semantics), but just want to make a few notes.
Even if we don't disallow it, I think in the implementations I'm suggesting there's no way to do it. I'm pretty sure you'd need either a forward reference, or a lambda that takes its own signature as an argument. The former doesn't work because the expression system is tree-based and relations are DAG-based and thus don't contain cycles (IMO expressions should also be DAG-based, but that's a different story and doesn't change this argument). The latter doesn't work because the lambda type would be recursive; you'd need something of the form When including extensions though, you could make lambdas Turing complete using something like
It's already trivial to break things with a Substrait plan. A consumer could already just define
It's not technically unbounded, but you can already easily bring a naive consumer to its knees by just spamming The only way to deal with the above two things in general is just to impose memory, performance, and time limits on the execution of untrusted queries. |
I'm generally in support of introducing lambdas at some point. I can see Level2 making a lot of sense. That being said, I'm concerned about creating more surface area ahead of when we need it. It leads to later desires to refactor as we more fully understand the requirements we have. I also feel like in some ways, some of the argues here for lambdas is to decompose some existing composite concepts into simpler primitives. While I think that can be valuable, I also think that there is a danger of over-decomposition. As an example, collation is a first class concept in databases. Forcing people to define lambdas to communicate collation in Substrait seems like a step too far. We sit at a much higher level than a compiler and should we embrace that higher level of abstraction. IOW, flexibility often comes at the cost of simplicity. |
As a consumer a lambda doesn't seem much different than an embedded UDF. For example, your lambda above has I do think "query authors" (who are not likely to be using Substrait directly) will want lambda support. For example, in Ibis, I might want to do something like you described to specify my sort function. However, there is nothing stopping Ibis from compiling that UDF into WASM or "pickled python" and sending that as a UDF, completely transparent to the end user. So I don't think lambdas are giving us a usability benefit here. If we accept the above two facts then this boils down to something like "is Substrait's expression language somehow preferable to producers or consumers over WASM's expression language when it comes to UDFs?" and I think the answer is no. |
LOL, I totally think the answer to the first comment is resounding yes. The moment we start to think about optimizers working with the Substrait plans it becomes very beneficial to be able to understand and decompose operations. This provides all sorts of optimization possibilities. For example, it could be that the optimizer decides to do some form of subexpression elimination occurs twice, once as a direct expression and once as part of a lambda expression. If one of those was just a wasm black box, that's going to be neigh impossible. That doesn't mean black boxes are bad. I just think that we only want black boxes when we have to have them. A concrete example of this is UDFs inside of Snowflake. You can write them in several languages, including something called Snowflake scripting. Because of the nature of this kind of horizon, there are all sorts of things Snowflake scripting can do that the other languages can't (in addition to various performance differences). Something similar would probably be true when you think about something like LLVM optimization windows. If I convert part of my expression tree into wasm from a logical Substrait tree (because I don't have lambdas), I imagine the injected boilerplate of the conversion will result in far fewer optimization opportunities as opposed to having a single view of the maximum number of operations. |
Yes, but then there are closures, which act like lambdas and UDFs (i.e. you can pass them around as values and call them). Closures can definitely not be replaced by UDFs, because they can be created at run time. Some people use 'lambda' and 'closure' interchangeably, so I want to be clear that I would like closure support. It is probably possible to support closures without extending the expression language. Just support add some functions for partial evaluation (uncurrying). |
I'm not convinced these corner case optimizations are worth the work for any consumer to implement but I will admit there is a possibility here. I guess I'd like a concrete example. Sorting by a custom key (e.g. in Jeroen's example) could just be achieved by a projection before the sort node.
Again, can we give an example of how a closure would be used in a query? Closures capture state but it's not clear to me what "state" would mean in the context of a query. |
Here's an example on Morel:
The expression |
Here's another example. The
The example is from slide 36 of Morel, a data-parallel programming language. |
Thank you for the examples, in particular the first one, using apply on a list array, is a very helpful example for my mental model, and not something easily achievable today. As a consumer I feel like we are blurring two different contexts, and maybe that is ok. I'll call them an "array evaluation context" and a "singular evaluation context" (avoiding scalar and vector as those can be misleading/ambiguous). In Acero I have an expression evaluation for the array evaluation context. This takes the form of I suppose "Weston doesn't have an efficient solution" is not a very good reason to prevent moving forwards on something. However, if I were tasked with coming up with an efficient solution, I would think code generation would be the way to go (well Substrait expression -> machine code), and I really don't want to write a brand new code generation library, so then I'm left looking at how to convert Substrait expressions to WASM / LLVM, which maybe isn't a terrible thing. It does feel like this shifts the project from "let's define a universal library of functions for query engines" to "let's define a universal standard library for programming languages" and I think there are other approaches to that latter task (WASM / LLVM). |
I don't totally follow your Acero example but I can imagine that you have some operations on relations (potentially unbounded streams of records) and other operations on arrays (bounded sets of records or scalars nested inside a record). When executing on actual hardware, with memory limits, that makes perfect sense. That is the world that Acero and Substrait need to operate in. Morel is a high-level language that makes no distinction between top-level relations and nested relations. The
In my opinion, framing relational algebra in functional programming terms makes a lot of sense. And if I am right, people will want their data-parallel framework to be able to execute programs expressed in a functional programming language. They will want to represent closures among their data values, and to put user-defined higher-order functions (table functions and aggregate functions) into their DAGs. Those programs will need to translate down to Substrait. So if the puck is going in that direction, Substrait should be heading there too. |
Two sync meetings ago already I think we briefly discussed lambdas. It took me a while to make this writeup for various reasons, but here goes.
User-defined functions vs lambdas vs embedded functions
IMO those are three very different things with different use cases.
I think it's really difficult to specify embedded functions in a consumer-agnostic way, while also retaining reasonable performance. For it to work at all, I guess Substrait would have to provide some interface API that each engine would then have to implement, such that the function really only combines a bunch of kernels provided by the engine together. If at any point it needs to do something that this API does not provide, the function either becomes consumer-specific or probably requires some level of serdes. We haven't even talked about the various encapsulations/languages that an embedded function might be written in (either consumers have to support all or a good portion of them or Substrait has to pick a preferred format, otherwise we lose consumer-agnosticism), or the fact that embedded functions are a black box to query optimizers. I'm not saying embedded functions don't have a place, but they are a highly suboptimal and very difficult-to-specify last-ditch effort to make things work in general.
Level 0: niladic lambda argument types
In the simplest case, what I'm calling a lambda here is just a Substrait expression that you can pass to a function, rather than passing the value returned by the expression. We're effectively already doing this for short-circuiting functions like boolean
and
, wherein the function gets to decide whether its arguments are evaluated at all. There is a big difference between the two in the presence of side effects, because many scalar expression transformations in optimizers rely on expression evaluation order. Substrait has almost no side effects, but we do support functions errorring out. Example: if I write a projection with two new expressions,and(field(0), big_expression(...))
andand(field(1), big_expression(...))
, then a naive common subexpression elimination step might want to push thatbig_expression(...)
to a preprocessingProjectRel
. However, if that big expression can error out for records where neitherfield(0)
orfield(1)
are true, then that optimization has changed behavior. Not to mention that it might have been a bad idea for the optimizer to do this in the first place, becausefield(0)
andfield(1)
might be true only very rarely. Now... an optimizer might want to have a special case for something likeand
anyway, especially if it has heuristical knowledge of how often those fields are true. But shouldn't anyone be allowed to make their own extension functions with short-circuit behavior? Whether we decide to support lambdas as a whole or not, there should at least be a way to specify the difference. This would only require a new flag field in the YAMLs for value arguments. But please note that saying that a value argument is actually an expression that the function may or may not evaluate is pretty much the same thing as just passing a niladic lambda.Level 1: n-adic lambda argument types
It seems like a small step to extend that to n-adic lambdas; after all, expressions already execute in N record contexts (multiple for subqueries), so in the simplest case we could just say that an expression in a lambda argument context gets another "record" pushed on the stack for
FieldRef
s to access, just like what happens for expressions executing within a subquery, but rather than the record coming from a table or something, the function provides the records to call the expression on. This gives you not only callback functions basically for free (the expression could just be any extension function call), but it also lets you shuffle arguments, or compose multiple extension functions together to form the lambda.For example, let's say I have field 0:
LIST<STRUCT<fp32, fp32>>
and field 1:STRUCT<fp32, fp32>
, and I want to get the coordinate from field 0 that is closest to field 1. In the presence of lambdas like this, I could write this something like this:with prototype
array_min_with_key(T, lambda (T) -> S) -> T
. I imagine in this case the arguments passed to a lambda argument would be derived types, while the return type of the lambda is pattern-matched (just like a value argument).Of course I could have written the entire
min
operation as an embedded function as well, but this comes with all the downsides I mentioned before.Level 2: n-adic lambda data types
We could take this one step further and instead add an entirely new compound type for lambdas, and thus allow lambdas to be passed around as values. In this case there would be a new literal type corresponding to that compound type that might look like this:
and the type of that literal would be
LAMBDA<STRUCT<arguments...>, body_return_type>
. This is much more powerful because it is a true implementation of higher-order functions, but is a pretty big update. Note that this option is not necessarily mutually exclusive with just adding a lambda argument type for now, so we could always decide to add this later if someone wants it. Note also that everyone can already do this with a type extension if they really want to.Why not just a function reference argument type?
So pretty much, passing function pointers to functions, but not specifying a way to define those functions within the plan (aside from embedded functions). This kind of addresses the same things that a lambda addresses, and is how custom sort functions are already implemented now. However... this is much less powerful.
For example, if that
array_min_with_key
function above would implement the key function like that, you still either need an embedded function to encapsulate the functionality (assuming those can be referred to this way at all and won't just be an expression type!) or need the consumer to coincidentally implement exactly the function you're looking for. For this particular example though, the lambda is impure: it depends on the target coordinate field in the schema. I don't know if embedded functions can generally do that, but an extension function certainly can't. You could certainly make an extension function that takes two coordinates and returns the distance, but that doesn't work, becausearray_min_with_key
would expect a function that takes only one coordinate. For the same reason, there's no way to specify optional or required enumeration arguments either. If the engine provides a distance-to-origin function and we could do some translations outside of themin
to make things work, but then what if the engine's function implementation has a[EUCLIDIAN, CARTESIAN]
enumeration argument? There's no easy way to pass that.Of course we could just throw special cases at the problem and make some sort of argument swizzle message. But IMO you've pretty much just made a less useful lambda with extra steps.
These same problems apply to custom sort functions. Unless the engine provides scalar functions explicitly intended to be used in that context and they don't need any optional arguments for corner cases and all such functions are quadruplicated for the four orderings (asc/desc, null first/last), it's not going to work. Not to mention that you commonly implement sorts with
less_than
functions rather than[-1, 0, 1]
functions, because you already needless_than
anyway. Lambdas would be much more ergonomic here.In fact, we could have done everything that
SortField
already does right now, without special-cased message types for it, without relying on some implicit and currently unspecified default sort order of types for the common case, and without adding too much complexity to the tree structure, with just a few core functions. For example:I'm assuming here that
true
means "in the right order",false
means "in the wrong order", andnull
means "equivalent", instead of [1, 0, -1].The rest is just down to defining
less_than
andgreater_than
functions. I'm not saying we should remove theSortField
abstraction if we add lambdas (I'm also not not saying it though); I'm just arguing that if we had started off with lambdas as a concept we would never have had to define it in the first place.The text was updated successfully, but these errors were encountered: