Add support for lambdas as function arguments #349

jvanstraten · 2022-09-29T16:14:12Z

Two sync meetings ago already I think we briefly discussed lambdas. It took me a while to make this writeup for various reasons, but here goes.

User-defined functions vs lambdas vs embedded functions

IMO those are three very different things with different use cases.

Characteristic	Function (i.e. YAML-defined)	Lambda (proposed)	Embedded function
Producer has ->	Behavior requirement	Behavior requirement	Unsatisfied behavior requirement or a user-specified implementation in (probably) consumer-specific format
YAML carries ->	Behavior specification	N/A	N/A
Plan carries ->	Reference to ^	Behavior specification	Implementation in (probably) consumer-specific format
Consumer has ->	Implementation	Implementation (by composition)	Means to execute an embedded function
Consumer-agnostic?	Yes (core extension functions)	Yes (if expression only uses core extension functions)	Unlikely IMO, see reasoning below

I think it's really difficult to specify embedded functions in a consumer-agnostic way, while also retaining reasonable performance. For it to work at all, I guess Substrait would have to provide some interface API that each engine would then have to implement, such that the function really only combines a bunch of kernels provided by the engine together. If at any point it needs to do something that this API does not provide, the function either becomes consumer-specific or probably requires some level of serdes. We haven't even talked about the various encapsulations/languages that an embedded function might be written in (either consumers have to support all or a good portion of them or Substrait has to pick a preferred format, otherwise we lose consumer-agnosticism), or the fact that embedded functions are a black box to query optimizers. I'm not saying embedded functions don't have a place, but they are a highly suboptimal and very difficult-to-specify last-ditch effort to make things work in general.

Level 0: niladic lambda argument types

In the simplest case, what I'm calling a lambda here is just a Substrait expression that you can pass to a function, rather than passing the value returned by the expression. We're effectively already doing this for short-circuiting functions like boolean and, wherein the function gets to decide whether its arguments are evaluated at all. There is a big difference between the two in the presence of side effects, because many scalar expression transformations in optimizers rely on expression evaluation order. Substrait has almost no side effects, but we do support functions errorring out. Example: if I write a projection with two new expressions, and(field(0), big_expression(...)) and and(field(1), big_expression(...)), then a naive common subexpression elimination step might want to push that big_expression(...) to a preprocessing ProjectRel. However, if that big expression can error out for records where neither field(0) or field(1) are true, then that optimization has changed behavior. Not to mention that it might have been a bad idea for the optimizer to do this in the first place, because field(0) and field(1) might be true only very rarely. Now... an optimizer might want to have a special case for something like and anyway, especially if it has heuristical knowledge of how often those fields are true. But shouldn't anyone be allowed to make their own extension functions with short-circuit behavior? Whether we decide to support lambdas as a whole or not, there should at least be a way to specify the difference. This would only require a new flag field in the YAMLs for value arguments. But please note that saying that a value argument is actually an expression that the function may or may not evaluate is pretty much the same thing as just passing a niladic lambda.

Level 1: n-adic lambda argument types

It seems like a small step to extend that to n-adic lambdas; after all, expressions already execute in N record contexts (multiple for subqueries), so in the simplest case we could just say that an expression in a lambda argument context gets another "record" pushed on the stack for FieldRefs to access, just like what happens for expressions executing within a subquery, but rather than the record coming from a table or something, the function provides the records to call the expression on. This gives you not only callback functions basically for free (the expression could just be any extension function call), but it also lets you shuffle arguments, or compose multiple extension functions together to form the lambda.

For example, let's say I have field 0: LIST<STRUCT<fp32, fp32>> and field 1: STRUCT<fp32, fp32>, and I want to get the coordinate from field 0 that is closest to field 1. In the presence of lambdas like this, I could write this something like this:

array_min_with_key(
   field(0),                  # list of structs field
   add(
      square(subtract(
         field(0, 0),         # first (and only) lambda argument, first dimension
         field(outer, 1, 0)   # break out of lambda argument list, target coordinate field, first dimension
      )),
      square(subtract(
         field(0, 1),         # first (and only) lambda argument, second dimension
         field(outer, 1, 1)   # break out of lambda argument list, target coordinate field, second dimension
      ))
   )
)

with prototype array_min_with_key(T, lambda (T) -> S) -> T. I imagine in this case the arguments passed to a lambda argument would be derived types, while the return type of the lambda is pattern-matched (just like a value argument).

Of course I could have written the entire min operation as an embedded function as well, but this comes with all the downsides I mentioned before.

Level 2: n-adic lambda data types

We could take this one step further and instead add an entirely new compound type for lambdas, and thus allow lambdas to be passed around as values. In this case there would be a new literal type corresponding to that compound type that might look like this:

message Lambda {
   repeated Type arguments = 1;
   Expression body = 2;
}

and the type of that literal would be LAMBDA<STRUCT<arguments...>, body_return_type>. This is much more powerful because it is a true implementation of higher-order functions, but is a pretty big update. Note that this option is not necessarily mutually exclusive with just adding a lambda argument type for now, so we could always decide to add this later if someone wants it. Note also that everyone can already do this with a type extension if they really want to.

Why not just a function reference argument type?

So pretty much, passing function pointers to functions, but not specifying a way to define those functions within the plan (aside from embedded functions). This kind of addresses the same things that a lambda addresses, and is how custom sort functions are already implemented now. However... this is much less powerful.

For example, if that array_min_with_key function above would implement the key function like that, you still either need an embedded function to encapsulate the functionality (assuming those can be referred to this way at all and won't just be an expression type!) or need the consumer to coincidentally implement exactly the function you're looking for. For this particular example though, the lambda is impure: it depends on the target coordinate field in the schema. I don't know if embedded functions can generally do that, but an extension function certainly can't. You could certainly make an extension function that takes two coordinates and returns the distance, but that doesn't work, because array_min_with_key would expect a function that takes only one coordinate. For the same reason, there's no way to specify optional or required enumeration arguments either. If the engine provides a distance-to-origin function and we could do some translations outside of the min to make things work, but then what if the engine's function implementation has a [EUCLIDIAN, CARTESIAN] enumeration argument? There's no easy way to pass that.

Of course we could just throw special cases at the problem and make some sort of argument swizzle message. But IMO you've pretty much just made a less useful lambda with extra steps.

These same problems apply to custom sort functions. Unless the engine provides scalar functions explicitly intended to be used in that context and they don't need any optional arguments for corner cases and all such functions are quadruplicated for the four orderings (asc/desc, null first/last), it's not going to work. Not to mention that you commonly implement sorts with less_than functions rather than [-1, 0, 1] functions, because you already need less_than anyway. Lambdas would be much more ergonomic here.

In fact, we could have done everything that SortField already does right now, without special-cased message types for it, without relying on some implicit and currently unspecified default sort order of types for the common case, and without adding too much complexity to the tree structure, with just a few core functions. For example:

# Sort with multiple keys. The lambdas implement `less_than`. Evaluate the
# functions from left to right:
#  - if true, return true;
#  - if false or null, try again with arguments reversed:
#     - if true, return false;
#     - if false or null, continue with next lambda. If no more lambda is
#       available, return null.
# Use this to pass multiple sort fields to a sort 
sort_tie(T, T, lambda (T, T) -> boolean?...) -> boolean?

# If both arguments are null, return null. Otherwise, if the first argument is
# null, return true (FIRST) or false (LAST). Otherwise, if the second argument
# is null, return false (FIRST) or true (LAST). Otherwise, evaluate the lambda.
sort_nulls({FIRST, LAST}, T, T, lambda (T, T) -> boolean?) -> boolean?

I'm assuming here that true means "in the right order", false means "in the wrong order", and null means "equivalent", instead of [1, 0, -1].

The rest is just down to defining less_than and greater_than functions. I'm not saying we should remove the SortField abstraction if we add lambdas (I'm also not not saying it though); I'm just arguing that if we had started off with lambdas as a concept we would never have had to define it in the first place.

The text was updated successfully, but these errors were encountered:

julianhyde · 2022-09-30T19:10:39Z

If I read your level 2 correctly, you can define a closure - that is, a value that is a function pointer plus zero or more values. The function pointer may be a lambda, or it may be defined some other way (e.g. a built-in function). And the values may themselves be closures. If the function pointer is allowed to call itself, congratulations, Substrait is now Turing complete!

I think support for closures is a great thing to aim for. It presents great challenges (e.g. with Turing completeness you have to worry about termination and unbounded memory usage) but great rewards (you now have the kernel of a functional programing language, not just a SQL database).

That's a journey to be made in several steps.

jvanstraten · 2022-10-03T10:35:47Z

I agree with what you're saying (I don't expect level 2 to happen anytime soon if ever, I'll be happy to just get level 0 sorted and fix argument evaluation semantics), but just want to make a few notes.

If the function pointer is allowed to call itself, congratulations, Substrait is now Turing complete!

Even if we don't disallow it, I think in the implementations I'm suggesting there's no way to do it. I'm pretty sure you'd need either a forward reference, or a lambda that takes its own signature as an argument. The former doesn't work because the expression system is tree-based and relations are DAG-based and thus don't contain cycles (IMO expressions should also be DAG-based, but that's a different story and doesn't change this argument). The latter doesn't work because the lambda type would be recursive; you'd need something of the form LAMBDA<STRUCT<..., [self], ...>, ...>.

When including extensions though, you could make lambdas Turing complete using something like repeat_until(initial_state: LIST<i64>, func: lambda(state: LIST<i64>) -> STRUCT<done: boolean, new_state: LIST<i64>>) -> LIST<i64>, continue(LIST<i64>) -> STRUCT<boolean, LIST<i64>>, and break(LIST<i64>) -> STRUCT<boolean, LIST<i64>> to add the missing global state and looping construct. Then again, Substrait is already trivially Turing-complete if we're considering extensions, because any consumer can define any program within an extension function. Likewise, a producer could define any program within an embedded function and make the plan call that.

It presents great challenges (e.g. with Turing completeness you have to worry about termination [...]

It's already trivial to break things with a Substrait plan. A consumer could already just define while_true() and halt_and_catch_fire() extension functions for either outcome of the halting problem, and a producer could do the same with embedded functions. I don't think there are ways to make a Substrait plan run forever using just core features, but the above lambda proposals doesn't change that.

[...] and unbounded memory usage)

It's not technically unbounded, but you can already easily bring a naive consumer to its knees by just spamming CrossRels or JoinRels. Especially when using relation references, I think you can quite easily get exponential memory and time complexity with plan size; start with a virtual table ReadRel with one column and two rows, use relation references to get the Cartesian product of it with itself, repeat. Just ten iterations later we have a cell for every cubic Planck length at every Planck time in the current observable universe since it was formed, and then some.

The only way to deal with the above two things in general is just to impose memory, performance, and time limits on the execution of untrusted queries.

jacques-n · 2022-10-03T21:28:48Z

I'm generally in support of introducing lambdas at some point. I can see Level2 making a lot of sense. That being said, I'm concerned about creating more surface area ahead of when we need it. It leads to later desires to refactor as we more fully understand the requirements we have. I also feel like in some ways, some of the argues here for lambdas is to decompose some existing composite concepts into simpler primitives. While I think that can be valuable, I also think that there is a danger of over-decomposition.

As an example, collation is a first class concept in databases. Forcing people to define lambdas to communicate collation in Substrait seems like a step too far. We sit at a much higher level than a compiler and should we embrace that higher level of abstraction. IOW, flexibility often comes at the cost of simplicity.

westonpace · 2022-10-07T00:17:11Z

As a consumer a lambda doesn't seem much different than an embedded UDF. For example, your lambda above has subtract(field(0, 0), field(outer, 1, 0)). Yes, my consumer has a subtract function, but this is a function that takes in two columns of data and subtracts them against each other. I could (in Arrow) pass in two scalars or (in Velox) pass in two constant arrays, but that still is going to be an extremely inefficient way to subtract two values. So all this does is add another encapsulation/language to the list of potential embedded UDF languages.

I do think "query authors" (who are not likely to be using Substrait directly) will want lambda support. For example, in Ibis, I might want to do something like you described to specify my sort function. However, there is nothing stopping Ibis from compiling that UDF into WASM or "pickled python" and sending that as a UDF, completely transparent to the end user. So I don't think lambdas are giving us a usability benefit here.

If we accept the above two facts then this boils down to something like "is Substrait's expression language somehow preferable to producers or consumers over WASM's expression language when it comes to UDFs?" and I think the answer is no.

jacques-n · 2022-10-07T17:20:59Z

is Substrait's expression language somehow preferable to producers or consumers over WASM's expression language when it comes to UDFs?"

and I think the answer is no.

LOL, I totally think the answer to the first comment is resounding yes. The moment we start to think about optimizers working with the Substrait plans it becomes very beneficial to be able to understand and decompose operations. This provides all sorts of optimization possibilities. For example, it could be that the optimizer decides to do some form of subexpression elimination occurs twice, once as a direct expression and once as part of a lambda expression. If one of those was just a wasm black box, that's going to be neigh impossible. That doesn't mean black boxes are bad. I just think that we only want black boxes when we have to have them.

A concrete example of this is UDFs inside of Snowflake. You can write them in several languages, including something called Snowflake scripting. Because of the nature of this kind of horizon, there are all sorts of things Snowflake scripting can do that the other languages can't (in addition to various performance differences). Something similar would probably be true when you think about something like LLVM optimization windows. If I convert part of my expression tree into wasm from a logical Substrait tree (because I don't have lambdas), I imagine the injected boilerplate of the conversion will result in far fewer optimization opportunities as opposed to having a single view of the maximum number of operations.

julianhyde · 2022-10-07T18:21:28Z

a lambda doesn't seem much different than an embedded UDF

Yes, but then there are closures, which act like lambdas and UDFs (i.e. you can pass them around as values and call them). Closures can definitely not be replaced by UDFs, because they can be created at run time.

Some people use 'lambda' and 'closure' interchangeably, so I want to be clear that I would like closure support.

It is probably possible to support closures without extending the expression language. Just support add some functions for partial evaluation (uncurrying).

westonpace · 2022-10-07T18:45:04Z

LOL, I totally think the answer to the first comment is resounding yes. The moment we start to think about optimizers working with the Substrait plans it becomes very beneficial to be able to understand and decompose operations. This provides all sorts of optimization possibilities. For example, it could be that the optimizer decides to do some form of subexpression elimination occurs twice, once as a direct expression and once as part of a lambda expression. If one of those was just a wasm black box, that's going to be neigh impossible. That doesn't mean black boxes are bad. I just think that we only want black boxes when we have to have them.

I'm not convinced these corner case optimizations are worth the work for any consumer to implement but I will admit there is a possibility here. I guess I'd like a concrete example. Sorting by a custom key (e.g. in Jeroen's example) could just be achieved by a projection before the sort node.

Yes, but then there are closures, which act like lambdas and UDFs (i.e. you can pass them around as values and call them). Closures can definitely not be replaced by UDFs, because they can be created at run time.

Again, can we give an example of how a closure would be used in a query? Closures capture state but it's not clear to me what "state" would mean in the context of a query.

julianhyde · 2022-10-07T23:24:12Z

Here's an example on Morel:

- val deptEmps =
=   from d in scott.dept
=     yield {d.deptno,
=        emps = (from e in scott.emp where e.deptno = d.deptno yield e.ename)};
val deptEmps =
  [{deptno=10,emps=["CLARK","KING","MILLER"]},
   {deptno=20,emps=["SMITH","JONES","SCOTT","ADAMS","FORD"]},
   {deptno=30,emps=["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]},
   {deptno=40,emps=[]}] : {deptno:int, emps:string list} list
- from d in deptEmps
=   yield {d.deptno, initials = map (fn n => substring(n, 0, 1)) d.emps};
val it =
  [{deptno=10,initials=["C","K","M"]},
   {deptno=20,initials=["S","J","S","A","F"]},
   {deptno=30,initials=["A","W","M","B","T","J"]},{deptno=40,initials=[]}]
  : {deptno:int, initials:string list} list

The expression fn n => substring(n, 0, 1) is a closure - the built-in function bound with two integer parameters. In this case, the closure is created at the call site, but it could be created as a column value elsewhere.

julianhyde · 2022-10-07T23:37:58Z

Here's another example. The comp column is a closure that binds e (and can therefore reference e.id and e.deptno when called) and can be evaluated by passing revenue as an argument:

- fun emps2 () =
=    from e in emps
=      yield {e.id,
=        e.name,
=        e.deptno,
=        comp = fn revenue => case e.deptno of
=            30 => e.id + revenue / 2
=          | _ => e.id};
val emps2 = fn : unit -> {comp:int -> int,
    deptno:int, id:int, name:string} list
- from e in emps2 ()
=   yield {e.name, e.id, c = e.comp 1000};
val it =
  [{c=100,id=100,name="Fred"},
   {c=101,id=101,name="Velma"},
   {c=602,id=102,name="Shaggy"},
   {c=603,id=103,name="Scooby"}]
  : {c:int, id:int, name:string} list

The example is from slide 36 of Morel, a data-parallel programming language.

westonpace · 2022-10-08T01:34:17Z

Thank you for the examples, in particular the first one, using apply on a list array, is a very helpful example for my mental model, and not something easily achievable today.

As a consumer I feel like we are blurring two different contexts, and maybe that is ok. I'll call them an "array evaluation context" and a "singular evaluation context" (avoiding scalar and vector as those can be misleading/ambiguous).

In Acero I have an expression evaluation for the array evaluation context. This takes the form of Array EvaluateArray(Expression, RecordArray) where Expression is a scalar function expression. This handles a limited set of expressions defined by hand-rolled Arrow compute functions. A "singluar evaluation context" would be something like Value EvaluateValue(Expression, Record). I don't have this in Acero. I can always do something like EvaluateValue => EvaluateArray(Expression, [Record])[0]. This would, unfortunately, be rather inefficient I think. Not only will there be boilerplate wrapping and unwrapping but also it would prevent any kind of vectorization or using hardware optimized for working on contiguous arrays.

I suppose "Weston doesn't have an efficient solution" is not a very good reason to prevent moving forwards on something. However, if I were tasked with coming up with an efficient solution, I would think code generation would be the way to go (well Substrait expression -> machine code), and I really don't want to write a brand new code generation library, so then I'm left looking at how to convert Substrait expressions to WASM / LLVM, which maybe isn't a terrible thing.

It does feel like this shifts the project from "let's define a universal library of functions for query engines" to "let's define a universal standard library for programming languages" and I think there are other approaches to that latter task (WASM / LLVM).

julianhyde · 2022-10-09T00:08:16Z

I don't totally follow your Acero example but I can imagine that you have some operations on relations (potentially unbounded streams of records) and other operations on arrays (bounded sets of records or scalars nested inside a record). When executing on actual hardware, with memory limits, that makes perfect sense. That is the world that Acero and Substrait need to operate in.

Morel is a high-level language that makes no distinction between top-level relations and nested relations. The from clause is syntactic sugar for the built-in higher-order functions map and filter. Thus, the following programs have identical semantics:

from d in deptEmps
   yield {d.deptno, initials = map (fn n => substring(n, 0, 1)) d.emps};

from d in deptEmps
   yield {d.deptno, initials = (from e in d.emps yield substring(n, 0, 1))};

In my opinion, framing relational algebra in functional programming terms makes a lot of sense. And if I am right, people will want their data-parallel framework to be able to execute programs expressed in a functional programming language. They will want to represent closures among their data values, and to put user-defined higher-order functions (table functions and aggregate functions) into their DAGs.

Those programs will need to translate down to Substrait. So if the puck is going in that direction, Substrait should be heading there too.

westonpace · 2023-03-08T22:51:18Z

Revisiting this months later I believe I have softened considerably in my opposition to lambdas. This issue is similar to #320 . I am going to close #320 in favor of this issue as this issue has more information.

westonpace changed the title ~~Lambda ideas~~ Add support for lambdas as function arguments Mar 8, 2023

westonpace added enhancement New feature or request help wanted No one is currently implementing but it seems like a good idea labels Mar 8, 2023

westonpace mentioned this issue Mar 8, 2023

Value vs "expression" (lambda-ish) function arguments #320

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for lambdas as function arguments #349

Add support for lambdas as function arguments #349

jvanstraten commented Sep 29, 2022

julianhyde commented Sep 30, 2022

jvanstraten commented Oct 3, 2022

jacques-n commented Oct 3, 2022

westonpace commented Oct 7, 2022 •

edited

Loading

jacques-n commented Oct 7, 2022

julianhyde commented Oct 7, 2022 •

edited

Loading

westonpace commented Oct 7, 2022

julianhyde commented Oct 7, 2022

julianhyde commented Oct 7, 2022

westonpace commented Oct 8, 2022

julianhyde commented Oct 9, 2022

westonpace commented Mar 8, 2023

Add support for lambdas as function arguments #349

Add support for lambdas as function arguments #349

Comments

jvanstraten commented Sep 29, 2022

User-defined functions vs lambdas vs embedded functions

Level 0: niladic lambda argument types

Level 1: n-adic lambda argument types

Level 2: n-adic lambda data types

Why not just a function reference argument type?

julianhyde commented Sep 30, 2022

jvanstraten commented Oct 3, 2022

jacques-n commented Oct 3, 2022

westonpace commented Oct 7, 2022 • edited Loading

jacques-n commented Oct 7, 2022

julianhyde commented Oct 7, 2022 • edited Loading

westonpace commented Oct 7, 2022

julianhyde commented Oct 7, 2022

julianhyde commented Oct 7, 2022

westonpace commented Oct 8, 2022

julianhyde commented Oct 9, 2022

westonpace commented Mar 8, 2023

westonpace commented Oct 7, 2022 •

edited

Loading

julianhyde commented Oct 7, 2022 •

edited

Loading