-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(proto): add protobuf support for user-defined-function extension #234
Conversation
Rationale for the proposed changes:
Note that a complete UDF workflow using the proposed changes has been locally implemented and tested to work using Ibis + Ibis-Substrait + (Py)Arrow. One alternative discussed with @cpcloud that he asked be described here involves reusing as many of the existing protobuf definitions for |
message UserDefinedFunction { | ||
// A serialization of the user-defined function's code. The default serialization | ||
// format is using base64 of cloudpickle.dumps(). | ||
string code = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is no longer relevant, so I'll fix this and then base64
would not be needed. I think the type can be bytes
. The format of the bytes would be cloudpickle
by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that if we go with the above-described alternative then cloudpickle
format would be default only in the PythonPickleFunction
case.
Hello. There is already a sketch of how to do this in the spec under embedded functions. It would be great to start from the sketch as well as update it as you see fit. We don't call this a UDF as we consider a UDF to be a function that is registered once and then referred to by some sort of identifier. Embedded functions are situations where the plan contains the actual code of the function. |
Yes, this is basically the alternative and distinction between an embedded function and a UDF noted above. Is it already in use by some Substrait producer or consumer? Indeed the proposed I'd like to first reach agreement on which alternative to go with. |
@jacques-n Thanks for the feedback, it's not obvious to me the difference between a Here is why I don't see a clear difference between the two:
I am guessing this is considered a But also you can register this function in order to use in a SQL string
To me, the fact that the function is Therefore, it seems to be that rational behind the distinction is not very clear - perhaps you can give some example of justifying the different concept of "UDF" vs "Embedded Function"? and which one is more suited for the Spark/Ibis Python UDFs? |
@icexelloss, I'll let @jacques-n answer and just note my understanding that |
@rtpsw is right on the money. Substrait distinguishes between embedded/inline functions versus function references (aka user defined functions). User here is a user of Substrait and shouldn't be confused with the user of Spark. Layers above Substrait are free to call both of these things UDFs. |
Gotcha. Sounds like |
Yeah, that's how it is right now. I'm open to a pattern of references similar to other function references where the embedded bodies are stored at the root of the plan. |
@rtpsw Sounds to me
What do you think? |
The concepts are reasonable but you should use the same anchor/reference system we use elsewhere in the plan (for normal functions, user defined types, etc). For the Python Pickle function, you also need to write up some documentation saying what the signature of a python function must be to be allowed to use arguments of a certain type. For example, if the arguments are VARCHAR<24>, DECIMAL, what should be the arguments be within Python. It is important that a second system could produce and/or consume on the same python pickle packaging so we need to be clear about not only dependencies but all of the type mappings we support to start. I'd also suspect that you should describe the version of python that the pickle function was within. (I'm thinking that prerequisites maybe should be tuples of requirement name and version number/range/etc?) Would love python people to fill this out to make sure. As I said, the python embedded functions should be defined well enough to be used in multiple systems. (If that requires them using a substrait python sdk, I think that would be fine.) |
@icexelloss, I'm fine with trying this general approach. @jacques-n, I understand what you're trying to push for; however, I think this would make the current focused issue (whose resolution would unblock a path on my UDF project) into a much larger one that involves setting standards and reaching agreement about them. My view is that since the existing Having said that, here are some thoughts about your points:
|
Agree on the need to write up the type mapping for the UDFs. One question is: what data are coming into the UDF? Does/should substrait have an opinion here? Without something like "arrow data are coming in" there's no way that two independent systems could ever hope to execute the same plan if it contained UDF, dependencies aside.
Maybe, but the pickle protocol number should be sufficient.
This seems well outside the scope of substrait and deep into provisioning and dependency management. Requiring that two arbitrary consumers that have been deployed independently be able to execute the same plan that includes a Python UDF doesn't seem feasible to solve anywhere, let alone in Substrait. A list of requirements is barely enough to get a Python environment to work. What if one consumer is running on x86-64 and the other on arm64 and there are native dependencies like numpy in the mix? What if both are on x86-64 but with C++ dependencies compiled with different, incompatible C++ standard libraries? IMO it's up to the systems that want to share plans with Python UDFs to make sure their dependencies are compatible by using modern deployment and provisioning tools. |
This definitely has to be Arrow data. The type system is already in place; it is used for inputs and outputs of extension functions, and it should be reused for the UDFs we are discussing here.
Agree, A common use case is delivering a Substrait plan for execution at a server/cloud host. In this case, it's normal for the caller to be responsible for a consistent distributed setup. Since there's no dependency management problem for Substrait here, we can focus on UDF definitions in this issue. Note that the more difficult dependency management problems arise when Substrait plans are intended for long-term storage, but this is a separate topic. |
I'm not convinced of this. Definitely not without substantially more discussion around the design and vision of embedded operations. Without it, it's much harder to argue that a substrait plan with an embedded function is well intentioned and compatible across multiple systems. Let's make sure to keep "hard to do" distinct from "not preferred". We can decide to not do something because it is hard. But is this actually hard in in common cases? (I have no doubt it is hard in some cases but let's keep those "bucket of bytes" cases to be as rare as possible.)
I don't know pickle. Experts must say what the right version information is. Just seems like we definitely need concepts around version.
I think that is one model and it makes sense to support it. However, from a Substrait pov, is it the first model we should introduce? Why not introduce something that works which defines a default python primitive to substrait type mapping to begin with to expose simple examples?
I don't think the issue here is narrow. Formalizing embedded functions into the Substrait spec should be seen as a meaningful undertaking. The content on the current spec around this topic has been marked as "sketch" and hasn't been meaningfully touched since originally sketched 10 months ago. It lists a bunch of open questions that we should resolve to formalize. I don't want to create a giant barrier to getting this done. At the same time, it seems worthwhile to put together a design and then figure out the bite-sized steps to get us to a good outcome. It's important to make sure that the proposal works for multiple many different substrait users. |
I don't think it's a bucket of bytes situation, I think it's a "this is a really difficult problem without substrait involved and adding another technology's opinions into the mix is a bad idea"-kind of problem. If I'm understanding you correctly it sounds like you're saying that a plan must work across arbitrary consumer environments to be a valid plan. If so, I think this is an astronomically high barrier to use UDFs and it will prevent people from using UDFs at all. Can you describe what a substrait consumer and producer would do with a list of dependencies and their version constraints? It might also help to clarify what "the common case" is here. In my experience it's not easy to enumerate what "the common case" means for dependency management.
I've used pickle quite a bit. The documentation is good and provides detailed information about versioning. I don't understand what it means to say an expert must say what the right version information is. What is an expert here? Pickle already contains versioning information and compatibility guarantees around Python versions. I'm not saying we shouldn't include Python version, only that we should look at what already exists and use it as is if we can.
I was using Arrow as an example of one choice, not as the thing we should necessarily start with. We can start with Python objects, that seems fine if horribly inefficient. |
Here's a detailed (though not complete) technical proposal to start a concrete discussion about. If it seems to be in the right direction, I could work on a PR for it. I designed with the following in mind:
In
In
In
In |
I'm saying someone who knows Python serialization like yourself and several others on the list (and not me). If a pickle version alone is sufficient on the interpreter side, awesome. I just want to make sure that we have a clear and reliable versioning specification/approach. |
IMO there is reasonable solutions to the issue raised above.
I think this is something that the producer can specify inside the
I think perhaps we can support two ways here.
I think these two option doesn't solve all the cases but is reasonable start that covers common cases. @jacques-n @cpcloud @rtpsw WDYT? (Just in a side note, for the use case of this in Arrow, the "environment" here doesn't really matter because producer/consumer share the same Python environment) |
I think even if in this case, the single value should have one of the types defined in the Substrait type system.
This part I agree with.
These are reasonable options assuming the versioning problem is about dependencies; however, @cpcloud says the dependency problem is not in scope for Substrait. From the discussion leading to this post by @jacques-n, my understanding is the versioning problem is about serialization compatibility, i.e., ensuring that the pickled UDF can be correctly unpickled by various Substrait consumers. In principle, picklers should take care of this compatibility, yet Substrait should have appropriate definitions for it. |
Some other thoughts regarding dependency I have the same feeling that no matter what we include in the substrait in term of Python dependencies, it's still hard to make these substrait work for multiple system. As a thought experiment, if I have something produces such substrait and passed to Spark, even I fully specifies the version of the packages I need, Spark would still need to install these packages on its worker, which isn't something that Spark can do right now. Therefore, even we come up a spec here, we won't truly know if it works until we try to pass this across systems. Another reason that I think we shouldn't handle Python libraries and dependencies in Substrait is that systems like Spark/Dask itself doesn't handle these as part of the "Query plan", i.e., if a user of Spark has a Spark script that constructs a query that has UDFs, the libraries requirement to run this query (Python or Scala) is not part of the Spark script and the query itself, and is controlled by parameter to start up the Spark cluster, and I think most system handles dependencies this way - i.e. not part of the query plan but rather as part of "how to setup the system". Since substrait is a presentation of the query plan, I think it's not a concern of substrait either IMO. It's true that this makes it that the consumer of the substrait plan is responsible for making sure it has the dependencies needed for executing it, but I think this reasonable because it is similar to whoever takes a Spark script and run it needs to make sure it has the proper dependencies. Which makes me lean towards punt on this problem for now, and the UDF spec is already useful without the dependency part IMO. Edit: Add additional reasons for not handling dependencies in substrait. |
@cpcloud, @jacques-n: back to you. How to move forward? If my proposal here is a reasonable starting point, I could get it ready and upload here. |
@jacques-n @cpcloud haven't heard back from you for a while. I would try to summarize what I think is a way forward and wonder what do you think Re @cpcloud comments on "object type that is passed in and out of the UDF": I think this is reasonable to add to the UDF definition protobuf (details pending, do we need just one type for the input/output, or do we need one type for each input arguments?) Re @jacques-n comments on "dependencies and versions for the UDF": I think this is best to not solve this problem. (Elaborating of the reasoning here: #234 (comment)) Does this sounds a reasonable way forward? |
@icexelloss we had some decent conversations about this at the last sync. In general, I think we have to work further on making things "self contained". This means having some sort of dependency definition (amongst other things). I think @cpcloud had some ideas of how to do this. |
I see - Thanks for the update here. What's the best way for us to track this work? (Since this is the blocker for us now we'd like to get involved and help out if we can) |
I'd love for @cpcloud to chime in as I'm not an expert here. I think his suggestion was that we support dependencies for python defined as a pip lock file or a conda lock file to start(?). I definitely think that the other item: what is the way we express the function signature (argument types, etc) for the python code and how do we declare the entrypoint into the python code also need to define. It would be great for you put together a proposal of each of these items as markdown as part of this patch. |
@icexelloss @rtpsw I think your proposal is a good starting point. I think the minimum requirements are: A way to specify dependencies:The main point of including support for this is for the plan to be self contained. It's most definitely an optional field. I'm not 100% sold on it, but I'm willing to accept it to move this forward.
A way to specify the expected input data kind
A way to specify the entrypoint
|
Addressing the items by @cpcloud:
|
Perhaps a
There's a difference between the type system in Substrait, and the kinds of objects a UDF operates on. For example, a consumer could call a scalar UDF on every element of input(s), or it could send vectors of the input(s). The producer should specify which kind of input the UDF expects so that the consumer can do the right thing. Perhaps "calling convention" is a better name for this, rather than "kind"?
An example of what you already have would really help me understand what you're describing. It's difficult to understand the description without something to reference. |
OK, I'll try this.
Would "input-shape" be a good term for what you're describing? We currently have analytic, elementwise, and reduction UDFs that are all scalar functions. In general, invocations of scalar functions differ in both input-shape and output-shape. For example, one invocation of an elementwise function could map a scalar shape to a scalar shape whereas another invocation of the same elementwise function could map a vector shape to a vector shape. Note that I'm working on adding tabular UDFs, where the output-shape could be 2-dimensional.
I've answered this elsewhere to point you to my work in progress. |
Input shape isn't what I'm describing. In the simplest SQL engine (ignoring window functions and table functions) there are two kinds of functions
Semantically, scalar functions take a single value in and produce a single value. The physical representation of those column elements has no effect on the semantic "shape" of the function's output. I am trying to capture this fact here. The UDF must communicate whether it expects to be called with a single scalar value per argument or with a vector per argument. Imagine a UDF that adds one to every value in a column, using the per-row calling convention: def add_one(x):
return x + 1 With a calling convention of Now imagine the same UDF using def add_one(x):
return pyarrow.compute.add(x, 1) This would have no effect on the execution results, but the execution itself will differ quite a bit. We need to capture this variation in the information produced by the producer so the consumer knows how to call the function and also how to assemble the result of calling it.
Hm, I was looking for an example of the Python code. Can you put the example (or a link to it) here? Apologies if you have to repeat yourself, but it would be good to have the relevant examples with the discussion in one place. |
Please bear in mind that Substrait's type system is entirely abstract: Substrait does not define any representation (in memory or otherwise) of those types, except the protobuf format it uses to represent literals. Formally, a consumer is free to use whatever internal data representation they want, as long as:
For example, it's perfectly fine for all integer types to be represented with some bigint or for floats and doubles to be represented using quads or whatever, as long as an 127i8 + 127i8 "overflows correctly" rather than yielding 254i8 (since that doesn't exist as far as Substrait is concerned). An Therefore, just saying that UDFs should use Substrait's type system is not sufficient. Some physical* representation is needed in addition. IMO the most obvious candidate for this would be Arrow, since many systems have been moving toward that over the past few years already... but Substrait's and Arrow's type systems are not 100% compatible. Just using Python objects also isn't sufficient unless you specify which ones (for integers for example, I think Jacques was suggesting requiring the use of some Substrait-controlled API to access values. I guess that would also work... ish. IMO it mostly just shifts the problem to the signatures of the functions defined by that API, and would try to solve exactly the problem that Arrow was created to solve. But I might be misunderstanding or misremembering what was said. To me it would make the most sense if:
* please excuse the overloaded term. I don't know what else to call it. |
It sounds like what both of you have in mind is an ABI (though in a data-scientific sense that is narrower than usual), since you're describing the concrete input and output formats of a function. Is this a correct understanding? In my local work, the Substrait consumer didn't have to directly deal with an ABI because it was (as well as the shapes I mentioned were) implied by the kind of function, like elementwise or analytical, that was used. I guess you're asking to make this ABI explicit in the Substrait plan. This is a bit unexpected to me because the Substrait plan tends to keep things abstract; however, I'm not all that experienced with Substrait, so this could be the right approach, and I wouldn't oppose it. |
I think so.
The problem is that the implementation of the UDF itself isn't abstract at all; it's bytecode for some specific VM or architecture (Python isn't really a VM, but in this context you can think of the Python interpreter as acting as a VM for the pickled function). So there's a disconnect there. If UDFs were instead implemented in terms of some set of abstract, primitive functions akin to something like LLVM-IR but using Substrait's type system, you wouldn't need this... but then you also wouldn't need UDFs, because you could just describe the functionality using existing expression trees. Not to mention that every consumer would have to implement that complete "Substrait VM". It could also be that my understanding of UDFs is lacking. I've been monitoring this thread but mostly abstaining from commenting in it because there's a significant gap in my background there. |
I'm not sure I understand; are you looking for an example UDF? Here's an example one that works in my local code and fits your
The way this UDF is handled in my local code is the following:
There is no support in my local code for a function like |
I agree this disconnect exists. The question I'm struggling with is whether resolving this disconnect should be the responsibility of the Substrait plan or the Substrait producer/consumer. The former option would force the Substrait plan to get much more concrete than usual but promises to keep the Substrait plan self-contained (if this is at all possible with UDFs), whereas the latter option would limit the Substrait producer/consumer with which a particular UDF can be used (since its ABI would need to be supported, and perhaps represented within the pickled function object) but promises to keep the substrait plan abstract. |
@cpcloud Thanks for putting the requirement together. I agree with most of what you said. One thing I wasn't sure about is "A way to specify the entrypoint". If you are talking about "import pandas as pd" before executing the UDF, AFAIK this is handled by cloudpickle already (https://github.com/cloudpipe/cloudpickle/blob/master/cloudpickle/cloudpickle.py#L345) and I don't think we need to do extra work here. Perhaps this doesn't need to be an requirement for the cloudpickled UDF? |
@cpcloud Any thoughts about #234 (comment) on "entry point"? |
@icexelloss I think what you're saying makes sense. Since unpickling will result in an instance of the UDF object then there's no real need in the case to re-specify the name of the function anywhere. |
Thanks. I think what @cpcloud specified above is reasonable requirements. Due to prioritization we will keep this in the backlog for a bit and hopefully find time to put up a updated proposal. |
|
This effort appears to have stalled. Can we close this PR and re-open at a later date if you choose to revisit? |
@icexelloss, is this still needed? |
Yes I think we can close this as we have a solution without changing substrait spec |
For #233