Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SortRel is a little under documented #213

Open
westonpace opened this issue May 30, 2022 · 8 comments
Open

SortRel is a little under documented #213

westonpace opened this issue May 30, 2022 · 8 comments
Labels
enhancement New feature or request help wanted No one is currently implementing but it seems like a good idea

Comments

@westonpace
Copy link
Member

westonpace commented May 30, 2022

  1. What does SORT_DIRECTION_CLUSTERED mean?
  2. Let's say I was sorting by strings, how would I specify a natural sort vs alphabetical sort?
  3. Can we provide an example of comparison_function_reference? What kind of function is this? Why isn't this just an extension relation?
  4. Are we content leaving the stability of the sort as a hint? It seems like the producer / planner might care about this.
@westonpace
Copy link
Member Author

  1. What does SORT_DIRECTION_UNSPECIFIED imply?

@cpcloud
Copy link
Contributor

cpcloud commented May 30, 2022

  1. What does SORT_DIRECTION_UNSPECIFIED imply?

All of the unspecified suffixes mean that the value wasn't specified when constructing the protobuf. Their only semantic is to communicate that nothing was specified. They don't imply anything about default behavior for example, which imo should be specified by the producer.

@jacques-n
Copy link
Contributor

Is there specific stuff that you think should be covered here that is not? Custom function reference let's you choose to use an arbitrary binary function for ordering. This allow you to use alternative collations.

@westonpace
Copy link
Member Author

westonpace commented May 31, 2022

Is there specific stuff that you think should be covered here that is not?

That helps, I was looking at the documentation for SortRel. I'll try and answer my own questions then.

  1. This is pretty clear from the linked doc.
  2. I still don't know. I suppose I could use a custom function but how would I pass arguments too it? Or would I need to have a different sort function for each kind of way I can sort?
  3. The docs help a little.

Returns data using a custom function that returns -1, 0, or 1 depending on the order of the data.

However, is this a scalar function? Wouldn't a sort comparator (that returns int) be called compare(T left, T right) or does it make sense still to have int[] compare(T[], T[])? I don't really know how a sort node would be implemented but I think my first guess would be that a custom function would have the signature T[] sort(T[], T[]) (i.e. a function that can sort a chunk, in memory. The sort node can then implement the actual chunking, spilling, etc.)

  1. Still not sure what the answer is here.

All of the unspecified suffixes mean that the value wasn't specified when constructing the protobuf. Their only semantic is to communicate that nothing was specified. They don't imply anything about default behavior for example, which imo should be specified by the producer.

What is the correct behavior of the consumer if it encounters this? An error?

Also, if there is any kind of value that makes any kind of sense as a default we should generally use that instead of unspecified, especially as we add new things to the proto. Otherwise, forwards compatibility will just throw an error whenever an older file is encountered.

@jvanstraten
Copy link
Contributor

Is there specific stuff that you think should be covered here that is not?

Yes actually; the type (including nullability) or set of types that the ordering function may return; there's no such thing as 1, 0, and -1 in Substrait, as literals are typed. IIRC I (rather arbitrarily in hindsight) also allowed them to return a bool with less-than semantics in the validator, and I think I allowed every type of int and probably nullability.

Also,

based on the quality function associated with the type

implies to me that every type needs to have one of these, and thus has a defined ordering (I also have no idea what a quality function is or looks like, as it doesn't appear to be defined). I don't even know how a map would be ordered, let alone arbitrary custom types. I can accept that Substrait requires an equality function be implicitly defined for every type since it's needed in various places (aggregations for example), but also requiring an implicit ordering function seems a bit much to me. If not every type, which types have one and which don't? Also, how do the orderings work for the less obvious types? For example for STRUCT?<i8?, i8?> is the sort order of inner nulls affected by the nulls first/last as well? (If so, you need two different implicit ordering functions, one that treats null as negative "infinity" and one as positive "infinity")

What is the correct behavior of the consumer if it encounters [an *_UNSPECIFIED]? An error?

IIRC protobuf APIs convert unrecognized enum options to the *_UNSPECIFIED option internally, unless you ask for enum value in its raw i32 form. In which case, it's likely that if we add a variant to an enum later and pass that value to an older (otherwise backward-compatible) consumer, it will probably see *_UNSPECIFIED in its switch statement. Therefore, I don't think it's a good idea to attach behavior other than "reject the plan" to *_UNSPECIFIED, or naive consumer implementations will do the wrong thing rather than fail when presented with an otherwise backward compatible plan that uses a variant they don't understand. If we want consumer-default behavior for an enum, we can just explicitly add a *_CONSUMER_DEFAULT option. Likewise for oneofs.

@westonpace
Copy link
Member Author

Therefore, I don't think it's a good idea to attach behavior other than "reject the plan" to *_UNSPECIFIED, or naive consumer implementations will do the wrong thing rather than fail when presented with an otherwise backward compatible plan that uses a variant they don't understand. If we want consumer-default behavior for an enum, we can just explicitly add a *_CONSUMER_DEFAULT option. Likewise for oneofs.

I agree with you. I wasn't arguing for consumer defaults as much as spec defaults. I was simply saying that we should try and define a spec default value when it makes sense. For example, I think it probably would have been fine to let an ascending sort be the default sort or let an inner join be the default join. In retrospect I think I was making it more of an issue that it needs to be. Using UNKNOWN or UNSPECIFIED as a 0 value seems to be a common practice and we don't actually do it all that much anyways. It will be a more significant concern when we make changes that we want forwards compatibility for.

IIRC protobuf APIs convert unrecognized enum options to the *_UNSPECIFIED option internally, unless you ask for enum value in its raw i32 form. In which case, it's likely that if we add a variant to an enum later and pass that value to an older (otherwise backward-compatible) consumer, it will probably see *_UNSPECIFIED in its switch statement.

It appears that it will either just pass on the unrecognized value or use it's own language-specific unknown option. Either way I do not believe it will use the default value in this case:

During deserialization, unrecognized enum values will be preserved in the message, though how this is represented when the message is deserialized is language-dependent. In languages that support open enum types with values outside the range of specified symbols, such as C++ and Go, the unknown enum value is simply stored as its underlying integer representation. In languages with closed enum types such as Java, a case in the enum is used to represent an unrecognized value, and the underlying integer can be accessed with special accessors. In either case, if the message is serialized the unrecognized value will still be serialized with the message.

@westonpace
Copy link
Member Author

Yes actually; the type (including nullability) or set of types that the ordering function may return; there's no such thing as 1, 0, and -1 in Substrait, as literals are typed. IIRC I (rather arbitrarily in hindsight) also allowed them to return a bool with less-than semantics in the validator, and I think I allowed every type of int and probably nullability.

Let's make this question number 6.

  1. What is the exact return type of a comparison function

implies to me that every type needs to have one of these, and thus has a defined ordering (I also have no idea what a quality function is or looks like, as it doesn't appear to be defined).

I think it might be good for every type to have a Substrait defined default ordering, even if consumers can't support all of them. This is a topic that could have a fair amount of explanation / discussion and might be better off as its own issue.

@jacques-n
Copy link
Contributor

2. I still don't know. I suppose I could use a custom function but how would I pass arguments too it? Or would I need to have a different sort function for each kind of way I can sort?
The docs help a little.

The thinking here is that the value is a reference to a function defined in extensions. We should more formally specify that: it is a 2 argument function that returns an int32 with the values of -1, 0, and 1 to provide a stable sort. So for each sort type + argument type, you have a different function you reference for comparison purposes. The function has a defined requirement of two arguments and a defined way things are output so you don't need to define any arguments (beyond what you do for any sort field).

I agree that the documentation needs to be enhanced. Happy to see PRs suggesting improvements.

@westonpace westonpace added enhancement New feature or request help wanted No one is currently implementing but it seems like a good idea labels Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted No one is currently implementing but it seems like a good idea
Projects
None yet
Development

No branches or pull requests

4 participants