docs: extend extension documentation

- coin the term "core extensions" and describe why they (and extensions in general) exist - document guidelines from substrait-io#251 and substrait-io#307 - document common options in core extensions (substrait-io#254) - better integrate the function documentation generator output into the website
jvanstraten · Sep 5, 2022 · 5e98051 · 5e98051
1 parent ae4b50d
commit 5e98051
Show file tree

Hide file tree

Showing 15 changed files with 173 additions and 37 deletions.
diff --git a/site/docs/expressions/user_defined_functions.md b/site/docs/expressions/user_defined_functions.md
@@ -1,3 +1,3 @@
 # User-Defined Functions
 
-Substrait supports the creation of custom functions using [simple extensions](../extensions/index.md#simple-extensions), using the facilities described in [scalar functions](scalar_functions.md). In fact, the functions defined by Substrait use the same mechanism. The extension files for them can be found [here](https://github.com/substrait-io/substrait/tree/main/extensions).
+Substrait supports the creation of custom functions using [simple extensions](../extensions/simple_extensions.md), using the facilities described in [scalar functions](scalar_functions.md). In fact, the functions defined by Substrait use the same mechanism. The extension files for them can be found [here](https://github.com/substrait-io/substrait/tree/main/extensions).
diff --git a/site/docs/extensions/.gitignore b/site/docs/extensions/.gitignore
diff --git a/site/docs/extensions/_config b/site/docs/extensions/_config
@@ -0,0 +1,5 @@
+arrange:
+  - expressivity_vs_compatibility.md
+  - simple_extensions.md
+  - core_extensions
+  - advanced_extensions.md
diff --git a/site/docs/extensions/advanced_extensions.md b/site/docs/extensions/advanced_extensions.md
@@ -0,0 +1,14 @@
+# Advanced Extensions
+
+Less common extensions can be extended using customization support at the serialization level. This includes the following kinds of extensions:
+
+| Extension Type                       | Description                                                  |
+| ------------------------------------ | ------------------------------------------------------------ |
+| Relation Modification (semantic)     | Extensions to an existing relation that will alter the semantics of that relation. These kinds of extensions require that any plan consumer understand the extension to be able to manipulate or execute that operator. Ignoring these extensions will result in an incorrect interpretation of the plan. An example extension might be creating a customized version of Aggregate that can optionally apply a filter before aggregating the data. <br /><br />Note: Semantic-changing extensions shouldn't change the core characteristics of the underlying relation. For example, they should *not* change the default direct output field ordering, change the number of fields output or change the behavior of physical property characteristics. If one needs to change one of these behaviors, one should define a new relation as described below. |
+| Relation Modification (optimization) | Extensions to an existing relation that can improve the efficiency of a plan consumer but don't fundamentally change the behavior of the operation. An example might be an estimated amount of memory the relation is expected to use or a particular algorithmic pattern that is perceived to be optimal. |
+| New Relations                        | Creates an entirely new kind of relation. It is the most flexible way to extend Substrait but also make the Substrait plan the least interoperable. In most cases it is better to use a semantic changing relation as oppposed to a new relation as it means existing code patterns can easily be extended to work with the additional properties. |
+| New Read Types                       | Defines a new subcategory of read that can be used in a ReadRel. One of Substrait is to provide a fairly extensive set of read patterns within the project as opposed to requiring people to define new types externally. As such, we suggest that you first talk with the Substrait community to determine whether you read type can be incorporated directly in the core specification. |
+| New Write Types                      | Similar to a read type but for writes. As with reads, the community recommends that interested extenders first discuss with the community about developing new write types in the community before using the extension mechanisms. |
+| Plan Extensions                      | Semantic and/or optimization based additions at the plan level. |
+
+Because extension mechanisms are different for each serialization format, please refer to the corresponding serialization sections to understand how these extensions are defined in more detail.
diff --git a/site/docs/extensions/core_extensions/.gitignore b/site/docs/extensions/core_extensions/.gitignore
@@ -0,0 +1,3 @@
+*.md
+!common_options.md
+!guidelines.md
diff --git a/site/docs/extensions/core_extensions/_config b/site/docs/extensions/core_extensions/_config
@@ -0,0 +1,3 @@
+arrange:
+  - guidelines.md
+  - common_options.md
diff --git a/site/docs/extensions/core_extensions/common_options.md b/site/docs/extensions/core_extensions/common_options.md
@@ -0,0 +1,73 @@
+# Common options
+
+Some (optional) enumeration arguments in the core extensions represent very broad implementation differences and thus appear very frequently. These options are described centrally here for brevity.
+
+## Integer overflow
+
+This optional enumeration appears on all integer functions that can overflow.
+
+| Option      | Description |
+|-------------|-------------|
+| SILENT      | Overflow silently based on two's complement overflow, i.e. by replacing all higher-order bits that cannot be represented with zero. |
+| SATURATE    | Overflow silently by saturating to the maximum or minimum value that can be represented. |
+| ERROR       | Throw an error when overflow is detected, aborting the whole query. |
+| unspecified | The consumer can decide. If multiple modes are supported, SILENT > SATURATE > ERROR. |
+
+## Floating point (IEEE 754) rounding mode
+
+The specification for floating point numbers defines five different types of rounding modes, to be used when the exact value returned by a computation cannot be exactly represented. Note that this generally does NOT mean rounding to an integer! However, for some functions, this rounding mode is generalized to functions returning integers or returning some fractional index into a list of non-numeric but orderable types.
+
+| Option             | Description |
+|--------------------|-------------|
+| TIE_TO_EVEN        | Round to the nearest available representation. If the exact value lies exactly between the two nearest representations, tie to even. Note that even-ness for floating point numbers is defined based on the least-significant bit of the mantissa, not on integer even-ness. This is the default behavior of the majority of floating point implementations. |
+| TIE_AWAY_FROM_ZERO | Round to the nearest available representation. If the exact value lies exactly between the two nearest representations, tie away from zero. |
+| TRUNCATE           | Round to the nearest available representation that lies between zero and the exact value. |
+| CEILING            | Round to the nearest available representation that lies between the exact value and positive infinity. |
+| FLOOR              | Round to the nearest available representation that lies between negative infinity and the exact value. |
+| unspecified        | The consumer can decide. It is recommended that producers use this, as consumers are unlikely to implement more than one mode. |
+
+## Floating point domain error handling
+
+The specification for floating point numbers defines that operations that yield a mathematical domain error may return NaN (not a number) or raise a floating point exception.
+
+| Option      | Description |
+|-------------|-------------|
+| NAN         | Return NaN when a domain error occurs. |
+| ERROR       | Throw an error when a domain error occurs, aborting the whole query. |
+| unspecified | The consumer can decide. If it can do both, it should return NaN. |
+
+## Allowable optimizations for statistical functions
+
+Computation of statistical functions is commonly approximated by testing only a subset of the complete population.
+
+| Option      | Description |
+|-------------|-------------|
+| SAMPLE      | The consumer may choose to operate only on a representative subset of the data. It is up to the consumer to decide which algorithm to use for this, or what the error tolerance is. |
+| POPULATION  | The consumer must consider every member of the population when computing the statistical metric. |
+| unspecified | The consumer can decide. If it can optimize, it should do so. |
+
+For some functions, a more generalized enumeration is used, that may allow for more optimizations at the cost of accuracy.
+
+| Option      | Description |
+|-------------|-------------|
+| EXACT       | The consumer must compute the exact value of the metric for the population or sample thereof. |
+| APPROXIMATE | The consumer may approximate the metric beyond merely operating on a subset of the data. It is up to the consumer to decide which algorithm to use, or what the error tolerance is. |
+
+## Case sensitivity and conversion
+
+Functions that match strings can generally be configured to match case-sensitive or case-insensitively. In the latter case, they may choose to only match ASCII characters case-insensitively, as this can be more performant than using a complete Unicode lookup table, and may be good enough.
+
+| Option                 | Description |
+|------------------------|-------------|
+| CASE_SENSITIVE         | Strings must be matched case-sensitively. |
+| CASE_INSENSITIVE       | Strings must be matched case-insensitively, using Unicode case conversion rules. |
+| CASE_INSENSITIVE_ASCII | Strings that only use ASCII characters must be matched case-insensitively. Non-ASCII characters are not expected. If a non-ASCII character appears nonetheless, the consumer may decide whether to match it case-sensitively or case-insensitively. |
+| unspecified            | All strings are expected to use the same case convention, so case sensitivity is not expected to affect the result. Thus, the consumer can decide. It should prefer case-sensitive matching if supported. |
+
+Case conversion functions have a similar option.
+
+| Option      | Description |
+|-------------|-------------|
+| UTF8        | Case conversion must be done using the complete ruleset defined by Unicode. |
+| ASCII       | The consumer should only case-convert ASCII characters. |
+| unspecified | Strings that only use ASCII characters must be converted as specified. Non-ASCII characters are not expected. If a non-ASCII character appears nonetheless, the consumer may decide whether to convert its case or leave its case unchanged. |
diff --git a/...docs/extensions/generate_function_docs.py → ...core_extensions/generate_function_docs.py b/...docs/extensions/generate_function_docs.py → ...core_extensions/generate_function_docs.py
@@ -155,7 +155,7 @@ def write_markdown(file_obj: dict, file_name: str) -> None:
 
 current_file = Path(__file__).name
 cur_path = Path(__file__).resolve()
-functions_folder = os.path.join(str(Path(cur_path).parents[3]), "extensions")
+functions_folder = os.path.join(str(Path(cur_path).parents[4]), "extensions")
 
 # Get a list of all the function yaml files
 function_files = []
@@ -198,7 +198,7 @@ def write_markdown(file_obj: dict, file_name: str) -> None:
         if not out_path.exists() or not filecmp.cmp(in_path, out_path, shallow=False):
             with open(in_path, "r") as markdown_file:
                 with mkdocs_gen_files.open(
-                    f"extensions/{function_file_no_extension}.md", "w"
+                    f"extensions/core_extensions/{function_file_no_extension}.md", "w"
                 ) as f:
                     for line in markdown_file:
                         f.write(line)
diff --git a/site/docs/extensions/core_extensions/guidelines.md b/site/docs/extensions/core_extensions/guidelines.md
@@ -0,0 +1,28 @@
+# Guidelines
+
+The process for deciding whether something should or should not be a core extension, and if so, how it should be specified, is based on the following guidelines.
+
+ - Avoid adding a new function if is is merely syntactic sugar and if producers could express its behavior easily enough by composing other existing functions. However, it might be acceptable to add such a function if adding it can enable consumers to more efficiently execute plans, or if adding it can avoid producers deconstructing a syntactic construct that consumers then need to reconstruct.
+    - Example: [#287 (comment)](https://github.com/substrait-io/substrait/pull/287#discussion_r942705485)
+
+ - Avoid adding functions that express behaviors that are already expressible with [specialized record expressions](../../expressions/specialized_record_expressions/).
+    - Example: the `if_else` function originally proposed in [#291](https://github.com/substrait-io/substrait/issues/291)
+
+ - Prefer adding new options to existing functions instead of adding new functions.
+    - Example: [#289](https://github.com/substrait-io/substrait/issues/289)
+    - Example: [#295](https://github.com/substrait-io/substrait/issues/295)
+
+ - Aim for syntactic and semantic consistency with widely used SQL dialects, especially PostgreSQL.
+    - Example: [#285 (comment)](https://github.com/substrait-io/substrait/pull/285#discussion_r944542030)
+
+ - Generalize the function as much as possible, to reduce the odds that we'll need to update it later.
+    - Example: for a function like `add`, consider making it variadic rather than only accepting two arguments.
+
+ - Be consistent when it comes to argument types. It is preferable to define a function that accepts and returns one type class over a function that promotes from one type class or another or accepts a mixture of type classes. This aims to prevent an explosion of function implementations.
+    - More information and examples: [#251](https://github.com/substrait-io/substrait/issues/251)
+
+ - Be pedantic when describing functionality. The corner cases that rarely come up in practice are exactly the places where different implementations are likely to differ, so for a plan to be implementation-agnostic, these are exactly the things that need to be specified exhaustively. For especially pedantic things, an optional enumeration argument may be suitable; this allows a producer to explicitly indicate that the consumer can pick the behavior.
+    - Example: the verbosity of the description of [regex_match_substring](https://github.com/substrait-io/substrait/blob/fbe5e0949b863334d02b5ad9ecac55ec8fc4debb/extensions/functions_string.yaml#L79-L139).
+    - Example: the floating point rounding option defined [here](common_options.md).
+
+ - The core extensions should generally not be defining type classes. If you believe a type class that isn't currently in the specification is important enough to include, it probably makes more sense to simply add it to the built-in types, or otherwise should be a third-party extension.