Skip to content

Commit

Permalink
v6.0.0 (#358)
Browse files Browse the repository at this point in the history
* refactor: Switch to type module, use Mocha for tests.

* feat!: Major internal revision.

* fix: Fix file capitalization.

* feat!: Update arrow, formats, and build process.

* docs: Update documentation.

* fix: Fix column arrayType method.

* refactor: Internal refactoring and type fixes.

* refactor!: More refactoring and cleanup.

* test: Add table data freeze tests.

* feat: Update types and comments.

* chore: Update version to 6.0.0.beta.

* chore: Bump dependencies.

* chore: Bump version, update deps.
  • Loading branch information
jheer authored Jul 17, 2024
1 parent dbb663d commit 3e8620a
Show file tree
Hide file tree
Showing 304 changed files with 13,298 additions and 15,689 deletions.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
BSD 3-Clause License

Copyright (c) 2020-2022, UW Interactive Data Lab
Copyright (c) 2020-2024, UW Interactive Data Lab
All rights reserved.

Redistribution and use in source and binary forms, with or without
Expand Down
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,13 +93,9 @@ Arquero uses modern JavaScript features, and so will not work with some outdated

### In Node.js or Application Bundles

First install `arquero` as a dependency, for example via `npm install arquero --save`. Arquero assumes Node version 12 or higher.

Import using CommonJS module syntax:

```js
const aq = require('arquero');
```
First install `arquero` as a dependency, for example via `npm install arquero --save`.
Arquero assumes Node version 18 or higher.
As of Arquero version 6, the library uses type `module` and should be loaded using ES module syntax.

Import using ES module syntax, import all exports into a single object:

Expand All @@ -113,6 +109,12 @@ Import using ES module syntax, with targeted imports:
import { op, table } from 'arquero';
```

Dynamic import (e.g., within a Node.js REPL):

```js
aq = await import('arquero');
```

## Build Instructions

To build and develop Arquero locally:
Expand Down
10 changes: 3 additions & 7 deletions docs/api/expressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,14 +142,10 @@ So why do we do this? Here are a few reasons:

* **Performance**. After parsing an expression, Arquero performs code generation, often creating more performant code in the process. This level of indirection also allows us to generate optimized expressions for certain inputs, such as Apache Arrow data.

* **Flexibility**. Providing our own parsing also allows us to introduce new kinds of backing data without changing the API. For example, we could add support for different underlying data formats and storage layouts.

* **Portability**. While a common use case of Arquero is to query data directly in the same JavaScript runtime, Arquero verbs can also be [*serialized as queries*](./#queries): one can specify verbs in one environment, but then send them to another environment for processing. For example, the [arquero-worker](https://github.com/uwdata/arquero-worker) package sends queries to a worker thread, while the [arquero-sql](https://github.com/chanwutk/arquero-sql) package sends them to a backing database server. As custom methods may not be defined in those environments, Arquero is designed to make this translation between environments possible and easier to reason about.

* **Safety**. Arquero table expressions do not let you call methods defined on input data values. For example, to trim a string you must call `op.trim(str)`, not `str.trim()`. Again, this aids portability: otherwise unsupported methods defined on input data elements might "sneak" in to the processing. Invoking arbitrary methods may also lead to security vulnerabilities when allowing untrusted third parties to submit queries into a system.
* **Flexibility**. Providing our own parsing also allows us to introduce new kinds of backing data without changing the API. For example, we could add support for different underlying data formats and storage layouts. More importantly, it also allows us analyze expressions and incorporate aggregate and window functions in otherwise "normal" JavaScript expressions.

* **Discoverability**. Defining all functions on a single object provides a single catalog of all available operations. In most IDEs, you can simply type `op.` (and perhaps hit the <kbd>tab</kbd> key) to the see a list of all available functions and benefit from auto-complete!

Of course, one might wish to make different trade-offs. Arquero is designed to support common use cases while also being applicable to more complex production setups. This goal comes with the cost of more rigid management of functions. However, Arquero can be extended with custom variables, functions, and even new table methods or verbs! As starting points, see the [params](table#params), [addFunction](extensibility#addFunction), and [addTableMethod](extensibility#addTableMethod) functions to introduce external variables, register new `op` functions, or extend tables with new methods.
Of course, one might wish to make different trade-offs. Arquero is designed to support common use cases while also being applicable to more complex production setups. This goal comes with the cost of more rigid management of functions. However, Arquero can be extended with custom variables, functions, and even new table methods or verbs! As starting points, see the [params](table#params) and [addFunction](extensibility#addFunction) methods to introduce external variables or register new `op` functions.

All that being said, not all use cases require portability, safety, etc. For such cases Arquero provides an escape hatch: use the [`escape()` expression helper](./#escape) to apply a standard JavaScript function *as-is*, skipping any internal parsing and code generation.
All that being said, Arquero provides an escape hatch: use the [`escape()` expression helper](./#escape) to apply a standard JavaScript function *as-is*, skipping any internal parsing and code generation. As a result, escaped functions do *not* support aggregation and window operations, as these depend on Arquero's internal parsing and code generation.
160 changes: 12 additions & 148 deletions docs/api/extensibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ title: Extensibility \| Arquero API Reference
* [addVerb](#addVerb)
* [Package Bundles](#packages)
* [addPackage](#addPackage)
* [Table Methods](#table-methods)

<br/>

Expand Down Expand Up @@ -123,158 +124,21 @@ aq.table({ x: [4, 3, 2, 1] })
## <a id="table-methods">Table Methods</a>
Add new table-level methods or verbs. The [addTableMethod](#addTableMethod) function registers a new function as an instance method of tables only. The [addVerb](#addVerb) method registers a new transformation verb with both tables and serializable [queries](./#query).
<hr/><a id="addTableMethod" href="#addTableMethod">#</a>
<em>aq</em>.<b>addTableMethod</b>(<i>name</i>, <i>method</i>[, <i>options</i>]) · [Source](https://github.com/uwdata/arquero/blob/master/src/register.js)
Register a custom table method, adding a new method with the given *name* to all table instances. The provided *method* must take a table as its first argument, followed by any additional arguments.
This method throws an error if the *name* argument is not a legal string value.
To protect Arquero internals, the *name* can not start with an underscore (`_`) character. If a custom method with the same name is already registered, the override option must be specified to overwrite it. In no case may a built-in method be overridden.
* *name*: The name to use for the table method.
* *method*: A function implementing the table method. This function should accept a table as its first argument, followed by any additional arguments.
* *options*: Function registration options.
* *override*: Boolean flag (default `false`) indicating if the added method is allowed to override an existing method with the same name. Built-in table methods can **not** be overridden; this flag applies only to methods previously added using the extensibility API.
*Examples*
```js
// add a table method named size, returning an array of row and column counts
aq.addTableMethod('size', table => [table.numRows(), table.numCols()]);
aq.table({ a: [1,2,3], b: [4,5,6] }).size() // [3, 2]
```
<hr/><a id="addVerb" href="#addVerb">#</a>
<em>aq</em>.<b>addVerb</b>(<i>name</i>, <i>method</i>, <i>params</i>[, <i>options</i>]) · [Source](https://github.com/uwdata/arquero/blob/master/src/register.js)
Register a custom transformation verb with the given *name*, adding both a table method and serializable [query](./#query) support. The provided *method* must take a table as its first argument, followed by any additional arguments. The required *params* argument describes the parameters the verb accepts. If you wish to add a verb to tables but do not require query serialization support, use [addTableMethod](#addTableMethod).
This method throws an error if the *name* argument is not a legal string value.
To protect Arquero internals, the *name* can not start with an underscore (`_`) character. If a custom method with the same name is already registered, the override option must be specified to overwrite it. In no case may a built-in method be overridden.
* *name*: The name to use for the table method.
* *method*: A function implementing the table method. This function should accept a table as its first argument, followed by any additional arguments.
* *params*: An array of schema descriptions for the verb parameters. These descriptors are needed to support query serialization. Each descriptor is an object with *name* (string-valued parameter name) and *type* properties (string-valued parameter type, see below). If a parameter has type `"Options"`, the descriptor can include an additional object-valued *props* property to describe any non-literal values, for which the keys are property names and the values are parameter types.
* *options*: Function registration options.
* *override*: Boolean flag (default `false`) indicating if the added method is allowed to override an existing method with the same name. Built-in verbs can **not** be overridden; this flag applies only to methods previously added using the extensibility API.
*Parameter Types*. The supported parameter types are:
* `"Expr"`: A single table expression, such as the input to [`filter()`](verbs/#filter).
* `"ExprList"`: A list of column references or expressions, such as the input to [`groupby()`](verbs/#groupby).
* `"ExprNumber"`: A number literal or numeric table expression, such as the *weight* option of [`sample()`](verbs/#sample).
* `"ExprObject"`: An object containing a set of expressions, such as the input to [`rollup()`](verbs/#rollup).
* `"JoinKeys"`: Input join keys, as in [`join()`](verbs/#join).
* `"JoinValues"`: Output join values, as in [`join()`](verbs/#join).
* `"Options"`: An options object of key-value pairs. If any of the option values are column references or table expressions, the descriptor should include a *props* property with property names as keys and parameter types as values.
* `"OrderKeys"`: A list of ordering criteria, as in [`orderby`](verbs/#orderby).
* `"SelectionList"`: A set of columns to select and potentially rename, as in [`select`](verbs/#select).
* `"TableRef"`: A reference to an additional input table, as in [`join()`](verbs/#join).
* `"TableRefList"`: A list of one or more additional input tables, as in [`concat()`](verbs/#concat).
*Examples*
```js
// add a bootstrapped confidence interval verb that
// accepts an aggregate expression plus options
aq.addVerb(
'bootstrap_ci',
(table, expr, options = {}) => table
.params({ frac: options.frac || 1000 })
.sample((d, $) => op.round($.frac * op.count()), { replace: true })
.derive({ id: (d, $) => op.row_number() % $.frac })
.groupby('id')
.rollup({ bs: expr })
.rollup({
lo: op.quantile('bs', options.lo || 0.025),
hi: op.quantile('bs', options.hi || 0.975)
}),
[
{ name: 'expr', type: 'Expr' },
{ name: 'options', type: 'Options' }
]
);

// apply the new verb
aq.table({ x: [1, 2, 3, 4, 6, 8, 9, 10] })
.bootstrap_ci(op.mean('x'))
```
<br/>
## <a id="packages">Package Bundles</a>
Extend Arquero with a bundle of functions, table methods, and/or verbs.
<hr/><a id="addPackage" href="#addPackage">#</a>
<em>aq</em>.<b>addPackage</b>(<i>bundle</i>[, <i>options</i>]) · [Source](https://github.com/uwdata/arquero/blob/master/src/register.js)
Register a *bundle* of extensions, which may include standard functions, aggregate functions, window functions, table methods, and verbs. If the input *bundle* has a key named `"arquero_package"`, the value of that property is used; otherwise the *bundle* object is used directly. This method is particularly useful for publishing separate packages of Arquero extensions and then installing them with a single method call.
A package bundle has the following structure:
```js
const bundle = {
functions: { ... },
aggregateFunctions: { ... },
windowFunctions: { ... },
tableMethods: { ... },
verbs: { ... }
};
```
All keys are optional. For example, `functions` or `verbs` may be omitted. Each sub-bundle is an object of key-value pairs, where the key is the name of the function and the value is the function to add.
The lone exception is the `verbs` bundle, which instead uses an object format with *method* and *params* keys, corresponding to the *method* and *params* arguments of [addVerb](#addVerb):
```js
const bundle = {
verbs: {
name: {
method: (table, expr) => { ... },
params: [ { name: 'expr': type: 'Expr' } ]
}
}
};
```
The package method performs validation prior to adding any package content. The method will throw an error if any of the package items fail validation. See the [addFunction](#addFunction), [addAggregateFunction](#addAggregateFunction), [addWindowFunction](#windowFunction), [addTableMethod](#addTableMethod), and [addVerb](#addVerb) methods for specific validation criteria. The *options* argument can be used to specify if method overriding is permitted, as supported by each of the aforementioned methods.
* *bundle*: The package bundle of extensions.
* *options*: Function registration options.
* *override*: Boolean flag (default `false`) indicating if the added method is allowed to override an existing method with the same name. Built-in table methods or verbs can **not** be overridden; for table methods and verbs this flag applies only to methods previously added using the extensibility API.
To add new table-level methods, including transformation verbs, simply assign new methods to the `ColumnTable` class prototype.
*Examples*
```js
// add a package
aq.addPackage({
functions: {
square: x => x * x,
},
tableMethods: {
size: table => [table.numRows(), table.numCols()]
}
});
```
```js
// add a package, ignores any content outside of "arquero_package"
aq.addPackage({
arquero_package: {
functions: {
square: x => x * x,
},
tableMethods: {
size: table => [table.numRows(), table.numCols()]
import { ColumnTable, op } from 'arquero';

// add a sum verb, which returns a new table containing summed
// values (potentially grouped) for a given column name
Object.assign(
ColumnTable.prototype,
{
sum(column, { as = 'sum' } = {}) {
return this.rollup({ [as]: op.sum(column) });
}
}
});
);
```
```js
// add a package from a separate library
aq.addPackage(require('arquero-arrow'));
```
Loading

0 comments on commit 3e8620a

Please sign in to comment.