Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SL-1967] Add support for statistical aggregate functions #1111

Open
tlento opened this issue Apr 3, 2024 · 1 comment
Open

[SL-1967] Add support for statistical aggregate functions #1111

tlento opened this issue Apr 3, 2024 · 1 comment
Labels
backlog Medium priority Created by Linear-GitHub Sync Metricflow Created by Linear-GitHub Sync

Comments

@tlento
Copy link
Contributor

tlento commented Apr 3, 2024

There are a number of straightforward statistical aggregate functions which we should be able to support without too much effort, although as always we have to make some decisions.

There is a current request for var_samp and covar_samp for BigQuery, but there are others we could add to this list.

Statistical aggregate functions recommended for consideration

  1. Sample variance var_samp
  2. Sample covariance covar_samp (multi-argument, not natively supported by Redshift)
  3. Sample standard deviation stddev_samp
  4. Population variance var_pop
  5. Population covariance covar_pop (multi-argument, not natively supported by Redshift)
  6. Populate standard deviation stddev_pop
  7. Correlation coefficient: corr (multi-argument, not natively supported by Redshift)

Statistical aggregate functions NOT under consideration

  1. Kurtosis: kurtosis (not natively supported by BigQuery, Postgres, Redshift)
  2. Skewness: skewness (skew in Snowflake, not natively supported by BigQuery, Postgres, Redshift)

Native implementations are missing from too many engines to justify the effort for these, especially given how little use they're likely to see.

Overall recommendation

Start with the ones supported across all engines, as those are much more straightforward to develop and test since they are universally supported and fit into our existing aggregate function model.

Separately, evaluate whether or not to bother with custom native-sql implementations of the covariance and correlation functions for Redshift. These are also more complex because they are the first multi-input aggregate functions we would be supporting.

SL-1967

@tlento tlento changed the title Add support for statistical aggregate functions [SL-1967] Add support for statistical aggregate functions Apr 3, 2024
@tlento tlento added the backlog label Apr 3, 2024
@tlento
Copy link
Contributor Author

tlento commented Apr 3, 2024

Note - this is closely related to, and possibly a pre-requisite for, #52

@tlento tlento added Metricflow Created by Linear-GitHub Sync Medium priority Created by Linear-GitHub Sync labels Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Medium priority Created by Linear-GitHub Sync Metricflow Created by Linear-GitHub Sync
Projects
None yet
Development

No branches or pull requests

1 participant