Skip to content

Support built-in names for Pandas GroupBy-Agg operations in Thicket's GroupBy #238

@ilumsden

Description

@ilumsden

In docstrings and docs, we refer users to pandas for documentation on aggregation functions. Despite this, we do not currently support an important way of specifying aggregation functions: string function names.

For example, currently, to use a "mean" operation in aggregation, we require users to do the following:

gb = thicket_obj.groupby(...)
gb.agg(numpy.mean)

In comparison, it is much more common to do the following for a pandas Groupby-Aggregate:

df.groupby(...).agg("mean")

We should also support string inputs to our GroupBy.agg method to be consistent with pandas.

Beyond consistency, there are 2 other reasons to do this:

  1. The logic behind a pandas mean (or similar operations) and a NumPy mean (or equivalent operations) are not the same. Current versions of pandas work around this by internally detecting when you pass NumPy functions in and replacing them with pandas' equivalents.
  2. Future versions of pandas (i.e., 3.0) will no longer replace NumPy functions with pandas' equivalents. That means there will be implications (e.g., performance) for using "mean" over numpy.mean. The behavior of the two will be different, and the NumPy functions may not produce correct output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-thicketIssues and PRs involving Thicket's core Thicket datastructure and associated classespriority-normalNormal priority issues and PRstype-featureRequests for new features or PRs which implement new features

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions