Skip to content

groupby_bins fails on time series data #10217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 tasks done
relativistic opened this issue Apr 10, 2025 · 3 comments · May be fixed by #10227
Open
5 tasks done

groupby_bins fails on time series data #10217

relativistic opened this issue Apr 10, 2025 · 3 comments · May be fixed by #10227
Labels

Comments

@relativistic
Copy link

What happened?

I'm not sure if this is a bug, or just surprising behavior.

When I have a dataset with timeseries variables, and I do a groupby_bins operation followed by a mean() operation, the timeseries data is silently dropped from the dataset, instead of being aggregated.

What did you expect to happen?

I expect the groupby_bins operation to be applied to time_series data when it is applicable to time series data. For example, in the example code below, the mean() operation should have return the average time in each bin.

Some aggregation operations might not be well defined for time (arguably sum(), for example). In such cases I'd expect it should return nans or raise an error.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np
import pandas as pd

ds = xr.Dataset({
                    'measurement':('trial',np.arange(0,100,10)),
                    'time':('trial',pd.date_range("20240101T1500", "20240101T1501", 10))
                },
                coords={'trial':np.arange(10)}
               
)
ds_agged= ds.groupby_bins('trial',5).mean()

# 'time' variable is mmissing from results, but measurement is present
print(ds_agged)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

<xarray.Dataset> Size: 80B
Dimensions:      (trial_bins: 5)
Coordinates:
  * trial_bins   (trial_bins) object 40B (-0.009, 1.8] (1.8, 3.6] ... (7.2, 9.0]
Data variables:
    measurement  (trial_bins) float64 40B 5.0 25.0 45.0 65.0 85.0

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.16 | packaged by conda-forge | (main, Dec 5 2024, 14:16:10) [GCC 13.3.0]
python-bits: 64
OS: Linux
OS-release: 6.8.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2

xarray: 2025.3.1
pandas: 2.2.3
numpy: 2.1.3
scipy: 1.15.2
netCDF4: 1.7.2
pydap: 3.5.4
h5netcdf: 1.6.1
h5py: 3.13.0
zarr: 2.18.3
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: 3.11.0
bottleneck: 1.4.2
dask: 2025.3.0
distributed: 2025.3.0
matplotlib: 3.10.1
cartopy: 0.24.0
seaborn: 0.13.2
numbagg: 0.9.0
fsspec: 2025.3.2
cupy: None
pint: 0.24.4
sparse: 0.16.0
flox: None
numpy_groupies: None
setuptools: 75.8.0
pip: 25.0
conda: None
pytest: None
mypy: None
IPython: 8.32.0
sphinx: None

@relativistic relativistic added bug needs triage Issue that has not been reviewed by xarray team member labels Apr 10, 2025
Copy link

welcome bot commented Apr 10, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@relativistic
Copy link
Author

Okay, I did a bit more checking. So, this probably isn't technically a bug. Time data is not numeric data. ( pd.api.types.is_numeric_dtype(ds.time) returns false in the above example), and according to the docs numeric data is dropped prior to taking the mean.

I still hold that this is surprising and inconsistent behavior. I've done some more experiments. If you apply the groupby operation directly to the time variable, it works. In other words, continuing from the example in my original post

print(ds.time.groupby_bins('trial',5).mean())

prints what I'd expect:

<xarray.DataArray 'time' (trial_bins: 5)> Size: 40B
array(['2024-01-01T15:00:03.333333504', '2024-01-01T15:00:16.666666496',
       '2024-01-01T15:00:30.000000000', '2024-01-01T15:00:43.333333504',
       '2024-01-01T15:00:56.666666496'], dtype='datetime64[ns]')
Coordinates:
  * trial_bins  (trial_bins) object 40B (-0.009, 1.8] (1.8, 3.6] ... (7.2, 9.0]

It seems inconsistent that applying groupby_bins to the dataset drops the time variable, but when you apply groupby_bins to the dime variable directly, you get an answer.

@dcherian dcherian removed the needs triage Issue that has not been reviewed by xarray team member label Apr 15, 2025
dcherian added a commit to dcherian/xarray that referenced this issue Apr 15, 2025
@dcherian dcherian linked a pull request Apr 15, 2025 that will close this issue
@dcherian
Copy link
Contributor

I worked on this a while ago and never opened a PR. See #10227. Are you able to contribute some extra tests? For example, we'd need one for Dataset, and DataArray at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants