Is your feature request related to a problem?
We should add a wrapper for pandas.IntervalIndex this would solve a long standing problem around propagating "bounds" variables (CF conventions, #1475)
The CF design
CF "encoding" for intervals is to use bounds variables. There is an attribute "bounds" on the dimension coordinate, that refers to a second variable (at least 2D). Example: x has an attribute bounds that refers to x_bounds.
import numpy as np
left = np.arange(0.5, 3.6, 1)
right = np.arange(1.5, 4.6, 1)
bounds = np.stack([left, right])
ds = xr.Dataset(
{"data": ("x", [1, 2, 3, 4])},
coords={"x": ("x", [1, 2, 3, 4], {"bounds": "x_bounds"}), "x_bounds": (("bnds", "x"), bounds)},
)
ds
A fundamental problem with our current data model is that we lose x_bounds when we extract ds.data because there is a dimension bnds that is not shared with ds.data. Very important metadata is now lost!
We would also like to use the "bounds" to enable interval based indexing. ds.sel(x=1.1) should give you the value from the appropriate interval.
Pandas IntervalIndex
All the indexing is easy to implement by wrapping pandas.IntervalIndex, but there is one limitation. pd.IntervalIndex saves two pieces of information for each interval (left bound, right bound). CF saves three : left bound, right bound (see x_bounds) and a "central" value (see x). This should be OK to work around in our wrapper.
Fundamental Question
To me, a core question is whether x_bounds needs to be preserved after creating an IntervalIndex.
-
If so, we need a better rule around coordinate variable propagation. In this case, the IntervalIndex would be associated with x and x_bounds. So the rule could be
"propagate all variables necessary to propagate an index associated with any of the dimensions on the extracted variable."
So when extracting ds.data we propagate all variables necessary to propagate indexes associated with ds.data.dims that is x which would say "propagate x, x_bounds, and the IntervalIndex.
-
Alternatively, we could choose to drop x_bounds entirely. I interpret this approach as "decoding" the bounds variable to an interval index object. When saving to disk, we would encode the interval index in two variables. (See below)
Describe the solution you'd like
I've prototyped (2) [approach 1 in this notebook) following @benbovy's suggestion
Details
from xarray import Variable
from xarray.indexes import PandasIndex
class XarrayIntervalIndex(PandasIndex):
def __init__(self, index, dim, coord_dtype):
assert isinstance(index, pd.IntervalIndex)
# for PandasIndex
self.index = index
self.dim = dim
self.coord_dtype = coord_dtype
@classmethod
def from_variables(cls, variables, options):
assert len(variables) == 1
(dim,) = tuple(variables)
bounds = options["bounds"]
assert isinstance(bounds, (xr.DataArray, xr.Variable))
(axis,) = bounds.get_axis_num(set(bounds.dims) - {dim})
left, right = np.split(bounds.data, 2, axis=axis)
index = pd.IntervalIndex.from_arrays(left.squeeze(), right.squeeze())
coord_dtype = bounds.dtype
return cls(index, dim, coord_dtype)
def create_variables(self, variables):
from xarray.core.indexing import PandasIndexingAdapter
newvars = {self.dim: xr.Variable(self.dim, PandasIndexingAdapter(self.index))}
return newvars
def __repr__(self):
string = f"Xarray{self.index!r}"
return string
def to_pandas_index(self):
return self.index
@property
def mid(self):
return PandasIndex(self.index.right, self.dim, self.coord_dtype)
@property
def left(self):
return PandasIndex(self.index.right, self.dim, self.coord_dtype)
@property
def right(self):
return PandasIndex(self.index.right, self.dim, self.coord_dtype)
ds1 = (
ds.drop_indexes("x")
.set_xindex("x", XarrayIntervalIndex, bounds=ds.x_bounds)
.drop_vars("x_bounds")
)
ds1
Describe alternatives you've considered
I've tried some approaches in this notebook
Is your feature request related to a problem?
We should add a wrapper for
pandas.IntervalIndexthis would solve a long standing problem around propagating "bounds" variables (CF conventions, #1475)The CF design
CF "encoding" for intervals is to use bounds variables. There is an attribute
"bounds"on the dimension coordinate, that refers to a second variable (at least 2D). Example:xhas an attributeboundsthat refers tox_bounds.A fundamental problem with our current data model is that we lose
x_boundswhen we extractds.databecause there is a dimensionbndsthat is not shared withds.data. Very important metadata is now lost!We would also like to use the "bounds" to enable interval based indexing.
ds.sel(x=1.1)should give you the value from the appropriate interval.Pandas IntervalIndex
All the indexing is easy to implement by wrapping pandas.IntervalIndex, but there is one limitation.
pd.IntervalIndexsaves two pieces of information for each interval (left bound, right bound). CF saves three : left bound, right bound (seex_bounds) and a "central" value (seex). This should be OK to work around in our wrapper.Fundamental Question
To me, a core question is whether
x_boundsneeds to be preserved after creating anIntervalIndex.If so, we need a better rule around coordinate variable propagation. In this case, the IntervalIndex would be associated with
xandx_bounds. So the rule could beSo when extracting
ds.datawe propagate all variables necessary to propagate indexes associated withds.data.dimsthat isxwhich would say "propagatex,x_bounds, and the IntervalIndex.Alternatively, we could choose to drop
x_boundsentirely. I interpret this approach as "decoding" the bounds variable to an interval index object. When saving to disk, we would encode the interval index in two variables. (See below)Describe the solution you'd like
I've prototyped (2) [approach 1 in this notebook) following @benbovy's suggestion
Details
Describe alternatives you've considered
I've tried some approaches in this notebook