-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design for IntervalIndex #8005
Comments
I guess there is a more general framing of this problem.
Or more generally, should all the variables needed to construct an |
|
Maybe this could be entirely left to the custom index whether to keep (and propagate) some coordinates or discard them after the index is created and/or after some other operation? The API of
|
@dcherian did you experiment further with this? I might want to take a stab at it, maybe in a draft PR.
One solution for this case is to:
To transform the indexed "x_bounds" coordinate back to its original form (for serialization), we could add a If that makes sense, maybe ds.sel(x=15, method="nearest") # dispatch to pd.Index (method 'nearest' not supported for pd.IntervalIndex)
ds.sel(x_bounds=3.5) # dispatch to pd.IntervalIndex This solution allows to fully leverage the interval index (central values + intervals) in a way that is unambiguous and compatible with the current rules of coordinate variable propagation for DataArrays. |
Of course! We discussed this a bit in a meeting a while ago and there was broad agreement that we should consider updating our rule to propagate all indexes associated with a DataArray's dimensions. I think this would be a good experimental PR.
Personally, I think it would be nice to minimize these transformations to avoid having to teach users new concepts when writing to disk
I prefer just |
FWIW #9671 enables using an IntervalIndex directly but there's still an impedance mismatch with CF which stores both a central value for each interval (e.g. "time") and the edges (e.g. "time_bounds"), unlike Pandas which only stores the edges. |
Is your feature request related to a problem?
We should add a wrapper for
pandas.IntervalIndex
this would solve a long standing problem around propagating "bounds" variables (CF conventions, #1475)The CF design
CF "encoding" for intervals is to use bounds variables. There is an attribute
"bounds"
on the dimension coordinate, that refers to a second variable (at least 2D). Example:x
has an attributebounds
that refers tox_bounds
.A fundamental problem with our current data model is that we lose
x_bounds
when we extractds.data
because there is a dimensionbnds
that is not shared withds.data
. Very important metadata is now lost!We would also like to use the "bounds" to enable interval based indexing.
ds.sel(x=1.1)
should give you the value from the appropriate interval.Pandas IntervalIndex
All the indexing is easy to implement by wrapping pandas.IntervalIndex, but there is one limitation.
pd.IntervalIndex
saves two pieces of information for each interval (left bound, right bound). CF saves three : left bound, right bound (seex_bounds
) and a "central" value (seex
). This should be OK to work around in our wrapper.Fundamental Question
To me, a core question is whether
x_bounds
needs to be preserved after creating anIntervalIndex
.If so, we need a better rule around coordinate variable propagation. In this case, the IntervalIndex would be associated with
x
andx_bounds
. So the rule could beSo when extracting
ds.data
we propagate all variables necessary to propagate indexes associated withds.data.dims
that isx
which would say "propagatex
,x_bounds
, and the IntervalIndex.Alternatively, we could choose to drop
x_bounds
entirely. I interpret this approach as "decoding" the bounds variable to an interval index object. When saving to disk, we would encode the interval index in two variables. (See below)Describe the solution you'd like
I've prototyped (2) [approach 1 in this notebook) following @benbovy's suggestion
Describe alternatives you've considered
I've tried some approaches in this notebook
The text was updated successfully, but these errors were encountered: