Description
What happened?
I ran into a large memory allocation when doing an isel on a zarr dataset with dask turned off (ie opened with open_dataset("mydataset.zarr", chunks=None)
), which caused the Linux OOM killer to kill my process.
I used some memory profilers (memray + scalene) to dig in a little; it appears that a fancy isel on a LazilyIndexedArray
ultimately causes indexing._combine_indexers
to run np.brodcast_arrays(…)
on a set of tuples which causes the broadcast to return an numpy array of the same shape as the underlying dataset. See MCVE for an example isel which I think tries to allocate exabytes of ram. (Not loading the data; when doing the lazy isel.)
The VectorizedIndexer created at the end of _combine_indexers
also ends up being made of ndarrays of the size of the data-to-be-selected which could also be large.
What did you expect to happen?
I was hoping that I could index into a large dataset cheaply before loading it (eg by tracking ranges to be selected as slices). The original example and the one below both try to select a relatively small amount of the total data.
Minimal Complete Verifiable Example
import xarray as xr
from xarray.core import indexing
import numpy as np
def test_dataset_indexing_memory() -> None:
# Mimics open_dataset with a large backing array and no dask.
big_backing_array = indexing.LazilyIndexedArray(
np.broadcast_to([1], (1_000_000, 1_000_000, 1_000_000))
)
data_array = xr.DataArray(
big_backing_array,
coords={
"t": np.arange(1_000_000),
"y": np.arange(1_000_000),
"z": np.arange(1_000_000),
},
)
print("running isel")
# I think any fancy isel will do
data_array.isel(t=xr.Variable(("new", "t"), [[0]]))
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
- Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
Anything else we need to know?
Happy to share memory profiles or other debugging info if that would be helpful. Thank you for taking a look!
Environment
INSTALLED VERSIONS
commit: c8affb3
python: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:23:34) [Clang 18.1.8 ]
python-bits: 64
OS: Darwin
OS-release: 24.4.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.4
libnetcdf: 4.9.2
xarray: 2025.4.1.dev1+g070af11f
pandas: 2.2.3
numpy: 2.2.5
scipy: 1.15.2
netCDF4: 1.7.2
pydap: 3.5.5
h5netcdf: 1.6.1
h5py: 3.12.1
zarr: 2.18.3
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: 3.11.1
bottleneck: 1.4.2
dask: 2024.8.2
distributed: 2024.8.2
matplotlib: 3.10.1
cartopy: 0.24.0
seaborn: 0.13.2
numbagg: 0.9.0
fsspec: 2025.3.2
cupy: None
pint: None
sparse: 0.16.0
flox: 0.10.1
numpy_groupies: 0.11.2
setuptools: 80.1.0
pip: 25.1
conda: 24.9.2
pytest: 8.3.5
mypy: 1.15.0
IPython: None
sphinx: None