Skip to content

Fancy isel on LazilyIndexedArray allocates memory of about the size of backing dataset #10311

Open
@jder

Description

@jder

What happened?

I ran into a large memory allocation when doing an isel on a zarr dataset with dask turned off (ie opened with open_dataset("mydataset.zarr", chunks=None)), which caused the Linux OOM killer to kill my process.

I used some memory profilers (memray + scalene) to dig in a little; it appears that a fancy isel on a LazilyIndexedArray ultimately causes indexing._combine_indexers to run np.brodcast_arrays(…) on a set of tuples which causes the broadcast to return an numpy array of the same shape as the underlying dataset. See MCVE for an example isel which I think tries to allocate exabytes of ram. (Not loading the data; when doing the lazy isel.)

The VectorizedIndexer created at the end of _combine_indexers also ends up being made of ndarrays of the size of the data-to-be-selected which could also be large.

What did you expect to happen?

I was hoping that I could index into a large dataset cheaply before loading it (eg by tracking ranges to be selected as slices). The original example and the one below both try to select a relatively small amount of the total data.

Minimal Complete Verifiable Example

import xarray as xr
from xarray.core import indexing
import numpy as np

def test_dataset_indexing_memory() -> None:
    # Mimics open_dataset with a large backing array and no dask.
    big_backing_array = indexing.LazilyIndexedArray(
        np.broadcast_to([1], (1_000_000, 1_000_000, 1_000_000))
    )
    data_array = xr.DataArray(
        big_backing_array,
        coords={
            "t": np.arange(1_000_000),
            "y": np.arange(1_000_000),
            "z": np.arange(1_000_000),
        },
    )
    print("running isel")
    # I think any fancy isel will do
    data_array.isel(t=xr.Variable(("new", "t"), [[0]]))

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Anything else we need to know?

Happy to share memory profiles or other debugging info if that would be helpful. Thank you for taking a look!

Environment

INSTALLED VERSIONS

commit: c8affb3
python: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:23:34) [Clang 18.1.8 ]
python-bits: 64
OS: Darwin
OS-release: 24.4.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.4
libnetcdf: 4.9.2

xarray: 2025.4.1.dev1+g070af11f
pandas: 2.2.3
numpy: 2.2.5
scipy: 1.15.2
netCDF4: 1.7.2
pydap: 3.5.5
h5netcdf: 1.6.1
h5py: 3.12.1
zarr: 2.18.3
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: 3.11.1
bottleneck: 1.4.2
dask: 2024.8.2
distributed: 2024.8.2
matplotlib: 3.10.1
cartopy: 0.24.0
seaborn: 0.13.2
numbagg: 0.9.0
fsspec: 2025.3.2
cupy: None
pint: None
sparse: 0.16.0
flox: 0.10.1
numpy_groupies: 0.11.2
setuptools: 80.1.0
pip: 25.1
conda: 24.9.2
pytest: 8.3.5
mypy: 1.15.0
IPython: None
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions