Skip to content

concat() very slow when inserting NaN into Dask arrays #9496

Open
@pschlo

Description

@pschlo

What is your issue?

Given the following situation:

  • a small Dataset with a few variables and a single dimension dim1 , backed by Dask
  • a large Dataset with a single variable and a single dimension dim1, backed by Dask

When I concat() them along dim1, xarray extends the variables that appear in the first Dataset but not in the second Dataset with NaN. I would expect this to be lazy and to execute almost instantly, but it turns out to be very slow on my machine.

Example code:

import xarray as xr
import dask.array as da
import numpy as np


ds1 = xr.Dataset(
    data_vars=dict(
        var1=('dim1', da.arange(10, dtype=np.float64, chunks=-1)),
        var2=('dim1', da.arange(10, dtype=np.float64, chunks=-1)),
        var3=('dim1', da.arange(10, dtype=np.float64, chunks=-1)),
        var4=('dim1', da.arange(10, dtype=np.float64, chunks=-1)),
        var5=('dim1', da.arange(10, dtype=np.float64, chunks=-1)),
        var6=('dim1', da.arange(10, dtype=np.float64, chunks=-1)),
        var7=('dim1', da.arange(10, dtype=np.float64, chunks=-1))
    ),
)

ds2 = xr.Dataset(
    data_vars=dict(
        var1=('dim1', da.arange(100_000, dtype=np.float64, chunks=20_000)),
    ),
)

print(ds1)
print('var1 chunks:', ds1['var1'].chunksizes)

print()
print(ds2)
print('var1 chunks:', ds2['var1'].chunksizes)

print()
concat = xr.concat([ds1, ds2], dim='dim1')
print(concat)
print('var1 chunks:', concat['var1'].chunksizes)
print('var2 chunks:', concat['var2'].chunksizes)

Output:

<xarray.Dataset> Size: 560B
Dimensions:  (dim1: 10)
Dimensions without coordinates: dim1
Data variables:
    var1     (dim1) float64 80B dask.array<chunksize=(10,), meta=np.ndarray>
    var2     (dim1) float64 80B dask.array<chunksize=(10,), meta=np.ndarray>
    var3     (dim1) float64 80B dask.array<chunksize=(10,), meta=np.ndarray>
    var4     (dim1) float64 80B dask.array<chunksize=(10,), meta=np.ndarray>
    var5     (dim1) float64 80B dask.array<chunksize=(10,), meta=np.ndarray>
    var6     (dim1) float64 80B dask.array<chunksize=(10,), meta=np.ndarray>
    var7     (dim1) float64 80B dask.array<chunksize=(10,), meta=np.ndarray>
var1 chunks: Frozen({'dim1': (10,)})

<xarray.Dataset> Size: 800kB
Dimensions:  (dim1: 100000)
Dimensions without coordinates: dim1
Data variables:
    var1     (dim1) float64 800kB dask.array<chunksize=(20000,), meta=np.ndarray>
var1 chunks: Frozen({'dim1': (20000, 20000, 20000, 20000, 20000)})

<xarray.Dataset> Size: 6MB
Dimensions:  (dim1: 100010)
Dimensions without coordinates: dim1
Data variables:
    var1     (dim1) float64 800kB dask.array<chunksize=(10,), meta=np.ndarray>
    var2     (dim1) float64 800kB dask.array<chunksize=(10,), meta=np.ndarray>
    var3     (dim1) float64 800kB dask.array<chunksize=(10,), meta=np.ndarray>
    var4     (dim1) float64 800kB dask.array<chunksize=(10,), meta=np.ndarray>
    var5     (dim1) float64 800kB dask.array<chunksize=(10,), meta=np.ndarray>
    var6     (dim1) float64 800kB dask.array<chunksize=(10,), meta=np.ndarray>
    var7     (dim1) float64 800kB dask.array<chunksize=(10,), meta=np.ndarray>
var1 chunks: Frozen({'dim1': (10, 20000, 20000, 20000, 20000, 20000)})
var2 chunks: Frozen({'dim1': (10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, ...

The last output line is followed by many more 10s.

This takes about 10-20 seconds to run on my machine. Is there any reason for this being so slow? I would've expected the code to execute almost instantly, such that the NaN chunks are being added lazily, e.g. upon calling compute().

Here is my output of xr.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.133.1-microsoft-standard-WSL2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.3-development

xarray: 2024.9.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.14.1
netCDF4: 1.7.1.post2
pydap: None
h5netcdf: 1.3.0
h5py: 3.7.0
zarr: 2.12.0
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.4.0
dask: 2024.9.0
distributed: 2024.9.0
matplotlib: 3.9.2
cartopy: None
seaborn: 0.13.2
numbagg: 0.8.1
fsspec: 2024.9.0
cupy: None
pint: None
sparse: None
flox: 0.9.11
numpy_groupies: 0.11.2
setuptools: 74.1.2
pip: 22.0.2
conda: None
pytest: None
mypy: None
IPython: 8.27.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions