Description
What is your issue?
Opening a dataset with use_cftime=True
turns the time dimension dtype from datetime64 to object. This means that using chunks='auto'
will fail in dask, since dask can't estimate the size of variables with dtype object.
However, the error is a bit confusing, since it's from the underlying dask call, and doesn't tell the user what caused it.
import xarray as xr
# Generally succeeds
xr.open_dataset(fn,chunks='auto')
# Definitely fails
xr.open_dataset(fn,chunks='auto',use_cftime=True)
The error is:
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
Cell In[46], line 1
----> 1 xr.open_dataset(fn,use_cftime=True,chunks='auto')
File [~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/backends/api.py:617](http://localhost:8888/lab/tree/hle_iv/~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/backends/api.py#line=616), in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
610 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
611 backend_ds = backend.open_dataset(
612 filename_or_obj,
613 drop_variables=drop_variables,
614 **decoders,
615 **kwargs,
616 )
--> 617 ds = _dataset_from_backend_dataset(
618 backend_ds,
619 filename_or_obj,
620 engine,
621 chunks,
622 cache,
623 overwrite_encoded_chunks,
624 inline_array,
625 chunked_array_type,
626 from_array_kwargs,
627 drop_variables=drop_variables,
628 **decoders,
629 **kwargs,
630 )
631 return ds
File [~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/backends/api.py:393](http://localhost:8888/lab/tree/hle_iv/~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/backends/api.py#line=392), in _dataset_from_backend_dataset(backend_ds, filename_or_obj, engine, chunks, cache, overwrite_encoded_chunks, inline_array, chunked_array_type, from_array_kwargs, **extra_tokens)
391 ds = backend_ds
392 else:
--> 393 ds = _chunk_ds(
394 backend_ds,
395 filename_or_obj,
396 engine,
397 chunks,
398 overwrite_encoded_chunks,
399 inline_array,
400 chunked_array_type,
401 from_array_kwargs,
402 **extra_tokens,
403 )
405 ds.set_close(backend_ds._close)
407 # Ensure source filename always stored in dataset object
File [~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/backends/api.py:357](http://localhost:8888/lab/tree/hle_iv/~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/backends/api.py#line=356), in _chunk_ds(backend_ds, filename_or_obj, engine, chunks, overwrite_encoded_chunks, inline_array, chunked_array_type, from_array_kwargs, **extra_tokens)
355 variables = {}
356 for name, var in backend_ds.variables.items():
--> 357 var_chunks = _get_chunk(var, chunks, chunkmanager)
358 variables[name] = _maybe_chunk(
359 name,
360 var,
(...)
367 from_array_kwargs=from_array_kwargs.copy(),
368 )
369 return backend_ds._replace(variables)
File [~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/core/dataset.py:255](http://localhost:8888/lab/tree/hle_iv/~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/core/dataset.py#line=254), in _get_chunk(var, chunks, chunkmanager)
249 chunks = dict.fromkeys(dims, chunks)
250 chunk_shape = tuple(
251 chunks.get(dim, None) or preferred_chunk_sizes
252 for dim, preferred_chunk_sizes in zip(dims, preferred_chunk_shape, strict=True)
253 )
--> 255 chunk_shape = chunkmanager.normalize_chunks(
256 chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape
257 )
259 # Warn where requested chunks break preferred chunks, provided that the variable
260 # contains data.
261 if var.size:
File [~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/namedarray/daskmanager.py:58](http://localhost:8888/lab/tree/hle_iv/~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/namedarray/daskmanager.py#line=57), in DaskManager.normalize_chunks(self, chunks, shape, limit, dtype, previous_chunks)
55 """Called by open_dataset"""
56 from dask.array.core import normalize_chunks
---> 58 return normalize_chunks(
59 chunks,
60 shape=shape,
61 limit=limit,
62 dtype=dtype,
63 previous_chunks=previous_chunks,
64 )
File [~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/dask/array/core.py:3132](http://localhost:8888/lab/tree/hle_iv/~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/dask/array/core.py#line=3131), in normalize_chunks(chunks, shape, limit, dtype, previous_chunks)
3129 chunks = tuple("auto" if isinstance(c, str) and c != "auto" else c for c in chunks)
3131 if any(c == "auto" for c in chunks):
-> 3132 chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks)
3134 if shape is not None:
3135 chunks = tuple(c if c not in {None, -1} else s for c, s in zip(chunks, shape))
File [~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/dask/array/core.py:3237](http://localhost:8888/lab/tree/hle_iv/~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/dask/array/core.py#line=3236), in auto_chunks(chunks, shape, limit, dtype, previous_chunks)
3234 raise TypeError("dtype must be known for auto-chunking")
3236 if dtype.hasobject:
-> 3237 raise NotImplementedError(
3238 "Can not use auto rechunking with object dtype. "
3239 "We are unable to estimate the size in bytes of object data"
3240 )
3242 for x in tuple(chunks) + tuple(shape):
3243 if (
3244 isinstance(x, Number)
3245 and np.isnan(x)
3246 or isinstance(x, tuple)
3247 and np.isnan(x).any()
3248 ):
NotImplementedError: Can not use auto rechunking with object dtype. We are unable to estimate the size in bytes of object data
Suggestion for now: add an Exception for when chunks='auto'
and use_cftime=True
are called at the same time. I think this should be implementable in backends.open_dataset()
(rather than in any specific engine's open_dataset) since it's likely common to any opening procedure, regardless of backend?
Something like
if (chunk == 'auto') and (use_cftime):
raise NotImplementedError('`use_cftime=True` changes the dtype of time variables to object, however, dask cannot yet chunk variables of object dtype. Manually specifying chunks (instead of using `chunks='auto'` will not throw this exception.')
Suggestion for later: If it's possible to estimate the size of the array with datetime
objects in the time coordinate, it should be possible to estimate it with cftime
objects as well (since whether or not the coordinate itself is stored in one or the other is unlikely to make a difference in how to chunk the other variables). Is there maybe a way to get conventions.decode_cf_variable()
to also return the original datetime object to present for chunking in it place of the converted cftime object? Or just for chunking to just apply the same chunking to a 1D coordinate that it would to that coordinate's dimension in the non-object-dtype arrays that may be present in the same dataset? (I guess this theoretically could be unstable if the object coordinate for some reason takes up a lot more space than it would if it were numeric, etc.).
(I'm working on putting together a PR for at least the Exception - please let me know if there's anything I should keep in mind, especially with where the exception would be most appropriate to stick, if this is a bad idea, etc.)