Caution
This package does not have a stable API. However, we do not anticipate the on-disk format to change in an incompatible manner.
A data loader and io utilities for minibatching on-disk AnnData, co-developed by lamin and scverse
Please refer to the documentation, in particular, the API documentation.
You need to have Python 3.12 or newer installed on your system. If you don't have Python installed, we recommend installing uv.
To install the latest release of annbatch from PyPI:
pip install annbatchWe provide extras in the pyproject.toml for torch, cupy-cuda12, cupy-cuda13, and zarrs-python.
cupy provides accelerated handling of the data via preload_to_gpu once it has been read off disk and does not need to be used in conjunction with torch.
Important
zarrs-python gives the necessary performance boost for the sharded data produced by our preprocessing functions to be useful when loading data off a local filesystem.
Basic preprocessing:
from annbatch import create_anndata_collection
import zarr
from pathlib import Path
# Using zarrs is necessary for local filesystem perforamnce.
# Ensure you installed it using our `[zarrs]` extra i.e., `pip install annbatch[zarrs]` to get the right version.
zarr.config.set(
{"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}
)
create_anndata_collection(
adata_paths=[
"path/to/your/file1.h5ad",
"path/to/your/file2.h5ad"
],
output_path="path/to/output/collection", # a directory containing `dataset_{i}.zarr`
shuffle=True, # shuffling is needed if you want to use chunked access
)Data loading:
from pathlib import Path
from annbatch import ZarrSparseDataset
import anndata as ad
import zarr
# Using zarrs is necessary for local filesystem perforamnce.
# Ensure you installed it using our `[zarrs]` extra i.e., `pip install annbatch[zarrs]` to get the right version.
zarr.config.set(
{"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}
)
ds = ZarrSparseDataset(
batch_size=4096,
chunk_size=32,
preload_nchunks=256,
).add_anndatas(
[
ad.AnnData(
# note that you can open an AnnData file using any type of zarr store
X=ad.io.sparse_dataset(zarr.open(p)["X"]),
obs=ad.io.read_elem(zarr.open(p)["obs"]),
)
for p in Path("path/to/output/collection").glob("*.zarr")
],
obs_keys="label_column",
)
# Iterate over dataloader (plugin replacement for torch.utils.DataLoader)
for batch in ds:
...For usage of our loader inside of torch, please see our this note for more info. At the minimum, be aware that deadlocking will occur on linux unless you pass multiprocessing_context="spawn" to the DataLoader.
For a deeper dive into this example, please see the in-depth section of our docs
See the changelog.
For questions and help requests, you can reach out in the scverse discourse. If you found a bug, please use the issue tracker.