Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom dataloader registry support #2932

Open
wants to merge 122 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
122 commits
Select commit Hold shift + click to select a range
7088e4b
copying CZI custom dataloader into our repo
ori-kron-wis Jul 28, 2024
cc72b05
added some fixes to the custom dataloader stuff
ori-kron-wis Jul 30, 2024
46048e3
Some suggestions
canergen Jul 30, 2024
14f343d
Changes to datamodule pipeline
canergen Jul 31, 2024
17282cd
Fixed attr_dict
canergen Jul 31, 2024
a4143f5
added some fixes based on custom data loader test
ori-kron-wis Aug 1, 2024
69abc47
Changes to dataloader
canergen Aug 6, 2024
dc21a3d
copying CZI custom dataloader into our repo
ori-kron-wis Jul 28, 2024
a1098b3
added some fixes to the custom dataloader stuff
ori-kron-wis Jul 30, 2024
b07216b
Some suggestions
canergen Jul 30, 2024
a578af1
Changes to datamodule pipeline
canergen Jul 31, 2024
42434ec
Fixed attr_dict
canergen Jul 31, 2024
3d0c890
added some fixes based on custom data loader test
ori-kron-wis Aug 1, 2024
eff5b1e
Changes to dataloader
canergen Aug 6, 2024
cbdc26e
Merge remote-tracking branch 'origin/ori-2907-custom-dataloader-regis…
ori-kron-wis Aug 7, 2024
18d65a6
add changes to tests and some merging with main following custom data…
ori-kron-wis Aug 7, 2024
4fe3ee1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 7, 2024
1110966
just put the cutom dataloder2 test under remarks so hook tests will r…
ori-kron-wis Aug 7, 2024
7972bdc
fixes
ori-kron-wis Aug 7, 2024
2d86c43
additional external models fixes once there is a registry
ori-kron-wis Aug 7, 2024
3c44d86
fixed a few failed tests
ori-kron-wis Aug 11, 2024
c0889d8
fix archesmixin init and added new custom dataloader test and github …
ori-kron-wis Aug 11, 2024
8fe043c
fix again for from __future__ import annotations
ori-kron-wis Aug 11, 2024
d8cf0f6
fix for run custom dataloader in github action
ori-kron-wis Aug 11, 2024
c41e8b2
rollback
ori-kron-wis Aug 11, 2024
6ec5d4d
added label to the new githubaction for custom dataloader
ori-kron-wis Aug 11, 2024
6bce317
fix for github action for custom dataloaders
ori-kron-wis Aug 12, 2024
1f4ae9d
another fix to custom dataloder test and github action
ori-kron-wis Aug 12, 2024
de1f30b
another fix to custom dataloder test and github action
ori-kron-wis Aug 12, 2024
49fa01e
another fix to custom dataloder test and github action
ori-kron-wis Aug 12, 2024
e33a935
another fix to custom dataloder test and github action
ori-kron-wis Aug 12, 2024
48627d9
another fix to custom dataloder test and github action
ori-kron-wis Aug 12, 2024
609094d
another fix to custom dataloder test and github action
ori-kron-wis Aug 12, 2024
8cf3517
another fix to custom dataloder test and github action
ori-kron-wis Aug 12, 2024
ba5a028
another fix to custom dataloder test and github action
ori-kron-wis Aug 12, 2024
a7dc3fe
another fix to custom dataloder test and github action
ori-kron-wis Aug 12, 2024
f3ff0f8
another fix to custom dataloder test and github action
ori-kron-wis Aug 12, 2024
083c76e
Merge branch 'main' into ori-2907-custom-dataloader-registry
ori-kron-wis Sep 9, 2024
70bba69
Merge branch 'main' into ori-2907-custom-dataloader-registry
ori-kron-wis Sep 15, 2024
8c75662
Merge branch 'main' into ori-2907-custom-dataloader-registry
ori-kron-wis Sep 16, 2024
b6eb2f1
Returned REGISTRY_KEYS for import, after was drop in recent merges
ori-kron-wis Sep 16, 2024
2979ea2
It is ok to drop it after scarches categorial covariates fix
ori-kron-wis Sep 16, 2024
67e9b34
Merge branch 'main' into ori-2907-custom-dataloader-registry
ori-kron-wis Sep 17, 2024
11fe33a
Merge branch 'main' into ori-2907-custom-dataloader-registry
ori-kron-wis Sep 17, 2024
4a648ff
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 17, 2024
e3831cb
moved to type checking blocks beucase of ruff updates
ori-kron-wis Sep 17, 2024
e1837bd
Merge branch 'main' into ori-2907-custom-dataloader-registry
ori-kron-wis Sep 26, 2024
bf4d3bf
Merge remote-tracking branch 'origin/main' into ori-2907-custom-datal…
ori-kron-wis Oct 7, 2024
2cc8ff9
updated for CZI custom dataloader test and backend
ori-kron-wis Oct 9, 2024
e62dc3a
Merge branch 'main' into ori-2907-custom-dataloader-registry
ori-kron-wis Oct 9, 2024
41fd877
added cellxgene-census folder as well for debug (will not be merged)
ori-kron-wis Oct 9, 2024
10ada9c
added cellxgene-census packge to run test
ori-kron-wis Oct 9, 2024
dd3649c
added torchdata packge to run test
ori-kron-wis Oct 9, 2024
c6acb5a
fixed the test workwflow
ori-kron-wis Oct 9, 2024
b35c6eb
adding the lamindb as well
ori-kron-wis Oct 10, 2024
1801604
fix the c.dataloders test
ori-kron-wis Oct 10, 2024
ed77a65
fix the c.dataloders test
ori-kron-wis Oct 10, 2024
fc831d5
fix the c.dataloders test
ori-kron-wis Oct 10, 2024
7400621
fix the c.dataloders test
ori-kron-wis Oct 10, 2024
47376ca
fix the c.dataloders test
ori-kron-wis Oct 10, 2024
f94f7fa
removed redundat functions in code base
ori-kron-wis Oct 13, 2024
962f043
Added scanvi support, including CZI datamodule fix for it
ori-kron-wis Oct 15, 2024
5c21d71
Merge remote-tracking branch 'origin/main' into ori-2907-custom-datal…
ori-kron-wis Oct 20, 2024
a8aeffe
updates from main
ori-kron-wis Dec 25, 2024
1283616
more updates from main
ori-kron-wis Dec 25, 2024
624ee72
Merge branch 'main' into ori-2907-custom-dataloader-registry
ori-kron-wis Dec 25, 2024
6d4f368
Merge remote-tracking branch 'origin/ori-2907-custom-dataloader-regis…
ori-kron-wis Dec 25, 2024
8ab01a4
updated related to tests
ori-kron-wis Dec 25, 2024
31e1d44
updated related to tests
ori-kron-wis Dec 25, 2024
93666fa
Running DataLoader MappedCollection
canergen Dec 30, 2024
1d1d6d3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 30, 2024
7695a8a
Fixed LaminDB dataloader
canergen Dec 31, 2024
e4d732a
Merge branch 'ori-2907-custom-dataloader-registry' of https://github.…
canergen Dec 31, 2024
a651442
LaminDB dataloader test.
canergen Dec 31, 2024
9767b8c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 31, 2024
719e740
Merge branch 'main' into ori-2907-custom-dataloader-registry
ori-kron-wis Dec 31, 2024
1a4c796
Merge remote-tracking branch 'origin/main' into ori-2907-custom-datal…
ori-kron-wis Jan 8, 2025
5666558
Changes for MappedCollection.
canergen Jan 8, 2025
c740dd2
Merge branch 'ori-2907-custom-dataloader-registry' of https://github.…
canergen Jan 8, 2025
61f2e27
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 8, 2025
874935b
Add other notebook for testing new dataloader
canergen Jan 9, 2025
f2c63bd
Merge branch 'ori-2907-custom-dataloader-registry' of https://github.…
canergen Jan 9, 2025
35d45c8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 9, 2025
38c670f
Merge remote-tracking branch 'origin/main' into ori-2907-custom-datal…
ori-kron-wis Jan 16, 2025
c93fc97
updates to test script
ori-kron-wis Jan 16, 2025
5045fc3
remove old test nb
ori-kron-wis Jan 16, 2025
55775f9
update test
ori-kron-wis Jan 16, 2025
7ccdf8d
update test
ori-kron-wis Jan 16, 2025
f88dc50
updated czi cdl
ori-kron-wis Jan 16, 2025
1f3ea11
updated czi cdl
ori-kron-wis Jan 16, 2025
d0ec46f
Merge remote-tracking branch 'origin/main' into ori-2907-custom-datal…
ori-kron-wis Jan 20, 2025
e304922
merge with main + updates
ori-kron-wis Feb 9, 2025
5ccd1ed
more updates
ori-kron-wis Feb 9, 2025
96a09d8
more updates
ori-kron-wis Feb 9, 2025
601d86f
more updates
ori-kron-wis Feb 9, 2025
2485bb6
pyproject update
ori-kron-wis Feb 10, 2025
538326b
merge with main
ori-kron-wis Mar 20, 2025
b7e3047
merge with main
ori-kron-wis Mar 20, 2025
7f5b17f
updates after merge with main
ori-kron-wis Mar 23, 2025
c4fe9a5
pinned zarr to work with py3.12
ori-kron-wis Mar 23, 2025
2c8bd85
updates
ori-kron-wis Mar 24, 2025
1e16399
updates
ori-kron-wis Mar 24, 2025
355cb7c
Merge branch 'main' into ori-2907-custom-dataloader-registry
ori-kron-wis Mar 24, 2025
2140911
added cesus dataloder
ori-kron-wis Mar 24, 2025
df86761
Merge branch 'main' into ori-2907-custom-dataloader-registry
ori-kron-wis Mar 24, 2025
cea01ad
update pyproject
ori-kron-wis Mar 24, 2025
f6cedd6
update pyproject
ori-kron-wis Mar 24, 2025
4ff6494
update pyproject
ori-kron-wis Mar 24, 2025
c3d0178
path for the setuptools bad version
ori-kron-wis Mar 24, 2025
e571361
path for the setuptools bad version
ori-kron-wis Mar 24, 2025
9c6fead
path for the setuptools bad version
ori-kron-wis Mar 24, 2025
91e99cd
path for the setuptools bad version - not related to us - revert it
ori-kron-wis Mar 25, 2025
e171fda
path for the setuptools bad version - not related to us - revert it
ori-kron-wis Mar 25, 2025
25d0530
updates
ori-kron-wis Mar 25, 2025
e455c2b
updates
ori-kron-wis Mar 25, 2025
5f51c69
Merge remote-tracking branch 'origin/main' into ori-2907-custom-datal…
ori-kron-wis Mar 25, 2025
59c9109
focus on scvi/scanvi only: added a general base model check for regis…
ori-kron-wis Mar 26, 2025
9ba0690
updates
ori-kron-wis Mar 26, 2025
780dcc6
updates
ori-kron-wis Mar 26, 2025
4e1dcaf
updates
ori-kron-wis Mar 26, 2025
0432d48
replaced the previous cellxgene_census experimental dl to the new til…
ori-kron-wis Mar 27, 2025
5b25d58
typo
ori-kron-wis Mar 27, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions .github/workflows/test_linux_custom_dataloader.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
name: test (custom dataloaders)

on:
push:
branches: [main, "[0-9]+.[0-9]+.x"]
pull_request:
branches: [main, "[0-9]+.[0-9]+.x"]
types: [labeled, synchronize, opened]
schedule:
- cron: "0 10 * * *" # runs at 10:00 UTC (03:00 PST) every day
workflow_dispatch:

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
test:
# if PR has label "custom_dataloader" or "all tests" or if scheduled or manually triggered
if: >-
(
contains(github.event.pull_request.labels.*.name, 'custom_dataloader') ||
contains(github.event.pull_request.labels.*.name, 'all tests') ||
contains(github.event_name, 'schedule') ||
contains(github.event_name, 'workflow_dispatch')
)

runs-on: ${{ matrix.os }}

defaults:
run:
shell: bash -e {0} # -e to fail on error

strategy:
fail-fast: false
matrix:
os: [ubuntu-latest]
python: ["3.12"]

name: integration

env:
OS: ${{ matrix.os }}
PYTHON: ${{ matrix.python }}

steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python }}
cache: "pip"
cache-dependency-path: "**/pyproject.toml"

- name: Install dependencies
run: |
python -m pip install --upgrade pip wheel uv
python -m uv pip install --system "scvi-tools[tests] @ ."

- name: Run specific custom dataloader pytest
env:
MPLBACKEND: agg
PLATFORM: ${{ matrix.os }}
DISPLAY: :42
COLUMNS: 120
run: |
coverage run -m pytest -v --color=yes --custom-dataloader-tests
coverage report

- uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}
3 changes: 0 additions & 3 deletions .github/workflows/test_linux_internet.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,9 +58,6 @@ jobs:
run: |
python -m pip install --upgrade pip wheel uv
python -m uv pip install --system "scvi-tools[tests] @ ."
python -m pip install tiledb
python -m pip install tiledbsoma
python -m pip install cellxgene-census
- name: Run pytest
env:
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ to [Semantic Versioning]. Full commit history is available in the
- Add supervised module class {class}`scvi.module.base.SupervisedModuleClass`. {pr}`3237`.
- Add get normalized function model property for any generative model {pr}`3238` and changed
get_accessibility_estimates to get_normalized_accessibility, where needed.
- Add support for using Lamin custom dataloaders with {class}`scvi.model.SCVI`, {pr}`2932`.
- Add Early stopping KL warmup steps. {pr}`3262`.

#### Fixed
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/notebooks
54 changes: 41 additions & 13 deletions docs/user_guide/use_case/custom_dataloaders.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,12 @@ Pros:
- Optimized for ML Workflows: If your dataset is structured as tables (rows and columns), LamindDB’s format aligns well with SCVI's expectations, potentially reducing the need for complex transformations.

```python
os.system("lamin init --storage ./test-registries")
import lamindb as ln
from scvi.dataloaders import MappedCollectionDataModule
import scvi
import os

os.system("lamin init --storage ./test-registries")

ln.setup.init(name="lamindb_instance_name", storage=save_path)

Expand Down Expand Up @@ -52,9 +56,10 @@ Scalability: Handles large datasets that exceed your system's memory capacity, m
```python
import cellxgene_census
import tiledbsoma as soma
from cellxgene_census.experimental.ml import experiment_dataloader
from cellxgene_census.experimental.ml.datamodule import CensusSCVIDataModule
import tiledbsoma_ml
from scvi.dataloaders import SCVIDataModule
import numpy as np
import scvi

# this test checks the local custom dataloder made by CZI and run several tests with it
census = cellxgene_census.open_soma(census_version="stable")
Expand All @@ -66,25 +71,48 @@ obs_value_filter = (

hv_idx = np.arange(100) # just ot make it smaller and faster for debug

# this is CZI part to be taken once all is ready
batch_keys = ["dataset_id", "assay", "suspension_type", "donor_id"]
datamodule = CensusSCVIDataModule(
census["census_data"][experiment_name],
# For HVG, we can use the highly_variable_genes function provided in cellxgene_census,
# which can compute HVGs in constant memory:
hvg_query = census["census_data"][experiment_name].axis_query(
measurement_name="RNA",
X_name="raw",
obs_query=soma.AxisQuery(value_filter=obs_value_filter),
var_query=soma.AxisQuery(coords=(list(hv_idx),)),
)

# this is CZI part to be taken once all is ready
batch_keys = ["dataset_id", "assay", "suspension_type", "donor_id"]
label_keys = ["tissue_general"]
datamodule = SCVIDataModule(
hvg_query,
layer_name="raw",
batch_size=1024,
shuffle=True,
batch_keys=batch_keys,
seed=42,
batch_column_names=batch_keys,
label_keys=label_keys,
train_size=0.9,
unlabeled_category="label_0",
dataloader_kwargs={"num_workers": 0, "persistent_workers": False},
)

# Setup the datamodule
scvi.model._scvi.SCVI.setup_datamodule(datamodule)

# We can now create the scVI model object and train it:
model = scvi.model.SCVI(
adata=None,
registry=datamodule.registry,
gene_likelihood="nb",
encode_covariates=False,
)

# basicaly we should mimiC everything below to any model census in scvi
adata_orig = synthetic_iid()
scvi.model.SCVI.setup_anndata(adata_orig, batch_key="batch")
model = scvi.model.SCVI(adata_orig)
model.train(
datamodule=datamodule,
max_epochs=1,
batch_size=1024,
train_size=0.9,
early_stopping=False,
)
...
```
Key Differences between them in terms of Custom Dataloaders:
Expand Down
13 changes: 7 additions & 6 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -82,11 +82,9 @@ docs = [
docsbuild = ["scvi-tools[docs,optional]"]

# scvi.autotune
autotune = ["hyperopt>=0.2", "ray[tune]","scib-metrics"]
autotune = ["hyperopt>=0.2", "ray[tune]", "scib-metrics"]
# scvi.hub.HubModel.pull_from_s3
aws = ["boto3"]
# scvi.data.cellxgene
census = ["cellxgene-census", "numpy<2.0"]
# scvi.hub dependencies
hub = ["huggingface_hub", "igraph", "leidenalg", "dvc[s3]"]
# scvi.data.add_dna_sequence
Expand All @@ -96,13 +94,15 @@ scanpy = ["scanpy>=1.10", "scikit-misc"]
# for convinient files sharing
file_sharing = ["pooch"]
# for parallelization engine
parallel = ["dask[array]>=2023.5.1,<2024.8.0"]
parallel = ["dask[array]>=2023.5.1,<2024.8.0", "zarr<3.0.0"]
# for supervised models interpretability
interpretability = ["captum","shap"]
interpretability = ["captum", "shap"]
# for custom dataloders
dataloaders = ["lamindb>=1.3.0", "biomart", "bionty", "cellxgene_lamin", "cellxgene-census", "numpy<2.0", "tiledbsoma", "tiledb", "tiledbsoma_ml", "torchdata==0.9.0"]


optional = [
"scvi-tools[autotune,aws,hub,file_sharing,regseq,scanpy,parallel,interpretability]"
"scvi-tools[autotune,aws,hub,file_sharing,regseq,scanpy,parallel,interpretability,dataloaders]"
]
tutorials = [
"cell2location",
Expand Down Expand Up @@ -137,6 +137,7 @@ markers = [
"private: mark tests that uses private keys, like HF",
"multigpu: mark tests that are used to check multi GPU performance",
"autotune: mark tests that are used to check ray autotune capabilities",
"custom dataloaders: mark tests that are used to check different custom data loaders",
]

[tool.ruff]
Expand Down
3 changes: 3 additions & 0 deletions src/scvi/dataloaders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

from ._ann_dataloader import AnnDataLoader
from ._concat_dataloader import ConcatDataLoader
from ._custom_dataloders import MappedCollectionDataModule, SCVIDataModule
from ._data_splitting import (
DataSplitter,
DeviceBackedDataSplitter,
Expand All @@ -20,4 +21,6 @@
"DataSplitter",
"SemiSupervisedDataSplitter",
"BatchDistributedSampler",
"MappedCollectionDataModule",
"SCVIDataModule",
]
Loading
Loading