soma-curation
is a light-weight Python package used at Phenomic to streamline the curation and management of single-cell RNA sequencing (scRNA-seq) atlases using TileDB-SOMA. It's still in its early stages, but the hope is to allow bioinformaticians and ML practitioners to organize their SOMA atlases and access their raw data a bit better. There are assumptions of raw storage organization baked into the package that mimic practices at Phenomic.
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install the package
pip install soma-curation
- Define your schema:
from soma_curation.schema import load_schema
# loads the default schema
schema = load_schema()
- Organize your raw data:
# You can simulate this structure with the following commands
from soma_curation.utils import test_dummy_structure
test_dummy_structure()
It should give you a structure like this:
raw_data/
├── study_1/
│ ├── mtx/
│ │ ├── sample_1/
│ │ └── sample_2/
│ ├── cell_metadata/
│ │ ├── study_1.tsv.gz
│ └── sample_metadata/
│ │ ├── study_1.tsv.gz
└── study_2/
└── ...
- Create and use your collection:
from soma_curation.collection import MtxCollection
# For MTX files
collection = MtxCollection(
storage_directory="path/to/raw_data",
db_schema=schema
)
# Access AnnDatas from MTX files
adata = collection.get_anndata(study_name="study_1", sample_name="sample_1")
# For H5AD files
h5ad_collection = H5adCollection(
storage_directory="path/to/h5ad_files"
)
# List all H5AD files
h5ad_files = h5ad_collection.list_h5ad_files()
# Access AnnData directly from an H5AD file
adata = h5ad_collection.get_anndata(filename="file1.h5ad")
- Create a TileDB-SOMA Experiment
from soma_curation.atlas.crud import AtlasManager
# Create an atlas
am = AtlasManager(atlas_name="...", db_schema=db_schema, storage_directory="...")
am.create()
# Delete an atlas
# am.delete()
- Create a Dataset according to your schema and standardize it
from soma_curation.dataset.anndataset import AnnDataset
# Create a Phenomic Dataset
# Original anndata is stored under the `.artifact` attribute
dataset = AnnDataset(
atlas_manager=am,
collection=collection
)
dataset.standardize()
- Ingest your data into the TileDB-SOMA Experiment using traditional TileDB-SOMA syntax documented here
For detailed documentation, including API reference and usage examples, visit our documentation site.
We welcome contributions! Please see our Contributing Guide for details.
This project is licensed under the MIT License - see the LICENSE file for details.
This work was inspired by the TileDB-SOMA and CellxGene Census teams. We extend our gratitude to the TileDB team for their valuable feedback and support.
Below are setup instructions. If you're working in VSCode, we highly recommend installing the Python extension.
Clone this repository to your local machine:
git clone https://github.com/PhenomicAI/soma-curation.git
cd soma-curation/
You only need to create a virtual environment once.
Create and activate a virtual environment:
virtualenv venv
source venv/bin/activate
pip install ".[dev]"