`sccompPy`(Beta) - A Python package for `sccomp`

Tests differences in cell type proportions and variability from single-cell data

Cellular omics such as single-cell genomics, proteomics, and microbiomics allow the characterization of tissue and microbial community composition, which can be compared between conditions to identify biological drivers. This strategy has been critical to unveiling markers of disease progression in conditions such as cancer and pathogen infections.

For cellular omic data, no method for differential variability analysis exists, and methods for differential composition analysis only take a few fundamental data properties into account. Here we introduce sccomp, a generalised method for differential composition and variability analyses capable of jointly modelling data count distribution, compositionality, group-specific variability, and proportion mean-variability association, while being robust to outliers.

sccomp is an extensive analysis framework that allows realistic data simulation and cross-study knowledge transfer. We demonstrate that mean-variability association is ubiquitous across technologies, highlighting the inadequacy of the very popular Dirichlet-multinomial modelling and providing essential principles for differential variability analysis.

Comparison with other methods

I: Data are modelled as counts.
II: Group proportions are modelled as compositional.
III: The proportion variability is modelled as cell-type specific.
IV: Information sharing across cell types, mean–variability association.
V: Outlier detection or robustness.
VI: Differential variability analysis.

Method	Year	Model	I	II	III	IV	V	VI
sccomp	2023	Sum-constrained Beta-binomial	●	●	●	●	●	●
scCODA	2021	Dirichlet-multinomial	●	●
quasi-binom.	2021	Quasi-binomial	●		●
rlm	2021	Robust-log-linear		●			●
propeller	2021	Logit-linear + limma		●	●	●
ANCOM-BC	2020	Log-linear		●	●
corncob	2020	Beta-binomial	●		●
scDC	2019	Log-linear		●	●
dmbvs	2017	Dirichlet-multinomial	●	●
MixMC	2016	Zero-inflated Log-linear		●	●
ALDEx2	2014	Dirichlet-multinomial	●	●

Cite

Mangiola, Stefano, Alexandra J. Roth-Schulze, Marie Trussart, Enrique Zozaya-Valdés, Mengyao Ma, Zijie Gao, Alan F. Rubin, Terence P. Speed, Heejung Shim, and Anthony T. Papenfuss. 2023. “Sccomp: Robust Differential Composition and Variability Analysis for Single-Cell Data.” Proceedings of the National Academy of Sciences of the United States of America 120 (33): e2203828120. https://doi.org/10.1073/pnas.2203828120

PNAS - sccomp: Robust differential composition and variability analysis for single-cell data

Talk

sccomp tests differences in cell type proportions from single-cell data. It is robust against outliers, it models continuous and discrete factors, and capable of random-effect/intercept modelling.

Characteristics

Complex linear models with continuous and categorical covariates
Multilevel modelling, with population fixed and random effects/intercept
Modelling data from counts
Testing differences in cell-type proportionality
Testing differences in cell-type specific variability
Cell-type information share for variability adaptive shrinkage
Testing differential variability
Probabilistic outlier identification
Cross-dataset learning (hyperpriors).

Development notes:

The capability of random-effect modelling for sccompPy package is under-development.

Installation

Github

pip install git+https://github.com/MangiolaLaboratory/sccompPy.git

Prerequisites: CmdStanPy Installation

sccompPy relies on CmdStanPy, which serves as an interface to the latest version of CmdStan, a powerful tool for Bayesian modelling.

To ensure compatibility, please install and configure CmdStanPy before using sccompPy. Follow the instructions on the CmdStanPy documentation to install and set up the required dependencies.

Here we provide a demo listing some necessary steps:

pip install --upgrade cmdstanpy

import cmdstanpy
cmdstanpy.install_cmdstan()

Import `sccompy` package

import sccompPy

Load embeded dataset

import pkg_resources
import pandas as pd

data_file_path = pkg_resources.resource_filename("sccompPy", "data/count_obj.csv")

count_obj = pd.read_csv(data_file_path)
count_obj

	sample	type	phenotype	count	cell_group	proportion
0	10x_6K	benign	b_cell_macrophage_precursor_or_follicular_LTB_...	42	BM	0.008350
1	10x_6K	benign	B_cell:immature	361	B1	0.071769
2	10x_6K	benign	B_cell:immature_IGLC3_IGLC2	57	B2	0.011332
3	10x_6K	benign	B_cell:Memory_ITM2C_IGHA1_MZB1_JCHAIN	40	B3	0.007952
4	10x_6K	benign	Dendritic_CD11_CD1_high_mito	75	Dm	0.014911
...	...	...	...	...	...	...
715	SRR7244582	benign	T_cell:CD8+_GZMK_DUSP2_LYAR_CCL5	197	CD8 2	0.060727
716	SRR7244582	benign	T_cell:CD8+_non_activated	320	CD8 3	0.098644
717	SRR7244582	benign	T_cell:CD8+_PPBP_SAT1	39	CD8 4	0.012022
718	SRR7244582	benign	T_cell:CD8+_S100B	88	CD8 5	0.027127
719	SRR7244582	benign	T_cell:CD8low_TIMP1_PPBP	107	CD8 6	0.032984

720 rows × 6 columns

`sccomp_estimate` function

estimate_res = sccompPy.sccomp_estimate(
    data = count_obj,
    formula_composition = '~ 0 + type', 
    sample = 'sample',
    cell_group = 'cell_group',
    count = 'count',
    verbose = False
)

17:05:31 - cmdstanpy - INFO - CmdStan start processing
chain 1 |�[33m          �[0m| 00:00 Status
�[A

chain 1 |�[33m▉         �[0m| 00:00 Iteration:    1 / 2000 [  0%]  (Warmup)
�[A

�[A�[A
�[A

chain 1 |�[33m█▎        �[0m| 00:00 Iteration:  100 / 2000 [  5%]  (Warmup)
�[A

chain 1 |�[33m█▊        �[0m| 00:01 Iteration:  200 / 2000 [ 10%]  (Warmup)
�[A

chain 1 |�[33m██▎       �[0m| 00:01 Iteration:  300 / 2000 [ 15%]  (Warmup)
�[A

chain 1 |�[33m██▋       �[0m| 00:01 Iteration:  400 / 2000 [ 20%]  (Warmup)
�[A

chain 1 |�[33m███▏      �[0m| 00:01 Iteration:  500 / 2000 [ 25%]  (Warmup)
�[A

chain 1 |�[33m███▋      �[0m| 00:02 Iteration:  600 / 2000 [ 30%]  (Warmup)
�[A

chain 1 |�[33m████      �[0m| 00:02 Iteration:  700 / 2000 [ 35%]  (Warmup)
�[A

chain 1 |�[33m████▌     �[0m| 00:02 Iteration:  800 / 2000 [ 40%]  (Warmup)
�[A

chain 1 |�[33m█████     �[0m| 00:02 Iteration:  900 / 2000 [ 45%]  (Warmup)
�[A

chain 1 |�[34m█████▉    �[0m| 00:02 Iteration: 1001 / 2000 [ 50%]  (Sampling)
�[A

chain 1 |�[34m██████▎   �[0m| 00:03 Iteration: 1100 / 2000 [ 55%]  (Sampling)
�[A

chain 1 |�[34m██████▊   �[0m| 00:03 Iteration: 1200 / 2000 [ 60%]  (Sampling)
�[A

chain 1 |�[34m███████▎  �[0m| 00:03 Iteration: 1300 / 2000 [ 65%]  (Sampling)
�[A

chain 1 |�[34m███████▋  �[0m| 00:03 Iteration: 1400 / 2000 [ 70%]  (Sampling)
�[A

chain 1 |�[34m████████▏ �[0m| 00:03 Iteration: 1500 / 2000 [ 75%]  (Sampling)
�[A

chain 1 |�[34m████████▋ �[0m| 00:04 Iteration: 1600 / 2000 [ 80%]  (Sampling)
�[A

chain 1 |�[34m█████████ �[0m| 00:04 Iteration: 1700 / 2000 [ 85%]  (Sampling)
�[A

chain 1 |�[34m█████████▌�[0m| 00:04 Iteration: 1800 / 2000 [ 90%]  (Sampling)
�[A

chain 1 |�[34m██████████�[0m| 00:04 Sampling completed                       
chain 2 |�[34m██████████�[0m| 00:04 Sampling completed                       
chain 3 |�[34m██████████�[0m| 00:04 Sampling completed                       
chain 4 |�[34m██████████�[0m| 00:04 Sampling completed                       

                                                                                                                                                                                                                                                                                                                                


17:05:36 - cmdstanpy - INFO - CmdStan done processing.
17:05:36 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: Exception: beta_binomial_lpmf: First prior sample size parameter[10] is 0, but must be positive finite! (in 'glm_multi_beta_binomial.stan', line 214, column 16 to line 219, column 19) (in 'glm_multi_beta_binomial.stan', line 653, column 3 to line 683, column 8)
Exception: Exception: beta_binomial_lpmf: First prior sample size parameter[1] is inf, but must be positive finite! (in 'glm_multi_beta_binomial.stan', line 214, column 16 to line 219, column 19) (in 'glm_multi_beta_binomial.stan', line 653, column 3 to line 683, column 8)
Exception: Exception: beta_binomial_lpmf: First prior sample size parameter[1] is inf, but must be positive finite! (in 'glm_multi_beta_binomial.stan', line 214, column 16 to line 219, column 19) (in 'glm_multi_beta_binomial.stan', line 653, column 3 to line 683, column 8)
Consider re-running with show_console=True if the above output is unclear!

estimate_res.keys()

dict_keys(['fit', 'model_input', 'truncation_df2', 'sample', 'cell_group', 'count', 'formula_composition', 'formula_variability', 'noise_model'])

`sccomp_test` function

by default return_all is set to be False, set as True to include all info in results

test_res = sccompPy.sccomp_test(estimate_res, contrasts=['type[cancer] - type[benign]'], return_all=True)

Warning: These elements require backquotes: ['type[cancer]', 'type[benign]'], sccompPy will auto quote them.
Warning: These elements require backquotes: ['type[cancer]', 'type[benign]'], sccompPy will auto quote them.


/home/chzhan1/Python/SAiGENCI/sccompPy/sccompPy/utilities.py:304: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
/home/chzhan1/Python/SAiGENCI/sccompPy/sccompPy/utilities.py:321: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

`sccomp_test` returns a `dict` when `return_all=True` where the first element - result contains the result table

Otherwise sccomp_test will be result table itself

test_res['result']

	cell_group	parameter	c_lower	c_effect	c_upper	c_pH0	FDR	N_Eff	R_k_hat	v_lower	v_effect	v_upper	v_pH0	factor	design_matrix_col	count_data
0	B1	type[benign]	0.895135	1.157385	1.390387	0.00000	0.000000	5933.55	0.999504	NaN	NaN	NaN	NaN	type	type[benign]	sample type phenotype co...
1	B1	type[cancer]	0.138356	0.489348	0.808645	0.01450	0.001500	6200.74	0.999354	NaN	NaN	NaN	NaN	type	type[cancer]	sample type phenotype co...
2	B1	type[cancer] - type[benign]	-1.076798	-0.666984	-0.268136	0.00250	0.000950	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	sample type phenotype co...
3	B2	type[benign]	0.396715	0.724280	1.036206	0.00050	0.000071	5362.59	0.999425	NaN	NaN	NaN	NaN	type	type[benign]	sample type p...
4	B2	type[cancer]	-0.438626	0.027175	0.449079	0.64200	0.087713	5983.34	0.999770	NaN	NaN	NaN	NaN	type	type[cancer]	sample type p...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
103	TM2	type[cancer]	-1.221164	-0.896173	-0.601406	0.00000	0.000000	5764.90	0.999441	NaN	NaN	NaN	NaN	type	type[cancer]	sample type ...
104	TM2	type[cancer] - type[benign]	-0.182896	0.273352	0.721949	0.21475	0.055604	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	sample type ...
105	TM3	type[benign]	-1.770492	-0.824812	0.244776	0.09325	0.017061	2063.35	0.999920	NaN	NaN	NaN	NaN	type	type[benign]	sample type ...
106	TM3	type[cancer]	-3.890775	-2.658175	-1.426470	0.00000	0.000000	1831.74	1.000070	NaN	NaN	NaN	NaN	type	type[cancer]	sample type ...
107	TM3	type[cancer] - type[benign]	-3.090491	-1.832928	-0.768404	0.00075	0.000333	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	sample type ...

108 rows × 16 columns

Use function `plot_1D_intervals` to visualise results

return_all needs to set as True when run sccomp_test

sccompPy.plot_1D_intervals(test_res)

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
README_files		README_files
sccompPy.egg-info		sccompPy.egg-info
sccompPy		sccompPy
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`sccompPy`(Beta) - A Python package for `sccomp`

Tests differences in cell type proportions and variability from single-cell data

Comparison with other methods

Cite

Talk

Characteristics

Development notes:

Installation

Github

Prerequisites: CmdStanPy Installation

Import `sccompy` package

Load embeded dataset

`sccomp_estimate` function

`sccomp_test` function

`sccomp_test` returns a `dict` when `return_all=True` where the first element - result contains the result table

Use function `plot_1D_intervals` to visualise results

About

Releases

Contributors 2

Languages

License

MangiolaLaboratory/sccompPy

Folders and files

Latest commit

History

Repository files navigation

sccompPy(Beta) - A Python package for sccomp

Tests differences in cell type proportions and variability from single-cell data

Comparison with other methods

Cite

Talk

Characteristics

Development notes:

Installation

Github

Prerequisites: CmdStanPy Installation

Import sccompy package

Load embeded dataset

sccomp_estimate function

sccomp_test function

sccomp_test returns a dict when return_all=True where the first element - result contains the result table

Use function plot_1D_intervals to visualise results

About

Resources

License

Stars

Watchers

Forks

Releases

Contributors 2

Languages

`sccompPy`(Beta) - A Python package for `sccomp`

Import `sccompy` package

`sccomp_estimate` function

`sccomp_test` function

`sccomp_test` returns a `dict` when `return_all=True` where the first element - result contains the result table

Use function `plot_1D_intervals` to visualise results