Cellular omics such as single-cell genomics, proteomics, and microbiomics allow the characterization of tissue and microbial community composition, which can be compared between conditions to identify biological drivers. This strategy has been critical to unveiling markers of disease progression in conditions such as cancer and pathogen infections.
For cellular omic data, no method for differential variability analysis exists, and methods for differential composition analysis only take a few fundamental data properties into account. Here we introduce sccomp, a generalised method for differential composition and variability analyses capable of jointly modelling data count distribution, compositionality, group-specific variability, and proportion mean-variability association, while being robust to outliers.
sccomp is an extensive analysis framework that allows realistic data simulation and cross-study knowledge transfer. We demonstrate that mean-variability association is ubiquitous across technologies, highlighting the inadequacy of the very popular Dirichlet-multinomial modelling and providing essential principles for differential variability analysis.
- I: Data are modelled as counts.
- II: Group proportions are modelled as compositional.
- III: The proportion variability is modelled as cell-type specific.
- IV: Information sharing across cell types, mean–variability association.
- V: Outlier detection or robustness.
- VI: Differential variability analysis.
Method | Year | Model | I | II | III | IV | V | VI |
---|---|---|---|---|---|---|---|---|
sccomp | 2023 | Sum-constrained Beta-binomial | ● | ● | ● | ● | ● | ● |
scCODA | 2021 | Dirichlet-multinomial | ● | ● | ||||
quasi-binom. | 2021 | Quasi-binomial | ● | ● | ||||
rlm | 2021 | Robust-log-linear | ● | ● | ||||
propeller | 2021 | Logit-linear + limma | ● | ● | ● | |||
ANCOM-BC | 2020 | Log-linear | ● | ● | ||||
corncob | 2020 | Beta-binomial | ● | ● | ||||
scDC | 2019 | Log-linear | ● | ● | ||||
dmbvs | 2017 | Dirichlet-multinomial | ● | ● | ||||
MixMC | 2016 | Zero-inflated Log-linear | ● | ● | ||||
ALDEx2 | 2014 | Dirichlet-multinomial | ● | ● |
Mangiola, Stefano, Alexandra J. Roth-Schulze, Marie Trussart, Enrique Zozaya-Valdés, Mengyao Ma, Zijie Gao, Alan F. Rubin, Terence P. Speed, Heejung Shim, and Anthony T. Papenfuss. 2023. “Sccomp: Robust Differential Composition and Variability Analysis for Single-Cell Data.” Proceedings of the National Academy of Sciences of the United States of America 120 (33): e2203828120. https://doi.org/10.1073/pnas.2203828120
PNAS - sccomp: Robust differential composition and variability analysis for single-cell data
sccomp
tests differences in cell type proportions from single-cell
data. It is robust against outliers, it models continuous and discrete
factors, and capable of random-effect/intercept modelling.
- Complex linear models with continuous and categorical covariates
- Multilevel modelling, with population fixed and random effects/intercept
- Modelling data from counts
- Testing differences in cell-type proportionality
- Testing differences in cell-type specific variability
- Cell-type information share for variability adaptive shrinkage
- Testing differential variability
- Probabilistic outlier identification
- Cross-dataset learning (hyperpriors).
The capability of random-effect modelling for sccompPy
package is under-development.
pip install git+https://github.com/MangiolaLaboratory/sccompPy.git
sccompPy
relies on CmdStanPy, which serves as an interface to the latest version of CmdStan, a powerful tool for Bayesian modelling.
To ensure compatibility, please install and configure CmdStanPy before using sccompPy
. Follow the instructions on the CmdStanPy documentation to install and set up the required dependencies.
Here we provide a demo listing some necessary steps:
pip install --upgrade cmdstanpy
import cmdstanpy
cmdstanpy.install_cmdstan()
import sccompPy
import pkg_resources
import pandas as pd
data_file_path = pkg_resources.resource_filename("sccompPy", "data/count_obj.csv")
count_obj = pd.read_csv(data_file_path)
count_obj
sample | type | phenotype | count | cell_group | proportion | |
---|---|---|---|---|---|---|
0 | 10x_6K | benign | b_cell_macrophage_precursor_or_follicular_LTB_... | 42 | BM | 0.008350 |
1 | 10x_6K | benign | B_cell:immature | 361 | B1 | 0.071769 |
2 | 10x_6K | benign | B_cell:immature_IGLC3_IGLC2 | 57 | B2 | 0.011332 |
3 | 10x_6K | benign | B_cell:Memory_ITM2C_IGHA1_MZB1_JCHAIN | 40 | B3 | 0.007952 |
4 | 10x_6K | benign | Dendritic_CD11_CD1_high_mito | 75 | Dm | 0.014911 |
... | ... | ... | ... | ... | ... | ... |
715 | SRR7244582 | benign | T_cell:CD8+_GZMK_DUSP2_LYAR_CCL5 | 197 | CD8 2 | 0.060727 |
716 | SRR7244582 | benign | T_cell:CD8+_non_activated | 320 | CD8 3 | 0.098644 |
717 | SRR7244582 | benign | T_cell:CD8+_PPBP_SAT1 | 39 | CD8 4 | 0.012022 |
718 | SRR7244582 | benign | T_cell:CD8+_S100B | 88 | CD8 5 | 0.027127 |
719 | SRR7244582 | benign | T_cell:CD8low_TIMP1_PPBP | 107 | CD8 6 | 0.032984 |
720 rows × 6 columns
estimate_res = sccompPy.sccomp_estimate(
data = count_obj,
formula_composition = '~ 0 + type',
sample = 'sample',
cell_group = 'cell_group',
count = 'count',
verbose = False
)
17:05:31 - cmdstanpy - INFO - CmdStan start processing
chain 1 |�[33m �[0m| 00:00 Status
�[A
chain 1 |�[33m▉ �[0m| 00:00 Iteration: 1 / 2000 [ 0%] (Warmup)
�[A
�[A�[A
�[A
chain 1 |�[33m█▎ �[0m| 00:00 Iteration: 100 / 2000 [ 5%] (Warmup)
�[A
chain 1 |�[33m█▊ �[0m| 00:01 Iteration: 200 / 2000 [ 10%] (Warmup)
�[A
chain 1 |�[33m██▎ �[0m| 00:01 Iteration: 300 / 2000 [ 15%] (Warmup)
�[A
chain 1 |�[33m██▋ �[0m| 00:01 Iteration: 400 / 2000 [ 20%] (Warmup)
�[A
chain 1 |�[33m███▏ �[0m| 00:01 Iteration: 500 / 2000 [ 25%] (Warmup)
�[A
chain 1 |�[33m███▋ �[0m| 00:02 Iteration: 600 / 2000 [ 30%] (Warmup)
�[A
chain 1 |�[33m████ �[0m| 00:02 Iteration: 700 / 2000 [ 35%] (Warmup)
�[A
chain 1 |�[33m████▌ �[0m| 00:02 Iteration: 800 / 2000 [ 40%] (Warmup)
�[A
chain 1 |�[33m█████ �[0m| 00:02 Iteration: 900 / 2000 [ 45%] (Warmup)
�[A
chain 1 |�[34m█████▉ �[0m| 00:02 Iteration: 1001 / 2000 [ 50%] (Sampling)
�[A
chain 1 |�[34m██████▎ �[0m| 00:03 Iteration: 1100 / 2000 [ 55%] (Sampling)
�[A
chain 1 |�[34m██████▊ �[0m| 00:03 Iteration: 1200 / 2000 [ 60%] (Sampling)
�[A
chain 1 |�[34m███████▎ �[0m| 00:03 Iteration: 1300 / 2000 [ 65%] (Sampling)
�[A
chain 1 |�[34m███████▋ �[0m| 00:03 Iteration: 1400 / 2000 [ 70%] (Sampling)
�[A
chain 1 |�[34m████████▏ �[0m| 00:03 Iteration: 1500 / 2000 [ 75%] (Sampling)
�[A
chain 1 |�[34m████████▋ �[0m| 00:04 Iteration: 1600 / 2000 [ 80%] (Sampling)
�[A
chain 1 |�[34m█████████ �[0m| 00:04 Iteration: 1700 / 2000 [ 85%] (Sampling)
�[A
chain 1 |�[34m█████████▌�[0m| 00:04 Iteration: 1800 / 2000 [ 90%] (Sampling)
�[A
chain 1 |�[34m██████████�[0m| 00:04 Sampling completed
chain 2 |�[34m██████████�[0m| 00:04 Sampling completed
chain 3 |�[34m██████████�[0m| 00:04 Sampling completed
chain 4 |�[34m██████████�[0m| 00:04 Sampling completed
17:05:36 - cmdstanpy - INFO - CmdStan done processing.
17:05:36 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: Exception: beta_binomial_lpmf: First prior sample size parameter[10] is 0, but must be positive finite! (in 'glm_multi_beta_binomial.stan', line 214, column 16 to line 219, column 19) (in 'glm_multi_beta_binomial.stan', line 653, column 3 to line 683, column 8)
Exception: Exception: beta_binomial_lpmf: First prior sample size parameter[1] is inf, but must be positive finite! (in 'glm_multi_beta_binomial.stan', line 214, column 16 to line 219, column 19) (in 'glm_multi_beta_binomial.stan', line 653, column 3 to line 683, column 8)
Exception: Exception: beta_binomial_lpmf: First prior sample size parameter[1] is inf, but must be positive finite! (in 'glm_multi_beta_binomial.stan', line 214, column 16 to line 219, column 19) (in 'glm_multi_beta_binomial.stan', line 653, column 3 to line 683, column 8)
Consider re-running with show_console=True if the above output is unclear!
estimate_res.keys()
dict_keys(['fit', 'model_input', 'truncation_df2', 'sample', 'cell_group', 'count', 'formula_composition', 'formula_variability', 'noise_model'])
by default return_all
is set to be False
, set as True
to include all info in results
test_res = sccompPy.sccomp_test(estimate_res, contrasts=['type[cancer] - type[benign]'], return_all=True)
Warning: These elements require backquotes: ['type[cancer]', 'type[benign]'], sccompPy will auto quote them.
Warning: These elements require backquotes: ['type[cancer]', 'type[benign]'], sccompPy will auto quote them.
/home/chzhan1/Python/SAiGENCI/sccompPy/sccompPy/utilities.py:304: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
/home/chzhan1/Python/SAiGENCI/sccompPy/sccompPy/utilities.py:321: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
sccomp_test
returns a dict
when return_all=True
where the first element - result contains the result table
Otherwise sccomp_test
will be result table itself
test_res['result']
cell_group | parameter | c_lower | c_effect | c_upper | c_pH0 | FDR | N_Eff | R_k_hat | v_lower | v_effect | v_upper | v_pH0 | factor | design_matrix_col | count_data | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | B1 | type[benign] | 0.895135 | 1.157385 | 1.390387 | 0.00000 | 0.000000 | 5933.55 | 0.999504 | NaN | NaN | NaN | NaN | type | type[benign] | sample type phenotype co... |
1 | B1 | type[cancer] | 0.138356 | 0.489348 | 0.808645 | 0.01450 | 0.001500 | 6200.74 | 0.999354 | NaN | NaN | NaN | NaN | type | type[cancer] | sample type phenotype co... |
2 | B1 | type[cancer] - type[benign] | -1.076798 | -0.666984 | -0.268136 | 0.00250 | 0.000950 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | sample type phenotype co... |
3 | B2 | type[benign] | 0.396715 | 0.724280 | 1.036206 | 0.00050 | 0.000071 | 5362.59 | 0.999425 | NaN | NaN | NaN | NaN | type | type[benign] | sample type p... |
4 | B2 | type[cancer] | -0.438626 | 0.027175 | 0.449079 | 0.64200 | 0.087713 | 5983.34 | 0.999770 | NaN | NaN | NaN | NaN | type | type[cancer] | sample type p... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
103 | TM2 | type[cancer] | -1.221164 | -0.896173 | -0.601406 | 0.00000 | 0.000000 | 5764.90 | 0.999441 | NaN | NaN | NaN | NaN | type | type[cancer] | sample type ... |
104 | TM2 | type[cancer] - type[benign] | -0.182896 | 0.273352 | 0.721949 | 0.21475 | 0.055604 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | sample type ... |
105 | TM3 | type[benign] | -1.770492 | -0.824812 | 0.244776 | 0.09325 | 0.017061 | 2063.35 | 0.999920 | NaN | NaN | NaN | NaN | type | type[benign] | sample type ... |
106 | TM3 | type[cancer] | -3.890775 | -2.658175 | -1.426470 | 0.00000 | 0.000000 | 1831.74 | 1.000070 | NaN | NaN | NaN | NaN | type | type[cancer] | sample type ... |
107 | TM3 | type[cancer] - type[benign] | -3.090491 | -1.832928 | -0.768404 | 0.00075 | 0.000333 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | sample type ... |
108 rows × 16 columns
return_all
needs to set as True
when run sccomp_test
sccompPy.plot_1D_intervals(test_res)