Skip to content

Mixed-effect model to test differences in cell type proportions from single-cell data, in Python

License

Notifications You must be signed in to change notification settings

MangiolaLaboratory/sccompPy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sccompPy(Beta) - A Python package for sccomp

Tests differences in cell type proportions and variability from single-cell data

Cellular omics such as single-cell genomics, proteomics, and microbiomics allow the characterization of tissue and microbial community composition, which can be compared between conditions to identify biological drivers. This strategy has been critical to unveiling markers of disease progression in conditions such as cancer and pathogen infections.

For cellular omic data, no method for differential variability analysis exists, and methods for differential composition analysis only take a few fundamental data properties into account. Here we introduce sccomp, a generalised method for differential composition and variability analyses capable of jointly modelling data count distribution, compositionality, group-specific variability, and proportion mean-variability association, while being robust to outliers.

sccomp is an extensive analysis framework that allows realistic data simulation and cross-study knowledge transfer. We demonstrate that mean-variability association is ubiquitous across technologies, highlighting the inadequacy of the very popular Dirichlet-multinomial modelling and providing essential principles for differential variability analysis.

Comparison with other methods

  • I: Data are modelled as counts.
  • II: Group proportions are modelled as compositional.
  • III: The proportion variability is modelled as cell-type specific.
  • IV: Information sharing across cell types, mean–variability association.
  • V: Outlier detection or robustness.
  • VI: Differential variability analysis.
Method Year Model I II III IV V VI
sccomp 2023 Sum-constrained Beta-binomial
scCODA 2021 Dirichlet-multinomial
quasi-binom. 2021 Quasi-binomial
rlm 2021 Robust-log-linear
propeller 2021 Logit-linear + limma
ANCOM-BC 2020 Log-linear
corncob 2020 Beta-binomial
scDC 2019 Log-linear
dmbvs 2017 Dirichlet-multinomial
MixMC 2016 Zero-inflated Log-linear
ALDEx2 2014 Dirichlet-multinomial

Cite

Mangiola, Stefano, Alexandra J. Roth-Schulze, Marie Trussart, Enrique Zozaya-Valdés, Mengyao Ma, Zijie Gao, Alan F. Rubin, Terence P. Speed, Heejung Shim, and Anthony T. Papenfuss. 2023. “Sccomp: Robust Differential Composition and Variability Analysis for Single-Cell Data.” Proceedings of the National Academy of Sciences of the United States of America 120 (33): e2203828120. https://doi.org/10.1073/pnas.2203828120

PNAS - sccomp: Robust differential composition and variability analysis for single-cell data

Talk

Watch the video

sccomp tests differences in cell type proportions from single-cell data. It is robust against outliers, it models continuous and discrete factors, and capable of random-effect/intercept modelling.

Characteristics

  • Complex linear models with continuous and categorical covariates
  • Multilevel modelling, with population fixed and random effects/intercept
  • Modelling data from counts
  • Testing differences in cell-type proportionality
  • Testing differences in cell-type specific variability
  • Cell-type information share for variability adaptive shrinkage
  • Testing differential variability
  • Probabilistic outlier identification
  • Cross-dataset learning (hyperpriors).

Development notes:

The capability of random-effect modelling for sccompPy package is under-development.

Installation

Github

pip install git+https://github.com/MangiolaLaboratory/sccompPy.git

Prerequisites: CmdStanPy Installation

sccompPy relies on CmdStanPy, which serves as an interface to the latest version of CmdStan, a powerful tool for Bayesian modelling.

To ensure compatibility, please install and configure CmdStanPy before using sccompPy. Follow the instructions on the CmdStanPy documentation to install and set up the required dependencies.

Here we provide a demo listing some necessary steps:

pip install --upgrade cmdstanpy
import cmdstanpy
cmdstanpy.install_cmdstan()

Import sccompy package

import sccompPy

Load embeded dataset

import pkg_resources
import pandas as pd

data_file_path = pkg_resources.resource_filename("sccompPy", "data/count_obj.csv")

count_obj = pd.read_csv(data_file_path)
count_obj
sample type phenotype count cell_group proportion
0 10x_6K benign b_cell_macrophage_precursor_or_follicular_LTB_... 42 BM 0.008350
1 10x_6K benign B_cell:immature 361 B1 0.071769
2 10x_6K benign B_cell:immature_IGLC3_IGLC2 57 B2 0.011332
3 10x_6K benign B_cell:Memory_ITM2C_IGHA1_MZB1_JCHAIN 40 B3 0.007952
4 10x_6K benign Dendritic_CD11_CD1_high_mito 75 Dm 0.014911
... ... ... ... ... ... ...
715 SRR7244582 benign T_cell:CD8+_GZMK_DUSP2_LYAR_CCL5 197 CD8 2 0.060727
716 SRR7244582 benign T_cell:CD8+_non_activated 320 CD8 3 0.098644
717 SRR7244582 benign T_cell:CD8+_PPBP_SAT1 39 CD8 4 0.012022
718 SRR7244582 benign T_cell:CD8+_S100B 88 CD8 5 0.027127
719 SRR7244582 benign T_cell:CD8low_TIMP1_PPBP 107 CD8 6 0.032984

720 rows × 6 columns

sccomp_estimate function

estimate_res = sccompPy.sccomp_estimate(
    data = count_obj,
    formula_composition = '~ 0 + type', 
    sample = 'sample',
    cell_group = 'cell_group',
    count = 'count',
    verbose = False
)
17:05:31 - cmdstanpy - INFO - CmdStan start processing
chain 1 |�[33m          �[0m| 00:00 Status
�[A

chain 1 |�[33m▉         �[0m| 00:00 Iteration:    1 / 2000 [  0%]  (Warmup)
�[A

�[A�[A
�[A

chain 1 |�[33m█▎        �[0m| 00:00 Iteration:  100 / 2000 [  5%]  (Warmup)
�[A

chain 1 |�[33m█▊        �[0m| 00:01 Iteration:  200 / 2000 [ 10%]  (Warmup)
�[A

chain 1 |�[33m██▎       �[0m| 00:01 Iteration:  300 / 2000 [ 15%]  (Warmup)
�[A

chain 1 |�[33m██▋       �[0m| 00:01 Iteration:  400 / 2000 [ 20%]  (Warmup)
�[A

chain 1 |�[33m███▏      �[0m| 00:01 Iteration:  500 / 2000 [ 25%]  (Warmup)
�[A

chain 1 |�[33m███▋      �[0m| 00:02 Iteration:  600 / 2000 [ 30%]  (Warmup)
�[A

chain 1 |�[33m████      �[0m| 00:02 Iteration:  700 / 2000 [ 35%]  (Warmup)
�[A

chain 1 |�[33m████▌     �[0m| 00:02 Iteration:  800 / 2000 [ 40%]  (Warmup)
�[A

chain 1 |�[33m█████     �[0m| 00:02 Iteration:  900 / 2000 [ 45%]  (Warmup)
�[A

chain 1 |�[34m█████▉    �[0m| 00:02 Iteration: 1001 / 2000 [ 50%]  (Sampling)
�[A

chain 1 |�[34m██████▎   �[0m| 00:03 Iteration: 1100 / 2000 [ 55%]  (Sampling)
�[A

chain 1 |�[34m██████▊   �[0m| 00:03 Iteration: 1200 / 2000 [ 60%]  (Sampling)
�[A

chain 1 |�[34m███████▎  �[0m| 00:03 Iteration: 1300 / 2000 [ 65%]  (Sampling)
�[A

chain 1 |�[34m███████▋  �[0m| 00:03 Iteration: 1400 / 2000 [ 70%]  (Sampling)
�[A

chain 1 |�[34m████████▏ �[0m| 00:03 Iteration: 1500 / 2000 [ 75%]  (Sampling)
�[A

chain 1 |�[34m████████▋ �[0m| 00:04 Iteration: 1600 / 2000 [ 80%]  (Sampling)
�[A

chain 1 |�[34m█████████ �[0m| 00:04 Iteration: 1700 / 2000 [ 85%]  (Sampling)
�[A

chain 1 |�[34m█████████▌�[0m| 00:04 Iteration: 1800 / 2000 [ 90%]  (Sampling)
�[A

chain 1 |�[34m██████████�[0m| 00:04 Sampling completed                       
chain 2 |�[34m██████████�[0m| 00:04 Sampling completed                       
chain 3 |�[34m██████████�[0m| 00:04 Sampling completed                       
chain 4 |�[34m██████████�[0m| 00:04 Sampling completed                       

                                                                                                                                                                                                                                                                                                                                


17:05:36 - cmdstanpy - INFO - CmdStan done processing.
17:05:36 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: Exception: beta_binomial_lpmf: First prior sample size parameter[10] is 0, but must be positive finite! (in 'glm_multi_beta_binomial.stan', line 214, column 16 to line 219, column 19) (in 'glm_multi_beta_binomial.stan', line 653, column 3 to line 683, column 8)
Exception: Exception: beta_binomial_lpmf: First prior sample size parameter[1] is inf, but must be positive finite! (in 'glm_multi_beta_binomial.stan', line 214, column 16 to line 219, column 19) (in 'glm_multi_beta_binomial.stan', line 653, column 3 to line 683, column 8)
Exception: Exception: beta_binomial_lpmf: First prior sample size parameter[1] is inf, but must be positive finite! (in 'glm_multi_beta_binomial.stan', line 214, column 16 to line 219, column 19) (in 'glm_multi_beta_binomial.stan', line 653, column 3 to line 683, column 8)
Consider re-running with show_console=True if the above output is unclear!
estimate_res.keys()
dict_keys(['fit', 'model_input', 'truncation_df2', 'sample', 'cell_group', 'count', 'formula_composition', 'formula_variability', 'noise_model'])

sccomp_test function

by default return_all is set to be False, set as True to include all info in results

test_res = sccompPy.sccomp_test(estimate_res, contrasts=['type[cancer] - type[benign]'], return_all=True)
Warning: These elements require backquotes: ['type[cancer]', 'type[benign]'], sccompPy will auto quote them.
Warning: These elements require backquotes: ['type[cancer]', 'type[benign]'], sccompPy will auto quote them.


/home/chzhan1/Python/SAiGENCI/sccompPy/sccompPy/utilities.py:304: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
/home/chzhan1/Python/SAiGENCI/sccompPy/sccompPy/utilities.py:321: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

sccomp_test returns a dict when return_all=True where the first element - result contains the result table

Otherwise sccomp_test will be result table itself

test_res['result']
cell_group parameter c_lower c_effect c_upper c_pH0 FDR N_Eff R_k_hat v_lower v_effect v_upper v_pH0 factor design_matrix_col count_data
0 B1 type[benign] 0.895135 1.157385 1.390387 0.00000 0.000000 5933.55 0.999504 NaN NaN NaN NaN type type[benign] sample type phenotype co...
1 B1 type[cancer] 0.138356 0.489348 0.808645 0.01450 0.001500 6200.74 0.999354 NaN NaN NaN NaN type type[cancer] sample type phenotype co...
2 B1 type[cancer] - type[benign] -1.076798 -0.666984 -0.268136 0.00250 0.000950 NaN NaN NaN NaN NaN NaN NaN NaN sample type phenotype co...
3 B2 type[benign] 0.396715 0.724280 1.036206 0.00050 0.000071 5362.59 0.999425 NaN NaN NaN NaN type type[benign] sample type p...
4 B2 type[cancer] -0.438626 0.027175 0.449079 0.64200 0.087713 5983.34 0.999770 NaN NaN NaN NaN type type[cancer] sample type p...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
103 TM2 type[cancer] -1.221164 -0.896173 -0.601406 0.00000 0.000000 5764.90 0.999441 NaN NaN NaN NaN type type[cancer] sample type ...
104 TM2 type[cancer] - type[benign] -0.182896 0.273352 0.721949 0.21475 0.055604 NaN NaN NaN NaN NaN NaN NaN NaN sample type ...
105 TM3 type[benign] -1.770492 -0.824812 0.244776 0.09325 0.017061 2063.35 0.999920 NaN NaN NaN NaN type type[benign] sample type ...
106 TM3 type[cancer] -3.890775 -2.658175 -1.426470 0.00000 0.000000 1831.74 1.000070 NaN NaN NaN NaN type type[cancer] sample type ...
107 TM3 type[cancer] - type[benign] -3.090491 -1.832928 -0.768404 0.00075 0.000333 NaN NaN NaN NaN NaN NaN NaN NaN sample type ...

108 rows × 16 columns

Use function plot_1D_intervals to visualise results

return_all needs to set as True when run sccomp_test

sccompPy.plot_1D_intervals(test_res)

png

About

Mixed-effect model to test differences in cell type proportions from single-cell data, in Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published