Computational tools for cell fate engineering
CompForce ("COMPutational tools FOR Cell fate Engineering") is a platform for comparing results of computational fate determinant prediction tools. CompForce adds flexibility to existing tools, allowing users to provide their own transcriptomic data and network structures, and optionally aggregates results.
Additionally, CompForce includes various functions to aid in benchmarking results. In the latter part of this tutorial, we provide various datasets (real and synthetic) and give an example of how to take advantage of the Dyngen platform (see dyngen.dynverse.org) to simulate synthetic predictions.
At the core of CompForce is the "CompForce" class. This object stores raw (counts) expression data, corresponding metadata, optional user provided network, and the raw and pruned results. Other information optionally stored includes: species, TFs, methods run, parameters for specific methods, transitions analyzed.
-
cf_initialize : Initializes the CompForce object
-
add_transition : Specifies the transition of interest
-
pm_diffexp : Runs differential expression. Parameter 'method' can be specified as either 'ttest' or 'deseq'
-
pm_cellnet : Runs CellNet. If parameter use_cfobj_network=FALSE, will reconstruct the network based on the stored expression data
-
pm_mogrify : Runs a flexible version of Mogrify. If parameter use_cfobj_network=FALSE, will use the network from STRINGdb, based on species.
-
pm_dalessio : Runs JSD portion of the method by d'Alessio et al. 2015. If parameter all_background=FALSE, will use source samples/cells as background. Otherwise, will default to using all non-target data as background.
-
cf_RRA : Aggregates the results stored in cfobj@results. Uses Robust Rank Aggregation.
To initialize a CompForce object, provide expression data, metadata, and TFs:
- expCounts : raw counts expression matrix, genes-by-samples/cells
- sampTab : sample table (metadata), with row names matching the column names of expCounts
- TFs : TFs -- see /data for mouse and human TFs
Also provide (optional):
- network : GRN dataframe. Columns should be ordered "TF","TG","weight","type". 'type' is either 1 or -1 corresponding to activation and repression respectively
# Expression data and corresponding metadata
expDat<-utils_loadObject("data/dataset_01_expX_counts.rda")
sampTab<-utils_loadObject("data/dataset_01_sampTab.rda")
# List of mouse TFs
mmTFs<-utils_loadObject("data/mmTFs.rda")
mmTFs<-intersect(rownames(expDat),mmTFs)
# A Network
grn<-utils_loadObject("data/mouse_chipX.rda")
# Initialize
cforce<-cf_initialize(expDat,sampTab,TFs=mmTFs,description_column="celltype",network=grn)
# Add transition of interest
cforce<-add_transition(cforce,"bcell","macrophage")
# You can run any combination of the following:
cforce<-pm_diffexp(cforce,method="ttest")
cforce<-pm_diffexp(cforce,method="deseq")
cforce<-pm_cellnet(cforce)
cforce<-pm_mogrify(cforce)
cforce<-pm_dalessio(cforce)
# Alternatively, you can do this in one step like so:
# cforce<-cf_onerun(cforce,methods="all")
# To run a subset of methods: methods is any combination of c("diffexp_ttest","diffexp_deseq","cellnet","mogrify","dalessio")
cforce<-cf_RRA(cforce)
cforce@results_raw stores the raw results from running each method. cforce@results stores the cleaned results, which is presented as individual dataframes specifying 'TF' and 'rank'.
For example:
cforce@results$bcell..macrophage$CellNet
# TF rank
# Fli1 Fli1 1
# Spi1 Spi1 2
# Irf8 Irf8 3
# Nfe2 Nfe2 4
# Gata6 Gata6 5
# Rarg Rarg 6
# Rarb Rarb 7
# Lyl1 Lyl1 8
# Mef2a Mef2a 9
cforce@results$bcell..macrophage$Mogrify[1:10,]
# TF rank
# Nr0b1 Nr0b1 1
# Sox6 Sox6 2
# Gata4 Gata4 3
# Sox18 Sox18 4
# Zfp365 Zfp365 5
# Lbx1 Lbx1 6
# Neurod2 Neurod2 7
# Foxa3 Foxa3 8
# Esrrb Esrrb 9
# Hoxa11 Hoxa11 10
cforce@results$bcell..macrophage$aggregate[1:10,]
# TF RRA_Score rank
# Nfe2 Nfe2 1.074761e-06 1
# Gata6 Gata6 2.097536e-06 2
# Rarb Rarb 2.223193e-05 3
# Bmp2 Bmp2 3.952438e-04 4
# Epas1 Epas1 8.868822e-04 5
# Zbtb16 Zbtb16 1.205503e-03 6
# Nfia Nfia 2.450168e-03 7
# Nlrp3 Nlrp3 3.471862e-03 8
# Tcf7l2 Tcf7l2 3.518621e-03 9
# Cebpa Cebpa 5.702262e-03 10
The following describes steps to add in a new method to the CompForce platform locally.
- Create a new method file: Akin to "pm_cellnet.R","pm_delassio.R", and others-- this is a file that contains functions required to run the method. At the center of this file should be a main wrapper function that takes in a compforce object and a parameter specifying the transition row of interest (as defined in the compforce object). This main function should return an updated compforce object. See "pm_cellnet", "pm_dalessio", etc. for details.
- Update cf_predict.R: Update the function cf_onerun in cf_predict.R such that it calls the new method if methods=="all", or if the new method is specified.
We include in this package some tools to aid in benchmarking.
As part of the CompForce project, we benchmarked existing prediction methods on real and synthetic datasets. A description of these datasets and where to access them are as follows:
These are 15 synthetic datasets generated using the Dyngen platform. For each dataset, the following files exist:
- "model_common.." : The common Dyngen model (no simulations run yet)
- "model_wt..": The WT Dyngen model with simulations corresponding to the WT populations
- "sc_expX..": The expression data
- "sc_sampTab..": The corresponding metadata
- "grn..": The GRN used to simulate the system
- "TFs..": A list of TFs in the dataset
For nets01-05 and 11-15, a dynamic version of the dataset also exists for use in predicting and simulating dynamic, step-wise differentiation trajectories:
- "dynamic_sc_expX..": The expression data for step-wise predictions
- "dynamic_sc_sampTab..": The corresponding metadata
Finally, should users be interested in applying SingleCellNet to assess simulations, a pre-trained classifier for each model is provided in "classifiers_scexp..".
These are 3 real curated bulk datasets:
- dataset_01 : a broad dataset spanning several cell types
- dataset_02 : a blood-centric dataset
- dataset_03 : a neural-centric dataset
Each dataset has an expression matrix ("..expX..") and corresponding metadata ("..sampTab.."). Additionally, a sample GRN and list of TFs are included:
- "mousePKN..."
- "mouseTFs..."
These are 2 real curated single-cell datasets:
- dataset01 : a broad dataset spanning several cell types
- dataset02 : a neural-centric dataset
Each dataset has an expression matris ("..expX.."), corresponding metadata ("..sampTab.."), and list of TFs ("..TFs.."). Additionally each dataset has reconstructed GRNs at different edge densities.
Raw, individual gold standards for real data can be found here. These are labeled as "source_target_type.csv". Source refers to the starting cell type, target refers to the target cell type, type refers to if it is a list of positives (reported in a cocktail) or negatives (tested, and not in any cocktails).
Importantly, TF names in these lists need to be harmonized with any datasets prior to assessment, such that naming systems are in agreement.
To assess performance on synthetic datasets, we can directly simulate the results. This requires Dyngen and Tidyverse. Installation instructions can be found: https://dyngen.dynverse.org/
Here is an example of how to simulate the overexpression of a set of TFs.
Using net02 as an example (see above for the data), load in the common and WT models.
require(dyngen)
require(tidyverse)
net02_model_common<-utils_loadObject("net02_model_common_20220317.rda")
net02_model_wt<-utils_loadObject("net02_model_wt_20220317.rda")
Simulate overexpression of "B3_TF1" and "C3_TF1" for the sEndC-->sEndD transition. Three steps are involved:
# Set the perturbation type
model_oe<-set_simulation_type_overexpression(net02_model_common,c("B3_TF1","C3_TF1"))
# Set the starting state
model_oe<-set_simulation_start_state(model_oe,net02_model_wt,"sEndC")
# Generate the simulation using Dyngen
model_oe<-generate_cells(model_oe)
# plot
plot_gold_mappings(model_oe,do_facet=FALSE)
plot_gold_simulations(model_oe)
The simulations looks like:
Simulate overexpression of a series of TF sets for the sA-->sEndD transition.
# This will require assertthat
require(assertthat)
# Set up the reporter for the intermediate (or multiple intermediate states)
# Here our reporter for the intermediate state is positive expression of B1_TF1
reporters<-c(1)
names(reporters)<-c("B1_TF1")
reporters<-list(epoch2..epoch2=reporters)
# Set up the perturbation
to_perturb<-list(epoch1..epoch1=c("A4_TF1","A2_TF1"),epoch2..epoch2=c("B3_TF1","B8_TF1","D2_TF1"))
# Simulate
model_oe<-ds_simulate_steps2(net02_model_common,net02_model_wt,"sA",to_perturb,reporters,num_simulations=100L)
# plot
plot_gold_mappings(model_oe,do_facet=FALSE)
plot_gold_simulations(model_oe)
The simulations looks like:



