Skip to content

Commit 162b3fb

Browse files
committed
Version 1.2, released July 3 2013
1 parent 2315217 commit 162b3fb

23 files changed

+801
-4001
lines changed

README.txt

Lines changed: 37 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,25 @@
11
The SDREM software is described in:
2+
23
Linking the signaling cascades and dynamic regulatory networks controlling stress responses.
34
Anthony Gitter, Miri Carmi, Naama Barkai, Ziv Bar-Joseph.
4-
Genome Research. doi:10.1101/gr.138628.112.
5+
Genome Research. 23:2, 2013.
6+
7+
Identifying proteins controlling key disease signaling pathways.
8+
Anthony Gitter, Ziv Bar-Joseph.
9+
Bioinformatics. 29:13, 2013.
510

611
Contact agitter@cs.cmu.edu with any questions.
712

813
Use of this code implies the user has accepted the terms of license.txt.
914
The code requires Java 5.0 or above.
1015

11-
Properties files are used to specify the paramaters and input data.
16+
Properties files are used to specify the parameters and input data.
1217
See the DREM manual (http://sb.cs.cmu.edu/drem/DREMmanual.pdf) for details regarding the gene expression and
1318
protein-DNA binding file formats. The TF-gene priors have the same
1419
format as the DREM protein-DNA binding data file, but have the value 0.5
1520
instead of 1 for all non-zero entries (assuming uniform initial priors).
1621

17-
Examples of how to run SDREM on a cluster are included. The maximum Java heap size must be
22+
Examples of how to run SDREM on a PBS cluster are included. The maximum Java heap size must be
1823
increased when working with large networks.
1924

2025
Many intermediate output files are generated during each iteration of SDREM.
@@ -24,13 +29,13 @@ The SDREM executables, code, input files, output produced, and example data are
2429
***********************************************
2530
Precomputing and storing paths (optional)
2631
***********************************************
27-
This is an optional step that can be performed before running SDREM.
32+
This is a recommended optional step that can be performed before running SDREM.
2833

29-
StorePaths.jar - The executable for preprocessing the network data. It searches for paths from the sources to each TF and writes them all to disk, optionally filtering them to keep only the highest confidence paths. Without doing this preprocessing, SDREM has to search for the paths many times as it iterates, which is wasteful and makes it very, very, very slow.
34+
StorePaths.jar - The executable for preprocessing the network data. It searches for paths from the sources to each TF and writes them all to disk, optionally filtering them to keep only the highest confidence paths. The stored files are large and can require many GBs of disk space in total. Without this preprocessing, SDREM has to search for the paths many times as it iterates, which is wasteful and makes it very, very, very slow for large networks.
3035

31-
allPathsEgf.props - An example properties file for StorePaths.jar. store/filter means the user wants to enumerate paths and remove low-confidence paths (recommended for large PPI networks). The next lines define the sources, targets, and networks. Node priors are used to give weights to vertices in the network. If the user stores and filters paths, then he/she can delete the intermediate output in stored.paths.dir once the filtering step is done (just give the final directory with the filtered paths to SDREM as input).
36+
allPathsEgfPriors.props - An example properties file for StorePaths.jar. store/filter means the user wants to enumerate paths and remove low-confidence paths (recommended for large PPI networks to speed network orientation). The next lines define the sources, targets, and networks. Node priors are used to give weights to vertices in the network. If the user stores and filters paths, then he/she can delete the intermediate output in stored.paths.dir once the filtering step is done (give the final directory with the filtered paths to SDREM as input).
3237

33-
allPathsEgf.qub - An example showing how to call StorePaths.jar. This is a submission script for a PBS cluster and the last line shows the actual command.
38+
allPathsEgfPriors.qub - An example showing how to call StorePaths.jar. This is a submission script for a PBS cluster and the last line shows how to call the jar.
3439

3540

3641
***********************************************
@@ -40,20 +45,20 @@ The SDREM algorithm executable.
4045

4146
sdrem.jar - The SDREM executable.
4247

43-
sdrem011712egf.props - A sample SDREM properties file. model.dir is where the output will be written and where some of the input files must be. stored.paths.dir is the location of the filtered paths from the preprocessing jar if it was used. It's important to use the same settings (sources, targets, node priors, path length, etc.) for the StorePaths and SDREM properties. The rest of the parameters can usually be left at the default values and are described in the Genome Research supplement.
48+
sdremEgfPriors.props - A sample SDREM properties file. model.dir is where the output will be written and where some of the input files must be located. stored.paths.dir is the location of the filtered paths from StorePaths.jar if it was used. It's important to use the same settings (sources, targets, node priors, path length, etc.) for the StorePaths and SDREM properties. predict.knockdown is used to enable the knockdown effect prediction described in the SDREM Bioinformatics paper and can be commented out if this feature is not desired. The rest of the parameters can usually be left at the default values and are described in the SDREM Genome Research paper supplement.
4449

45-
DREM_defaults.txt - A separate DREM properties file has to be set. It's the same format that the original DREM software uses (as described in http://sb.cs.cmu.edu/drem/DREMmanual.pdf) except the TF-gene_Interactions_File will be generated dynamically and Active_TF_influence is a new SDREM parameter.
50+
DREM_defaults.txt - A separate DREM properties file has to be configured. It's the same format that the original DREM software uses (as described in http://sb.cs.cmu.edu/drem/DREMmanual.pdf) except the TF-gene_Interactions_File will be generated dynamically and Active_TF_influence is a new SDREM parameter. This file must be located in the model.dir.
4651

47-
sdrem011712egf.qub - An example showing how to call sdrem.jar on a PBS cluster.
52+
sdremEgfPriors.qub - An example showing how to call sdrem.jar on a PBS cluster.
4853

4954

5055
***********************************************
5156
Modified DREM
5257
***********************************************
5358
The DREM software (http://sb.cs.cmu.edu/drem/) that SDREM was built upon allows visualization of the
5459
active TFs and gene expression profiles after SDREM is run. There are several differences between the version distributed here and DREM 2.0:
55-
None of the new features descrbied in the DREM 2.0 manuscript are present yet (e.g. support for motif finding).
56-
To view the final output (assuming 10 iterations), should load 10.model as the saved model and tfActivityPriors_round9.txt as the TF-gene interactions. Use these priors instead of tfActivityPriors_round10.txt because tfActivityPriors_round10.txt are the updated priors after running the 10th round of network orientation as opposed to the file that was given to DREM as input at the start of the 10th SDREM iteration.
60+
None of the new features described in the DREM 2.0 manuscript are present yet (e.g. support for motif finding).
61+
To view the final output (assuming 10 iterations), load 10.model as the saved model and tfActivityPriors_round9.txt as the TF-gene interactions. Use these priors instead of tfActivityPriors_round10.txt because tfActivityPriors_round10.txt are the updated priors after running the 10th round of network orientation as opposed to the file that was given to DREM as input at the start of the 10th SDREM iteration.
5762
The split table will show the activity score for each TF at that split and the max activity score across all splits.
5863
Key TF Labels includes options to display TFs at each split based on activity score. If the user uses activity scores to choose which TFs to show, the slider will be used to calculate 10^X (instead of 10^-X) and all TFs with activity scores greater than 10^X at a split are shown.
5964
The output file 10.targetsStd can be used to help choose the activity score threshold. The last column in this file gives the max activity score for each active TF. Therefore, the user can find the minimum of these values and use it (or a value slightly less than it) as the threshold for display purposes.
@@ -87,11 +92,11 @@ N.model.activitiesStd - Activity information for the TFs used to determine which
8792
N.targets -The TFs that were selected as targets at iteration N and their target weights. These are the proteins that were used for the network orientation.
8893
N.targetsStd - Like N.targets but contains information about the distribution of random TF activity scores.
8994

90-
itrN.out - A log file.
95+
itrN.out - A log file. It also prints the version of SDREM that was run.
9196

9297
conflictOrientations_itrN.txt - A code that represents how each PPI was oriented. See pathEdges_iterN.txt for a human-readable version.
9398

94-
nodeScores_iterN_Pathweight_10_1000.txt - A summary of the oriented network at iteration N. It gives the sources, target TFs (same ones as N.targets) and various measures of how many oriented paths use a particular protein. The '% top 1000 paths through node' column was used to select nodes in the Genome Research paper. From this file, the user can extract the sources, targets, and all other proteins that have a value >= 0.01 in this column (called the 'internal' or 'signaling' proteins). That score is saying that at least 1% of the highest confidence oriented paths go through that particular protein.
99+
nodeScores_iterN_Pathweight_10_1000.txt - A summary of the oriented network at iteration N. It gives the sources, target TFs (same ones as N.targets) and various measures of how many oriented paths use a particular protein. The '% top 1000 paths through node' column was used to select nodes in the SDREM Genome Research paper. From this file, the user can extract the sources, targets, and all other proteins that have a value >= 0.01 in this column (called the 'internal' or 'signaling' proteins), and they are also written as a Cytoscape-formatted file at the final iteration. That score means that at least 1% of the highest confidence oriented paths go through that particular protein.
95100

96101
pathEdges_iterN.txt - The predicted PPI orientation for all edges that were used on at least one source-target path (i.e. some edges are excluded because they are not "between" the sources and targets).
97102

@@ -101,23 +106,34 @@ tfActivityPriors_roundN.txt - The new TF-gene binding file that will be used as
101106

102107
statisfiedPaths_itrN.txt.gz - Lists every oriented, satisfied path that connects a source and target and has less than 5 edges. It doesn't contain any information that couldn't be reconstructed from pathEdges_itrN.txt but is more convenient for downstream analyses that examine individual paths.
103108

104-
After the final iteration SDREM writes:
109+
110+
After the final iteration, iteration M, SDREM writes:
111+
postProcessing.out - A log file.
112+
113+
topPathEdges_itrM.sif - A file that can be loaded into Cytoscape v2.8 to visualize the high-confidence paths SDREM inferred. Only edges between sources, targets, or internal nodes are shown.
114+
topPathNodes_itrM.noa - A file that can be loaded into Cytoscape v2.8 to annotate the nodes on the high-confidence path with their role in the network (Source, Target, or Internal).
115+
105116
droppedTargets.txt - Targets that were present at iteration N-1 but not iteration N.
106117
targetsByIteration.txt - All targets at each iteration.
107118
newTargets.txt - Targets that were not present at iteration N-1 but are at iteration N.
108119

120+
singleKnockdown_itrM.txt - File that is generated only if single or double knockdown effects were requested. The SDREM Bioinformatics paper defines the scoring metrics.
121+
doubleKnockdown_itrM.txt - File that is generated only if double knockdown effects were requested. The SDREM Bioinformatics paper defines the scoring metrics.
122+
109123

110124
***********************************************
111125
Sample data
112126
***********************************************
113-
The example property files refer to these data files, which demonstrate the expected file formats. These are human datasets, and all ids are NCBI Gene ids. DREM always requires that all proteins and genes use the same type of identifier, and it is especially important that the same types of ids are used in the PPI network and the TF columns of the TF-gene interaction files. For example, do not use gene symbols in one and ids in the other.
127+
The example property files refer to these data files, which demonstrate the expected file formats. These are human datasets, and all ids are NCBI Gene ids (http://www.ncbi.nlm.nih.gov/gene). DREM always requires that all proteins and genes use the same type of identifier, and it is especially important that the same types of ids are used in the PPI network and the TF columns of the TF-gene interaction files. For example, do not use gene symbols in one and ids in the other.
128+
129+
Yarden_MCF10A_expr.txt - Sample expression data in the DREM format (see http://sb.cs.cmu.edu/drem/DREMmanual.pdf). The data are from MCF10A cells stimulated with EGF. Please cite PMID 17322878 if using this data.
114130

115-
Yarden_MCF10A_expr.txt - Sample expression data in DREM format (see http://sb.cs.cmu.edu/drem/DREMmanual.pdf).
131+
ppi_ptm_pd_edges.txt - The human interaction network, which uses PPI from BioGRID and HPRD (pp), predicted protein-DNA binding edges (pd), and post-translational modifications from HPRD (ptm). Please cite the SDREM Bioinformatics paper and the original data sources if using this data.
116132

117-
ppi_ptm_pd_edges.txt - The human interaction network, which uses PPI from BioGRID and HPRD (pp), predicted protein-DNA binding edges (pd), and post-translational modifications from HPRD (ptm).
133+
tfList.txt - Gene ids for the TFs that have protein-DNA interactions. This is used for precomputing paths with StorePaths.jar, and these proteins are also the set of possible random targets when SDREM builds a background distribution of TF network connectivity scores (if using precomputed paths).
118134

119-
tfList.txt - Gene ids for the TFs that have protein-DNA interactions. This is used for enumerating paths and these proteins are also the set of possible random targets for when SDREM builds a background distribution of TF network connectivity scores.
135+
tfActivityPriors_round0.txt - A DREM-style protein-DNA binding grid (see http://sb.cs.cmu.edu/drem/DREMmanual.pdf) but with 0.5 as a prior for all interactions. This will be updated at each iteration as the network is used to refine the priors. Please cite PMIDs 22897824 and 20219943 if using this data.
120136

121-
tfActivityPriors_round0.txt - A DREM-style protein-DNA binding grid (see http://sb.cs.cmu.edu/drem/DREMmanual.pdf) but with 0.5 as a prior for all interactions. This will be updated at each iteration as the network is used to refine the priors.
137+
sources.txt - The ids of the proteins that are used as the sources for the network orientation in all of the iterations. These two proteins were selected from the EGFR pathway.
122138

123-
sources.txt - The ids of the proteins that are used as the sources for the network orientation in all of the iterations.
139+
egfPriors.txt - A tab-separated file that lists proteins and the prior probability of their involvement in the signaling pathway.

StorePaths.jar

-4.93 MB
Binary file not shown.
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,17 @@ sources.file = ../../human/egf/sources.txt
66
targets.file = ../../human/proteinDna/tfList.txt
77
edges.file = ../../human/ppi/ppi_ptm_pd_egdes.txt
88
# Node priors for certain nodes. May be left blank.
9-
#node.priors.file =
9+
node.priors.file = ../../human/egf/egfPriors.txt
1010
# Maximum number of edges to allow in a path
1111
max.path.length = 5
1212
# Node prior for all nodes not appearing in the node.priors.file
1313
default.node.prior = 0.5
1414

1515
# The directory that stored all enumerated paths and from which
1616
# paths are read when filtering
17-
stored.paths.dir = ../../human/egf/storedPaths/allPathsEgf
17+
stored.paths.dir = ../../human/egf/storedPaths/allPathsEgfPriors
1818

1919
# The number of top-scoring paths to keep after filtering
2020
path.enum.bound = 100000
2121
# The directory where the filtered paths are stored
22-
filtered.paths.dir = ../../human/egf/storedPaths/allPathsEgf100k
22+
filtered.paths.dir = ../../human/egf/storedPaths/allPathsEgfPriors100k
Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
#!/bin/sh
22
## Set stderr and stdout using absolute directories
3-
#PBS -o /home/agitter/human/egf/storedPaths/allPathsEgf.out
4-
#PBS -e /home/agitter/human/egf/storedPaths/allPathsEgf.err
3+
#PBS -o /home/agitter/human/egf/storedPaths/allPathsEgfPriors.out
4+
#PBS -e /home/agitter/human/egf/storedPaths/allPathsEgfPriors.err
55
## Name the job
6-
#PBS -N allPathsEgf
6+
#PBS -N allPathsEgfPriors
77
## Run on a single node and 8 cores on that node
88
#PBS -l nodes=1:ppn=8
99
## Estimated max RAM is for Java heap (2 GB per each of 8 threads) plus extra buffer
@@ -21,4 +21,4 @@ echo Queue used is $PBS_O_QUEUE
2121
echo Time is `date`
2222

2323
## Start StorePaths.jar
24-
java -Xmx16g -jar StorePaths.jar /home/agitter/human/egf/storedPaths/allPathsEgf.props
24+
java -Xmx16g -jar StorePaths.jar /home/agitter/human/egf/storedPaths/allPathsEgfPriors.props

0 commit comments

Comments
 (0)