The aim of ChemPSy (Chemical Prioritization System) is to develop an innovative approach based on several bioinformatics and biostatistics methodologies to analyze and integrate massive toxicogenomics datasets.
Specific objectives include: (1) classification of chemicals based on transcriptional signatures, e.g. the set of genes whose expression is known to be positively or negatively altered after an exposure to these compounds; (2) the association of classes with human pathologies or deleterious phenotypes, e.g. classes containing toxicants with well-known effects; (3) the prediction of novel reprotoxicants and/or endocrine disruptors based on transcriptional signature similarities with known chemicals affecting testis development and function.
##Data format Each dataset is organizing according to the following:
+-- [Species]
+-- [Tissue]
+--GSEXXX
+--Experimental_conditions
¦ +--Condition 1
¦ +--Condition 2
+--Individual_experiments
¦ +--GSMXXXX.CEL.gz
¦ +--GSMXXXX.CEL.gz
+--GSEXXX.txt
[Species]: Binomial nomenclature for selected species (e.g. Homo sapiens for Human)
[Tissue]: Tested tissue in upper case (e.g. LIVER, KIDNEY or MCF-7, HK-2...)
###Step 1 - Describe your dataset To describe your dataset please use tabulate .txt file with the following fields (Keep the order):
Fields | Description |
---|---|
Files | CEL file full name (GSM1223.CEL.gz) |
Species | Binomial nomenclature of species where the results come from |
Strain | Species strain (e.g. Sprague-Dawley). Can be not specified * |
Gender | Animal gender(male/female). Can be not specified * |
Experiment | Experiment (E.G. in vitro, in vivo, ...). Can be not specified * |
Tissues/Cells | Tissue or cell name where the experiment is performed |
Age | Animal age. Can be not specified * |
Generation | Animal generation (for trans-generational studies). If not specified, please put 'F0'. |
ChemicalName | Chemical usual / synonym name (only one name) |
CAS | Chemical CAS number |
MESH | Chemical MESH ID |
Dose | Chemical exposition dose |
Duration | Chemical exposition duration |
Route | Chemical route. Can be not specified * |
Vehicule | Chemical vehicle. Can be not specified * |
PMID | Associated publication PubMed ID. Can be not specified * |
GSE | GEO dataset ID |
GSM | GEO profile ID |
GPL | GPL use. Can be not specified * |
Corresponding dataset author mail. Can be not specified * | |
Paired | Paired data (Yes/No) |
Replicates | Replicate number |
Experiment type | Experimental type details (e.g. 'Expression profiling by array’). Can be not specified * |
Design | Experimental design. Can be not specified * |
Treatment protocol | Treatment protocol description. Can be not specified * |
Characteristics | Tissue/cells characteristics. Can be not specified * |
Extraction protocol | Extraction protocol description. Can be not specified * |
Link(s) | Cross-link(s) (e.g. GEO, database, personal website ...). Can be not specified * |
Data processing | Data processing description. Can be not specified * |
Sample Treated | Treated or Control sample. Can be not specified |
Associated Ctrl | Associate a unique number to your control and list all control paired with your treated sample (e.g. control1 = 1, control2=2 ..., treated_sample1 = 1,2 [this sample is paired with control 1 and 2]). |
Don't leave empty fields: use 'NA' if your field is not specified
'*': This field is required for TOXsIgN integration
Each line need to correspond to one and unique sample
###Step 2 - Organize your data
In your GSEXXX directory, save your tabulate .txt file using the same name of your directory: GSEXXX.txt and create a new folder called: Individual_experiments.
Drop in this folder all expression files associated with your study. Please make sur that all yours. CEL file are compressed. If not use the following command:
gzip *.CEL
###Step 3 - Create conditions and treatment.info To create the Experimental_conditions directory and all conditions sub-directories, use the CreateTreatmentInfo.sh script. This script takes no arguments but load a configuration file: ChemPSy.ini. Please modified this file or change the configuration file load in CreateTreatmentInfo.sh script:
source /home/genouest/irset/tdarde/projects/ChemPSy/20160321/script/ChemPSy_Human.ini
Next adapt the loop according to your datasets:
#! /bin/bash
source /home/genouest/irset/tdarde/projects/ChemPSy/20160321/script/ChemPSy_Human.ini
echo "Reading config...." >&2
#echo "Create treatment.info files for HEPATOCYTES"
for i in $HepatoList
do
python $scriptTreatment -p $i -t HEPATOCYTES -e $fileRemove -s True
done
echo "Create treatment.info files for HK-2"
for i in $HK2List
do
python $scriptTreatment -p $i -t HK-2 -e $fileRemove -s True
done
echo "Create treatment.info files for ISHIKAWA_CELLS"
for i in $IshikawaList
do
python $scriptTreatment -p $i -t ISHIKAWA_CELLS -e $fileRemove -s True
done
echo "Create treatment.info files for JURKAT_CELLS"
for i in $JurkatList
do
python $scriptTreatment -p $i -t JURKAT_CELLS -e $fileRemove -s True
done
echo "Create treatment.info files for MCF-7"
for i in $MCFList
do
python $scriptTreatment -p $i -t MCF-7 -e $fileRemove -s True
done
echo "Create treatment.info files for liver"
for i in $LiverList
do
python $scriptTreatment -p $i -t LIVER -e $fileRemove -s False
done
echo "Create treatment.info files for tg"
for i in $TgList
do
python $scriptTreatment -p $i -t THIGH-MUSCLE -e $fileRemove
done
If you have no error, you may obtain the following directories organization:
+-- [Species]
+-- [Tissue]
+--GSEXXX
+--Experimental_conditions
¦ +--Condition 1
¦ ¦ +--treatment.info
¦ +--Condition 2
¦ ¦ +--treatment.info
+--Individual_experiments
¦ +--GSMXXXX.CEL.gz
¦ +--GSMXXXX.CEL.gz
+--GSEXXX.txt
In each treatment.info you will find the association between treated sample (first column) and control sample (second column):
003016029014.CEL.gz 003016029008.CEL.gz 0
003016029014.CEL.gz 003016029009.CEL.gz 0
003016029015.CEL.gz 003016029008.CEL.gz 0
003016029015.CEL.gz 003016029009.CEL.gz 0
##Run ChemPSy Before run ChemPSy please be sur that you have the same architecture like previously describe and you have all your conditions with associated treatment.info files
Next run ChemPSy_data_prep.sh
As the previous script, this script uses the same configuration file. So please edit it and/or change the path on source line.
#!/bin/bash
#################################
# Source .ini file #
#################################
echo "############################## ChemPSy ##############################"
echo "--1-- Checking config file"
source /home/genouest/irset/tdarde/projects/ChemPSy/20160321/script/ChemPsy_processing/ChemPSy_Human.ini
echo "Reading config...." >&2
echo "Reading scriptPath: $scriptPath" >&2
echo "Reading config: $Rscript " >&2
echo "Reading dataPath: $dataPath" >&2
echo "Reading config: $processedPath " >&2
###Step 1 - Quality control The first step of ChemPSy_data_prep.sh is a quality control. Various information will be created for each conditions including the microarray picture.
function step_1 {
echo "--2-- STEP_1 process_data"
for tissue in $tissues
do
echo $tissue
outputT=$processedPath$tissue"/"
mkdir -p $outputT
for gse in $gsePath
do
path=$dataPath$tissue"/"$gse"/"
if [ -d $path ]
then
output=$processedPath$tissue"/"$gse"/Experimental_conditions/"
mkdir -p $output
scriptA=$scriptPath'Rlauncher.sh'
$scriptA -p $path -t $tissue -o $output -c $cdfpath
fi
done
done
while [ $(qstat | grep "ChemPSy_" | wc -l) -ne 0 ]
do
echo "Running --2-- STEP_1 process_data"
sleep 7
done
echo "--2-- STEP_1 process_data finish"
}
To performe the quality control please check each picture one by one and remove microarray with 20% or more of hybridization error.
List all your sample to remove in removeCelFile.txt and run the Step - 3 of Data Format part
+-- [Species]
+-- [Tissue]
+--GSEXXX
+--Experimental conditions
+--Condition 1
¦ +--contrastmatrix.txt
¦ +--normdata.txt
¦ +--designmatrix.txt
¦ +--qc_boxplot_afternormalization.pdf
¦ +--qc_boxplot_beforenormalization.pdf
¦ +--filtration.txt
¦ +--log2fcchangedata.txt
¦ +--qc_corrmatrix_afternormalization.pdf
¦ +--mednormdata.txt
¦ +--qc_image_003016029009.CEL.gz.png
¦ +--qc_image_003016029015.CEL.gz.png
¦ +--qc_image_003016029009.CEL.gz.png
###Step 2 - List all conditions This step lists all the conditions