Home

###->22/03/2015

Uploaded poster to doc directory as well as main.
"The more the merrier in high-throughput biology". I think this is what I should post.

#####Some programming tips learned from the project

######Steps to import biopython on uppmax(avoid getting import error)

module load bioinfo-tools
module load biopython (has many biopy versions but works without specifying any).

######Python tips

Dictionaries make a great data format for storing data with the structure (key, value). Worked great for creating key lists and merging key lists with set functions. (Is it memory heavy?)

######MobaXterm

Hot! SFTP browser and various other bells and whistles!

######GitHub

Useful resource for beginners - http://guides.railsgirls.com/github/

###->20/03/2015

Presented poster (Scilife Gamma floor 6)
Prof. Lukas verified our results. The proteins we found to be differentially abundant actually seem to be differentially abundant! Yay!

###->18/03/2015

Same Problems with Pi0 in NSAF with some estimated pi0 values greater than but none greater than 2.

NSAF plot

On checking q values, found 2 proteins that are differentially expressed with a q value of less than 0.01. (These were in Sample 1 versus Sample 3 and Sample 2 versus Sample 3, one each).
James had the idea that if the proteins were spiked in different quantities, how can we say that some proteins do not have a zero concentration in some samples? Zero is a valid concentration. So is our elimination of all the proteins that are not common between samples correct? It is intruiging to think about but then we have to consider all the readings that we filtered out in the beginning based on their q-values. What if the q-value had for the reading of that peptide was low in one sample and high in another? Something to think about.
Interesting thing to find - what could be done differently to improve this insensitive performance of the two spectral counting methods when compared against the results got from the method starting from precursor intensity?
Completed the poster and uploaded to shared folder.

###->17/03/2015 Started with statistics analysis on emPAI results. #####Problems and possible solutions

A lot of p value scores were 1.
A lot of q value scores gave division by zero error
pi0 values generated from p values had an upper limit of just above 250.
Found out there could be problems with excel t.test if data was in multiple columns->modified the mergeEmpai script to use the scipy.stats.ttest_ind() function to run the ttest in python.
Got the similar range of pi0 values from python ttest so turned to other methods.
To see if the problem was due to not normalizing the empai, decided to normalize the empai scores were by dividing individual empai score with sum of all empai scores from a replicate in a sample
Changed the empai script accordingly and ran empai->avgEmpai->mergeEmpai to get new file for t-test
Got the below graph for p-value vs pi0 plot
On checking q values, found no values having a q value less than 0.01
Tried NSAF to see if the problems could be resolved

Normalized Spectral Abundance Factor

calculated as the number of spectral counts (SpC) identifying a protein, divided by the protein’s length (L), divided by the sum of SpC/L for all proteins in the experiment

#####NSAF ######Step 1 Run to get peptide counts for each protein

awk -F"\t" '{if(sprintf("%.8f",$10)<0.01) { print $8 }}' ID_1A.tsv |awk -F, '!z[$1]++{ a[$1]=$0; } END {for (i in a) printf("%s\t%s\n",a[i], z[i])}'>1A_nsaf_fil.tsv

######Step 2 Run the python script to calculate the nsaf for 1 replicate of a sample

python nsaf

######Step 3 Run the python script to merge the nsaf of 3 replicates of a sample into one file. Replace file names

python avgEmpai

######Step 4 Run the python script to merge the nsaf of 2 samples into one file. Input from avgEmpai. Replace file names

python mergeEmpai

###->13/03/2015

Continued working on emPAI

#####emPAI ######Step 4 Count Observed peptides: counted the number of unique peptides identified for each protein

awk -F"\t" '{print $8}' 1A_q_dupRem.tsv|awk -F, '!z[$1]++{a[$1]=$0;} END { for (i in a) printf("%s\t%s\n", a[i], z[i])}'>1A_countObserved.tsv

######Step 5 Calculate empai for each replicate in each run: Change the file names in the python script 'empai'. The script is given in the folder.

python empai

######Step 6 Calculate average empai for each sample: Average the empai score for 3 replicates of 1 sample. The proteins not common to all 3 replicates were discarded

python avgEmpai

######Step 7 Merge the emPAI values for 2 given runs to compare and do statistical analysis

python mergeEmpai

#####Problems

Group decided not to average the empai values for the 3 replicates in a run so as to run a t-test. Changed avgEmpai accordingly
Changed mergeEmpai to accomodate 6 values for 2 runs accordingly

###->12/03/2015 Met up with Prof. Käll at Scilife to discuss the project. He recommended us to look at the log of the intensity values that we calculated with formula

(sum to number of peptides (log(mean of peptide i in run 1/mean of peptide i in run 2))/ number of peptides

He also suggested that we try emPAI(Exponentially modified protein abundance index) as a measure of spectral counts.

Started working on calculating emPAI.

empAI~(prot) = 10^(number of observed/number of observable peptides) - 1

#####emPAI ######Step 0

Downloaded the ftp://[email protected]/distro/fasta/iPRG2015.fasta file for the protein sequence and length.

Used the trpsin digest python script from https://github.com/yafeng/trypsin to get a list of peptides from the digestion of the proteins in the fasta file. This script was developed by Yafeng Zhu(http://ki.se/people/yafzhu), a Ph. D. student at Karolinska.

The output list contains protein ID in the first column and corresponding tryptic peptides in the second column calculated with a miss cleavage of 1.

######Step 1 Filter: get only the peptide measurements for which the q value was less than 0.01. The ID files were downloaded from ftp://ftp.peptideatlas.org/distro/id/tsv/

 awk -F"\t" '{if(NR==1){print} else if(sprintf("%.8f",$10)<0.01) { print }}' ID_1A.tsv > ID_1A_qfilt.tsv

######Step 2 Remove [0-9] modification sites in peptides for comparison against digested peptides: some peptides have amino acid modifications in square brackets.

sed -i 's/\(\[[0-9]*\]\)//g' ID_1A_qfilt.tsv

######Step3 Remove peptide duplicates: Removed duplicated peptides to get the number of unique peptides identified for a protein

awk -F"\t" '!seen[$2]++' ID_1A_qfilt.tsv > 1A_q_dupRem.tsv

###->10/03/2015

*Skip header(NR==1) and remove all peptide readings with NA qvalue in the ftp://[email protected]/distro/intensity/SkylineIntensities_unique.tsv file

awk -F"\t" '{if(NR==1){print} else if(match($12,"NA")==0) { print $12 }}' SkylineIntensities_unique.tsv>inter.tsv

*Changed qvalue filter in the awk. It was not accounting for values with exponential notation (xxxE-yyy)

awk -F"\t" '{if(NR==1){print} else if(sprintf("%.8f",$12)<0.01) { print }}' inter.tsv >filtered_intensities.tsv

*Used above to refilter SkylineIntensities_unique.tsv to get correct q values.

Tried to find out how to include math equations in Github. It needs something called Mathjax. Tried to figure it out but couldn't get a hold on it. Learn it later when I have time

###->09/03/2015

Read up on the below paper "A comparative analysis of computational approaches to relative protein quantification using peptide peak intensities in label-free LC-MS proteomics experiments" by Matzke et al. for info on precursor intensity methods.
Decided we could use an additive approach.
Downloaded msStats to work on a linear model. Read up on data formats accepted by it.
Downloaded R to machine to load the msStats module on it. Loaded it.
This is the function to use for processing the data as preliminary step. Played around with it.

dataProcess(a, logTrans=2, normalization="equalizeMedians",nameStandards=NULL, betweenRunInterferenceScore=FALSE,fillIncompleteRows=FALSE,FeatureSelection=FALSE,lambda=seq(0,1.4,0.1),eta=0.05, address="")

###->01/03/2015

Got the local repository to work! Was much easier than anticipated.

Created tsv files with following fields (Protein Name, Peptide Count, Protein Length, Normalized Count) with awk and python Protein Name - Name of protein as in fasta file Peptide count - Number of peptides for each protein as in the tsv files downloaded from the 'id/' of the iprg site. Protein length - length of protein in amino acids as from fasta file from iprg website Normalized count - Peptide count/ Protein Length for each protein.

#####Input: 12 tsv files from iprg site

The tsv files (~ID_1A.tsv and so on) were parsed with the foll awk command.

awk -F"\t" '$10<0.01 { print $8 }' ID_1A.tsv |awk -F, '!z[$1]++{ a[$1]=$0; } END {for (i in a) printf("%s\t%s\n",a[i], z[i])}'>ID_1A_fil.tsv and

It extracts the protein name based on its qvalue being less than 0.01 then pipes it to another awk snippet which enumerates on protein name, deletes duplicates and writes the output to fileName_fil.tsv file.

Then a tiny python script using Biopython to parse the fasta file to get protein length in amino acids, compare with the filtered tsv file, calculate PeptideCount/ProteinLength and write to a tsv file.

#####Output:12 tsv files with normalized spectral count

Trying to find how to upload files to github. Turns out you need a local repository.

###->27/02/2015 Worked on getting Crux to work. Failed miserably. Should try to get to change format of the input files. Decided to work with Linux commands and Python to get a Spectral count normalized with protein length. Will try to figure out out measures later.

The Swedish version of Excel on the school computers uses a decimal comma rather than a decimal point. Needed to change the files to read Why can't we have a global standard?

Sent and received the below mail from Lukas

We want to use something like 3 pipelines from the literature and one of our own in order to compare the results from our own pipeline with the ones from literature and then try to start from the raw data and do the identification ourselves and then run the same pipelines on that to see how the results differ. We just want to check if you are ok with this plan? Do you want us to do less literature pipelines and more of our own? do you think this plan seems reasonable?

Sure, you are free to select to implement quite a number of pipelines.

In the instructions I wrote: "You are free to select if you would like to start-out by reprocessing the raw data from scratch, or if you would like to live with the organizers processing. The real challenge in this exercise is how to think about how to combine the different quantitative levels of peptides into quantitative levels of proteins."

I mean that. That implies that the selection of pipe-line is less exciting than the selection of strategy to infer quantative levels of proteins from peptides. I do not know how I can say that more clearly. Please stop reading this email and ask more questions if you do not get this.

Also, do you have any particular software you would like to see us try? We've found quite a number of toolkits online and are trying to pick some that are representative of what people actually use. So far we are looking at Crux, OpenMS, Apex, and Trans proteomic pipeline TPP.

Anything you like would go here. Including the option to just use the organizers results.

Another problem is handling the very large files, especially if we are going to try to use the raw data files (12 1,3-1,4 gig files). Do you have any advice on this front? Is there any way to run the files without having to download them to hard drive?

A "wget" command issued on the server where you want the data should do the trick.

Also, most of the software we've looked at require administrator rights to install on school computers, we have personal laptops but as we mentioned the file sizes are a bit too much for our laptops. Is there a way to install select software on the school computers using admin rights temporarily for the project?

Most of the software can be installed in the home directory. Which software do not allow you to do so?

Lastly, How far are we expected to have reached in the project for the comparison of experiences seminar on Tuesday?

It would be good if you could have one ready result, so that you can start to think about alternative ways to combine peptides.

Thanks -Lukas

###->26/02/2015 Selected Normalized Spectral Abundance Factor as one of the measures to calculate the relative abundance. The below paper was referred to which contained a software package called Crux. Crux has a tool called spectral-count which could be used to find the NSAF.

Ran into problems with the file format. Crux spectral-tools requires input files with specific format (got error msg 'q-values needed in percolator, q-ranker, barista format')

Decided to ask suggestions for software from the professor.

Another tool we looked at was OpenMS which can be used to try one of the peak intensity methods. Ran into problems with installation. Another problem was huge sizes of the mxMl files. Decided to ask the professor about file sizes as well.

#####Paper Estimating relative abundances of proteins from shotgun proteomics data - Sean McIlwain, Michael Mathews, Michael S Bereman, Edwin W Rubel, Michael J MacCoss and William Stafford Noble

Note: Found professor's paper as reference for percolator algorithm.

###->24/02/2015
Met up for discussion. The team decided to each select a project pipeline from the literature and discuss on 26/02. The following papers were circulated

A New Approach to Evaluating Statistical Significance of Spectral Identifications by Hosein Mohimani, Sangtae Kim, and Pavel A. Pevzne
An Automated Pipeline for High-Throughput Label-Free Quantitative Proteomics by Hendrik Weisser, Sven Nahnsen, Jonas Grossmann, Lars Nilse, Andreas Quandt, Hendrik Brauer,Marc Sturm, Erhan Kenar, Oliver Kohlbacher, Ruedi Aebersold, and Lars Malmström

###->20/02/2015 Project Overview was presented.

###->19/02/2015 The project 'Differential Abundance Analysis in Label-Free Quantitative Proteomics' was allocated to group (James Ericson, Angeliki Maraki, Anandashankar Anil). It is based on a study of LC-MS/MS data analysis conducted by the Proteome Informatics Research group of ABRF. The presentation for the project overview was prepared.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Clone this wiki locally