-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathProjectNotebook
126 lines (89 loc) · 6.25 KB
/
ProjectNotebook
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
See wiki page for further updates
->01/03/2015
Got the local repository to work! Was much easier than anticipated.
Created tsv files with following fields (Protein Name, Peptide Count, Protein Length, Normalized Count) with awk and python
Protein Name - Name of protein as in fasta file
Peptide count - Number of peptides for each protein as in the tsv files downloaded from the 'id/' of the iprg site.
Protein length - length of protein in amino acids as from fasta file from iprg website
Normalized count - Peptide count/ Protein Length for each protein.
Input: 12 tsv files from iprg site
The tsv files (~ID_1A.tsv and so on) were parsed with the foll awk command.
awk -F"\t" '$10<0.01 { print $8 }' ID_1A.tsv |awk -F, '!z[$1]++{ a[$1]=$0; } END {for (i in a) printf("%s\t%s\n",a[i], z[i])}'>ID_1A_fil.tsv and
It extracts the protein name based on its qvalue being less than 0.01 then pipes it to another awk snippet which enumerates on protein name, deletes duplicates and writes the output to fileName_fil.tsv file.
Then a tiny python script using Biopython to parse the fasta file to get protein length in amino acids, compare with the filtered tsv file,
calculate PeptideCount/ProteinLength and write to a tsv file.
Output:12 tsv files with normalized spectral count
Trying to find how to upload files to github. Turns out you need a local repository.
->27/02/2015
Worked on getting Crux to work. Failed miserably. Should try to get to change format of the input files. Decided to work with
Linux commands and Python to get a Spectral count normalized with protein length. Will try to figure out out measures later.
The Swedish version of Excel on the school computers uses a decimal comma rather than a decimal point. Needed to change the files to read
Why can't we have a global standard?
Sent and received the below mail from Lukas
> We want to use something like 3 pipelines from the literature and one of
> our own in order to compare the results from our own pipeline with the ones
> from literature and then try to start from the raw data and do the
> identification ourselves and then run the same pipelines on that to see how
> the results differ. We just want to check if you are ok with this plan? Do
> you want us to do less literature pipelines and more of our own? do you
> think this plan seems reasonable?
Sure, you are free to select to implement quite a number of pipelines.
In the instructions I wrote: "You are free to select if you would like
to start-out by reprocessing the raw data from scratch, or if you
would like to live with the organizers processing. The real challenge
in this exercise is how to think about how to combine the different
quantitative levels of peptides into quantitative levels of proteins."
I mean that. That implies that the selection of pipe-line is less
exciting than the selection of strategy to infer quantative levels of
proteins from peptides. I do not know how I can say that more clearly.
Please stop reading this email and ask more questions if you do not
get this.
> Also, do you have any particular software you would like to see us try?
> We've found quite a number of toolkits online and are trying to pick some
> that are representative of what people actually use. So far we are looking
> at Crux, OpenMS, Apex, and Trans proteomic pipeline TPP.
Anything you like would go here. Including the option to just use the
organizers results.
> Another problem is handling the very large files, especially if we are going
> to try to use the raw data files (12 1,3-1,4 gig files). Do you have any
> advice on this front? Is there any way to run the files without having to
> download them to hard drive?
A "wget" command issued on the server where you want the data should
do the trick.
> Also, most of the software we've looked at require administrator rights to
> install on school computers, we have personal laptops but as we mentioned
> the file sizes are a bit too much for our laptops. Is there a way to install
> select software on the school computers using admin rights temporarily for
> the project?
Most of the software can be installed in the home directory. Which
software do not allow you to do so?
> Lastly, How far are we expected to have reached in the project for the
> comparison of experiences seminar on Tuesday?
It would be good if you could have one ready result, so that you can
start to think about alternative ways to combine peptides.
Thanks
-Lukas
->26/02/2015
Selected Normalized Spectral Abundance Factor as one of the measures to calculate the relative abundance.
The below paper was referred to which contained a software package called Crux.
Crux has a tool called spectral-count which could be used to find the NSAF.
Ran into problems with the file format. Crux spectral-tools requires input files with specific format (got error msg 'q-values needed in percolator, q-ranker, barista format')
Decided to ask suggestions for software from the professor.
Another tool we looked at was OpenMS which can be used to try one of the peak intensity methods.
Ran into problems with installation. Another problem was huge sizes of the mxMl files. Decided to ask the professor about file sizes as well.
Paper
Estimating relative abundances of proteins from shotgun proteomics data - Sean McIlwain, Michael Mathews, Michael S Bereman, Edwin W Rubel, Michael J MacCoss and William Stafford Noble
->24/02/2015
Met up for discussion. The team decided to each select a project pipeline from the literature and discuss on 26/02.
The following papers were circulated
-A New Approach to Evaluating Statistical Significance of Spectral Identifications
Hosein Mohimani, Sangtae Kim, and Pavel A. Pevzne
-An Automated Pipeline for High-Throughput Label-Free Quantitative Proteomics
Hendrik Weisser, Sven Nahnsen, Jonas Grossmann, Lars Nilse, Andreas Quandt, Hendrik Brauer,Marc Sturm, Erhan Kenar,
Oliver Kohlbacher, Ruedi Aebersold, and Lars Malmström
->20/02/2015
Project Overview was presented.
->19/02/2015
The project 'Differential Abundance Analysis in Label-Free Quantitative Proteomics' was allocated to group (James Ericson, Angeliki Maraki, aanil).
It is based on a study of LC-MS/MS data analysis conducted by the Proteome Informatics Research group of ABRF.
The presentation for the project overview was prepared.