-
Notifications
You must be signed in to change notification settings - Fork 18
How to prepare a protein database
Proteomics pipelines and toolkits like Philosopher rely on properly formatted protein sequence databases to correctly identify peptides. Here are some tips on how to prepare a protein database for your experiment.
Run Philosopher from the command line to download one from UniProt by executing the following two commands:
philosopher workspace --init
philosopher database --reviewed --contam --id UP000005640
This will generate a human UniProt/SwissProt (i.e. reviewed sequences only) database, with common contaminants and decoys added (with a default decoy prefix rev_). If you would like to use the full (unreviewed) UniProt proteome, remove the --reviewed
tag.
For mouse, for example, use the proteome ID UP000000589. To find the proteome ID for other organisms, search within the UniProt proteomes.
To combine multiple proteomes, provide a comma-separated list, e.g.:
philosopher workspace --init
philosopher database --reviewed --contam --id UP000005640,UP000000625,UP000002311
to generate a database with the human, yeast, and E. coli proteomes.
Add decoys and contaminants and format it for FragPipe/philosopher using the following commands:
philosopher workspace --init
philosopher database --custom <file_name> --contam
Reformat it for FragPipe using the following commands:
philosopher workspace --init
philosopher database --annotate <file_name> --prefix <prefix>
If you need to run the --custom
or the --annotate
command, you may manually inspect the formatted files to ensure it will be compatible with Philosopher, it should follow one of these formats (see example for each):
-
UniProt:
>sp|P02489|CRYAA_HUMAN Alpha-crystallin A chain OS=Homo sapiens OX=9606 GN=CRYAA PE=1 SV=2
-
NCBI:
>NP_000385.1 alpha-crystallin A chain isoform 1 [Homo sapiens]
-
ENSEMBL:
>ENSP00000291554.2 pep chromosome:GRCh38:21:43169008:43172805:1 gene:ENSG00000160202.7 transcript:ENST00000291554.6 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CRYAA description:crystallin alpha A [Source:HGNC Symbol;Acc:HGNC:2388]
Note: the protein description text (e.g. "crystallin alpha A") should not contain any commas or special characters, as it may result in incorrect parsing of the entry by Philosopher
- or generic:
>P02489
If you are adding you own decoys, they also need to follow a specific formatting; sequences need to be formatted as a whole protein string in FASTA file with a decoy (e.g. rev_ or DECOY_) added at the beginning.
Examples of compatible decoy formats:
>rev_tr|J3KNE0|J3KNE0_HUMAN
>DECOY_tr|J3KNE0|J3KNE0_HUMAN
Examples of incompatible decoy formats:
>tr_REVERSED|J3KNE0|J3KNE0_HUMAN
>tr|fake_J3KNE0|J3KNE0_HUMAN RanBP2-like
>tr|J3KNE0_DECOY|J3KNE0_HUMAN