AnnoTater AnnoTater is a whole or partial genome functional annotation workflow built using Nextflow. It takes a set of protein coding gene sequences (either in nucleotide or protein FASTA format) and runs InterProScan; BLAST vs UniProt SwissProt, NCBI NR, NCBI RefSeq, OrthoDB and StringDB in order to provide a first pass set of annotations for genes.
AnnoTater is constructed using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
AnnoTater provides the following steps:
- Homology searching against specified databases using Diamond BLAST (
Diamond
). Supported databases include:- NCBI nr
- NCBI RefSeq
- ExPASy SwissProt
- ExPASy Trembl
- STRING database
- Execution of InterProScan
-
Download databases. AnnoTater must have available the databases. These can take quite a while to download and can consume large amounts of storage. Use the bash scripts in the
scripts
folder to retrieve and index the databases prior to using this workflow. -
Install
Nextflow
(>=21.10.3
) -
Install any of
Docker
,Singularity
,Podman
,Shifter
orCharliecloud
for full pipeline reproducibility (Conda
is currently not supported); see docs), -
Download the pipeline and test it on a minimal dataset with a single command:
nextflow run systemsgenetics/annotater -profile test,<docker/singularity/podman/shifter/charliecloud/conda/institute>
- Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use
-profile <institute>
in your command. This will enable eitherdocker
orsingularity
and set the appropriate execution settings for your local compute environment. - If you are using
singularity
then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the--singularity_pull_docker_container
parameter to pull and convert the Docker image instead. Alternatively, it is highly recommended to use thenf-core download
command to pre-download all of the required containers before running the pipeline and to set theNXF_SINGULARITY_CACHEDIR
orsingularity.cacheDir
Nextflow options to be able to store and re-use the images from a central location for future pipeline runs.
- Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use
-
Start running your own analysis!
nextflow run systemsgenetics/annotater \ -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \ --batch_size 100 \ --input <fasta file> \ --data_sprot <directory with swissprot diamond index> \ --data_refseq <directory with refseq diamond index> \ --data_ipr <directory with InterProScan data> \ --max_cpus 10 \ --max_memory 6GB
- The
--batch_size
arguments indicates the number of sequences to process in each batch. - It is recommended if using NCBI nr to set a large enough
--max_memory
size.
Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file
option. Custom config files including those provided by the -c
Nextflow option can be used to provide any configuration except for parameters; see docs.
AnnoTater and was written by the Ficklin Computational Biology Team at Washington State University. Development of AnnoTater was initially funded by the U.S. National Science Foundation (NSF) Award #1659300.
If you would like to contribute to this pipeline, please see the contributing guidelines.
AnnoTater is currently unpublished. For now, please use the GitHub URL when referencing. An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.