Skip to content

SystemsGenetics/AnnoTater

Repository files navigation

AnnoTater

AnnoTater Logo

GitHub Actions CI Status GitHub Actions Linting Status nf-test Nextflow run with docker run with singularity

Introduction

AnnoTater AnnoTater is a whole or partial genome functional annotation workflow built using Nextflow. It takes a set of protein coding gene sequences (either in nucleotide or protein FASTA format) and runs InterProScan; BLAST vs UniProt SwissProt, NCBI NR, NCBI RefSeq, OrthoDB and StringDB in order to provide a first pass set of annotations for genes.

AnnoTater is constructed using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

AnnoTater provides the following steps:

  1. Homology searching against specified databases using Diamond BLAST (Diamond). Supported databases include:
    • NCBI nr
    • NCBI RefSeq
    • ExPASy SwissProt
    • ExPASy Trembl
    • STRING database
  2. Execution of InterProScan

Usage

  1. Download databases. AnnoTater must have available the databases. These can take quite a while to download and can consume large amounts of storage. Use the bash scripts in the scripts folder to retrieve and index the databases prior to using this workflow.

  2. Install Nextflow (>=21.10.3)

  3. Install any of Docker, Singularity, Podman, Shifter or Charliecloud for full pipeline reproducibility (Conda is currently not supported); see docs),

  4. Download the pipeline and test it on a minimal dataset with a single command:

    nextflow run systemsgenetics/annotater -profile test,<docker/singularity/podman/shifter/charliecloud/conda/institute>
    • Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.
    • If you are using singularity then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the --singularity_pull_docker_container parameter to pull and convert the Docker image instead. Alternatively, it is highly recommended to use the nf-core download command to pre-download all of the required containers before running the pipeline and to set the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options to be able to store and re-use the images from a central location for future pipeline runs.
  5. Start running your own analysis!

    nextflow run systemsgenetics/annotater \
        -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \
        --batch_size 100 \
        --input <fasta file> \
        --data_sprot <directory with swissprot diamond index> \
        --data_refseq <directory with refseq diamond index> \
        --data_ipr <directory with InterProScan data> \
        --max_cpus 10 \
        --max_memory 6GB
    
  • The --batch_size arguments indicates the number of sequences to process in each batch.
  • It is recommended if using NCBI nr to set a large enough --max_memory size.

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Credits

AnnoTater and was written by the Ficklin Computational Biology Team at Washington State University. Development of AnnoTater was initially funded by the U.S. National Science Foundation (NSF) Award #1659300.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

AnnoTater is currently unpublished. For now, please use the GitHub URL when referencing. An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.