version 1.0 8 July 2024
Giant viruses are abundant and diverse and frequently found in environmental microbiomes. GVClass assigns taxonomy to putative giant virus contigs or metagenome assembled genomes (GVMAGs). It uses a conservative approach based on the consensus of single protein trees built from giant virus orthologous groups (GVOGs), additional Mirusvirus, Mryavirus and Poxvirus hallmark genes and cellular single copy panorthologs. Genome completeness and contamination is then estimated based on copy numbers of a larger set of genes typically conserved in single copy at order-level.
- Input is a directory that contains single contigs or MAGs as nucleic acid (fna) or proteins (faa)
- File extensions .fna or .faa
- Recommended length for assembly size is 50kb, but at least 20kb
- No special characters (".", ";", ":") in filebase name, "_" or "-" are okay
- Recommended sequence header format if faa provided: |
- Input will be checked and reformatted if necessary
- Upload you metagenome assembled genome or single contig to IMG/VR using the GVClass feature
Using containers is the recommended way of running GVClass.
- Use the provided
gvclass_apptainer.sh
script. Thequerydir
should be located under the current working directory. For testing, use theexample
dir (available in this repository) as thequerydir
.
bash gvclass_apptainer.sh <querydir> <n processes>
- Alternatively, run Apptainer directly:
PROCESSES=<number of processes, e.g. 8>
QUERYDIR=<dir with query genomes, e.g. example>
apptainer run --containall --bind $(pwd):/workdir --pwd /workdir \
docker://docker.io/doejgi/gvclass:latest \
snakemake --snakefile /gvclass/workflow/Snakefile \
-j $PROCESSES \
--use-conda \
--conda-frontend mamba \
--conda-prefix /gvclass/.snakemake/conda \
--config querydir="/workdir/$QUERYDIR" \
database_path="/gvclass/resources"
- Use the provided
gvclass_docker.sh
script. Thequerydir
should be located under the current working directory. For testing, use theexample
dir (available in this repository) as thequerydir
.
bash gvclass_docker.sh <querydir> <n processes>
- Alternatively, run Docker directly:
PROCESSES=<number of processes, e.g. 8>
QUERYDIR=<dir with query genomes, e.g. example>
docker run -v $(pwd):$(pwd) -w $(pwd) doejgi/gvclass:latest \
snakemake --snakefile /gvclass/workflow/Snakefile \
-j $PROCESSES \
--use-conda \
--conda-frontend mamba \
--conda-prefix /gvclass/.snakemake/conda \
--config querydir="$QUERYDIR" \
database_path="/gvclass/resources"
- Use the provided
gvclass_shifter
script. Thequerydir
should be located under the current working directory. For testing, use theexample
dir (available in this repository) as thequerydir
.
bash gvclass_shifter.sh <querydir> <n processes>
- Alternatively, run Shifter directly:
PROCESSES=<number of processes, e.g. 8>
QUERYDIR=<dir with query genomes, e.g. example>
shifterimg pull docker:doejgi/gvclass:latest
shifter --image=docker:doejgi/gvclass:latest \
snakemake --snakefile /gvclass/workflow/Snakefile \
-j $PROCESSES \
--use-conda \
--conda-frontend mamba \
--conda-prefix /gvclass/.snakemake/conda \
--config querydir="$QUERYDIR" \
database_path="/gvclass/resources"
- First, install a conda environment with snakemake, check here: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
- Clone the repository
git clone --recurse-submodules https://github.com/NeLLi-team/gvclass
- Activate snakemake (8.14.0) conda environment, install cython and pyrodigal
conda config --set channel_priority flexible # gvclass needs flexible priorities
pip install cython
cd gvclass/workflow/scripts/
pip install --user ./pyrodigal
cd ../../
- Test GVClass using the provided giant virus assemblies
snakemake -j 24 --use-conda --config querydir="example"
- If this completes successfully, run it using your own directory of query genomes
snakemake -j <number of processes> --use-conda --config querydir="<path to query dir>"
- Config file allows to specify options for MAFFT (default is mafft-linsi), iqtree (default) or fasttree
- fast_mode (default) can be set to False in config file, in that case single protein trees are also built for all conserved order-level marker genes
- These parameters can also be passed on the command line via the
--config
command line option. E.g.,--config querydir=example treeoption=fasttree
.
- The classification result is summarized in a tab separated file in a subdir "results" in the the query dir
- Different genetic codes are tested and evaluated based on hmmsearch using the general models
- Genetic code that yields the largest number of matches to general models with the highest average bitscore and the highest coding density is selected
- Taxonomy assignments are provided on different taxonomic levels
- To yield an assignments all nearest neighbors in GVOG phylogenetic trees have to be in agreement
- Giant virus genomes typically have less than 10 out of a set of 56 universal cellular housekeeping genes (UNI56). Higher UNI56 counts indicate cellular contamination, or giant virus sequences that are located on host contigs.
- UNI56u (unique counts), UNI56t(total counts), UNI56df (duplication factor) are provided and can be used for further quality filtering
- Giant virus genomes typically have a duplication factor of GVOG7 and GVOG9 of below 3. Higher GVOG7 duplication factors indicate the presence mixed viral populations.
- GVOG8u, GVOG4u (unique counts), GVOG8t, GVOG4t (total counts), GVOG8df (duplication factor) are provided and can be used for further quality filtering
- GVOG8df < 2 and order_dup < 1.5: low chance of representing mixed bin [high quality]
- GVOG8df 2-3 and order_dup 1.5-2: medium chance of representing mixed bin [medium quality]
- GVOG8df >3 and order_dup >3: high chance of representing mixed bin [low quality]
- GVOG8u, GVOG4u (unique counts), GVOG8t, GVOG4t (total counts), GVOG8df (duplication factor) are provided and can be used for further quality filtering
- Genome completeness estimate based on count of genes conserved in 50% of genomes of the respective Nucleocytoviricota order.
- < 30%: low completeness [low quality]
- 30-70%: medium completeness [medium quality]
- > 70% high completeness [high quality]
- Will be provided soon
https://www.nature.com/articles/s44298-024-00069-7
- Add Egoviruses and Proculoviruses
- Schulz F, Roux S, Paez-Espino D, Jungbluth S, Walsh DA, Denef VJ, McMahon KD, Konstantinidis KT, Eloe-Fadrosh EA, Kyrpides NC, Woyke T. Giant virus diversity and host interactions through global metagenomics. Nature. 2020 Feb;578(7795):432-6.
- Aylward FO, Moniruzzaman M, Ha AD, Koonin EV. A phylogenomic framework for charting the diversity and evolution of giant viruses. PLoS biology. 2021 Oct 27;19(10):e3001430.
GVClass was developed by the New Lineages of Life Group at the DOE Joint Genome Institute supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-05CH11231.