VStrains is a de novo approach for reconstructing strains from viral quasispecies.


VStrains: De Novo Reconstruction of Viral Strains via Iterative Path Extraction From Assembly Graphs

Table of Contents

  1. About VStrains
  2. Updates
  3. Installation
    3.1. Option 1. Quick Install
    3.2. Option 2. Manual Install
    3.3. Download & Install VStrains
  4. Running VStrains
    4.1. Quick Usage
    4.2. Support SPAdes
    4.3. Output
  5. Stand-alone binaries
  6. Experiment
  7. Citation
  8. Feedback and bug reports

About VStrains

VStrains is a de novo approach for reconstructing strains from viral quasispecies.


VStrains 1.1.0 Release (03 Feb 2023)

  • Replace the PE link inference module with implements a hash table approach that produce efficient perfect match lookup, the new module leads to consistent evaluation results and substantially decrease the runtime and memory usage against previous alignment approach.


VStrains requires a 64-bit Linux system or Mac OS and python (supported versions are python3: 3.2 and higher).

Option 1. Quick Install (recommended)

Install (mini)conda as a light-weighted package management tool. Run the following commands to initialize and setup the conda environment for VStrains

# add channels
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

# create conda environment
conda create --name VStrains-env

# activate conda environment
conda activate VStrains-env

conda install -c bioconda -c conda-forge python=3 graph-tool minimap2 numpy gfapy matplotlib

Option 2. Manual Install

Manually install dependencies:

And python modules:

Download & Install VStrains

After successfully setup the environment and dependencies, clone the VStrains into your desirable place.

git clone

Install the VStrains via Pip

cd VStrains; pip install .

Run the following commands to ensure VStrains is correctly setup & installed.

vstrains -h

Running VStrains

VStrains supports assembly results from SPAdes (includes metaSPAdes and metaviralSPAdes) and may supports other graph-based assemblers in the future.

Quick Usage

usage: VStrains [-h] -a {spades} -g GFA_FILE [-p PATH_FILE] [-o OUTPUT_DIR] -fwd FWD -rve RVE

Construct full-length viral strains under de novo approach from contigs and assembly graph, currently supports

optional arguments:
  -h, --help            show this help message and exit
  -a {spades}, --assembler {spades}
                        name of the assembler used. [spades]
  -g GFA_FILE, --graph GFA_FILE
                        path to the assembly graph, (.gfa format)
  -p PATH_FILE, --path PATH_FILE
                        contig file from SPAdes (.paths format), only required for SPAdes. e.g., contigs.paths
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        path to the output directory [default: acc/]
  -fwd FWD, --fwd_file FWD
                        paired-end sequencing reads, forward strand (.fastq format)
  -rve RVE, --rve_file RVE
                        paired-end sequencing reads, reverse strand (.fastq format)

VStrains takes as input an assembly graph in Graphical Fragment Assembly (GFA) Format and associated contig information, together with the raw reads in paired-end format (e.g., forward.fastq, reverse.fastq).

Support SPAdes

When running SPAdes, we recommend to use --careful option for more accurate assembly results. Do not modify any contig/node name from the SPAdes assembly results for consistency. Please refer to SPAdes for further guideline. Example usage as below:

# SPAdes assembler example, pair-end reads
python -1 forward.fastq -2 reverse.fastq --careful -t 16 -o output_dir

Both assembly graph (assembly_graph_after_simplification.gfa) and contig information (contigs.paths) can be found in the output directory after running SPAdes assembler. Please use them together with raw reads as inputs for VStrains, and set -a flag to spades. Example usage as below:

vstrains -a spades -g assembly_graph_after_simplification.gfa -p contigs.paths -o output_dir -fwd forward.fastq -rve reverse.fastq


VStrains stores all output files in <output_dir>, which is set by the user.

  • <output_dir>/aln/ directory contains paired-end (PE) linkage information, which is stored in pe_info and st_info.
  • <output_dir>/gfa/ directory contains iteratively simplified assembly graphs, where graph_L0.gfa contains the assembly graph produced by SPAdes after Strandedness Canonization, split_graph_final.gfa contains the assembly graph after Graph Disentanglement, and graph_S_final.gfa contains the assembly graph after Contig-based Path Extraction, the rests are intermediate results. All the assembly graphs are in GFA 1.0 format.
  • <output_dir>/paf/ and <output_dir>/tmp/ are temporary directories, feel free to ignore them.
  • <output_dir>/strain.fasta contains resulting strains in .fasta, the headers for each strain has the form NODE_<strain name>_<sequence length>_<coverage> which is compatiable to SPAdes contigs format.
  • <output_dir>/strain.paths contains paths in the assembly graph (input GFA_FILE) corresponding to strain.fasta using Bandage for further downstream analysis.
  • <output_dir>/vstrains.log contains the VStrains log.

Stand-alone binaries

evals/ is a wrapper script for strain-level experimental result analysis using MetaQUAST.

usage: [-h] -quast QUAST [-cs FILES [FILES ...]] [-d IDIR] -ref REF_FILE -o OUTPUT_DIR

Use MetaQUAST to evaluate assembly result

  -h, --help            show this help message and exit
  -quast QUAST, --path_to_quast QUAST
                        path to MetaQuast python script, version >= 5.2.0
  -cs FILES [FILES ...], --contig_files FILES [FILES ...]
                        contig files from different tools, separated by space
  -d IDIR, --contig_dir IDIR
                        contig files from different tools, stored in the directory, .fasta format
  -ref REF_FILE, --ref_file REF_FILE
                        ref file (single)
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        output directory


VStrains is evaluated on both simulated and real datasets under default settings, and the source of the datasets can be found in the links listed below:

  1. Simulated Dataset, can be found at savage-benchmark (No preprocessing is required)
    • 6 Poliovirus (20,000x)
    • 10 HCV (20,000x)
    • 15 ZIKV (20,000x)
  2. Real Dataset (please refer to Supplementary Material for preprocessing the real datasets)


VStrains has been accepted at RECOMB 2023 and manuscript is publicly available at here.

If you use VStrains in your work, please cite the following publications.

Runpeng Luo and Yu Lin, VStrains: De Novo Reconstruction of Viral Strains via Iterative Path Extraction From Assembly Graphs

Feedback and bug reports

Thanks for using VStrains. If any bugs be experienced during execution, please re-run the program with additional -d flag and provide the vstains.log together with user cases via Issues