Skip to content

microbialds/NanoVI

Repository files navigation

NanoVI

Taxonomic classification of full-length 16S rRNA Oxford Nanopore reads using variational inference.

Nextflow License

Overview

NanoVI is a Nextflow DSL2 pipeline that performs taxonomic classification of full-length 16S ribosomal RNA gene sequences generated by Oxford Nanopore Technologies long-read sequencing. It uses variational inference to estimate species-level relative abundances from alignment likelihoods computed via CIGAR string analysis.

Pipeline Workflow

Input FASTQs → Quality Filter (FastpLong) → Align (minimap2) → CIGAR Stats → Log-Probabilities → VI → Abundance Tables

Subcommands:

Subcommand Description
abundance Classify reads and estimate taxonomic abundances
build-database Build a custom reference database
collapse-taxonomy Collapse abundance table to a higher taxonomic rank
combine-outputs Merge per-sample abundance tables into a single matrix

Quick Start

# Install Nextflow
curl -s https://get.nextflow.io | bash

# Run with Docker (default)
nextflow run microbialds/NanoVI \
    --cmd abundance \
    --input samplesheet.csv \
    --db /path/to/database \
    --taxonomy_tsv /path/to/database/taxonomy.tsv \
    --output_dir results/

# Run on HPC with Singularity
nextflow run microbialds/NanoVI \
    --cmd abundance \
    --input samplesheet.csv \
    --db /path/to/database \
    -profile singularity

# Run on SLURM cluster
nextflow run microbialds/NanoVI \
    --cmd abundance \
    --input samplesheet.csv \
    --db /path/to/database \
    -profile slurm

Note for SLURM users: Before running on a cluster, review and adjust the resource allocations (CPUs, memory, and time limits) defined in conf/base.config to match your cluster's available resources and partition settings.

Input Format

NanoVI accepts input in two formats:

Samplesheet (recommended)

A CSV file with sample and fastq columns:

sample,fastq
sample1,/path/to/sample1.fastq.gz
sample2,/path/to/sample2.fastq.gz

Directory

A path to a directory containing FASTQ files. Sample IDs are inferred from filenames.

Parameters

General

Parameter Default Description
--cmd abundance Subcommand to run: abundance, build-database, collapse-taxonomy, combine-outputs
--output_dir results/ Output directory

Abundance Estimation

Parameter Default Description
--input required Path to samplesheet CSV or FASTQ directory
--db required Path to NanoVI/GTDB reference database
--taxonomy_tsv <db>/taxonomy.tsv Path to taxonomy TSV
--kmer_size 21 K-mer size for minimap2 indexing
--N 3 Maximum secondary alignments per read
--K 4000000000 Minibatch size for minimap2 mapping (bytes)
--type map-ont Minimap2 preset (map-ont, map-pb, sr)
--min_length 500 Minimum read length filter (bp)
--max_length 2000 Maximum read length filter (bp)
--keep_counts false Include read count column in output
--keep_files false Retain intermediate SAM and FASTQ files

Database Build

Parameter Default Description
--db_name db_custom Name for the custom database
--sequences required Input FASTA with reference sequences
--seq2tax required Sequence-to-taxonomy mapping TSV
--taxonomy_list required Taxonomy terms list TSV

Post-Processing

Parameter Default Description
--input_tsv required Abundance table for taxonomy collapse
--input_dir required Directory with per-sample tables for combining
--rank required Taxonomic rank: species, genus, family, order, class, phylum, superkingdom

Output

The pipeline produces the following output files in --output_dir:

File Description
<sample>_rel-abundance.tsv Per-sample relative abundance table
<sample>_rel-abundance-counts.tsv Per-sample abundance with estimated read counts (if --keep_counts)
qc/<sample>_report.html.gz FastpLong quality control report
pipeline_info/timeline.html Nextflow execution timeline
pipeline_info/report.html Nextflow execution report
pipeline_info/trace.txt Nextflow trace file

Profiles

Profile Description
docker Run with Docker containers (default if no profile specified)
singularity Run with Singularity containers
conda Run with Conda environments
slurm Submit jobs to SLURM scheduler (uses Singularity)
sge Submit jobs to SGE scheduler (uses Singularity)
test Minimal test dataset with reduced resources

Combine profiles: nextflow run main.nf -profile test,docker

Building Containers

Docker:

docker build -t nanovi-python:1.0.0 -f containers/Dockerfile containers/

Singularity:

singularity build nanovi-python.sif containers/Singularity.def

Building a GTDB Reference Database

NanoVI includes a helper script to build a reference database directly from the GTDB SSU FASTA file. The script assigns one taxid per species, deduplicates identical 16S sequences, and outputs the files required by the pipeline.

1. Download the GTDB SSU file

Go to the GTDB data repository and download the combined bacteria + archaea SSU file:

ssu_all_rXXX.fna.gz

Replace XXX with the release number (e.g. r226). The file is typically ~400 MB compressed.

2. Build the database

python3 bin/build_gtdb_db.py \
    --ssu ssu_all_r226.fna.gz \
    --db-name db_gtdb_r226 \
    --output-dir /path/to/db_gtdb_r226
Option Default Description
--ssu required Path to the GTDB SSU FASTA (.fna or .fna.gz)
--db-name db_gtdb Label embedded in FASTA headers (e.g. db_gtdb_r226)
--output-dir ./db Directory where output files are written
--min-length 900 Minimum 16S sequence length in bp to include

Output files:

File Description
species_taxid.fasta Reference FASTA — one entry per unique 16S sequence, header: taxid:db_name:n
taxonomy.tsv Taxonomy table — one row per species with full lineage

3. Index with minimap2

minimap2 -k 21 -d /path/to/db_gtdb_r226/gtdb_index.mmi \
    /path/to/db_gtdb_r226/species_taxid.fasta

Note: The minimap2 index (.mmi) is version-specific and may not be compatible across different minimap2 versions. To avoid conflicts, either build the index using the pipeline's container, or pass the FASTA file directly as --db and let the pipeline index it at runtime.

4. Run NanoVI with the GTDB database

nextflow run microbialds/NanoVI \
    --cmd abundance \
    --input samplesheet.csv \
    --db /path/to/db_gtdb_r226/gtdb_index.mmi \
    --taxonomy_tsv /path/to/db_gtdb_r226/taxonomy.tsv \
    --output_dir results/

Citation

If you use NanoVI in your research, please cite:

Curiqueo C, Fuentes-Santander F, Ugalde JA. NanoVI: taxonomic classification of full-length 16S rRNA Nanopore reads using variational inference. [Journal]. [Year].

License

This project is licensed under the MIT License — see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors