NanoVI

Taxonomic classification of full-length 16S rRNA Oxford Nanopore reads using variational inference.

Overview

NanoVI is a Nextflow DSL2 pipeline that performs taxonomic classification of full-length 16S ribosomal RNA gene sequences generated by Oxford Nanopore Technologies long-read sequencing. It uses variational inference to estimate species-level relative abundances from alignment likelihoods computed via CIGAR string analysis.

Pipeline Workflow

Input FASTQs → Quality Filter (FastpLong) → Align (minimap2) → CIGAR Stats → Log-Probabilities → VI → Abundance Tables

Subcommands:

Subcommand	Description
`abundance`	Classify reads and estimate taxonomic abundances
`build-database`	Build a custom reference database
`collapse-taxonomy`	Collapse abundance table to a higher taxonomic rank
`combine-outputs`	Merge per-sample abundance tables into a single matrix

Quick Start

# Install Nextflow
curl -s https://get.nextflow.io | bash

# Run with Docker (default)
nextflow run microbialds/NanoVI \
    --cmd abundance \
    --input samplesheet.csv \
    --db /path/to/database \
    --taxonomy_tsv /path/to/database/taxonomy.tsv \
    --output_dir results/

# Run on HPC with Singularity
nextflow run microbialds/NanoVI \
    --cmd abundance \
    --input samplesheet.csv \
    --db /path/to/database \
    -profile singularity

# Run on SLURM cluster
nextflow run microbialds/NanoVI \
    --cmd abundance \
    --input samplesheet.csv \
    --db /path/to/database \
    -profile slurm

Note for SLURM users: Before running on a cluster, review and adjust the resource allocations (CPUs, memory, and time limits) defined in conf/base.config to match your cluster's available resources and partition settings.

Input Format

NanoVI accepts input in two formats:

Samplesheet (recommended)

A CSV file with sample and fastq columns:

sample,fastq
sample1,/path/to/sample1.fastq.gz
sample2,/path/to/sample2.fastq.gz

Parameters

General

Parameter	Default	Description
`--cmd`	`abundance`	Subcommand to run: `abundance`, `build-database`, `collapse-taxonomy`, `combine-outputs`
`--output_dir`	`results/`	Output directory

Abundance Estimation

Parameter	Default	Description
`--input`	required	Path to samplesheet CSV or FASTQ directory
`--db`	required	Path to NanoVI/GTDB reference database
`--taxonomy_tsv`	`<db>/taxonomy.tsv`	Path to taxonomy TSV
`--kmer_size`	`21`	K-mer size for minimap2 indexing
`--N`	`3`	Maximum secondary alignments per read
`--K`	`4000000000`	Minibatch size for minimap2 mapping (bytes)
`--type`	`map-ont`	Minimap2 preset (`map-ont`, `map-pb`, `sr`)
`--min_length`	`500`	Minimum read length filter (bp)
`--max_length`	`2000`	Maximum read length filter (bp)
`--keep_counts`	`false`	Include read count column in output
`--keep_files`	`false`	Retain intermediate SAM and FASTQ files

Database Build

Parameter	Default	Description
`--db_name`	`db_custom`	Name for the custom database
`--sequences`	required	Input FASTA with reference sequences
`--seq2tax`	required	Sequence-to-taxonomy mapping TSV
`--taxonomy_list`	required	Taxonomy terms list TSV

Post-Processing

Parameter	Default	Description
`--input_tsv`	required	Abundance table for taxonomy collapse
`--input_dir`	required	Directory with per-sample tables for combining
`--rank`	required	Taxonomic rank: `species`, `genus`, `family`, `order`, `class`, `phylum`, `superkingdom`

Output

The pipeline produces the following output files in --output_dir:

File	Description
`<sample>_rel-abundance.tsv`	Per-sample relative abundance table
`<sample>_rel-abundance-counts.tsv`	Per-sample abundance with estimated read counts (if `--keep_counts`)
`qc/<sample>_report.html.gz`	FastpLong quality control report
`pipeline_info/timeline.html`	Nextflow execution timeline
`pipeline_info/report.html`	Nextflow execution report
`pipeline_info/trace.txt`	Nextflow trace file

Profiles

Profile	Description
`docker`	Run with Docker containers (default if no profile specified)
`singularity`	Run with Singularity containers
`conda`	Run with Conda environments
`slurm`	Submit jobs to SLURM scheduler (uses Singularity)
`sge`	Submit jobs to SGE scheduler (uses Singularity)
`test`	Minimal test dataset with reduced resources

Combine profiles: nextflow run main.nf -profile test,docker

Building Containers

Docker:

docker build -t nanovi-python:1.0.0 -f containers/Dockerfile containers/

Singularity:

singularity build nanovi-python.sif containers/Singularity.def

Building a GTDB Reference Database

NanoVI includes a helper script to build a reference database directly from the GTDB SSU FASTA file. The script assigns one taxid per species, deduplicates identical 16S sequences, and outputs the files required by the pipeline.

1. Download the GTDB SSU file

Go to the GTDB data repository and download the combined bacteria + archaea SSU file:

ssu_all_rXXX.fna.gz

Replace XXX with the release number (e.g. r226). The file is typically ~400 MB compressed.

2. Build the database

python3 bin/build_gtdb_db.py \
    --ssu ssu_all_r226.fna.gz \
    --db-name db_gtdb_r226 \
    --output-dir /path/to/db_gtdb_r226

Option	Default	Description
`--ssu`	required	Path to the GTDB SSU FASTA (`.fna` or `.fna.gz`)
`--db-name`	`db_gtdb`	Label embedded in FASTA headers (e.g. `db_gtdb_r226`)
`--output-dir`	`./db`	Directory where output files are written
`--min-length`	`900`	Minimum 16S sequence length in bp to include

Output files:

File	Description
`species_taxid.fasta`	Reference FASTA — one entry per unique 16S sequence, header: `taxid:db_name:n`
`taxonomy.tsv`	Taxonomy table — one row per species with full lineage

3. Index with minimap2

minimap2 -k 21 -d /path/to/db_gtdb_r226/gtdb_index.mmi \
    /path/to/db_gtdb_r226/species_taxid.fasta

Note: The minimap2 index (.mmi) is version-specific and may not be compatible across different minimap2 versions. To avoid conflicts, either build the index using the pipeline's container, or pass the FASTA file directly as --db and let the pipeline index it at runtime.

4. Run NanoVI with the GTDB database

nextflow run microbialds/NanoVI \
    --cmd abundance \
    --input samplesheet.csv \
    --db /path/to/db_gtdb_r226/gtdb_index.mmi \
    --taxonomy_tsv /path/to/db_gtdb_r226/taxonomy.tsv \
    --output_dir results/

Citation

If you use NanoVI in your research, please cite:

Curiqueo C, Fuentes-Santander F, Ugalde JA. NanoVI: taxonomic classification of full-length 16S rRNA Nanopore reads using variational inference. [Journal]. [Year].

License

This project is licensed under the MIT License — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
assets		assets
bin		bin
conf		conf
containers		containers
docs		docs
lib		lib
modules/local		modules/local
subworkflows/local		subworkflows/local
tests/test_data		tests/test_data
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.nf		main.nf
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NanoVI

Overview

Pipeline Workflow

Quick Start

Input Format

Samplesheet (recommended)

Directory

Parameters

General

Abundance Estimation

Database Build

Post-Processing

Output

Profiles

Building Containers

Building a GTDB Reference Database

1. Download the GTDB SSU file

2. Build the database

3. Index with minimap2

4. Run NanoVI with the GTDB database

Citation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NanoVI

Overview

Pipeline Workflow

Quick Start

Input Format

Samplesheet (recommended)

Directory

Parameters

General

Abundance Estimation

Database Build

Post-Processing

Output

Profiles

Building Containers

Building a GTDB Reference Database

1. Download the GTDB SSU file

2. Build the database

3. Index with minimap2

4. Run NanoVI with the GTDB database

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages