Taxonomic classification of full-length 16S rRNA Oxford Nanopore reads using variational inference.
NanoVI is a Nextflow DSL2 pipeline that performs taxonomic classification of full-length 16S ribosomal RNA gene sequences generated by Oxford Nanopore Technologies long-read sequencing. It uses variational inference to estimate species-level relative abundances from alignment likelihoods computed via CIGAR string analysis.
Input FASTQs → Quality Filter (FastpLong) → Align (minimap2) → CIGAR Stats → Log-Probabilities → VI → Abundance Tables
Subcommands:
| Subcommand | Description |
|---|---|
abundance |
Classify reads and estimate taxonomic abundances |
build-database |
Build a custom reference database |
collapse-taxonomy |
Collapse abundance table to a higher taxonomic rank |
combine-outputs |
Merge per-sample abundance tables into a single matrix |
# Install Nextflow
curl -s https://get.nextflow.io | bash
# Run with Docker (default)
nextflow run microbialds/NanoVI \
--cmd abundance \
--input samplesheet.csv \
--db /path/to/database \
--taxonomy_tsv /path/to/database/taxonomy.tsv \
--output_dir results/
# Run on HPC with Singularity
nextflow run microbialds/NanoVI \
--cmd abundance \
--input samplesheet.csv \
--db /path/to/database \
-profile singularity
# Run on SLURM cluster
nextflow run microbialds/NanoVI \
--cmd abundance \
--input samplesheet.csv \
--db /path/to/database \
-profile slurmNote for SLURM users: Before running on a cluster, review and adjust the resource allocations (CPUs, memory, and time limits) defined in conf/base.config to match your cluster's available resources and partition settings.
NanoVI accepts input in two formats:
A CSV file with sample and fastq columns:
sample,fastq
sample1,/path/to/sample1.fastq.gz
sample2,/path/to/sample2.fastq.gzA path to a directory containing FASTQ files. Sample IDs are inferred from filenames.
| Parameter | Default | Description |
|---|---|---|
--cmd |
abundance |
Subcommand to run: abundance, build-database, collapse-taxonomy, combine-outputs |
--output_dir |
results/ |
Output directory |
| Parameter | Default | Description |
|---|---|---|
--input |
required | Path to samplesheet CSV or FASTQ directory |
--db |
required | Path to NanoVI/GTDB reference database |
--taxonomy_tsv |
<db>/taxonomy.tsv |
Path to taxonomy TSV |
--kmer_size |
21 |
K-mer size for minimap2 indexing |
--N |
3 |
Maximum secondary alignments per read |
--K |
4000000000 |
Minibatch size for minimap2 mapping (bytes) |
--type |
map-ont |
Minimap2 preset (map-ont, map-pb, sr) |
--min_length |
500 |
Minimum read length filter (bp) |
--max_length |
2000 |
Maximum read length filter (bp) |
--keep_counts |
false |
Include read count column in output |
--keep_files |
false |
Retain intermediate SAM and FASTQ files |
| Parameter | Default | Description |
|---|---|---|
--db_name |
db_custom |
Name for the custom database |
--sequences |
required | Input FASTA with reference sequences |
--seq2tax |
required | Sequence-to-taxonomy mapping TSV |
--taxonomy_list |
required | Taxonomy terms list TSV |
| Parameter | Default | Description |
|---|---|---|
--input_tsv |
required | Abundance table for taxonomy collapse |
--input_dir |
required | Directory with per-sample tables for combining |
--rank |
required | Taxonomic rank: species, genus, family, order, class, phylum, superkingdom |
The pipeline produces the following output files in --output_dir:
| File | Description |
|---|---|
<sample>_rel-abundance.tsv |
Per-sample relative abundance table |
<sample>_rel-abundance-counts.tsv |
Per-sample abundance with estimated read counts (if --keep_counts) |
qc/<sample>_report.html.gz |
FastpLong quality control report |
pipeline_info/timeline.html |
Nextflow execution timeline |
pipeline_info/report.html |
Nextflow execution report |
pipeline_info/trace.txt |
Nextflow trace file |
| Profile | Description |
|---|---|
docker |
Run with Docker containers (default if no profile specified) |
singularity |
Run with Singularity containers |
conda |
Run with Conda environments |
slurm |
Submit jobs to SLURM scheduler (uses Singularity) |
sge |
Submit jobs to SGE scheduler (uses Singularity) |
test |
Minimal test dataset with reduced resources |
Combine profiles: nextflow run main.nf -profile test,docker
Docker:
docker build -t nanovi-python:1.0.0 -f containers/Dockerfile containers/Singularity:
singularity build nanovi-python.sif containers/Singularity.defNanoVI includes a helper script to build a reference database directly from the GTDB SSU FASTA file. The script assigns one taxid per species, deduplicates identical 16S sequences, and outputs the files required by the pipeline.
Go to the GTDB data repository and download the combined bacteria + archaea SSU file:
ssu_all_rXXX.fna.gz
Replace
XXXwith the release number (e.g.r226). The file is typically ~400 MB compressed.
python3 bin/build_gtdb_db.py \
--ssu ssu_all_r226.fna.gz \
--db-name db_gtdb_r226 \
--output-dir /path/to/db_gtdb_r226| Option | Default | Description |
|---|---|---|
--ssu |
required | Path to the GTDB SSU FASTA (.fna or .fna.gz) |
--db-name |
db_gtdb |
Label embedded in FASTA headers (e.g. db_gtdb_r226) |
--output-dir |
./db |
Directory where output files are written |
--min-length |
900 |
Minimum 16S sequence length in bp to include |
Output files:
| File | Description |
|---|---|
species_taxid.fasta |
Reference FASTA — one entry per unique 16S sequence, header: taxid:db_name:n |
taxonomy.tsv |
Taxonomy table — one row per species with full lineage |
minimap2 -k 21 -d /path/to/db_gtdb_r226/gtdb_index.mmi \
/path/to/db_gtdb_r226/species_taxid.fastaNote: The minimap2 index (.mmi) is version-specific and may not be compatible across different minimap2 versions. To avoid conflicts, either build the index using the pipeline's container, or pass the FASTA file directly as --db and let the pipeline index it at runtime.
nextflow run microbialds/NanoVI \
--cmd abundance \
--input samplesheet.csv \
--db /path/to/db_gtdb_r226/gtdb_index.mmi \
--taxonomy_tsv /path/to/db_gtdb_r226/taxonomy.tsv \
--output_dir results/If you use NanoVI in your research, please cite:
Curiqueo C, Fuentes-Santander F, Ugalde JA. NanoVI: taxonomic classification of full-length 16S rRNA Nanopore reads using variational inference. [Journal]. [Year].
This project is licensed under the MIT License — see the LICENSE file for details.