-
Notifications
You must be signed in to change notification settings - Fork 9
Database updating tutorial: adding genes
This is a tutorial for updating >= Struo2-generated custom database.
Please first read the README for instructions on general setup of Struo2.
This tutorial will use the following:
- A small set of genes to add to the existing custom databases
- GTDB taxonomy
- NCBI taxonomy is an alternative
- The databases generated from the database generation tutorial
- See the ftp site for custom GTDB databases
This tutorial assumes that you've created your custom databases in the ./data/ directory.
# set the location of the database files
OUTDIR=./data/
mkdir -p $OUTDIR
These are the genes that you will add to the existing databases.
wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/dev_data/genes/UniRef50_n5.tar.gz
tar -pzxvf $OUTDIR/UniRef50_n5.tar.gz --directory $OUTDIR
You need either the amino acid or nucleotide version of the genes.
Nucleotide is better, since it can be translated to amino acid via mmseqs translatenucs.
If only amino acid sequences are provided, then they will be rev-translated via mmseqs translateaa.
Plass and linclust can be used to create a gene set from your metagenomes, which you could add to the existing custom GTDB databases.
You also need a metadata table for the new gene sequences. The following columns are required:
-
seq_uuid- A UUID is recommended, but any unique ID should work
-
seq_orig_name- The original name/annotation of the gene
-
domain- Taxonomic level
- All values can be blank/empty
-
phylum- Taxonomic level
- All values can be blank/empty
-
class- Taxonomic level
- All values can be blank/empty
-
order- Taxonomic level
- All values can be blank/empty
-
family- Taxonomic level
- All values can be blank/empty
-
genus- Taxonomic level
-
species- Taxonomic level
-
taxid- All values can be blank/empty
-
genome_name- Genome origin of the gene (e.g., assembly accession), if available
- All values can be blank/empty
-
genome_length_bp- Total assembly length (base pairs) of the genome-of-origin
- Useful for normalization
- All values can be blank/empty
Note: the order of the columns in the table does not matter.
No samples table is needed, since only genes will be added.
The snakemake pipeline config is a bit different for updating a database versus generating a new database.
Below is an example config:
#-- email notifications of pipeline success/failure (use "Skip" to deactivate) --#
email: None
#-- databases to update --#
# Replace "Create" with "Skip" to skip creation of any of these
# Skipping kraken2/braken, since those are just genome-based, and we are adding genes to the database
databases:
kraken2: Create # <-- automatically skipped if no new genomes are provided
bracken: Create # <-- automatically skipped if no new genomes are provided
genes: Create
humann3_bowtie2: Create
humann3_diamond: Create
#-- Input --#
#--- If just a set of gene sequences to add ---#
# If you have nucleotide/amino-acid gene sequences formatted for humann
# If translate = True, missing nuc or AA seqs will be (rev)translated from the other, else seqs not used
new_genes:
amino_acid: data/UniRef50/genome_reps_filtered.faa.gz
nucleotide: data/UniRef50/genome_reps_filtered.fna.gz
metadata: data/genome_reps_filtered.txt.gz
translate: True
#--- If a set of genomes to add ---#
# file listing samples and associated data
samples_file: Skip # <-- Not needed, since no new genomes
## column names in samples table
samples_col: 'ncbi_organism_name' # <-- Not used, since no new genomes
accession_col: 'accession' # <-- Not used, since no new genomes
fasta_file_path_col: 'fasta_file_path' # <-- Not used, since no new genomes
taxID_col: 'gtdb_taxid' # <-- Not used, since no new genomes
taxonomy_col: 'gtdb_taxonomy' # <-- Not used, since no new genomes
# Saved databases that will be updated
kraken2_db:
library: tests/output/GTDBr95_n10/kraken2/library/
taxonomy: tests/output/GTDBr95_n10/kraken2/taxonomy/
genes_db:
genes:
mmseqs_db: tests/output/GTDBr95_n10/genes/genes_db.tar.gz
amino_acid: tests/output/GTDBr95_n10/genes/genome_reps_filtered.faa.gz
nucleotide: tests/output/GTDBr95_n10/genes/genome_reps_filtered.fna.gz
metadata: tests/output/GTDBr95_n10/genes/genome_reps_filtered.txt.gz
cluster:
mmseqs_db: tests/output/GTDBr95_n10/genes/cluster/clusters_db.tar.gz
humann_db:
query:
hits: tests/output/GTDBr95_n10/humann3/annotation_hits.gz
cluster:
reps: tests/output/GTDBr95_n10/genes/cluster/clusters_reps.faa.gz
membership: tests/output/GTDBr95_n10/genes/cluster/clusters_membership.tsv.gz
#-- Output --#
# output location
output_dir: tests/output/GTDBr95_n10-n5/
# Name of UniRef clustering (uniref90 or uniref50)
## "uniref90" highly recommended (but takes longer)!
uniref_name: uniref50
# Name of the humann3 diamond database to create
## This must match naming allowed by humann3
dmnd_name: uniref50_201901.dmnd # UniRef90 is recommended
# Index mapping UniRef90 clusters to UniRef50 (saves time vs re-annotating)
## Skip if annotating with UniRef50
cluster_idx: data/uniref50-90.pkl
# temporary file directory (your username will be added automatically)
tmp_dir: tmp/db_update_tmp/
#-- if custom NCBI/GTDB taxdump files, "Skip" if standard NCBI taxdump --#
# Used for kraken taxonomy & metaphlan
names_dmp: data/taxdump/names.dmp
nodes_dmp: data/taxdump/nodes.dmp
#-- keep intermediate files required for re-creating DBs (eg., w/ more genomes) --#
# If "True", the intermediate files are saved to `output_dir`
# Else, the intermediate files are temporarily stored in `temp_folder`
keep_intermediate: True
#-- software parameters --#
# `vsearch_per_genome` = per-genome gene clustering
# for humann3, use either mmseqs or diamond (mmseqs gets priority if neither skipped)
# for humann3::mmseqs_search::run, --num-iterations must be >=2
params:
ionice: -c 3
bracken:
build_kmer: 35
build_read_lens:
- 100
- 150
genes:
prodigal: ""
vsearch_per_genome: --id 0.97 --strand both --qmask none --fasta_width 0
mmseqs_cluster_update: --min-seq-id 0.9 -c 0.8 -s 4.0
humann3:
batches: 2
filter_existing: --min-pident 0 # any existing genes w/ < cutoff with be re-queried
mmseqs_search:
db: data/UniRef90/uniref90
index: -s 6
run: -e 1e-3 --max-accept 1 --max-seqs 100 --num-iterations 2 --start-sens 1 --sens-steps 3 -s 6
diamond:
db: data/uniref90_ec-filtered/uniref90_ec_filt_201901.dmnd
run: --evalue 1e-3 --query-cover 80 --id 90 --max-target-seqs 1 --block-size 4 --index-chunks 2
propagate_annotations: --min-cov 80 --min-pident 90
#-- snakemake pipeline --#
pipeline:
snakemake_folder: ./
script_folder: ./bin/scripts/
name: Struo2_db-update
config: update
See here for general notes about the config, regardless of database creation or updating.
-
kraken2_db:,genes_db:, andhumann_db:specify the locations for the existing database files- See the database generation tutorial for how to generate the files
- WARNING: use a different
output_dir:other than where the existing databases are located; otherwise, the database files may be over-written!
See the snakemake docs for general instructions.
First, a dry run:
snakemake --use-conda -j -Fqn
Now, an actual run with 4 cores:
snakemake --use-conda -j 2 -F
See the README for running snakemake on a cluster (recommended).
See the README for details on the output.
A good quick sanity check is to compare the size of the updated databases versus the size of the original database files. The updated file sizes should be larger.