Skip to content

Commit 59609e2

Browse files
authored
Merge pull request #584 from Ecogenomics/staging
Merge Staging for release of 2.4.0
2 parents 445b740 + 3e1cc69 commit 59609e2

38 files changed

+1129
-538
lines changed

Dockerfile

+16-6
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# How to build and deploy the Docker image:
22
# docker build --build-arg VER=1.2.3 --no-cache -t ecogenomic/gtdbtk:latest -t ecogenomic/gtdbtk:1.2.3 .
3+
# docker run -v /host/gtdbtk_io:/data -v /host/release_data:/refdata ecogenomic/gtdbtk classify_wf --genome_dir /data/genomes --out_dir /data/output
34
# docker push ecogenomic/gtdbtk:latest && sudo docker push ecogenomic/gtdbtk:1.2.3
45

56
FROM python:3.8-slim-bullseye
@@ -15,14 +16,19 @@ RUN apt-get update -y -m && \
1516
libgomp1 \
1617
libgsl25 \
1718
libgslcblas0 \
19+
build-essential \
20+
curl \
1821
hmmer=3.* \
1922
mash=2.2.* \
2023
prodigal=1:2.6.* \
2124
fasttree=2.1.* \
2225
unzip && \
2326
apt-get clean && \
2427
rm -rf /var/lib/apt/lists/* && \
25-
ln -s /usr/bin/fasttreeMP /usr/bin/FastTreeMP
28+
ln -s /usr/bin/fasttreeMP /usr/bin/FastTreeMP && \
29+
curl https://sh.rustup.rs -sSf | sh -s -- -y
30+
31+
ENV PATH="/root/.cargo/bin:${PATH}"
2632

2733
# ---------------------------------------------------------------------------- #
2834
# ----------------------------- INSTALL PPLACER ------------------------------ #
@@ -34,11 +40,15 @@ RUN wget https://github.com/matsen/pplacer/releases/download/v1.1.alpha19/pplace
3440
rm -rf pplacer-Linux-v1.1.alpha19
3541

3642
# ---------------------------------------------------------------------------- #
37-
# ----------------------------- INSTALL FASTANI ------------------------------ #
43+
# ------------------------------ INSTALL SKANI ------------------------------- #
3844
# ---------------------------------------------------------------------------- #
39-
RUN wget https://github.com/ParBLiSS/FastANI/releases/download/v1.32/fastANI-Linux64-v1.32.zip -q && \
40-
unzip fastANI-Linux64-v1.32.zip -d /usr/bin && \
41-
rm fastANI-Linux64-v1.32.zip
45+
46+
RUN wget https://github.com/bluenote-1577/skani/archive/refs/tags/v0.2.1.tar.gz
47+
RUN tar -xvf v0.2.1.tar.gz
48+
RUN cd skani-0.2.1 && cargo install --path . --root /usr
49+
RUN chmod +x /usr/bin/skani
50+
RUN cd ../
51+
RUN rm -rf v0.2.1.tar.gz skani-0.2.1
4252

4353
# ---------------------------------------------------------------------------- #
4454
# --------------------- SET GTDB-TK MOUNTED DIRECTORIES ---------------------- #
@@ -51,7 +61,7 @@ ENV GTDBTK_DATA_PATH="/refdata/"
5161
# --------------------------- INSTALL PIP PACKAGES --------------------------- #
5262
# ---------------------------------------------------------------------------- #
5363
RUN python -m pip install --upgrade pip && \
54-
python -m pip install gtdbtk==${VER}
64+
python -m pip install gtdbtk==${VER} \
5565

5666
# ---------------------------------------------------------------------------- #
5767
# ---------------------------- SET THE ENTRYPOINT ---------------------------- #

README.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,8 @@ Documentation for GTDB-Tk can be found [here](https://ecogenomics.github.io/GTDB
3737

3838
## ✨ New Features
3939

40-
GTDB-Tk v2.3.0+ includes the following new features:
41-
- New functionality ``convert_to_species`` function to convert GTDB genome IDs to GTDB species names
40+
GTDB-Tk v2.4.0+ includes the following new features:
41+
- `FastANI` has been replaced by `skani` as the primary tool for computing Average Nucleotide Identity (ANI).Users may notice slight variations in the results compared to those obtained using `FastANI`.
4242

4343

4444
## 📈 Performance
@@ -63,6 +63,7 @@ We strongly encourage you to cite the following 3rd party dependencies:
6363

6464
* Matsen FA, et al. 2010. [pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree](https://www.ncbi.nlm.nih.gov/pubmed/21034504). <i>BMC Bioinformatics</i>, 11:538.
6565
* Jain C, et al. 2019. [High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries](https://www.nature.com/articles/s41467-018-07641-9). <i>Nat. Communications</i>, doi: 10.1038/s41467-018-07641-9.
66+
* Shaw J. and Yu Y.W. 2023. [Fast and robust metagenomic sequence comparison through sparse chaining with skani](https://www.nature.com/articles/s41592-023-02018-3). <i>Nature Methods</i>, 20, pages1661–1665 (2023).
6667
* Hyatt D, et al. 2010. [Prodigal: prokaryotic gene recognition and translation initiation site identification](https://www.ncbi.nlm.nih.gov/pubmed/20211023). <i>BMC Bioinformatics</i>, 11:119. doi: 10.1186/1471-2105-11-119.
6768
* Price MN, et al. 2010. [FastTree 2 - Approximately Maximum-Likelihood Trees for Large Alignments](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835736/). <i>PLoS One</i>, 5, e9490.
6869
* Eddy SR. 2011. [Accelerated profile HMM searches](https://www.ncbi.nlm.nih.gov/pubmed/22039361). <i>PLOS Comp. Biol.</i>, 7:e1002195.

docs/src/announcements.rst

+11
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,17 @@
11
Announcements
22
=============
33

4+
GTDB-Tk 2.4.0 available
5+
-----------------------
6+
7+
*April 24, 2024*
8+
9+
* GTDB-Tk version ``2.4.0`` is now available.
10+
* This version of GTDB-Tk requires a new version of the GTDB-Tk reference package (Release 220).
11+
`gtdbtk_r220_data.tar.gz <https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/>`_.
12+
13+
14+
415
GTDB-Tk 2.3.0 available
516
-----------------------
617

docs/src/changelog.rst

+30
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,36 @@
22
Change log
33
==========
44

5+
6+
2.4.0
7+
-----
8+
9+
Bug Fixes:
10+
11+
* (`#576 <https://github.com/Ecogenomics/GTDBTk/issues/576>`_) When all genomes fail the prodigal step in the classify_wf, The
12+
bac120 summary file is still produced with the all failed genomes listed as 'Unclassified'
13+
* (`#573 <https://github.com/Ecogenomics/GTDBTk/issues/573>`_) When running the 3 classify steps independently, a genome can be filtered out in the align
14+
step but still be classified in the identify step. To avoid duplication of row, the genome is classified with a warning.
15+
* (`#540 <https://github.com/Ecogenomics/GTDBTk/issues/540>`_) Empty files are skipped during the sketch step of Mash,
16+
they are then catch in the prodigal step and are returned as 'Unclassified'
17+
* (`#549 <https://github.com/Ecogenomics/GTDBTk/issues/549>`_) : `--force` has been modified to deal with #540. Prodigal
18+
wasn't returning the empty files as failed genomes, it was only skipping them. These genomes are now returned in the summary file and flagged as Unclassified.
19+
20+
Major Changes:
21+
22+
* FastANI has been replaced by skani as the primary tool for computing Average Nucleotide Identity (ANI).Users may notice slight variations in the results compared to those obtained using FastANI.
23+
* In the generated `summary.tsv` files, several columns have been renamed for clarity and consistency. The following columns have been affected:
24+
25+
- "`fastani_reference`" column has been renamed to "`closest_genome_reference`".
26+
- "`fastani_reference_radius`" column has been renamed to "`closest_genome_reference_radius`".
27+
- "`fastani_taxonomy`" column has been renamed to "`closest_genome_taxonomy`".
28+
- "`fastani_ani`" column has been renamed to "`closest_genome_ani`".
29+
- "`fastani_af`" column has been renamed to "`closest_genome_af`".
30+
31+
These changes have been implemented to improve the readability and understanding of the data within the `summary.tsv` files. Users should update their scripts or processes accordingly to reflect these renamed column headers.
32+
33+
34+
535
2.3.2
636
-----
737

docs/src/commands/ani_rep.rst

+15-14
Original file line numberDiff line numberDiff line change
@@ -49,20 +49,21 @@ Output
4949

5050
.. code-block:: text
5151
52-
[2020-04-13 10:51:58] INFO: GTDB-Tk v1.1.0
53-
[2020-04-13 10:51:58] INFO: gtdbtk ani_rep --genome_dir genomes/ --out_dir ani_rep/ --cpus 70
54-
[2020-04-13 10:51:58] INFO: Using GTDB-Tk reference data version r89: /release89
55-
[2020-04-13 10:51:59] INFO: Using Mash version 2.2.2
56-
[2020-04-13 10:51:59] INFO: Creating Mash sketch file: ani_rep/intermediate_results/mash/gtdbtk.user_query_sketch.msh
57-
==> Sketching 3 of 3 (100.0%) genomes
58-
[2020-04-13 10:51:59] INFO: Creating Mash sketch file: ani_rep/intermediate_results/mash/gtdbtk.gtdb_ref_sketch.msh
59-
==> Sketching 24706 of 24706 (100.0%) genomes
60-
[2020-04-13 10:53:13] INFO: Calculating Mash distances.
61-
[2020-04-13 10:53:14] INFO: Calculating ANI with FastANI.
62-
==> Processing 874 of 874 (100.0%) comparisons.
63-
[2020-04-13 10:53:23] INFO: Summary of results saved to: ani_rep/gtdbtk.ani_summary.tsv
64-
[2020-04-13 10:53:23] INFO: Closest representative hits saved to: ani_rep/gtdbtk.ani_closest.tsv
65-
[2020-04-13 10:53:23] INFO: Done.
52+
[2024-03-27 16:43:25] INFO: GTDB-Tk v2.3.2
53+
[2024-03-27 16:43:25] INFO: gtdbtk ani_rep --batchfile genomes/500_batchfile.tsv --out_dir user_vs_reps --cpus 90
54+
[2024-03-27 16:43:25] INFO: Using GTDB-Tk reference data version r214: /srv/db/gtdbtk/official/release214_skani/release214
55+
[2024-03-27 16:43:25] INFO: Loading reference genomes.
56+
[2024-03-27 16:43:25] INFO: Using Mash version 2.2.2
57+
[2024-03-27 16:43:25] INFO: Creating Mash sketch file: user_vs_reps/intermediate_results/mash/gtdbtk.user_query_sketch.msh
58+
[2024-03-27 16:43:27] INFO: Completed 500 genomes in 1.42 seconds (351.61 genomes/second).
59+
[2024-03-27 16:43:27] INFO: Creating Mash sketch file: user_vs_reps/intermediate_results/mash/gtdbtk.gtdb_ref_sketch.msh
60+
[2024-03-27 16:46:55] INFO: Completed 85,205 genomes in 3.47 minutes (24,519.48 genomes/minute).
61+
[2024-03-27 16:46:55] INFO: Calculating Mash distances.
62+
[2024-03-27 16:47:37] INFO: Calculating ANI with skani v0.2.1.
63+
[2024-03-27 16:47:45] INFO: Completed 4,383 comparisons in 7.68 seconds (570.58 comparisons/second).
64+
[2024-03-27 16:47:46] INFO: Summary of results saved to: user_vs_reps/gtdbtk.ani_summary.tsv
65+
[2024-03-27 16:47:46] INFO: Closest representative hits saved to: user_vs_reps/gtdbtk.ani_closest.tsv
66+
[2024-03-27 16:47:46] INFO: Done.
6667
6768
6869

docs/src/commands/check_install.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ Output
4242
[2020-11-04 09:35:16] INFO: Checking that all third-party software are on the system path:
4343
[2020-11-04 09:35:16] INFO: |-- FastTree OK
4444
[2020-11-04 09:35:16] INFO: |-- FastTreeMP OK
45-
[2020-11-04 09:35:16] INFO: |-- fastANI OK
45+
[2020-11-04 09:35:16] INFO: |-- skani OK
4646
[2020-11-04 09:35:16] INFO: |-- guppy OK
4747
[2020-11-04 09:35:16] INFO: |-- hmmalign OK
4848
[2020-11-04 09:35:16] INFO: |-- hmmsearch OK
@@ -57,6 +57,6 @@ Output
5757
[2020-11-04 09:35:20] INFO: |-- msa OK
5858
[2020-11-04 09:35:20] INFO: |-- metadata OK
5959
[2020-11-04 09:35:20] INFO: |-- taxonomy OK
60-
[2020-11-04 09:47:36] INFO: |-- fastani OK
60+
[2020-11-04 09:47:36] INFO: |-- skani OK
6161
[2020-11-04 09:47:36] INFO: |-- mrca_red OK
6262
[2020-11-04 09:47:36] INFO: Done.

0 commit comments

Comments
 (0)