From 9e5f985c8fedaf3672abe1a2e63f95864103f66c Mon Sep 17 00:00:00 2001 From: Wei Shen Date: Fri, 24 Nov 2023 15:01:36 +0000 Subject: [PATCH] add a new tutorial --- README.md | 2 +- doc/docs/download.md | 26 +++++++------ doc/docs/tutorial.md | 65 +++++++++++++++++++++++++++++++ doc/docs/usage.md | 16 ++++++-- example/changed_species_names.txt | 2 + taxonkit/cmd/name2taxid.go | 6 +-- 6 files changed, 98 insertions(+), 19 deletions(-) create mode 100644 example/changed_species_names.txt diff --git a/README.md b/README.md index b555fdc..b071826 100644 --- a/README.md +++ b/README.md @@ -64,7 +64,7 @@ Subcommand |F [`list`](https://bioinf.shenwei.me/taxonkit/usage/#list) |List taxonomic subtrees (TaxIds) bellow given TaxIds [`lineage`](https://bioinf.shenwei.me/taxonkit/usage/#lineage) |Query taxonomic lineage of given TaxIds [`reformat`](https://bioinf.shenwei.me/taxonkit/usage/#reformat) |Reformat lineage in canonical ranks -[`name2taxid`](https://bioinf.shenwei.me/taxonkit/usage/#name2taxid) |Convert scientific names to TaxIds +[`name2taxid`](https://bioinf.shenwei.me/taxonkit/usage/#name2taxid) |Convert taxon names to TaxIds [`filter`](https://bioinf.shenwei.me/taxonkit/usage/#filter) |Filter TaxIds by taxonomic rank range [`lca`](https://bioinf.shenwei.me/taxonkit/usage/#lca) |Compute lowest common ancestor (LCA) for TaxIds [`taxid-changelog`](https://bioinf.shenwei.me/taxonkit/usage/#taxid-changelog)|Create TaxId changelog from dump archives diff --git a/doc/docs/download.md b/doc/docs/download.md index c633ccc..2d69d1b 100644 --- a/doc/docs/download.md +++ b/doc/docs/download.md @@ -6,12 +6,10 @@ ## Current Version -- [TaxonKit v0.15.0](https://github.com/shenwei356/taxonkit/releases/tag/v0.15.0) -[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/taxonkit/v0.15.0/total.svg)](https://github.com/shenwei356/taxonkit/releases/tag/v0.15.0) - - `taxonkit reformat`: - - For lineages with more than one node, if it fails to query TaxId with the parent-child pair, use the last child only. [#82](https://github.com/shenwei356/taxonkit/issues/82) - - The flag `-T/--trim` also does not add the prefix for missing ranks lower than the current rank. [#82](https://github.com/shenwei356/taxonkit/issues/82) - - New flag `-s/--miss-rank-repl-suffix` to set the suffix for estimated taxon names. [#85](https://github.com/shenwei356/taxonkit/issues/85) +- [TaxonKit v0.15.1](https://github.com/shenwei356/taxonkit/releases/tag/v0.15.1) +[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/taxonkit/v0.15.1/total.svg)](https://github.com/shenwei356/taxonkit/releases/tag/v0.15.1) + - `taxonkit name2taxid`: + - remove the restriction of name types. [#87](https://github.com/shenwei356/taxonkit/issues/87) ### Please cite @@ -28,11 +26,11 @@ OS |Arch |File, 中国镜像 |Download Count :------|:---------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Linux |**64-bit**|[**taxonkit_linux_amd64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.15.0/taxonkit_linux_amd64.tar.gz),
[中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_linux_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.15.0/taxonkit_linux_amd64.tar.gz) -Linux |**arm64** |[**taxonkit_linux_arm64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.15.0/taxonkit_linux_arm64.tar.gz),
[中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_linux_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_linux_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.15.0/taxonkit_linux_arm64.tar.gz) -macOS |**64-bit**|[**taxonkit_darwin_amd64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.15.0/taxonkit_darwin_amd64.tar.gz),
[中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_darwin_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.15.0/taxonkit_darwin_amd64.tar.gz) -macOS |**arm64** |[**taxonkit_darwin_arm64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.15.0/taxonkit_darwin_arm64.tar.gz),
[中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_darwin_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_darwin_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.15.0/taxonkit_darwin_arm64.tar.gz) -Windows|**64-bit**|[**taxonkit_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.15.0/taxonkit_windows_amd64.exe.tar.gz),
[中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.15.0/taxonkit_windows_amd64.exe.tar.gz) +Linux |**64-bit**|[**taxonkit_linux_amd64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.15.1/taxonkit_linux_amd64.tar.gz),
[中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_linux_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.15.1/taxonkit_linux_amd64.tar.gz) +Linux |**arm64** |[**taxonkit_linux_arm64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.15.1/taxonkit_linux_arm64.tar.gz),
[中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_linux_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_linux_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.15.1/taxonkit_linux_arm64.tar.gz) +macOS |**64-bit**|[**taxonkit_darwin_amd64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.15.1/taxonkit_darwin_amd64.tar.gz),
[中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_darwin_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.15.1/taxonkit_darwin_amd64.tar.gz) +macOS |**arm64** |[**taxonkit_darwin_arm64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.15.1/taxonkit_darwin_arm64.tar.gz),
[中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_darwin_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_darwin_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.15.1/taxonkit_darwin_arm64.tar.gz) +Windows|**64-bit**|[**taxonkit_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.15.1/taxonkit_windows_amd64.exe.tar.gz),
[中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.15.1/taxonkit_windows_amd64.exe.tar.gz) ## Installation @@ -153,6 +151,12 @@ All-in-one command: ## Release history +- [TaxonKit v0.15.0](https://github.com/shenwei356/taxonkit/releases/tag/v0.15.0) +[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/taxonkit/v0.15.0/total.svg)](https://github.com/shenwei356/taxonkit/releases/tag/v0.15.0) + - `taxonkit reformat`: + - For lineages with more than one node, if it fails to query TaxId with the parent-child pair, use the last child only. [#82](https://github.com/shenwei356/taxonkit/issues/82) + - The flag `-T/--trim` also does not add the prefix for missing ranks lower than the current rank. [#82](https://github.com/shenwei356/taxonkit/issues/82) + - New flag `-s/--miss-rank-repl-suffix` to set the suffix for estimated taxon names. [#85](https://github.com/shenwei356/taxonkit/issues/85) - [TaxonKit v0.14.2](https://github.com/shenwei356/taxonkit/releases/tag/v0.14.2) [![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/taxonkit/v0.14.2/total.svg)](https://github.com/shenwei356/taxonkit/releases/tag/v0.14.2) - `taxonkit filter`: diff --git a/doc/docs/tutorial.md b/doc/docs/tutorial.md index b67bfcc..1f7cd3c 100644 --- a/doc/docs/tutorial.md +++ b/doc/docs/tutorial.md @@ -182,6 +182,71 @@ where rank of the closest higher node is still lower than rank cutoff**. species Severe acute respiratory syndrome-related coronavirus strain Severe acute respiratory syndrome coronavirus 2 +## Mapping old species names to new ones + +Some species names in papers or websites might changed, we can try querying their TaxIds via their old new names +and then retrieve the new ones. + + cat example/changed_species_names.txt + Lactobacillus fermentum + Mycoplasma gallinaceum + + # TaxonKit >= v0.15.1 + cat example/changed_species_names.txt \ + | taxonkit name2taxid \ + | taxonkit lineage -i 2 -n \ + | cut -f 1,4 + + Lactobacillus fermentum Limosilactobacillus fermentum + Mycoplasma gallinaceum + +Woops, there's no information of `Mycoplasma gallinaceum`. +Then we check the [taxid-changelog](https://github.com/shenwei356/taxid-changelog). + + zcat taxonkit/taxid-changelog.csv.gz \ + | csvtk grep -f name -P example/changed_species_names.txt + | csvtk cut -f taxid,version,change,name,rank \ + | csvtk pretty + + taxid version change name rank + ----- ---------- -------------- ----------------------- ------- + 1613 2013-02-21 NEW Lactobacillus fermentum species + 1613 2016-03-01 ABSORB Lactobacillus fermentum species + 1613 2016-03-01 CHANGE_LIN_LEN Lactobacillus fermentum species + 29556 2013-02-21 NEW Mycoplasma gallinaceum species + 29556 2016-03-01 CHANGE_LIN_LEN Mycoplasma gallinaceum species + 29556 2021-01-01 CHANGE_NAME Mycoplasma gallinaceum species + 29556 2021-01-01 CHANGE_LIN_LIN Mycoplasma gallinaceum species + +We can see the names are changed. Full changes can be queried with the taxid. e.g., + + taxid version change change-value name rank + ----- ---------- -------------- ------------ ------------------------- ------- + 29556 2013-02-21 NEW Mycoplasma gallinaceum species + 29556 2016-03-01 CHANGE_LIN_LEN Mycoplasma gallinaceum species + 29556 2020-09-01 CHANGE_NAME Mycoplasmopsis gallinacea species + 29556 2020-09-01 CHANGE_LIN_TAX Mycoplasmopsis gallinacea species + 29556 2021-01-01 CHANGE_NAME Mycoplasma gallinaceum species + 29556 2021-01-01 CHANGE_LIN_LIN Mycoplasma gallinaceum species + 29556 2021-09-01 CHANGE_NAME Mycoplasmopsis gallinacea species + 29556 2021-09-01 CHANGE_LIN_LIN Mycoplasmopsis gallinacea species + 29556 2023-03-01 CHANGE_LIN_LIN Mycoplasmopsis gallinacea species + + +Then we just use their TaxIds to rertrieve the new names. **The final commands are**: + + zcat taxonkit/taxid-changelog.csv.gz \ + | csvtk grep -f name -P example/changed_species_names.txt \ + | csvtk uniq -f taxid \ + | csvtk cut -f name,taxid \ + | csvtk del-header \ + | csvtk csv2tab \ + | taxonkit lineage -i 2 -n \ + | cut -f 1,4 + + Lactobacillus fermentum Limosilactobacillus fermentum + Mycoplasma gallinaceum Mycoplasmopsis gallinacea + ## Add taxonomy information to BLAST result An blast result file `blast_result.txt`, where the second column is the accession of matched sequences. diff --git a/doc/docs/usage.md b/doc/docs/usage.md index 999e7c7..c3fa695 100644 --- a/doc/docs/usage.md +++ b/doc/docs/usage.md @@ -42,7 +42,7 @@ All-in-one command: ```text TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit -Version: 0.14.2 +Version: 0.15.1 Author: Wei Shen @@ -75,7 +75,7 @@ Available Commands: lca Compute lowest common ancestor (LCA) for TaxIds lineage Query taxonomic lineage of given TaxIds list List taxonomic subtrees of given TaxIds - name2taxid Convert scientific names to TaxIds + name2taxid Convert taxon names to TaxIds profile2cami Convert metagenomic profile table to CAMI format reformat Reformat lineage in canonical ranks taxid-changelog Create TaxId changelog from dump archives @@ -90,6 +90,8 @@ Flags: -j, --threads int number of CPUs. 4 is enough (default 4) --verbose print verbose information +Use "taxonkit [command] --help" for more information about a command. + ``` ## list @@ -999,11 +1001,11 @@ Examples: Usage ```text -Convert scientific names to TaxIds +Convert taxon names to TaxIds Attention: - 1. Some TaxIds share the same scientific names, e.g, Drosophila. + 1. Some TaxIds share the same names, e.g, Drosophila. These input lines are duplicated with multiple TaxIds. $ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L @@ -1069,6 +1071,12 @@ Example data uncultured murine large bowel bacterium BAC 54B 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B Croceibacter phage P2559Y 1327037 Viruses;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y +1. Convert old names to new names. + + $ echo Lactobacillus fermentum | taxonkit name2taxid | taxonkit lineage -i 2 -n | cut -f 1,2,4 + Lactobacillus fermentum 1613 Limosilactobacillus fermentum + + 1. **Some TaxIds share the same scientific names**, e.g, Drosophila. $ echo Drosophila \ diff --git a/example/changed_species_names.txt b/example/changed_species_names.txt new file mode 100644 index 0000000..26f30ed --- /dev/null +++ b/example/changed_species_names.txt @@ -0,0 +1,2 @@ +Lactobacillus fermentum +Mycoplasma gallinaceum diff --git a/taxonkit/cmd/name2taxid.go b/taxonkit/cmd/name2taxid.go index 03f00d8..ecf0e67 100644 --- a/taxonkit/cmd/name2taxid.go +++ b/taxonkit/cmd/name2taxid.go @@ -33,12 +33,12 @@ import ( // name2taxidCmd represents the fx2tab command var name2taxidCmd = &cobra.Command{ Use: "name2taxid", - Short: "Convert scientific names to TaxIds", - Long: `Convert scientific names to TaxIds + Short: "Convert taxon names to TaxIds", + Long: `Convert taxon names to TaxIds Attention: - 1. Some TaxIds share the same scientific names, e.g, Drosophila. + 1. Some TaxIds share the same names, e.g, Drosophila. These input lines are duplicated with multiple TaxIds. $ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L