-
Notifications
You must be signed in to change notification settings - Fork 63
fix: fix dna/rna local blast #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -177,3 +177,7 @@ cache | |
| *.pyc | ||
| *.html | ||
| .gradio | ||
|
|
||
| # macOS | ||
| .DS_Store | ||
| **/.DS_Store | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,5 +13,5 @@ pipeline: | |
| email: [email protected] # NCBI requires an email address | ||
| tool: GraphGen # tool name for NCBI API | ||
| use_local_blast: true # whether to use local blast for DNA search | ||
| local_blast_db: /your_path/refseq_241 # path to local BLAST database (without .nhr extension) | ||
| local_blast_db: refseq_release/refseq_release # path to local BLAST database (without .nhr extension) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,9 +1,4 @@ | ||
| {"type": "text", "content": "TP53"} | ||
| {"type": "text", "content": "BRCA1"} | ||
| {"type": "text", "content": "672"} | ||
| {"type": "text", "content": "11998"} | ||
| {"type": "text", "content": "NM_000546"} | ||
| {"type": "text", "content": "NM_024140"} | ||
| {"type": "text", "content": ">query\nCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAATGCCAGAGGCTGCTCCCCCCGTGGCCCCTGCACCAGCAGCTCCTACACCGGCGGCCCCTGCACCAGCCCCCTCCTGGCCCCTGTCATCTTCTGTCCCTTCCCAGAAAACCTACCAGGGCAGCTACGGTTTCCGTCTGGGCTTCTTGCATTCTGGGACAGCCAAGTCTGTGACTTGCACGTACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGGTCTGGCCCCTCCTCAGCATCTTATCCGAGTGGAAGGAAATTTGCGTGTGGAGTATTTGGATGACAGAAACACTTTTCGACATAGTGTGGTGGTGCCCTATGAGCCGCCTGAGGTTGGCTCTGACTGTACCACCATCCACTACAACTACATGTGTAACAGTTCCTGCATGGGCGGCATGAACCGGAGGCCCATCCTCACCATCATCACACTGGAAGACTCCAGTGGTAATCTACTGGGACGGAACAGCTTTGAGGTGCGTGTTTGTGCCTGTCCTGGGAGAGACCGGCGCACAGAGGAAGAGAATCTCCGCAAGAAAGGGGAGCCTCACCACGAGCTGCCCCCAGGGAGCACTAAGCGAGCACTGCCCAACAACACCAGCTCCTCTCCCCAGCCAAAGAAGAAACCACTGGATGGAGAATATTTCACCCTTCAGATCCGTGGGCGTGAGCGCTTCGAGATGTTCCGAGAGCTGAATGAGGCCTTGGAACTCAAGGATGCCCAGGCTGGGAAGGAGCCAGGGGGGAGCAGGGCTCACTCCAGCCACCTGAAGTCCAAAAAGGGTCAGTCTACCTCCCGCCATAAAAAACTCATGTTCAAGACAGAAGGGCCTGACTCAGACTGACATTCTCCACTTCTTGTTCCCCACTGACAGCCTCCCACCCCCATCTCTCCCTCCCCTGCCATTTTGGGTTTTGGGTCTTTGAACCCTTGCTTGCAATAGGTGTGCGTCAGAAGCACCCAGGACTTCCATTTGCTTTGTCCCGGGGCTCCACTGAACAAGTTGGCCTGCACTGGTGTTTTGTTGTGGGGAGGAGGATGGGGAGTAGGACATACCAGCTTAGATTTTAAGGTTTTTACTGTGAGGGATGTTTGGGAGATGTAAGAAATGTTCTTGCAGTTAAGGGTTAGTTTACAATCAGCCACATTCTAGGTAGGGGCCCACTTCACCGTACTAACCAGGGAAGCTGTCCCTCACTGTTGAATTTTCTCTAACTTCAAGGCCCATATCTGTGAAATGCTGGCATTTGCACCTACCTCACAGAGTGCATTGTGAGGGTTAATGAAATAATGTACATCTGGCCTTGAAACCACCTTTTATTACATGGGGTCTAGAACTTGACCCCCTTGAGGGTGCTTGTTCCCTCTCCCTGTTGGTCGGTGGGTTGGTAGTTTCTACAGTTGGGCAGCTGGTTAGGTAGAGGGAGTTGTCAAGTCTCTGCTGGCCCAGCCAAACCCTGTCTGACAACCTCTTGGTGAACCTTAGTACCTAAAAGGAAATCTCACCCCATCCCACACCCTGGAGGATTTCATCTCTTGTATATGATGATCTGGATCCACCAAGACTTGTTTTATGCTCAGGGTCAATTTCTTTTTTCTTTTTTTTTTTTTTTTTTCTTTTTCTTTGAGACTGGGTCTCGCTTTGTTGCCCAGGCTGGAGTGGAGTGGCGTGATCTTGGCTTACTGCAGCCTTTGCCTCCCCGGCTCGAGCAGTCCTGCCTCAGCCTCCGGAGTAGCTGGGACCACAGGTTCATGCCACCATGGCCAGCCAACTTTTGCATGTTTTGTAGAGATGGGGTCTCACAGTGTTGCCCAGGCTGGTCTCAAACTCCTGGGCTCAGGCGATCCACCTGTCTCAGCCTCCCAGAGTGCTGGGATTACAATTGTGAGCCACCACGTCCAGCTGGAAGGGTCAACATCTTTTACATTCTGCAAGCACATCTGCATTTTCACCCCACCCTTCCCCTCCTTCTCCCTTTTTATATCCCATTTTTATATCGATCTCTTATTTTACAATAAAACTTTGCTGCCA"} | ||
| {"type": "text", "content": "CTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAATGCCAGAGGCTGCTCCCCCCGTGGCCCCTGCACCAGCAGCTCCTACACCGGCGGCCCCTGCACCAGCCCCCTCCTGGCCCCTGTCATCTTCTGTCCCTTCCCAGAAAACCTACCAGGGCAGCTACGGTTTCCGTCTGGGCTTCTTGCATTCTGGGACAGCCAAGTCTGTGACTTGCACGTACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGGTCTGGCCCCTCCTCAGCATCTTATCCGAGTGGAAGGAAATTTGCGTGTGGAGTATTTGGATGACAGAAACACTTTTCGACATAGTGTGGTGGTGCCCTATGAGCCGCCTGAGGTTGGCTCTGACTGTACCACCATCCACTACAACTACATGTGTAACAGTTCCTGCATGGGCGGCATGAACCGGAGGCCCATCCTCACCATCATCACACTGGAAGACTCCAGTGGTAATCTACTGGGACGGAACAGCTTTGAGGTGCGTGTTTGTGCCTGTCCTGGGAGAGACCGGCGCACAGAGGAAGAGAATCTCCGCAAGAAAGGGGAGCCTCACCACGAGCTGCCCCCAGGGAGCACTAAGCGAGCACTGCCCAACAACACCAGCTCCTCTCCCCAGCCAAAGAAGAAACCACTGGATGGAGAATATTTCACCCTTCAGATCCGTGGGCGTGAGCGCTTCGAGATGTTCCGAGAGCTGAATGAGGCCTTGGAACTCAAGGATGCCCAGGCTGGGAAGGAGCCAGGGGGGAGCAGGGCTCACTCCAGCCACCTGAAGTCCAAAAAGGGTCAGTCTACCTCCCGCCATAAAAAACTCATGTTCAAGACAGAAGGGCCTGACTCAGACTGACATTCTCCACTTCTTGTTCCCCACTGACAGCCTCCCACCCCCATCTCTCCCTCCCCTGCCATTTTGGGTTTTGGGTCTTTGAACCCTTGCTTGCAATAGGTGTGCGTCAGAAGCACCCAGGACTTCCATTTGCTTTGTCCCGGGGCTCCACTGAACAAGTTGGCCTGCACTGGTGTTTTGTTGTGGGGAGGAGGATGGGGAGTAGGACATACCAGCTTAGATTTTAAGGTTTTTACTGTGAGGGATGTTTGGGAGATGTAAGAAATGTTCTTGCAGTTAAGGGTTAGTTTACAATCAGCCACATTCTAGGTAGGGGCCCACTTCACCGTACTAACCAGGGAAGCTGTCCCTCACTGTTGAATTTTCTCTAACTTCAAGGCCCATATCTGTGAAATGCTGGCATTTGCACCTACCTCACAGAGTGCATTGTGAGGGTTAATGAAATAATGTACATCTGGCCTTGAAACCACCTTTTATTACATGGGGTCTAGAACTTGACCCCCTTGAGGGTGCTTGTTCCCTCTCCCTGTTGGTCGGTGGGTTGGTAGTTTCTACAGTTGGGCAGCTGGTTAGGTAGAGGGAGTTGTCAAGTCTCTGCTGGCCCAGCCAAACCCTGTCTGACAACCTCTTGGTGAACCTTAGTACCTAAAAGGAAATCTCACCCCATCCCACACCCTGGAGGATTTCATCTCTTGTATATGATGATCTGGATCCACCAAGACTTGTTTTATGCTCAGGGTCAATTTCTTTTTTCTTTTTTTTTTTTTTTTTTCTTTTTCTTTGAGACTGGGTCTCGCTTTGTTGCCCAGGCTGGAGTGGAGTGGCGTGATCTTGGCTTACTGCAGCCTTTGCCTCCCCGGCTCGAGCAGTCCTGCCTCAGCCTCCGGAGTAGCTGGGACCACAGGTTCATGCCACCATGGCCAGCCAACTTTTGCATGTTTTGTAGAGATGGGGTCTCACAGTGTTGCCCAGGCTGGTCTCAAACTCCTGGGCTCAGGCGATCCACCTGTCTCAGCCTCCCAGAGTGCTGGGATTACAATTGTGAGCCACCACGTCCAGCTGGAAGGGTCAACATCTTTTACATTCTGCAAGCACATCTGCATTTTCACCCCACCCTTCCCCTCCTTCTCCCTTTTTATATCCCATTTTTATATCGATCTCTTATTTTACAATAAAACTTTGCTGCCA"} | ||
|
|
||
| {"type": "text", "content": "NG_033923"} | ||
| {"type": "text", "content": "NG_056118"} | ||
| {"type": "text", "content": ">query\nACTCAATTGTCCCAGCAGCATCTACCGAAAAGCCCCCTTGCTGTTCCTGCCAACTTGAAGCCCGGAGGCCTGCTGGGAGGAGGAATTCTAAATGACAAGTATGCCTGGAAAGCTGTGGTCCAAGGCCGTTTTTGCCGTCAGCAGGATCTCCAGAACCAAAGGGAGGACACAGCTCTTCTTAAAACTGAAGGTATTTATGGCTGACATAAAATGAGATTTGATTTGGGCAGGAAATGCGCTTATGTGTACAAAGAATAATACTGACTCCTGGCAGCAAACCAAACAAAACCAGAGTAAGGTGGAGAAAGGTAACGTGTGCCCACGGAAACAGTGGCACAATGTGTGCCTAATTCCAAAGCAGCCGTCCTGCTTAGGCCACTAGTCACGGCGGCTCTGTGATGCTGTACTCCTCAAGGATTTGAACTAATGAAAAGTAAATAAATACCAGTAAAAGTGGATTTGTAAAAAGAAAAGAAAAATGATAGGAAAAGCCCCTTTACCATATGTCAAGGGTTTATGCTG"} | ||
| {"type": "text", "content": "ACTCAATTGTCCCAGCAGCATCTACCGAAAAGCCCCCTTGCTGTTCCTGCCAACTTGAAGCCCGGAGGCCTGCTGGGAGGAGGAATTCTAAATGACAAGTATGCCTGGAAAGCTGTGGTCCAAGGCCGTTTTTGCCGTCAGCAGGATCTCCAGAACCAAAGGGAGGACACAGCTCTTCTTAAAACTGAAGGTATTTATGGCTGACATAAAATGAGATTTGATTTGGGCAGGAAATGCGCTTATGTGTACAAAGAATAATACTGACTCCTGGCAGCAAACCAAACAAAACCAGAGTAAGGTGGAGAAAGGTAACGTGTGCCCACGGAAACAGTGGCACAATGTGTGCCTAATTCCAAAGCAGCCGTCCTGCTTAGGCCACTAGTCACGGCGGCTCTGTGATGCTGTACTCCTCAAGGATTTGAACTAATGAAAAGTAAATAAATACCAGTAAAAGTGGATTTGTAAAAAGAAAAGAAAAATGATAGGAAAAGCCCCTTTACCATATGTCAAGGGTTTATGCTG"} |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,8 @@ | ||
| {"type": "text", "content": "hsa-let-7a-1"} | ||
| {"type": "text", "content": "XIST regulator"} | ||
| {"type": "text", "content": "URS0000123456"} | ||
| {"type": "text", "content": "URS0000000001"} | ||
| {"type": "text", "content": "URS0000000787"} | ||
| {"type": "text", "content": "GCAGTTCTCAGCCATGACAGATGGGAGTTTCGGCCCAATTGACCAGTATTCCTTACTGATAAGAGACACTGACCATGGAGTGGTTCTGGTGAGATGACATGACCCTCGTGAAGGGGCCTGAAGCTTCATTGTGTTTGTGTATGTTTCTCTCTTCAAAAATATTCATGACTTCTCCTGTAGCTTGATAAATATGTATATTTACACACTGCA"} | ||
| {"type": "text", "content": ">query\nCUCCUUUGACGUUAGCGGCGGACGGGUUAGUAACACGUGGGUAACCUACCUAUAAGACUGGGAUAACUUCGGGAAACCGGAGCUAAUACCGGAUAAUAUUUCGAACCGCAUGGUUCGAUAGUGAAAGAUGGUUUUGCUAUCACUUAUAGAUGGACCCGCGCCGUAUUAGCUAGUUGGUAAGGUAACGGCUUACCAAGGCGACGAUACGUAGCCGACCUGAGAGGGUGAUCGGCCACACUGGAACUGAGACACGGUCCAGACUCCUACGGGAGGCAGCAGGGG"} | ||
| {"type": "text", "content": "CUCCUUUGACGUUAGCGGCGGACGGGUUAGUAACACGUGGGUAACCUACCUAUAAGACUGGGAUAACUUCGGGAAACCGGAGCUAAUACCGGAUAAUAUUUCGAACCGCAUGGUUCGAUAGUGAAAGAUGGUUUUGCUAUCACUUAUAGAUGGACCCGCGCCGUAUUAGCUAGUUGGUAAGGUAACGGCUUACCAAGGCGACGAUACGUAGCCGACCUGAGAGGGUGAUCGGCCACACUGGAACUGAGACACGGUCCAGACUCCUACGGGAGGCAGCAGGGG"} |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,7 +24,8 @@ set -e | |
| # - {category}.{number}.genomic.fna.gz (基因组序列) | ||
| # - {category}.{number}.rna.fna.gz (RNA序列) | ||
| # | ||
| # Usage: ./build_dna_blast_db.sh [representative|complete|all] | ||
| # Usage: ./build_dna_blast_db.sh [human_mouse|representative|complete|all] | ||
| # human_mouse: Download only Homo sapiens and Mus musculus sequences (minimal, smallest) | ||
| # representative: Download genomic sequences from major categories (recommended, smaller) | ||
| # Includes: vertebrate_mammalian, vertebrate_other, bacteria, archaea, fungi | ||
| # complete: Download all complete genomic sequences from complete/ directory (very large) | ||
|
|
@@ -35,7 +36,7 @@ set -e | |
| # For CentOS/RHEL/Fedora: sudo dnf install ncbi-blast+ | ||
| # Or download from: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ | ||
|
|
||
| DOWNLOAD_TYPE=${1:-representative} | ||
| DOWNLOAD_TYPE=${1:-human_mouse} | ||
|
|
||
| # Better to use a stable DOWNLOAD_TMP name to support resuming downloads | ||
| DOWNLOAD_TMP=_downloading_dna | ||
|
|
@@ -57,8 +58,66 @@ else | |
| echo "Using date as release identifier: ${RELEASE}" | ||
| fi | ||
|
|
||
| # Function to check if a file contains target species | ||
| check_file_for_species() { | ||
| local url=$1 | ||
| local filename=$2 | ||
| local temp_file="/tmp/check_${filename//\//_}" | ||
|
|
||
| # Download first 500KB (enough to get many sequence headers) | ||
| # This should be sufficient to identify the species in most cases | ||
| if curl -s --max-time 30 --range 0-512000 "${url}" -o "${temp_file}" 2>/dev/null && [ -s "${temp_file}" ]; then | ||
| # Try to decompress and check for species names | ||
| if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus)"; then | ||
| rm -f "${temp_file}" | ||
| return 0 # Contains target species | ||
| else | ||
| rm -f "${temp_file}" | ||
| return 1 # Does not contain target species | ||
| fi | ||
| else | ||
| # If partial download fails, skip this file (don't download it) | ||
| rm -f "${temp_file}" | ||
| return 1 | ||
| fi | ||
| } | ||
|
|
||
| # Download based on type | ||
| case ${DOWNLOAD_TYPE} in | ||
| human_mouse) | ||
| echo "Downloading RefSeq sequences for Homo sapiens and Mus musculus only (minimal size)..." | ||
| echo "This will check each file to see if it contains human or mouse sequences..." | ||
| category="vertebrate_mammalian" | ||
| echo "Checking files in ${category} category..." | ||
|
|
||
| # Get list of files and save to temp file to avoid subshell issues | ||
| curl -s "https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/" | \ | ||
| grep -oE 'href="[^"]*\.genomic\.fna\.gz"' | \ | ||
| sed 's/href="\(.*\)"/\1/' > /tmp/refseq_files.txt | ||
|
|
||
| file_count=0 | ||
| download_count=0 | ||
|
|
||
| while read filename; do | ||
| file_count=$((file_count + 1)) | ||
| url="https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/${filename}" | ||
| echo -n "[${file_count}] Checking ${filename}... " | ||
|
|
||
| if check_file_for_species "${url}" "${filename}"; then | ||
| echo "✓ contains target species, downloading..." | ||
| download_count=$((download_count + 1)) | ||
| wget -c -q --show-progress "${url}" || { | ||
| echo "Warning: Failed to download ${filename}" | ||
| } | ||
| else | ||
| echo "✗ skipping (no human/mouse data)" | ||
| fi | ||
| done < /tmp/refseq_files.txt | ||
|
|
||
| rm -f /tmp/refseq_files.txt | ||
|
Comment on lines
+93
to
+117
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The script uses a hardcoded temporary file # Get list of files and save to temp file to avoid subshell issues
file_list=$(mktemp)
curl -s "https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/" | \
grep -oE 'href="[^\"]*\.genomic\.fna\.gz"' | \
sed 's/href="\(.*\)"/\1/' > "${file_list}"
file_count=0
download_count=0
while read filename; do
file_count=$((file_count + 1))
url="https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/${filename}"
echo -n "[${file_count}] Checking ${filename}... "
if check_file_for_species "${url}" "${filename}"; then
echo "✓ contains target species, downloading..."
download_count=$((download_count + 1))
wget -c -q --show-progress "${url}" || {
echo "Warning: Failed to download ${filename}"
}
else
echo "✗ skipping (no human/mouse data)"
fi
done < "${file_list}"
rm -f "${file_list}" |
||
| echo "" | ||
| echo "Summary: Checked ${file_count} files, downloaded ${download_count} files containing human or mouse sequences." | ||
| ;; | ||
| representative) | ||
| echo "Downloading RefSeq representative sequences (recommended, smaller size)..." | ||
| # Download major categories for representative coverage | ||
|
|
@@ -109,7 +168,11 @@ case ${DOWNLOAD_TYPE} in | |
| ;; | ||
| *) | ||
| echo "Error: Unknown download type '${DOWNLOAD_TYPE}'" | ||
| echo "Usage: $0 [representative|complete|all]" | ||
| echo "Usage: $0 [human_mouse|representative|complete|all]" | ||
| echo " human_mouse: Download only Homo sapiens and Mus musculus (minimal)" | ||
| echo " representative: Download major categories (recommended)" | ||
| echo " complete: Download all complete genomic sequences (very large)" | ||
| echo " all: Download all genomic sequences (extremely large)" | ||
| echo "Note: For RNA sequences, use build_rna_blast_db.sh instead" | ||
| exit 1 | ||
| ;; | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
check_file_for_speciesfunction uses a predictable temporary file path in/tmp. This can lead to race conditions and unexpected behavior if the script is run multiple times concurrently. It's a better and safer practice to usemktempto create a unique temporary file. Usingtrapalso simplifies cleanup logic by ensuring the temporary file is removed when the function exits.