-
Notifications
You must be signed in to change notification settings - Fork 63
fix: fix dna/rna local blast #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix dna/rna local blast #111
Conversation
Summary of ChangesHello @CHERRY-ui8, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refines the local BLAST search capabilities for both DNA and RNA sequences. It introduces more robust methods for identifying molecule types and extracting sequence information from NCBI, updates configuration paths for local databases, and completely revamps the RNA BLAST database creation process to leverage RNAcentral for improved consistency and flexibility. Additionally, it provides a new option for building smaller, species-specific DNA BLAST databases and updates example inputs to align with these enhancements. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant enhancements to the ncbi_searcher model and refactors the local BLAST database generation scripts. Key changes in ncbi_searcher.py include a new _infer_molecule_type_detail method for more precise molecule type identification, improved fallback logic for extracting representative accessions, and a separation of metadata and sequence extraction into _extract_metadata_from_genbank and _extract_sequence_from_fasta functions for better reliability, especially for CON-type records. The build_dna_blast_db.sh script now includes a human_mouse download option, which is also the new default, allowing users to create a minimal database containing only Homo sapiens and Mus musculus sequences by partially downloading and checking files. The build_rna_blast_db.sh script has been completely refactored to download RNA sequences from RNAcentral instead of NCBI RefSeq, providing options for a full active database or specific subsets, ensuring consistency with online RNAcentral searches. Configuration files (search_dna_config.yaml, search_rna_config.yaml) were updated with new default local BLAST database paths, and example input files were modified. A uv.lock file was added, and .DS_Store entries were included in .gitignore. Review comments focused on improving the robustness and efficiency of the shell scripts, specifically recommending the use of mktemp to prevent race conditions with temporary files in build_dna_blast_db.sh, and suggesting optimizations for fetching database listings and simplifying release version parsing in build_rna_blast_db.sh.
| check_file_for_species() { | ||
| local url=$1 | ||
| local filename=$2 | ||
| local temp_file="/tmp/check_${filename//\//_}" | ||
|
|
||
| # Download first 500KB (enough to get many sequence headers) | ||
| # This should be sufficient to identify the species in most cases | ||
| if curl -s --max-time 30 --range 0-512000 "${url}" -o "${temp_file}" 2>/dev/null && [ -s "${temp_file}" ]; then | ||
| # Try to decompress and check for species names | ||
| if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus)"; then | ||
| rm -f "${temp_file}" | ||
| return 0 # Contains target species | ||
| else | ||
| rm -f "${temp_file}" | ||
| return 1 # Does not contain target species | ||
| fi | ||
| else | ||
| # If partial download fails, skip this file (don't download it) | ||
| rm -f "${temp_file}" | ||
| return 1 | ||
| fi | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check_file_for_species function uses a predictable temporary file path in /tmp. This can lead to race conditions and unexpected behavior if the script is run multiple times concurrently. It's a better and safer practice to use mktemp to create a unique temporary file. Using trap also simplifies cleanup logic by ensuring the temporary file is removed when the function exits.
check_file_for_species() {
local url=$1
local filename=$2
local temp_file
temp_file=$(mktemp)
trap 'rm -f "${temp_file}"' RETURN
# Download first 500KB (enough to get many sequence headers)
# This should be sufficient to identify the species in most cases
if curl -s --max-time 30 --range 0-512000 "${url}" -o "${temp_file}" 2>/dev/null && [ -s "${temp_file}" ]; then
# Try to decompress and check for species names
if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus)"; then
return 0 # Contains target species
else
return 1 # Does not contain target species
fi
else
# If partial download fails, skip this file (don't download it)
return 1
fi
}| # Get list of files and save to temp file to avoid subshell issues | ||
| curl -s "https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/" | \ | ||
| grep -oE 'href="[^"]*\.genomic\.fna\.gz"' | \ | ||
| sed 's/href="\(.*\)"/\1/' > /tmp/refseq_files.txt | ||
|
|
||
| file_count=0 | ||
| download_count=0 | ||
|
|
||
| while read filename; do | ||
| file_count=$((file_count + 1)) | ||
| url="https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/${filename}" | ||
| echo -n "[${file_count}] Checking ${filename}... " | ||
|
|
||
| if check_file_for_species "${url}" "${filename}"; then | ||
| echo "✓ contains target species, downloading..." | ||
| download_count=$((download_count + 1)) | ||
| wget -c -q --show-progress "${url}" || { | ||
| echo "Warning: Failed to download ${filename}" | ||
| } | ||
| else | ||
| echo "✗ skipping (no human/mouse data)" | ||
| fi | ||
| done < /tmp/refseq_files.txt | ||
|
|
||
| rm -f /tmp/refseq_files.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script uses a hardcoded temporary file /tmp/refseq_files.txt. If multiple instances of this script are run simultaneously, they will interfere with each other by overwriting this file, leading to incorrect downloads or failures. You should use mktemp to create a temporary file with a unique name to avoid this race condition.
# Get list of files and save to temp file to avoid subshell issues
file_list=$(mktemp)
curl -s "https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/" | \
grep -oE 'href="[^\"]*\.genomic\.fna\.gz"' | \
sed 's/href="\(.*\)"/\1/' > "${file_list}"
file_count=0
download_count=0
while read filename; do
file_count=$((file_count + 1))
url="https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/${filename}"
echo -n "[${file_count}] Checking ${filename}... "
if check_file_for_species "${url}" "${filename}"; then
echo "✓ contains target species, downloading..."
download_count=$((download_count + 1))
wget -c -q --show-progress "${url}" || {
echo "Warning: Failed to download ${filename}"
}
else
echo "✗ skipping (no human/mouse data)"
fi
done < "${file_list}"
rm -f "${file_list}"Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
LGTM |
* fix: fix dna/rna local blast * Update scripts/search/build_db/build_rna_blast_db.sh Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update scripts/search/build_db/build_rna_blast_db.sh Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix: fix pylint problems --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This pull request refines the local BLAST search capabilities for both DNA and RNA sequences.