fix: fix dna/rna local blast #111

CHERRY-ui8 · 2025-12-04T17:02:58Z

This pull request refines the local BLAST search capabilities for both DNA and RNA sequences.

gemini-code-assist · 2025-12-04T17:03:24Z

Summary of Changes

Hello @CHERRY-ui8, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the local BLAST search capabilities for both DNA and RNA sequences. It introduces more robust methods for identifying molecule types and extracting sequence information from NCBI, updates configuration paths for local databases, and completely revamps the RNA BLAST database creation process to leverage RNAcentral for improved consistency and flexibility. Additionally, it provides a new option for building smaller, species-specific DNA BLAST databases and updates example inputs to align with these enhancements.

Highlights

Local BLAST Database Configuration: Updated default paths for local DNA and RNA BLAST databases in configuration files, reflecting changes in how these databases are built and named.
Improved NCBI Searcher Logic: Enhanced the ncbi_searcher.py to more accurately infer molecule types (DNA/RNA) from accession numbers and gene types. It also now uses separate methods for extracting metadata (from GenBank) and sequence data (from FASTA) for increased reliability, especially for complex records.
Refactored DNA BLAST Database Script: The build_dna_blast_db.sh script now includes a new human_mouse option, allowing users to build a smaller, more targeted DNA BLAST database containing only Homo sapiens and Mus musculus sequences. It also features a pre-download check to filter files by species.
Switched RNA BLAST Database Source to RNAcentral: The build_rna_blast_db.sh script has been completely overhauled to download and build local BLAST databases from RNAcentral instead of NCBI RefSeq RNA. This change ensures consistency with online RNAcentral searches and offers flexible options to download the full active database or specific subsets (e.g., RefSeq, Rfam, miRBase).
Updated Example Input Files: The example input files for DNA and RNA searches (search_dna_demo.jsonl, search_rna_demo.jsonl) have been revised to reflect the updated search capabilities and database structures.
Dependency Lock File Added: A new uv.lock file has been added, indicating changes in dependency management, likely for consistent environment setup.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant enhancements to the ncbi_searcher model and refactors the local BLAST database generation scripts. Key changes in ncbi_searcher.py include a new _infer_molecule_type_detail method for more precise molecule type identification, improved fallback logic for extracting representative accessions, and a separation of metadata and sequence extraction into _extract_metadata_from_genbank and _extract_sequence_from_fasta functions for better reliability, especially for CON-type records. The build_dna_blast_db.sh script now includes a human_mouse download option, which is also the new default, allowing users to create a minimal database containing only Homo sapiens and Mus musculus sequences by partially downloading and checking files. The build_rna_blast_db.sh script has been completely refactored to download RNA sequences from RNAcentral instead of NCBI RefSeq, providing options for a full active database or specific subsets, ensuring consistency with online RNAcentral searches. Configuration files (search_dna_config.yaml, search_rna_config.yaml) were updated with new default local BLAST database paths, and example input files were modified. A uv.lock file was added, and .DS_Store entries were included in .gitignore. Review comments focused on improving the robustness and efficiency of the shell scripts, specifically recommending the use of mktemp to prevent race conditions with temporary files in build_dna_blast_db.sh, and suggesting optimizations for fetching database listings and simplifying release version parsing in build_rna_blast_db.sh.

gemini-code-assist · 2025-12-04T17:06:46Z

scripts/search/build_db/build_dna_blast_db.sh

+check_file_for_species() {
+    local url=$1
+    local filename=$2
+    local temp_file="/tmp/check_${filename//\//_}"
+
+    # Download first 500KB (enough to get many sequence headers)
+    # This should be sufficient to identify the species in most cases
+    if curl -s --max-time 30 --range 0-512000 "${url}" -o "${temp_file}" 2>/dev/null && [ -s "${temp_file}" ]; then
+        # Try to decompress and check for species names
+        if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus)"; then
+            rm -f "${temp_file}"
+            return 0  # Contains target species
+        else
+            rm -f "${temp_file}"
+            return 1  # Does not contain target species
+        fi
+    else
+        # If partial download fails, skip this file (don't download it)
+        rm -f "${temp_file}"
+        return 1
+    fi
+}


The check_file_for_species function uses a predictable temporary file path in /tmp. This can lead to race conditions and unexpected behavior if the script is run multiple times concurrently. It's a better and safer practice to use mktemp to create a unique temporary file. Using trap also simplifies cleanup logic by ensuring the temporary file is removed when the function exits.

check_file_for_species() { local url=$1 local filename=$2 local temp_file temp_file=$(mktemp) trap 'rm -f "${temp_file}"' RETURN # Download first 500KB (enough to get many sequence headers) # This should be sufficient to identify the species in most cases if curl -s --max-time 30 --range 0-512000 "${url}" -o "${temp_file}" 2>/dev/null && [ -s "${temp_file}" ]; then # Try to decompress and check for species names if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus)"; then return 0 # Contains target species else return 1 # Does not contain target species fi else # If partial download fails, skip this file (don't download it) return 1 fi }

gemini-code-assist · 2025-12-04T17:06:46Z

scripts/search/build_db/build_dna_blast_db.sh

+        # Get list of files and save to temp file to avoid subshell issues
+        curl -s "https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/" | \
+            grep -oE 'href="[^"]*\.genomic\.fna\.gz"' | \
+            sed 's/href="\(.*\)"/\1/' > /tmp/refseq_files.txt
+
+        file_count=0
+        download_count=0
+
+        while read filename; do
+            file_count=$((file_count + 1))
+            url="https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/${filename}"
+            echo -n "[${file_count}] Checking ${filename}... "
+
+            if check_file_for_species "${url}" "${filename}"; then
+                echo "✓ contains target species, downloading..."
+                download_count=$((download_count + 1))
+                wget -c -q --show-progress "${url}" || {
+                    echo "Warning: Failed to download ${filename}"
+                }
+            else
+                echo "✗ skipping (no human/mouse data)"
+            fi
+        done < /tmp/refseq_files.txt
+
+        rm -f /tmp/refseq_files.txt


The script uses a hardcoded temporary file /tmp/refseq_files.txt. If multiple instances of this script are run simultaneously, they will interfere with each other by overwriting this file, leading to incorrect downloads or failures. You should use mktemp to create a temporary file with a unique name to avoid this race condition.

# Get list of files and save to temp file to avoid subshell issues file_list=$(mktemp) curl -s "https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/" | \ grep -oE 'href="[^\"]*\.genomic\.fna\.gz"' | \ sed 's/href="$.*$"/\1/' > "${file_list}" file_count=0 download_count=0 while read filename; do file_count=$((file_count + 1)) url="https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/${filename}" echo -n "[${file_count}] Checking ${filename}... " if check_file_for_species "${url}" "${filename}"; then echo "✓ contains target species, downloading..." download_count=$((download_count + 1)) wget -c -q --show-progress "${url}" || { echo "Warning: Failed to download ${filename}" } else echo "✗ skipping (no human/mouse data)" fi done < "${file_list}" rm -f "${file_list}"

scripts/search/build_db/build_rna_blast_db.sh

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

ChenZiHong-Gavin · 2025-12-05T06:11:25Z

LGTM

* fix: fix dna/rna local blast * Update scripts/search/build_db/build_rna_blast_db.sh Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update scripts/search/build_db/build_rna_blast_db.sh Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix: fix pylint problems --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fix: fix dna/rna local blast

3796c7c

gemini-code-assist bot reviewed Dec 4, 2025

View reviewed changes

CHERRY-ui8 and others added 3 commits December 5, 2025 01:08

Update scripts/search/build_db/build_rna_blast_db.sh

d6fbc2a

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update scripts/search/build_db/build_rna_blast_db.sh

d22ed41

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fix: fix pylint problems

33eee15

ChenZiHong-Gavin merged commit 737f45d into InternScience:main Dec 5, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fix dna/rna local blast #111

fix: fix dna/rna local blast #111

Uh oh!

CHERRY-ui8 commented Dec 4, 2025 •

edited by ChenZiHong-Gavin

Loading

Uh oh!

gemini-code-assist bot commented Dec 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 4, 2025

Uh oh!

gemini-code-assist bot Dec 4, 2025

Uh oh!

Uh oh!

Uh oh!

ChenZiHong-Gavin commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: fix dna/rna local blast #111

fix: fix dna/rna local blast #111

Uh oh!

Conversation

CHERRY-ui8 commented Dec 4, 2025 • edited by ChenZiHong-Gavin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Dec 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ChenZiHong-Gavin commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CHERRY-ui8 commented Dec 4, 2025 •

edited by ChenZiHong-Gavin

Loading