Skip to content

Conversation

@CHERRY-ui8
Copy link
Contributor

@CHERRY-ui8 CHERRY-ui8 commented Dec 4, 2025

This pull request refines the local BLAST search capabilities for both DNA and RNA sequences.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @CHERRY-ui8, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the local BLAST search capabilities for both DNA and RNA sequences. It introduces more robust methods for identifying molecule types and extracting sequence information from NCBI, updates configuration paths for local databases, and completely revamps the RNA BLAST database creation process to leverage RNAcentral for improved consistency and flexibility. Additionally, it provides a new option for building smaller, species-specific DNA BLAST databases and updates example inputs to align with these enhancements.

Highlights

  • Local BLAST Database Configuration: Updated default paths for local DNA and RNA BLAST databases in configuration files, reflecting changes in how these databases are built and named.
  • Improved NCBI Searcher Logic: Enhanced the ncbi_searcher.py to more accurately infer molecule types (DNA/RNA) from accession numbers and gene types. It also now uses separate methods for extracting metadata (from GenBank) and sequence data (from FASTA) for increased reliability, especially for complex records.
  • Refactored DNA BLAST Database Script: The build_dna_blast_db.sh script now includes a new human_mouse option, allowing users to build a smaller, more targeted DNA BLAST database containing only Homo sapiens and Mus musculus sequences. It also features a pre-download check to filter files by species.
  • Switched RNA BLAST Database Source to RNAcentral: The build_rna_blast_db.sh script has been completely overhauled to download and build local BLAST databases from RNAcentral instead of NCBI RefSeq RNA. This change ensures consistency with online RNAcentral searches and offers flexible options to download the full active database or specific subsets (e.g., RefSeq, Rfam, miRBase).
  • Updated Example Input Files: The example input files for DNA and RNA searches (search_dna_demo.jsonl, search_rna_demo.jsonl) have been revised to reflect the updated search capabilities and database structures.
  • Dependency Lock File Added: A new uv.lock file has been added, indicating changes in dependency management, likely for consistent environment setup.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant enhancements to the ncbi_searcher model and refactors the local BLAST database generation scripts. Key changes in ncbi_searcher.py include a new _infer_molecule_type_detail method for more precise molecule type identification, improved fallback logic for extracting representative accessions, and a separation of metadata and sequence extraction into _extract_metadata_from_genbank and _extract_sequence_from_fasta functions for better reliability, especially for CON-type records. The build_dna_blast_db.sh script now includes a human_mouse download option, which is also the new default, allowing users to create a minimal database containing only Homo sapiens and Mus musculus sequences by partially downloading and checking files. The build_rna_blast_db.sh script has been completely refactored to download RNA sequences from RNAcentral instead of NCBI RefSeq, providing options for a full active database or specific subsets, ensuring consistency with online RNAcentral searches. Configuration files (search_dna_config.yaml, search_rna_config.yaml) were updated with new default local BLAST database paths, and example input files were modified. A uv.lock file was added, and .DS_Store entries were included in .gitignore. Review comments focused on improving the robustness and efficiency of the shell scripts, specifically recommending the use of mktemp to prevent race conditions with temporary files in build_dna_blast_db.sh, and suggesting optimizations for fetching database listings and simplifying release version parsing in build_rna_blast_db.sh.

Comment on lines +62 to +83
check_file_for_species() {
local url=$1
local filename=$2
local temp_file="/tmp/check_${filename//\//_}"

# Download first 500KB (enough to get many sequence headers)
# This should be sufficient to identify the species in most cases
if curl -s --max-time 30 --range 0-512000 "${url}" -o "${temp_file}" 2>/dev/null && [ -s "${temp_file}" ]; then
# Try to decompress and check for species names
if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus)"; then
rm -f "${temp_file}"
return 0 # Contains target species
else
rm -f "${temp_file}"
return 1 # Does not contain target species
fi
else
# If partial download fails, skip this file (don't download it)
rm -f "${temp_file}"
return 1
fi
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The check_file_for_species function uses a predictable temporary file path in /tmp. This can lead to race conditions and unexpected behavior if the script is run multiple times concurrently. It's a better and safer practice to use mktemp to create a unique temporary file. Using trap also simplifies cleanup logic by ensuring the temporary file is removed when the function exits.

check_file_for_species() {
    local url=$1
    local filename=$2
    local temp_file
    temp_file=$(mktemp)
    trap 'rm -f "${temp_file}"' RETURN
    
    # Download first 500KB (enough to get many sequence headers)
    # This should be sufficient to identify the species in most cases
    if curl -s --max-time 30 --range 0-512000 "${url}" -o "${temp_file}" 2>/dev/null && [ -s "${temp_file}" ]; then
        # Try to decompress and check for species names
        if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus)"; then
            return 0  # Contains target species
        else
            return 1  # Does not contain target species
        fi
    else
        # If partial download fails, skip this file (don't download it)
        return 1
    fi
}

Comment on lines +93 to +117
# Get list of files and save to temp file to avoid subshell issues
curl -s "https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/" | \
grep -oE 'href="[^"]*\.genomic\.fna\.gz"' | \
sed 's/href="\(.*\)"/\1/' > /tmp/refseq_files.txt

file_count=0
download_count=0

while read filename; do
file_count=$((file_count + 1))
url="https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/${filename}"
echo -n "[${file_count}] Checking ${filename}... "

if check_file_for_species "${url}" "${filename}"; then
echo "✓ contains target species, downloading..."
download_count=$((download_count + 1))
wget -c -q --show-progress "${url}" || {
echo "Warning: Failed to download ${filename}"
}
else
echo "✗ skipping (no human/mouse data)"
fi
done < /tmp/refseq_files.txt

rm -f /tmp/refseq_files.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The script uses a hardcoded temporary file /tmp/refseq_files.txt. If multiple instances of this script are run simultaneously, they will interfere with each other by overwriting this file, leading to incorrect downloads or failures. You should use mktemp to create a temporary file with a unique name to avoid this race condition.

        # Get list of files and save to temp file to avoid subshell issues
        file_list=$(mktemp)
        curl -s "https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/" | \
            grep -oE 'href="[^\"]*\.genomic\.fna\.gz"' | \
            sed 's/href="\(.*\)"/\1/' > "${file_list}"
        
        file_count=0
        download_count=0
        
        while read filename; do
            file_count=$((file_count + 1))
            url="https://ftp.ncbi.nlm.nih.gov/refseq/release/${category}/${filename}"
            echo -n "[${file_count}] Checking ${filename}... "
            
            if check_file_for_species "${url}" "${filename}"; then
                echo "✓ contains target species, downloading..."
                download_count=$((download_count + 1))
                wget -c -q --show-progress "${url}" || {
                    echo "Warning: Failed to download ${filename}"
                }
            else
                echo "✗ skipping (no human/mouse data)"
            fi
        done < "${file_list}"
        
        rm -f "${file_list}"

CHERRY-ui8 and others added 3 commits December 5, 2025 01:08
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@ChenZiHong-Gavin
Copy link
Collaborator

LGTM

@ChenZiHong-Gavin ChenZiHong-Gavin merged commit 737f45d into InternScience:main Dec 5, 2025
3 checks passed
CHERRY-ui8 added a commit to CHERRY-ui8/GraphGen that referenced this pull request Dec 17, 2025
* fix: fix dna/rna local blast

* Update scripts/search/build_db/build_rna_blast_db.sh

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update scripts/search/build_db/build_rna_blast_db.sh

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* fix: fix pylint problems

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants