Skip to content

Conversation

@CHERRY-ui8
Copy link
Contributor

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @CHERRY-ui8, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the platform's capabilities by introducing a robust multi-omics data processing and QA generation pipeline. It enables the creation of knowledge graphs and question-answer pairs from DNA, RNA, and protein sequences, leveraging specialized LLM-based extraction and generation models. Concurrently, the underlying search infrastructure has been refactored and enhanced to provide more efficient and reliable access to external biological databases, including improved local BLAST support. These changes aim to streamline the processing of complex biological data and improve the overall user experience by reducing log verbosity and optimizing data handling.

Highlights

  • Multi-omics QA Generation Pipeline: A comprehensive pipeline has been introduced for generating question-answer pairs from multi-omics data (DNA, RNA, protein). This includes new configuration files, shell scripts for execution, and example input data, streamlining the process from raw omics data to structured QA pairs.
  • Dedicated Omics KG Builders and Generators: New OmicsKGBuilder and OmicsQAGenerator classes have been added. The OmicsKGBuilder is designed for extracting entities and relationships specifically from omics data, while the OmicsQAGenerator focuses on generating QA pairs with automatic molecule-specific caption extraction, supporting various output formats like Alpaca, Sharegpt, and ChatML.
  • Enhanced Search Service with Local BLAST Integration: The search functionality has been refactored into a new SearchService that supports multiple biological databases (NCBI, UniProt, RNAcentral). This service can detect pre-searched data to skip redundant steps and includes significantly improved local BLAST integration with multi-threading, optimized commands, and robust retry mechanisms for network requests.
  • Improved Local BLAST Database Building Scripts: The scripts for building local DNA, RNA, and protein BLAST databases have been updated. Enhancements include support for more species in DNA databases, flexible download options for protein databases (Swiss-Prot only or full), and advanced logic for RNAcentral databases, allowing selection of multiple subsets and merging them into a single, unified BLAST database.
  • Reduced Ray Logging Verbosity: Logging verbosity has been reduced across the system. Ray operators and searchers now default to ERROR level for console output, and Ray Data's progress bars and tensor extension casting have been disabled to provide cleaner execution logs and prevent potential conversion errors with complex data types.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@CHERRY-ui8 CHERRY-ui8 closed this Dec 19, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality for handling multi-omics data, including new examples, data files, and core logic for building knowledge graphs and generating QA pairs from DNA, RNA, and protein data. It also includes substantial refactoring and improvements to the search operators, database build scripts, and logging/execution engine to reduce log verbosity and improve robustness. While the new features are well-implemented, I've found some critical issues in the shell scripts for building local BLAST databases that would cause them to fail. There are also a couple of medium-severity issues related to a simplified regex and a hardcoded configuration parameter that reduces flexibility. Overall, this is a great contribution, and addressing the identified issues will make it even better.

Comment on lines +61 to 92
# First check if file is already downloaded locally
if check_file_downloaded "${filename}"; then
# File already exists, check if it contains target species
# Check both compressed and decompressed versions
local decompressed_file="${filename%.gz}"
if [ -f "${filename}" ]; then
# Compressed file exists
if gunzip -c "${filename}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then
return 0 # Contains target species
else
return 1 # Does not contain target species
fi
elif [ -f "${decompressed_file}" ]; then
# Decompressed file exists
if head -2000 "${decompressed_file}" 2>/dev/null | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then
return 0 # Contains target species
else
return 1 # Does not contain target species
fi
fi
fi

# File not downloaded yet, download first 500KB to check
# Download first 500KB (enough to get many sequence headers)
# This should be sufficient to identify the species in most cases
if curl -s --max-time 30 --range 0-512000 "${url}" -o "${temp_file}" 2>/dev/null && [ -s "${temp_file}" ]; then
# Try to decompress and check for species names
if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus)"; then
# Check for: Homo sapiens (人), Mus musculus (小鼠), Drosophila melanogaster (果蝇), Saccharomyces cerevisiae (酵母)
if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then
rm -f "${temp_file}"
return 0 # Contains target species
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The function check_file_for_species appears to be broken. The function definition check_file_for_species() { ... } and local variable declarations have been removed, leaving only the function body. This will cause a "command not found" error when the script tries to call it. Additionally, the function check_file_downloaded is called but not defined anywhere in the script.

# Function to check if a file has been downloaded
check_file_downloaded() {
    local filename=$1
    local decompressed_file="${filename%.gz}"
    if [ -f "${filename}" ] || [ -f "${decompressed_file}" ]; then
        return 0
    else
        return 1
    fi
}

# Function to check if a file contains target species
check_file_for_species() {
    local url=$1
    local filename=$2
    local temp_file="/tmp/check_${filename//\//_}"

    # First check if file is already downloaded locally
    if check_file_downloaded "${filename}"; then
        # File already exists, check if it contains target species
        # Check both compressed and decompressed versions
        local decompressed_file="${filename%.gz}"
        if [ -f "${filename}" ]; then
            # Compressed file exists
            if gunzip -c "${filename}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then
                return 0  # Contains target species
            else
                return 1  # Does not contain target species
            fi
        elif [ -f "${decompressed_file}" ]; then
            # Decompressed file exists
            if head -2000 "${decompressed_file}" 2>/dev/null | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then
                return 0  # Contains target species
            else
                return 1  # Does not contain target species
            fi
        fi
    fi
    
    # File not downloaded yet, download first 500KB to check
    # Download first 500KB (enough to get many sequence headers)
    # This should be sufficient to identify the species in most cases
    if curl -s --max-time 30 --range 0-512000 "${url}" -o "${temp_file}" 2>/dev/null && [ -s "${temp_file}" ]; then
        # Try to decompress and check for species names
        # Check for: Homo sapiens (人), Mus musculus (小鼠), Drosophila melanogaster (果蝇), Saccharomyces cerevisiae (酵母)
        if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then
            rm -f "${temp_file}"
            return 0  # Contains target species
        else

Comment on lines 9 to 86
# For CentOS/RHEL/Fedora: sudo dnf install ncbi-blast+
# Or download from: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

# Better to use a stable DOWNLOAD_TMP name to support resuming downloads
DOWNLOAD_TMP=_downloading
mkdir -p ${DOWNLOAD_TMP}
cd ${DOWNLOAD_TMP}

wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/RELEASE.metalink"
echo "Downloading RELEASE.metalink..."
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/RELEASE.metalink"

# Extract the release name (like 2017_10 or 2017_1)
# Use sed for cross-platform compatibility (works on both macOS and Linux)
RELEASE=$(sed -n 's/.*<version>\([0-9]\{4\}_[0-9]\{1,2\}\)<\/version>.*/\1/p' RELEASE.metalink | head -1)

wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz"
wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/reldate.txt"
wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/README"
wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/LICENSE"
echo "UniProt release: ${RELEASE}"
echo ""

# Download Swiss-Prot (always needed)
echo "Downloading uniprot_sprot.fasta.gz..."
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"

# Download TrEMBL only if full mode
if [ "${DOWNLOAD_MODE}" = "full" ]; then
echo "Downloading uniprot_trembl.fasta.gz..."
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz"
fi

# Download metadata files
echo "Downloading metadata files..."
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/reldate.txt"
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/README"
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/LICENSE"

cd ..

mkdir ${RELEASE}
mkdir -p ${RELEASE}
mv ${DOWNLOAD_TMP}/* ${RELEASE}
rmdir ${DOWNLOAD_TMP}

cd ${RELEASE}

echo ""
echo "Extracting files..."
gunzip uniprot_sprot.fasta.gz
gunzip uniprot_trembl.fasta.gz

cat uniprot_sprot.fasta uniprot_trembl.fasta >uniprot_${RELEASE}.fasta
if [ "${DOWNLOAD_MODE}" = "full" ]; then
gunzip uniprot_trembl.fasta.gz
echo "Merging Swiss-Prot and TrEMBL..."
cat uniprot_sprot.fasta uniprot_trembl.fasta >uniprot_${RELEASE}.fasta
fi

makeblastdb -in uniprot_${RELEASE}.fasta -out uniprot_${RELEASE} -dbtype prot -parse_seqids -title uniprot_${RELEASE}
echo ""
echo "Building BLAST databases..."

# Always build Swiss-Prot database
makeblastdb -in uniprot_sprot.fasta -out uniprot_sprot -dbtype prot -parse_seqids -title uniprot_sprot
makeblastdb -in uniprot_trembl.fasta -out uniprot_trembl -dbtype prot -parse_seqids -title uniprot_trembl

# Build full release database only if in full mode
if [ "${DOWNLOAD_MODE}" = "full" ]; then
makeblastdb -in uniprot_${RELEASE}.fasta -out uniprot_${RELEASE} -dbtype prot -parse_seqids -title uniprot_${RELEASE}
makeblastdb -in uniprot_trembl.fasta -out uniprot_trembl -dbtype prot -parse_seqids -title uniprot_trembl
fi

cd ..

echo ""
echo "BLAST databases created successfully!"
echo "Database locations:"
echo " - Combined: $(pwd)/${RELEASE}/uniprot_${RELEASE}"
echo " - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot"
echo " - TrEMBL: $(pwd)/${RELEASE}/uniprot_trembl"
echo ""
echo "To use these databases, set in your config:"
echo " local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot # or uniprot_${RELEASE} or uniprot_trembl"
if [ "${DOWNLOAD_MODE}" = "sprot" ]; then
echo " - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot"
echo ""
echo "To use this database, set in your config:"
echo " local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot"
else
echo " - Combined: $(pwd)/${RELEASE}/uniprot_${RELEASE}"
echo " - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot"
echo " - TrEMBL: $(pwd)/${RELEASE}/uniprot_trembl"
echo ""
echo "To use these databases, set in your config:"
echo " local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot # or uniprot_${RELEASE} or uniprot_trembl"
fi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This script will fail because it uses several undefined variables: UNIPROT_BASE, DOWNLOAD_MODE, and DOWNLOAD_TMP. These variables need to be defined for the script to execute correctly. DOWNLOAD_MODE should likely be a command-line argument, and the others should be defined at the top of the script.

# For CentOS/RHEL/Fedora: sudo dnf install ncbi-blast+
# Or download from: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

DOWNLOAD_MODE=${1:-sprot} # sprot or full
UNIPROT_BASE="ftp://ftp.uniprot.org/pub/databases/uniprot"

# Better to use a stable DOWNLOAD_TMP name to support resuming downloads
DOWNLOAD_TMP=_downloading
mkdir -p ${DOWNLOAD_TMP}
cd ${DOWNLOAD_TMP}

echo "Downloading RELEASE.metalink..."
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/RELEASE.metalink"

# Extract the release name (like 2017_10 or 2017_1)
# Use sed for cross-platform compatibility (works on both macOS and Linux)
RELEASE=$(sed -n 's/.*<version>\([0-9]\{4\}_[0-9]\{1,2\}\)<\/version>.*/\1/p' RELEASE.metalink | head -1)

echo "UniProt release: ${RELEASE}"
echo ""

# Download Swiss-Prot (always needed)
echo "Downloading uniprot_sprot.fasta.gz..."
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"

# Download TrEMBL only if full mode
if [ "${DOWNLOAD_MODE}" = "full" ]; then
    echo "Downloading uniprot_trembl.fasta.gz..."
    wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz"
fi

# Download metadata files
echo "Downloading metadata files..."
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/reldate.txt"
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/README"
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/LICENSE"

cd ..

mkdir -p ${RELEASE}
mv ${DOWNLOAD_TMP}/* ${RELEASE}
rmdir ${DOWNLOAD_TMP}

cd ${RELEASE}

echo ""
echo "Extracting files..."
gunzip uniprot_sprot.fasta.gz

if [ "${DOWNLOAD_MODE}" = "full" ]; then
    gunzip uniprot_trembl.fasta.gz
    echo "Merging Swiss-Prot and TrEMBL..."
    cat uniprot_sprot.fasta uniprot_trembl.fasta >uniprot_${RELEASE}.fasta
fi

echo ""
echo "Building BLAST databases..."

# Always build Swiss-Prot database
makeblastdb -in uniprot_sprot.fasta -out uniprot_sprot -dbtype prot -parse_seqids -title uniprot_sprot

# Build full release database only if in full mode
if [ "${DOWNLOAD_MODE}" = "full" ]; then
    makeblastdb -in uniprot_${RELEASE}.fasta -out uniprot_${RELEASE} -dbtype prot -parse_seqids -title uniprot_${RELEASE}
    makeblastdb -in uniprot_trembl.fasta -out uniprot_trembl -dbtype prot -parse_seqids -title uniprot_trembl
fi

cd ..

echo ""
echo "BLAST databases created successfully!"
echo "Database locations:"
if [ "${DOWNLOAD_MODE}" = "sprot" ]; then
    echo "  - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot"
    echo ""
    echo "To use this database, set in your config:"
    echo "  local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot"
else
    echo "  - Combined: $(pwd)/${RELEASE}/uniprot_${RELEASE}"
    echo "  - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot"
    echo "  - TrEMBL: $(pwd)/${RELEASE}/uniprot_trembl"
    echo ""
    echo "To use these databases, set in your config:"
    echo "  local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot  # or uniprot_${RELEASE} or uniprot_trembl"
fi

Comment on lines +15 to 20
def __init__(self, working_dir: str = "cache"):
super().__init__(working_dir=working_dir, op_name="build_kg_service")
self.llm_client: BaseLLMWrapper = init_llm("synthesizer")
self.graph_storage: BaseGraphStorage = init_storage(
backend=graph_backend, working_dir=working_dir, namespace="graph"
backend="kuzu", working_dir=working_dir, namespace="graph"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The graph_backend is now hardcoded to "kuzu". Previously, it was configurable via the __init__ parameter. This change reduces flexibility. It would be better to restore the parameter to allow users to choose a different graph backend from the configuration file.

Suggested change
def __init__(self, working_dir: str = "cache"):
super().__init__(working_dir=working_dir, op_name="build_kg_service")
self.llm_client: BaseLLMWrapper = init_llm("synthesizer")
self.graph_storage: BaseGraphStorage = init_storage(
backend=graph_backend, working_dir=working_dir, namespace="graph"
backend="kuzu", working_dir=working_dir, namespace="graph"
)
def __init__(self, working_dir: str = "cache", graph_backend: str = "kuzu"):
super().__init__(working_dir=working_dir, op_name="build_kg_service")
self.llm_client: BaseLLMWrapper = init_llm("synthesizer")
self.graph_storage: BaseGraphStorage = init_storage(
backend=graph_backend, working_dir=working_dir, namespace="graph"
)

elif re.fullmatch(r"[A-NR-Z0-9]{6,10}", query, re.I):
# UniProt accession IDs: 6-10 characters, must start with a letter
# Format: [A-Z][A-Z0-9]{5,9} (6-10 chars total: 1 letter + 5-9 alphanumeric)
elif re.fullmatch(r"[A-Z][A-Z0-9]{5,9}", query, re.I):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regex for validating UniProt accession IDs is a simplification and may not cover all valid formats. A more accurate regex is used elsewhere in the codebase (e.g., in multi_omics_search.py). Using the more accurate regex here would improve correctness and consistency.

Suggested change
elif re.fullmatch(r"[A-Z][A-Z0-9]{5,9}", query, re.I):
elif re.fullmatch(r"([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2})", query, re.I):

@CHERRY-ui8 CHERRY-ui8 deleted the reduce-ray-log branch December 19, 2025 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant