Reduce ray log #125

CHERRY-ui8 · 2025-12-19T07:07:15Z

No description provided.

gemini-code-assist · 2025-12-19T07:09:22Z

Summary of Changes

Hello @CHERRY-ui8, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the platform's capabilities by introducing a robust multi-omics data processing and QA generation pipeline. It enables the creation of knowledge graphs and question-answer pairs from DNA, RNA, and protein sequences, leveraging specialized LLM-based extraction and generation models. Concurrently, the underlying search infrastructure has been refactored and enhanced to provide more efficient and reliable access to external biological databases, including improved local BLAST support. These changes aim to streamline the processing of complex biological data and improve the overall user experience by reducing log verbosity and optimizing data handling.

Highlights

Multi-omics QA Generation Pipeline: A comprehensive pipeline has been introduced for generating question-answer pairs from multi-omics data (DNA, RNA, protein). This includes new configuration files, shell scripts for execution, and example input data, streamlining the process from raw omics data to structured QA pairs.
Dedicated Omics KG Builders and Generators: New OmicsKGBuilder and OmicsQAGenerator classes have been added. The OmicsKGBuilder is designed for extracting entities and relationships specifically from omics data, while the OmicsQAGenerator focuses on generating QA pairs with automatic molecule-specific caption extraction, supporting various output formats like Alpaca, Sharegpt, and ChatML.
Enhanced Search Service with Local BLAST Integration: The search functionality has been refactored into a new SearchService that supports multiple biological databases (NCBI, UniProt, RNAcentral). This service can detect pre-searched data to skip redundant steps and includes significantly improved local BLAST integration with multi-threading, optimized commands, and robust retry mechanisms for network requests.
Improved Local BLAST Database Building Scripts: The scripts for building local DNA, RNA, and protein BLAST databases have been updated. Enhancements include support for more species in DNA databases, flexible download options for protein databases (Swiss-Prot only or full), and advanced logic for RNAcentral databases, allowing selection of multiple subsets and merging them into a single, unified BLAST database.
Reduced Ray Logging Verbosity: Logging verbosity has been reduced across the system. Ray operators and searchers now default to ERROR level for console output, and Ray Data's progress bars and tensor extension casting have been disabled to provide cleaner execution logs and prevent potential conversion errors with complex data types.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant new functionality for handling multi-omics data, including new examples, data files, and core logic for building knowledge graphs and generating QA pairs from DNA, RNA, and protein data. It also includes substantial refactoring and improvements to the search operators, database build scripts, and logging/execution engine to reduce log verbosity and improve robustness. While the new features are well-implemented, I've found some critical issues in the shell scripts for building local BLAST databases that would cause them to fail. There are also a couple of medium-severity issues related to a simplified regex and a hardcoded configuration parameter that reduces flexibility. Overall, this is a great contribution, and addressing the identified issues will make it even better.

gemini-code-assist · 2025-12-19T07:12:10Z

examples/search/build_db/build_dna_blast_db.sh

+    # First check if file is already downloaded locally
+    if check_file_downloaded "${filename}"; then
+        # File already exists, check if it contains target species
+        # Check both compressed and decompressed versions
+        local decompressed_file="${filename%.gz}"
+        if [ -f "${filename}" ]; then
+            # Compressed file exists
+            if gunzip -c "${filename}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then
+                return 0  # Contains target species
+            else
+                return 1  # Does not contain target species
+            fi
+        elif [ -f "${decompressed_file}" ]; then
+            # Decompressed file exists
+            if head -2000 "${decompressed_file}" 2>/dev/null | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then
+                return 0  # Contains target species
+            else
+                return 1  # Does not contain target species
+            fi
+        fi
+    fi

+    # File not downloaded yet, download first 500KB to check
    # Download first 500KB (enough to get many sequence headers)
    # This should be sufficient to identify the species in most cases
    if curl -s --max-time 30 --range 0-512000 "${url}" -o "${temp_file}" 2>/dev/null && [ -s "${temp_file}" ]; then
        # Try to decompress and check for species names
-        if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus)"; then
+        # Check for: Homo sapiens (人), Mus musculus (小鼠), Drosophila melanogaster (果蝇), Saccharomyces cerevisiae (酵母)
+        if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then
            rm -f "${temp_file}"
            return 0  # Contains target species
        else


The function check_file_for_species appears to be broken. The function definition check_file_for_species() { ... } and local variable declarations have been removed, leaving only the function body. This will cause a "command not found" error when the script tries to call it. Additionally, the function check_file_downloaded is called but not defined anywhere in the script.

# Function to check if a file has been downloaded check_file_downloaded() { local filename=$1 local decompressed_file="${filename%.gz}" if [ -f "${filename}" ] || [ -f "${decompressed_file}" ]; then return 0 else return 1 fi } # Function to check if a file contains target species check_file_for_species() { local url=$1 local filename=$2 local temp_file="/tmp/check_${filename//\//_}" # First check if file is already downloaded locally if check_file_downloaded "${filename}"; then # File already exists, check if it contains target species # Check both compressed and decompressed versions local decompressed_file="${filename%.gz}" if [ -f "${filename}" ]; then # Compressed file exists if gunzip -c "${filename}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then return 0 # Contains target species else return 1 # Does not contain target species fi elif [ -f "${decompressed_file}" ]; then # Decompressed file exists if head -2000 "${decompressed_file}" 2>/dev/null | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then return 0 # Contains target species else return 1 # Does not contain target species fi fi fi # File not downloaded yet, download first 500KB to check # Download first 500KB (enough to get many sequence headers) # This should be sufficient to identify the species in most cases if curl -s --max-time 30 --range 0-512000 "${url}" -o "${temp_file}" 2>/dev/null && [ -s "${temp_file}" ]; then # Try to decompress and check for species names # Check for: Homo sapiens (人), Mus musculus (小鼠), Drosophila melanogaster (果蝇), Saccharomyces cerevisiae (酵母) if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then rm -f "${temp_file}" return 0 # Contains target species else

gemini-code-assist · 2025-12-19T07:12:10Z

examples/search/build_db/build_protein_blast_db.sh

 # For CentOS/RHEL/Fedora: sudo dnf install ncbi-blast+
 # Or download from: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

-# Better to use a stable DOWNLOAD_TMP name to support resuming downloads
-DOWNLOAD_TMP=_downloading
-mkdir -p ${DOWNLOAD_TMP}
-cd ${DOWNLOAD_TMP}
-
-wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/RELEASE.metalink"
+echo "Downloading RELEASE.metalink..."
+wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/RELEASE.metalink"

 # Extract the release name (like 2017_10 or 2017_1)
 # Use sed for cross-platform compatibility (works on both macOS and Linux)
 RELEASE=$(sed -n 's/.*<version>\([0-9]\{4\}_[0-9]\{1,2\}\)<\/version>.*/\1/p' RELEASE.metalink | head -1)

-wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
-wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz"
-wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/reldate.txt"
-wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/README"
-wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/LICENSE"
+echo "UniProt release: ${RELEASE}"
+echo ""
+
+# Download Swiss-Prot (always needed)
+echo "Downloading uniprot_sprot.fasta.gz..."
+wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
+
+# Download TrEMBL only if full mode
+if [ "${DOWNLOAD_MODE}" = "full" ]; then
+    echo "Downloading uniprot_trembl.fasta.gz..."
+    wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz"
+fi
+
+# Download metadata files
+echo "Downloading metadata files..."
+wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/reldate.txt"
+wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/README"
+wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/LICENSE"

 cd ..

-mkdir ${RELEASE}
+mkdir -p ${RELEASE}
 mv ${DOWNLOAD_TMP}/* ${RELEASE}
 rmdir ${DOWNLOAD_TMP}

 cd ${RELEASE}

+echo ""
+echo "Extracting files..."
 gunzip uniprot_sprot.fasta.gz
-gunzip uniprot_trembl.fasta.gz

-cat uniprot_sprot.fasta uniprot_trembl.fasta >uniprot_${RELEASE}.fasta
+if [ "${DOWNLOAD_MODE}" = "full" ]; then
+    gunzip uniprot_trembl.fasta.gz
+    echo "Merging Swiss-Prot and TrEMBL..."
+    cat uniprot_sprot.fasta uniprot_trembl.fasta >uniprot_${RELEASE}.fasta
+fi

-makeblastdb -in uniprot_${RELEASE}.fasta -out uniprot_${RELEASE} -dbtype prot -parse_seqids -title uniprot_${RELEASE}
+echo ""
+echo "Building BLAST databases..."
+
+# Always build Swiss-Prot database
 makeblastdb -in uniprot_sprot.fasta -out uniprot_sprot -dbtype prot -parse_seqids -title uniprot_sprot
-makeblastdb -in uniprot_trembl.fasta -out uniprot_trembl -dbtype prot -parse_seqids -title uniprot_trembl
+
+# Build full release database only if in full mode
+if [ "${DOWNLOAD_MODE}" = "full" ]; then
+    makeblastdb -in uniprot_${RELEASE}.fasta -out uniprot_${RELEASE} -dbtype prot -parse_seqids -title uniprot_${RELEASE}
+    makeblastdb -in uniprot_trembl.fasta -out uniprot_trembl -dbtype prot -parse_seqids -title uniprot_trembl
+fi

 cd ..

+echo ""
 echo "BLAST databases created successfully!"
 echo "Database locations:"
-echo "  - Combined: $(pwd)/${RELEASE}/uniprot_${RELEASE}"
-echo "  - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot"
-echo "  - TrEMBL: $(pwd)/${RELEASE}/uniprot_trembl"
-echo ""
-echo "To use these databases, set in your config:"
-echo "  local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot  # or uniprot_${RELEASE} or uniprot_trembl"
+if [ "${DOWNLOAD_MODE}" = "sprot" ]; then
+    echo "  - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot"
+    echo ""
+    echo "To use this database, set in your config:"
+    echo "  local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot"
+else
+    echo "  - Combined: $(pwd)/${RELEASE}/uniprot_${RELEASE}"
+    echo "  - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot"
+    echo "  - TrEMBL: $(pwd)/${RELEASE}/uniprot_trembl"
+    echo ""
+    echo "To use these databases, set in your config:"
+    echo "  local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot  # or uniprot_${RELEASE} or uniprot_trembl"
+fi



This script will fail because it uses several undefined variables: UNIPROT_BASE, DOWNLOAD_MODE, and DOWNLOAD_TMP. These variables need to be defined for the script to execute correctly. DOWNLOAD_MODE should likely be a command-line argument, and the others should be defined at the top of the script.

# For CentOS/RHEL/Fedora: sudo dnf install ncbi-blast+ # Or download from: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ DOWNLOAD_MODE=${1:-sprot} # sprot or full UNIPROT_BASE="ftp://ftp.uniprot.org/pub/databases/uniprot" # Better to use a stable DOWNLOAD_TMP name to support resuming downloads DOWNLOAD_TMP=_downloading mkdir -p ${DOWNLOAD_TMP} cd ${DOWNLOAD_TMP} echo "Downloading RELEASE.metalink..." wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/RELEASE.metalink" # Extract the release name (like 2017_10 or 2017_1) # Use sed for cross-platform compatibility (works on both macOS and Linux) RELEASE=$(sed -n 's/.*<version>$[0-9]\{4\}_[0-9]\{1,2\}$<\/version>.*/\1/p' RELEASE.metalink | head -1) echo "UniProt release: ${RELEASE}" echo "" # Download Swiss-Prot (always needed) echo "Downloading uniprot_sprot.fasta.gz..." wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz" # Download TrEMBL only if full mode if [ "${DOWNLOAD_MODE}" = "full" ]; then echo "Downloading uniprot_trembl.fasta.gz..." wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz" fi # Download metadata files echo "Downloading metadata files..." wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/reldate.txt" wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/README" wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/LICENSE" cd .. mkdir -p ${RELEASE} mv ${DOWNLOAD_TMP}/* ${RELEASE} rmdir ${DOWNLOAD_TMP} cd ${RELEASE} echo "" echo "Extracting files..." gunzip uniprot_sprot.fasta.gz if [ "${DOWNLOAD_MODE}" = "full" ]; then gunzip uniprot_trembl.fasta.gz echo "Merging Swiss-Prot and TrEMBL..." cat uniprot_sprot.fasta uniprot_trembl.fasta >uniprot_${RELEASE}.fasta fi echo "" echo "Building BLAST databases..." # Always build Swiss-Prot database makeblastdb -in uniprot_sprot.fasta -out uniprot_sprot -dbtype prot -parse_seqids -title uniprot_sprot # Build full release database only if in full mode if [ "${DOWNLOAD_MODE}" = "full" ]; then makeblastdb -in uniprot_${RELEASE}.fasta -out uniprot_${RELEASE} -dbtype prot -parse_seqids -title uniprot_${RELEASE} makeblastdb -in uniprot_trembl.fasta -out uniprot_trembl -dbtype prot -parse_seqids -title uniprot_trembl fi cd .. echo "" echo "BLAST databases created successfully!" echo "Database locations:" if [ "${DOWNLOAD_MODE}" = "sprot" ]; then echo " - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot" echo "" echo "To use this database, set in your config:" echo " local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot" else echo " - Combined: $(pwd)/${RELEASE}/uniprot_${RELEASE}" echo " - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot" echo " - TrEMBL: $(pwd)/${RELEASE}/uniprot_trembl" echo "" echo "To use these databases, set in your config:" echo " local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot # or uniprot_${RELEASE} or uniprot_trembl" fi

gemini-code-assist · 2025-12-19T07:12:10Z

graphgen/operators/build_kg/build_kg_service.py

+    def __init__(self, working_dir: str = "cache"):
        super().__init__(working_dir=working_dir, op_name="build_kg_service")
        self.llm_client: BaseLLMWrapper = init_llm("synthesizer")
        self.graph_storage: BaseGraphStorage = init_storage(
-            backend=graph_backend, working_dir=working_dir, namespace="graph"
+            backend="kuzu", working_dir=working_dir, namespace="graph"
        )


The graph_backend is now hardcoded to "kuzu". Previously, it was configurable via the __init__ parameter. This change reduces flexibility. It would be better to restore the parameter to allow users to choose a different graph backend from the configuration file.

Suggested change

def __init__(self, working_dir: str = "cache"):

super().__init__(working_dir=working_dir, op_name="build_kg_service")

self.llm_client: BaseLLMWrapper = init_llm("synthesizer")

self.graph_storage: BaseGraphStorage = init_storage(

backend=graph_backend, working_dir=working_dir, namespace="graph"

backend="kuzu", working_dir=working_dir, namespace="graph"

)

def __init__(self, working_dir: str = "cache", graph_backend: str = "kuzu"):

super().__init__(working_dir=working_dir, op_name="build_kg_service")

self.llm_client: BaseLLMWrapper = init_llm("synthesizer")

self.graph_storage: BaseGraphStorage = init_storage(

backend=graph_backend, working_dir=working_dir, namespace="graph"

)

gemini-code-assist · 2025-12-19T07:12:10Z

graphgen/models/searcher/db/uniprot_searcher.py

-        elif re.fullmatch(r"[A-NR-Z0-9]{6,10}", query, re.I):
+        # UniProt accession IDs: 6-10 characters, must start with a letter
+        # Format: [A-Z][A-Z0-9]{5,9} (6-10 chars total: 1 letter + 5-9 alphanumeric)
+        elif re.fullmatch(r"[A-Z][A-Z0-9]{5,9}", query, re.I):


The regex for validating UniProt accession IDs is a simplification and may not cover all valid formats. A more accurate regex is used elsewhere in the codebase (e.g., in multi_omics_search.py). Using the more accurate regex here would improve correctness and consistency.

Suggested change

elif re.fullmatch(r"[A-Z][A-Z0-9]{5,9}", query, re.I):

elif re.fullmatch(r"([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2})", query, re.I):

CHERRY-ui8 added 16 commits December 18, 2025 14:49

feat: multi omics KG building

fc18746

fix: remove remaining conflict markers

ce2b296

fix: restore files accidentally modified

6d0868a

fix: update local blast database paths in omics qa config

b6e65c0

fix: fix pylint problems

29ab42f

fix: fix pylint problems agaaaain

8b908b8

fix: fix pylint problems again

4173095

chore: remove protein KG extraction template

6466e27

refactor: remove unused read_stream method from JSONReader

82665bf

refactor: remove repeated image_exists method from JSONReader

2a27f69

refactor: clean up logging in Engine

91abfde

refactor: enhance initialization of services with configurable backends

f2ee12f

refactor: remove unused progress bar from run_concurrent

b21f4ce

refactor: simplify anchor_type initialization in AnchorBFSPartitioner

23fa2bb

style: pylint problems in AnchorBFSPartitioner

41456e8

feat: reduce Ray logging and disable verbose output

49881fa

CHERRY-ui8 closed this Dec 19, 2025

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

CHERRY-ui8 deleted the reduce-ray-log branch December 19, 2025 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce ray log #125

Reduce ray log #125

Uh oh!

CHERRY-ui8 commented Dec 19, 2025

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	elif re.fullmatch(r"[A-Z][A-Z0-9]{5,9}", query, re.I):
	elif re.fullmatch(r"([OPQ][0-9][A-Z0-9]{3}[0-9]\|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2})", query, re.I):

Reduce ray log #125

Reduce ray log #125

Uh oh!

Conversation

CHERRY-ui8 commented Dec 19, 2025

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant