Skip to content

Conversation

@CHERRY-ui8
Copy link
Contributor

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @CHERRY-ui8, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the platform's biological data search capabilities by integrating multi-omics data types (DNA, protein, RNA). It streamlines the user experience through reorganized examples and robust backend improvements. The changes focus on enhancing the efficiency and reliability of searches against major public databases, both via API and local BLAST, and introduces a unified service for managing diverse search operations and normalizing their outputs.

Highlights

  • Multi-Omics Search Capabilities: Introduced comprehensive search functionalities for DNA, protein, and RNA sequences, enabling users to query respective public databases like NCBI, UniProt, and RNAcentral.
  • Refactored Search Examples and Scripts: The search examples and their associated scripts and configuration files have been reorganized into dedicated subdirectories (e.g., examples/search/search_dna), improving clarity and modularity.
  • Enhanced Local BLAST Database Management: The build_dna_blast_db.sh script now supports a broader range of species (Drosophila melanogaster, Saccharomyces cerevisiae) for minimal downloads, and the build_rna_blast_db.sh script has been significantly upgraded to allow selection of multiple RNAcentral database subsets and their merging into a single BLAST database.
  • Improved Searcher Implementations: Individual searcher clients (NCBISearch, RNACentralSearch, UniProtSearch) have been enhanced with better logging, multi-threading support, timeouts for local BLAST operations, refined query type detection, and robust retry mechanisms for API calls.
  • New Unified Search Service Operator: A new SearchService operator replaces the previous search_all.py, offering a more intelligent and centralized approach to handling multi-omics searches, including logic to normalize already processed data and integrate various search clients seamlessly.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable refactoring of the search examples and enhances the underlying searcher implementations. The introduction of the SearchService operator for Ray, along with the reorganization of example files, greatly improves the structure and usability. The enhancements to the searcher classes, including standardized logging, retry mechanisms for network robustness, and optimized local BLAST execution, are excellent additions. My review focuses on a critical issue in one of the new shell scripts and some opportunities to further optimize the local BLAST workflows for better performance by avoiding redundant data fetching. I've also included a suggestion regarding configuration design for improved maintainability.

local filename=$2
local temp_file="/tmp/check_${filename//\//_}"
# First check if file is already downloaded locally
if check_file_downloaded "${filename}"; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This script will fail because the function check_file_downloaded is called here but is not defined anywhere in the script. Additionally, the surrounding block of code (lines 61-101) appears to be the body of the check_file_for_species function, but it's missing the function definition wrapper check_file_for_species() { ... }, which will cause a syntax error.

Comment on lines 513 to 518
if accession:
self.logger.debug("Local BLAST found accession: %s", accession)
# When using local BLAST, skip sequence fetching by default (faster, fewer API calls)
# Sequence is already known from the query, so we only need metadata
result = self.get_by_accession(accession)
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The local BLAST workflow is inefficient. When a match is found, it calls get_by_accession, which eventually re-fetches the sequence from the local database using blastdbcmd. This is redundant because the original sequence is already available from the query. The comment on line 515 is also misleading, as sequence fetching is not skipped.

To optimize, you can get the metadata via get_by_accession but then overwrite the sequence in the returned dictionary with the original query sequence, avoiding a second BLAST command.

                if accession:
                    self.logger.debug("Local BLAST found accession: %s", accession)
                    # Get metadata for the accession, then add the original sequence
                    result = self.get_by_accession(accession)
                    if result:
                        # Overwrite sequence with the original query to avoid redundant fetching
                        result["sequence"] = seq
                        result["sequence_length"] = len(seq)
                    return result

Comment on lines 307 to 316
if accession:
logger.debug("Local BLAST found accession: %s", accession)
return self.get_by_rna_id(accession)
self.logger.debug("Local BLAST found accession: %s", accession)
detailed = self.get_by_rna_id(accession)
if detailed:
return detailed
self.logger.info(
"Local BLAST found accession %s but could not retrieve metadata from API.",
accession
)
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The local BLAST workflow is inefficient. It finds an accession with _local_blast and then calls get_by_rna_id, which re-fetches the full record, including the sequence, from the RNAcentral API. This is unnecessary since the sequence is already available from the query.

To optimize, after getting the result from get_by_rna_id, you should overwrite the sequence and sequence_length fields in the result dictionary with the original query sequence. This avoids the unnecessary data transfer from the API.

                accession = self._local_blast(seq, threshold)
                if accession:
                    self.logger.debug("Local BLAST found accession: %s", accession)
                    detailed = self.get_by_rna_id(accession)
                    if detailed:
                        # Overwrite sequence with original query to avoid re-fetching from API
                        detailed["sequence"] = seq
                        detailed["sequence_length"] = len(seq)
                        return detailed
                    self.logger.info(
                        "Local BLAST found accession %s but could not retrieve metadata from API.",
                        accession
                    )

Comment on lines 135 to 138
accession = self._local_blast(seq, threshold)
if accession:
logger.debug("Local BLAST found accession: %s", accession)
self.logger.debug("Local BLAST found accession: %s", accession)
return self.get_by_accession(accession)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The local BLAST workflow for UniProt is also inefficient. It uses _local_blast to find an accession and then calls get_by_accession, which fetches the entire record from the ExPASy API, including the sequence that was already provided in the query.

To optimize this, after get_by_accession returns the metadata, you can overwrite the sequence in the result dictionary with the original query sequence.

Suggested change
accession = self._local_blast(seq, threshold)
if accession:
logger.debug("Local BLAST found accession: %s", accession)
self.logger.debug("Local BLAST found accession: %s", accession)
return self.get_by_accession(accession)
accession = self._local_blast(seq, threshold)
if accession:
self.logger.debug("Local BLAST found accession: %s", accession)
result = self.get_by_accession(accession)
if result:
# Overwrite sequence with original query to avoid re-fetching from API
result["sequence"] = seq
return result

Comment on lines 104 to 106
uniprot_params = self.search_config.get("uniprot_params", {}).copy()
# Get max_concurrent from config before passing params to constructor
max_concurrent = uniprot_params.pop("max_concurrent", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The max_concurrent parameter, which controls the concurrency of run_concurrent, is being extracted from the searcher-specific parameters (e.g., uniprot_params). This mixes the configuration for the execution utility with the configuration for the searcher class itself, which can be confusing and less maintainable.

A clearer design would be to separate execution-related parameters from searcher-specific parameters in your configuration files.

@ChenZiHong-Gavin ChenZiHong-Gavin merged commit a17bf7f into InternScience:main Dec 25, 2025
3 checks passed
@ChenZiHong-Gavin ChenZiHong-Gavin deleted the search-multi-omics branch December 25, 2025 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants