add: search multi omics #140

CHERRY-ui8 · 2025-12-24T09:12:26Z

No description provided.

gemini-code-assist · 2025-12-24T09:12:55Z

Summary of Changes

Hello @CHERRY-ui8, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the platform's biological data search capabilities by integrating multi-omics data types (DNA, protein, RNA). It streamlines the user experience through reorganized examples and robust backend improvements. The changes focus on enhancing the efficiency and reliability of searches against major public databases, both via API and local BLAST, and introduces a unified service for managing diverse search operations and normalizing their outputs.

Highlights

Multi-Omics Search Capabilities: Introduced comprehensive search functionalities for DNA, protein, and RNA sequences, enabling users to query respective public databases like NCBI, UniProt, and RNAcentral.
Refactored Search Examples and Scripts: The search examples and their associated scripts and configuration files have been reorganized into dedicated subdirectories (e.g., examples/search/search_dna), improving clarity and modularity.
Enhanced Local BLAST Database Management: The build_dna_blast_db.sh script now supports a broader range of species (Drosophila melanogaster, Saccharomyces cerevisiae) for minimal downloads, and the build_rna_blast_db.sh script has been significantly upgraded to allow selection of multiple RNAcentral database subsets and their merging into a single BLAST database.
Improved Searcher Implementations: Individual searcher clients (NCBISearch, RNACentralSearch, UniProtSearch) have been enhanced with better logging, multi-threading support, timeouts for local BLAST operations, refined query type detection, and robust retry mechanisms for API calls.
New Unified Search Service Operator: A new SearchService operator replaces the previous search_all.py, offering a more intelligent and centralized approach to handling multi-omics searches, including logic to normalize already processed data and integrate various search clients seamlessly.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant and valuable refactoring of the search examples and enhances the underlying searcher implementations. The introduction of the SearchService operator for Ray, along with the reorganization of example files, greatly improves the structure and usability. The enhancements to the searcher classes, including standardized logging, retry mechanisms for network robustness, and optimized local BLAST execution, are excellent additions. My review focuses on a critical issue in one of the new shell scripts and some opportunities to further optimize the local BLAST workflows for better performance by avoiding redundant data fetching. I've also included a suggestion regarding configuration design for improved maintainability.

gemini-code-assist · 2025-12-24T09:15:53Z

examples/search/search_dna/build_db.sh

-    local filename=$2
-    local temp_file="/tmp/check_${filename//\//_}"
+    # First check if file is already downloaded locally
+    if check_file_downloaded "${filename}"; then


This script will fail because the function check_file_downloaded is called here but is not defined anywhere in the script. Additionally, the surrounding block of code (lines 61-101) appears to be the body of the check_file_for_species function, but it's missing the function definition wrapper check_file_for_species() { ... }, which will cause a syntax error.

gemini-code-assist · 2025-12-24T09:15:53Z

graphgen/models/searcher/db/ncbi_searcher.py

+                if accession:
+                    self.logger.debug("Local BLAST found accession: %s", accession)
+                    # When using local BLAST, skip sequence fetching by default (faster, fewer API calls)
+                    # Sequence is already known from the query, so we only need metadata
+                    result = self.get_by_accession(accession)
+                    return result


The local BLAST workflow is inefficient. When a match is found, it calls get_by_accession, which eventually re-fetches the sequence from the local database using blastdbcmd. This is redundant because the original sequence is already available from the query. The comment on line 515 is also misleading, as sequence fetching is not skipped.

To optimize, you can get the metadata via get_by_accession but then overwrite the sequence in the returned dictionary with the original query sequence, avoiding a second BLAST command.

if accession: self.logger.debug("Local BLAST found accession: %s", accession) # Get metadata for the accession, then add the original sequence result = self.get_by_accession(accession) if result: # Overwrite sequence with the original query to avoid redundant fetching result["sequence"] = seq result["sequence_length"] = len(seq) return result

gemini-code-assist · 2025-12-24T09:15:53Z

graphgen/models/searcher/db/rnacentral_searcher.py

                if accession:
-                    logger.debug("Local BLAST found accession: %s", accession)
-                    return self.get_by_rna_id(accession)
+                    self.logger.debug("Local BLAST found accession: %s", accession)
+                    detailed = self.get_by_rna_id(accession)
+                    if detailed:
+                        return detailed
+                    self.logger.info(
+                        "Local BLAST found accession %s but could not retrieve metadata from API.",
+                        accession
+                    )
+                    return None


The local BLAST workflow is inefficient. It finds an accession with _local_blast and then calls get_by_rna_id, which re-fetches the full record, including the sequence, from the RNAcentral API. This is unnecessary since the sequence is already available from the query.

To optimize, after getting the result from get_by_rna_id, you should overwrite the sequence and sequence_length fields in the result dictionary with the original query sequence. This avoids the unnecessary data transfer from the API.

accession = self._local_blast(seq, threshold) if accession: self.logger.debug("Local BLAST found accession: %s", accession) detailed = self.get_by_rna_id(accession) if detailed: # Overwrite sequence with original query to avoid re-fetching from API detailed["sequence"] = seq detailed["sequence_length"] = len(seq) return detailed self.logger.info( "Local BLAST found accession %s but could not retrieve metadata from API.", accession )

gemini-code-assist · 2025-12-24T09:15:53Z

graphgen/models/searcher/db/uniprot_searcher.py

            accession = self._local_blast(seq, threshold)
            if accession:
-                logger.debug("Local BLAST found accession: %s", accession)
+                self.logger.debug("Local BLAST found accession: %s", accession)
+                return self.get_by_accession(accession)


The local BLAST workflow for UniProt is also inefficient. It uses _local_blast to find an accession and then calls get_by_accession, which fetches the entire record from the ExPASy API, including the sequence that was already provided in the query.

To optimize this, after get_by_accession returns the metadata, you can overwrite the sequence in the result dictionary with the original query sequence.

Suggested change

accession = self._local_blast(seq, threshold)

if accession:

logger.debug("Local BLAST found accession: %s", accession)

self.logger.debug("Local BLAST found accession: %s", accession)

return self.get_by_accession(accession)

accession = self._local_blast(seq, threshold)

if accession:

self.logger.debug("Local BLAST found accession: %s", accession)

result = self.get_by_accession(accession)

if result:

# Overwrite sequence with original query to avoid re-fetching from API

result["sequence"] = seq

return result

gemini-code-assist · 2025-12-24T09:15:53Z

graphgen/operators/search/search_service.py

+                uniprot_params = self.search_config.get("uniprot_params", {}).copy()
+                # Get max_concurrent from config before passing params to constructor
+                max_concurrent = uniprot_params.pop("max_concurrent", None)


The max_concurrent parameter, which controls the concurrency of run_concurrent, is being extracted from the searcher-specific parameters (e.g., uniprot_params). This mixes the configuration for the execution utility with the configuration for the searcher class itself, which can be confusing and less maintainable.

A clearer design would be to separate execution-related parameters from searcher-specific parameters in your configuration files.

…pendencies

…ent async search wrapper and remove unnecessary search output keys

…searchers

…to search-multi-omics

…ogic of RNA and prot search)

CHERRY-ui8 added 6 commits December 24, 2025 16:47

refactor: reconstruct examples/search file

628b8bd

feat: enhance BaseSearcher

4068b95

Merge remote-tracking branch 'origin/main' into search-multi-omics

098d7da

feat: enhance searchers with multi-threading and improved error handling

8eb45a9

add: add searched example input files for DNA, protein, and RNA

03067df

refactor: replace search_all with SearchService

1c534e4

gemini-code-assist bot reviewed Dec 24, 2025

View reviewed changes

CHERRY-ui8 and others added 19 commits December 24, 2025 17:22

fix: update search operator init to use SearchService

bf74484

fix: upgrade run_concurrent to be compactible with SearchService

2764acb

fix: add undefined functions in DNA db building script

9591a23

perf: perf search_service

f231f2a

wip: add pipeline annotations

b95b168

refactor: update multi-omcis eg. input to use specific data types

f027658

refactor: update input format in README files and base_reader

d7cc2de

chore: add kv_backend and graph_backend params to search config files

9d7a9e4

refactor: implement lazy imports in BaseOperator to avoid circular de…

89e085e

…pendencies

refactor: remove unnecessary asyncio locks and thread pool in searchers

2f87057

refactor: update SearchService to use specific search classes, implem…

45f848a

…ent async search wrapper and remove unnecessary search output keys

style: fix pylint problems

4698f8b

fix: change async BaseSearcher to sync to match DNA, RNA and protein …

cc474f9

…searchers

fix: pass threshold in searcher config

1642d4b

fix: delete duplicate logger

6a21b3d

Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…

013208b

…to search-multi-omics

perf: perf search service

0b116b3

Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…

d8b71be

…to search-multi-omics

fix: fix pylint problems (extract sequence parsing and local search l…

dadbb1d

…ogic of RNA and prot search)

ChenZiHong-Gavin merged commit a17bf7f into InternScience:main Dec 25, 2025
3 checks passed

ChenZiHong-Gavin deleted the search-multi-omics branch December 25, 2025 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add: search multi omics #140

add: search multi omics #140

Uh oh!

CHERRY-ui8 commented Dec 24, 2025

Uh oh!

gemini-code-assist bot commented Dec 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Uh oh!

gemini-code-assist bot Dec 24, 2025

Uh oh!

gemini-code-assist bot Dec 24, 2025

Uh oh!

gemini-code-assist bot Dec 24, 2025

Uh oh!

gemini-code-assist bot Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add: search multi omics #140

add: search multi omics #140

Uh oh!

Conversation

CHERRY-ui8 commented Dec 24, 2025

Uh oh!

gemini-code-assist bot commented Dec 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants