-
Notifications
You must be signed in to change notification settings - Fork 63
add: search multi omics #140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add: search multi omics #140
Conversation
Summary of ChangesHello @CHERRY-ui8, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the platform's biological data search capabilities by integrating multi-omics data types (DNA, protein, RNA). It streamlines the user experience through reorganized examples and robust backend improvements. The changes focus on enhancing the efficiency and reliability of searches against major public databases, both via API and local BLAST, and introduces a unified service for managing diverse search operations and normalizing their outputs. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and valuable refactoring of the search examples and enhances the underlying searcher implementations. The introduction of the SearchService operator for Ray, along with the reorganization of example files, greatly improves the structure and usability. The enhancements to the searcher classes, including standardized logging, retry mechanisms for network robustness, and optimized local BLAST execution, are excellent additions. My review focuses on a critical issue in one of the new shell scripts and some opportunities to further optimize the local BLAST workflows for better performance by avoiding redundant data fetching. I've also included a suggestion regarding configuration design for improved maintainability.
| local filename=$2 | ||
| local temp_file="/tmp/check_${filename//\//_}" | ||
| # First check if file is already downloaded locally | ||
| if check_file_downloaded "${filename}"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script will fail because the function check_file_downloaded is called here but is not defined anywhere in the script. Additionally, the surrounding block of code (lines 61-101) appears to be the body of the check_file_for_species function, but it's missing the function definition wrapper check_file_for_species() { ... }, which will cause a syntax error.
| if accession: | ||
| self.logger.debug("Local BLAST found accession: %s", accession) | ||
| # When using local BLAST, skip sequence fetching by default (faster, fewer API calls) | ||
| # Sequence is already known from the query, so we only need metadata | ||
| result = self.get_by_accession(accession) | ||
| return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The local BLAST workflow is inefficient. When a match is found, it calls get_by_accession, which eventually re-fetches the sequence from the local database using blastdbcmd. This is redundant because the original sequence is already available from the query. The comment on line 515 is also misleading, as sequence fetching is not skipped.
To optimize, you can get the metadata via get_by_accession but then overwrite the sequence in the returned dictionary with the original query sequence, avoiding a second BLAST command.
if accession:
self.logger.debug("Local BLAST found accession: %s", accession)
# Get metadata for the accession, then add the original sequence
result = self.get_by_accession(accession)
if result:
# Overwrite sequence with the original query to avoid redundant fetching
result["sequence"] = seq
result["sequence_length"] = len(seq)
return result| if accession: | ||
| logger.debug("Local BLAST found accession: %s", accession) | ||
| return self.get_by_rna_id(accession) | ||
| self.logger.debug("Local BLAST found accession: %s", accession) | ||
| detailed = self.get_by_rna_id(accession) | ||
| if detailed: | ||
| return detailed | ||
| self.logger.info( | ||
| "Local BLAST found accession %s but could not retrieve metadata from API.", | ||
| accession | ||
| ) | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The local BLAST workflow is inefficient. It finds an accession with _local_blast and then calls get_by_rna_id, which re-fetches the full record, including the sequence, from the RNAcentral API. This is unnecessary since the sequence is already available from the query.
To optimize, after getting the result from get_by_rna_id, you should overwrite the sequence and sequence_length fields in the result dictionary with the original query sequence. This avoids the unnecessary data transfer from the API.
accession = self._local_blast(seq, threshold)
if accession:
self.logger.debug("Local BLAST found accession: %s", accession)
detailed = self.get_by_rna_id(accession)
if detailed:
# Overwrite sequence with original query to avoid re-fetching from API
detailed["sequence"] = seq
detailed["sequence_length"] = len(seq)
return detailed
self.logger.info(
"Local BLAST found accession %s but could not retrieve metadata from API.",
accession
)| accession = self._local_blast(seq, threshold) | ||
| if accession: | ||
| logger.debug("Local BLAST found accession: %s", accession) | ||
| self.logger.debug("Local BLAST found accession: %s", accession) | ||
| return self.get_by_accession(accession) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The local BLAST workflow for UniProt is also inefficient. It uses _local_blast to find an accession and then calls get_by_accession, which fetches the entire record from the ExPASy API, including the sequence that was already provided in the query.
To optimize this, after get_by_accession returns the metadata, you can overwrite the sequence in the result dictionary with the original query sequence.
| accession = self._local_blast(seq, threshold) | |
| if accession: | |
| logger.debug("Local BLAST found accession: %s", accession) | |
| self.logger.debug("Local BLAST found accession: %s", accession) | |
| return self.get_by_accession(accession) | |
| accession = self._local_blast(seq, threshold) | |
| if accession: | |
| self.logger.debug("Local BLAST found accession: %s", accession) | |
| result = self.get_by_accession(accession) | |
| if result: | |
| # Overwrite sequence with original query to avoid re-fetching from API | |
| result["sequence"] = seq | |
| return result |
| uniprot_params = self.search_config.get("uniprot_params", {}).copy() | ||
| # Get max_concurrent from config before passing params to constructor | ||
| max_concurrent = uniprot_params.pop("max_concurrent", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max_concurrent parameter, which controls the concurrency of run_concurrent, is being extracted from the searcher-specific parameters (e.g., uniprot_params). This mixes the configuration for the execution utility with the configuration for the searcher class itself, which can be confusing and less maintainable.
A clearer design would be to separate execution-related parameters from searcher-specific parameters in your configuration files.
…ent async search wrapper and remove unnecessary search output keys
…to search-multi-omics
…to search-multi-omics
…ogic of RNA and prot search)
No description provided.