-
Notifications
You must be signed in to change notification settings - Fork 63
Reduce ray log #125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce ray log #125
Conversation
Summary of ChangesHello @CHERRY-ui8, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the platform's capabilities by introducing a robust multi-omics data processing and QA generation pipeline. It enables the creation of knowledge graphs and question-answer pairs from DNA, RNA, and protein sequences, leveraging specialized LLM-based extraction and generation models. Concurrently, the underlying search infrastructure has been refactored and enhanced to provide more efficient and reliable access to external biological databases, including improved local BLAST support. These changes aim to streamline the processing of complex biological data and improve the overall user experience by reducing log verbosity and optimizing data handling. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant new functionality for handling multi-omics data, including new examples, data files, and core logic for building knowledge graphs and generating QA pairs from DNA, RNA, and protein data. It also includes substantial refactoring and improvements to the search operators, database build scripts, and logging/execution engine to reduce log verbosity and improve robustness. While the new features are well-implemented, I've found some critical issues in the shell scripts for building local BLAST databases that would cause them to fail. There are also a couple of medium-severity issues related to a simplified regex and a hardcoded configuration parameter that reduces flexibility. Overall, this is a great contribution, and addressing the identified issues will make it even better.
| # First check if file is already downloaded locally | ||
| if check_file_downloaded "${filename}"; then | ||
| # File already exists, check if it contains target species | ||
| # Check both compressed and decompressed versions | ||
| local decompressed_file="${filename%.gz}" | ||
| if [ -f "${filename}" ]; then | ||
| # Compressed file exists | ||
| if gunzip -c "${filename}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then | ||
| return 0 # Contains target species | ||
| else | ||
| return 1 # Does not contain target species | ||
| fi | ||
| elif [ -f "${decompressed_file}" ]; then | ||
| # Decompressed file exists | ||
| if head -2000 "${decompressed_file}" 2>/dev/null | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then | ||
| return 0 # Contains target species | ||
| else | ||
| return 1 # Does not contain target species | ||
| fi | ||
| fi | ||
| fi | ||
|
|
||
| # File not downloaded yet, download first 500KB to check | ||
| # Download first 500KB (enough to get many sequence headers) | ||
| # This should be sufficient to identify the species in most cases | ||
| if curl -s --max-time 30 --range 0-512000 "${url}" -o "${temp_file}" 2>/dev/null && [ -s "${temp_file}" ]; then | ||
| # Try to decompress and check for species names | ||
| if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus)"; then | ||
| # Check for: Homo sapiens (人), Mus musculus (小鼠), Drosophila melanogaster (果蝇), Saccharomyces cerevisiae (酵母) | ||
| if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then | ||
| rm -f "${temp_file}" | ||
| return 0 # Contains target species | ||
| else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function check_file_for_species appears to be broken. The function definition check_file_for_species() { ... } and local variable declarations have been removed, leaving only the function body. This will cause a "command not found" error when the script tries to call it. Additionally, the function check_file_downloaded is called but not defined anywhere in the script.
# Function to check if a file has been downloaded
check_file_downloaded() {
local filename=$1
local decompressed_file="${filename%.gz}"
if [ -f "${filename}" ] || [ -f "${decompressed_file}" ]; then
return 0
else
return 1
fi
}
# Function to check if a file contains target species
check_file_for_species() {
local url=$1
local filename=$2
local temp_file="/tmp/check_${filename//\//_}"
# First check if file is already downloaded locally
if check_file_downloaded "${filename}"; then
# File already exists, check if it contains target species
# Check both compressed and decompressed versions
local decompressed_file="${filename%.gz}"
if [ -f "${filename}" ]; then
# Compressed file exists
if gunzip -c "${filename}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then
return 0 # Contains target species
else
return 1 # Does not contain target species
fi
elif [ -f "${decompressed_file}" ]; then
# Decompressed file exists
if head -2000 "${decompressed_file}" 2>/dev/null | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then
return 0 # Contains target species
else
return 1 # Does not contain target species
fi
fi
fi
# File not downloaded yet, download first 500KB to check
# Download first 500KB (enough to get many sequence headers)
# This should be sufficient to identify the species in most cases
if curl -s --max-time 30 --range 0-512000 "${url}" -o "${temp_file}" 2>/dev/null && [ -s "${temp_file}" ]; then
# Try to decompress and check for species names
# Check for: Homo sapiens (人), Mus musculus (小鼠), Drosophila melanogaster (果蝇), Saccharomyces cerevisiae (酵母)
if gunzip -c "${temp_file}" 2>/dev/null | head -2000 | grep -qE "(Homo sapiens|Mus musculus|Drosophila melanogaster|Saccharomyces cerevisiae)"; then
rm -f "${temp_file}"
return 0 # Contains target species
else| # For CentOS/RHEL/Fedora: sudo dnf install ncbi-blast+ | ||
| # Or download from: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ | ||
|
|
||
| # Better to use a stable DOWNLOAD_TMP name to support resuming downloads | ||
| DOWNLOAD_TMP=_downloading | ||
| mkdir -p ${DOWNLOAD_TMP} | ||
| cd ${DOWNLOAD_TMP} | ||
|
|
||
| wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/RELEASE.metalink" | ||
| echo "Downloading RELEASE.metalink..." | ||
| wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/RELEASE.metalink" | ||
|
|
||
| # Extract the release name (like 2017_10 or 2017_1) | ||
| # Use sed for cross-platform compatibility (works on both macOS and Linux) | ||
| RELEASE=$(sed -n 's/.*<version>\([0-9]\{4\}_[0-9]\{1,2\}\)<\/version>.*/\1/p' RELEASE.metalink | head -1) | ||
|
|
||
| wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz" | ||
| wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz" | ||
| wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/reldate.txt" | ||
| wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/README" | ||
| wget -c "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/LICENSE" | ||
| echo "UniProt release: ${RELEASE}" | ||
| echo "" | ||
|
|
||
| # Download Swiss-Prot (always needed) | ||
| echo "Downloading uniprot_sprot.fasta.gz..." | ||
| wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz" | ||
|
|
||
| # Download TrEMBL only if full mode | ||
| if [ "${DOWNLOAD_MODE}" = "full" ]; then | ||
| echo "Downloading uniprot_trembl.fasta.gz..." | ||
| wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz" | ||
| fi | ||
|
|
||
| # Download metadata files | ||
| echo "Downloading metadata files..." | ||
| wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/reldate.txt" | ||
| wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/README" | ||
| wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/LICENSE" | ||
|
|
||
| cd .. | ||
|
|
||
| mkdir ${RELEASE} | ||
| mkdir -p ${RELEASE} | ||
| mv ${DOWNLOAD_TMP}/* ${RELEASE} | ||
| rmdir ${DOWNLOAD_TMP} | ||
|
|
||
| cd ${RELEASE} | ||
|
|
||
| echo "" | ||
| echo "Extracting files..." | ||
| gunzip uniprot_sprot.fasta.gz | ||
| gunzip uniprot_trembl.fasta.gz | ||
|
|
||
| cat uniprot_sprot.fasta uniprot_trembl.fasta >uniprot_${RELEASE}.fasta | ||
| if [ "${DOWNLOAD_MODE}" = "full" ]; then | ||
| gunzip uniprot_trembl.fasta.gz | ||
| echo "Merging Swiss-Prot and TrEMBL..." | ||
| cat uniprot_sprot.fasta uniprot_trembl.fasta >uniprot_${RELEASE}.fasta | ||
| fi | ||
|
|
||
| makeblastdb -in uniprot_${RELEASE}.fasta -out uniprot_${RELEASE} -dbtype prot -parse_seqids -title uniprot_${RELEASE} | ||
| echo "" | ||
| echo "Building BLAST databases..." | ||
|
|
||
| # Always build Swiss-Prot database | ||
| makeblastdb -in uniprot_sprot.fasta -out uniprot_sprot -dbtype prot -parse_seqids -title uniprot_sprot | ||
| makeblastdb -in uniprot_trembl.fasta -out uniprot_trembl -dbtype prot -parse_seqids -title uniprot_trembl | ||
|
|
||
| # Build full release database only if in full mode | ||
| if [ "${DOWNLOAD_MODE}" = "full" ]; then | ||
| makeblastdb -in uniprot_${RELEASE}.fasta -out uniprot_${RELEASE} -dbtype prot -parse_seqids -title uniprot_${RELEASE} | ||
| makeblastdb -in uniprot_trembl.fasta -out uniprot_trembl -dbtype prot -parse_seqids -title uniprot_trembl | ||
| fi | ||
|
|
||
| cd .. | ||
|
|
||
| echo "" | ||
| echo "BLAST databases created successfully!" | ||
| echo "Database locations:" | ||
| echo " - Combined: $(pwd)/${RELEASE}/uniprot_${RELEASE}" | ||
| echo " - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot" | ||
| echo " - TrEMBL: $(pwd)/${RELEASE}/uniprot_trembl" | ||
| echo "" | ||
| echo "To use these databases, set in your config:" | ||
| echo " local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot # or uniprot_${RELEASE} or uniprot_trembl" | ||
| if [ "${DOWNLOAD_MODE}" = "sprot" ]; then | ||
| echo " - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot" | ||
| echo "" | ||
| echo "To use this database, set in your config:" | ||
| echo " local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot" | ||
| else | ||
| echo " - Combined: $(pwd)/${RELEASE}/uniprot_${RELEASE}" | ||
| echo " - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot" | ||
| echo " - TrEMBL: $(pwd)/${RELEASE}/uniprot_trembl" | ||
| echo "" | ||
| echo "To use these databases, set in your config:" | ||
| echo " local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot # or uniprot_${RELEASE} or uniprot_trembl" | ||
| fi | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script will fail because it uses several undefined variables: UNIPROT_BASE, DOWNLOAD_MODE, and DOWNLOAD_TMP. These variables need to be defined for the script to execute correctly. DOWNLOAD_MODE should likely be a command-line argument, and the others should be defined at the top of the script.
# For CentOS/RHEL/Fedora: sudo dnf install ncbi-blast+
# Or download from: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
DOWNLOAD_MODE=${1:-sprot} # sprot or full
UNIPROT_BASE="ftp://ftp.uniprot.org/pub/databases/uniprot"
# Better to use a stable DOWNLOAD_TMP name to support resuming downloads
DOWNLOAD_TMP=_downloading
mkdir -p ${DOWNLOAD_TMP}
cd ${DOWNLOAD_TMP}
echo "Downloading RELEASE.metalink..."
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/RELEASE.metalink"
# Extract the release name (like 2017_10 or 2017_1)
# Use sed for cross-platform compatibility (works on both macOS and Linux)
RELEASE=$(sed -n 's/.*<version>\([0-9]\{4\}_[0-9]\{1,2\}\)<\/version>.*/\1/p' RELEASE.metalink | head -1)
echo "UniProt release: ${RELEASE}"
echo ""
# Download Swiss-Prot (always needed)
echo "Downloading uniprot_sprot.fasta.gz..."
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
# Download TrEMBL only if full mode
if [ "${DOWNLOAD_MODE}" = "full" ]; then
echo "Downloading uniprot_trembl.fasta.gz..."
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz"
fi
# Download metadata files
echo "Downloading metadata files..."
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/reldate.txt"
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/README"
wget -c "${UNIPROT_BASE}/current_release/knowledgebase/complete/LICENSE"
cd ..
mkdir -p ${RELEASE}
mv ${DOWNLOAD_TMP}/* ${RELEASE}
rmdir ${DOWNLOAD_TMP}
cd ${RELEASE}
echo ""
echo "Extracting files..."
gunzip uniprot_sprot.fasta.gz
if [ "${DOWNLOAD_MODE}" = "full" ]; then
gunzip uniprot_trembl.fasta.gz
echo "Merging Swiss-Prot and TrEMBL..."
cat uniprot_sprot.fasta uniprot_trembl.fasta >uniprot_${RELEASE}.fasta
fi
echo ""
echo "Building BLAST databases..."
# Always build Swiss-Prot database
makeblastdb -in uniprot_sprot.fasta -out uniprot_sprot -dbtype prot -parse_seqids -title uniprot_sprot
# Build full release database only if in full mode
if [ "${DOWNLOAD_MODE}" = "full" ]; then
makeblastdb -in uniprot_${RELEASE}.fasta -out uniprot_${RELEASE} -dbtype prot -parse_seqids -title uniprot_${RELEASE}
makeblastdb -in uniprot_trembl.fasta -out uniprot_trembl -dbtype prot -parse_seqids -title uniprot_trembl
fi
cd ..
echo ""
echo "BLAST databases created successfully!"
echo "Database locations:"
if [ "${DOWNLOAD_MODE}" = "sprot" ]; then
echo " - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot"
echo ""
echo "To use this database, set in your config:"
echo " local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot"
else
echo " - Combined: $(pwd)/${RELEASE}/uniprot_${RELEASE}"
echo " - Swiss-Prot: $(pwd)/${RELEASE}/uniprot_sprot"
echo " - TrEMBL: $(pwd)/${RELEASE}/uniprot_trembl"
echo ""
echo "To use these databases, set in your config:"
echo " local_blast_db: $(pwd)/${RELEASE}/uniprot_sprot # or uniprot_${RELEASE} or uniprot_trembl"
fi| def __init__(self, working_dir: str = "cache"): | ||
| super().__init__(working_dir=working_dir, op_name="build_kg_service") | ||
| self.llm_client: BaseLLMWrapper = init_llm("synthesizer") | ||
| self.graph_storage: BaseGraphStorage = init_storage( | ||
| backend=graph_backend, working_dir=working_dir, namespace="graph" | ||
| backend="kuzu", working_dir=working_dir, namespace="graph" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The graph_backend is now hardcoded to "kuzu". Previously, it was configurable via the __init__ parameter. This change reduces flexibility. It would be better to restore the parameter to allow users to choose a different graph backend from the configuration file.
| def __init__(self, working_dir: str = "cache"): | |
| super().__init__(working_dir=working_dir, op_name="build_kg_service") | |
| self.llm_client: BaseLLMWrapper = init_llm("synthesizer") | |
| self.graph_storage: BaseGraphStorage = init_storage( | |
| backend=graph_backend, working_dir=working_dir, namespace="graph" | |
| backend="kuzu", working_dir=working_dir, namespace="graph" | |
| ) | |
| def __init__(self, working_dir: str = "cache", graph_backend: str = "kuzu"): | |
| super().__init__(working_dir=working_dir, op_name="build_kg_service") | |
| self.llm_client: BaseLLMWrapper = init_llm("synthesizer") | |
| self.graph_storage: BaseGraphStorage = init_storage( | |
| backend=graph_backend, working_dir=working_dir, namespace="graph" | |
| ) |
| elif re.fullmatch(r"[A-NR-Z0-9]{6,10}", query, re.I): | ||
| # UniProt accession IDs: 6-10 characters, must start with a letter | ||
| # Format: [A-Z][A-Z0-9]{5,9} (6-10 chars total: 1 letter + 5-9 alphanumeric) | ||
| elif re.fullmatch(r"[A-Z][A-Z0-9]{5,9}", query, re.I): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex for validating UniProt accession IDs is a simplification and may not cover all valid formats. A more accurate regex is used elsewhere in the codebase (e.g., in multi_omics_search.py). Using the more accurate regex here would improve correctness and consistency.
| elif re.fullmatch(r"[A-Z][A-Z0-9]{5,9}", query, re.I): | |
| elif re.fullmatch(r"([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2})", query, re.I): |
No description provided.