Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
216 changes: 216 additions & 0 deletions examples/generate/generate_omics_qa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# Multi-omics Knowledge Graph QA Generation

This example demonstrates how to build knowledge graphs from multi-omics data (DNA, RNA, protein) and generate question-answer pairs using the unified `omics_qa` method.

## Pipeline Overview

The pipeline includes the following steps:

1. **read**: Read input files (JSON/JSONL format with sequence queries or protein data)
2. **search**: Search biological databases (NCBI for DNA, RNAcentral for RNA, UniProt for protein) - *optional if input already contains search results*
3. **chunk**: Chunk sequences and metadata
4. **build_kg**: Extract entities and relationships to build knowledge graph
5. **partition**: Partition the knowledge graph into communities using anchor-based BFS
6. **generate**: Generate QA pairs from partitioned communities with automatic molecule caption extraction

## Key Features

- **Unified QA Generation**: Single `omics_qa` method supports DNA, RNA, and Protein
- **Automatic Caption Extraction**: Automatically extracts and attaches molecule-specific information (dna/rna/protein captions) to each QA pair
- **Flexible Configuration**: Easy to switch between DNA, RNA, and Protein by changing input file and data source
- **Anchor-based Partitioning**: Uses molecule type as anchor for BFS partitioning (dna/rna/protein)

## Quick Start

### 1. Configure Input Data

Edit `omics_qa_config.yaml` to set the input file path:

**For DNA:**
```yaml
input_path:
- examples/input_examples/search_dna_demo.jsonl
```

**For RNA:**
```yaml
input_path:
- examples/input_examples/search_rna_demo.jsonl
```

**For Protein:**
```yaml
input_path:
- examples/input_examples/search_protein_demo.jsonl
```

### 2. Configure Data Source

Set the appropriate data source and parameters in the `search_data` node:

**For DNA (NCBI):**
```yaml
data_sources: [ncbi]
ncbi_params:
email: [email protected] # Required!
tool: GraphGen
use_local_blast: true
local_blast_db: refseq_release/refseq_release
blast_num_threads: 2
max_concurrent: 5
```

**For RNA (RNAcentral):**
```yaml
data_sources: [rnacentral]
rnacentral_params:
use_local_blast: true
local_blast_db: rnacentral_ensembl_gencode_YYYYMMDD/ensembl_gencode_YYYYMMDD
blast_num_threads: 2
max_concurrent: 5
```

**For Protein (UniProt):**
```yaml
data_sources: [uniprot]
uniprot_params:
use_local_blast: true
local_blast_db: ${RELEASE}/uniprot_sprot
blast_num_threads: 2
max_concurrent: 5
```

### 3. Configure Anchor Type

Set the `anchor_type` in the `partition` node to match your molecule type:

```yaml
partition:
params:
method: anchor_bfs
method_params:
anchor_type: protein # Change to "dna" or "rna" as needed
max_units_per_community: 10
```

### 4. Run the Pipeline

```bash
./generate_omics_qa.sh
```

Or run directly with Python:

```bash
python3 -m graphgen.run \
--config_file examples/generate/generate_omics_qa/omics_qa_config.yaml \
--output_dir cache/
```

## Input Format

### For DNA/RNA (JSONL format):
```jsonl
{"type": "text", "content": "BRCA1"}
{"type": "text", "content": ">query\nATGCGATCG..."}
{"type": "text", "content": "ATGCGATCG..."}
```

### For Protein (JSONL format):
```jsonl
{"type": "text", "content": "P01308"}
{"type": "text", "content": "insulin"}
{"type": "text", "content": "MHHHHHHSSGVDLGTENLYFQSNAMDFPQQLEACVKQANQALSRFIAPLPFQNTPVVETMQYGALLGGKRLRPFLVYATGHMFGVSTNTLDAPAAAVECIHAYSLIHDDLPAMDDDDLRRGLPTCHVKFGEANAILAGDALQTLAFSILSDANMPEVSDRDRISMISELASASGIAGMCGGQALDLDAEGKHVPLDALERIHRHKTGALIRAAVRLGALSAGDKGRRALPVLDKYAESIGLAFQVQDDILDVVGDTATLGKRQGADQQLGKSTYPALLGLEQARKKARDLIDDARQALKQLAEQSLDTSALEALADYIIQRNK"}
```

## Output Format

The `omics_qa` method automatically extracts and attaches molecule-specific captions to QA pairs:

### Alpaca Format:
```json
{
"instruction": "What is the function of this protein?",
"input": "",
"output": "The protein functions as...",
"dna": {...}, # DNA caption (if molecule_type is DNA)
"rna": {...}, # RNA caption (if molecule_type is RNA)
"protein": {...} # Protein caption (if molecule_type is protein)
}
```

### ChatML Format:
```json
{
"messages": [
{
"role": "user",
"content": [
{
"text": "What is the function of this protein?",
"dna": {...},
"rna": {...},
"protein": {...}
}
]
},
{
"role": "assistant",
"content": "The protein functions as..."
}
]
}
```

## Caption Information

The generator automatically extracts relevant caption information based on molecule type:

- **DNA**: gene_name, gene_description, organism, chromosome, genomic_location, function, gene_type, etc.
- **RNA**: rna_type, description, organism, related_genes, gene_name, so_term, modifications, etc.
- **Protein**: protein_name, gene_names, organism, function, sequence, entry_name, etc.

## Configuration Options

### Chunking Parameters
- `chunk_size`: Size for text metadata chunks (default: 1024)
- `chunk_overlap`: Overlap for text chunks (default: 100)
- `sequence_chunk_size`: Size for sequence chunks (default: 1000)
- `sequence_chunk_overlap`: Overlap for sequence chunks (default: 100)

### Partition Parameters
- `method`: `anchor_bfs` (recommended for omics data)
- `anchor_type`: `dna`, `rna`, or `protein` (must match your data type)
- `max_units_per_community`: Maximum nodes and edges per community (default: 10)

### Generation Parameters
- `method`: `omics_qa` (unified method for DNA/RNA/Protein)
- `data_format`: `Alpaca`, `ChatML`, or `Sharegpt`

## Notes

- **NCBI requires an email address** - Make sure to set `email` in `ncbi_params`
- **Anchor type must match molecule type** - Set `anchor_type` to match your data (dna/rna/protein)
- **Local BLAST** can be enabled if you have local databases set up (see `examples/search/build_db/`)
- **Caption extraction** is automatic - The generator detects molecule type and extracts relevant caption information
- Adjust `max_concurrent` based on your system resources and API rate limits

## Examples

### Generate QA for Protein Data
1. Set `input_path` to `examples/input_examples/search_protein_demo.jsonl`
2. Set `data_sources: [uniprot]`
3. Set `anchor_type: protein`
4. Run `./generate_omics_qa.sh`

### Generate QA for DNA Data
1. Set `input_path` to `examples/input_examples/search_dna_demo.jsonl`
2. Set `data_sources: [ncbi]`
3. Set `anchor_type: dna`
4. Run `./generate_omics_qa.sh`

### Generate QA for RNA Data
1. Set `input_path` to `examples/input_examples/search_rna_demo.jsonl`
2. Set `data_sources: [rnacentral]`
3. Set `anchor_type: rna`
4. Run `./generate_omics_qa.sh`
3 changes: 3 additions & 0 deletions examples/generate/generate_omics_qa/generate_omics_qa.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
python3 -m graphgen.run \
--config_file examples/generate/generate_omics_qa/omics_qa_config.yaml \
--output_dir cache/
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
python3 -m graphgen.run \
--config_file examples/generate/generate_omics_qa/omics_qa_config_searched.yaml \
--output_dir cache/
92 changes: 92 additions & 0 deletions examples/generate/generate_omics_qa/omics_qa_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
global_params:
working_dir: cache
graph_backend: kuzu # graph database backend, support: kuzu, networkx
kv_backend: rocksdb # key-value store backend, support: rocksdb, json_kv

nodes:
- id: read_files
op_name: read
type: source
dependencies: []
params:
input_path:
# three input files to generate DNA, RNA, and Protein data together
- examples/input_examples/search_dna_demo.jsonl
- examples/input_examples/search_rna_demo.jsonl
- examples/input_examples/search_protein_demo.jsonl

- id: search_data
op_name: search
type: map_batch
dependencies:
- read_files
execution_params:
replicas: 1
batch_size: 10
params:
data_sources: [ncbi, rnacentral, uniprot] # Multi-omics: use all three data sources
# DNA search parameters
ncbi_params:
email: [email protected] # Required for NCBI
tool: GraphGen
use_local_blast: true
local_blast_db: path_to_your_local_blast_db/refseq_version/refseq_version
blast_num_threads: 2
max_concurrent: 5
# RNA search parameters
rnacentral_params:
use_local_blast: true
local_blast_db: path_to_your_local_blast_db/rnacentral_YYYYMMDD/rnacentral_YYYYMMDD
blast_num_threads: 2
max_concurrent: 5
# Protein search parameters
uniprot_params:
use_local_blast: true
local_blast_db: path_to_your_local_blast_db/${RELEASE}/uniprot_sprot
blast_num_threads: 2
max_concurrent: 5

- id: chunk_documents
op_name: chunk
type: map_batch
dependencies:
- search_data
execution_params:
replicas: 4
params:
chunk_size: 1024 # chunk size for text splitting
chunk_overlap: 100 # chunk overlap for text splitting
sequence_chunk_size: 1000 # For sequence chunks (bp for DNA/RNA, aa for protein)
sequence_chunk_overlap: 100

- id: build_kg
op_name: build_kg
type: map_batch
dependencies:
- chunk_documents
execution_params:
replicas: 1
batch_size: 128

- id: partition
op_name: partition
type: aggregate
dependencies:
- build_kg
params:
method: anchor_bfs # partition method
method_params:
anchor_type: [dna, rna, protein] # Multi-omics: support multiple anchor types (list or single string)
max_units_per_community: 10 # max nodes and edges per community

- id: generate
op_name: generate
type: map_batch
dependencies:
- partition
execution_params:
replicas: 1
batch_size: 128
params:
method: omics_qa # unified QA generation method for DNA/RNA/Protein
data_format: ChatML # Alpaca, Sharegpt, ChatML
73 changes: 73 additions & 0 deletions examples/generate/generate_omics_qa/omics_qa_config_searched.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
global_params:
working_dir: cache
graph_backend: kuzu # graph database backend, support: kuzu, networkx
kv_backend: rocksdb # key-value store backend, support: rocksdb, json_kv

nodes:
- id: read_files
op_name: read
type: source
dependencies: []
params:
input_path:
# Use pre-searched data files (skip search step)
# The search_service will automatically detect and skip search if data already contains search results
- examples/input_examples/searched_dna_demo.jsonl
- examples/input_examples/searched_rna_demo.jsonl
- examples/input_examples/searched_protein_demo.jsonl

- id: search_data
op_name: search
type: map_batch
dependencies:
- read_files
execution_params:
replicas: 1
batch_size: 10
# Note: search_service will automatically detect pre-searched data and skip search,
# but it will still normalize the data format (ensure _doc_id, content, data_source fields exist)

- id: chunk_documents
op_name: chunk
type: map_batch
dependencies:
- search_data
execution_params:
replicas: 4
params:
chunk_size: 1024 # chunk size for text splitting
chunk_overlap: 100 # chunk overlap for text splitting
sequence_chunk_size: 1000 # For sequence chunks (bp for DNA/RNA, aa for protein)
sequence_chunk_overlap: 100

- id: build_kg
op_name: build_kg
type: map_batch
dependencies:
- chunk_documents
execution_params:
replicas: 1
batch_size: 128

- id: partition
op_name: partition
type: aggregate
dependencies:
- build_kg
params:
method: anchor_bfs # partition method
method_params:
anchor_type: [dna, rna, protein] # Multi-omics: support multiple anchor types (list or single string)
max_units_per_community: 10 # max nodes and edges per community

- id: generate
op_name: generate
type: map_batch
dependencies:
- partition
execution_params:
replicas: 1
batch_size: 128
params:
method: omics_qa # unified QA generation method for DNA/RNA/Protein
data_format: ChatML # Alpaca, Sharegpt, ChatML
Loading