Skip to content

Commit 79c008b

Browse files
committed
feat: multi-omics KG building
1 parent 02adac3 commit 79c008b

File tree

71 files changed

+4667
-60
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+4667
-60
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -174,9 +174,13 @@ cython_debug/
174174
.pypirc
175175

176176
cache
177+
cache_*
178+
databases/
177179
*.pyc
178180
*.html
179181
.gradio
182+
graph_kuzu*
183+
resources/bio-instructions/
180184

181185
# macOS
182186
.DS_Store
Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
# Multi-omics Knowledge Graph QA Generation
2+
3+
This example demonstrates how to build knowledge graphs from multi-omics data (DNA, RNA, protein) and generate question-answer pairs using the unified `omics_qa` method.
4+
5+
## Pipeline Overview
6+
7+
The pipeline includes the following steps:
8+
9+
1. **read**: Read input files (JSON/JSONL format with sequence queries or protein data)
10+
2. **search**: Search biological databases (NCBI for DNA, RNAcentral for RNA, UniProt for protein) - *optional if input already contains search results*
11+
3. **chunk**: Chunk sequences and metadata
12+
4. **build_kg**: Extract entities and relationships to build knowledge graph
13+
5. **partition**: Partition the knowledge graph into communities using anchor-based BFS
14+
6. **generate**: Generate QA pairs from partitioned communities with automatic molecule caption extraction
15+
16+
## Key Features
17+
18+
- **Unified QA Generation**: Single `omics_qa` method supports DNA, RNA, and Protein
19+
- **Automatic Caption Extraction**: Automatically extracts and attaches molecule-specific information (dna/rna/protein captions) to each QA pair
20+
- **Flexible Configuration**: Easy to switch between DNA, RNA, and Protein by changing input file and data source
21+
- **Anchor-based Partitioning**: Uses molecule type as anchor for BFS partitioning (dna/rna/protein)
22+
23+
## Quick Start
24+
25+
### 1. Configure Input Data
26+
27+
Edit `omics_qa_config.yaml` to set the input file path:
28+
29+
**For DNA:**
30+
```yaml
31+
input_path:
32+
- examples/input_examples/search_dna_demo.jsonl
33+
```
34+
35+
**For RNA:**
36+
```yaml
37+
input_path:
38+
- examples/input_examples/search_rna_demo.jsonl
39+
```
40+
41+
**For Protein:**
42+
```yaml
43+
input_path:
44+
- examples/input_examples/search_protein_demo.jsonl
45+
```
46+
47+
### 2. Configure Data Source
48+
49+
Set the appropriate data source and parameters in the `search_data` node:
50+
51+
**For DNA (NCBI):**
52+
```yaml
53+
data_sources: [ncbi]
54+
ncbi_params:
55+
email: [email protected] # Required!
56+
tool: GraphGen
57+
use_local_blast: true
58+
local_blast_db: refseq_release/refseq_release
59+
blast_num_threads: 2
60+
max_concurrent: 5
61+
```
62+
63+
**For RNA (RNAcentral):**
64+
```yaml
65+
data_sources: [rnacentral]
66+
rnacentral_params:
67+
use_local_blast: true
68+
local_blast_db: rnacentral_ensembl_gencode_YYYYMMDD/ensembl_gencode_YYYYMMDD
69+
blast_num_threads: 2
70+
max_concurrent: 5
71+
```
72+
73+
**For Protein (UniProt):**
74+
```yaml
75+
data_sources: [uniprot]
76+
uniprot_params:
77+
use_local_blast: true
78+
local_blast_db: ${RELEASE}/uniprot_sprot
79+
blast_num_threads: 2
80+
max_concurrent: 5
81+
```
82+
83+
### 3. Configure Anchor Type
84+
85+
Set the `anchor_type` in the `partition` node to match your molecule type:
86+
87+
```yaml
88+
partition:
89+
params:
90+
method: anchor_bfs
91+
method_params:
92+
anchor_type: protein # Change to "dna" or "rna" as needed
93+
max_units_per_community: 10
94+
```
95+
96+
### 4. Run the Pipeline
97+
98+
```bash
99+
./generate_omics_qa.sh
100+
```
101+
102+
Or run directly with Python:
103+
104+
```bash
105+
python3 -m graphgen.run \
106+
--config_file examples/generate/generate_omics_qa/omics_qa_config.yaml \
107+
--output_dir cache/
108+
```
109+
110+
## Input Format
111+
112+
### For DNA/RNA (JSONL format):
113+
```jsonl
114+
{"type": "text", "content": "BRCA1"}
115+
{"type": "text", "content": ">query\nATGCGATCG..."}
116+
{"type": "text", "content": "ATGCGATCG..."}
117+
```
118+
119+
### For Protein (JSONL format):
120+
```jsonl
121+
{"type": "text", "content": "P01308"}
122+
{"type": "text", "content": "insulin"}
123+
{"type": "text", "content": "MHHHHHHSSGVDLGTENLYFQSNAMDFPQQLEACVKQANQALSRFIAPLPFQNTPVVETMQYGALLGGKRLRPFLVYATGHMFGVSTNTLDAPAAAVECIHAYSLIHDDLPAMDDDDLRRGLPTCHVKFGEANAILAGDALQTLAFSILSDANMPEVSDRDRISMISELASASGIAGMCGGQALDLDAEGKHVPLDALERIHRHKTGALIRAAVRLGALSAGDKGRRALPVLDKYAESIGLAFQVQDDILDVVGDTATLGKRQGADQQLGKSTYPALLGLEQARKKARDLIDDARQALKQLAEQSLDTSALEALADYIIQRNK"}
124+
```
125+
126+
## Output Format
127+
128+
The `omics_qa` method automatically extracts and attaches molecule-specific captions to QA pairs:
129+
130+
### Alpaca Format:
131+
```json
132+
{
133+
"instruction": "What is the function of this protein?",
134+
"input": "",
135+
"output": "The protein functions as...",
136+
"dna": {...}, # DNA caption (if molecule_type is DNA)
137+
"rna": {...}, # RNA caption (if molecule_type is RNA)
138+
"protein": {...} # Protein caption (if molecule_type is protein)
139+
}
140+
```
141+
142+
### ChatML Format:
143+
```json
144+
{
145+
"messages": [
146+
{
147+
"role": "user",
148+
"content": [
149+
{
150+
"text": "What is the function of this protein?",
151+
"dna": {...},
152+
"rna": {...},
153+
"protein": {...}
154+
}
155+
]
156+
},
157+
{
158+
"role": "assistant",
159+
"content": "The protein functions as..."
160+
}
161+
]
162+
}
163+
```
164+
165+
## Caption Information
166+
167+
The generator automatically extracts relevant caption information based on molecule type:
168+
169+
- **DNA**: gene_name, gene_description, organism, chromosome, genomic_location, function, gene_type, etc.
170+
- **RNA**: rna_type, description, organism, related_genes, gene_name, so_term, modifications, etc.
171+
- **Protein**: protein_name, gene_names, organism, function, sequence, entry_name, etc.
172+
173+
## Configuration Options
174+
175+
### Chunking Parameters
176+
- `chunk_size`: Size for text metadata chunks (default: 1024)
177+
- `chunk_overlap`: Overlap for text chunks (default: 100)
178+
- `sequence_chunk_size`: Size for sequence chunks (default: 1000)
179+
- `sequence_chunk_overlap`: Overlap for sequence chunks (default: 100)
180+
181+
### Partition Parameters
182+
- `method`: `anchor_bfs` (recommended for omics data)
183+
- `anchor_type`: `dna`, `rna`, or `protein` (must match your data type)
184+
- `max_units_per_community`: Maximum nodes and edges per community (default: 10)
185+
186+
### Generation Parameters
187+
- `method`: `omics_qa` (unified method for DNA/RNA/Protein)
188+
- `data_format`: `Alpaca`, `ChatML`, or `Sharegpt`
189+
190+
## Notes
191+
192+
- **NCBI requires an email address** - Make sure to set `email` in `ncbi_params`
193+
- **Anchor type must match molecule type** - Set `anchor_type` to match your data (dna/rna/protein)
194+
- **Local BLAST** can be enabled if you have local databases set up (see `examples/search/build_db/`)
195+
- **Caption extraction** is automatic - The generator detects molecule type and extracts relevant caption information
196+
- Adjust `max_concurrent` based on your system resources and API rate limits
197+
198+
## Examples
199+
200+
### Generate QA for Protein Data
201+
1. Set `input_path` to `examples/input_examples/search_protein_demo.jsonl`
202+
2. Set `data_sources: [uniprot]`
203+
3. Set `anchor_type: protein`
204+
4. Run `./generate_omics_qa.sh`
205+
206+
### Generate QA for DNA Data
207+
1. Set `input_path` to `examples/input_examples/search_dna_demo.jsonl`
208+
2. Set `data_sources: [ncbi]`
209+
3. Set `anchor_type: dna`
210+
4. Run `./generate_omics_qa.sh`
211+
212+
### Generate QA for RNA Data
213+
1. Set `input_path` to `examples/input_examples/search_rna_demo.jsonl`
214+
2. Set `data_sources: [rnacentral]`
215+
3. Set `anchor_type: rna`
216+
4. Run `./generate_omics_qa.sh`
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
python3 -m graphgen.run \
2+
--config_file examples/generate/generate_omics_qa/omics_qa_config.yaml \
3+
--output_dir cache/
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
python3 -m graphgen.run \
2+
--config_file examples/generate/generate_omics_qa/omics_qa_config_searched.yaml \
3+
--output_dir cache/
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
global_params:
2+
working_dir: cache
3+
graph_backend: kuzu # graph database backend, support: kuzu, networkx
4+
kv_backend: rocksdb # key-value store backend, support: rocksdb, json_kv
5+
6+
nodes:
7+
- id: read_files
8+
op_name: read
9+
type: source
10+
dependencies: []
11+
params:
12+
input_path:
13+
# three input files to generate DNA, RNA, and Protein data together
14+
- examples/input_examples/search_dna_demo.jsonl
15+
- examples/input_examples/search_rna_demo.jsonl
16+
- examples/input_examples/search_protein_demo.jsonl
17+
18+
- id: search_data
19+
op_name: search
20+
type: map_batch
21+
dependencies:
22+
- read_files
23+
execution_params:
24+
replicas: 1
25+
batch_size: 10
26+
params:
27+
data_sources: [ncbi, rnacentral, uniprot] # Multi-omics: use all three data sources
28+
# DNA search parameters
29+
ncbi_params:
30+
email: [email protected] # Required for NCBI
31+
tool: GraphGen
32+
use_local_blast: true
33+
local_blast_db: databases/refseq_232_old/refseq_232
34+
blast_num_threads: 2
35+
max_concurrent: 5
36+
# RNA search parameters
37+
rnacentral_params:
38+
use_local_blast: true
39+
local_blast_db: databases/rnacentral_merged_20251213/rnacentral_merged_20251213
40+
blast_num_threads: 2
41+
max_concurrent: 5
42+
# Protein search parameters
43+
uniprot_params:
44+
use_local_blast: true
45+
# local_blast_db: ${RELEASE}/uniprot_sprot
46+
local_blast_db: databases/2025_04/uniprot_sprot
47+
blast_num_threads: 2
48+
max_concurrent: 5
49+
50+
- id: chunk_documents
51+
op_name: chunk
52+
type: map_batch
53+
dependencies:
54+
- search_data
55+
execution_params:
56+
replicas: 4
57+
params:
58+
chunk_size: 1024 # chunk size for text splitting
59+
chunk_overlap: 100 # chunk overlap for text splitting
60+
sequence_chunk_size: 1000 # For sequence chunks (bp for DNA/RNA, aa for protein)
61+
sequence_chunk_overlap: 100
62+
63+
- id: build_kg
64+
op_name: build_kg
65+
type: map_batch
66+
dependencies:
67+
- chunk_documents
68+
execution_params:
69+
replicas: 1
70+
batch_size: 128
71+
72+
- id: partition
73+
op_name: partition
74+
type: aggregate
75+
dependencies:
76+
- build_kg
77+
params:
78+
method: anchor_bfs # partition method
79+
method_params:
80+
anchor_type: [dna, rna, protein] # Multi-omics: support multiple anchor types (list or single string)
81+
max_units_per_community: 10 # max nodes and edges per community
82+
83+
- id: generate
84+
op_name: generate
85+
type: map_batch
86+
dependencies:
87+
- partition
88+
execution_params:
89+
replicas: 1
90+
batch_size: 128
91+
params:
92+
method: omics_qa # unified QA generation method for DNA/RNA/Protein
93+
data_format: ChatML # Alpaca, Sharegpt, ChatML

0 commit comments

Comments
 (0)