InternScience · CHERRY-ui8 · Dec 18, 2025 · Dec 18, 2025 · Dec 18, 2025 · Dec 18, 2025
diff --git a/examples/generate/generate_omics_qa/README.md b/examples/generate/generate_omics_qa/README.md
@@ -0,0 +1,216 @@
+# Multi-omics Knowledge Graph QA Generation
+
+This example demonstrates how to build knowledge graphs from multi-omics data (DNA, RNA, protein) and generate question-answer pairs using the unified `omics_qa` method.
+
+## Pipeline Overview
+
+The pipeline includes the following steps:
+
+1. **read**: Read input files (JSON/JSONL format with sequence queries or protein data)
+2. **search**: Search biological databases (NCBI for DNA, RNAcentral for RNA, UniProt for protein) - *optional if input already contains search results*
+3. **chunk**: Chunk sequences and metadata
+4. **build_kg**: Extract entities and relationships to build knowledge graph
+5. **partition**: Partition the knowledge graph into communities using anchor-based BFS
+6. **generate**: Generate QA pairs from partitioned communities with automatic molecule caption extraction
+
+## Key Features
+
+- **Unified QA Generation**: Single `omics_qa` method supports DNA, RNA, and Protein
+- **Automatic Caption Extraction**: Automatically extracts and attaches molecule-specific information (dna/rna/protein captions) to each QA pair
+- **Flexible Configuration**: Easy to switch between DNA, RNA, and Protein by changing input file and data source
+- **Anchor-based Partitioning**: Uses molecule type as anchor for BFS partitioning (dna/rna/protein)
+
+## Quick Start
+
+### 1. Configure Input Data
+
+Edit `omics_qa_config.yaml` to set the input file path:
+
+**For DNA:**
+```yaml
+input_path:
+  - examples/input_examples/search_dna_demo.jsonl
+```
+
+**For RNA:**
+```yaml
+input_path:
+  - examples/input_examples/search_rna_demo.jsonl
+```
+
+**For Protein:**
+```yaml
+input_path:
+  - examples/input_examples/search_protein_demo.jsonl
+```
+
+### 2. Configure Data Source
+
+Set the appropriate data source and parameters in the `search_data` node:
+
+**For DNA (NCBI):**
+```yaml
+data_sources: [ncbi]
+ncbi_params:
+  email: [email protected]  # Required!
+  tool: GraphGen
+  use_local_blast: true
+  local_blast_db: refseq_release/refseq_release
+  blast_num_threads: 2
+  max_concurrent: 5
+```
+
+**For RNA (RNAcentral):**
+```yaml
+data_sources: [rnacentral]
+rnacentral_params:
+  use_local_blast: true
+  local_blast_db: rnacentral_ensembl_gencode_YYYYMMDD/ensembl_gencode_YYYYMMDD
+  blast_num_threads: 2
+  max_concurrent: 5
+```
+
+**For Protein (UniProt):**
+```yaml
+data_sources: [uniprot]
+uniprot_params:
+  use_local_blast: true
+  local_blast_db: ${RELEASE}/uniprot_sprot
+  blast_num_threads: 2
+  max_concurrent: 5
+```
+
+### 3. Configure Anchor Type
+
+Set the `anchor_type` in the `partition` node to match your molecule type:
+
+```yaml
+partition:
+  params:
+    method: anchor_bfs
+    method_params:
+      anchor_type: protein  # Change to "dna" or "rna" as needed
+      max_units_per_community: 10
+```
+
+### 4. Run the Pipeline
+
+```bash
+./generate_omics_qa.sh
+```
+
+Or run directly with Python:
+
+```bash
+python3 -m graphgen.run \
+  --config_file examples/generate/generate_omics_qa/omics_qa_config.yaml \
+  --output_dir cache/
+```
+
+## Input Format
+
+### For DNA/RNA (JSONL format):
+```jsonl
+{"type": "text", "content": "BRCA1"}
+{"type": "text", "content": ">query\nATGCGATCG..."}
+{"type": "text", "content": "ATGCGATCG..."}
+```
+
+### For Protein (JSONL format):
+```jsonl
+{"type": "text", "content": "P01308"}
+{"type": "text", "content": "insulin"}
+{"type": "text", "content": "MHHHHHHSSGVDLGTENLYFQSNAMDFPQQLEACVKQANQALSRFIAPLPFQNTPVVETMQYGALLGGKRLRPFLVYATGHMFGVSTNTLDAPAAAVECIHAYSLIHDDLPAMDDDDLRRGLPTCHVKFGEANAILAGDALQTLAFSILSDANMPEVSDRDRISMISELASASGIAGMCGGQALDLDAEGKHVPLDALERIHRHKTGALIRAAVRLGALSAGDKGRRALPVLDKYAESIGLAFQVQDDILDVVGDTATLGKRQGADQQLGKSTYPALLGLEQARKKARDLIDDARQALKQLAEQSLDTSALEALADYIIQRNK"}
+```
+
+## Output Format
+
+The `omics_qa` method automatically extracts and attaches molecule-specific captions to QA pairs:
+
+### Alpaca Format:
+```json
+{
+  "instruction": "What is the function of this protein?",
+  "input": "",
+  "output": "The protein functions as...",
+  "dna": {...},      # DNA caption (if molecule_type is DNA)
+  "rna": {...},      # RNA caption (if molecule_type is RNA)
+  "protein": {...}   # Protein caption (if molecule_type is protein)
+}
+```
+
+### ChatML Format:
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": [
+        {
+          "text": "What is the function of this protein?",
+          "dna": {...},
+          "rna": {...},
+          "protein": {...}
+        }
+      ]
+    },
+    {
+      "role": "assistant",
+      "content": "The protein functions as..."
+    }
+  ]
+}
+```
+
+## Caption Information
+
+The generator automatically extracts relevant caption information based on molecule type:
+
+- **DNA**: gene_name, gene_description, organism, chromosome, genomic_location, function, gene_type, etc.
+- **RNA**: rna_type, description, organism, related_genes, gene_name, so_term, modifications, etc.
+- **Protein**: protein_name, gene_names, organism, function, sequence, entry_name, etc.
+
+## Configuration Options
+
+### Chunking Parameters
+- `chunk_size`: Size for text metadata chunks (default: 1024)
+- `chunk_overlap`: Overlap for text chunks (default: 100)
+- `sequence_chunk_size`: Size for sequence chunks (default: 1000)
+- `sequence_chunk_overlap`: Overlap for sequence chunks (default: 100)
+
+### Partition Parameters
+- `method`: `anchor_bfs` (recommended for omics data)
+- `anchor_type`: `dna`, `rna`, or `protein` (must match your data type)
+- `max_units_per_community`: Maximum nodes and edges per community (default: 10)
+
+### Generation Parameters
+- `method`: `omics_qa` (unified method for DNA/RNA/Protein)
+- `data_format`: `Alpaca`, `ChatML`, or `Sharegpt`
+
+## Notes
+
+- **NCBI requires an email address** - Make sure to set `email` in `ncbi_params`
+- **Anchor type must match molecule type** - Set `anchor_type` to match your data (dna/rna/protein)
+- **Local BLAST** can be enabled if you have local databases set up (see `examples/search/build_db/`)
+- **Caption extraction** is automatic - The generator detects molecule type and extracts relevant caption information
+- Adjust `max_concurrent` based on your system resources and API rate limits
+
+## Examples
+
+### Generate QA for Protein Data
+1. Set `input_path` to `examples/input_examples/search_protein_demo.jsonl`
+2. Set `data_sources: [uniprot]`
+3. Set `anchor_type: protein`
+4. Run `./generate_omics_qa.sh`
+
+### Generate QA for DNA Data
+1. Set `input_path` to `examples/input_examples/search_dna_demo.jsonl`
+2. Set `data_sources: [ncbi]`
+3. Set `anchor_type: dna`
+4. Run `./generate_omics_qa.sh`
+
+### Generate QA for RNA Data
+1. Set `input_path` to `examples/input_examples/search_rna_demo.jsonl`
+2. Set `data_sources: [rnacentral]`
+3. Set `anchor_type: rna`
+4. Run `./generate_omics_qa.sh`
diff --git a/examples/generate/generate_omics_qa/generate_omics_qa.sh b/examples/generate/generate_omics_qa/generate_omics_qa.sh
@@ -0,0 +1,3 @@
+python3 -m graphgen.run \
+  --config_file examples/generate/generate_omics_qa/omics_qa_config.yaml \
+  --output_dir cache/
diff --git a/examples/generate/generate_omics_qa/generate_omics_qa_searched.sh b/examples/generate/generate_omics_qa/generate_omics_qa_searched.sh
@@ -0,0 +1,3 @@
+python3 -m graphgen.run \
+  --config_file examples/generate/generate_omics_qa/omics_qa_config_searched.yaml \
+  --output_dir cache/
diff --git a/examples/generate/generate_omics_qa/omics_qa_config.yaml b/examples/generate/generate_omics_qa/omics_qa_config.yaml
@@ -0,0 +1,92 @@
+global_params:
+  working_dir: cache
+  graph_backend: kuzu # graph database backend, support: kuzu, networkx
+  kv_backend: rocksdb # key-value store backend, support: rocksdb, json_kv
+
+nodes:
+  - id: read_files
+    op_name: read
+    type: source
+    dependencies: []
+    params:
+      input_path:
+        # three input files to generate DNA, RNA, and Protein data together
+        - examples/input_examples/search_dna_demo.jsonl
+        - examples/input_examples/search_rna_demo.jsonl
+        - examples/input_examples/search_protein_demo.jsonl
+
+  - id: search_data
+    op_name: search
+    type: map_batch
+    dependencies:
+      - read_files
+    execution_params:
+      replicas: 1
+      batch_size: 10
+    params:
+      data_sources: [ncbi, rnacentral, uniprot] # Multi-omics: use all three data sources
+      # DNA search parameters
+      ncbi_params:
+        email: [email protected] # Required for NCBI
+        tool: GraphGen
+        use_local_blast: true
+        local_blast_db: path_to_your_local_blast_db/refseq_version/refseq_version
+        blast_num_threads: 2
+        max_concurrent: 5
+      # RNA search parameters
+      rnacentral_params:
+        use_local_blast: true
+        local_blast_db: path_to_your_local_blast_db/rnacentral_YYYYMMDD/rnacentral_YYYYMMDD
+        blast_num_threads: 2
+        max_concurrent: 5
+      # Protein search parameters
+      uniprot_params:
+        use_local_blast: true
+        local_blast_db: path_to_your_local_blast_db/${RELEASE}/uniprot_sprot
+        blast_num_threads: 2
+        max_concurrent: 5 
+
+  - id: chunk_documents
+    op_name: chunk
+    type: map_batch
+    dependencies:
+      - search_data
+    execution_params:
+      replicas: 4
+    params:
+      chunk_size: 1024 # chunk size for text splitting
+      chunk_overlap: 100 # chunk overlap for text splitting
+      sequence_chunk_size: 1000 # For sequence chunks (bp for DNA/RNA, aa for protein)
+      sequence_chunk_overlap: 100
+
+  - id: build_kg
+    op_name: build_kg
+    type: map_batch
+    dependencies:
+      - chunk_documents
+    execution_params:
+      replicas: 1
+      batch_size: 128
+
+  - id: partition
+    op_name: partition
+    type: aggregate
+    dependencies:
+      - build_kg
+    params:
+      method: anchor_bfs # partition method
+      method_params:
+        anchor_type: [dna, rna, protein] # Multi-omics: support multiple anchor types (list or single string)
+        max_units_per_community: 10 # max nodes and edges per community
+
+  - id: generate
+    op_name: generate
+    type: map_batch
+    dependencies:
+      - partition
+    execution_params:
+      replicas: 1
+      batch_size: 128
+    params:
+      method: omics_qa # unified QA generation method for DNA/RNA/Protein
+      data_format: ChatML # Alpaca, Sharegpt, ChatML
diff --git a/examples/generate/generate_omics_qa/omics_qa_config_searched.yaml b/examples/generate/generate_omics_qa/omics_qa_config_searched.yaml
@@ -0,0 +1,73 @@
+global_params:
+  working_dir: cache
+  graph_backend: kuzu # graph database backend, support: kuzu, networkx
+  kv_backend: rocksdb # key-value store backend, support: rocksdb, json_kv
+
+nodes:
+  - id: read_files
+    op_name: read
+    type: source
+    dependencies: []
+    params:
+      input_path:
+        # Use pre-searched data files (skip search step)
+        # The search_service will automatically detect and skip search if data already contains search results
+        - examples/input_examples/searched_dna_demo.jsonl
+        - examples/input_examples/searched_rna_demo.jsonl
+        - examples/input_examples/searched_protein_demo.jsonl
+
+  - id: search_data
+    op_name: search
+    type: map_batch
+    dependencies:
+      - read_files
+    execution_params:
+      replicas: 1
+      batch_size: 10
+    # Note: search_service will automatically detect pre-searched data and skip search,
+    # but it will still normalize the data format (ensure _doc_id, content, data_source fields exist)
+
+  - id: chunk_documents
+    op_name: chunk
+    type: map_batch
+    dependencies:
+      - search_data
+    execution_params:
+      replicas: 4
+    params:
+      chunk_size: 1024 # chunk size for text splitting
+      chunk_overlap: 100 # chunk overlap for text splitting
+      sequence_chunk_size: 1000 # For sequence chunks (bp for DNA/RNA, aa for protein)
+      sequence_chunk_overlap: 100
+
+  - id: build_kg
+    op_name: build_kg
+    type: map_batch
+    dependencies:
+      - chunk_documents
+    execution_params:
+      replicas: 1
+      batch_size: 128
+
+  - id: partition
+    op_name: partition
+    type: aggregate
+    dependencies:
+      - build_kg
+    params:
+      method: anchor_bfs # partition method
+      method_params:
+        anchor_type: [dna, rna, protein] # Multi-omics: support multiple anchor types (list or single string)
+        max_units_per_community: 10 # max nodes and edges per community
+
+  - id: generate
+    op_name: generate
+    type: map_batch
+    dependencies:
+      - partition
+    execution_params:
+      replicas: 1
+      batch_size: 128
+    params:
+      method: omics_qa # unified QA generation method for DNA/RNA/Protein
+      data_format: ChatML # Alpaca, Sharegpt, ChatML