|
| 1 | +# Multi-omics Knowledge Graph QA Generation |
| 2 | + |
| 3 | +This example demonstrates how to build knowledge graphs from multi-omics data (DNA, RNA, protein) and generate question-answer pairs using the unified `omics_qa` method. |
| 4 | + |
| 5 | +## Pipeline Overview |
| 6 | + |
| 7 | +The pipeline includes the following steps: |
| 8 | + |
| 9 | +1. **read**: Read input files (JSON/JSONL format with sequence queries or protein data) |
| 10 | +2. **search**: Search biological databases (NCBI for DNA, RNAcentral for RNA, UniProt for protein) - *optional if input already contains search results* |
| 11 | +3. **chunk**: Chunk sequences and metadata |
| 12 | +4. **build_kg**: Extract entities and relationships to build knowledge graph |
| 13 | +5. **partition**: Partition the knowledge graph into communities using anchor-based BFS |
| 14 | +6. **generate**: Generate QA pairs from partitioned communities with automatic molecule caption extraction |
| 15 | + |
| 16 | +## Key Features |
| 17 | + |
| 18 | +- **Unified QA Generation**: Single `omics_qa` method supports DNA, RNA, and Protein |
| 19 | +- **Automatic Caption Extraction**: Automatically extracts and attaches molecule-specific information (dna/rna/protein captions) to each QA pair |
| 20 | +- **Flexible Configuration**: Easy to switch between DNA, RNA, and Protein by changing input file and data source |
| 21 | +- **Anchor-based Partitioning**: Uses molecule type as anchor for BFS partitioning (dna/rna/protein) |
| 22 | + |
| 23 | +## Quick Start |
| 24 | + |
| 25 | +### 1. Configure Input Data |
| 26 | + |
| 27 | +Edit `omics_qa_config.yaml` to set the input file path: |
| 28 | + |
| 29 | +**For DNA:** |
| 30 | +```yaml |
| 31 | +input_path: |
| 32 | + - examples/input_examples/search_dna_demo.jsonl |
| 33 | +``` |
| 34 | +
|
| 35 | +**For RNA:** |
| 36 | +```yaml |
| 37 | +input_path: |
| 38 | + - examples/input_examples/search_rna_demo.jsonl |
| 39 | +``` |
| 40 | +
|
| 41 | +**For Protein:** |
| 42 | +```yaml |
| 43 | +input_path: |
| 44 | + - examples/input_examples/search_protein_demo.jsonl |
| 45 | +``` |
| 46 | +
|
| 47 | +### 2. Configure Data Source |
| 48 | +
|
| 49 | +Set the appropriate data source and parameters in the `search_data` node: |
| 50 | + |
| 51 | +**For DNA (NCBI):** |
| 52 | +```yaml |
| 53 | +data_sources: [ncbi] |
| 54 | +ncbi_params: |
| 55 | + email: [email protected] # Required! |
| 56 | + tool: GraphGen |
| 57 | + use_local_blast: true |
| 58 | + local_blast_db: refseq_release/refseq_release |
| 59 | + blast_num_threads: 2 |
| 60 | + max_concurrent: 5 |
| 61 | +``` |
| 62 | + |
| 63 | +**For RNA (RNAcentral):** |
| 64 | +```yaml |
| 65 | +data_sources: [rnacentral] |
| 66 | +rnacentral_params: |
| 67 | + use_local_blast: true |
| 68 | + local_blast_db: rnacentral_ensembl_gencode_YYYYMMDD/ensembl_gencode_YYYYMMDD |
| 69 | + blast_num_threads: 2 |
| 70 | + max_concurrent: 5 |
| 71 | +``` |
| 72 | + |
| 73 | +**For Protein (UniProt):** |
| 74 | +```yaml |
| 75 | +data_sources: [uniprot] |
| 76 | +uniprot_params: |
| 77 | + use_local_blast: true |
| 78 | + local_blast_db: ${RELEASE}/uniprot_sprot |
| 79 | + blast_num_threads: 2 |
| 80 | + max_concurrent: 5 |
| 81 | +``` |
| 82 | + |
| 83 | +### 3. Configure Anchor Type |
| 84 | + |
| 85 | +Set the `anchor_type` in the `partition` node to match your molecule type: |
| 86 | + |
| 87 | +```yaml |
| 88 | +partition: |
| 89 | + params: |
| 90 | + method: anchor_bfs |
| 91 | + method_params: |
| 92 | + anchor_type: protein # Change to "dna" or "rna" as needed |
| 93 | + max_units_per_community: 10 |
| 94 | +``` |
| 95 | + |
| 96 | +### 4. Run the Pipeline |
| 97 | + |
| 98 | +```bash |
| 99 | +./generate_omics_qa.sh |
| 100 | +``` |
| 101 | + |
| 102 | +Or run directly with Python: |
| 103 | + |
| 104 | +```bash |
| 105 | +python3 -m graphgen.run \ |
| 106 | + --config_file examples/generate/generate_omics_qa/omics_qa_config.yaml \ |
| 107 | + --output_dir cache/ |
| 108 | +``` |
| 109 | + |
| 110 | +## Input Format |
| 111 | + |
| 112 | +### For DNA/RNA (JSONL format): |
| 113 | +```jsonl |
| 114 | +{"type": "text", "content": "BRCA1"} |
| 115 | +{"type": "text", "content": ">query\nATGCGATCG..."} |
| 116 | +{"type": "text", "content": "ATGCGATCG..."} |
| 117 | +``` |
| 118 | + |
| 119 | +### For Protein (JSONL format): |
| 120 | +```jsonl |
| 121 | +{"type": "text", "content": "P01308"} |
| 122 | +{"type": "text", "content": "insulin"} |
| 123 | +{"type": "text", "content": "MHHHHHHSSGVDLGTENLYFQSNAMDFPQQLEACVKQANQALSRFIAPLPFQNTPVVETMQYGALLGGKRLRPFLVYATGHMFGVSTNTLDAPAAAVECIHAYSLIHDDLPAMDDDDLRRGLPTCHVKFGEANAILAGDALQTLAFSILSDANMPEVSDRDRISMISELASASGIAGMCGGQALDLDAEGKHVPLDALERIHRHKTGALIRAAVRLGALSAGDKGRRALPVLDKYAESIGLAFQVQDDILDVVGDTATLGKRQGADQQLGKSTYPALLGLEQARKKARDLIDDARQALKQLAEQSLDTSALEALADYIIQRNK"} |
| 124 | +``` |
| 125 | + |
| 126 | +## Output Format |
| 127 | + |
| 128 | +The `omics_qa` method automatically extracts and attaches molecule-specific captions to QA pairs: |
| 129 | + |
| 130 | +### Alpaca Format: |
| 131 | +```json |
| 132 | +{ |
| 133 | + "instruction": "What is the function of this protein?", |
| 134 | + "input": "", |
| 135 | + "output": "The protein functions as...", |
| 136 | + "dna": {...}, # DNA caption (if molecule_type is DNA) |
| 137 | + "rna": {...}, # RNA caption (if molecule_type is RNA) |
| 138 | + "protein": {...} # Protein caption (if molecule_type is protein) |
| 139 | +} |
| 140 | +``` |
| 141 | + |
| 142 | +### ChatML Format: |
| 143 | +```json |
| 144 | +{ |
| 145 | + "messages": [ |
| 146 | + { |
| 147 | + "role": "user", |
| 148 | + "content": [ |
| 149 | + { |
| 150 | + "text": "What is the function of this protein?", |
| 151 | + "dna": {...}, |
| 152 | + "rna": {...}, |
| 153 | + "protein": {...} |
| 154 | + } |
| 155 | + ] |
| 156 | + }, |
| 157 | + { |
| 158 | + "role": "assistant", |
| 159 | + "content": "The protein functions as..." |
| 160 | + } |
| 161 | + ] |
| 162 | +} |
| 163 | +``` |
| 164 | + |
| 165 | +## Caption Information |
| 166 | + |
| 167 | +The generator automatically extracts relevant caption information based on molecule type: |
| 168 | + |
| 169 | +- **DNA**: gene_name, gene_description, organism, chromosome, genomic_location, function, gene_type, etc. |
| 170 | +- **RNA**: rna_type, description, organism, related_genes, gene_name, so_term, modifications, etc. |
| 171 | +- **Protein**: protein_name, gene_names, organism, function, sequence, entry_name, etc. |
| 172 | + |
| 173 | +## Configuration Options |
| 174 | + |
| 175 | +### Chunking Parameters |
| 176 | +- `chunk_size`: Size for text metadata chunks (default: 1024) |
| 177 | +- `chunk_overlap`: Overlap for text chunks (default: 100) |
| 178 | +- `sequence_chunk_size`: Size for sequence chunks (default: 1000) |
| 179 | +- `sequence_chunk_overlap`: Overlap for sequence chunks (default: 100) |
| 180 | + |
| 181 | +### Partition Parameters |
| 182 | +- `method`: `anchor_bfs` (recommended for omics data) |
| 183 | +- `anchor_type`: `dna`, `rna`, or `protein` (must match your data type) |
| 184 | +- `max_units_per_community`: Maximum nodes and edges per community (default: 10) |
| 185 | + |
| 186 | +### Generation Parameters |
| 187 | +- `method`: `omics_qa` (unified method for DNA/RNA/Protein) |
| 188 | +- `data_format`: `Alpaca`, `ChatML`, or `Sharegpt` |
| 189 | + |
| 190 | +## Notes |
| 191 | + |
| 192 | +- **NCBI requires an email address** - Make sure to set `email` in `ncbi_params` |
| 193 | +- **Anchor type must match molecule type** - Set `anchor_type` to match your data (dna/rna/protein) |
| 194 | +- **Local BLAST** can be enabled if you have local databases set up (see `examples/search/build_db/`) |
| 195 | +- **Caption extraction** is automatic - The generator detects molecule type and extracts relevant caption information |
| 196 | +- Adjust `max_concurrent` based on your system resources and API rate limits |
| 197 | + |
| 198 | +## Examples |
| 199 | + |
| 200 | +### Generate QA for Protein Data |
| 201 | +1. Set `input_path` to `examples/input_examples/search_protein_demo.jsonl` |
| 202 | +2. Set `data_sources: [uniprot]` |
| 203 | +3. Set `anchor_type: protein` |
| 204 | +4. Run `./generate_omics_qa.sh` |
| 205 | + |
| 206 | +### Generate QA for DNA Data |
| 207 | +1. Set `input_path` to `examples/input_examples/search_dna_demo.jsonl` |
| 208 | +2. Set `data_sources: [ncbi]` |
| 209 | +3. Set `anchor_type: dna` |
| 210 | +4. Run `./generate_omics_qa.sh` |
| 211 | + |
| 212 | +### Generate QA for RNA Data |
| 213 | +1. Set `input_path` to `examples/input_examples/search_rna_demo.jsonl` |
| 214 | +2. Set `data_sources: [rnacentral]` |
| 215 | +3. Set `anchor_type: rna` |
| 216 | +4. Run `./generate_omics_qa.sh` |
0 commit comments