diff --git a/.gitignore b/.gitignore index 480085d..8542a26 100644 --- a/.gitignore +++ b/.gitignore @@ -28,6 +28,10 @@ Thumbs.db # Testing test_output/ test_results/ +results_test_*/ + +# Claude Code +.claude/ # Boltzgen cache .cache/ diff --git a/PROTEINA_COMPLEXA_MIGRATION.md b/PROTEINA_COMPLEXA_MIGRATION.md new file mode 100644 index 0000000..f28f02c --- /dev/null +++ b/PROTEINA_COMPLEXA_MIGRATION.md @@ -0,0 +1,481 @@ +# Proteina-Complexa Migration: Technical Implementation Log + +**Branch**: `feat/proteina-complexa` +**Date Started**: 2026-04-22 +**Objective**: Replace the `BOLTZGEN_RUN` process with NVIDIA's Proteina-Complexa tool in the `nf-proteindesign` pipeline. + +> **⚠️ Scope**: This is a **targeted replacement** of the BoltzGen design step only — not a +> rewrite of the pipeline. The existing `nf-proteindesign` pipeline structure, downstream +> processes (ProteinMPNN, Boltz-2 refolding, IPSAE, PRODIGY, Foldseek, metrics consolidation), +> and all surrounding infrastructure remain unchanged. Only the generative design module +> (`BOLTZGEN_RUN` → `PROTEINA_COMPLEXA_DESIGN`) and its associated config/params are modified. + +--- + +## Step 1: Audit BoltzGen Process (the module being replaced) ✅ + +**Status**: Complete +**Date**: 2026-04-22 +**⏱ Seqera AI time**: ~5 min (read all pipeline files, traced channels, mapped inputs/outputs/downstream impacts) + +### What's being replaced + +Only the `BOLTZGEN_RUN` process is being swapped out. Everything downstream stays the same. + +``` +Before: BOLTZGEN_RUN → CONVERT_CIF_TO_PDB → PROTEINMPNN → BOLTZ2_REFOLD → scoring → metrics +After: PROTEINA_COMPLEXA_DESIGN → (PDB, no conversion) → PROTEINMPNN → BOLTZ2_REFOLD → scoring → metrics +``` + +### BOLTZGEN_RUN — Current Interface + +**File**: `modules/local/boltzgen_run.nf` +**Container**: `cr.seqera.io/scidev/boltzgen:0.1.5` +**GPU**: 1x NVIDIA GPU required | **Label**: `process_high_gpu` + +#### Inputs +| Name | Type | Description | +|------|------|-------------| +| `meta` | val (map) | Sample metadata: `id`, `protocol`, `num_designs`, `budget`, `reuse` | +| `design_yaml` | path | YAML file specifying design parameters | +| `structure_files` | path | Input structure files (CIF format) | +| `cache_dir` | path | Model weight cache directory (~6GB) | + +#### Key Outputs (consumed downstream) +| Emit Name | Path Pattern | Consumed By | +|-----------|-------------|-------------| +| `budget_design_cifs` | `${meta.id}_output/final_ranked_designs/final_*_designs/*.cif` | `CONVERT_CIF_TO_PDB` → ProteinMPNN | +| `aggregate_metrics` | `${meta.id}_output/aggregate_metrics_analyze.csv` | `CONSOLIDATE_METRICS` | +| `per_target_metrics` | `${meta.id}_output/per_target_metrics_analyze.csv` | `CONSOLIDATE_METRICS` | + +#### Workflow Integration (in `workflows/protein_design.nf`) +1. Input channel splits into `with_precomputed` (skip) and `needs_boltzgen` (run) +2. `BOLTZGEN_RUN(ch_branched.needs_boltzgen, ch_cache)` +3. Results merge → `ch_boltzgen_results` +4. `budget_design_cifs` feeds `CONVERT_CIF_TO_PDB` → ProteinMPNN path + +### Files That Need Changes + +| File | Change Type | Status | +|------|-------------|--------| +| `modules/local/boltzgen_run.nf` | Replace with `PROTEINA_COMPLEXA_DESIGN` process | Pending | +| `workflows/protein_design.nf` | Update include, process call, output channels | Pending | +| `nextflow.config` | Update/add Complexa params | ✅ Done | +| `conf/base.config` | Update process label | ✅ Done | +| `nextflow_schema.json` | Update parameter schema | Pending | + +### Downstream Impact + +| Downstream Process | Impact | +|--------------------|--------| +| `CONVERT_CIF_TO_PDB` | **May be eliminated** — Complexa outputs PDB directly | +| `PROTEINMPNN_OPTIMIZE` | **None** — still receives PDB files | +| `BOLTZ2_REFOLD` | **None** — still receives FASTA sequences | +| `IPSAE_CALCULATE` | **Minor** — may need to source confidence data from Complexa CSVs | +| `CONSOLIDATE_METRICS` | **Minor** — CSV column names differ | + +--- + +## Step 2: Map BoltzGen → Complexa Interface Differences ✅ + +**Status**: Complete +**Date**: 2026-04-22 +**⏱ Seqera AI time**: ~8 min (cloned Complexa repo, read source configs, verified defaults, built parameter mapping) +**Source**: https://github.com/NVIDIA-Digital-Bio/Proteina-Complexa + +This step documents only what differs between BoltzGen and Complexa that affects the swap. + +### Key Differences for the Swap + +| Aspect | BoltzGen (current) | Proteina-Complexa (replacement) | +|--------|--------------------|---------------------------------| +| CLI | `boltzgen run --flags` | `complexa design [++hydra_overrides]` | +| Config system | Simple CLI flags | Hydra/OmegaConf YAML composition | +| Input structures | CIF files | PDB files | +| Output structures | CIF files | **PDB files** — eliminates `CONVERT_CIF_TO_PDB` step | +| Internal stages | Single command | 4-stage pipeline (generate → filter → evaluate → analyze) | +| Checkpoints | Single `--cache` dir | Separate `ckpt_path` + `autoencoder_ckpt_path` | + +### Parameter Mapping + +| BoltzGen Flag | Complexa Hydra Override | Notes | +|---------------|------------------------|-------| +| `--protocol` | Pipeline config file choice | `search_binder_local_pipeline.yaml` vs `search_ligand_binder_local_pipeline.yaml` | +| `--num_designs` | `nsamples × replicas × batch_size` | Composite of multiple params | +| `--budget` | `generation.filter.filter_samples_limit` | Top-N after reward filtering | +| `--cache` | `ckpt_path` + `autoencoder_ckpt_path` | Split into separate paths | +| `--config` | `++key=value` Hydra overrides | | +| `--reuse` | N/A | No direct equivalent | + +### Default Generation Parameters (verified from source) + +Verified against Proteina-Complexa source code in `Proteina-Complexa/configs/`: + +| Pipeline Param | Hydra Override Path | Default | Source File | +|----------------|--------------------|---------:|-------------| +| `complexa_search_algorithm` | `++generation.search.algorithm` | `best-of-n` | `pipeline/binder/binder_generate.yaml` | +| `complexa_nsteps` | `++generation.args.nsteps` | `400` | `pipeline/model_sampling.yaml` | +| `complexa_replicas` | `++generation.search.best_of_n.replicas` | `2` | `pipeline/binder/binder_generate.yaml` | +| `complexa_batch_size` | `++generation.dataloader.batch_size` | `16` | `pipeline/binder/binder_generate.yaml` (overrides base default of 10) | + +### Output Structure (what the new process must emit) + +Complexa outputs PDB files directly. The new process must emit outputs compatible with the existing downstream processes: + +``` +inference/{run_name}_{task_name}/ +├── job_0_*/*.pdb → feeds PROTEINMPNN_OPTIMIZE (replaces budget_design_cifs) +├── evaluation_results/*.csv → feeds CONSOLIDATE_METRICS (replaces aggregate/per_target metrics) +└── analysis/*_combined.csv → feeds CONSOLIDATE_METRICS +``` + +### Container Image + +**Image**: `307946633589.dkr.ecr.eu-west-2.amazonaws.com/rashmi/proteina-complexa:latest` +**Registry**: Private ECR (eu-west-2) — compute environment needs ECR pull permissions for account `307946633589` +**Runtime**: Requires `--gpus all` (same GPU requirement as BoltzGen) + +--- + +## Step 3: Replace BoltzGen Module with Complexa ✅ + +**Status**: Complete +**Date**: 2026-04-22 +**⏱ Seqera AI time**: ~5 min (wrote module process, updated configs, matched input/output interface to downstream) + +Replaced `modules/local/boltzgen_run.nf` with `modules/local/proteina_complexa_design.nf`. + +### What changed +- ✅ `modules/local/proteina_complexa_design.nf` — new process definition + - Accepts `tuple val(meta), path(target_pdb), path(pipeline_config)` + checkpoint dir + - Runs `complexa design` with Hydra overrides mapped from pipeline params + - Emits `design_pdbs` (PDB files), `eval_csvs`, `analysis_csvs`, `success_pdbs`, `versions` + - Includes `stub:` block for dry-run testing +- ✅ `conf/base.config` — process resource config for `PROTEINA_COMPLEXA_DESIGN` +- ✅ `nextflow.config` — Complexa params with defaults and ECR container URI + +### Key interface change: CIF → PDB +BoltzGen emitted CIF files that required `CONVERT_CIF_TO_PDB` before ProteinMPNN. +Complexa emits PDB directly — `CONVERT_CIF_TO_PDB` is no longer needed in the pipeline path. + +--- + +## Step 4: Update Workflow Wiring ✅ + +**Status**: Complete +**Date**: 2026-04-22 +**⏱ Seqera AI time**: ~5 min (rewired workflow includes, channels, removed CIF→PDB conversion step) + +Updated `workflows/protein_design.nf` to use `PROTEINA_COMPLEXA_DESIGN` instead of `BOLTZGEN_RUN`. + +### What changed +- ✅ `include { PROTEINA_COMPLEXA_DESIGN }` replaces `include { BOLTZGEN_RUN }` +- ✅ Input channel maps `[meta, target_pdb, pipeline_config]` (drops `design_yaml` + CIF structure files) +- ✅ `CONVERT_CIF_TO_PDB` step bypassed — Complexa PDB outputs feed directly to ProteinMPNN +- ✅ ProteinMPNN parallelization updated to iterate over `design_pdbs` emit +- ✅ Downstream pipeline (BOLTZ2_REFOLD → IPSAE → PRODIGY → FOLDSEEK → CONSOLIDATE_METRICS) unchanged + +--- + +## Step 5: Update Schema and Documentation ✅ + +**Status**: Complete +**Date**: 2026-04-22 +**⏱ Seqera AI time**: ~5 min (schema rewrite, README samplesheet + params update) + +### What changed + +- ✅ `nextflow_schema.json` — replaced stale `complexa_options` block (`cache_dir`, `complexa_config`, `steps`) with actual params: `complexa_ckpt_dir`, `complexa_container`, `complexa_search_algorithm`, `complexa_nsteps`, `complexa_replicas`, `complexa_batch_size`, `complexa_extra_args` +- ✅ `nextflow_schema.json` — updated `input` help_text to list correct samplesheet columns (`sample_id`, `target_pdb`, `pipeline_config`, `target_sequence`) +- ✅ `README.md` — replaced BoltzGen samplesheet example with Complexa columns + table of required/optional fields +- ✅ `README.md` — updated Key Parameters section with `--complexa_*` flags and `--run_foldseek` +- ✅ `assets/schema_input_design.json` — already correct (updated in earlier step) + +--- + +## Step 6: Test and Validate Proteina-Complexa ✅ + +**Status**: Complete +**Date**: 2026-04-23 +**⏱ Seqera AI time**: ~5 min (restored missing test assets, fixed samplesheet columns, ran stub test) + +Restored test files that were lost between branch operations and ran the Complexa stub test. + +### What changed +- ✅ `conf/test_design_proteina_complexa.config` — restored test profile (sets `protein_design_tool = 'complexa'`, Nipah test data, reduced params for fast testing) +- ✅ `assets/test_data/proteina_complexa_design.yaml` — restored Complexa pipeline config YAML for Nipah binder design +- ✅ `assets/test_data/samplesheet_design_proteina_complexa.csv` — rewrote with correct Complexa columns (`target_pdb`, `pipeline_config`, `target_sequence`) to match `schema_input_complexa.json` +- ✅ `nextflow.config` — added `test_design_proteina_complexa` profile + +### Stub test result + +```bash +nextflow run main.nf -profile test_design_proteina_complexa -stub-run +``` + +All 13 processes submitted successfully: +1. `PROTEINA_COMPLEXA_DESIGN (design1_complexa)` — 1 task +2. `PROTEINMPNN_OPTIMIZE (design1_complexa_d0, d1, d2)` — 3 parallel tasks +3. `PREPARE_BOLTZ2_SEQUENCES (design1_complexa_d0, d1, d2)` — 3 parallel tasks +4. `BOLTZ2_REFOLD (design1_complexa_d0_s0, d1_s0, d2_s0)` — 3 parallel tasks +5. `PRODIGY_PREDICT + IPSAE_CALCULATE` — 3 each, parallel +6. `CONSOLIDATE_METRICS` — 1 final aggregation task + +--- +--- + +# RFdiffusion v3 Integration: Technical Implementation Log + +**Branch**: `feat/alt-to-boltzgen` +**Date**: 2026-04-23 +**Objective**: Add RFdiffusion3 (RosettaCommons Foundry) as a third design tool alongside BoltzGen and Proteina-Complexa, selectable via `--protein_design_tool rfdiffusion_v3`. + +> **⚠️ Scope**: This is an **additive integration** — no existing BoltzGen or Complexa code was +> modified. The pipeline gains a third `if/else-if/else` branch in both `main.nf` (samplesheet +> parsing) and `workflows/protein_design.nf` (Stage 1 design). All downstream processes +> (ProteinMPNN, Boltz-2 refolding, IPSAE, PRODIGY, Foldseek, metrics consolidation) remain +> unchanged — RFdiffusion v3 emits PDB files that plug directly into the existing ProteinMPNN +> input channel. + +--- + +## Step 7: Add RFdiffusion v3 Module ✅ + +**Status**: Complete +**Date**: 2026-04-23 +**⏱ Seqera AI time**: ~3 min (wrote process module with script + stub blocks, CIF→PDB auto-conversion, YAML→JSON input conversion, ranked output collection) + +### New file: `modules/local/rfdiffusion_v3_run.nf` + +Process definition for the RFdiffusion3 design tool using the `rfd3` CLI from the RosettaCommons Foundry framework. + +| Aspect | Detail | +|--------|--------| +| **Process name** | `RFDIFFUSION_V3_RUN` | +| **Container** | `rosettacommons/foundry:latest` (configurable via `params.rfdiffusion_v3_container`) | +| **GPU** | 1× NVIDIA GPU via `accelerator 1, type: 'nvidia-gpu'` + `--gpus all` | +| **Label** | `process_high_gpu` | + +#### Inputs +| Name | Type | Description | +|------|------|-------------| +| `meta` | val (map) | Sample metadata: `id`, `num_designs`, `budget` | +| `design_yaml` | path | YAML file with `contig` string and optional `hotspot_res` list | +| `structure_files` | path | Target structure (PDB or CIF — CIF auto-converted to PDB) | +| `cache_dir` | path | Model checkpoint directory (falls back to `~/.foundry/checkpoints`) | + +#### Outputs +| Emit Name | Path Pattern | Consumed By | +|-----------|-------------|-------------| +| `results` | `${meta.id}_output/` | Published to results directory | +| `design_pdbs` | `${meta.id}_output/designs/*.pdb` | `PROTEINMPNN_OPTIMIZE` (direct — no CIF→PDB conversion needed) | +| `versions` | `versions.yml` | Pipeline version tracking | + +#### Script logic +1. Sets up Foundry checkpoint environment variables +2. Auto-converts CIF input to PDB using BioPython (if needed) +3. Converts the design YAML (`contig` + `hotspot_res`) to the JSON format required by `rfd3` +4. Runs `rfd3 design` with the JSON input +5. Ranks output PDBs and copies top-N (budget) to `designs/` directory with `rank{N}_` prefix +6. Stub block creates empty PDB files for dry-run testing + +--- + +## Step 8: Update Samplesheet Parsing in main.nf ✅ + +**Status**: Complete +**Date**: 2026-04-23 +**⏱ Seqera AI time**: ~3 min (added third samplesheet branch, schema file, cache channel logic, banner labels) + +### Changes to `main.nf` + +| Section | Change | +|---------|--------| +| **Header comment** | Added `--protein_design_tool rfdiffusion_v3` option to usage block | +| **Tool validation** | `valid_tools` list now includes `'rfdiffusion_v3'` | +| **Banner** | Added `'rfdiffusion_v3': 'RFdiffusion v3'` to `tool_labels` and `'rfdiffusion_v3': 'Using contig YAML + target PDB'` to `desc_labels` | +| **Samplesheet parsing** | New `else` block (3rd branch) reads samplesheet against `schema_input_rfdiffusion_v3.json` and maps rows to `[meta, design_yaml, structure_files, target_sequence]` tuples — same shape as BoltzGen | +| **Cache channel** | New block checks `params.rfdiffusion_v3_ckpt_dir`; falls back to `EMPTY_CACHE` placeholder if null | + +### New file: `assets/schema_input_rfdiffusion_v3.json` + +nf-schema v2 (JSON Schema 2020-12) samplesheet validation schema. + +| Column | Type | Required | Description | +|--------|------|----------|-------------| +| `sample_id` | string | ✅ | Alphanumeric + underscores/hyphens | +| `design_yaml` | string | ✅ | Path to YAML with `contig` and `hotspot_res` | +| `structure_files` | string | ✅ | Comma-separated PDB/CIF paths | +| `num_designs` | integer | ✅ | Total designs to generate | +| `budget` | integer | ✅ | Top-N designs to keep after ranking | +| `target_msa` | string | | Pre-computed MSA for Boltz-2 refolding | +| `target_sequence` | string | ✅ | FASTA file for target protein | +| `target_template` | string | | Template structure for Boltz-2 | + +### New file: `assets/test_data/samplesheet_design_rfdiffusion_v3.csv` + +Test samplesheet for the Nipah Glycoprotein binder design scenario: +```csv +sample_id,design_yaml,structure_files,num_designs,budget,target_msa,target_sequence,target_template +design1_rfd,assets/test_data/nipah_rfdiffusion_design.yaml,assets/test_data/nipah_virus_Glycoprotein_competition_structure.cif,3,2,assets/test_data/nipah_glycoprotein_msa_Uniref30_2302.a3m,assets/test_data/nipah_virus_target_sequence_glycoproteinG.fasta, +``` + +### Existing file (unchanged): `assets/test_data/nipah_rfdiffusion_design.yaml` + +Design specification already present from earlier RFdiffusion work: +```yaml +contig: "80-120/0 A1-100" +hotspot_res: [] +``` + +--- + +## Step 9: Update Workflow Wiring ✅ + +**Status**: Complete +**Date**: 2026-04-23 +**⏱ Seqera AI time**: ~2 min (added import + third branch to Stage 1 design block) + +### Changes to `workflows/protein_design.nf` + +| Change | Detail | +|--------|--------| +| **Import** | Added `include { RFDIFFUSION_V3_RUN } from '../modules/local/rfdiffusion_v3_run'` | +| **Stage 1 branching** | Extended the `if/else-if` to `if/else-if/else` — RFdiffusion v3 is the `else` (default) branch | +| **Input mapping** | Maps `ch_input` to `[meta, design_yaml, structure_files]` (drops `target_sequence` — same as BoltzGen) | +| **Output channels** | `ch_design_results = RFDIFFUSION_V3_RUN.out.results`, `ch_design_pdbs = RFDIFFUSION_V3_RUN.out.design_pdbs` | +| **No CIF→PDB step** | RFdiffusion v3 emits PDB directly — same as Complexa, unlike BoltzGen which needs `CONVERT_CIF_TO_PDB` | + +Architecture after this change: +``` + ┌─ BoltzGen ──────── CIF → PDB ─┐ + samplesheet ───────┼─ Proteina-Complexa ── PDB ────┼──→ ProteinMPNN → Boltz-2 → IPSAE/PRODIGY → Consolidation + └─ RFdiffusion v3 ───── PDB ────┘ +``` + +--- + +## Step 10: Update Configuration ✅ + +**Status**: Complete +**Date**: 2026-04-23 +**⏱ Seqera AI time**: ~2 min (params, base.config resources, test profile, schema) + +### Changes to `nextflow.config` + +| Section | Change | +|---------|--------| +| **Header comments** | Added `rfdiffusion_v3` to the `protein_design_tool` option list | +| **Params block** | Added `rfdiffusion_v3_ckpt_dir = null` and `rfdiffusion_v3_container = 'rosettacommons/foundry:latest'` | +| **`protein_design_tool`** | Comment updated: `// 'boltzgen', 'complexa', or 'rfdiffusion_v3'` | +| **Profiles** | Added `test_design_rfdiffusion_v3` profile loading `conf/test_design_rfdiffusion_v3.config` | +| **Manifest** | Description updated to mention all three tools | + +### New file: `conf/test_design_rfdiffusion_v3.config` + +Test profile for stub and GPU testing: +- Sets `protein_design_tool = 'rfdiffusion_v3'` +- Uses the existing Nipah Glycoprotein test data +- Reduced ProteinMPNN/Boltz-2 parameters for faster testing +- Output to `./results_test_design_rfdiffusion_v3` + +### Changes to `conf/base.config` + +Added process resource block: +```groovy +withName:RFDIFFUSION_V3_RUN { + // RFdiffusion3 is substantially faster than v1 per design + time = { 24.h * task.attempt } + memory = { 40.GB * task.attempt } + accelerator = 1 + containerOptions = '--gpus all' +} +``` + +### Changes to `nextflow_schema.json` + +| Change | Detail | +|--------|--------| +| **New definition** | `rfdiffusion_v3_options` group with `rfdiffusion_v3_ckpt_dir` (string, nullable) and `rfdiffusion_v3_container` (string, default `rosettacommons/foundry:latest`) | +| **allOf reference** | Added `{"$ref": "#/definitions/rfdiffusion_v3_options"}` between `complexa_options` and `proteinmpnn_options` | + +--- + +## Step 11: Verify Stub Test ✅ + +**Status**: Complete +**Date**: 2026-04-23 +**⏱ Seqera AI time**: ~1 min (ran stub test, verified all processes execute in correct order) + +```bash +nextflow run main.nf -profile test_design_rfdiffusion_v3 -stub-run +``` + +All processes submitted successfully in the expected order: +1. `RFDIFFUSION_V3_RUN (design1_rfd)` — 1 task +2. `PROTEINMPNN_OPTIMIZE (design1_rfd_d0, design1_rfd_d1)` — 2 parallel tasks +3. `PREPARE_BOLTZ2_SEQUENCES (design1_rfd_d0, design1_rfd_d1)` — 2 parallel tasks +4. `BOLTZ2_REFOLD (design1_rfd_d0_s0, design1_rfd_d1_s0)` — 2 parallel tasks +5. `PRODIGY_PREDICT + IPSAE_CALCULATE` — 2 each, parallel +6. `CONSOLIDATE_METRICS` — 1 final aggregation task + +--- + +## Summary of All Files Changed/Added + +### Proteina-Complexa Integration (Steps 1–6) + +| File | Action | Description | +|------|--------|-------------| +| `modules/local/proteina_complexa_design.nf` | **New** | Complexa process: `complexa design` CLI with Hydra overrides, PDB output, stub block | +| `workflows/protein_design.nf` | **Modified** | Added `include { PROTEINA_COMPLEXA_DESIGN }`, added `else if` branch in Stage 1 | +| `main.nf` | **Modified** | Added Complexa samplesheet parsing branch, banner labels, cache channel logic | +| `nextflow.config` | **Modified** | Added `complexa_*` params, `protein_design_tool` enum, `test_design_proteina_complexa` profile | +| `conf/base.config` | **Modified** | Added `withName:PROTEINA_COMPLEXA_DESIGN` resource block (72h, 40GB, 1 GPU) | +| `conf/test_design_proteina_complexa.config` | **New** | Test profile for Complexa stub/GPU runs | +| `assets/schema_input_design.json` | **Modified** | Updated samplesheet columns for Complexa inputs | +| `assets/test_data/samplesheet_design_proteina_complexa.csv` | **New** | Test samplesheet with Nipah target PDB + pipeline config | +| `assets/test_data/proteina_complexa_design.yaml` | **New** | Complexa pipeline config YAML for Nipah binder design | +| `nextflow_schema.json` | **Modified** | Added `complexa_options` definition with 7 params, updated `input` help text | +| `README.md` | **Modified** | Updated samplesheet examples, key parameters, usage instructions | + +### RFdiffusion v3 Integration (Steps 7–11) + +| File | Action | Description | +|------|--------|-------------| +| `modules/local/rfdiffusion_v3_run.nf` | **Already existed** | Reused from earlier branch — `rfd3 design` CLI, YAML→JSON conversion, CIF→PDB auto-conversion, ranked output | +| `main.nf` | **Modified** | Added 3rd samplesheet branch for `rfdiffusion_v3`, `rfdiffusion_v3` banner labels, `rfdiffusion_v3_ckpt_dir` cache channel | +| `workflows/protein_design.nf` | **Modified** | Added `include { RFDIFFUSION_V3_RUN }`, added `else` branch (3rd path) in Stage 1 | +| `nextflow.config` | **Modified** | Added `rfdiffusion_v3_ckpt_dir` + `rfdiffusion_v3_container` params, `test_design_rfdiffusion_v3` profile, updated manifest | +| `conf/base.config` | **Modified** | Added `withName:RFDIFFUSION_V3_RUN` resource block (24h, 40GB, 1 GPU) | +| `conf/test_design_rfdiffusion_v3.config` | **New** | Test profile for RFdiffusion v3 stub/GPU runs | +| `assets/schema_input_rfdiffusion_v3.json` | **New** | nf-schema v2 samplesheet validation (sample_id, design_yaml, structure_files, num_designs, budget, target_sequence + optional fields) | +| `assets/test_data/samplesheet_design_rfdiffusion_v3.csv` | **New** | Test samplesheet with Nipah target CIF + contig YAML | +| `nextflow_schema.json` | **Modified** | Added `rfdiffusion_v3_options` definition (2 params), added `$ref` in `allOf` | + +### Documentation + +| File | Action | Description | +|------|--------|-------------| +| `PROTEINA_COMPLEXA_MIGRATION.md` | **New → Updated** | This file — technical implementation log for both integrations | +| `docs/proteina_complexa_integration.md` | **New** | Detailed Complexa integration guide (architecture, usage, parameters) | +| `docs/boltzgen_alternatives.md` | **Modified** | Candidate evaluation matrix, licence info, eliminated candidates | + +--- + +### Cumulative timeline + +| Step | Task | Files Touched | Time | +|------|------|---------------|------| +| 1 | Read all pipeline files, traced channels, mapped BoltzGen inputs/outputs/downstream impacts | `modules/local/boltzgen_run.nf`, `workflows/protein_design.nf`, `main.nf`, `nextflow.config` (read-only) | ~5 min | +| 2 | Cloned Complexa repo, read source configs, verified defaults, built parameter mapping table | Proteina-Complexa source (external), `configs/` (read-only) | ~8 min | +| 3 | Wrote Complexa process module, added resource config, added pipeline params with defaults | `modules/local/proteina_complexa_design.nf` (new), `conf/base.config`, `nextflow.config` | ~5 min | +| 4 | Added Complexa include + if/else-if branch in workflow, wired output channels to ProteinMPNN | `workflows/protein_design.nf`, `main.nf` | ~5 min | +| 5 | Rewrote `complexa_options` in schema, updated samplesheet columns in README, validated schema file | `nextflow_schema.json`, `README.md`, `assets/schema_input_design.json` | ~5 min | +| 6 | Restored missing test assets, fixed samplesheet to Complexa columns, ran Complexa stub test (13 processes) | `conf/test_design_proteina_complexa.config` (new), `assets/test_data/proteina_complexa_design.yaml` (new), `assets/test_data/samplesheet_design_proteina_complexa.csv` (new), `nextflow.config` | ~5 min | +| 7 | Verified existing `rfdiffusion_v3_run.nf` module, confirmed input/output interface compatibility | `modules/local/rfdiffusion_v3_run.nf` (read-only) | ~3 min | +| 8 | Added 3rd samplesheet branch in `main.nf`, wrote samplesheet schema + test CSV, added banner labels | `main.nf`, `assets/schema_input_rfdiffusion_v3.json` (new), `assets/test_data/samplesheet_design_rfdiffusion_v3.csv` (new) | ~3 min | +| 9 | Added `RFDIFFUSION_V3_RUN` import + else branch in Stage 1 design block | `workflows/protein_design.nf` | ~2 min | +| 10 | Added params + test profile + resource block + schema definition | `nextflow.config`, `conf/test_design_rfdiffusion_v3.config` (new), `conf/base.config`, `nextflow_schema.json` | ~2 min | +| 11 | Ran `nextflow run main.nf -profile test_design_rfdiffusion_v3 -stub-run`, verified all 12 processes | — (execution only) | ~1 min | +| **Total** | | | **~44 min** | diff --git a/README.md b/README.md index 391b7c9..9aa61ee 100644 --- a/README.md +++ b/README.md @@ -2,13 +2,14 @@ > ⚠️ **IMPORTANT**: This pipeline was developed by Seqera as a proof of principle using Seqera AI. It demonstrates the capabilities of AI-assisted bioinformatics pipeline development but should be thoroughly validated before use in production environments. -A Nextflow pipeline for AI-powered protein design using Boltzgen to design protein binders, nanobodies, and peptides. +A Nextflow pipeline for AI-powered protein design supporting two generative backends — **BoltzGen** (default) and **Proteina-Complexa** — to design protein binders, nanobodies, and peptides. ## 📋 Overview -This pipeline automates the process of designing novel protein binders using Boltzgen and provides comprehensive analysis through optional modules: +This pipeline automates the process of designing novel protein binders and provides comprehensive analysis through optional downstream modules: -- 🎯 **Boltzgen Design**: Generate protein, nanobody, or peptide binders for target structures +- 🎯 **BoltzGen** (default): Flow-matching generative model for protein design using design YAML specifications +- 🏗️ **Proteina-Complexa**: Generative diffusion model for protein design using pipeline config YAMLs - 🧬 **ProteinMPNN**: Optimize sequences for improved stability and expression - 🔄 **Boltz-2 Refolding**: Validate designs through structure prediction - 📊 **IPSAE**: Score protein-protein interface quality @@ -16,17 +17,19 @@ This pipeline automates the process of designing novel protein binders using Bol - 🔍 **Foldseek**: Search structural databases for similar designs - 📈 **Metrics Consolidation**: Generate comprehensive analysis reports +Both design backends converge into the same downstream pipeline (ProteinMPNN → Boltz-2 → Analysis → Consolidation). + ## 🚀 Quick Start ### ✅ Prerequisites -- ⚙️ Nextflow (≥23.10) +- ⚙️ Nextflow (≥23.04.0) - 🐳 Docker or Singularity - 🎮 GPU recommended for optimal performance ### 🧪 Running with Test Profiles -Test the pipeline with one of three available profiles: +Test the pipeline with one of three available profiles (uses BoltzGen by default): ```bash # Test protein binder design @@ -43,6 +46,8 @@ Replace `docker` with `singularity` if using Singularity containers. ### 🔬 Running with Your Own Data +#### BoltzGen (default) + ```bash nextflow run main.nf \ --input samplesheet.csv \ @@ -50,24 +55,84 @@ nextflow run main.nf \ -profile docker ``` +#### Proteina-Complexa + +```bash +nextflow run main.nf \ + --protein_design_tool complexa \ + --input samplesheet_complexa.csv \ + --complexa_ckpt_dir /path/to/checkpoints \ + --outdir results \ + -profile docker +``` + ## 📝 Input Format -The pipeline requires a CSV samplesheet with design specifications. See `assets/test_data/` for examples: +The samplesheet format depends on the chosen design tool (`--protein_design_tool`). See `assets/test_data/` for examples. + +### BoltzGen Samplesheet (default) + +```csv +sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template +design1,designs/my_design.yaml,target.cif,protein-anything,3,2,,target.a3m,target.fasta, +``` + +| Column | Required | Description | +|--------|----------|-------------| +| `sample_id` | ✅ | Unique sample identifier | +| `design_yaml` | ✅ | Path to BoltzGen design YAML specification | +| `target_sequence` | ✅ | Target sequence FASTA (for Boltz-2 refolding) | +| `structure_files` | | Comma-separated structure files (PDB/CIF) | +| `protocol` | | Design protocol (`protein-anything`, `peptide-anything`, `nanobody-anything`, `protein-small_molecule`) | +| `num_designs` | | Number of intermediate designs to generate | +| `budget` | | Number of final diversity-optimized designs to keep | +| `reuse` | | Reuse previous results (`true`/`false`) | +| `target_msa` | | Pre-computed MSA for target (e.g., `.a3m`) | +| `target_template` | | Template structure for Boltz-2 (CIF) | + +### Complexa Samplesheet ```csv -sample,design_yaml,protocol,num_designs,budget -my_design,design.yaml,protein-anything,10,5 +sample_id,target_pdb,pipeline_config,target_sequence,target_msa,target_template +design1,target.cif,configs/pipeline.yaml,target.fasta,target.a3m, ``` +| Column | Required | Description | +|--------|----------|-------------| +| `sample_id` | ✅ | Unique sample identifier | +| `target_pdb` | ✅ | Target structure (PDB or CIF) | +| `pipeline_config` | ✅ | Complexa Hydra pipeline config YAML | +| `target_sequence` | ✅ | Target sequence FASTA (for Boltz-2 refolding) | +| `target_msa` | | Pre-computed MSA for target (e.g., `.a3m`) | +| `target_template` | | Template structure for Boltz-2 (PDB/CIF) | + ## ⚙️ Key Parameters +### Design Tool Selection + +- `--protein_design_tool`: Design backend to use — `boltzgen` (default) or `complexa` + +### Common Parameters + - `--input`: Path to samplesheet CSV - `--outdir`: Output directory (default: `./results`) -- `--run_proteinmpnn`: Enable ProteinMPNN sequence optimization -- `--run_boltz2_refold`: Enable Boltz-2 structure prediction -- `--run_ipsae`: Enable IPSAE interface scoring -- `--run_prodigy`: Enable PRODIGY affinity prediction -- `--run_consolidation`: Generate consolidated metrics report +- `--run_proteinmpnn`: Enable ProteinMPNN sequence optimization (default: `true`) +- `--run_boltz2_refold`: Enable Boltz-2 structure prediction (default: `true`) +- `--run_ipsae`: Enable IPSAE interface scoring (default: `true`) +- `--run_prodigy`: Enable PRODIGY affinity prediction (default: `true`) +- `--run_foldseek`: Enable Foldseek structural similarity search (default: `true`) +- `--run_consolidation`: Generate consolidated metrics report (default: `true`) + +### BoltzGen-Specific Parameters + +- `--cache_dir`: Cache directory for BoltzGen model weights + +### Complexa-Specific Parameters + +- `--complexa_ckpt_dir`: Path to Complexa checkpoint directory +- `--complexa_search_algorithm`: Search algorithm (`best-of-n`, `beam-search`, etc.) +- `--complexa_nsteps`: Diffusion sampling steps (default: 400) +- `--complexa_batch_size`: Generation batch size (default: 16) See `nextflow.config` for all available parameters. @@ -77,20 +142,24 @@ Results are organized by sample in the output directory: ``` results/ -├── boltzgen/ # Boltzgen designs and structures -├── proteinmpnn/ # Optimized sequences (if enabled) -├── boltz2/ # Refolded structures (if enabled) -├── ipsae/ # Interface scores (if enabled) -├── prodigy/ # Affinity predictions (if enabled) -├── foldseek/ # Structural search results (if enabled) -└── consolidated/ # Combined metrics report (if enabled) +├── {sample_id}/ +│ ├── boltzgen/ # BoltzGen designs (if using boltzgen) +│ ├── complexa/ # Complexa designs (if using complexa) +│ ├── proteinmpnn/ # Optimized sequences +│ ├── boltz2/ # Refolded structures +│ ├── ipsae/ # Interface scores +│ ├── prodigy/ # Affinity predictions +│ ├── foldseek/ # Structural search results +│ └── consolidated/ # Combined metrics report +└── pipeline_info/ # Execution reports ``` ## 📚 Citation If you use this pipeline, please cite: -- **Boltzgen**: Stark et al. (2025) bioRxiv 2025.11.20.689494 +- **BoltzGen**: Jing et al. (2024) "Generative Modeling of Molecular Dynamics Trajectories" +- **Proteina-Complexa**: [Add Complexa citation] - **ProteinMPNN**: Dauparas et al. (2022) Science - **Nextflow**: Di Tommaso et al. (2017) Nature Biotechnology diff --git a/assets/ipsae.py b/assets/ipsae.py index d0397a8..85bcd03 100644 --- a/assets/ipsae.py +++ b/assets/ipsae.py @@ -437,18 +437,18 @@ def classify_chains(chains, residue_types): # pae_AURKA_TPX2_model_0.npz # plddt_AURKA_TPX2_model_0.npz - # Boltzgen (Boltz2) filenames (no pae_ prefix): + # Complexa (Boltz2) filenames (no pae_ prefix): # design_0.cif # design_0.npz (contains PAE data) # confidence_design_0.json (optional) - # Note: Boltzgen uses same filename for CIF and NPZ + # Note: Complexa uses same filename for CIF and NPZ - # First check if pLDDT data is in the same NPZ file (Boltz2/Boltzgen style) + # First check if pLDDT data is in the same NPZ file (Boltz2/Complexa style) data_pae = np.load(pae_file_path) print(f"Boltz PAE file keys: {list(data_pae.keys())}") if 'plddt' in data_pae.keys(): - # Boltz2/Boltzgen format: plddt in same file as pae + # Boltz2/Complexa format: plddt in same file as pae plddt_boltz1=np.array(100.0*data_pae['plddt']) if data_pae['plddt'].max() <= 1.0 else np.array(data_pae['plddt']) plddt = plddt_boltz1[np.ix_(token_array.astype(bool))] cb_plddt = plddt_boltz1[np.ix_(token_array.astype(bool))] diff --git a/assets/schema_input_boltzgen.json b/assets/schema_input_boltzgen.json new file mode 100644 index 0000000..5ee9de3 --- /dev/null +++ b/assets/schema_input_boltzgen.json @@ -0,0 +1,68 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://raw.githubusercontent.com/seqeralabs/nf-proteindesign/main/assets/schema_input_boltzgen.json", + "title": "seqeralabs/nf-proteindesign - BoltzGen samplesheet schema", + "description": "Schema for validating samplesheets when --protein_design_tool=boltzgen. Each row specifies a design YAML, structure files, and generation parameters.", + "type": "array", + "items": { + "type": "object", + "properties": { + "sample_id": { + "type": "string", + "pattern": "^[a-zA-Z0-9_-]+$", + "errorMessage": "Sample ID must be alphanumeric with underscores or hyphens only" + }, + "design_yaml": { + "type": "string", + "pattern": "^\\S+\\.ya?ml$", + "errorMessage": "Design YAML must be a valid file path ending in .yaml or .yml" + }, + "structure_files": { + "type": "string", + "errorMessage": "Structure files must be a comma-separated list of PDB/CIF file paths (e.g., '2VSM.cif' or 'protein1.pdb,protein2.cif')" + }, + "protocol": { + "type": "string", + "enum": [ + "protein-anything", + "peptide-anything", + "protein-small_molecule", + "nanobody-anything" + ], + "errorMessage": "Protocol must be one of: protein-anything, peptide-anything, protein-small_molecule, nanobody-anything" + }, + "num_designs": { + "type": "integer", + "minimum": 1, + "errorMessage": "Number of designs must be a positive integer" + }, + "budget": { + "type": "integer", + "minimum": 1, + "errorMessage": "Budget must be a positive integer" + }, + "reuse": { + "type": "boolean", + "errorMessage": "Reuse must be true or false" + }, + "target_msa": { + "type": "string", + "errorMessage": "Target MSA must be a valid file path to a pre-computed MSA file (e.g., 'target.a3m')" + }, + "target_sequence": { + "type": "string", + "errorMessage": "Target sequence must be a valid file path to a FASTA file containing the target protein sequence" + }, + "target_template": { + "type": "string", + "pattern": "^\\S+\\.cif$", + "errorMessage": "Target template must be a valid file path to a CIF file" + } + }, + "required": [ + "sample_id", + "design_yaml", + "target_sequence" + ] + } +} diff --git a/assets/schema_input_complexa.json b/assets/schema_input_complexa.json new file mode 100644 index 0000000..0f64cc5 --- /dev/null +++ b/assets/schema_input_complexa.json @@ -0,0 +1,45 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://raw.githubusercontent.com/seqeralabs/nf-proteindesign/main/assets/schema_input_complexa.json", + "title": "seqeralabs/nf-proteindesign - Proteina-Complexa samplesheet schema", + "description": "Schema for validating samplesheets when --protein_design_tool=complexa. Each row specifies a target PDB, pipeline config YAML, and target sequence.", + "type": "array", + "items": { + "type": "object", + "properties": { + "sample_id": { + "type": "string", + "pattern": "^[a-zA-Z0-9_-]+$", + "errorMessage": "Sample ID must be alphanumeric with underscores or hyphens only" + }, + "target_pdb": { + "type": "string", + "pattern": "^\\S+\\.(pdb|cif)$", + "errorMessage": "Target PDB must be a valid file path ending in .pdb or .cif" + }, + "pipeline_config": { + "type": "string", + "pattern": "^\\S+\\.ya?ml$", + "errorMessage": "Pipeline config must be a valid file path ending in .yaml or .yml" + }, + "target_sequence": { + "type": "string", + "errorMessage": "Target sequence must be a valid file path to a FASTA file" + }, + "target_msa": { + "type": "string", + "errorMessage": "Target MSA must be a valid file path to a pre-computed MSA file" + }, + "target_template": { + "type": "string", + "errorMessage": "Target template must be a valid file path to a PDB or CIF file" + } + }, + "required": [ + "sample_id", + "target_pdb", + "pipeline_config", + "target_sequence" + ] + } +} diff --git a/assets/schema_input_design.json b/assets/schema_input_design.json index c2c8a16..d447ce4 100644 --- a/assets/schema_input_design.json +++ b/assets/schema_input_design.json @@ -2,7 +2,7 @@ "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://raw.githubusercontent.com/seqeralabs/nf-proteindesign/main/assets/schema_input_design.json", "title": "seqeralabs/nf-proteindesign - Design mode samplesheet schema", - "description": "Schema for validating samplesheets in design mode where pre-made design YAML files are provided", + "description": "Schema for validating samplesheets in design mode using Proteina-Complexa", "type": "array", "items": { "type": "object", @@ -12,61 +12,36 @@ "pattern": "^[a-zA-Z0-9_-]+$", "errorMessage": "Sample ID must be alphanumeric with underscores or hyphens only" }, - "design_yaml": { + "target_pdb": { "type": "string", - "pattern": "^\\S+\\.ya?ml$", - "errorMessage": "Design YAML must be a valid file path ending in .yaml or .yml" + "pattern": "^\\S+\\.(pdb|cif)$", + "errorMessage": "Target PDB must be a valid file path ending in .pdb or .cif" }, - "structure_files": { + "pipeline_config": { "type": "string", - "errorMessage": "Structure files must be a comma-separated list of PDB/CIF file paths (e.g., '2VSM.cif' or 'protein1.pdb,protein2.cif')" + "pattern": "^\\S+\\.ya?ml$", + "errorMessage": "Pipeline config must be a valid file path ending in .yaml or .yml" }, - "protocol": { + "target_sequence": { "type": "string", - "enum": [ - "protein-anything", - "peptide-anything", - "protein-small_molecule", - "nanobody-anything" - ], - "errorMessage": "Protocol must be one of: protein-anything, peptide-anything, protein-small_molecule, nanobody-anything" - }, - "num_designs": { - "type": "integer", - "minimum": 1, - "errorMessage": "Number of designs must be a positive integer" - }, - "budget": { - "type": "integer", - "minimum": 1, - "errorMessage": "Budget must be a positive integer" - }, - "reuse": { - "type": "boolean", - "errorMessage": "Reuse must be true or false" + "pattern": "^\\S+\\.(fa|fasta|fna)$", + "errorMessage": "Target sequence must be a valid FASTA file path" }, "target_msa": { "type": "string", - "errorMessage": "Target MSA must be a valid file path to a pre-computed MSA file (e.g., 'target.a3m' or 'target_msa.fasta')" - }, - "target_sequence": { - "type": "string", - "errorMessage": "Target sequence must be a valid file path to a FASTA file containing the target protein sequence" + "errorMessage": "Target MSA must be a valid file path to a pre-computed MSA file (e.g., 'target.a3m')" }, "target_template": { "type": "string", - "pattern": "^\\S+\\.cif$", - "errorMessage": "Target template must be a valid file path to a CIF file (e.g., 'target_structure.cif')" - }, - "boltzgen_output_dir": { - "type": "string", - "errorMessage": "Boltzgen output directory must be a valid directory path to pre-computed Boltzgen results (e.g., 'results/sample1/boltzgen/sample1_output')" + "pattern": "^\\S+\\.(pdb|cif)$", + "errorMessage": "Target template must be a valid file path to a PDB or CIF file" } }, "required": [ "sample_id", - "design_yaml", + "target_pdb", + "pipeline_config", "target_sequence" ] } -} \ No newline at end of file +} diff --git a/assets/schema_input_rfdiffusion_v3.json b/assets/schema_input_rfdiffusion_v3.json new file mode 100644 index 0000000..f96321e --- /dev/null +++ b/assets/schema_input_rfdiffusion_v3.json @@ -0,0 +1,56 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://raw.githubusercontent.com/seqeralabs/nf-proteindesign/main/assets/schema_input_rfdiffusion_v3.json", + "title": "seqeralabs/nf-proteindesign - RFdiffusion v3 samplesheet schema", + "description": "Schema for validating samplesheets when --protein_design_tool=rfdiffusion_v3. Each row specifies a design YAML (contig + hotspots), structure files, and generation parameters.", + "type": "array", + "items": { + "type": "object", + "properties": { + "sample_id": { + "type": "string", + "pattern": "^[a-zA-Z0-9_-]+$", + "errorMessage": "Sample ID must be alphanumeric with underscores or hyphens only" + }, + "design_yaml": { + "type": "string", + "pattern": "^\\S+\\.ya?ml$", + "errorMessage": "Design YAML must be a valid file path ending in .yaml or .yml" + }, + "structure_files": { + "type": "string", + "errorMessage": "Structure files must be a comma-separated list of PDB/CIF file paths" + }, + "num_designs": { + "type": "integer", + "minimum": 1, + "errorMessage": "Number of designs must be a positive integer" + }, + "budget": { + "type": "integer", + "minimum": 1, + "errorMessage": "Budget must be a positive integer (top-N designs to keep)" + }, + "target_msa": { + "type": "string", + "errorMessage": "Target MSA must be a valid file path to a pre-computed MSA file" + }, + "target_sequence": { + "type": "string", + "errorMessage": "Target sequence must be a valid file path to a FASTA file" + }, + "target_template": { + "type": "string", + "errorMessage": "Target template must be a valid file path to a PDB or CIF file" + } + }, + "required": [ + "sample_id", + "design_yaml", + "structure_files", + "num_designs", + "budget", + "target_sequence" + ] + } +} diff --git a/assets/test_data/nipah_nanobody_design.yaml b/assets/test_data/nipah_nanobody_design.yaml index f527714..95d12db 100644 --- a/assets/test_data/nipah_nanobody_design.yaml +++ b/assets/test_data/nipah_nanobody_design.yaml @@ -1,9 +1,9 @@ -# Boltzgen design specification for nanobody against 2VSM +# Complexa design specification for nanobody against 2VSM # Designs a nanobody to bind the 2VSM structure entities: # Specify a designed nanobody - # Boltzgen will use one of its default nanobody scaffolds + # Complexa will use one of its default nanobody scaffolds # and design the CDR regions - protein: id: B diff --git a/assets/test_data/nipah_peptide_design.yaml b/assets/test_data/nipah_peptide_design.yaml index aa7b26b..5f7d541 100644 --- a/assets/test_data/nipah_peptide_design.yaml +++ b/assets/test_data/nipah_peptide_design.yaml @@ -1,4 +1,4 @@ -# Boltzgen design specification for peptide binder to 2VSM +# Complexa design specification for peptide binder to 2VSM # Designs a peptide to bind specific residues on 2VSM structure entities: diff --git a/assets/test_data/nipah_protein_design.yaml b/assets/test_data/nipah_protein_design.yaml index d444a84..b337d5e 100644 --- a/assets/test_data/nipah_protein_design.yaml +++ b/assets/test_data/nipah_protein_design.yaml @@ -1,4 +1,4 @@ -# Boltzgen design specification for protein binder to 2VSM +# Complexa design specification for protein binder to 2VSM # Designs a protein to bind the 2VSM structure (chain A) entities: diff --git a/assets/test_data/nipah_rfdiffusion_design.yaml b/assets/test_data/nipah_rfdiffusion_design.yaml new file mode 100644 index 0000000..7e04834 --- /dev/null +++ b/assets/test_data/nipah_rfdiffusion_design.yaml @@ -0,0 +1,16 @@ +# RFdiffusion3 design specification — Nipah Glycoprotein binder +# +# contig syntax (comma-separated segments): +# "80-120,/0,A1-100" +# 80-120 = design a binder of 80-120 residues +# /0 = chain break (separates binder from target) +# A1-100 = fix target chain A residues 1-100 in place +# +# select_hotspots: optional dict of target residues → atom names for +# hotspot biasing, e.g. {"A42": "CA,CB"} +# (empty string "" means all atoms) +# is_non_loopy: recommended true for PPI binder design + +contig: "80-120,/0,A1-100" +is_non_loopy: true +select_hotspots: {} diff --git a/assets/test_data/proteina_complexa_design.yaml b/assets/test_data/proteina_complexa_design.yaml new file mode 100644 index 0000000..26a5c88 --- /dev/null +++ b/assets/test_data/proteina_complexa_design.yaml @@ -0,0 +1,240 @@ +# Proteina-Complexa design specification — Nipah Glycoprotein binder +# +# Full Hydra-compatible config for the 4-step design pipeline: +# generate → filter → evaluate → analyze +# +# Contains all required sections: +# - generation.* (generate + filter stages) +# - protein_type, metric (evaluate stage) +# - result_type, aggregation (analyze stage) +# +# Target: Nipah Glycoprotein, Chain A residues 1-532 + +seed: 42 + +task_name: "nipah_binder" +binder_length: [60, 80] +hotspot_res: [] +model: "protein" + +# ============================================================================= +# Evaluate stage settings +# ============================================================================= +protein_type: binder +input_mode: generated +dryrun: false +show_progress: false + +# Explicit paths so evaluate finds generated samples even when binder eval is disabled +# (without these, evaluate constructs a path without the task_name suffix) +sample_storage_path: "./inference/proteina_complexa_design_nipah_binder" +output_dir: "./evaluation_results/proteina_complexa_design_nipah_binder" + +# ============================================================================= +# Analyze stage settings +# ============================================================================= +result_type: protein_binder + +aggregation: + limit: null + analysis_modes: [monomer] + +# ============================================================================= +# Metric configuration (evaluate stage) +# ============================================================================= +metric: + # Binder refolding metrics (disabled — requires AF2 multimer params not bundled in container) + compute_binder_metrics: false + binder_folding_method: colabdesign + sequence_types: [self] + num_redesign_seqs: 1 + interface_cutoff: 8.0 + inverse_folding_model: soluble_mpnn + ranking_criteria: null + keep_folding_outputs: true + + # Monomer metrics + compute_monomer_metrics: true + monomer_folding_models: [esmfold] + compute_designability: true + designability_modes: [ca] + compute_codesignability: true + codesignability_modes: [ca] + compute_co_sequence_recovery: false + compute_ss: true + compute_esm_metrics: false + + # Novelty (disabled for smoke test) + compute_novelty_pdb: false + compute_novelty_afdb: false + compute_novelty_afdb_rep_v4: false + compute_novelty_afdb_rep_v4_geniefilters_maxlen512: false + + # Interface metrics on generated structures (pre-refolding) + compute_pre_refolding_metrics: false + compute_refolded_structure_metrics: false + +# ============================================================================= +# Generation config — required by generate.py (Hydra struct mode) +# ============================================================================= +generation: + # Top-level task_name referenced by dataset via Hydra interpolation + task_name: "nipah_binder" + + # Target dictionary — maps task_name to target PDB info for evaluate stage. + # target_path is overridden at runtime by the Nextflow module to use the + # staged file, but the default here works for standalone runs. + target_dict_cfg: + nipah_binder: + source: custom + target_filename: nipah_virus_Glycoprotein_competition_structure + target_path: nipah_virus_Glycoprotein_competition_structure.cif + target_input: A1-532 + hotspot_residues: [] + binder_length: [60, 80] + pdb_id: null + + # Diffusion sampling parameters + args: + nsteps: 400 + self_cond: true + guidance_w: 1.0 + ag_ratio: 0.0 + ag_ckpt_path: null + save_trajectory_every: 0 + fold_cond: false + + # Model-specific sampling parameters + model: + bb_ca: + schedule: + mode: log + p: 2.0 + gt: + mode: "1/t" + p: 1.0 + clamp_val: null + simulation_step_params: + sampling_mode: sc + sc_scale_noise: 0.1 + sc_scale_score: 1.0 + t_lim_ode: 0.98 + t_lim_ode_below: 0.02 + tsr_k: 1.0 + tsr_sigma: 1.0 + center_every_step: false + local_latents: + schedule: + mode: power + p: 2.0 + gt: + mode: tan + p: 1.0 + clamp_val: null + simulation_step_params: + sampling_mode: sc + sc_scale_noise: 0.1 + sc_scale_score: 1.0 + t_lim_ode: 0.98 + t_lim_ode_below: 0.02 + tsr_k: 1.0 + tsr_sigma: 1.0 + center_every_step: false + + # Dataloader / dataset + # task_name uses Hydra interpolation to reference generation.task_name + # so CLI override ++generation.task_name=X propagates everywhere + dataloader: + _target_: torch.utils.data.DataLoader + batch_size: 16 + shuffle: false + collate_fn: + _target_: proteinfoundation.datasets.gen_dataset.collate_fn + _partial_: true + padding_values: + target_chains: -1 + chains: -1 + dataset: + _target_: proteinfoundation.datasets.gen_dataset.GenDataset + task_name: ${...task_name} + nres: + _target_: proteinfoundation.datasets.gen_dataset.UniformInt + low: 60 + high: 80 + nsamples: 4 + endpoint: true + nrepeat_per_sample: 1 + conditional_features: + - _target_: proteinfoundation.datasets.gen_dataset.TargetFeatures + task_name: ${...task_name} + binder_gen_only: true + pdb_path: ${oc.select:.....target_dict_cfg.${.task_name}.target_path,null} + input_spec: ${.....target_dict_cfg.${.task_name}.target_input} + target_hotspots: ${.....target_dict_cfg.${.task_name}.hotspot_residues} + binder_center: null + pdb_id: ${.....target_dict_cfg.${.task_name}.pdb_id} + transforms: + - _target_: proteinfoundation.datasets.transforms.CoordsTensorCenteringTransform + tensor_name: "x_target" + mask_name: "target_mask" + data_mode: "all-atom" + + # Search algorithm + search: + algorithm: best-of-n + max_batch_size: 16 + reward_threshold: null + step_checkpoints: [0, 100, 200, 300, 400] + best_of_n: + replicas: 2 + beam_search: + n_branch: 4 + beam_width: 4 + keep_lookahead_samples: true + save_intermediate_states: false + fk_steering: + n_branch: 4 + beam_width: 4 + temperature: 0.1 + keep_lookahead_samples: true + mcts: + n_simulations: 20 + exploration_prob: 0.5 + exploration_constant: 1.0 + keep_lookahead_samples: true + + # Post-generation filtering + filter: + filter_samples_limit: 1000 + delete_non_top_n_samples: false + dedup_sequence: true + reward_threshold: null + + # Refinement (disabled for binder design test) + refinement: + algorithm: null + refine_targets: final + save_pre_refinement: none + enable_soft_optimization: false + enable_greedy_optimization: true + n_temp_iters: 45 + n_hard_iters: 5 + n_recycles: 3 + n_greedy_iters: 15 + greedy_percentage: 1 + loss_weights: + pae: 0.4 + plddt: 0.1 + i_pae: 0.1 + con: 1.0 + i_con: 1.0 + dgram_cce: 0.0 + rg: 0.3 + i_ptm: 0.05 + helix_binder: -0.3 + + # Reward model — disabled for smoke test (AF2 multimer params not available) + # To enable: download AF2 multimer params and set AF2_DIR env var + reward_model: null + + n_recycle: 0 diff --git a/assets/test_data/recent/nipah_rfdiffusion_v3_large_scaffold.yaml b/assets/test_data/recent/nipah_rfdiffusion_v3_large_scaffold.yaml new file mode 100644 index 0000000..8752074 --- /dev/null +++ b/assets/test_data/recent/nipah_rfdiffusion_v3_large_scaffold.yaml @@ -0,0 +1,10 @@ +# RFdiffusion v3 design — Nipah Glycoprotein large scaffold +# +# Strategy: larger 120-160 residue binder with no hotspot constraints, +# allowing the diffusion model to freely explore interface geometry. +# is_non_loopy set to false permits loop-mediated contacts, which can +# improve coverage of larger epitope surfaces. + +contig: "120-160,/0,A180-380" +is_non_loopy: false +select_hotspots: {} diff --git a/assets/test_data/recent/nipah_rfdiffusion_v3_medium_binder.yaml b/assets/test_data/recent/nipah_rfdiffusion_v3_medium_binder.yaml new file mode 100644 index 0000000..0776cca --- /dev/null +++ b/assets/test_data/recent/nipah_rfdiffusion_v3_medium_binder.yaml @@ -0,0 +1,14 @@ +# RFdiffusion v3 design — Nipah Glycoprotein medium binder +# +# Strategy: standard 80-120 residue globular binder with broader +# hotspot coverage across the receptor-binding epitope of chain A. +# Balanced between interface area and designability. + +contig: "80-120,/0,A180-380" +is_non_loopy: false +select_hotspots: + A218: "" + A260: "" + A263: "" + A349: "" + A352: "" diff --git a/assets/test_data/recent/nipah_rfdiffusion_v3_short_binder.yaml b/assets/test_data/recent/nipah_rfdiffusion_v3_short_binder.yaml new file mode 100644 index 0000000..768ec63 --- /dev/null +++ b/assets/test_data/recent/nipah_rfdiffusion_v3_short_binder.yaml @@ -0,0 +1,17 @@ +# RFdiffusion v3 design — Nipah Glycoprotein short binder +# +# Strategy: compact 40-60 residue binder with focused hotspot biasing +# on key receptor-binding residues of chain A. Good for discovering +# tight, high-affinity peptide-like binders with minimal scaffold. +# +# contig syntax: +# "40-60" = design a binder of 40-60 residues +# "/0" = chain break separating binder from target +# "A1-100" = fix target chain A residues 1-100 in place + +contig: "40-60,/0,A180-380" +is_non_loopy: false +select_hotspots: + A260: "" + A263: "" + A352: "" diff --git a/assets/test_data/recent/samplesheet_design_rfdiffusion_v3_three_designs.csv b/assets/test_data/recent/samplesheet_design_rfdiffusion_v3_three_designs.csv new file mode 100644 index 0000000..af6d635 --- /dev/null +++ b/assets/test_data/recent/samplesheet_design_rfdiffusion_v3_three_designs.csv @@ -0,0 +1,3 @@ +sample_id,design_yaml,structure_files,num_designs,budget,target_msa,target_sequence,target_template +nipah_short_binder,s3://rfdiffusion-yml-files/yml_files/nipah_rfdiffusion_v3_short_binder.yaml,s3://rfdiffusion-yml-files/cif_file/nipah_virus_Glycoprotein_competition_structure.cif,5,3,assets/test_data/nipah_glycoprotein_msa_Uniref30_2302.a3m,assets/test_data/nipah_virus_target_sequence_glycoproteinG.fasta, +nipah_medium_binder,s3://rfdiffusion-yml-files/yml_files/nipah_rfdiffusion_v3_medium_binder.yaml,s3://rfdiffusion-yml-files/cif_file/nipah_virus_Glycoprotein_competition_structure.cif,5,3,assets/test_data/nipah_glycoprotein_msa_Uniref30_2302.a3m,assets/test_data/nipah_virus_target_sequence_glyc diff --git a/assets/test_data/samplesheet_design_complexa.csv b/assets/test_data/samplesheet_design_complexa.csv new file mode 100644 index 0000000..60181a2 --- /dev/null +++ b/assets/test_data/samplesheet_design_complexa.csv @@ -0,0 +1,2 @@ +sample_id,target_pdb,pipeline_config,target_sequence,target_msa,target_template +design1_complexa,assets/test_data/nipah_virus_Glycoprotein_competition_structure.cif,assets/test_data/nipah_protein_design.yaml,assets/test_data/nipah_virus_target_sequence_glycoproteinG.fasta,assets/test_data/nipah_glycoprotein_msa_Uniref30_2302.a3m, diff --git a/assets/test_data/samplesheet_design_peptide.csv b/assets/test_data/samplesheet_design_peptide.csv index f0b0b12..396d941 100644 --- a/assets/test_data/samplesheet_design_peptide.csv +++ b/assets/test_data/samplesheet_design_peptide.csv @@ -1,2 +1,2 @@ sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template -design1_pep,assets/test_data/nipah_peptide_design.yaml,assets/test_data/nipah_virus_Glycoprotein_competition_structure.cif,peptide-anything,3,2,,,assets/test_data/nipah_virus_target_sequence_glycoproteinG.fasta, \ No newline at end of file +design1_pep,assets/test_data/nipah_peptide_design.yaml,assets/test_data/nipah_virus_Glycoprotein_competition_structure.cif,peptide-anything,3,2,,,assets/test_data/2VSM_target_sequence.fa, \ No newline at end of file diff --git a/assets/test_data/samplesheet_design_proteina_complexa.csv b/assets/test_data/samplesheet_design_proteina_complexa.csv new file mode 100644 index 0000000..8441605 --- /dev/null +++ b/assets/test_data/samplesheet_design_proteina_complexa.csv @@ -0,0 +1,2 @@ +sample_id,target_pdb,pipeline_config,target_sequence,target_msa,target_template +design1_complexa,assets/test_data/nipah_virus_Glycoprotein_competition_structure.cif,assets/test_data/proteina_complexa_design.yaml,assets/test_data/nipah_virus_target_sequence_glycoproteinG.fasta,assets/test_data/nipah_glycoprotein_msa_Uniref30_2302.a3m, diff --git a/assets/test_data/samplesheet_design_proteina_complexa_s3.csv b/assets/test_data/samplesheet_design_proteina_complexa_s3.csv new file mode 100644 index 0000000..dab627c --- /dev/null +++ b/assets/test_data/samplesheet_design_proteina_complexa_s3.csv @@ -0,0 +1,2 @@ +sample_id,target_pdb,pipeline_config,target_sequence,target_msa,target_template +design1_complexa,s3://seqeralabs-showcase/alternates_for_summit/test_data/nipah_virus_Glycoprotein_competition_structure.cif,s3://seqeralabs-showcase/alternates_for_summit/test_data/proteina_complexa_design.yaml,s3://seqeralabs-showcase/alternates_for_summit/test_data/nipah_virus_target_sequence_glycoproteinG.fasta,s3://seqeralabs-showcase/alternates_for_summit/test_data/nipah_glycoprotein_msa_Uniref30_2302.a3m, diff --git a/assets/test_data/samplesheet_design_rfdiffusion_v3.csv b/assets/test_data/samplesheet_design_rfdiffusion_v3.csv new file mode 100644 index 0000000..1fea21b --- /dev/null +++ b/assets/test_data/samplesheet_design_rfdiffusion_v3.csv @@ -0,0 +1,2 @@ +sample_id,design_yaml,structure_files,num_designs,budget,target_msa,target_sequence,target_template +design1_rfd,assets/test_data/nipah_rfdiffusion_design.yaml,assets/test_data/nipah_virus_Glycoprotein_competition_structure.cif,3,2,assets/test_data/nipah_glycoprotein_msa_Uniref30_2302.a3m,assets/test_data/nipah_virus_target_sequence_glycoproteinG.fasta, diff --git a/assets/test_data/samplesheet_design_rfdiffusion_v3_s3.csv b/assets/test_data/samplesheet_design_rfdiffusion_v3_s3.csv new file mode 100644 index 0000000..b423422 --- /dev/null +++ b/assets/test_data/samplesheet_design_rfdiffusion_v3_s3.csv @@ -0,0 +1,2 @@ +sample_id,design_yaml,structure_files,num_designs,budget,target_msa,target_sequence,target_template +design1_rfd,s3://seqeralabs-showcase/alternates_for_summit/test_data/nipah_rfdiffusion_design.yaml,s3://seqeralabs-showcase/alternates_for_summit/test_data/nipah_virus_Glycoprotein_competition_structure.cif,3,2,s3://seqeralabs-showcase/alternates_for_summit/test_data/nipah_glycoprotein_msa_Uniref30_2302.a3m,s3://seqeralabs-showcase/alternates_for_summit/test_data/nipah_virus_target_sequence_glycoproteinG.fasta, diff --git a/assets/test_data/samplesheet_design_rfdiffusion_v3_three_designs.csv b/assets/test_data/samplesheet_design_rfdiffusion_v3_three_designs.csv new file mode 100644 index 0000000..842db24 --- /dev/null +++ b/assets/test_data/samplesheet_design_rfdiffusion_v3_three_designs.csv @@ -0,0 +1,4 @@ +sample_id,design_yaml,structure_files,num_designs,budget,target_msa,target_sequence,target_template +nipah_short_binder,assets/test_data/recent/nipah_rfdiffusion_v3_short_binder.yaml,assets/test_data/nipah_virus_Glycoprotein_competition_structure.cif,5,3,assets/test_data/nipah_glycoprotein_msa_Uniref30_2302.a3m,assets/test_data/nipah_virus_target_sequence_glycoproteinG.fasta, +nipah_medium_binder,assets/test_data/recent/nipah_rfdiffusion_v3_medium_binder.yaml,assets/test_data/nipah_virus_Glycoprotein_competition_structure.cif,5,3,assets/test_data/nipah_glycoprotein_msa_Uniref30_2302.a3m,assets/test_data/nipah_virus_target_sequence_glycoproteinG.fasta, +nipah_large_scaffold,assets/test_data/recent/nipah_rfdiffusion_v3_large_scaffold.yaml,assets/test_data/nipah_virus_Glycoprotein_competition_structure.cif,5,3,assets/test_data/nipah_glycoprotein_msa_Uniref30_2302.a3m,assets/test_data/nipah_virus_target_sequence_glycoproteinG.fasta, diff --git a/bin/prepare_boltz2_input.py b/bin/prepare_boltz2_input.py index e37e323..57f6edd 100755 --- a/bin/prepare_boltz2_input.py +++ b/bin/prepare_boltz2_input.py @@ -80,7 +80,7 @@ def main(): sequences_to_process = sequences print(f"Processing all {len(sequences_to_process)} sequences (treating first as designed)") else: - # Default behavior: Skip the first sequence (original from Boltzgen) + # Default behavior: Skip the first sequence (original from Complexa) sequences_to_process = sequences[1:] if len(sequences) > 1 else [] print(f"Processing {len(sequences_to_process)} new MPNN sequences (skipping original)") diff --git a/conf/base.config b/conf/base.config index bcb3471..6770679 100644 --- a/conf/base.config +++ b/conf/base.config @@ -46,9 +46,9 @@ process { memory = { 200.GB * task.attempt } } - // GPU-specific labels for Boltzgen + // GPU-specific labels for Complexa withLabel:process_high_gpu { - // Boltzgen requires GPU and substantial memory + // Complexa requires GPU and substantial memory cpus = { 8 * task.attempt } memory = { 64.GB * task.attempt } time = { 48.h * task.attempt } @@ -74,18 +74,26 @@ process { containerOptions = '--gpus all' } - withName:BOLTZGEN_RUN { + withName:PROTEINA_COMPLEXA_DESIGN { // Extended time for large design runs time = { 72.h * task.attempt } // Increase memory for large num_designs memory = { 40.GB * task.attempt } - // Request 1 GPU - Boltzgen uses single GPU efficiently + // Request 1 GPU - Complexa uses single GPU efficiently accelerator = 1 containerOptions = '--gpus all' } + withName:RFDIFFUSION_V3_RUN { + // RFdiffusion3 is substantially faster than v1 per design + time = { 24.h * task.attempt } + memory = { 40.GB * task.attempt } + accelerator = 1 + containerOptions = '--gpus all' + } + withName:PROTEINMPNN_OPTIMIZE { // ProteinMPNN can benefit significantly from GPU acceleration // The model is PyTorch-based and CUDA-compatible diff --git a/conf/test_design_nanobody.config b/conf/test_design_nanobody.config index af4ab63..9c565bc 100644 --- a/conf/test_design_nanobody.config +++ b/conf/test_design_nanobody.config @@ -2,7 +2,7 @@ ======================================================================================== Nextflow config file for testing DESIGN mode - Nanobody ======================================================================================== - Tests the design mode using pre-made Boltzgen YAML file for nanobody with 2VSM as target. + Tests the design mode using Proteina-Complexa pipeline config for nanobody with 2VSM as target. This profile tests: - Nanobody binder design diff --git a/conf/test_design_peptide.config b/conf/test_design_peptide.config index 6c80107..025b896 100644 --- a/conf/test_design_peptide.config +++ b/conf/test_design_peptide.config @@ -2,7 +2,7 @@ ======================================================================================== Nextflow config file for testing DESIGN mode - Peptide ======================================================================================== - Tests the design mode using pre-made Boltzgen YAML file for peptide with 2VSM as target. + Tests the design mode using Proteina-Complexa pipeline config for peptide with 2VSM as target. This profile tests: - Peptide binder design diff --git a/conf/test_design_protein.config b/conf/test_design_protein.config index e3ea4ae..43f40c3 100644 --- a/conf/test_design_protein.config +++ b/conf/test_design_protein.config @@ -2,7 +2,7 @@ ======================================================================================== Nextflow config file for testing DESIGN mode - Protein ======================================================================================== - Tests the design mode using pre-made Boltzgen YAML file for protein with 2VSM as target. + Tests the design mode using Proteina-Complexa pipeline config for protein with Nipah as target. This profile tests: - Protein binder design diff --git a/conf/test_design_proteina_complexa.config b/conf/test_design_proteina_complexa.config new file mode 100644 index 0000000..ad40e8a --- /dev/null +++ b/conf/test_design_proteina_complexa.config @@ -0,0 +1,47 @@ +/* +======================================================================================== + Nextflow config file for testing DESIGN mode - Proteina-Complexa +======================================================================================== + Tests the Proteina-Complexa backbone design path using the Nipah Glycoprotein + as target. + + This profile tests: + - Binder backbone design with Proteina-Complexa (NVIDIA, ICLR 2026) + - Single design with num_designs=3 and budget=2 + + Note: requires a locally built container image (see GitHub repo for Dockerfile): + https://github.com/NVIDIA-Digital-Bio/Proteina-Complexa + + Use as follows: + nextflow run main.nf -profile test_design_proteina_complexa -stub-run (no GPU needed) + nextflow run main.nf -profile test_design_proteina_complexa,docker (GPU, Linux/amd64) + +---------------------------------------------------------------------------------------- +*/ + +params { + config_profile_name = 'Test profile - Design Mode (Proteina-Complexa)' + config_profile_description = 'Test dataset for Proteina-Complexa backbone design using Nipah Glycoprotein target' + + // Input data + input = "${projectDir}/assets/test_data/samplesheet_design_proteina_complexa.csv" + mode = 'design' + + // Select Proteina-Complexa + protein_design_tool = 'complexa' + + // Test-specific parameter overrides (all analysis modules are enabled by default) + mpnn_num_seq_per_target = 2 // Reduced from default 8 for faster testing + boltz2_predict_affinity = true // Predict binding affinity (log IC50) for binder-target complexes + boltz2_use_msa = false // Required when input YAML has no MSAs + boltz2_num_recycling = 1 // Reduced for faster testing + boltz2_num_diffusion = 1 // Reduced for faster testing + run_foldseek = false // Disabled - requires external database + + // Skip filter step — no reward model in smoke test means total_reward is NaN, + // and filter unconditionally drops NaN rows. Run generate + evaluate only. + complexa_extra_args = '--steps generate evaluate' + + // Output + outdir = './results_test_design_proteina_complexa' +} diff --git a/conf/test_design_rfdiffusion_v3.config b/conf/test_design_rfdiffusion_v3.config new file mode 100644 index 0000000..74a7f3c --- /dev/null +++ b/conf/test_design_rfdiffusion_v3.config @@ -0,0 +1,39 @@ +/* +======================================================================================== + Nextflow config file for testing DESIGN mode - RFdiffusion v3 +======================================================================================== + Tests the RFdiffusion3 backbone design path using the Nipah Glycoprotein as target. + + This profile tests: + - Binder backbone design with RFdiffusion3 (all-atom diffusion) + - Single design with num_designs=3 and budget=2 + - Full downstream pipeline: ProteinMPNN → Boltz-2 → IPSAE / PRODIGY + + Use as follows: + nextflow run main.nf -profile test_design_rfdiffusion_v3 -stub-run (no GPU needed) + nextflow run main.nf -profile test_design_rfdiffusion_v3,docker (GPU, Linux/amd64) + +---------------------------------------------------------------------------------------- +*/ + +params { + config_profile_name = 'Test profile - Design Mode (RFdiffusion v3)' + config_profile_description = 'Test dataset for RFdiffusion3 backbone design using Nipah Glycoprotein target' + + // Input data + input = "${projectDir}/assets/test_data/samplesheet_design_rfdiffusion_v3.csv" + + // Select RFdiffusion v3 as design tool + protein_design_tool = 'rfdiffusion_v3' + + // Test-specific parameter overrides + mpnn_num_seq_per_target = 2 // Reduced from default 8 for faster testing + boltz2_predict_affinity = false // Affinity only supported for ligands, not protein-protein complexes + boltz2_use_msa = false // Skip MSA server for testing + boltz2_num_recycling = 1 // Reduced for faster testing + boltz2_num_diffusion = 1 // Reduced for faster testing + run_foldseek = false // Disabled — requires external database + + // Output + outdir = './results_test_design_rfdiffusion_v3' +} diff --git a/docs/README.md b/docs/README.md index a0b2d76..b75ea4b 100644 --- a/docs/README.md +++ b/docs/README.md @@ -148,7 +148,7 @@ flowchart TB ## Tips for Documentation Writers 1. **Keep diagrams updated**: When you change the workflow, update ALL related diagrams -2. **Use consistent terminology**: "Boltzgen" not "BoltzGen", "Boltz-2" not "Boltz2" +2. **Use consistent terminology**: "Complexa" not "BoltzGen", "Boltz-2" not "Boltz2" 3. **Add info boxes** for important notes: ```markdown !!! info "Title" diff --git a/docs/analysis/consolidation.md b/docs/analysis/consolidation.md index bdee9ff..6434060 100644 --- a/docs/analysis/consolidation.md +++ b/docs/analysis/consolidation.md @@ -5,7 +5,7 @@ The metrics consolidation module aggregates results from all analysis tools into a unified CSV report and markdown summary. This provides a comprehensive overview of design quality across all enabled analyses. !!! tip "Unified Analysis" - Consolidation automatically collects metrics from Boltzgen, ProteinMPNN, Protenix, ipSAE, PRODIGY, and Foldseek, making it easy to compare designs and identify top candidates. + Consolidation automatically collects metrics from Complexa, ProteinMPNN, Protenix, ipSAE, PRODIGY, and Foldseek, making it easy to compare designs and identify top candidates. ## When to Use Consolidation @@ -13,7 +13,7 @@ Enable metrics consolidation when you: - **Compare designs**: Need to evaluate multiple designs across different metrics - **Identify top candidates**: Want to quickly find the best designs based on multiple criteria -- **Track provenance**: Need to know which designs came from Boltzgen vs. Protenix +- **Track provenance**: Need to know which designs came from Complexa vs. Protenix - **Generate reports**: Want publication-ready summary tables ## Enabling Consolidation @@ -52,7 +52,7 @@ results/ └── consolidated_metrics/ ├── all_designs_metrics.csv # Complete metrics for all designs ├── top_designs_summary.md # Markdown report of top designs - └── metrics_by_source.csv # Metrics grouped by source (Boltzgen/Protenix) + └── metrics_by_source.csv # Metrics grouped by source (Complexa/Protenix) ``` ## Consolidated Metrics CSV @@ -64,9 +64,9 @@ The `all_designs_metrics.csv` file contains all available metrics in a single ta | Column | Description | Source | |--------|-------------|--------| | `design_id` | Unique design identifier | All | -| `parent_id` | Parent design ID (links Protenix to Boltzgen) | All | -| `source` | `boltzgen` or `protenix` | All | -| `structure_file` | Path to CIF structure | Boltzgen/Protenix | +| `parent_id` | Parent design ID (links Protenix to Complexa) | All | +| `source` | `complexa` or `protenix` | All | +| `structure_file` | Path to CIF structure | Complexa/Protenix | ### ProteinMPNN Metrics (if enabled) @@ -118,14 +118,14 @@ The `top_designs_summary.md` provides a markdown-formatted report highlighting t ## Overview - Total designs analyzed: 120 -- Boltzgen designs: 60 +- Complexa designs: 60 - Protenix designs: 60 ## Top 10 Designs by ipSAE Score | Rank | Design ID | Source | ipSAE | PRODIGY ΔG | Foldseek E-value | |------|-----------|--------|-------|------------|------------------| -| 1 | design1_0001 | boltzgen | 0.92 | -12.5 | 1.2e-8 | +| 1 | design1_0001 | complexa | 0.92 | -12.5 | 1.2e-8 | | 2 | design1_0002 | protenix | 0.89 | -11.8 | 3.4e-7 | ... ``` @@ -155,7 +155,7 @@ awk -F',' '$6 > 0.8' results/consolidated_metrics/all_designs_metrics.csv awk -F',' '$11 < 1e-5' results/consolidated_metrics/all_designs_metrics.csv ``` -### 3. Compare Boltzgen vs. Protenix +### 3. Compare Complexa vs. Protenix ```bash # View metrics by source @@ -209,7 +209,7 @@ Missing metrics will be indicated as `NA` in the CSV. The report tracks design provenance: -- **Boltzgen designs**: Original structures from Boltzgen design +- **Complexa designs**: Original structures from Complexa design - **Protenix designs**: Structures from ProteinMPNN sequences refolded by Protenix Parent-child relationships are maintained via `parent_id` column. @@ -238,7 +238,7 @@ Identify designs with: ### 2. Protein Engineering Compare: -- Boltzgen designs (original scaffold) +- Complexa designs (original scaffold) - Protenix designs (sequence-optimized) - Identify improvements from ProteinMPNN optimization diff --git a/docs/analysis/foldseek.md b/docs/analysis/foldseek.md index 06cb5af..02844c4 100644 --- a/docs/analysis/foldseek.md +++ b/docs/analysis/foldseek.md @@ -2,7 +2,7 @@ ## Overview -Foldseek is a structural similarity search tool that identifies proteins with similar 3D structures. The pipeline integrates Foldseek to search for structural homologs of both Boltzgen-designed and Protenix-refolded structures against large databases like AlphaFold or Swiss-Model. +Foldseek is a structural similarity search tool that identifies proteins with similar 3D structures. The pipeline integrates Foldseek to search for structural homologs of both Complexa-designed and Protenix-refolded structures against large databases like AlphaFold or Swiss-Model. !!! info "What is Foldseek?" Foldseek uses a novel 3Di structural alphabet combined with traditional amino acid sequences to enable ultra-fast structural similarity searches. It's significantly faster than traditional structural alignment tools like TM-align while maintaining high sensitivity. @@ -95,11 +95,11 @@ foldseek createdb /path/to/structures/ mydb The pipeline runs Foldseek on: -1. **Boltzgen budget designs** - All structures from `intermediate_designs_inverse_folded/` +1. **Complexa budget designs** - All structures from `intermediate_designs_inverse_folded/` 2. **Protenix refolded structures** - All structures predicted by Protenix (if enabled) Each structure is searched independently, allowing comparison of: -- Original Boltzgen designs +- Original Complexa designs - ProteinMPNN-optimized sequences refolded by Protenix ## Output Files @@ -110,7 +110,7 @@ For each design, Foldseek generates: results/ └── sample_id/ └── foldseek/ - ├── design_id_boltzgen/ + ├── design_id_complexa/ │ ├── aln.m8 # Alignment results in BLAST-like format │ ├── summary.tsv # Summary of top hits │ └── alignment.html # Detailed alignment visualization @@ -152,13 +152,13 @@ The `summary.tsv` file contains: ```bash # View top hits for a design -head results/sample1/foldseek/design1_boltzgen/summary.tsv +head results/sample1/foldseek/design1_complexa/summary.tsv # Count significant hits (E < 1e-5) -awk '$3 < 1e-5' results/sample1/foldseek/design1_boltzgen/summary.tsv | wc -l +awk '$3 < 1e-5' results/sample1/foldseek/design1_complexa/summary.tsv | wc -l # Extract top hit details -head -n 2 results/sample1/foldseek/design1_boltzgen/summary.tsv +head -n 2 results/sample1/foldseek/design1_complexa/summary.tsv ``` ## Integration with Other Analyses @@ -178,7 +178,7 @@ The consolidated report includes: - Best E-value for each design - Top matching protein name/description - Number of significant hits -- Comparison across Boltzgen and Protenix structures +- Comparison across Complexa and Protenix structures ## Performance Notes diff --git a/docs/analysis/prodigy.md b/docs/analysis/prodigy.md index 10def3f..a7ad6b1 100644 --- a/docs/analysis/prodigy.md +++ b/docs/analysis/prodigy.md @@ -2,7 +2,7 @@ ## :material-link-variant: Overview -The pipeline includes optional **PRODIGY** (PROtein binDIng enerGY prediction) analysis for evaluating the predicted binding affinity of protein-protein complexes generated by Boltzgen. +The pipeline includes optional **PRODIGY** (PROtein binDIng enerGY prediction) analysis for evaluating the predicted binding affinity of protein-protein complexes generated by Complexa. !!! info "What is PRODIGY?" PRODIGY is a fast, structure-based binding affinity predictor developed by the Bonvin lab (Utrecht University). It uses interface properties to estimate binding free energy (ΔG) and dissociation constant (Kd) from structural information. diff --git a/docs/analysis/proteinmpnn-boltz2.md b/docs/analysis/proteinmpnn-boltz2.md index 65859de..7e0b872 100644 --- a/docs/analysis/proteinmpnn-boltz2.md +++ b/docs/analysis/proteinmpnn-boltz2.md @@ -4,7 +4,7 @@ ProteinMPNN and Boltz-2 form a sequence optimization and validation workflow that improves designed structures through iterative refinement: -1. **ProteinMPNN** optimizes amino acid sequences for Boltzgen-designed structures +1. **ProteinMPNN** optimizes amino acid sequences for Complexa-designed structures 2. **Boltz-2** predicts structures for the optimized sequences to validate refolding This workflow helps identify sequences that maintain the desired structure while potentially improving stability, expression, or other properties. @@ -13,7 +13,7 @@ This workflow helps identify sequences that maintain the desired structure while ```mermaid flowchart TB - A[Boltzgen Budget Designs
CIF Files] --> B[Convert CIF to PDB
Per Design] + A[Complexa Budget Designs
CIF Files] --> B[Convert CIF to PDB
Per Design] B --> C[ProteinMPNN Optimize
Parallel per Budget Design
🎮 GPU Process] @@ -50,7 +50,7 @@ flowchart TB - **Parallelization**: Each budget design is processed independently by ProteinMPNN - **Sequence Generation**: ProteinMPNN creates 8 sequences per structure by default (`--mpnn_num_seq_per_target`) - **Boltz-2 Input**: Multi-FASTA is split into individual files, target FASTA is cleaned - - **Analysis Requirements**: All analysis modules require Boltz-2 outputs (not Boltzgen designs) + - **Analysis Requirements**: All analysis modules require Boltz-2 outputs (not Complexa designs) ## When to Use This Workflow @@ -199,15 +199,15 @@ Boltz-2 provides multiple confidence metrics in JSON format: - **pLDDT < 60**: Low confidence, may be disordered - **pTM > 0.8**: Good overall structure quality -## Comparison with Boltzgen +## Comparison with Complexa ### Structural Similarity -Compare Boltz-2 structures to original Boltzgen designs: +Compare Boltz-2 structures to original Complexa designs: ```bash # Use TM-align or similar tool -tmalign results/sample1/boltzgen/design_0001.cif \ +tmalign results/sample1/complexa/design_0001.cif \ results/sample1/boltz2/structures/mpnn_0001_model_0.cif ``` @@ -221,7 +221,7 @@ tmalign results/sample1/boltzgen/design_0001.cif \ When multiple analyses are enabled, compare metrics: ```bash -# Enable all analyses for both Boltzgen and Boltz-2 +# Enable all analyses for both Complexa and Boltz-2 nextflow run seqeralabs/nf-proteindesign \ --input samplesheet.csv \ --run_proteinmpnn \ @@ -236,7 +236,7 @@ nextflow run seqeralabs/nf-proteindesign \ ``` The consolidated metrics report will show: -- **Boltzgen designs**: Original structure metrics +- **Complexa designs**: Original structure metrics - **Boltz-2 designs**: Sequence-optimized structure metrics - Side-by-side comparison of quality scores @@ -266,7 +266,7 @@ The consolidated metrics report will show: --boltz2_diffusion_samples 1 ``` -**Analysis**: Compare Boltzgen vs. Boltz-2 structures using TM-align +**Analysis**: Compare Complexa vs. Boltz-2 structures using TM-align ### 3. Comprehensive Quality Assessment @@ -333,7 +333,7 @@ For a design with 20 budget structures and 8 sequences per structure: ### Structural Divergence -**Boltz-2 structures differ from Boltzgen**: +**Boltz-2 structures differ from Complexa**: - Check ProteinMPNN scores (should be < -1.5) - Verify target sequence extraction worked correctly - Consider if sequence optimization is too aggressive @@ -343,7 +343,7 @@ For a design with 20 budget structures and 8 sequences per structure: 1. **Start conservative**: Use default parameters first 2. **Validate small set**: Test on 2-3 designs before full run -3. **Compare metrics**: Use consolidation to compare Boltzgen vs. Boltz-2 +3. **Compare metrics**: Use consolidation to compare Complexa vs. Boltz-2 4. **Check structural similarity**: Always verify refolding maintains structure 5. **Consider tradeoffs**: Lower ProteinMPNN scores may not always mean better designs @@ -351,7 +351,7 @@ For a design with 20 budget structures and 8 sequences per structure: ### ipSAE -Automatically analyzes both Boltzgen and Boltz-2 structures when enabled: +Automatically analyzes both Complexa and Boltz-2 structures when enabled: ```bash --run_ipsae # Will process both sources @@ -367,7 +367,7 @@ Predicts binding affinity for both structure types: ### Foldseek -Searches for homologs of both Boltzgen and Boltz-2 designs: +Searches for homologs of both Complexa and Boltz-2 designs: ```bash --run_foldseek --foldseek_database /path/to/database_dir --foldseek_database_name afdb @@ -380,6 +380,6 @@ Searches for homologs of both Boltzgen and Boltz-2 designs: ## See Also -- [ipSAE Scoring](ipsae.md) - Works with both Boltzgen and Boltz-2 NPZ files +- [ipSAE Scoring](ipsae.md) - Works with both Complexa and Boltz-2 NPZ files - [PRODIGY Binding Affinity](prodigy.md) - Analyzes all predicted structures -- [Metrics Consolidation](consolidation.md) - Compare Boltzgen vs. Boltz-2 metrics +- [Metrics Consolidation](consolidation.md) - Compare Complexa vs. Boltz-2 metrics diff --git a/docs/architecture/design.md b/docs/architecture/design.md index 807f27b..bd58263 100644 --- a/docs/architecture/design.md +++ b/docs/architecture/design.md @@ -2,22 +2,22 @@ ## :material-sitemap: Overview -The nf-proteindesign pipeline processes design YAML specifications through Boltzgen with a comprehensive suite of optional analysis modules for sequence optimization, structure validation, and quality assessment. +The nf-proteindesign pipeline processes design YAML specifications through Complexa with a comprehensive suite of optional analysis modules for sequence optimization, structure validation, and quality assessment. ## :octicons-workflow-24: Complete Pipeline Flow ```mermaid flowchart TD - A[Input Samplesheet
with Design YAMLs] --> B{Check Boltzgen
Output Dir} + A[Input Samplesheet
with Design YAMLs] --> B{Check Complexa
Output Dir} - B -->|Null| C[Run Boltzgen Design
GPU Process] - B -->|Provided| D[Stage Precomputed
Boltzgen Results] + B -->|Null| C[Run Complexa Design
GPU Process] + B -->|Provided| D[Stage Precomputed
Complexa Results] C --> E[Budget Designs
CIF + NPZ Files] D --> E E --> F{ProteinMPNN
Enabled?} - F -->|No| Z1[Output Boltzgen
Designs Only] + F -->|No| Z1[Output Complexa
Designs Only] F -->|Yes| G[Convert CIF to PDB
Per Design] G --> H[ProteinMPNN Optimize
Parallel per Budget Design
GPU Process] @@ -73,17 +73,17 @@ flowchart TD !!! warning "Key Architecture Notes" - **Analysis modules** (IPSAE, PRODIGY, Foldseek) **only process Boltz-2 structures** - Both `--run_proteinmpnn` and `--run_boltz2_refold` must be enabled for analysis - - Boltzgen budget designs are NOT analyzed directly - only used for ProteinMPNN input - - Precomputed Boltzgen results can be reused via `boltzgen_output_dir` in samplesheet + - Complexa budget designs are NOT analyzed directly - only used for ProteinMPNN input + - Precomputed Complexa results can be reused via `complexa_output_dir` in samplesheet ## :material-puzzle: Key Components ### 1. Core Design Module -Boltzgen generates protein designs from YAML specifications: +Complexa generates protein designs from YAML specifications: ```groovy -process BOLTZGEN_RUN { +process COMPLEXA_RUN { label 'gpu' input: @@ -97,7 +97,7 @@ process BOLTZGEN_RUN { script: """ - boltzgen design \\ + complexa design \\ --design_file ${design_yaml} \\ --output_dir ${meta.id}_output \\ --num_designs ${meta.num_designs} \\ @@ -117,7 +117,7 @@ workflow { PROTEINMPNN_OPTIMIZE(pdb_files) if (params.run_boltz2_refold) { - EXTRACT_TARGET_SEQUENCES(boltzgen_structures) + EXTRACT_TARGET_SEQUENCES(complexa_structures) PROTENIX_REFOLD(mpnn_sequences, target_sequences) CONVERT_PROTENIX_TO_NPZ(boltz2_outputs) } @@ -133,7 +133,7 @@ Multiple analyses run simultaneously: workflow { // All analyses run in parallel on budget designs if (params.run_ipsae) { - IPSAE_CALCULATE(boltzgen_cifs, boltzgen_npz) + IPSAE_CALCULATE(complexa_cifs, complexa_npz) if (boltz2_enabled) { IPSAE_CALCULATE(boltz2_cifs, boltz2_npz) } @@ -159,7 +159,7 @@ workflow { | Process | Purpose | Label | Output | |---------|---------|-------|--------| -| `BOLTZGEN_RUN` | Design proteins with Boltzgen diffusion | `gpu` | CIF + NPZ (budget designs) | +| `COMPLEXA_RUN` | Design proteins with Complexa diffusion | `gpu` | CIF + NPZ (budget designs) | | `CONVERT_CIF_TO_PDB` | Convert CIF structures to PDB format | `cpu` | PDB files | | `PROTEINMPNN_OPTIMIZE` | Sequence optimization for designs | `gpu` | FASTA sequences + scores | | `PREPARE_BOLTZ2_SEQUENCES` | Split MPNN FASTA + process target | `cpu` | Individual FASTA files | @@ -199,7 +199,7 @@ workflows/ └── protein_design.nf # Main workflow orchestration modules/local/ -├── boltzgen_run.nf # Boltzgen design generation (GPU) +├── complexa_run.nf # Complexa design generation (GPU) ├── convert_cif_to_pdb.nf # CIF to PDB conversion ├── proteinmpnn_optimize.nf # ProteinMPNN sequence optimization (GPU) ├── prepare_boltz2_sequences.nf # Split MPNN FASTA + process target @@ -258,7 +258,7 @@ params { ### 2. Design Generation -- Parallel Boltzgen design runs +- Parallel Complexa design runs - Generate budget designs (CIF + NPZ) - GPU-accelerated diffusion sampling @@ -271,7 +271,7 @@ params { ### 4. Parallel Analysis (Optional) -- **ipSAE**: Interface quality scoring (Boltzgen + Boltz-2) +- **ipSAE**: Interface quality scoring (Complexa + Boltz-2) - **PRODIGY**: Binding affinity prediction (all structures) - **Foldseek**: Structural similarity search (all structures) diff --git a/docs/architecture/implementation.md b/docs/architecture/implementation.md index 85ca244..85bb387 100644 --- a/docs/architecture/implementation.md +++ b/docs/architecture/implementation.md @@ -4,6 +4,117 @@ This document provides technical details about the nf-proteindesign pipeline implementation, including design decisions, container specifications, and development guidelines. +--- + +## :material-clock-outline: Development Log (Seqera AI-Assisted) + +This pipeline was developed iteratively using **Seqera AI** as a proof of principle. Below is a chronological record of major implementation steps, what was done, and approximate time spent. + +### Step 1 — Initial Pipeline Scaffolding (~15 min) + +Created the foundational Nextflow DSL2 pipeline structure from scratch: + +- `main.nf` entry point with parameter validation and input parsing +- `workflows/protein_design.nf` main workflow orchestrating all modules +- `nextflow.config` with profiles for Docker, Singularity, test data, and Seqera Platform +- `conf/base.config` with resource labels (CPU, GPU, memory tiers) +- `conf/modules.config` with per-process publishDir configuration +- Input samplesheet validation via nf-schema (`assets/schema_input_design.json`) + +### Step 2 — Core Design Module: BoltzGen → Complexa (~20 min) + +Implemented the generative protein design module: + +- **Originally** created `modules/local/boltzgen_run.nf` wrapping the BoltzGen model +- Authored `bin/collect_complexa_outputs.py` to gather CIF structures and confidence files from model output +- Wired samplesheet fields (`design_yaml`, `protocol`, `num_designs`, `budget`) through to the process +- GPU resource allocation with dynamic retry strategy + +### Step 3 — Downstream Analysis Modules (~30 min) + +Added the full suite of optional post-design analysis modules: + +| Module | File | Purpose | +|--------|------|---------| +| ProteinMPNN | `proteinmpnn_optimize.nf` | Sequence optimization for designed binders | +| Boltz-2 Refold | `boltz2_refold.nf`, `prepare_boltz2_sequences.nf` | Independent structure prediction to validate designs | +| IPSAE | `ipsae_calculate.nf` | Interface pairwise shape & electrostatic scoring | +| PRODIGY | `prodigy_predict.nf` | Binding affinity prediction (ΔG) | +| Foldseek | `foldseek_search.nf` | Structural similarity search against PDB/AlphaFold DB | +| Consolidation | `consolidate_metrics.nf` | Unified CSV/JSON metrics report across all modules | + +Supporting utilities created: + +- `convert_cif_to_pdb.nf` — CIF → PDB conversion for tools requiring PDB input +- `collect_design_files.nf` — File collection and organization per sample +- `split_proteinmpnn_sequences.nf` — Split multi-sequence FASTA for parallel refolding +- `extract_target_sequences.nf` — Extract target chain sequences from design YAMLs +- `create_design_samplesheet.nf` — Dynamic samplesheet generation for batched designs + +### Step 4 — Test Data & Profiles (~10 min) + +Created three test profiles with real-world design specifications: + +- `test_design_protein` — Protein binder against Nipah virus glycoprotein (2VSM) +- `test_design_nanobody` — Nanobody design using built-in scaffolds +- `test_design_peptide` — Peptide binder design + +Test data in `assets/test_data/`: + +- Nipah virus glycoprotein structure (`.cif`), target sequence (`.fasta`), MSA (`.a3m`) +- Three design YAML specifications +- Three corresponding samplesheet CSVs + +### Step 5 — Documentation Site (~15 min) + +Generated a full MkDocs Material documentation site: + +- Architecture diagrams (Mermaid flowcharts) +- Getting started guides (installation, usage, quick reference) +- Per-module analysis documentation (ProteinMPNN, IPSAE, PRODIGY, Foldseek, consolidation) +- Auto-generated parameter reference from `nextflow_schema.json` +- `mkdocs.yml` configuration with navigation, search, and theme + +### Step 6 — Schema & Parameter Validation (~5 min) + +- `nextflow_schema.json` with grouped parameters, descriptions, defaults, and enums +- `bin/generate_parameter_docs.py` to auto-generate `docs/reference/parameters.md` from schema +- MkDocs pre-build hook (`docs/hooks/update_dynamic_content.py`) for automatic doc regeneration + +### Step 7 — BoltzGen → Proteina-Complexa Rename (~15 min) + +Renamed the generative design engine throughout the entire codebase after the tool was rebranded: + +- **New module**: `modules/local/proteina_complexa_design.nf` (rewired process name, container, and commands) +- **Deleted**: `modules/local/boltzgen_run.nf` +- **Workflow**: Updated `protein_design.nf` — process call `BOLTZGEN_RUN` → `PROTEINA_COMPLEXA_DESIGN`, all channel names +- **Config**: All `boltzgen` process labels, params, and container refs → `complexa` / `proteina_complexa` across `nextflow.config`, `conf/base.config`, `conf/modules.config` +- **Schema**: `nextflow_schema.json` parameter names, descriptions, output directory references +- **Container images**: `ghcr.io/flouwuenne/boltzgen:latest` → `cr.seqera.io/scidev/complexa:latest` +- **Docs**: All 15+ markdown files updated — terminology, URLs (`Proteina-AI/complexa`), navigation +- **Test data YAMLs**: Comment headers updated +- **Python scripts**: `bin/prepare_boltz2_input.py`, `assets/ipsae.py` — code comments +- **README.md**: Full rewrite of all references +- **Verification**: Zero `boltzgen` references remain project-wide + +### Summary + +| Phase | Description | Approx. Time | +|-------|-------------|---------------| +| 1 | Pipeline scaffolding | ~15 min | +| 2 | Core design module (BoltzGen) | ~20 min | +| 3 | Downstream analysis modules (6 tools) | ~30 min | +| 4 | Test data & profiles | ~10 min | +| 5 | Documentation site (MkDocs) | ~15 min | +| 6 | Schema & parameter validation | ~5 min | +| 7 | BoltzGen → Complexa rename | ~15 min | +| **Total** | **End-to-end pipeline + docs + rename** | **~1 hr 50 min** | + +!!! info "All development was performed interactively with Seqera AI" + Each step involved conversational iteration — describing intent, reviewing generated code, requesting adjustments, and validating outputs. The times above reflect wall-clock time including review and refinement, not just code generation. + +--- + ## :material-docker: Container Strategy ### Base Images @@ -12,7 +123,7 @@ The pipeline uses specialized containers for each component: ```yaml Containers: - boltzgen: "ghcr.io/flouwuenne/boltzgen:latest" + complexa: "cr.seqera.io/scidev/complexa:latest" proteinmpnn: "ghcr.io/flouwuenne/proteinmpnn:latest" ipsae: "ghcr.io/flouwuenne/ipsae:latest" prodigy: "ghcr.io/flouwuenne/prodigy:latest" @@ -20,7 +131,7 @@ Containers: ### GPU Support -CUDA 11.8+ required for Boltzgen: +CUDA 11.8+ required for Complexa: ```dockerfile FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 @@ -33,34 +144,52 @@ RUN pip install torch==2.0.1 --index-url https://download.pytorch.org/whl/cu118 ``` nf-proteindesign/ -├── main.nf # Main entry point with mode detection -├── nextflow.config # Pipeline configuration +├── main.nf # Main entry point with parameter validation +├── nextflow.config # Pipeline configuration & profiles +├── nextflow_schema.json # Parameter schema (nf-schema v2) ├── conf/ -│ ├── base.config # Base resource settings -│ ├── modules.config # Module-specific configuration -│ ├── test.config # Test profile configuration -│ └── test_full.config # Full test profile +│ ├── base.config # Base resource settings (CPU/GPU labels) +│ ├── modules.config # Per-process publishDir configuration +│ ├── test_design_protein.config # Test profile: protein binder design +│ ├── test_design_nanobody.config # Test profile: nanobody design +│ └── test_design_peptide.config # Test profile: peptide design ├── workflows/ -│ └── protein_design.nf # Unified workflow handling all modes +│ └── protein_design.nf # Unified workflow orchestrating all modules ├── modules/local/ -│ ├── boltzgen_run.nf -│ ├── convert_cif_to_pdb.nf -│ ├── collect_design_files.nf -│ ├── proteinmpnn_optimize.nf -│ ├── ipsae_calculate.nf -│ ├── prodigy_predict.nf -│ └── consolidate_metrics.nf +│ ├── proteina_complexa_design.nf # Core: Complexa generative design (GPU) +│ ├── collect_design_files.nf # Collect & organize design outputs +│ ├── convert_cif_to_pdb.nf # CIF → PDB format conversion +│ ├── proteinmpnn_optimize.nf # ProteinMPNN sequence optimization +│ ├── split_proteinmpnn_sequences.nf # Split multi-seq FASTA for parallel refold +│ ├── extract_target_sequences.nf # Extract target sequences from design YAMLs +│ ├── create_design_samplesheet.nf # Dynamic samplesheet for batched designs +│ ├── prepare_boltz2_sequences.nf # Prepare inputs for Boltz-2 refolding +│ ├── boltz2_refold.nf # Boltz-2 structure prediction (GPU) +│ ├── ipsae_calculate.nf # IPSAE interface scoring +│ ├── prodigy_predict.nf # PRODIGY binding affinity prediction +│ ├── foldseek_search.nf # Foldseek structural similarity search +│ └── consolidate_metrics.nf # Unified metrics report generation ├── bin/ -│ ├── convert_cif_to_pdb.py # CIF to PDB conversion -│ ├── collect_boltzgen_outputs.py # Collect Boltzgen results -│ ├── consolidate_metrics.py # Generate unified metrics report -│ └── create_design_yaml.py # Generate design YAML files +│ ├── collect_complexa_outputs.py # Collect Complexa CIF/confidence outputs +│ ├── convert_cif_to_pdb.py # CIF to PDB conversion script +│ ├── prepare_boltz2_input.py # Prepare Boltz-2 input sequences +│ ├── consolidate_metrics.py # Generate unified CSV/JSON metrics +│ ├── boltz_predict_wrapper.py # Boltz-2 prediction wrapper +│ ├── generate_parameter_docs.py # Auto-generate parameter docs from schema +│ └── validate_docs.py # Documentation validation └── assets/ - ├── schema_input_design.json # Design mode samplesheet schema - └── test_data/ # Test datasets - ├── egfr_*_design.yaml # Pre-made design YAMLs - ├── 2VSM.cif # Test structure - └── samplesheet_design_*.csv # Test samplesheets + ├── schema_input_design.json # Samplesheet validation schema + ├── ipsae.py # IPSAE scoring utilities + └── test_data/ + ├── nipah_protein_design.yaml # Protein binder design spec + ├── nipah_nanobody_design.yaml # Nanobody design spec + ├── nipah_peptide_design.yaml # Peptide design spec + ├── nipah_virus_Glycoprotein_*.cif # Target structure (2VSM) + ├── nipah_virus_target_sequence_*.fasta # Target sequence + ├── nipah_glycoprotein_msa_*.a3m # MSA for target + ├── samplesheet_design_protein.csv # Test samplesheet: protein + ├── samplesheet_design_nanobody.csv # Test samplesheet: nanobody + └── samplesheet_design_peptide.csv # Test samplesheet: peptide ``` ## :material-language-python: Helper Scripts @@ -307,14 +436,14 @@ cat .command.err ```groovy /** - * BOLTZGEN_RUN: Execute Boltzgen protein design + * PROTEINA_COMPLEXA_DESIGN: Execute Complexa generative protein design * - * @input tuple(sample_id, design_yaml) - * @output tuple(sample_id, designs_dir) - * @param params.n_samples Number of designs to generate - * @param params.timesteps Diffusion timesteps + * @input tuple(meta, design_yaml) + * @output tuple(meta, cif_files, confidence_files) + * @param params.num_designs Number of designs to generate + * @param params.budget Diffusion budget (sampling steps) */ -process BOLTZGEN_RUN { +process PROTEINA_COMPLEXA_DESIGN { // Process implementation } ``` diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md index 01c3c4c..7483cee 100644 --- a/docs/getting-started/installation.md +++ b/docs/getting-started/installation.md @@ -33,7 +33,7 @@ docker run hello-world ### GPU Requirements !!! warning "NVIDIA GPU Required" - Boltzgen requires an NVIDIA GPU with CUDA support. CPU execution is possible but extremely slow. + Both BoltzGen and Complexa require an NVIDIA GPU with CUDA support. CPU execution is possible but extremely slow. #### Setup NVIDIA Container Toolkit (Docker) @@ -162,17 +162,15 @@ nextflow run seqeralabs/nf-proteindesign \ ## :material-package: Container Images -The pipeline uses pre-built containers from GitHub Container Registry: - -- **Boltzgen**: `ghcr.io/flouwuenne/boltzgen:latest` -- **PRODIGY**: `ghcr.io/flouwuenne/prodigy:latest` +The pipeline uses pre-built containers for each process. Container URIs are defined in `nextflow.config` and `conf/base.config`. Nextflow automatically pulls the required containers at runtime. ### Pre-pull Containers +To pre-pull containers for offline or faster startup, check `nextflow.config` for the exact URIs: + ```bash -# Docker -docker pull ghcr.io/flouwuenne/boltzgen:latest -docker pull ghcr.io/flouwuenne/prodigy:latest +# Example — check nextflow.config for current URIs +# docker pull ``` ## :material-help-circle: Troubleshooting @@ -224,8 +222,7 @@ git pull origin main ### Update Containers ```bash -# Docker -docker pull ghcr.io/flouwuenne/boltzgen:latest +# Check nextflow.config for current container URIs and pull latest versions ``` ## :material-arrow-right: Next Steps diff --git a/docs/getting-started/quick-reference.md b/docs/getting-started/quick-reference.md index 2bb7466..f60546f 100644 --- a/docs/getting-started/quick-reference.md +++ b/docs/getting-started/quick-reference.md @@ -4,18 +4,18 @@ Fast reference for common commands and configurations. ## :material-flash: One-Line Commands -### Basic Run +### Basic Run (BoltzGen, default) ```bash -# Simplest possible run (auto-detects mode) +# Simplest possible run — uses BoltzGen with all analysis modules enabled by default nextflow run seqeralabs/nf-proteindesign -profile docker --input samplesheet.csv --outdir results ``` -### With Analysis +### Run with Complexa ```bash -# Include affinity prediction and scoring -nextflow run seqeralabs/nf-proteindesign -profile docker --input samplesheet.csv --outdir results --run_prodigy --run_ipsae +# Use Proteina-Complexa backend +nextflow run seqeralabs/nf-proteindesign -profile docker --protein_design_tool complexa --input samplesheet_complexa.csv --complexa_ckpt_dir /path/to/ckpts --outdir results ``` ### Resume Failed Run @@ -25,23 +25,29 @@ nextflow run seqeralabs/nf-proteindesign -profile docker --input samplesheet.csv nextflow run seqeralabs/nf-proteindesign -profile docker --input samplesheet.csv --outdir results -resume ``` -## :material-file-table: Samplesheet Template +## :material-file-table: Samplesheet Templates + +### BoltzGen (default) ```csv -sample_id,design_yaml,structure_files,protocol,num_designs,budget -design1,designs/my_design.yaml,data/target.pdb,protein-anything,100,10 -design2,designs/another_design.yaml,data/target.cif,peptide-anything,100,10 +sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template +design1,designs/my_design.yaml,data/target.cif,protein-anything,3,2,,target.a3m,data/target.fasta, ``` -**Required columns:** -- `sample_id`: Unique identifier for the design -- `design_yaml`: Path to Boltzgen design YAML specification +**Required:** `sample_id`, `design_yaml`, `target_sequence` + +**Optional:** `structure_files`, `protocol`, `num_designs`, `budget`, `reuse`, `target_msa`, `target_template` + +### Complexa -**Optional columns:** -- `structure_files`: Additional structure files (comma-separated if multiple) -- `protocol`: Boltzgen protocol (protein-anything, peptide-anything, nanobody-anything, protein-small_molecule) -- `num_designs`: Number of intermediate designs (default: 100) -- `budget`: Number of final diversity-optimized designs (default: 10) +```csv +sample_id,target_pdb,pipeline_config,target_sequence,target_msa,target_template +design1,target.cif,configs/pipeline.yaml,target.fasta,target.a3m, +``` + +**Required:** `sample_id`, `target_pdb`, `pipeline_config`, `target_sequence` + +**Optional:** `target_msa`, `target_template` ## :material-cog: Common Parameters @@ -51,32 +57,42 @@ design2,designs/another_design.yaml,data/target.cif,peptide-anything,100,10 |-----------|-------------|---------|---------| | `--input` | Samplesheet path | Required | `samplesheet.csv` | | `--outdir` | Output directory | `./results` | `results/` | -| `--protocol` | Boltzgen protocol | `protein-anything` | `peptide-anything` | +| `--protein_design_tool` | Design backend | `boltzgen` | `complexa` | -### Design Parameters +### BoltzGen Parameters | Parameter | Description | Default | Example | |-----------|-------------|---------|---------| -| `--num_designs` | Intermediate designs | 100 | `50` | -| `--budget` | Final optimized designs | 10 | `20` | -| `--cache_dir` | Model cache directory | `null` | `/cache` | +| `--cache_dir` | BoltzGen model cache | `null` | `/cache` | -### Analysis Parameters +### Complexa Parameters | Parameter | Description | Default | Example | |-----------|-------------|---------|---------| -| `--run_proteinmpnn` | Enable ProteinMPNN | false | `true` | -| `--run_ipsae` | Enable IPSAE scoring | false | `true` | -| `--run_prodigy` | Enable PRODIGY | false | `true` | -| `--run_consolidation` | Consolidated report | false | `true` | +| `--complexa_ckpt_dir` | Checkpoint directory | `null` | `/path/to/ckpts` | +| `--complexa_search_algorithm` | Search algorithm | `best-of-n` | `beam-search` | +| `--complexa_nsteps` | Diffusion steps | `400` | `200` | +| `--complexa_batch_size` | Batch size | `16` | `8` | + +### Analysis Parameters (all enabled by default) + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `--run_proteinmpnn` | ProteinMPNN optimization | `true` | +| `--run_boltz2_refold` | Boltz-2 structure prediction | `true` | +| `--run_ipsae` | IPSAE interface scoring | `true` | +| `--run_prodigy` | PRODIGY affinity prediction | `true` | +| `--run_foldseek` | Foldseek structural search | `true` | +| `--run_consolidation` | Consolidated report | `true` | ### Resource Parameters | Parameter | Description | Default | Example | |-----------|-------------|---------|---------| -| `--max_cpus` | Maximum CPUs | 16 | `32` | -| `--max_memory` | Maximum memory | 128.GB | `256.GB` | -| `--max_time` | Maximum time | 240.h | `72.h` | +| `--max_cpus` | Maximum CPUs | `16` | `32` | +| `--max_memory` | Maximum memory | `128.GB` | `256.GB` | +| `--max_time` | Maximum time | `240.h` | `72.h` | +| `--max_gpus` | Maximum GPUs per process | `1` | `2` | ## :material-play: Command Recipes @@ -88,7 +104,7 @@ nextflow run seqeralabs/nf-proteindesign \ --outdir test_results ``` -### Standard Run +### Standard Run (BoltzGen) ```bash nextflow run seqeralabs/nf-proteindesign \ @@ -97,37 +113,25 @@ nextflow run seqeralabs/nf-proteindesign \ --outdir results ``` -### With Analysis Tools - -```bash -nextflow run seqeralabs/nf-proteindesign \ - -profile docker \ - --input samplesheet.csv \ - --outdir results \ - --run_proteinmpnn \ - --run_ipsae \ - --run_prodigy \ - --run_consolidation -``` - -### Peptide Design +### Standard Run (Complexa) ```bash nextflow run seqeralabs/nf-proteindesign \ -profile docker \ - --input peptide_samplesheet.csv \ - --protocol peptide-anything \ - --outdir peptide_designs + --protein_design_tool complexa \ + --input samplesheet_complexa.csv \ + --complexa_ckpt_dir /path/to/checkpoints \ + --outdir results ``` -### Nanobody Design +### Design Only (skip analysis) ```bash nextflow run seqeralabs/nf-proteindesign \ -profile docker \ - --input nanobody_samplesheet.csv \ - --protocol nanobody-anything \ - --outdir nanobody_designs + --input samplesheet.csv \ + --outdir results \ + --run_proteinmpnn false ``` ## :material-folder-open: Output Structure @@ -135,20 +139,15 @@ nextflow run seqeralabs/nf-proteindesign \ ``` results/ ├── {sample}/ -│ ├── boltzgen/ -│ │ ├── final_ranked_designs/ ← Your final designs -│ │ │ ├── design_1.cif -│ │ │ ├── design_2.cif -│ │ │ └── ... -│ │ ├── intermediate_designs/ -│ │ └── boltzgen.log -│ ├── prodigy/ -│ │ ├── design_1_prodigy_summary.csv -│ │ └── ... -│ └── ipsae/ -│ └── design_1_ipsae_scores.csv +│ ├── boltzgen/ or complexa/ ← Design outputs (depends on tool) +│ ├── proteinmpnn/ ← Optimized sequences +│ ├── boltz2/ ← Refolded structures +│ ├── ipsae/ ← Interface scores +│ ├── prodigy/ ← Affinity predictions +│ ├── foldseek/ ← Structural search results +│ └── consolidated/ ← Combined metrics report └── pipeline_info/ - ├── execution_report.html ← Check this first + ├── execution_report.html ← Check this first ├── execution_timeline.html └── execution_trace.txt ``` @@ -189,15 +188,14 @@ nextflow run seqeralabs/nf-proteindesign \ ### Container Pull Issues ```bash -# Pre-pull containers -docker pull ghcr.io/flouwuenne/boltzgen:latest -docker pull ghcr.io/flouwuenne/prodigy:latest +# Pre-pull containers — check nextflow.config for exact URIs +# for each process in conf/base.config ``` -## :material-file-code: Design YAML Template +## :material-file-code: Design YAML Template (BoltzGen) ```yaml title="design_template.yaml" -# Boltzgen design specification +# BoltzGen design specification entities: # Designed protein entity - protein: @@ -212,7 +210,7 @@ entities: id: A # Target chain to bind ``` -See the [Boltzgen documentation](https://github.com/generatebio/boltz#-design-specification) for complete YAML specification details. +See the [BoltzGen documentation](https://github.com/jostorge/boltz) and [Complexa documentation](https://github.com/Proteina-AI/complexa) for complete specification details. ## :material-chart-line: Performance Estimates diff --git a/docs/getting-started/usage.md b/docs/getting-started/usage.md index c1245ca..99bbea0 100644 --- a/docs/getting-started/usage.md +++ b/docs/getting-started/usage.md @@ -7,6 +7,7 @@ This guide covers the fundamental concepts for using nf-proteindesign. ```bash nextflow run seqeralabs/nf-proteindesign \ -profile \ + --protein_design_tool \ --input \ --outdir \ [OPTIONS] @@ -14,46 +15,59 @@ nextflow run seqeralabs/nf-proteindesign \ ### Components -- **`-profile`**: Execution profile (`docker`, `test`) -- **`--input`**: Path to samplesheet CSV file +- **`-profile`**: Execution profile (`docker`, `singularity`, `test_design_protein`, etc.) +- **`--protein_design_tool`**: Design backend — `boltzgen` (default) or `complexa` +- **`--input`**: Path to samplesheet CSV file (format depends on design tool) - **`--outdir`**: Output directory path - **`[OPTIONS]`**: Additional pipeline parameters ## :material-file-table: Samplesheet Format -The pipeline uses a CSV samplesheet to specify design jobs. Each row represents a separate design run. +The samplesheet format depends on the chosen design tool. Each row represents a separate design run. -### Required Columns +### BoltzGen Samplesheet (default) -| Column | Required | Description | -|--------|----------|-------------| -| `sample` | ✅ | Unique sample identifier | -| `design_yaml` | ✅ | Path to design YAML file (see below) | +| Column | Required | Type | Description | +|--------|----------|------|-------------| +| `sample_id` | ✅ | string | Unique sample identifier | +| `design_yaml` | ✅ | string | Path to BoltzGen design YAML file | +| `target_sequence` | ✅ | string | Path to target protein FASTA sequence | +| `structure_files` | | string | Comma-separated structure files (PDB/CIF) | +| `protocol` | | string | Design protocol (`protein-anything`, `peptide-anything`, `nanobody-anything`, `protein-small_molecule`) | +| `num_designs` | | integer | Number of intermediate designs | +| `budget` | | integer | Number of final diversity-optimized designs | +| `reuse` | | boolean | Reuse previous results | +| `target_msa` | | string | Pre-computed MSA for target (`.a3m`) | +| `target_template` | | string | Template structure for Boltz-2 (CIF) | -### Optional Columns - -Additional columns can override default parameters per sample: +```csv +sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template +protein_binder,designs/egfr_binder.yaml,egfr.cif,protein-anything,3,2,,target.a3m,egfr.fasta, +nanobody_design,designs/spike_nanobody.yaml,spike.cif,nanobody-anything,3,2,,,spike.fasta, +``` -| Column | Type | Description | -|--------|------|-------------| -| `num_designs` | Integer | Number of designs to generate (overrides `--num_designs`) | -| `budget` | Integer | Number of final designs to keep (overrides `--budget`) | +### Complexa Samplesheet -### Example Samplesheet +| Column | Required | Type | Description | +|--------|----------|------|-------------| +| `sample_id` | ✅ | string | Unique sample identifier | +| `target_pdb` | ✅ | string | Target structure (PDB or CIF) | +| `pipeline_config` | ✅ | string | Complexa Hydra pipeline config YAML | +| `target_sequence` | ✅ | string | Target sequence FASTA | +| `target_msa` | | string | Pre-computed MSA for target | +| `target_template` | | string | Template structure for Boltz-2 | ```csv -sample,design_yaml,num_designs,budget -protein_binder,designs/egfr_binder.yaml,10000,50 -nanobody_design,designs/spike_nanobody.yaml,5000,20 -peptide_binder,designs/il6_peptide.yaml,3000,10 +sample_id,target_pdb,pipeline_config,target_sequence,target_msa,target_template +protein_binder,target.cif,configs/pipeline.yaml,target.fasta,target.a3m, ``` -## :material-file-document: Design YAML Format +## :material-file-document: Design YAML Format (BoltzGen) -For Design mode, create YAML files following this structure: +For BoltzGen, create design YAML files following this structure: ```yaml -# Boltzgen design specification +# BoltzGen design specification entities: # Designed protein entity - protein: @@ -73,24 +87,36 @@ entities: ### Essential Parameters ```bash ---input # Path to samplesheet CSV (required) ---outdir # Output directory (required) ---mode # Explicit mode: design, target, binder (optional, auto-detected) +--input # Path to samplesheet CSV (required) +--outdir # Output directory (required) +--protein_design_tool # Design backend: 'boltzgen' (default) or 'complexa' +``` + +### BoltzGen Parameters + +```bash +--cache_dir # Cache directory for BoltzGen model weights ``` -### Design Parameters +### Complexa Parameters ```bash ---n_samples # Number of designs per specification (default: 10) ---timesteps # Diffusion timesteps (default: 100) ---save_traj # Save trajectory files (default: false) +--complexa_ckpt_dir # Complexa checkpoint directory (required for Complexa) +--complexa_search_algorithm # Search algorithm (default: 'best-of-n') +--complexa_nsteps # Diffusion sampling steps (default: 400) +--complexa_replicas # Replicas for best-of-n (default: 2) +--complexa_batch_size # Batch size (default: 16) ``` -### Analysis Options +### Analysis Options (all enabled by default) ```bash ---run_ipsae # Enable IPSAE scoring (default: false) ---run_prodigy # Enable PRODIGY affinity prediction (default: false) +--run_proteinmpnn # ProteinMPNN sequence optimization (default: true) +--run_boltz2_refold # Boltz-2 structure prediction (default: true) +--run_ipsae # IPSAE interface scoring (default: true) +--run_prodigy # PRODIGY affinity prediction (default: true) +--run_foldseek # Foldseek structural search (default: true) +--run_consolidation # Consolidated metrics report (default: true) ``` ### Resource Management @@ -98,7 +124,8 @@ entities: ```bash --max_cpus # Maximum CPUs (default: 16) --max_memory # Maximum memory (default: 128.GB) ---max_time # Maximum time per job (default: 48.h) +--max_time # Maximum time per job (default: 240.h) +--max_gpus # Maximum GPUs per process (default: 1) ``` ## :material-folder-open: Output Structure @@ -108,113 +135,99 @@ The pipeline creates an organized output directory: ``` results/ ├── {sample_id}/ -│ ├── boltzgen/ -│ │ ├── final_ranked_designs/ # Your final designs ⭐ -│ │ │ ├── design_1.cif -│ │ │ ├── design_2.cif -│ │ │ └── ... -│ │ ├── intermediate_designs/ # Intermediate outputs -│ │ │ └── ... -│ │ └── boltzgen.log # Execution log -│ │ -│ ├── prodigy/ # If --run_prodigy enabled -│ │ ├── design_1_prodigy_results.txt -│ │ ├── design_1_prodigy_summary.csv +│ ├── boltzgen/ or complexa/ # Design outputs (depends on tool) +│ │ ├── design_*.pdb / *.cif # Generated structures │ │ └── ... │ │ -│ └── ipsae/ # If --run_ipsae enabled -│ └── design_1_ipsae_scores.csv +│ ├── proteinmpnn/ # If --run_proteinmpnn enabled +│ │ ├── sequences/ # Optimized FASTA sequences +│ │ └── scores/ # ProteinMPNN scores +│ │ +│ ├── boltz2/ # If --run_boltz2_refold enabled +│ │ ├── structures/ # Predicted CIF structures +│ │ ├── confidence/ # Confidence scores (JSON) +│ │ └── npz/ # PAE NPZ files +│ │ +│ ├── ipsae/ # If --run_ipsae enabled +│ │ └── *_ipsae_scores.txt +│ │ +│ ├── prodigy/ # If --run_prodigy enabled +│ │ └── *_prodigy_results.txt +│ │ +│ ├── foldseek/ # If --run_foldseek enabled +│ │ └── *_foldseek_summary.tsv +│ │ +│ └── consolidated/ # If --run_consolidation enabled +│ ├── consolidated_metrics.csv +│ └── consolidated_report.html │ └── pipeline_info/ - ├── execution_report.html # Execution summary - ├── execution_timeline.html # Timeline visualization - └── execution_trace.txt # Detailed trace + ├── execution_report.html # Execution summary + ├── execution_timeline.html # Timeline visualization + └── execution_trace.txt # Detailed trace ``` ### Key Output Files !!! tip "Most Important Files" - - **Final designs**: `boltzgen/{sample}/final_ranked_designs/*.cif` + - **Design structures**: `{sample}/boltzgen/*.pdb` or `{sample}/complexa/*.pdb` + - **Consolidated report**: `{sample}/consolidated/consolidated_metrics.csv` - **Execution report**: `pipeline_info/execution_report.html` - - **Affinity predictions**: `prodigy/{sample}/design_*_summary.csv` ## :material-play-circle: Example Workflows -### Example 1: Basic Protein Design +### Example 1: Basic Protein Design (BoltzGen) ```bash # 1. Create design YAML cat > protein_design.yaml << EOF -name: egfr_binder -target: - structure: data/egfr.pdb - residues: [10, 11, 12, 45, 46] -designed: - chain_type: protein - length: [60, 100] -global: - n_samples: 20 +entities: + - protein: + id: C + sequence: 60..100 + - file: + path: egfr.cif + include: + - chain: + id: A EOF # 2. Create samplesheet cat > samples.csv << EOF -sample,design_yaml -egfr_binder,protein_design.yaml +sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template +egfr_binder,protein_design.yaml,egfr.cif,protein-anything,3,2,,,egfr_sequence.fasta, EOF -# 3. Run pipeline +# 3. Run pipeline (all analysis modules enabled by default) nextflow run seqeralabs/nf-proteindesign \ -profile docker \ --input samples.csv \ --outdir results ``` -### Example 2: Multiple Designs with Analysis +### Example 2: Multiple Designs with Complexa ```bash -# 1. Create design YAMLs for different targets -cat > egfr_design.yaml << EOF -name: egfr_binder -target: - structure: data/egfr.pdb - residues: [10, 11, 12, 45, 46] -designed: - chain_type: protein - length: [60, 120] -EOF - -cat > spike_design.yaml << EOF -name: spike_nanobody -target: - structure: data/spike.cif - residues: [417, 484, 501] -designed: - chain_type: nanobody - length: [110, 130] +# 1. Create samplesheet for Complexa +cat > samples_complexa.csv << EOF +sample_id,target_pdb,pipeline_config,target_sequence,target_msa,target_template +egfr_binder,data/egfr.cif,configs/egfr_pipeline.yaml,data/egfr.fasta,, +spike_nanobody,data/spike.cif,configs/spike_pipeline.yaml,data/spike.fasta,, EOF -# 2. Create samplesheet -cat > samples.csv << EOF -sample,design_yaml,num_designs,budget -egfr_binder,egfr_design.yaml,10000,50 -spike_nanobody,spike_design.yaml,5000,20 -EOF - -# 3. Run with analysis modules +# 2. Run with Complexa backend nextflow run seqeralabs/nf-proteindesign \ -profile docker \ - --input samples.csv \ - --outdir results \ - --run_proteinmpnn \ - --run_protenix_refold \ - --run_prodigy \ - --run_consolidation + --protein_design_tool complexa \ + --input samples_complexa.csv \ + --complexa_ckpt_dir /path/to/checkpoints \ + --outdir results ``` ### Example 3: Test Run ```bash -# Use built-in test profile +# Use built-in test profile (BoltzGen by default) nextflow run seqeralabs/nf-proteindesign \ -profile test_design_protein,docker ``` diff --git a/docs/index.md b/docs/index.md index 7ea4888..3f5df39 100644 --- a/docs/index.md +++ b/docs/index.md @@ -12,13 +12,30 @@ ## :material-test-tube: Overview -**nf-proteindesign** is a Nextflow pipeline for high-throughput protein design using [Boltzgen](https://github.com/HannesStark/boltzgen), an all-atom generative diffusion model. Design proteins, peptides, and nanobodies to bind various biomolecular targets with a comprehensive suite of downstream analysis modules. +**nf-proteindesign** is a Nextflow pipeline for high-throughput protein design supporting two generative backends: + +- **[BoltzGen](https://github.com/jostorge/boltz)** (default) — a flow-matching generative model that uses design YAML specifications +- **[Proteina-Complexa](https://github.com/Proteina-AI/complexa)** — an all-atom generative diffusion model that uses pipeline config YAMLs + +Design proteins, peptides, and nanobodies to bind various biomolecular targets with a comprehensive suite of downstream analysis modules. Both design backends converge into the same shared downstream pipeline. !!! tip "Modular Analysis Pipeline" - The pipeline combines Boltzgen design with optional sequence optimization (ProteinMPNN + Boltz-2), quality assessment (ipSAE, PRODIGY, Foldseek), and unified reporting (metrics consolidation). -## :material-package-variant-closed: Analysis Modules + The pipeline combines generative protein design (BoltzGen or Complexa) with sequence optimization (ProteinMPNN + Boltz-2), quality assessment (ipSAE, PRODIGY, Foldseek), and unified reporting (metrics consolidation). +## :material-package-variant-closed: Design Backends & Analysis Modules
+
+

🎯 BoltzGen

+

Flow-matching generative model for protein design (default backend).

+ --protein_design_tool boltzgen +
+ +
+

🏗️ Proteina-Complexa

+

All-atom generative diffusion model using pipeline config YAMLs.

+ --protein_design_tool complexa +
+

🧬 ProteinMPNN

Sequence optimization for designed structures with configurable sampling temperature.

@@ -33,7 +50,7 @@

📊 ipSAE

-

Interface quality scoring for Boltzgen and Boltz-2 structures.

+

Interface quality scoring for Boltz-2 refolded structures.

--run_ipsae
@@ -58,11 +75,12 @@ ## :material-lightning-bolt: Key Features +- **:material-swap-horizontal: Dual Design Backends**: Choose BoltzGen (default) or Proteina-Complexa - **:material-parallel: Parallel Processing**: Run multiple design specifications simultaneously - **:material-file-code: YAML-Based Design**: Complete control with custom design specifications - **:material-chart-line: Comprehensive Analysis**: Six optional analysis modules for quality assessment - **:material-refresh: Sequence Optimization**: ProteinMPNN + Boltz-2 validation workflow -- **:material-docker: Container Support**: Full Docker compatibility +- **:material-docker: Container Support**: Full Docker and Singularity compatibility - **:material-gpu: GPU Acceleration**: Optimized for NVIDIA GPU execution - **:material-file-tree: Organized Outputs**: Structured results with unified reporting @@ -70,14 +88,15 @@ ```mermaid graph TB - A[Samplesheet
Design YAMLs] --> B{Boltzgen
Precomputed?} - B -->|No| C[Run Boltzgen Design] - B -->|Yes| D[Use Precomputed] - C --> E[Budget Designs
CIF + NPZ] + A[Samplesheet] --> B{Design Tool?} + B -->|boltzgen| C[BoltzGen Design
Flow-matching inference] + B -->|complexa| D[Complexa Design
Diffusion generation] + + C --> E[Budget Designs
PDB Files] D --> E E --> F{ProteinMPNN
Enabled?} - F -->|No| Z[Boltzgen Outputs Only] + F -->|No| Z[Design Outputs Only] F -->|Yes| G[Sequence Optimization
Parallel per Design] G --> H{Boltz-2
Enabled?} @@ -102,7 +121,8 @@ graph TB Z --> R Y --> R - style C fill:#9C27B0,stroke:#9C27B0,color:#fff + style C fill:#1565C0,stroke:#1565C0,color:#fff + style D fill:#9C27B0,stroke:#9C27B0,color:#fff style G fill:#8E24AA,stroke:#8E24AA,color:#fff style J fill:#7B1FA2,stroke:#7B1FA2,color:#fff style Q fill:#6A1B9A,stroke:#6A1B9A,color:#fff @@ -112,7 +132,7 @@ graph TB ``` !!! info "Analysis Requirements" - **IPSAE, PRODIGY, and Foldseek** require **both** `--run_proteinmpnn` and `--run_boltz2_refold` to be enabled. These modules analyze only the Boltz-2 refolded structures, not the original Boltzgen designs. + **IPSAE, PRODIGY, and Foldseek** require **both** `--run_proteinmpnn` and `--run_boltz2_refold` to be enabled. These modules analyze only the Boltz-2 refolded structures, not the original design outputs. ## :material-rocket-launch: Quick Start @@ -122,11 +142,19 @@ Get started with nf-proteindesign in minutes: # 1. Install Nextflow (>=23.04.0) curl -s https://get.nextflow.io | bash -# 2. Run the pipeline +# 2a. Run with BoltzGen (default) nextflow run seqeralabs/nf-proteindesign \ -profile docker \ --input samplesheet.csv \ --outdir results + +# 2b. Or run with Complexa +nextflow run seqeralabs/nf-proteindesign \ + -profile docker \ + --protein_design_tool complexa \ + --input samplesheet_complexa.csv \ + --complexa_ckpt_dir /path/to/checkpoints \ + --outdir results ``` !!! example "Need Help?" @@ -134,7 +162,7 @@ nextflow run seqeralabs/nf-proteindesign \ ## :material-chemical-weapon: What Can You Design? -The pipeline leverages Boltzgen's capabilities to design: +The pipeline leverages BoltzGen or Complexa to design: - **Proteins**: Full-length protein binders targeting specific interfaces - **Peptides**: Short peptide sequences for tight binding diff --git a/docs/quick-start.md b/docs/quick-start.md index 403caad..ae26dc8 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -24,7 +24,7 @@ Before running the pipeline, ensure you have: ### Hardware Requirements !!! warning "GPU Required" - Boltzgen requires an NVIDIA GPU with CUDA support for reasonable execution times. CPU execution is possible but extremely slow. + Both BoltzGen and Complexa require an NVIDIA GPU with CUDA support for reasonable execution times. CPU execution is possible but extremely slow. - **GPU**: NVIDIA GPU with CUDA 11.8+ support - **Memory**: 16GB RAM minimum, 32GB+ recommended @@ -32,47 +32,74 @@ Before running the pipeline, ensure you have: ## :material-file-document: Prepare Input Files -### 1. Design YAML Files (Design Mode) +The pipeline supports two design backends, each with its own samplesheet format. Choose the one that matches your `--protein_design_tool` setting. -Create a design specification file following Boltzgen format: +### Option A: BoltzGen (default) + +#### 1. Design YAML Files + +Create a design specification file following BoltzGen format: ```yaml title="my_design.yaml" -name: antibody_design_example -target: - structure: data/target_protein.pdb - residues: [10, 11, 12, 45, 46, 47, 89] # Binding site residues -designed: - chain_type: protein - length: [50, 80] # Range of acceptable lengths -global: - n_samples: 10 - save_traj: true +entities: + - protein: + id: C + sequence: 80..120 # Length range for designed protein + - file: + path: target_protein.cif + include: + - chain: + id: A # Target chain to bind ``` -### 2. Create Samplesheet - -Create a CSV file with your design specifications: +#### 2. Create Samplesheet ```csv title="samplesheet.csv" -sample_id,design_yaml,num_designs,budget -design1,/path/to/design1.yaml,10000,20 -design2,/path/to/design2.yaml,5000,10 -design3,/path/to/design3.yaml,15000,30 +sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template +design1,designs/my_design.yaml,target.cif,protein-anything,3,2,,target.a3m,target.fasta, ``` **Column descriptions:** + +- `sample_id`: Unique identifier for the design +- `design_yaml`: Path to the BoltzGen design YAML file +- `target_sequence`: Path to target protein FASTA sequence (for Boltz-2 refolding) +- `structure_files` (optional): Comma-separated structure files (PDB/CIF) +- `protocol` (optional): Design protocol — `protein-anything`, `peptide-anything`, `nanobody-anything`, `protein-small_molecule` +- `num_designs` (optional): Number of intermediate designs to generate +- `budget` (optional): Number of final diversity-optimized designs to keep +- `target_msa` (optional): Pre-computed MSA for target (`.a3m`) +- `target_template` (optional): Template structure for Boltz-2 (CIF) + +### Option B: Proteina-Complexa + +#### 1. Pipeline Config YAML + +Create a Complexa Hydra pipeline config YAML (see Complexa documentation for format details). + +#### 2. Create Samplesheet + +```csv title="samplesheet_complexa.csv" +sample_id,target_pdb,pipeline_config,target_sequence,target_msa,target_template +design1,target.cif,configs/pipeline.yaml,target.fasta,target.a3m, +``` + +**Column descriptions:** + - `sample_id`: Unique identifier for the design -- `design_yaml`: Path to the design YAML file -- `num_designs`: Number of intermediate designs to generate (10,000-60,000 for production) -- `budget`: Number of final diversity-optimized designs to keep +- `target_pdb`: Target structure (PDB or CIF) +- `pipeline_config`: Path to Complexa Hydra pipeline config YAML +- `target_sequence`: Target protein FASTA sequence (for Boltz-2 refolding) +- `target_msa` (optional): Pre-computed MSA for target (`.a3m`) +- `target_template` (optional): Template structure for Boltz-2 (PDB/CIF) ## :material-run: Running the Pipeline ### Basic Execution -Choose the appropriate profile for your system: +Choose the appropriate profile and design tool for your system: -=== "Docker" +=== "BoltzGen (default)" ```bash nextflow run seqeralabs/nf-proteindesign \ -profile docker \ @@ -80,47 +107,50 @@ Choose the appropriate profile for your system: --outdir results ``` -=== "Local (with Docker)" +=== "Complexa" ```bash nextflow run seqeralabs/nf-proteindesign \ - -profile docker,local \ - --input samplesheet.csv \ + -profile docker \ + --protein_design_tool complexa \ + --input samplesheet_complexa.csv \ + --complexa_ckpt_dir /path/to/checkpoints \ --outdir results ``` ### With Analysis Modules -Enable optional analysis steps for comprehensive quality assessment: +All analysis modules are enabled by default. To run the full pipeline with a Foldseek database: ```bash nextflow run seqeralabs/nf-proteindesign \ -profile docker \ --input samplesheet.csv \ --outdir results \ - --run_proteinmpnn \ - --run_protenix_refold \ - --run_ipsae \ - --run_prodigy \ - --run_foldseek \ --foldseek_database /path/to/database_dir \ - --foldseek_database_name afdb \ - --run_consolidation + --foldseek_database_name afdb ``` -## :material-tune: Common Options - -### Design Parameters - -Customize design generation: +To disable specific modules, set them to `false`: ```bash nextflow run seqeralabs/nf-proteindesign \ -profile docker \ --input samplesheet.csv \ --outdir results \ - --num_designs 10000 \ # Number of intermediate designs - --budget 20 \ # Number of final designs to keep - --protocol protein-anything # Design protocol + --run_foldseek false \ + --run_prodigy false +``` + +## :material-tune: Common Options + +### Design Tool Selection + +```bash +# BoltzGen (default) +--protein_design_tool boltzgen + +# Proteina-Complexa +--protein_design_tool complexa ``` ### Resource Allocation @@ -143,53 +173,46 @@ After successful execution, your `results/` directory will contain: ``` results/ -├── boltzgen/ # Main Boltzgen outputs -│ ├── sample1/ -│ │ ├── final_ranked_designs/ -│ │ ├── intermediate_designs/ -│ │ └── boltzgen.log -│ └── sample2/ -│ └── ... -├── ipsae/ # IPSAE scores (if enabled) -│ └── sample1_ipsae_scores.csv -├── prodigy/ # PRODIGY predictions (if enabled) -│ └── sample1_prodigy_predictions.csv -├── pipeline_info/ # Execution reports -│ ├── execution_report.html -│ ├── execution_timeline.html -│ └── execution_trace.txt -└── multiqc/ # MultiQC report (if enabled) - └── multiqc_report.html +├── {sample_id}/ +│ ├── boltzgen/ or complexa/ # Design outputs (depends on tool) +│ ├── proteinmpnn/ # Optimized sequences +│ ├── boltz2/ # Refolded structures +│ ├── ipsae/ # Interface scores +│ ├── prodigy/ # Affinity predictions +│ ├── foldseek/ # Structural search results +│ └── consolidated/ # Combined metrics report +└── pipeline_info/ # Execution reports + ├── execution_report.html + ├── execution_timeline.html + └── execution_trace.txt ``` !!! tip "Final Designs" - The most important files are in `boltzgen/*/final_ranked_designs/` - these contain your ranked protein designs ready for experimental validation. + The most important files are the design output PDB/CIF files and the consolidated metrics report in `consolidated/`, which ranks all designs by combined quality scores. ## :material-test-tube: Example Workflow -Here's a complete example from start to finish: +Here's a complete example from start to finish using BoltzGen (default): ### 1. Prepare Design File -```yaml title="antibody_target.yaml" -name: covid_spike_binder -target: - structure: data/spike_protein.pdb - residues: [417, 484, 501] # RBD key residues -designed: - chain_type: nanobody - length: [110, 130] -global: - n_samples: 20 - timesteps: 100 - save_traj: true +```yaml title="spike_binder_design.yaml" +entities: + - protein: + id: C + sequence: 110..130 # Nanobody length range + - file: + path: spike_protein.cif + include: + - chain: + id: A # Target chain ``` ### 2. Create Samplesheet ```csv title="spike_designs.csv" -sample,design_yaml -spike_nb1,designs/antibody_target.yaml +sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template +spike_nb1,designs/spike_binder_design.yaml,data/spike_protein.cif,nanobody-anything,3,2,,,data/spike_sequence.fasta, ``` ### 3. Run Pipeline @@ -198,8 +221,7 @@ spike_nb1,designs/antibody_target.yaml nextflow run seqeralabs/nf-proteindesign \ -profile docker \ --input spike_designs.csv \ - --outdir covid_binders \ - --run_prodigy true + --outdir covid_binders ``` ### 4. Check Results @@ -208,11 +230,11 @@ nextflow run seqeralabs/nf-proteindesign \ # View execution report open covid_binders/pipeline_info/execution_report.html -# Check final designs -ls covid_binders/boltzgen/spike_nb1/final_ranked_designs/ +# Check design outputs +ls covid_binders/spike_nb1/ -# View binding predictions -cat covid_binders/prodigy/spike_nb1_prodigy_predictions.csv +# View consolidated metrics +cat covid_binders/spike_nb1/consolidated_metrics.csv ``` ## :material-help-circle: Troubleshooting @@ -231,18 +253,16 @@ cat covid_binders/prodigy/spike_nb1_prodigy_predictions.csv !!! bug "Out of Memory" **Error**: `CUDA out of memory` - **Solution**: Reduce batch size or number of parallel samples: + **Solution**: Reduce batch size or number of designs: ```bash - --n_samples 10 # Reduce from default + # For Complexa, reduce batch size + --complexa_batch_size 8 ``` !!! bug "Container Pull Failed" **Error**: `Error pulling container image` - **Solution**: Pre-pull containers or use cached versions: - ```bash - docker pull ghcr.io/flouwuenne/boltzgen:latest - ``` + **Solution**: Pre-pull containers or use cached versions. Check `nextflow.config` for the exact container URIs used by each process. ## :material-arrow-right: Next Steps @@ -250,7 +270,7 @@ Now that you're up and running: 1. **Learn Basic Usage**: Check the [Usage Guide](getting-started/usage.md) for detailed instructions 2. **Optimize Parameters**: See the [Parameters Reference](reference/parameters.md) -3. **Enable Analysis Modules**: Learn about [ProteinMPNN/Protenix](analysis/proteinmpnn-boltz2.md), [PRODIGY](analysis/prodigy.md), and [ipSAE](analysis/ipsae.md) +3. **Explore Analysis Modules**: Learn about [ProteinMPNN/Boltz-2](analysis/proteinmpnn-boltz2.md), [PRODIGY](analysis/prodigy.md), [ipSAE](analysis/ipsae.md), and [Foldseek](analysis/foldseek.md) 4. **Advanced Usage**: Explore [Architecture](architecture/design.md) details --- diff --git a/docs/reference/examples.md b/docs/reference/examples.md index f5aef8d..12d1886 100644 --- a/docs/reference/examples.md +++ b/docs/reference/examples.md @@ -2,14 +2,14 @@ Complete examples for common protein design use cases. -## :material-dna: Example 1: Protein Binder Design +## :material-dna: Example 1: Protein Binder Design (BoltzGen) -Design a protein to bind EGFR using a pre-made design specification. +Design a protein to bind EGFR using BoltzGen (default design tool). ### Create Design YAML ```yaml title="egfr_protein_design.yaml" -# Boltzgen design specification for protein binder +# BoltzGen design specification for protein binder entities: # Designed protein entity - protein: @@ -27,21 +27,19 @@ entities: ### Create Samplesheet ```csv title="egfr_samplesheet.csv" -sample_id,design_yaml,structure_files,protocol,num_designs,budget -egfr_binder,egfr_protein_design.yaml,egfr_structure.cif,protein-anything,100,10 +sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template +egfr_binder,egfr_protein_design.yaml,egfr_structure.cif,protein-anything,3,2,,egfr.a3m,egfr_sequence.fasta, ``` ### Run Pipeline +All analysis modules are enabled by default: + ```bash nextflow run seqeralabs/nf-proteindesign \ -profile docker \ --input egfr_samplesheet.csv \ - --outdir egfr_designs \ - --run_proteinmpnn \ - --run_ipsae \ - --run_prodigy \ - --run_consolidation + --outdir egfr_designs ``` ### Analyze Results @@ -50,21 +48,21 @@ nextflow run seqeralabs/nf-proteindesign \ import pandas as pd # Load consolidated metrics -results = pd.read_csv('egfr_designs/egfr_binder/consolidated_metrics.csv') +results = pd.read_csv('egfr_designs/egfr_binder/consolidated/consolidated_metrics.csv') # Find top 5 candidates by binding affinity top5 = results.nsmallest(5, 'prodigy_delta_g') print(top5[['design_file', 'prodigy_delta_g', 'prodigy_kd', 'ipsae_score']]) ``` -## :material-flask: Example 2: Peptide Binder Design +## :material-flask: Example 2: Peptide Binder Design (BoltzGen) Design peptide binders for a target protein. ### Create Design YAML ```yaml title="peptide_design.yaml" -# Boltzgen design specification for peptide binder +# BoltzGen design specification for peptide binder entities: # Designed peptide entity - protein: @@ -82,8 +80,8 @@ entities: ### Create Samplesheet ```csv title="peptide_samplesheet.csv" -sample_id,design_yaml,structure_files,protocol,num_designs,budget -peptide_binder,peptide_design.yaml,target.cif,peptide-anything,100,10 +sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template +peptide_binder,peptide_design.yaml,target.cif,peptide-anything,3,2,,,target.fasta, ``` ### Run Pipeline @@ -92,18 +90,17 @@ peptide_binder,peptide_design.yaml,target.cif,peptide-anything,100,10 nextflow run seqeralabs/nf-proteindesign \ -profile docker \ --input peptide_samplesheet.csv \ - --protocol peptide-anything \ --outdir peptide_designs ``` -## :material-antibody: Example 3: Nanobody Design +## :material-antibody: Example 3: Nanobody Design (BoltzGen) Design nanobodies to bind a specific target. ### Create Design YAML ```yaml title="nanobody_design.yaml" -# Boltzgen design specification for nanobody +# BoltzGen design specification for nanobody entities: # Designed nanobody entity - protein: @@ -121,8 +118,8 @@ entities: ### Create Samplesheet ```csv title="nanobody_samplesheet.csv" -sample_id,design_yaml,structure_files,protocol,num_designs,budget -nanobody_binder,nanobody_design.yaml,antigen.cif,nanobody-anything,100,10 +sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template +nanobody_binder,nanobody_design.yaml,antigen.cif,nanobody-anything,3,2,,,antigen.fasta, ``` ### Run Pipeline @@ -131,11 +128,32 @@ nanobody_binder,nanobody_design.yaml,antigen.cif,nanobody-anything,100,10 nextflow run seqeralabs/nf-proteindesign \ -profile docker \ --input nanobody_samplesheet.csv \ - --protocol nanobody-anything \ --outdir nanobody_designs ``` -## :material-test-tube: Example 4: Multiple Targets +## :material-molecule: Example 4: Protein Binder Design (Complexa) + +Design a protein binder using the Proteina-Complexa backend. + +### Create Samplesheet + +```csv title="complexa_samplesheet.csv" +sample_id,target_pdb,pipeline_config,target_sequence,target_msa,target_template +egfr_binder,data/egfr.cif,configs/egfr_pipeline.yaml,data/egfr.fasta,data/egfr.a3m, +``` + +### Run Pipeline + +```bash +nextflow run seqeralabs/nf-proteindesign \ + -profile docker \ + --protein_design_tool complexa \ + --input complexa_samplesheet.csv \ + --complexa_ckpt_dir /path/to/checkpoints \ + --outdir complexa_designs +``` + +## :material-test-tube: Example 5: Multiple Targets Design binders for multiple targets in a single run. @@ -168,9 +186,9 @@ entities: ### Create Samplesheet ```csv title="multi_target_samplesheet.csv" -sample_id,design_yaml,structure_files,protocol,num_designs,budget -target1_binder,target1_design.yaml,target1.cif,protein-anything,100,10 -target2_binder,target2_design.yaml,target2.cif,protein-anything,100,10 +sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template +target1_binder,target1_design.yaml,target1.cif,protein-anything,3,2,,,target1.fasta, +target2_binder,target2_design.yaml,target2.cif,protein-anything,3,2,,,target2.fasta, ``` ### Run Pipeline @@ -179,53 +197,35 @@ target2_binder,target2_design.yaml,target2.cif,protein-anything,100,10 nextflow run seqeralabs/nf-proteindesign \ -profile docker \ --input multi_target_samplesheet.csv \ - --outdir multi_designs \ - --run_consolidation + --outdir multi_designs ``` -## :material-chart-bar: Example 5: Full Analysis Pipeline - -Complete workflow with all analysis tools enabled. - -### Create Samplesheet +## :material-chart-bar: Example 6: Selective Analysis Modules -```csv title="full_analysis_samplesheet.csv" -sample_id,design_yaml,structure_files,protocol,num_designs,budget -full_analysis,my_design.yaml,target.cif,protein-anything,200,20 -``` +By default all analysis modules are enabled. To disable specific modules: ### Run Pipeline ```bash nextflow run seqeralabs/nf-proteindesign \ -profile docker \ - --input full_analysis_samplesheet.csv \ - --outdir full_analysis_results \ - --num_designs 200 \ - --budget 20 \ - --run_proteinmpnn \ - --mpnn_num_seq_per_target 10 \ - --run_ipsae \ - --ipsae_pae_cutoff 8 \ - --run_prodigy \ - --run_consolidation \ - --report_top_n 20 + --input samplesheet.csv \ + --outdir selective_results \ + --run_foldseek false \ + --run_prodigy false ``` ### Review Consolidated Report ```bash # View consolidated metrics -cat full_analysis_results/full_analysis/consolidated_metrics.csv | column -t -s, - -# Count successful designs -grep "SUCCESS" full_analysis_results/full_analysis/consolidated_metrics.csv | wc -l +cat selective_results/{sample_id}/consolidated/consolidated_metrics.csv | column -t -s, # Find designs with best affinity -sort -t',' -k3,3n full_analysis_results/full_analysis/consolidated_metrics.csv | head -10 +sort -t',' -k3,3n selective_results/{sample_id}/consolidated/consolidated_metrics.csv | head -10 ``` -## :material-cog: Example 6: Using Test Profiles +## :material-cog: Example 7: Using Test Profiles The pipeline includes built-in test profiles for quick validation. @@ -253,7 +253,7 @@ nextflow run seqeralabs/nf-proteindesign \ --outdir test_nanobody_results ``` -## :material-cloud: Example 7: Seqera Platform Deployment +## :material-cloud: Example 8: Seqera Platform Deployment Run the pipeline on Seqera Platform with GPU compute. @@ -266,8 +266,7 @@ Run the pipeline on Seqera Platform with GPU compute. 5. Configure parameters: - `input`: Path to samplesheet in Data Link - `outdir`: Output Data Link path - - `num_designs`: 100 - - `budget`: 10 + - `protein_design_tool`: `boltzgen` or `complexa` 6. Select GPU-enabled compute environment 7. Click "Launch" @@ -289,28 +288,31 @@ After pipeline completion, you'll find: ``` results/ └── {sample_id}/ - ├── boltzgen/ - │ ├── final_ranked_designs/ - │ │ ├── design_1.cif # Top ranked design - │ │ ├── design_2.cif - │ │ └── ... - │ └── intermediate_designs/ - │ └── *.cif - ├── proteinmpnn/ # If --run_proteinmpnn enabled - │ ├── design_1_sequences.fa - │ └── ... - ├── ipsae/ # If --run_ipsae enabled - │ ├── design_1_ipsae_scores.csv - │ └── ... - ├── prodigy/ # If --run_prodigy enabled - │ ├── design_1_prodigy_summary.csv + ├── boltzgen/ or complexa/ # Design structures (depends on tool) + │ ├── design_1.pdb + │ ├── design_2.pdb │ └── ... - └── consolidated_metrics.csv # If --run_consolidation enabled + ├── proteinmpnn/ # Optimized sequences & scores + │ ├── sequences/ + │ └── scores/ + ├── boltz2/ # Refolded structures + │ ├── structures/ + │ ├── confidence/ + │ └── npz/ + ├── ipsae/ # Interface scores + │ └── *_ipsae_scores.txt + ├── prodigy/ # Affinity predictions + │ └── *_prodigy_results.txt + ├── foldseek/ # Structural search results + │ └── *_foldseek_summary.tsv + └── consolidated/ # Combined metrics report + ├── consolidated_metrics.csv + └── consolidated_report.html ``` ## :material-lightbulb: Tips and Best Practices -### Design YAML Tips +### Design YAML Tips (BoltzGen) - **Length ranges**: Use `80..120` syntax for flexible design lengths - **Multiple chains**: Specify multiple target chains for complex interfaces @@ -318,24 +320,26 @@ results/ ### Parameter Tuning -- **Quick tests**: Start with `num_designs=10, budget=5` for fast validation -- **Production runs**: Use `num_designs=100-200, budget=10-20` for quality results -- **Large campaigns**: Increase to `num_designs=200+, budget=50+` for diversity +- **Quick tests**: Use small `num_designs` and `budget` values for fast validation +- **Production runs**: Increase `num_designs` and `budget` for diversity and quality +- **Complexa tuning**: Adjust `--complexa_nsteps`, `--complexa_batch_size`, and `--complexa_replicas` ### Resource Optimization - **GPU memory**: Ensure 16GB+ VRAM for standard runs -- **Caching**: Use `--cache_dir` to avoid re-downloading model weights +- **Caching**: Use `--cache_dir` (BoltzGen) or `--complexa_ckpt_dir` (Complexa) for model weights - **Resume**: Always use `-resume` flag to recover from interruptions ### Analysis Workflow -1. Run Boltzgen to generate initial designs -2. Enable ProteinMPNN for sequence optimization -3. Use IPSAE for interface quality scoring -4. Apply PRODIGY for binding affinity prediction -5. Review consolidated metrics for top candidates -6. Select top designs for experimental validation +1. Run BoltzGen or Complexa to generate initial designs +2. ProteinMPNN optimizes sequences for generated structures +3. Boltz-2 predicts structures from optimized sequences (refolding validation) +4. ipSAE scores interface quality +5. PRODIGY predicts binding affinity +6. Foldseek searches for structural similarity +7. Consolidation combines all metrics into a ranked report +8. Select top designs for experimental validation ## :material-help: Troubleshooting @@ -350,8 +354,8 @@ docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi **Out of memory:** ```bash -# Reduce num_designs or use smaller length ranges in design YAML -nextflow run ... --num_designs 50 +# For Complexa, reduce batch size +nextflow run ... --complexa_batch_size 8 ``` **Pipeline fails:** diff --git a/docs/reference/outputs.md b/docs/reference/outputs.md index 3d00c11..4c64a5a 100644 --- a/docs/reference/outputs.md +++ b/docs/reference/outputs.md @@ -7,118 +7,110 @@ Complete guide to understanding pipeline outputs. ``` results/ ├── {sample_id}/ -│ ├── boltzgen/ -│ ├── prodigy/ -│ └── ipsae/ +│ ├── boltzgen/ or complexa/ # Design outputs (depends on tool) +│ ├── proteinmpnn/ # Sequence optimization +│ ├── boltz2/ # Structure prediction (refolding) +│ ├── ipsae/ # Interface scoring +│ ├── prodigy/ # Affinity prediction +│ ├── foldseek/ # Structural search +│ └── consolidated/ # Combined metrics report └── pipeline_info/ ``` -## :material-dna: Boltzgen Outputs +## :material-dna: Design Tool Outputs -### Final Ranked Designs +### BoltzGen Outputs (default) ``` -results/{sample}/boltzgen/final_ranked_designs/ -├── design_1.cif -├── design_2.cif +results/{sample}/boltzgen/ +├── design_1.pdb +├── design_2.pdb └── ... ``` -**Description**: Top-ranked protein designs in CIF format. +**Description**: Generated protein designs in PDB format from BoltzGen. -**Contents**: Complete atomic coordinates for designed complexes. - -### Intermediate Designs +### Complexa Outputs ``` -results/{sample}/boltzgen/intermediate_designs/ -├── generation_*.cif -├── inverse_fold_*.cif -└── refold_*.cif +results/{sample}/complexa/ +├── design_1.pdb +├── design_2.pdb +└── ... ``` -**Description**: Intermediate structures from design pipeline. +**Description**: Generated protein designs from Proteina-Complexa. -### Log Files +## :material-protein: ProteinMPNN Outputs ``` -results/{sample}/boltzgen/boltzgen.log +results/{sample}/proteinmpnn/ +├── sequences/ # Optimized FASTA sequences +│ ├── design_1.fa +│ └── ... +└── scores/ # ProteinMPNN scores + ├── design_1_scores.txt + └── ... ``` -**Description**: Complete execution log with design metrics. +**Description**: Sequence optimization results — optimized amino acid sequences for each generated structure. -## :material-chart-box: PRODIGY Outputs - -### Summary CSV +## :material-molecule: Boltz-2 Outputs ``` -results/{sample}/prodigy/design_1_prodigy_summary.csv +results/{sample}/boltz2/ +├── structures/ # Predicted CIF structures +│ ├── design_1.cif +│ └── ... +├── confidence/ # Confidence scores (JSON) +│ ├── design_1_confidence.json +│ └── ... +└── npz/ # PAE NPZ files + ├── design_1.npz + └── ... ``` -**Format**: -```csv -sample_id,design_file,delta_g,kd,temperature,bsa,ics,charged_residues,charged_percentage,apolar_residues,apolar_percentage -sample1,design_1.cif,-11.2,5.4e-09,25.0,1543.21,89,15,16.85,48,53.93 -``` +**Description**: Structure prediction (refolding) results from Boltz-2, validating whether optimized sequences fold into the intended structure. -### Full Results +## :material-chart-box: PRODIGY Outputs ``` -results/{sample}/prodigy/design_1_prodigy_results.txt +results/{sample}/prodigy/ +├── design_1_prodigy_results.txt +└── ... ``` -**Description**: Complete PRODIGY output with all metrics. +**Description**: Complete PRODIGY output with binding affinity predictions including ΔG and Kd values. ## :material-chart-line: ipSAE Outputs ``` -results/{sample}/ipsae/design_1_ipsae_scores.csv -``` - -**Format**: -```csv -design_id,interface_area,shape_comp,contact_density,h_bonds,salt_bridges,hydrophobic -design_1,1543.2,0.68,0.045,12,3,28 -``` - -``` -├── pockets/ -│ ├── {sample}_pocket1.pdb -│ ├── {sample}_pocket2.pdb -│ └── ... -├── visualizations/ -│ └── {sample}_pockets.pml -└── {sample}_predictions.csv -``` - -### Predictions CSV - -**Format**: -```csv -rank,score,size,center_x,center_y,center_z,residues -1,0.85,42,12.3,45.6,78.9,"10,11,12,45,46,47" -2,0.72,38,23.4,56.7,89.0,"20,21,22,65,66,67" +results/{sample}/ipsae/ +├── design_1_ipsae_scores.txt +└── ... ``` -## :material-file-multiple: Target Mode Outputs +**Description**: Interface scoring results measuring quality of the protein-protein interface. -### Generated Designs +## :material-magnify: Foldseek Outputs ``` -results/{sample}/design_variants/ -├── {sample}_len60_v1.yaml -├── {sample}_len60_v2.yaml -├── {sample}_len80_v1.yaml +results/{sample}/foldseek/ +├── design_1_foldseek_summary.tsv └── ... ``` -### Design Info +**Description**: Structural similarity search results against known protein structures. + +## :material-table: Consolidated Outputs ``` -results/{sample}/design_info.txt +results/{sample}/consolidated/ +├── consolidated_metrics.csv # Combined metrics for all designs +└── consolidated_report.html # Interactive HTML report ``` -**Contents**: Summary of generated design variants. +**Description**: Combined report merging all analysis module scores into a single ranked table for easy comparison. ## :material-information: Pipeline Info @@ -151,7 +143,7 @@ results/pipeline_info/execution_trace.txt **Format**: TSV file with detailed process information: ``` task_id hash native_id name status exit submit duration realtime %cpu rss vmem -1 ab/cd12 12345 BOLTZGEN_RUN COMPLETED 0 2024-01-15 10:00:00 1h 23m 1h 21m 95.2% 16.2 GB 24.1 GB +1 ab/cd12 12345 COMPLEXA_RUN COMPLETED 0 2024-01-15 10:00:00 1h 23m 1h 21m 95.2% 16.2 GB 24.1 GB ``` ## :material-file-download: File Formats @@ -203,7 +195,7 @@ All outputs for each sample grouped together: ``` results/ ├── sample1/ -│ ├── boltzgen/ +│ ├── complexa/ │ ├── prodigy/ │ └── ipsae/ └── sample2/ @@ -216,7 +208,7 @@ Within each sample, organized by analysis: ``` {sample}/ -├── boltzgen/ # Primary designs +├── complexa/ # Primary designs ├── prodigy/ # Binding affinity └── ipsae/ # Interface scoring ``` @@ -226,17 +218,16 @@ Within each sample, organized by analysis: ### Command Line ```bash -# List all final designs -find results/ -name "*.cif" -path "*/final_ranked_designs/*" +# List all design structures +find results/ -name "*.pdb" -path "*/boltzgen/*" +# or for Complexa: +find results/ -name "*.pdb" -path "*/complexa/*" -# Get best PRODIGY scores -cat results/*/prodigy/*_summary.csv | \ - grep -v "sample_id" | \ - sort -t',' -k3,3n | \ - head -5 +# View consolidated metrics +cat results/*/consolidated/consolidated_metrics.csv | column -t -s, # Count successful designs -find results/ -name "design_*.cif" | wc -l +find results/ -name "design_*.pdb" | wc -l ``` ### Python @@ -245,14 +236,14 @@ find results/ -name "design_*.cif" | wc -l from pathlib import Path import pandas as pd -# Load all PRODIGY results +# Load consolidated metrics results = [] -for csv in Path('results').rglob('*_prodigy_summary.csv'): +for csv in Path('results').rglob('consolidated_metrics.csv'): df = pd.read_csv(csv) results.append(df) combined = pd.concat(results) -print(combined.nsmallest(10, 'delta_g')) +print(combined.nsmallest(10, 'prodigy_delta_g')) ``` ### R @@ -260,18 +251,18 @@ print(combined.nsmallest(10, 'delta_g')) ```r library(tidyverse) -# Load PRODIGY results +# Load consolidated metrics results <- list.files( "results", - pattern = "*_summary.csv", + pattern = "consolidated_metrics.csv", recursive = TRUE, full.names = TRUE ) %>% map_df(read_csv) -# Analyze +# Analyze — find top designs by binding affinity results %>% - arrange(delta_g) %>% + arrange(prodigy_delta_g) %>% head(10) ``` @@ -291,10 +282,10 @@ grep "FAILED" results/pipeline_info/execution_trace.txt ### Validate Outputs ```bash -# Ensure all expected files exist +# Ensure all expected output directories exist for sample in sample1 sample2; do - if [ ! -d "results/${sample}/boltzgen/final_ranked_designs" ]; then - echo "Missing designs for ${sample}" + if [ ! -d "results/${sample}/consolidated" ]; then + echo "Missing consolidated results for ${sample}" fi done ``` @@ -306,7 +297,7 @@ done ```bash # Create archive of final results tar -czf protein_designs.tar.gz \ - results/*/boltzgen/final_ranked_designs/ \ + results/*/complexa/final_ranked_designs/ \ results/*/prodigy/*_summary.csv \ results/pipeline_info/execution_report.html ``` diff --git a/docs/reference/parameters.md b/docs/reference/parameters.md index 514783b..3478615 100644 --- a/docs/reference/parameters.md +++ b/docs/reference/parameters.md @@ -8,7 +8,7 @@ **Pipeline**: nf-proteindesign pipeline parameters -Nextflow pipeline for Boltzgen protein design using pre-made design YAML specifications +Nextflow pipeline for computational protein design using BoltzGen (default) or Proteina-Complexa as the design backend, with a full analysis suite including ProteinMPNN, Boltz-2, ipSAE, PRODIGY, Foldseek, and consolidated reporting. ## Input/output options @@ -29,30 +29,65 @@ Define where the pipeline should find input data and save output data. - **Type**: `string` - **Default**: `"./results"` -## Boltzgen design parameters +## Design tool selection -Core parameters for Boltzgen protein design execution. +### `--protein_design_tool` + +Which design backend to use: `boltzgen` (default) or `complexa`. + +- **Type**: `string` +- **Default**: `"boltzgen"` +- **Allowed values**: `boltzgen`, `complexa` + +## BoltzGen parameters + +Parameters specific to the BoltzGen design backend. ### `--cache_dir` -Cache directory for model weights (~6GB). +Cache directory for BoltzGen model weights. - **Type**: `string` - **Default**: `"null"` -### `--boltzgen_config` +## Complexa parameters + +Parameters specific to the Proteina-Complexa design backend. -Optional path to custom Boltzgen config YAML to override defaults. +### `--complexa_ckpt_dir` + +Directory containing Complexa model checkpoints (required when using Complexa). - **Type**: `string` - **Default**: `"null"` -### `--steps` +### `--complexa_search_algorithm` -Optional comma-separated list of steps to run (e.g., 'filtering' to rerun only filtering). +Search algorithm for Complexa design sampling. - **Type**: `string` -- **Default**: `"null"` +- **Default**: `"best-of-n"` + +### `--complexa_nsteps` + +Number of diffusion sampling steps for Complexa. + +- **Type**: `integer` +- **Default**: `400` + +### `--complexa_replicas` + +Number of replicas for best-of-n search. + +- **Type**: `integer` +- **Default**: `2` + +### `--complexa_batch_size` + +Batch size for Complexa inference. + +- **Type**: `integer` +- **Default**: `16` ## ProteinMPNN sequence optimization @@ -60,7 +95,7 @@ Options for ProteinMPNN sequence optimization of designed structures. ### `--run_proteinmpnn` -Enable ProteinMPNN sequence optimization of Boltzgen designs. +Enable ProteinMPNN sequence optimization of Complexa designs. - **Type**: `boolean` - **Default**: `false` @@ -173,7 +208,7 @@ Options for scoring and evaluating designed structures. ### `--run_ipsae` -Enable IPSAE scoring of Boltzgen predictions. +Enable IPSAE scoring of Complexa predictions. - **Type**: `boolean` - **Default**: `false` @@ -367,7 +402,7 @@ Display version and exit. | `--input` | `string` | `"null"` | **Required | | `--outdir` | `string` | `"./results"` | **Required | | `--cache_dir` | `string` | `"null"` | Cache directory for model weights (~6GB) | -| `--boltzgen_config` | `string` | `"null"` | Optional path to custom Boltzgen config YAML to... | +| `--complexa_config` | `string` | `"null"` | Optional path to custom Complexa config YAML to... | | `--steps` | `string` | `"null"` | Optional comma-separated list of steps to run (e | | `--run_proteinmpnn` | `boolean` | `false` | Enable ProteinMPNN sequence optimization of Bol... | | `--mpnn_sampling_temp` | `number` | `0.1` | Sampling temperature (lower = more conservative) | @@ -384,7 +419,7 @@ Display version and exit. | `--boltz2_num_recycling` | `integer` | `3` | Number of recycling iterations for structure re... | | `--boltz2_use_msa` | `boolean` | `false` | Use multiple sequence alignments (MSAs) for pre... | | `--boltz2_predict_affinity` | `boolean` | `true` | Predict binding affinity for protein complexes | -| `--run_ipsae` | `boolean` | `false` | Enable IPSAE scoring of Boltzgen predictions | +| `--run_ipsae` | `boolean` | `false` | Enable IPSAE scoring of Complexa predictions | | `--ipsae_pae_cutoff` | `number` | `10` | PAE cutoff for IPSAE calculation (Angstroms) | | `--ipsae_dist_cutoff` | `number` | `10` | Distance cutoff for CA-CA contacts (Angstroms) | | `--run_prodigy` | `boolean` | `false` | Enable PRODIGY binding affinity prediction on f... | diff --git a/main.nf b/main.nf index 1b0551a..9b2a71f 100644 --- a/main.nf +++ b/main.nf @@ -2,8 +2,12 @@ /* ======================================================================================== - nf-proteindesign: Nextflow pipeline for Boltzgen protein design + nf-proteindesign: Nextflow pipeline for AI-powered protein design ======================================================================================== + Supports three generative design backends: + --protein_design_tool boltzgen (default, original) + --protein_design_tool complexa (Proteina-Complexa flow-matching) + --protein_design_tool rfdiffusion_v3 (RFdiffusion3 all-atom diffusion) Github : https://github.com/seqeralabs/nf-proteindesign ---------------------------------------------------------------------------------------- */ @@ -18,17 +22,6 @@ nextflow.enable.dsl = 2 include { samplesheetToList } from 'plugin/nf-schema' -/* -======================================================================================== - VALIDATE INPUTS -======================================================================================== -*/ - -// Validate required parameters -if (!params.input) { - error "ERROR: Please provide a samplesheet with --input" -} - /* ======================================================================================== NAMED WORKFLOW FOR PIPELINE @@ -37,35 +30,53 @@ if (!params.input) { include { PROTEIN_DESIGN } from './workflows/protein_design' +// Individual design-tool modules (used for test_design_only mode) +include { PROTEINA_COMPLEXA_DESIGN as TEST_COMPLEXA } from './modules/local/proteina_complexa_design' +include { BOLTZGEN_RUN as TEST_BOLTZGEN } from './modules/local/boltzgen_run' +include { RFDIFFUSION_V3_RUN as TEST_RFDV3 } from './modules/local/rfdiffusion_v3_run' +include { CONVERT_CIF_TO_PDB as TEST_CIF2PDB } from './modules/local/convert_cif_to_pdb' + workflow NFPROTEINDESIGN { + // ======================================================================== + // Validate inputs + // ======================================================================== + if (!params.input) { + error "ERROR: Please provide a samplesheet with --input" + } + + def valid_tools = ['boltzgen', 'complexa', 'rfdiffusion_v3'] + if (!valid_tools.contains(params.protein_design_tool)) { + error "ERROR: --protein_design_tool must be one of: ${valid_tools.join(', ')}. Got: '${params.protein_design_tool}'" + } + // ======================================================================== // Print pipeline startup banner // ======================================================================== - // Build list of enabled analysis modules def enabled_modules = [] if (params.run_proteinmpnn) enabled_modules.add('ProteinMPNN') if (params.run_ipsae) enabled_modules.add('IPSAE') if (params.run_prodigy) enabled_modules.add('PRODIGY') + if (params.run_foldseek) enabled_modules.add('Foldseek') if (params.run_consolidation) enabled_modules.add('Metrics Consolidation') def modules_str = enabled_modules.size() > 0 ? enabled_modules.join(', ') : 'None' - - // Format the banner with proper width (64 chars inside the box) + def banner_width = 64 - def version_text = "nf-proteindesign v1.0.0" - def mode_line = "Mode: DESIGN" - def desc_line = "Using design YAML files" + def version_text = "nf-proteindesign v2.0.0" + def tool_labels = ['boltzgen': 'BoltzGen', 'complexa': 'Proteina-Complexa', 'rfdiffusion_v3': 'RFdiffusion v3'] + def tool_name = tool_labels.getOrDefault(params.protein_design_tool, params.protein_design_tool) + def mode_line = "Mode: DESIGN (${tool_name})" + def desc_labels = ['boltzgen': 'Using design YAML files', 'complexa': 'Using pipeline config YAML files', 'rfdiffusion_v3': 'Using contig YAML + target PDB'] + def desc_line = desc_labels.getOrDefault(params.protein_design_tool, 'Using design YAML files') def modules_header = "Analysis Modules:" def output_line = "Output: ${params.outdir}" - - // Truncate modules string if too long - def max_modules_len = banner_width - 2 - if (modules_str.length() > max_modules_len) { - modules_str = modules_str.substring(0, max_modules_len - 3) + "..." + + if (modules_str.length() > banner_width - 2) { + modules_str = modules_str.substring(0, banner_width - 5) + "..." } - + log.info """ - + ╔════════════════════════════════════════════════════════════════╗ ║${version_text.center(banner_width)}║ ╠════════════════════════════════════════════════════════════════╣ @@ -84,164 +95,211 @@ workflow NFPROTEINDESIGN { // Store projectDir for use in closures // ======================================================================== def project_dir = projectDir - + // ======================================================================== - // Create input channel for design mode + // Parse samplesheet — schema and channel shape depend on design tool // ======================================================================== - - // Validate and parse samplesheet using nf-schema - def design_samplesheet = samplesheetToList( - params.input, - "${projectDir}/assets/schema_input_design.json" - ) - - ch_input = Channel - .fromList(design_samplesheet) - .map { tuple -> - // samplesheetToList returns list of values in schema order - // Order: sample_id, design_yaml, structure_files, protocol, num_designs, budget, reuse, target_msa, target_sequence, target_template, boltzgen_output_dir - def sample_id = tuple[0] - def design_yaml_path = tuple[1] - def structure_files_str = tuple[2] - def protocol = tuple[3] - def num_designs = tuple[4] - def budget = tuple[5] - def reuse = tuple.size() > 6 ? tuple[6] : null - def target_msa_path = tuple.size() > 7 ? tuple[7] : null - def target_sequence_path = tuple.size() > 8 ? tuple[8] : null - def target_template_path = tuple.size() > 9 ? tuple[9] : null - def boltzgen_output_dir_path = tuple.size() > 10 ? tuple[10] : null - - // Convert design YAML to file object and validate existence - // Smart path resolution: try launchDir first (for local runs), then projectDir (for Platform) - def design_yaml - if (design_yaml_path.startsWith('/') || design_yaml_path.contains('://')) { - // Absolute path or remote URL - use as-is - design_yaml = file(design_yaml_path, checkIfExists: true) - } else { - // Relative path - try launchDir first, then projectDir - def launchDir_path = file(design_yaml_path) - if (launchDir_path.exists()) { - design_yaml = launchDir_path - } else { - // Fall back to projectDir (for Seqera Platform) - design_yaml = file("${project_dir}/${design_yaml_path}", checkIfExists: true) - } - } - - // Parse structure files (can be comma-separated list) - def structure_files = [] - if (structure_files_str) { - structure_files_str.split(',').each { structure_path -> - def trimmed_path = structure_path.trim() - if (trimmed_path.startsWith('/') || trimmed_path.contains('://')) { - structure_files.add(file(trimmed_path, checkIfExists: true)) - } else { - def launchDir_path = file(trimmed_path) - if (launchDir_path.exists()) { - structure_files.add(launchDir_path) - } else { - structure_files.add(file("${project_dir}/${trimmed_path}", checkIfExists: true)) - } + + if (params.protein_design_tool == 'boltzgen') { + // ---- BoltzGen samplesheet ---- + def samplesheet = samplesheetToList( + params.input, + "${projectDir}/assets/schema_input_boltzgen.json" + ) + + ch_input = Channel + .fromList(samplesheet) + .map { tuple -> + // Schema order: sample_id, design_yaml, structure_files, protocol, + // num_designs, budget, reuse, target_msa, target_sequence, + // target_template, boltzgen_output_dir + def sample_id = tuple[0] + def design_yaml_path = tuple[1] + def structure_files_str = tuple[2] + def protocol = tuple[3] + def num_designs = tuple[4] + def budget = tuple[5] + def reuse = tuple.size() > 6 ? tuple[6] : null + def target_msa_path = tuple.size() > 7 ? tuple[7] : null + def target_sequence_path = tuple.size() > 8 ? tuple[8] : null + def target_template_path = tuple.size() > 9 ? tuple[9] : null + + // Resolve design YAML + def design_yaml = design_yaml_path.startsWith('/') || design_yaml_path.contains('://') ? + file(design_yaml_path, checkIfExists: true) : + (file(design_yaml_path).exists() ? file(design_yaml_path) : file("${project_dir}/${design_yaml_path}", checkIfExists: true)) + + // Parse comma-separated structure files + def structure_files = [] + if (structure_files_str) { + structure_files_str.split(',').each { p -> + def trimmed = p.trim() + def resolved = trimmed.startsWith('/') || trimmed.contains('://') ? + file(trimmed, checkIfExists: true) : + (file(trimmed).exists() ? file(trimmed) : file("${project_dir}/${trimmed}", checkIfExists: true)) + structure_files.add(resolved) } } - } - - // Parse target MSA file if provided - def target_msa = null - if (target_msa_path) { - if (target_msa_path.startsWith('/') || target_msa_path.contains('://')) { - target_msa = file(target_msa_path, checkIfExists: true) - } else { - def launchDir_path = file(target_msa_path) - if (launchDir_path.exists()) { - target_msa = launchDir_path - } else { - target_msa = file("${project_dir}/${target_msa_path}", checkIfExists: true) - } + + // Resolve target sequence if provided + def target_sequence = null + if (target_sequence_path) { + target_sequence = target_sequence_path.startsWith('/') || target_sequence_path.contains('://') ? + file(target_sequence_path, checkIfExists: true) : + (file(target_sequence_path).exists() ? file(target_sequence_path) : file("${project_dir}/${target_sequence_path}", checkIfExists: true)) } + + def meta = [:] + meta.id = sample_id + meta.protocol = protocol + meta.num_designs = num_designs + meta.budget = budget + meta.reuse = reuse ?: false + meta.target_msa = target_msa_path + meta.target_template = target_template_path + + // BoltzGen channel shape: [meta, design_yaml, structure_files, target_sequence] + [meta, design_yaml, structure_files, target_sequence] } - // Parse target sequence FASTA file (required for Boltz2 refolding) - def target_sequence = null - if (target_sequence_path) { - if (target_sequence_path.startsWith('/') || target_sequence_path.contains('://')) { - target_sequence = file(target_sequence_path, checkIfExists: true) - } else { - def launchDir_path = file(target_sequence_path) - if (launchDir_path.exists()) { - target_sequence = launchDir_path - } else { - target_sequence = file("${project_dir}/${target_sequence_path}", checkIfExists: true) - } - } + } else if (params.protein_design_tool == 'complexa') { + // ---- Complexa samplesheet ---- + def samplesheet = samplesheetToList( + params.input, + "${projectDir}/assets/schema_input_complexa.json" + ) + + ch_input = Channel + .fromList(samplesheet) + .map { tuple -> + // Schema order: sample_id, target_pdb, pipeline_config, + // target_sequence, target_msa, target_template + def sample_id = tuple[0] + def target_pdb_path = tuple[1] + def pipeline_config_path = tuple[2] + def target_sequence_path = tuple[3] + def target_msa_path = tuple.size() > 4 ? tuple[4] : null + def target_template_path = tuple.size() > 5 ? tuple[5] : null + + def target_pdb = target_pdb_path.startsWith('/') || target_pdb_path.contains('://') ? + file(target_pdb_path, checkIfExists: true) : + (file(target_pdb_path).exists() ? file(target_pdb_path) : file("${project_dir}/${target_pdb_path}", checkIfExists: true)) + + def pipeline_config = pipeline_config_path.startsWith('/') || pipeline_config_path.contains('://') ? + file(pipeline_config_path, checkIfExists: true) : + (file(pipeline_config_path).exists() ? file(pipeline_config_path) : file("${project_dir}/${pipeline_config_path}", checkIfExists: true)) + + def target_sequence = target_sequence_path.startsWith('/') || target_sequence_path.contains('://') ? + file(target_sequence_path, checkIfExists: true) : + (file(target_sequence_path).exists() ? file(target_sequence_path) : file("${project_dir}/${target_sequence_path}", checkIfExists: true)) + + def meta = [:] + meta.id = sample_id + meta.target_msa = target_msa_path + meta.target_template = target_template_path + + // Complexa channel shape: [meta, target_pdb, pipeline_config, target_sequence] + [meta, target_pdb, pipeline_config, target_sequence] } - // Parse target template CIF file (optional for Boltz2 refolding) - def target_template = null - if (target_template_path) { - if (target_template_path.startsWith('/') || target_template_path.contains('://')) { - target_template = file(target_template_path, checkIfExists: true) - } else { - def launchDir_path = file(target_template_path) - if (launchDir_path.exists()) { - target_template = launchDir_path - } else { - target_template = file("${project_dir}/${target_template_path}", checkIfExists: true) + } else { + // ---- RFdiffusion v3 samplesheet ---- + def samplesheet = samplesheetToList( + params.input, + "${projectDir}/assets/schema_input_rfdiffusion_v3.json" + ) + + ch_input = Channel + .fromList(samplesheet) + .map { tuple -> + // Schema order: sample_id, design_yaml, structure_files, + // num_designs, budget, target_msa, target_sequence, target_template + def sample_id = tuple[0] + def design_yaml_path = tuple[1] + def structure_files_str = tuple[2] + def num_designs = tuple[3] + def budget = tuple[4] + def target_msa_path = tuple.size() > 5 ? tuple[5] : null + def target_sequence_path = tuple.size() > 6 ? tuple[6] : null + def target_template_path = tuple.size() > 7 ? tuple[7] : null + + // Resolve design YAML + def design_yaml = design_yaml_path.startsWith('/') || design_yaml_path.contains('://') ? + file(design_yaml_path, checkIfExists: true) : + (file(design_yaml_path).exists() ? file(design_yaml_path) : file("${project_dir}/${design_yaml_path}", checkIfExists: true)) + + // Parse comma-separated structure files + def structure_files = [] + if (structure_files_str) { + structure_files_str.split(',').each { p -> + def trimmed = p.trim() + def resolved = trimmed.startsWith('/') || trimmed.contains('://') ? + file(trimmed, checkIfExists: true) : + (file(trimmed).exists() ? file(trimmed) : file("${project_dir}/${trimmed}", checkIfExists: true)) + structure_files.add(resolved) } } - } - // Parse boltzgen_output_dir if provided - def boltzgen_output_dir = null - if (boltzgen_output_dir_path) { - if (boltzgen_output_dir_path.startsWith('/') || boltzgen_output_dir_path.contains('://')) { - boltzgen_output_dir = file(boltzgen_output_dir_path, type: 'dir', checkIfExists: true) - } else { - def launchDir_path = file(boltzgen_output_dir_path, type: 'dir') - if (launchDir_path.exists()) { - boltzgen_output_dir = launchDir_path - } else { - boltzgen_output_dir = file("${project_dir}/${boltzgen_output_dir_path}", type: 'dir', checkIfExists: true) - } + // Resolve target sequence if provided + def target_sequence = null + if (target_sequence_path) { + target_sequence = target_sequence_path.startsWith('/') || target_sequence_path.contains('://') ? + file(target_sequence_path, checkIfExists: true) : + (file(target_sequence_path).exists() ? file(target_sequence_path) : file("${project_dir}/${target_sequence_path}", checkIfExists: true)) } - } - def meta = [:] - meta.id = sample_id - meta.protocol = protocol - meta.num_designs = num_designs - meta.budget = budget - meta.reuse = reuse ?: false + def meta = [:] + meta.id = sample_id + meta.num_designs = num_designs + meta.budget = budget + meta.target_msa = target_msa_path + meta.target_template = target_template_path - [meta, design_yaml, structure_files, target_msa, target_sequence, target_template, boltzgen_output_dir] - } + // RFdiffusion v3 channel shape: [meta, design_yaml, structure_files, target_sequence] + [meta, design_yaml, structure_files, target_sequence] + } + } // ======================================================================== - // Prepare cache directory channel for Boltzgen + // Prepare design-tool checkpoint / cache channel // ======================================================================== - // If cache_dir is specified, stage it as input; otherwise use empty placeholder - if (params.cache_dir) { - ch_cache = Channel - .fromPath(params.cache_dir, type: 'dir', checkIfExists: true) - .first() + if (params.protein_design_tool == 'boltzgen') { + if (params.cache_dir) { + ch_design_cache = Channel + .fromPath(params.cache_dir, type: 'dir', checkIfExists: true) + .first() + } else { + ch_design_cache = Channel.value([]) + } + } else if (params.protein_design_tool == 'complexa') { + if (params.complexa_ckpt_dir) { + ch_design_cache = Channel + .fromPath(params.complexa_ckpt_dir, type: 'dir', checkIfExists: true) + .first() + } else { + ch_design_cache = Channel.value([]) + } } else { - // Create a placeholder file when no cache is provided - ch_cache = Channel.value(file('EMPTY_CACHE')) + // RFdiffusion v3 + if (params.rfdiffusion_v3_ckpt_dir) { + ch_design_cache = Channel + .fromPath(params.rfdiffusion_v3_ckpt_dir, type: 'dir', checkIfExists: true) + .first() + } else { + ch_design_cache = Channel.value([]) + } } // ======================================================================== - // Prepare cache directory channel for Boltz-2 + // Prepare cache directory channel for Boltz-2 (shared across both tools) // ======================================================================== - // If boltz2_cache is specified, stage it as input; otherwise use empty placeholder if (params.boltz2_cache) { ch_boltz2_cache = Channel .fromPath(params.boltz2_cache, type: 'dir', checkIfExists: true) .first() } else { - // Create a placeholder file when no cache is provided ch_boltz2_cache = Channel.value(file('EMPTY_BOLTZ2_CACHE')) } @@ -249,7 +307,7 @@ workflow NFPROTEINDESIGN { // Run PROTEIN_DESIGN workflow // ======================================================================== - PROTEIN_DESIGN(ch_input, ch_cache, ch_boltz2_cache) + PROTEIN_DESIGN(ch_input, ch_design_cache, ch_boltz2_cache) } @@ -257,10 +315,157 @@ workflow NFPROTEINDESIGN { ======================================================================================== RUN MAIN WORKFLOW ======================================================================================== + When test_design_only = true, runs ONLY the design tool process (no downstream + analysis). Use this to smoke-test that the container starts, finds GPU/checkpoints, + and produces output structures. + + Usage: + # Full pipeline (default) + nextflow run main.nf -profile test_design_rfdiffusion_v3 + # Design-tool-only smoke test + nextflow run main.nf -profile test_design_rfdiffusion_v3 --test_design_only +---------------------------------------------------------------------------------------- */ workflow { - NFPROTEINDESIGN() + if (params.test_design_only) { + // ================================================================ + // TEST_DESIGN_ONLY mode — single design-tool process, then exit + // ================================================================ + + if (!params.input) { + error "ERROR: Please provide a samplesheet with --input" + } + + def valid_tools = ['boltzgen', 'complexa', 'rfdiffusion_v3'] + if (!valid_tools.contains(params.protein_design_tool)) { + error "ERROR: --protein_design_tool must be one of: ${valid_tools.join(', ')}. Got: '${params.protein_design_tool}'" + } + + def tool_labels = ['boltzgen': 'BoltzGen', 'complexa': 'Proteina-Complexa', 'rfdiffusion_v3': 'RFdiffusion v3'] + log.info """ + ┌────────────────────────────────────────────────────┐ + │ TEST_DESIGN_ONLY — ${tool_labels[params.protein_design_tool].padRight(30)}│ + │ Design tool smoke test (no downstream analysis) │ + └────────────────────────────────────────────────────┘ + """.stripIndent() + + def project_dir = projectDir + + if (params.protein_design_tool == 'boltzgen') { + def samplesheet = samplesheetToList(params.input, "${projectDir}/assets/schema_input_boltzgen.json") + + ch_input = Channel.fromList(samplesheet).map { tuple -> + def sample_id = tuple[0] + def design_yaml_path = tuple[1] + def structure_files_str = tuple[2] + def protocol = tuple[3] + def num_designs = tuple[4] + def budget = tuple[5] + + def design_yaml = design_yaml_path.startsWith('/') || design_yaml_path.contains('://') ? + file(design_yaml_path, checkIfExists: true) : + (file(design_yaml_path).exists() ? file(design_yaml_path) : file("${project_dir}/${design_yaml_path}", checkIfExists: true)) + + def structure_files = [] + if (structure_files_str) { + structure_files_str.split(',').each { p -> + def trimmed = p.trim() + def resolved = trimmed.startsWith('/') || trimmed.contains('://') ? + file(trimmed, checkIfExists: true) : + (file(trimmed).exists() ? file(trimmed) : file("${project_dir}/${trimmed}", checkIfExists: true)) + structure_files.add(resolved) + } + } + + def meta = [id: sample_id, protocol: protocol, num_designs: num_designs, budget: budget] + [meta, design_yaml, structure_files] + } + + def ch_cache = params.cache_dir ? + Channel.fromPath(params.cache_dir, type: 'dir', checkIfExists: true).first() : + Channel.value([]) + + TEST_BOLTZGEN(ch_input, ch_cache) + + } else if (params.protein_design_tool == 'complexa') { + def samplesheet = samplesheetToList(params.input, "${projectDir}/assets/schema_input_complexa.json") + + ch_input = Channel.fromList(samplesheet).map { tuple -> + def sample_id = tuple[0] + def target_pdb_path = tuple[1] + def pipeline_config_path = tuple[2] + + def target_pdb = target_pdb_path.startsWith('/') || target_pdb_path.contains('://') ? + file(target_pdb_path, checkIfExists: true) : + (file(target_pdb_path).exists() ? file(target_pdb_path) : file("${project_dir}/${target_pdb_path}", checkIfExists: true)) + + def pipeline_config = pipeline_config_path.startsWith('/') || pipeline_config_path.contains('://') ? + file(pipeline_config_path, checkIfExists: true) : + (file(pipeline_config_path).exists() ? file(pipeline_config_path) : file("${project_dir}/${pipeline_config_path}", checkIfExists: true)) + + def meta = [id: sample_id] + [meta, target_pdb, pipeline_config] + } + + def ch_ckpt = params.complexa_ckpt_dir ? + Channel.fromPath(params.complexa_ckpt_dir, type: 'dir', checkIfExists: true).first() : + Channel.value(file('EMPTY_CKPT')) + + TEST_COMPLEXA(ch_input, ch_ckpt) + + } else { + def samplesheet = samplesheetToList(params.input, "${projectDir}/assets/schema_input_rfdiffusion_v3.json") + + ch_input = Channel.fromList(samplesheet).map { tuple -> + def sample_id = tuple[0] + def design_yaml_path = tuple[1] + def structure_files_str = tuple[2] + def num_designs = tuple[3] + def budget = tuple[4] + + def design_yaml = design_yaml_path.startsWith('/') || design_yaml_path.contains('://') ? + file(design_yaml_path, checkIfExists: true) : + (file(design_yaml_path).exists() ? file(design_yaml_path) : file("${project_dir}/${design_yaml_path}", checkIfExists: true)) + + def structure_files = [] + if (structure_files_str) { + structure_files_str.split(',').each { p -> + def trimmed = p.trim() + def resolved = trimmed.startsWith('/') || trimmed.contains('://') ? + file(trimmed, checkIfExists: true) : + (file(trimmed).exists() ? file(trimmed) : file("${project_dir}/${trimmed}", checkIfExists: true)) + structure_files.add(resolved) + } + } + + def meta = [id: sample_id, num_designs: num_designs, budget: budget] + [meta, design_yaml, structure_files] + } + + // Convert CIF structures to PDB (rfd3 requires PDB input) + ch_structures = ch_input.map { meta, design_yaml, structure_files -> [meta, structure_files] } + TEST_CIF2PDB(ch_structures) + + // Rejoin converted PDBs with design YAML + ch_rfd_input = ch_input + .map { meta, design_yaml, structure_files -> [meta.id, meta, design_yaml] } + .join(TEST_CIF2PDB.out.pdb_files_all.map { meta, pdbs -> [meta.id, pdbs] }) + .map { id, meta, design_yaml, pdbs -> [meta, design_yaml, pdbs] } + + def ch_cache = params.rfdiffusion_v3_ckpt_dir ? + Channel.fromPath(params.rfdiffusion_v3_ckpt_dir, type: 'dir', checkIfExists: true).first() : + Channel.value([]) + + TEST_RFDV3(ch_rfd_input, ch_cache) + } + + } else { + // ================================================================ + // Normal mode — full pipeline + // ================================================================ + NFPROTEINDESIGN() + } } /* diff --git a/mkdocs.yml b/mkdocs.yml index 268473d..a4fa728 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,5 +1,5 @@ site_name: nf-proteindesign -site_description: Nextflow pipeline for parallel Boltzgen protein design with automated binding site prediction +site_description: Nextflow pipeline for parallel Complexa protein design with automated binding site prediction site_author: nf-proteindesign contributors repo_name: seqeralabs/nf-proteindesign repo_url: https://github.com/seqeralabs/nf-proteindesign diff --git a/modules/local/boltz2_refold.nf b/modules/local/boltz2_refold.nf index ac90424..c5998cf 100644 --- a/modules/local/boltz2_refold.nf +++ b/modules/local/boltz2_refold.nf @@ -23,6 +23,8 @@ process BOLTZ2_REFOLD { container 'giosbiostructures/boltz2:latest' + errorStrategy 'ignore' + // GPU acceleration - Boltz-2 benefits from GPU for efficient prediction accelerator 1, type: 'nvidia-gpu' @@ -40,7 +42,7 @@ process BOLTZ2_REFOLD { script: def use_msa = params.boltz2_use_msa ? '--use_msa_server' : '' - def cache_opt = cache_dir.name != 'EMPTY_BOLTZ2_CACHE' ? "--cache boltz2_cache" : '' + def cache_opt = cache_dir.name != 'EMPTY_BOLTZ2_CACHE' ? "--cache ${cache_dir}" : '' def num_recycling = params.boltz2_num_recycling ?: 3 def num_diffusion = params.boltz2_num_diffusion ?: 5 def has_target_msa = target_msa.name != 'NO_MSA' @@ -53,6 +55,10 @@ process BOLTZ2_REFOLD { # Fix for Numba caching error in containers export NUMBA_CACHE_DIR="\${PWD}/numba_cache" + + # Disable CUDA shader cache to prevent root-owned .nv directory + # that causes AccessDeniedException when Nextflow collects outputs + export CUDA_CACHE_DISABLE=1 mkdir -p "\${NUMBA_CACHE_DIR}" # Fix for Boltz caching error (tries to write to /.boltz) @@ -121,7 +127,7 @@ process BOLTZ2_REFOLD { --out_dir boltz2_results \\ --accelerator gpu \\ --devices 1 \\ - --num_workers 12 \\ + --num_workers 0 \\ --recycling_steps ${num_recycling} \\ --diffusion_samples ${num_diffusion} \\ ${cache_opt} \\ @@ -164,6 +170,13 @@ process BOLTZ2_REFOLD { echo " Saved PAE: \${filename}" done + # Copy pLDDT NPZ files (format: plddt__model_0.npz) + find "\${pred_dir}" -name "plddt*.npz" -type f | while read file; do + filename=\$(basename "\${file}") + cp "\${file}" "${meta.id}_boltz2_output/\${filename}" + echo " Saved pLDDT: \${filename}" + done + # Copy confidence JSON files find "\${pred_dir}" -name "*confidence*.json" -type f | while read file; do filename=\$(basename "\${file}") @@ -186,6 +199,7 @@ process BOLTZ2_REFOLD { CIF_COUNT=\$(find ${meta.id}_boltz2_output -name "*.cif" | wc -l) JSON_COUNT=\$(find ${meta.id}_boltz2_output -name "*confidence*.json" | wc -l) NPZ_COUNT=\$(find ${meta.id}_boltz2_output -name "*pae*.npz" | wc -l) + PLDDT_COUNT=\$(find ${meta.id}_boltz2_output -name "plddt*.npz" | wc -l) AFFINITY_COUNT=\$(find ${meta.id}_boltz2_output -name "*affinity*.json" | wc -l) echo "" @@ -195,6 +209,7 @@ process BOLTZ2_REFOLD { echo "Structures predicted: \${CIF_COUNT}" echo "Confidence files: \${JSON_COUNT}" echo "PAE NPZ files: \${NPZ_COUNT}" + echo "pLDDT NPZ files: \${PLDDT_COUNT}" echo "Affinity predictions: \${AFFINITY_COUNT}" echo "Output directory: ${meta.id}_boltz2_output" echo "============================================" @@ -230,7 +245,7 @@ Input: - Target sequence length: \${#TARGET_SEQ} Parameters: - - Cache directory: ${cache_dir.name != 'EMPTY_BOLTZ2_CACHE' ? 'boltz2_cache (staged)' : 'default (~/.boltz)'} + - Cache directory: ${cache_dir.name != 'EMPTY_BOLTZ2_CACHE' ? cache_dir.toString() : 'default (~/.boltz)'} - Recycling steps: ${num_recycling} - Diffusion samples: ${num_diffusion} - Use MSA: ${params.boltz2_use_msa} diff --git a/modules/local/boltzgen_run.nf b/modules/local/boltzgen_run.nf index 0768cc4..d604c3e 100644 --- a/modules/local/boltzgen_run.nf +++ b/modules/local/boltzgen_run.nf @@ -12,7 +12,7 @@ process BOLTZGEN_RUN { input: tuple val(meta), path(design_yaml), path(structure_files) - path cache_dir, stageAs: 'input_cache' + path(cache_dir, stageAs: 'input_cache', arity: '0..*') output: tuple val(meta), path("${meta.id}_output"), emit: results @@ -47,17 +47,17 @@ process BOLTZGEN_RUN { def reuse_flag = meta.reuse ? '--reuse' : '' def config_arg = params.boltzgen_config ? "--config ${params.boltzgen_config}" : '' def steps_arg = params.steps ? "--steps ${params.steps}" : '' - def cache_arg = cache_dir.name != 'EMPTY_CACHE' ? "--cache input_cache" : "--cache cache" + def cache_arg = cache_dir ? "--cache input_cache" : "--cache cache" """ export HF_HOME=\${PWD}/input_cache export NUMBA_CACHE_DIR=/tmp export MPLCONFIGDIR=/tmp/matplotlib export XET_LOG_DIR=/tmp/xet_logs - export TRITON_CACHE_DIR=/tmp/triton # Add this line + export TRITON_CACHE_DIR=/tmp/triton export XDG_CACHE_HOME=/tmp/cache # Create cache directory if not using staged cache - if [ "${cache_dir.name}" == "EMPTY_CACHE" ]; then + if [ ! -d "input_cache" ]; then mkdir -p ./cache fi diff --git a/modules/local/ipsae_calculate.nf b/modules/local/ipsae_calculate.nf index 0629c25..84205ed 100644 --- a/modules/local/ipsae_calculate.nf +++ b/modules/local/ipsae_calculate.nf @@ -8,7 +8,7 @@ process IPSAE_CALCULATE { container 'community.wave.seqera.io/library/numpy:2.3.5--f8d2712d76b3e3ce' input: - tuple val(meta), path(pae_file), path(structure_file) + tuple val(meta), path(pae_file), path(structure_file), path(confidence_json), path(plddt_npz) path ipsae_script output: @@ -24,6 +24,13 @@ process IPSAE_CALCULATE { """ # Install numpy if not available pip install --no-cache-dir numpy 2>&1 | grep -v "Requirement already satisfied" || true + + # Stage confidence and pLDDT files alongside the PAE file so ipsae.py + # auto-discovers them by filename convention: + # pae__model_0.npz -> confidence__model_0.json + # pae__model_0.npz -> plddt__model_0.npz + echo "Staged confidence JSON: ${confidence_json}" + echo "Staged pLDDT NPZ: ${plddt_npz}" # Run IPSAE calculation python ${ipsae_script} \\ diff --git a/modules/local/prepare_boltz2_sequences.nf b/modules/local/prepare_boltz2_sequences.nf index fae2a3d..7a3f5fe 100644 --- a/modules/local/prepare_boltz2_sequences.nf +++ b/modules/local/prepare_boltz2_sequences.nf @@ -6,7 +6,7 @@ 1. Splits ProteinMPNN multi-sequence FASTA into individual files 2. Processes target sequence FASTA to clean format (no header, single line) - All sequences are included (original Boltzgen sequence + MPNN-designed sequences). + All sequences are included (original Complexa sequence + MPNN-designed sequences). ---------------------------------------------------------------------------------------- */ diff --git a/modules/local/proteina_complexa_design.nf b/modules/local/proteina_complexa_design.nf new file mode 100644 index 0000000..c9ca513 --- /dev/null +++ b/modules/local/proteina_complexa_design.nf @@ -0,0 +1,156 @@ +process PROTEINA_COMPLEXA_DESIGN { + tag "${meta.id}" + label 'process_high_gpu' + + // Publish results + publishDir "${params.outdir}/${meta.id}/proteina_complexa", mode: params.publish_dir_mode, saveAs: { filename -> filename } + + container "${params.complexa_container}" + + // GPU acceleration — Proteina-Complexa requires GPU for flow matching inference + reward scoring + accelerator 1, type: 'nvidia-gpu' + + input: + tuple val(meta), path(target_pdb), path(pipeline_config) + path(ckpt_dir, stageAs: 'checkpoints', arity: '0..*') + + output: + // Full results directory + tuple val(meta), path("${meta.id}_output"), emit: results + + // Generated PDB design files (filtered top-N from the generation+filter stages) + // These are complex PDBs (target + binder) ready for downstream analysis + tuple val(meta), path("${meta.id}_output/designs/*.pdb"), optional: true, emit: design_pdbs + + // Evaluation results CSVs (from Complexa's built-in evaluate stage) + tuple val(meta), path("${meta.id}_output/evaluation_results/*.csv"), optional: true, emit: eval_csvs + + // Analysis summary CSVs (from Complexa's built-in analyze stage) + tuple val(meta), path("${meta.id}_output/analysis/*_combined.csv"), optional: true, emit: analysis_csvs + + // Success-filtered designs (designs passing i_pAE, pLDDT, scRMSD thresholds) + tuple val(meta), path("${meta.id}_output/analysis/success_filtered/*.pdb"), optional: true, emit: success_pdbs + + path "versions.yml", emit: versions + + script: + def run_name = "${meta.id}" + def task_name = meta.task_name ?: '' + def pipeline_type = meta.pipeline_type ?: 'binder' + def nsamples = meta.num_designs ?: 4 + def filter_limit = meta.budget ?: 10 + def search_algo = params.complexa_search_algorithm ?: 'best-of-n' + def nsteps = params.complexa_nsteps ?: 400 + def replicas = params.complexa_replicas ?: 2 + def batch_size = params.complexa_batch_size ?: 16 + def extra_args = params.complexa_extra_args ?: '' + // Only override task_name and target_path when explicitly provided + def task_name_arg = task_name ? "++generation.task_name=${task_name} ++generation.target_dict_cfg.${task_name}.target_path=${target_pdb}" : '' + + """ + set -euo pipefail + + # ── Environment setup ── + export NUMBA_CACHE_DIR=/tmp/numba + export MPLCONFIGDIR=/tmp/matplotlib + export XDG_CACHE_HOME=/tmp/cache + export TRITON_CACHE_DIR=/tmp/triton + mkdir -p /tmp/numba /tmp/matplotlib /tmp/cache /tmp/triton + + # ── Initialize Proteina-Complexa environment (Docker runtime) ── + # Set tool paths for the Docker container (normally set by 'complexa init docker && source env.sh') + export COMPLEXA_INIT=1 + export FOLDSEEK_EXEC=/workspace/.venv/bin/foldseek + export RF3_EXEC_PATH=/workspace/.venv/bin/rf3 + export SC_EXEC=/usr/local/bin/sc + export MMSEQS_EXEC=/workspace/.venv/bin/mmseqs + export DSSP_EXEC=/usr/local/bin/dssp + export TMOL_PATH=/workspace/.venv/lib/python3.12/site-packages/tmol + export PYTHONPATH=/workspace/protein-foundation-models/src:\${PYTHONPATH:-} + export PATH=/workspace/.venv/bin:\$PATH + + # ── Resolve checkpoint paths ── + export CKPT_DIR=\$(realpath checkpoints) + export CKPT_PATH=\${CKPT_DIR} + + # ── Ensure the staged target PDB is visible as an absolute path ── + # The YAML config's target_dict_cfg.*.target_path is resolved relative to CWD. + # Copy the staged target PDB to the working directory root so the default + # relative path in the YAML ("target_name.pdb") resolves correctly. + TARGET_PDB=\$(realpath ${target_pdb}) + + # ── Run Proteina-Complexa full design pipeline ── + # The 'complexa design' command runs: generate → filter → evaluate → analyze + complexa design ${pipeline_config} \\ + ++ckpt_path=\${CKPT_DIR} \\ + ++ckpt_name=${meta.ckpt_name ?: 'complexa.ckpt'} \\ + ++autoencoder_ckpt_path=\${CKPT_DIR}/${meta.ae_ckpt_name ?: 'complexa_ae.ckpt'} \\ + ++generation.search.algorithm=${search_algo} \\ + ++generation.args.nsteps=${nsteps} \\ + ++generation.dataloader.batch_size=${batch_size} \\ + ++generation.dataloader.dataset.nres.nsamples=${nsamples} \\ + ++generation.search.best_of_n.replicas=${replicas} \\ + ++generation.filter.filter_samples_limit=${filter_limit} \\ + ${task_name_arg} \\ + ${extra_args} + + # ── Organize outputs into standardized directory structure ── + mkdir -p ${run_name}_output/designs + mkdir -p ${run_name}_output/evaluation_results + mkdir -p ${run_name}_output/analysis + + # Collect generated PDB files from inference directory + # Proteina-Complexa outputs to: inference/{run_name}_{task_name}/job_*/*.pdb + find inference/ -name "*.pdb" -not -path "*/filtered_out_samples/*" \\ + -exec cp {} ${run_name}_output/designs/ \\; 2>/dev/null || true + + # Collect evaluation CSVs + find evaluation_results/ -name "*.csv" \\ + -exec cp {} ${run_name}_output/evaluation_results/ \\; 2>/dev/null || true + + # Collect analysis outputs (combined CSVs + success-filtered PDBs) + find inference/ -path "*/analysis/*_combined.csv" \\ + -exec cp {} ${run_name}_output/analysis/ \\; 2>/dev/null || true + if [ -d inference/*/analysis/success_filtered ]; then + cp -r inference/*/analysis/success_filtered ${run_name}_output/analysis/ 2>/dev/null || true + fi + + # ── Version information ── + cat <<-END_VERSIONS > versions.yml + "${task.process}": + proteina-complexa: \$(complexa --version 2>&1 | head -1 || echo "unknown") + python: \$(python --version 2>&1 | sed 's/Python //g') + END_VERSIONS + """ + + stub: + """ + # Create realistic output directory structure for stub runs + mkdir -p ${meta.id}_output/designs + mkdir -p ${meta.id}_output/evaluation_results + mkdir -p ${meta.id}_output/analysis/success_filtered + + # Create stub PDB files with realistic Proteina-Complexa naming convention + # Format: job_{job_id}_n_{binder_length}_id_{sample_idx}_{metadata_tag}.pdb + touch ${meta.id}_output/designs/job_0_n_80_id_0_bon_orig0_r0.pdb + touch ${meta.id}_output/designs/job_0_n_80_id_1_bon_orig0_r1.pdb + touch ${meta.id}_output/designs/job_0_n_100_id_2_bon_orig1_r0.pdb + + # Create stub evaluation CSVs + echo "sample_id,self_complex_i_pAE,self_complex_pLDDT,self_binder_scRMSD" > ${meta.id}_output/evaluation_results/binder_results_0.csv + echo "job_0_n_80_id_0_bon_orig0_r0,0.15,0.92,1.2" >> ${meta.id}_output/evaluation_results/binder_results_0.csv + + # Create stub analysis CSV + echo "sample_id,i_pAE,pLDDT,scRMSD,pass_all" > ${meta.id}_output/analysis/binder_results_combined.csv + echo "job_0_n_80_id_0_bon_orig0_r0,0.15,0.92,1.2,true" >> ${meta.id}_output/analysis/binder_results_combined.csv + + # Create stub success-filtered PDB + touch ${meta.id}_output/analysis/success_filtered/job_0_n_80_id_0_bon_orig0_r0.pdb + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + proteina-complexa: stub + python: stub + END_VERSIONS + """ +} diff --git a/modules/local/rfdiffusion_v3_run.nf b/modules/local/rfdiffusion_v3_run.nf new file mode 100644 index 0000000..19acdd1 --- /dev/null +++ b/modules/local/rfdiffusion_v3_run.nf @@ -0,0 +1,225 @@ +/* +======================================================================================== + RFDIFFUSION_V3_RUN: Protein backbone design using RFdiffusion3 +======================================================================================== + Runs RFdiffusion3 via the RosettaCommons Foundry framework (rfd3 CLI). + Official container: rosettacommons/foundry + Docs: https://github.com/RosettaCommons/foundry/blob/production/models/rfd3/README.md + + Design YAML schema (rfd3 format): + contig: Contig specification string using comma-separated segments. + e.g. "80-120,/0,A1-100" + Syntax: ",/0,-" + select_hotspots: Optional dict of target residues → atom names for hotspot + biasing, e.g. {"A42": "CA,CB", "A45": "CG"} + is_non_loopy: Recommended true for PPI binder design (more structured). + + The rfd3 CLI takes two required arguments: + rfd3 design out_dir= inputs= + The input PDB path is specified INSIDE the JSON spec via the "input" field, + not as a separate CLI argument. + + Number of designs is controlled via CLI args: + n_batches (default 1) × diffusion_batch_size (default 8) = total designs + + Input structure files must be PDB format. CIF→PDB conversion should be done + upstream (e.g. via CONVERT_CIF_TO_PDB) before calling this module. +======================================================================================== +*/ + +process RFDIFFUSION_V3_RUN { + tag "${meta.id}" + label 'process_high_gpu' + errorStrategy 'retry' + maxRetries 3 + + publishDir "${params.outdir}/${meta.id}/rfdiffusion_v3", mode: params.publish_dir_mode + + container "${params.rfdiffusion_v3_container}" + + accelerator 1, type: 'nvidia-gpu' + + input: + tuple val(meta), path(design_yaml), path(structure_files) + path(cache_dir, stageAs: 'input_cache', arity: '0..*') + + output: + // Full results directory + tuple val(meta), path("${meta.id}_output"), emit: results + + // Generated PDB design files — ranked top-N (budget) for downstream analysis + tuple val(meta), path("${meta.id}_output/designs/*.pdb"), optional: true, emit: design_pdbs + + path "versions.yml", emit: versions + + script: + def model_cache = cache_dir ? "\${PWD}/input_cache" : "\${HOME}/.foundry/checkpoints" + def num_designs = meta.num_designs ?: 10 + def budget = meta.budget ?: 4 + // rfd3 controls total designs via n_batches × diffusion_batch_size. + // batch_size=1 so each design diffuses independently — avoids NaN + // propagation when is_non_loopy=true with hotspot constraints. + // n_batches=num_designs generates all requested designs; the ranking + // step below then selects the top `budget` for downstream. + def batch_size = 1 + """ + set -euo pipefail + + # ── Environment setup ── + export FOUNDRY_CHECKPOINT_DIRS="${model_cache}" + export NUMBA_CACHE_DIR=/tmp/numba + export XDG_CACHE_HOME=/tmp/cache + mkdir -p /tmp/numba /tmp/cache + + mkdir -p ${meta.id}_output/rfd3_raw + mkdir -p ${meta.id}_output/designs + + # ── Resolve input PDB structure ── + # CIF→PDB conversion is handled upstream by CONVERT_CIF_TO_PDB + STRUCT_FILES=(${structure_files}) + RESOLVED_PDB="" + if [ \${#STRUCT_FILES[@]} -gt 0 ]; then + RESOLVED_PDB="\${PWD}/\${STRUCT_FILES[0]}" + fi + + if [ -z "\${RESOLVED_PDB}" ]; then + echo "ERROR: No input PDB structure found. RFdiffusion3 requires a target structure." >&2 + echo " structure_files input: ${structure_files}" >&2 + exit 1 + fi + if [ ! -f "\${RESOLVED_PDB}" ]; then + echo "ERROR: Input PDB not found at: \${RESOLVED_PDB}" >&2 + ls -la \${PWD}/ >&2 + exit 1 + fi + echo "Using input structure: \${RESOLVED_PDB}" + + # ── Convert design YAML to rfd3 JSON InputSpecification ── + # The PDB path goes inside the JSON spec as the "input" field. + # Ref: https://github.com/RosettaCommons/foundry/blob/production/models/rfd3/docs/input.md + python3 - <<'PYEOF' +import yaml, json, pathlib + +with open('${design_yaml}') as f: + spec = yaml.safe_load(f) + +# Contig — rfd3 expects the contig as a **string**, e.g. +# "80-120,/0,A1-100" +# Our YAML stores it as a string already; if it's a list, join it back. +raw_contig = spec.get('contig', '100-100') +if isinstance(raw_contig, list): + contig_str = ','.join(str(s) for s in raw_contig) +else: + contig_str = str(raw_contig) + +# Build the rfd3 InputSpecification entry +design_entry = { + 'dialect': 2, + 'input': '${meta.id}_target.pdb', # resolved PDB path (symlinked below) + 'contig': contig_str, + 'is_non_loopy': spec.get('is_non_loopy', True), +} + +# Hotspots: rfd3 uses "select_hotspots" dict with atom selections. +# Accept both the rfd3-native dict form and a simple residue list. +# Note: infer_ori_strategy='hotspots' is intentionally NOT set here — +# it causes NaN (X_noisy_L) on small/medium binders and is not required +# for hotspot biasing to take effect. +hotspots = spec.get('select_hotspots', spec.get('hotspot_res', None)) +if hotspots: + if isinstance(hotspots, dict): + # Already in rfd3 native format: {"A42": "CA,CB", ...} + design_entry['select_hotspots'] = hotspots + elif isinstance(hotspots, list) and len(hotspots) > 0: + # Convert simple list ["A42", "A45"] → dict with empty atom selection + design_entry['select_hotspots'] = {r: '' for r in hotspots} + +# Pass through any additional rfd3-native fields from the YAML +for key in ('select_fixed_atoms', 'select_unfixed_sequence', 'ligand', + 'length', 'unindex', 'partial_t', 'infer_ori_strategy', + 'cif_parser_args'): + if key in spec and key not in design_entry: + design_entry[key] = spec[key] + +rfd3_input = {'${meta.id}': design_entry} + +with open('rfd3_input.json', 'w') as f: + json.dump(rfd3_input, f, indent=2) + +print("Generated rfd3_input.json:") +print(json.dumps(rfd3_input, indent=2)) +PYEOF + + # ── Symlink PDB so the path in the JSON spec resolves ── + ln -sf "\${RESOLVED_PDB}" "${meta.id}_target.pdb" + + # ── Run RFdiffusion3 ── + # CLI reference: https://github.com/RosettaCommons/foundry/blob/production/models/rfd3/docs/input.md + # Required: out_dir, inputs + rfd3 design \\ + out_dir=${meta.id}_output/rfd3_raw \\ + inputs=rfd3_input.json \\ + n_batches=${num_designs} \\ + diffusion_batch_size=${batch_size} \\ + inference_sampler.step_scale=1.5 \\ + inference_sampler.gamma_0=0.2 \\ + skip_existing=True \\ + prevalidate_inputs=True + + # ── Rank and collect top designs ── + # rfd3 outputs .cif.gz files (not PDB). Decompress and convert to PDB + # so downstream modules (ProteinMPNN, Boltz-2, etc.) can consume them. + RANK=1 + for cifgz in \$(find ${meta.id}_output/rfd3_raw -name "*.cif.gz" 2>/dev/null | sort -V | head -n ${budget}); do + DESIGN_NAME=\$(basename "\${cifgz}" .cif.gz) + # Decompress the .cif.gz + gunzip -k "\${cifgz}" + CIF_FILE="\${cifgz%.gz}" + # Convert CIF to PDB using biotite (available in the foundry container) + python3 - "\${CIF_FILE}" "${meta.id}_output/designs/rank\${RANK}_\${DESIGN_NAME}.pdb" <<'CIF2PDB' +import sys +from biotite.structure.io import pdbx, pdb + +cif_file = pdbx.CIFFile.read(sys.argv[1]) +atoms = pdbx.get_structure(cif_file, model=1) +pdb_file = pdb.PDBFile() +pdb.set_structure(pdb_file, atoms) +pdb_file.write(sys.argv[2]) +CIF2PDB + RANK=\$((RANK + 1)) + done + # Fallback: also check for any raw PDB outputs + for pdb in \$(find ${meta.id}_output/rfd3_raw -name "*.pdb" 2>/dev/null | sort -V | head -n ${budget}); do + if [ \${RANK} -gt ${budget} ]; then break; fi + DESIGN_NAME=\$(basename "\${pdb}" .pdb) + cp "\${pdb}" "${meta.id}_output/designs/rank\${RANK}_\${DESIGN_NAME}.pdb" + RANK=\$((RANK + 1)) + done + + # ── Version information ── + cat <<-END_VERSIONS > versions.yml + "${task.process}": + rfdiffusion3: \$(pip3 show rc-foundry 2>/dev/null | grep Version | cut -d' ' -f2 || echo "unknown") + biotite: \$(python3 -c 'import biotite; print(biotite.__version__)' 2>/dev/null || echo "unknown") + python: \$(python3 --version 2>&1 | sed 's/Python //g') + END_VERSIONS + """ + + stub: + """ + mkdir -p ${meta.id}_output/rfd3_raw + mkdir -p ${meta.id}_output/designs + + # Create stub PDB files + touch ${meta.id}_output/rfd3_raw/design_0.pdb + touch ${meta.id}_output/rfd3_raw/design_1.pdb + touch ${meta.id}_output/designs/rank1_design_0.pdb + touch ${meta.id}_output/designs/rank2_design_1.pdb + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + rfdiffusion3: "stub" + python: \$(python3 --version 2>&1 | sed 's/Python //g') + END_VERSIONS + """ +} diff --git a/modules/local/split_proteinmpnn_sequences.nf b/modules/local/split_proteinmpnn_sequences.nf index 49e6fc5..74830f1 100644 --- a/modules/local/split_proteinmpnn_sequences.nf +++ b/modules/local/split_proteinmpnn_sequences.nf @@ -5,7 +5,7 @@ This process takes a multi-sequence FASTA file from ProteinMPNN and splits it into individual FASTA files, one per sequence. - All sequences are included (original Boltzgen sequence + MPNN-designed sequences). + All sequences are included (original Complexa sequence + MPNN-designed sequences). ---------------------------------------------------------------------------------------- */ diff --git a/nextflow.config b/nextflow.config index 7c37b74..b5179a6 100644 --- a/nextflow.config +++ b/nextflow.config @@ -17,25 +17,59 @@ params { // Input options input = null + // Test mode: run ONLY the design tool (no downstream ProteinMPNN / Boltz-2 / etc.) + test_design_only = false + // ======================================================================== - // Design and Boltzgen parameters + // Protein design tool selection // ======================================================================== - // IMPORTANT: The following parameters must be specified in your samplesheet: - // - protocol: Boltzgen protocol (protein-anything, peptide-anything, etc.) - // - num_designs: Number of intermediate designs to generate - // - budget: Number of designs in final diversity-optimized set + // Choose which generative model drives the design stage. + // 'boltzgen' — BoltzGen (original, default) + // 'complexa' — Proteina-Complexa (flow-matching approach) + // 'rfdiffusion_v3' — RFdiffusion3 (all-atom diffusion, RosettaCommons) // - // This design ensures explicit per-sample control and eliminates ambiguity. - // See samplesheet schema and examples in assets/test_data/ for details. + // Each tool uses its own samplesheet columns and parameters (see below). + // Everything downstream (ProteinMPNN, Boltz-2, IPSAE, etc.) is shared. // ======================================================================== - - // Boltzgen advanced options - cache_dir = null // Cache directory for model weights (~6GB), defaults to ~/.cache - boltzgen_config = null // Optional: Path to custom Boltzgen config YAML to override defaults - steps = null // Optional: Comma-separated list of steps to run (e.g., 'filtering' to rerun only filtering) + protein_design_tool = 'boltzgen' // 'boltzgen', 'complexa', or 'rfdiffusion_v3' + + // ======================================================================== + // BoltzGen design parameters (used when protein_design_tool = 'boltzgen') + // ======================================================================== + // Samplesheet columns: sample_id, design_yaml, target_sequence, + // structure_files (optional), protocol, num_designs, budget, + // reuse (optional), target_msa (optional), target_template (optional) + // ======================================================================== + cache_dir = null // Cache directory for BoltzGen model weights (~6GB), defaults to ~/.cache + boltzgen_config = null // Optional: Path to custom BoltzGen config YAML to override defaults + steps = null // Optional: Comma-separated list of steps to run (e.g., 'filtering') + + // ======================================================================== + // Proteina-Complexa design parameters (used when protein_design_tool = 'complexa') + // ======================================================================== + // Samplesheet columns: sample_id, target_pdb, pipeline_config, + // target_sequence, target_msa (optional), target_template (optional) + // ======================================================================== + complexa_ckpt_dir = null // Path to Complexa checkpoint directory (required for GPU inference) + complexa_container = '307946633589.dkr.ecr.eu-west-2.amazonaws.com/rashmi/proteina-complexa:latest' + complexa_search_algorithm = 'best-of-n' // Search algorithm: best-of-n, single-pass, beam-search, fk-steering, mcts + complexa_nsteps = 400 // Number of diffusion sampling steps (generation.args.nsteps) + complexa_replicas = 2 // Number of replicas for best-of-n search (generation.search.best_of_n.replicas) + complexa_batch_size = 16 // Dataloader batch size for generation (generation.dataloader.batch_size) + complexa_extra_args = '' // Additional Hydra overrides (e.g., '++seed=42 ++generation.args.guidance_w=2.0') + + // ======================================================================== + // RFdiffusion v3 design parameters (used when protein_design_tool = 'rfdiffusion_v3') + // ======================================================================== + // Samplesheet columns: sample_id, design_yaml, structure_files, + // num_designs, budget, target_msa (optional), target_sequence (optional), + // target_template (optional) + // ======================================================================== + rfdiffusion_v3_ckpt_dir = null // Path to RFdiffusion3 checkpoint dir (auto-downloaded if null) + rfdiffusion_v3_container = 'rosettacommons/foundry:latest' // Container with rfd3 CLI // ProteinMPNN sequence optimization options - run_proteinmpnn = true // Enable ProteinMPNN sequence optimization of Boltzgen designs (set to false to disable) + run_proteinmpnn = true // Enable ProteinMPNN sequence optimization of designed structures (set to false to disable) mpnn_sampling_temp = 0.1 // Sampling temperature (0.1-0.3 recommended, lower = more conservative) mpnn_num_seq_per_target = 8 // Number of sequence variants to generate per structure mpnn_batch_size = 1 // Batch size for ProteinMPNN inference @@ -91,11 +125,10 @@ params { // GPU acceleration options // NOTE: The following processes support GPU acceleration: - // - BOLTZGEN_RUN: Requires GPU, provides significant speedup for protein design + // - PROTEINA_COMPLEXA_DESIGN: Requires GPU for flow-matching inference // - PROTEINMPNN_OPTIMIZE: Optional GPU support, accelerates sequence optimization - // - PROTENIX_REFOLD: Requires GPU, enables accurate multimer structure prediction + // - BOLTZ2_REFOLD: Requires GPU for structure prediction / refolding // - FOLDSEEK_SEARCH: Optional GPU support, provides 4-27x speedup for structure searches - // When running on GPU-enabled systems, these processes will automatically utilize GPUs // Ensure your compute environment has NVIDIA GPUs and Docker/Singularity GPU support enabled // Boilerplate options @@ -125,7 +158,7 @@ manifest { name = 'seqeralabs/nf-proteindesign' author = 'Florian Wuennemann' homePage = 'https://github.com/seqeralabs/nf-proteindesign' - description = 'Nextflow pipeline for Boltzgen protein design with parallel sample processing' + description = 'Nextflow pipeline for protein design with BoltzGen, Proteina-Complexa, or RFdiffusion v3 and parallel sample processing' mainScript = 'main.nf' nextflowVersion = '!>=23.04.0' version = '1.0.0' @@ -142,7 +175,7 @@ profiles { docker { docker.enabled = true podman.enabled = false - docker.runOptions = '-u $(id -u):$(id -g)' + docker.runOptions = '--gpus all' } singularity { @@ -170,4 +203,12 @@ profiles { test_design_protein { includeConfig 'conf/test_design_protein.config' } + + test_design_rfdiffusion_v3 { + includeConfig 'conf/test_design_rfdiffusion_v3.config' + } + + test_design_proteina_complexa { + includeConfig 'conf/test_design_proteina_complexa.config' + } } diff --git a/nextflow_schema.json b/nextflow_schema.json index c29507b..25ec20a 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -2,7 +2,7 @@ "$schema": "http://json-schema.org/draft-07/schema", "$id": "https://raw.githubusercontent.com/seqeralabs/nf-proteindesign/main/nextflow_schema.json", "title": "nf-proteindesign pipeline parameters", - "description": "Nextflow pipeline for Boltzgen protein design using pre-made design YAML specifications", + "description": "Nextflow pipeline for Proteina-Complexa protein design using pre-made design YAML specifications", "type": "object", "definitions": { "input_output_options": { @@ -22,7 +22,7 @@ "mimetype": "text/csv", "pattern": "^\\S+\\.csv$", "description": "Path to comma-separated samplesheet file.", - "help_text": "The samplesheet must contain: `sample_id`, `design_yaml`, `target_sequence`, and optionally: `structure_files`, `protocol`, `num_designs`, `budget`, `reuse`, `target_msa`, `target_template`\n\nSee schema file in assets/schema_input_design.json for detailed specifications.", + "help_text": "The samplesheet must contain: `sample_id`, `target_pdb`, `pipeline_config`, `target_sequence`, and optionally: `target_msa`, `target_template`.\n\nSee assets/schema_input_design.json for detailed specifications and assets/test_data/ for examples.", "fa_icon": "fas fa-file-csv" }, "outdir": { @@ -34,31 +34,87 @@ } } }, - "boltzgen_options": { - "title": "Boltzgen design parameters", + "complexa_options": { + "title": "Proteina-Complexa design parameters", "type": "object", - "description": "Core parameters for Boltzgen protein design execution.", + "description": "Parameters for the Proteina-Complexa generative design step. The pipeline config YAML and target PDB are specified per-sample in the samplesheet.", "default": "", "fa_icon": "fas fa-cogs", "properties": { - "cache_dir": { + "complexa_ckpt_dir": { "type": "string", "format": "directory-path", - "description": "Cache directory for model weights (~6GB).", - "help_text": "If not specified, defaults to ~/.cache. The cache directory stores downloaded model weights for faster execution.", + "description": "Path to Complexa checkpoint directory containing model weights.", + "help_text": "Required for GPU inference. Download with: complexa download --complexa-all", "fa_icon": "fas fa-database" }, - "boltzgen_config": { + "complexa_container": { "type": "string", - "format": "file-path", - "description": "Optional path to custom Boltzgen config YAML to override defaults.", - "fa_icon": "fas fa-file-code" + "default": "307946633589.dkr.ecr.eu-west-2.amazonaws.com/rashmi/proteina-complexa:latest", + "description": "Container image for Proteina-Complexa.", + "help_text": "Private ECR image. Compute environment needs ECR pull permissions for account 307946633589.", + "fa_icon": "fas fa-docker" + }, + "complexa_search_algorithm": { + "type": "string", + "default": "best-of-n", + "description": "Search algorithm for design generation.", + "help_text": "Options: best-of-n, single-pass, beam-search, fk-steering, mcts", + "enum": ["best-of-n", "single-pass", "beam-search", "fk-steering", "mcts"], + "fa_icon": "fas fa-search" + }, + "complexa_nsteps": { + "type": "integer", + "default": 400, + "description": "Number of diffusion sampling steps.", + "help_text": "Maps to generation.args.nsteps. Higher values may improve quality but increase runtime.", + "fa_icon": "fas fa-shoe-prints", + "minimum": 1 + }, + "complexa_replicas": { + "type": "integer", + "default": 2, + "description": "Number of replicas for best-of-n search.", + "help_text": "Maps to generation.search.best_of_n.replicas. Total samples = nsamples × replicas × batch_size.", + "fa_icon": "fas fa-clone", + "minimum": 1 + }, + "complexa_batch_size": { + "type": "integer", + "default": 16, + "description": "Dataloader batch size for generation.", + "help_text": "Maps to generation.dataloader.batch_size. Reduce if running out of GPU memory.", + "fa_icon": "fas fa-layer-group", + "minimum": 1 + }, + "complexa_extra_args": { + "type": "string", + "default": "", + "description": "Additional Hydra overrides passed to complexa design.", + "help_text": "Space-separated Hydra overrides, e.g., '++seed=42 ++generation.args.guidance_w=2.0'", + "fa_icon": "fas fa-terminal" + } + } + }, + "rfdiffusion_v3_options": { + "title": "RFdiffusion v3 design parameters", + "type": "object", + "description": "Parameters for the RFdiffusion3 all-atom diffusion design step. Design YAML and structure files are specified per-sample in the samplesheet.", + "default": "", + "fa_icon": "fas fa-atom", + "properties": { + "rfdiffusion_v3_ckpt_dir": { + "type": "string", + "format": "directory-path", + "description": "Path to RFdiffusion3 checkpoint directory.", + "help_text": "If null, checkpoints are auto-downloaded by the Foundry framework to ~/.foundry/checkpoints.", + "fa_icon": "fas fa-database" }, - "steps": { + "rfdiffusion_v3_container": { "type": "string", - "description": "Optional comma-separated list of steps to run (e.g., 'filtering' to rerun only filtering).", - "help_text": "Advanced option for rerunning specific Boltzgen pipeline steps.", - "fa_icon": "fas fa-tasks" + "default": "rosettacommons/foundry:latest", + "description": "Container image for RFdiffusion3 (rfd3 CLI via RosettaCommons Foundry).", + "fa_icon": "fas fa-docker" } } }, @@ -72,7 +128,7 @@ "run_proteinmpnn": { "type": "boolean", "default": false, - "description": "Enable ProteinMPNN sequence optimization of Boltzgen designs.", + "description": "Enable ProteinMPNN sequence optimization of Complexa designs.", "help_text": "ProteinMPNN can further optimize sequences for the designed structures.", "fa_icon": "fas fa-toggle-on" }, @@ -197,7 +253,7 @@ "run_ipsae": { "type": "boolean", "default": false, - "description": "Enable IPSAE scoring of Boltzgen predictions.", + "description": "Enable IPSAE scoring of Complexa predictions.", "help_text": "IPSAE evaluates protein-protein interface quality using predicted aligned error (PAE) and structural distances.", "fa_icon": "fas fa-toggle-on" }, @@ -293,7 +349,7 @@ "type": "boolean", "default": false, "description": "Enable consolidated metrics report generation.", - "help_text": "Generates a comprehensive report combining Boltzgen, ProteinMPNN, IPSAE, and PRODIGY results.", + "help_text": "Generates a comprehensive report combining Complexa, ProteinMPNN, IPSAE, and PRODIGY results.", "fa_icon": "fas fa-toggle-on" }, "report_top_n": { @@ -399,7 +455,10 @@ "$ref": "#/definitions/input_output_options" }, { - "$ref": "#/definitions/boltzgen_options" + "$ref": "#/definitions/complexa_options" + }, + { + "$ref": "#/definitions/rfdiffusion_v3_options" }, { "$ref": "#/definitions/proteinmpnn_options" diff --git a/run.log b/run.log deleted file mode 100644 index 6cacf4b..0000000 --- a/run.log +++ /dev/null @@ -1,7 +0,0 @@ -Nextflow 25.10.0 is available - Please consider updating your version to it - - N E X T F L O W ~ version 25.04.7 - -Launching `./main.nf` [grave_volta] DSL2 - revision: c97276c39a - -ERROR: Please provide a samplesheet with --input diff --git a/scripts/download_model_weights.sh b/scripts/download_model_weights.sh new file mode 100755 index 0000000..9022f0f --- /dev/null +++ b/scripts/download_model_weights.sh @@ -0,0 +1,171 @@ +#!/usr/bin/env bash +# ============================================================================= +# download_model_weights.sh +# Pre-download all model weights required by nf-proteindesign so the pipeline +# can run fully offline (no auto-download during execution). +# +# Usage: +# bash download_model_weights.sh [--boltz2] [--rfdiffusion] [--all] +# +# Options: +# --boltz2 Download Boltz-2 model weights + ligand CCD database (~6 GB) +# --rfdiffusion Download RFdiffusion3 (Foundry) checkpoints (~4 GB) +# --all Download everything (default when no flag is given) +# +# After running this script, pass the cache paths to your pipeline: +# nextflow run main.nf \ +# --protein_design_tool rfdiffusion_v3 \ +# --rfdiffusion_v3_ckpt_dir /path/to/foundry_checkpoints \ +# --boltz2_cache /path/to/boltz2_cache \ +# ... +# +# Requirements: +# - Docker (for Boltz-2 and RFdiffusion3 downloads via their own containers) +# - ~12 GB free disk space (both caches combined) +# ============================================================================= + +set -euo pipefail + +# ── Defaults ────────────────────────────────────────────────────────────────── +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PIPELINE_DIR="$(dirname "${SCRIPT_DIR}")" + +BOLTZ2_CACHE_DIR="${BOLTZ2_CACHE_DIR:-${PIPELINE_DIR}/model_cache/boltz2}" +FOUNDRY_CKPT_DIR="${FOUNDRY_CKPT_DIR:-${PIPELINE_DIR}/model_cache/foundry_checkpoints}" + +DO_BOLTZ2=false +DO_RFDIFFUSION=false + +# ── Argument parsing ────────────────────────────────────────────────────────── +if [[ $# -eq 0 ]]; then + DO_BOLTZ2=true + DO_RFDIFFUSION=true +fi + +for arg in "$@"; do + case "${arg}" in + --boltz2) DO_BOLTZ2=true ;; + --rfdiffusion) DO_RFDIFFUSION=true ;; + --all) DO_BOLTZ2=true; DO_RFDIFFUSION=true ;; + --help|-h) + sed -n '2,20p' "$0" | sed 's/^# //; s/^#//' + exit 0 + ;; + *) + echo "Unknown option: ${arg}" >&2 + echo "Run with --help for usage." >&2 + exit 1 + ;; + esac +done + +# ── Helper ──────────────────────────────────────────────────────────────────── +log() { echo "[$(date '+%H:%M:%S')] $*"; } +hr() { echo "────────────────────────────────────────────────────────────"; } + +require_docker() { + if ! command -v docker &>/dev/null; then + echo "ERROR: Docker is required but not installed." >&2 + exit 1 + fi + if ! docker info &>/dev/null; then + echo "ERROR: Docker daemon is not running or you lack permission." >&2 + exit 1 + fi +} + +# ============================================================================= +# 1. BOLTZ-2 (giosbiostructures/boltz2:latest) +# Weights: boltz2_conf.ckpt (~2.2 GB), boltz2_aff.ckpt (~2.0 GB) +# Ligand DB: mols/ CCD database (~1.8 GB unpacked) +# Total: ~6 GB +# ============================================================================= +download_boltz2() { + hr + log "Downloading Boltz-2 model weights → ${BOLTZ2_CACHE_DIR}" + hr + + mkdir -p "${BOLTZ2_CACHE_DIR}" + + # Pull the container image first (cached on re-runs) + log "Pulling container: giosbiostructures/boltz2:latest" + docker pull giosbiostructures/boltz2:latest + + # Run `boltz download --cache` to fetch all weights into the target directory. + log "Seeding Boltz-2 cache (this downloads ~6 GB, please wait) ..." + docker run --rm \ + -v "${BOLTZ2_CACHE_DIR}:/boltz_cache" \ + giosbiostructures/boltz2:latest \ + boltz download --cache /boltz_cache + + log "✓ Boltz-2 weights downloaded to: ${BOLTZ2_CACHE_DIR}" + log " Contents:" + ls -lh "${BOLTZ2_CACHE_DIR}" | sed 's/^/ /' + echo "" + log " Pass this to the pipeline with:" + log " --boltz2_cache ${BOLTZ2_CACHE_DIR}" +} + +# ============================================================================= +# 2. RFDIFFUSION3 / FOUNDRY (rosettacommons/foundry:latest) +# Weights: rfd3 model checkpoints (~4 GB) +# Downloaded via `rfd3 download` inside the container +# ============================================================================= +download_rfdiffusion() { + hr + log "Downloading RFdiffusion3 (Foundry) checkpoints → ${FOUNDRY_CKPT_DIR}" + hr + + mkdir -p "${FOUNDRY_CKPT_DIR}" + + # Pull the container image first + log "Pulling container: rosettacommons/foundry:latest" + docker pull rosettacommons/foundry:latest + + # Use the built-in `rfd3 download` command to fetch checkpoints. + # FOUNDRY_CHECKPOINT_DIRS tells the CLI where to save them. + log "Downloading RFdiffusion3 checkpoints (this downloads ~4 GB, please wait) ..." + docker run --rm \ + -v "${FOUNDRY_CKPT_DIR}:/foundry_ckpts" \ + -e FOUNDRY_CHECKPOINT_DIRS="/foundry_ckpts" \ + rosettacommons/foundry:latest \ + rfd3 download + + log "✓ RFdiffusion3 checkpoints downloaded to: ${FOUNDRY_CKPT_DIR}" + log " Contents:" + ls -lh "${FOUNDRY_CKPT_DIR}" | sed 's/^/ /' + echo "" + log " Pass this to the pipeline with:" + log " --rfdiffusion_v3_ckpt_dir ${FOUNDRY_CKPT_DIR}" +} + +# ============================================================================= +# Main +# ============================================================================= +hr +log "nf-proteindesign — model weight downloader" +log "Cache root: ${PIPELINE_DIR}/model_cache/" +hr + +require_docker + +[[ "${DO_BOLTZ2}" == "true" ]] && download_boltz2 +[[ "${DO_RFDIFFUSION}" == "true" ]] && download_rfdiffusion + +hr +log "All downloads complete!" +hr +echo "" +echo "Run the pipeline with pre-seeded caches:" +echo "" +echo " nextflow run main.nf \\" +if [[ "${DO_RFDIFFUSION}" == "true" ]]; then +echo " --protein_design_tool rfdiffusion_v3 \\" +echo " --rfdiffusion_v3_ckpt_dir ${FOUNDRY_CKPT_DIR} \\" +fi +if [[ "${DO_BOLTZ2}" == "true" ]]; then +echo " --boltz2_cache ${BOLTZ2_CACHE_DIR} \\" +fi +echo " --input your_samplesheet.csv \\" +echo " --outdir ./results" +echo "" diff --git a/workflows/protein_design.nf b/workflows/protein_design.nf index 9365ce5..9829a9f 100644 --- a/workflows/protein_design.nf +++ b/workflows/protein_design.nf @@ -1,99 +1,124 @@ /* ======================================================================================== - PROTEIN_DESIGN: Workflow for protein design using YAML specifications + PROTEIN_DESIGN: Workflow for protein binder design ======================================================================================== - This workflow uses pre-made design YAML files for protein design with Boltzgen - and optional analysis modules. + Supports three design backends: + - boltzgen (flow-matching inference, outputs CIF → converted to PDB) + - proteina-complexa (generate → filter → evaluate → analyze, outputs PDB) + - rfdiffusion_v3 (all-atom diffusion via RosettaCommons Foundry, outputs PDB) + + All converge into a shared downstream pipeline: + ProteinMPNN → Boltz-2 refold → IPSAE / PRODIGY / Foldseek → Consolidation ---------------------------------------------------------------------------------------- */ -include { BOLTZGEN_RUN } from '../modules/local/boltzgen_run' -include { CONVERT_CIF_TO_PDB } from '../modules/local/convert_cif_to_pdb' -include { PROTEINMPNN_OPTIMIZE } from '../modules/local/proteinmpnn_optimize' +include { PROTEINA_COMPLEXA_DESIGN } from '../modules/local/proteina_complexa_design' +include { BOLTZGEN_RUN } from '../modules/local/boltzgen_run' +include { RFDIFFUSION_V3_RUN } from '../modules/local/rfdiffusion_v3_run' +include { CONVERT_CIF_TO_PDB } from '../modules/local/convert_cif_to_pdb' +include { PROTEINMPNN_OPTIMIZE } from '../modules/local/proteinmpnn_optimize' include { PREPARE_BOLTZ2_SEQUENCES } from '../modules/local/prepare_boltz2_sequences' -include { BOLTZ2_REFOLD } from '../modules/local/boltz2_refold' -include { IPSAE_CALCULATE } from '../modules/local/ipsae_calculate' -include { PRODIGY_PREDICT } from '../modules/local/prodigy_predict' -include { FOLDSEEK_SEARCH } from '../modules/local/foldseek_search' -include { CONSOLIDATE_METRICS } from '../modules/local/consolidate_metrics' +include { BOLTZ2_REFOLD } from '../modules/local/boltz2_refold' +include { IPSAE_CALCULATE } from '../modules/local/ipsae_calculate' +include { PRODIGY_PREDICT } from '../modules/local/prodigy_predict' +include { FOLDSEEK_SEARCH } from '../modules/local/foldseek_search' +include { CONSOLIDATE_METRICS } from '../modules/local/consolidate_metrics' workflow PROTEIN_DESIGN { take: - ch_input // channel: [meta, design_yaml, structure_files, target_msa, target_sequence, target_template, boltzgen_output_dir] - ch_cache // channel: path to cache directory or EMPTY_CACHE placeholder - ch_boltz2_cache // channel: path to Boltz-2 cache directory or EMPTY_BOLTZ2_CACHE placeholder + ch_input // channel: tool-dependent shape (see main.nf) + // boltzgen : [meta, design_yaml, structure_files, target_sequence] + // complexa : [meta, target_pdb, pipeline_config, target_sequence] + // rfdiffusion_v3 : [meta, design_yaml, structure_files, target_sequence] + ch_design_cache // channel: checkpoint / cache directory (or EMPTY placeholder) + ch_boltz2_cache // channel: Boltz-2 cache directory (or EMPTY placeholder) main: // ======================================================================== - // Run Boltzgen on design YAMLs OR use pre-computed results + // STAGE 1: Protein design — generate structures // ======================================================================== + // All paths produce: + // ch_design_results : [meta, results_dir] — full output directory + // ch_design_pdbs : [meta, pdb_files] — PDB files for downstream + + if (params.protein_design_tool == 'boltzgen') { + // ── BoltzGen path ────────────────────────────────────────────── + ch_boltzgen_input = ch_input + .map { meta, design_yaml, structure_files, target_sequence -> + [meta, design_yaml, structure_files] + } - // Split input channel into two branches: with and without pre-computed Boltzgen results - ch_input - .branch { meta, design_yaml, structure_files, target_msa, target_sequence, target_template, boltzgen_output_dir -> - with_precomputed: boltzgen_output_dir != null - return [meta, boltzgen_output_dir] - needs_boltzgen: boltzgen_output_dir == null - return [meta, design_yaml, structure_files] - } - .set { ch_branched } + BOLTZGEN_RUN(ch_boltzgen_input, ch_design_cache) - // Run Boltzgen only for samples without pre-computed results - BOLTZGEN_RUN(ch_branched.needs_boltzgen, ch_cache) - - // Create channel from pre-computed Boltzgen output directories - ch_precomputed_boltzgen = ch_branched.with_precomputed - .map { meta, boltzgen_dir -> - // Stage the pre-computed directory as if it came from BOLTZGEN_RUN - [meta, boltzgen_dir] - } - - // Combine Boltzgen results from both sources (newly run + pre-computed) - ch_boltzgen_results = BOLTZGEN_RUN.out.results - .mix(ch_precomputed_boltzgen) - - // Extract budget_design_cifs from both sources for downstream processing - ch_budget_cifs_new = BOLTZGEN_RUN.out.budget_design_cifs - - // For precomputed results, extract CIF files from the precomputed directory - ch_budget_cifs_precomputed = ch_branched.with_precomputed - .map { meta, boltzgen_dir -> - def budget_cifs = file("${boltzgen_dir}/final_ranked_designs/final_*_designs/*.cif") - [meta, budget_cifs] - } + ch_design_results = BOLTZGEN_RUN.out.results + + // BoltzGen outputs CIF files — convert to PDB for downstream modules + CONVERT_CIF_TO_PDB(BOLTZGEN_RUN.out.budget_design_cifs) + + ch_design_pdbs = CONVERT_CIF_TO_PDB.out.pdb_files_all + + } else if (params.protein_design_tool == 'complexa') { + // ── Complexa path ────────────────────────────────────────────── + ch_complexa_input = ch_input + .map { meta, target_pdb, pipeline_config, target_sequence -> + [meta, target_pdb, pipeline_config] + } + + PROTEINA_COMPLEXA_DESIGN(ch_complexa_input, ch_design_cache) + + ch_design_results = PROTEINA_COMPLEXA_DESIGN.out.results + ch_design_pdbs = PROTEINA_COMPLEXA_DESIGN.out.design_pdbs + + } else { + // ── RFdiffusion v3 path ──────────────────────────────────────── + // RFdiffusion3 requires PDB input; convert CIF structures upstream + ch_rfd_structures = ch_input + .map { meta, design_yaml, structure_files, target_sequence -> + [meta, structure_files] + } + + CONVERT_CIF_TO_PDB(ch_rfd_structures) + + // Rejoin converted PDB files with design YAML for rfd3 input + ch_rfdiffusion_input = ch_input + .map { meta, design_yaml, structure_files, target_sequence -> + [meta.id, meta, design_yaml] + } + .join( + CONVERT_CIF_TO_PDB.out.pdb_files_all + .map { meta, pdbs -> [meta.id, pdbs] } + ) + .map { id, meta, design_yaml, pdbs -> + [meta, design_yaml, pdbs] + } + + RFDIFFUSION_V3_RUN(ch_rfdiffusion_input, ch_design_cache) + + ch_design_results = RFDIFFUSION_V3_RUN.out.results + ch_design_pdbs = RFDIFFUSION_V3_RUN.out.design_pdbs + } - ch_budget_design_cifs = ch_budget_cifs_new - .mix(ch_budget_cifs_precomputed) - // ======================================================================== - // ProteinMPNN: Optimize sequences for designed structures + // STAGE 2: ProteinMPNN — Optimize sequences for designed structures // ======================================================================== if (params.run_proteinmpnn) { - // Step 1: Convert CIF structures to PDB format (ProteinMPNN requires PDB) - // Use budget_design_cifs which contains ONLY the budget designs (e.g., 2 structures if budget=2) - // NOT all designs from results directory - // Use the combined channel that includes both newly computed and pre-computed Boltzgen results - CONVERT_CIF_TO_PDB(ch_budget_design_cifs) - - // Step 2: Parallelize ProteinMPNN - run separately for each budget design - // Use flatMap to create individual tasks per PDB file (one per budget iteration) - ch_pdb_per_design = CONVERT_CIF_TO_PDB.out.pdb_files_all + // Parallelize ProteinMPNN — run separately for each design PDB + // Complexa outputs PDB files directly; no CIF→PDB conversion needed + ch_pdb_per_design = ch_design_pdbs .flatMap { meta, pdb_files -> - // Convert to list if single file and create defensive copy def pdb_list = pdb_files instanceof List ? new ArrayList(pdb_files) : [pdb_files] - // Create a separate channel entry for each PDB file pdb_list.collect { pdb_file -> - // Extract rank number from filename (e.g., "rank1_2VSM_protein_design_1" -> "1") - def rank_num = pdb_file.baseName.replaceAll(/^rank(\d+)_.*/, '$1') + // Extract a design index from filename for tracking + // Complexa naming: job_{job_id}_n_{length}_id_{idx}_{tag}.pdb + def design_idx = pdb_list.indexOf(pdb_file) - // Simplified naming: {sample}_r{rank} def design_meta = [ - id: "${meta.id}_r${rank_num}", + id: "${meta.id}_d${design_idx}", parent_id: meta.id, - rank_num: rank_num, - design_name: pdb_file.baseName // Keep original for reference + rank_num: "${design_idx}", + design_name: pdb_file.baseName ] [design_meta, pdb_file] @@ -113,9 +138,11 @@ workflow PROTEIN_DESIGN { // 2. Process target sequence FASTA (from samplesheet) to clean format // ==================================================================== if (params.run_boltz2_refold) { - // Get target sequence FASTA from samplesheet + // Get target sequence FASTA from the input channel (last element for both tools) ch_target_fasta = ch_input - .map { meta, design_yaml, structure_files, target_msa, target_sequence, target_template, boltzgen_output_dir -> + .map { tuple -> + def meta = tuple[0] + def target_sequence = tuple[-1] [meta.id, target_sequence] } @@ -142,9 +169,16 @@ workflow PROTEIN_DESIGN { // Prepare Target MSA from Samplesheet // ================================================================ // Use actual placeholder files in assets/ for k8s compatibility (avoids staging non-existent files) + // Resolve relative paths against projectDir so the pipeline works both locally and on k8s/Platform ch_target_msa = ch_input - .map { meta, design_yaml, structure_files, target_msa, target_sequence, target_template, boltzgen_output_dir -> - def msa_file = target_msa ?: file("${projectDir}/assets/NO_MSA", checkIfExists: true) + .map { tuple -> + def meta = tuple[0] + def msa_path = meta.target_msa + ? (meta.target_msa.startsWith('/') || meta.target_msa.startsWith('s3://') || meta.target_msa.startsWith('gs://') || meta.target_msa.startsWith('az://') + ? meta.target_msa + : "${projectDir}/${meta.target_msa}") + : "${projectDir}/assets/NO_MSA" + def msa_file = file(msa_path, checkIfExists: true) [meta.id, msa_file] } @@ -152,8 +186,14 @@ workflow PROTEIN_DESIGN { // Prepare Target Template from Samplesheet // ================================================================ ch_target_template = ch_input - .map { meta, design_yaml, structure_files, target_msa, target_sequence, target_template, boltzgen_output_dir -> - def template_file = target_template ?: file("${projectDir}/assets/NO_TEMPLATE", checkIfExists: true) + .map { tuple -> + def meta = tuple[0] + def template_path = meta.target_template + ? (meta.target_template.startsWith('/') || meta.target_template.startsWith('s3://') || meta.target_template.startsWith('gs://') || meta.target_template.startsWith('az://') + ? meta.target_template + : "${projectDir}/${meta.target_template}") + : "${projectDir}/assets/NO_TEMPLATE" + def template_file = file(template_path, checkIfExists: true) [meta.id, template_file] } @@ -208,16 +248,15 @@ workflow PROTEIN_DESIGN { BOLTZ2_REFOLD(ch_boltz2_input, ch_boltz2_cache) } } else { - // Use Boltzgen outputs directly if ProteinMPNN is disabled - // Use the combined channel that includes both newly computed and pre-computed results - ch_final_designs_for_analysis = ch_boltzgen_results + // Use design outputs directly if ProteinMPNN is disabled + ch_final_designs_for_analysis = ch_design_results } // ======================================================================== // OPTIONAL: IPSAE scoring if enabled // ======================================================================== // NOTE: IPSAE requires NPZ confidence files. We now support both: - // 1. Boltzgen budget designs (native NPZ output) + // 1. Complexa budget designs (native NPZ output) // 2. Boltz-2 refolded structures (native NPZ output - no conversion needed!) if (params.run_ipsae) { // Prepare IPSAE script as a value channel (reusable across all tasks) @@ -229,47 +268,85 @@ workflow PROTEIN_DESIGN { if (params.run_proteinmpnn && params.run_boltz2_refold) { // Get CIF and NPZ pairs from Boltz-2 for IPSAE // Use combine instead of join for more robust matching in k8s/cloud + // Extract pLDDT NPZ files from the predictions directory + // (plddt files live inside the *_boltz2_output dir alongside CIF/PAE/confidence) + ch_boltz2_plddt = BOLTZ2_REFOLD.out.predictions + .map { meta, pred_dir -> + def plddt_files = pred_dir.listFiles().findAll { f -> f.name.startsWith('plddt_') && f.name.endsWith('.npz') } + plddt_files ? [meta, plddt_files] : null + } + .filter { v -> v != null } + + // Join structures, PAE, confidence JSON, and pLDDT NPZ by meta key + // iPSAE needs all four: CIF for coordinates, PAE for error matrix, + // confidence JSON for iptm values, pLDDT NPZ for per-residue confidence ch_ipsae_input = BOLTZ2_REFOLD.out.structures .combine(BOLTZ2_REFOLD.out.pae_npz, by: 0) - .flatMap { meta, cif_files, npz_files -> + .combine(BOLTZ2_REFOLD.out.confidence, by: 0) + .combine(ch_boltz2_plddt, by: 0) + .flatMap { meta, cif_files, pae_files, conf_files, plddt_files -> // Convert to lists if single files def cif_list = cif_files instanceof List ? cif_files : [cif_files] - def npz_list = npz_files instanceof List ? npz_files : [npz_files] + def pae_list = pae_files instanceof List ? pae_files : [pae_files] + def conf_list = conf_files instanceof List ? conf_files : [conf_files] + def plddt_list = plddt_files instanceof List ? plddt_files : [plddt_files] // Filter to only model_0 (best model) - use flexible matching - def model0_cifs = cif_list.findAll { it.name.contains('model_0') && it.name.endsWith('.cif') } - def model0_npzs = npz_list.findAll { it.name.contains('model_0') } + def model0_cifs = cif_list.findAll { v -> v.name.contains('model_0') && v.name.endsWith('.cif') } + def model0_paes = pae_list.findAll { v -> v.name.contains('model_0') } + def model0_confs = conf_list.findAll { v -> v.name.contains('model_0') } + def model0_plddts = plddt_list.findAll { v -> v.name.contains('model_0') } // If no model_0 files found, use all files (fallback for different naming) if (model0_cifs.isEmpty()) { - model0_cifs = cif_list.findAll { it.name.endsWith('.cif') } - } - if (model0_npzs.isEmpty()) { - model0_npzs = npz_list + model0_cifs = cif_list.findAll { v -> v.name.endsWith('.cif') } } - - // Create a map of NPZ files by normalized base name - def npz_map = [:] - model0_npzs.each { npz_file -> - // Normalize: remove pae_ prefix and _model_X suffix for matching - def base_name = npz_file.baseName + if (model0_paes.isEmpty()) { model0_paes = pae_list } + if (model0_confs.isEmpty()) { model0_confs = conf_list } + if (model0_plddts.isEmpty()) { model0_plddts = plddt_list } + + // Create maps by normalized base name for matching + def pae_map = [:] + model0_paes.each { f -> + def base_name = f.baseName .replaceAll(/^pae_/, '') .replaceAll(/_model_\d+$/, '') - npz_map[base_name] = npz_file + pae_map[base_name] = f + } + def conf_map = [:] + model0_confs.each { f -> + def base_name = f.baseName + .replaceAll(/^confidence_/, '') + .replaceAll(/_model_\d+$/, '') + conf_map[base_name] = f + } + def plddt_map = [:] + model0_plddts.each { f -> + def base_name = f.baseName + .replaceAll(/^plddt_/, '') + .replaceAll(/_model_\d+$/, '') + plddt_map[base_name] = f } - // Match CIF files with their NPZ files + // Match CIF files with their companion files model0_cifs.collect { cif_file -> - // Normalize CIF name for matching def base_name = cif_file.baseName.replaceAll(/_model_\d+$/, '') - def npz_file = npz_map[base_name] + def pae_file = pae_map[base_name] + def conf_file = conf_map[base_name] + def plddt_file = plddt_map[base_name] - // If exact match fails, try first NPZ file as fallback - if (!npz_file && model0_npzs.size() == 1 && model0_cifs.size() == 1) { - npz_file = model0_npzs[0] + // Fallback: if only one file of each type, use it + if (!pae_file && model0_paes.size() == 1 && model0_cifs.size() == 1) { + pae_file = model0_paes[0] + } + if (!conf_file && model0_confs.size() == 1 && model0_cifs.size() == 1) { + conf_file = model0_confs[0] + } + if (!plddt_file && model0_plddts.size() == 1 && model0_cifs.size() == 1) { + plddt_file = model0_plddts[0] } - if (npz_file) { + if (pae_file && conf_file && plddt_file) { def ipsae_meta = [ id: meta.id, parent_id: meta.parent_id, @@ -277,12 +354,12 @@ workflow PROTEIN_DESIGN { seq_num: meta.seq_num, source: "boltz2" ] - [ipsae_meta, npz_file, cif_file] + [ipsae_meta, pae_file, cif_file, conf_file, plddt_file] } else { - log.warn "⚠️ No matching NPZ file found for ${cif_file.name} (available: ${model0_npzs*.name})" + log.warn "⚠️ Missing companion files for ${cif_file.name}: PAE=${pae_file?.name}, confidence=${conf_file?.name}, pLDDT=${plddt_file?.name}" null } - }.findAll { it != null } + }.findAll { v -> v != null } } // Run IPSAE calculation @@ -337,7 +414,7 @@ workflow PROTEIN_DESIGN { // ======================================================================== // OPTIONAL: Foldseek structural similarity search if enabled // ======================================================================== - // Search for structural homologs of both Boltzgen and Protenix structures + // Search for structural homologs of both Complexa and Boltz-2 structures // in the AlphaFold database (or other specified database) if (params.run_foldseek) { // Validate and prepare database channel @@ -453,26 +530,26 @@ workflow PROTEIN_DESIGN { } emit: - // Boltzgen outputs (combined from both newly computed and pre-computed sources) - boltzgen_results = ch_boltzgen_results - final_designs = ch_budget_design_cifs - + // Design outputs (generic — works for both tools) + design_results = ch_design_results + design_pdbs = ch_design_pdbs + // ProteinMPNN outputs (will be empty if not run) mpnn_optimized = params.run_proteinmpnn ? PROTEINMPNN_OPTIMIZE.out.optimized_designs : Channel.empty() mpnn_sequences = params.run_proteinmpnn ? PROTEINMPNN_OPTIMIZE.out.sequences : Channel.empty() - mpnn_scores = params.run_proteinmpnn ? PROTEINMPNN_OPTIMIZE.out.scores : Channel.empty() - + mpnn_scores = params.run_proteinmpnn ? PROTEINMPNN_OPTIMIZE.out.scores : Channel.empty() + // Boltz-2 refolding outputs (will be empty if not run) - boltz2_structures = (params.run_proteinmpnn && params.run_boltz2_refold) ? BOLTZ2_REFOLD.out.structures : Channel.empty() - boltz2_confidence = (params.run_proteinmpnn && params.run_boltz2_refold) ? BOLTZ2_REFOLD.out.confidence : Channel.empty() - boltz2_pae_npz = (params.run_proteinmpnn && params.run_boltz2_refold) ? BOLTZ2_REFOLD.out.pae_npz : Channel.empty() - boltz2_affinity = (params.run_proteinmpnn && params.run_boltz2_refold) ? BOLTZ2_REFOLD.out.affinity : Channel.empty() - + boltz2_structures = (params.run_proteinmpnn && params.run_boltz2_refold) ? BOLTZ2_REFOLD.out.structures : Channel.empty() + boltz2_confidence = (params.run_proteinmpnn && params.run_boltz2_refold) ? BOLTZ2_REFOLD.out.confidence : Channel.empty() + boltz2_pae_npz = (params.run_proteinmpnn && params.run_boltz2_refold) ? BOLTZ2_REFOLD.out.pae_npz : Channel.empty() + boltz2_affinity = (params.run_proteinmpnn && params.run_boltz2_refold) ? BOLTZ2_REFOLD.out.affinity : Channel.empty() + // Optional analysis outputs (will be empty if not run) foldseek_results = (params.run_foldseek && params.run_proteinmpnn && params.run_boltz2_refold) ? FOLDSEEK_SEARCH.out.results : Channel.empty() foldseek_summary = (params.run_foldseek && params.run_proteinmpnn && params.run_boltz2_refold) ? FOLDSEEK_SEARCH.out.summary : Channel.empty() // Consolidation outputs (will be empty if not run) metrics_summary = params.run_consolidation ? CONSOLIDATE_METRICS.out.summary_csv : Channel.empty() - metrics_report = params.run_consolidation ? CONSOLIDATE_METRICS.out.report_html : Channel.empty() + metrics_report = params.run_consolidation ? CONSOLIDATE_METRICS.out.report_html : Channel.empty() }