seqeralabs · rnaidu-seqera · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026 · Apr 28, 2026
diff --git a/.gitignore b/.gitignore
@@ -28,6 +28,10 @@ Thumbs.db
 # Testing
 test_output/
 test_results/
+results_test_*/
+
+# Claude Code
+.claude/
 
 # Boltzgen cache
 .cache/

diff --git a/PROTEINA_COMPLEXA_MIGRATION.md b/PROTEINA_COMPLEXA_MIGRATION.md
diff --git a/README.md b/README.md
@@ -2,31 +2,34 @@
 
 > ⚠️ **IMPORTANT**: This pipeline was developed by Seqera as a proof of principle using Seqera AI. It demonstrates the capabilities of AI-assisted bioinformatics pipeline development but should be thoroughly validated before use in production environments.
 
-A Nextflow pipeline for AI-powered protein design using Boltzgen to design protein binders, nanobodies, and peptides.
+A Nextflow pipeline for AI-powered protein design supporting two generative backends — **BoltzGen** (default) and **Proteina-Complexa** — to design protein binders, nanobodies, and peptides.
 
 ## 📋 Overview
 
-This pipeline automates the process of designing novel protein binders using Boltzgen and provides comprehensive analysis through optional modules:
+This pipeline automates the process of designing novel protein binders and provides comprehensive analysis through optional downstream modules:
 
-- 🎯 **Boltzgen Design**: Generate protein, nanobody, or peptide binders for target structures
+- 🎯 **BoltzGen** (default): Flow-matching generative model for protein design using design YAML specifications
+- 🏗️ **Proteina-Complexa**: Generative diffusion model for protein design using pipeline config YAMLs
 - 🧬 **ProteinMPNN**: Optimize sequences for improved stability and expression
 - 🔄 **Boltz-2 Refolding**: Validate designs through structure prediction
 - 📊 **IPSAE**: Score protein-protein interface quality
 - ⚡ **PRODIGY**: Predict binding affinity
 - 🔍 **Foldseek**: Search structural databases for similar designs
 - 📈 **Metrics Consolidation**: Generate comprehensive analysis reports
 
+Both design backends converge into the same downstream pipeline (ProteinMPNN → Boltz-2 → Analysis → Consolidation).
+
 ## 🚀 Quick Start
 
 ### ✅ Prerequisites
 
-- ⚙️ Nextflow (≥23.10)
+- ⚙️ Nextflow (≥23.04.0)
 - 🐳 Docker or Singularity
 - 🎮 GPU recommended for optimal performance
 
 ### 🧪 Running with Test Profiles
 
-Test the pipeline with one of three available profiles:
+Test the pipeline with one of three available profiles (uses BoltzGen by default):
 
 ```bash
 # Test protein binder design
@@ -43,31 +46,93 @@ Replace `docker` with `singularity` if using Singularity containers.
 
 ### 🔬 Running with Your Own Data
 
+#### BoltzGen (default)
+
 ```bash
 nextflow run main.nf \
   --input samplesheet.csv \
   --outdir results \
   -profile docker
 ```
 
+#### Proteina-Complexa
+
+```bash
+nextflow run main.nf \
+  --protein_design_tool complexa \
+  --input samplesheet_complexa.csv \
+  --complexa_ckpt_dir /path/to/checkpoints \
+  --outdir results \
+  -profile docker
+```
+
 ## 📝 Input Format
 
-The pipeline requires a CSV samplesheet with design specifications. See `assets/test_data/` for examples:
+The samplesheet format depends on the chosen design tool (`--protein_design_tool`). See `assets/test_data/` for examples.
+
+### BoltzGen Samplesheet (default)
+
+```csv
+sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template
+design1,designs/my_design.yaml,target.cif,protein-anything,3,2,,target.a3m,target.fasta,
+```
+
+| Column | Required | Description |
+|--------|----------|-------------|
+| `sample_id` | ✅ | Unique sample identifier |
+| `design_yaml` | ✅ | Path to BoltzGen design YAML specification |
+| `target_sequence` | ✅ | Target sequence FASTA (for Boltz-2 refolding) |
+| `structure_files` | | Comma-separated structure files (PDB/CIF) |
+| `protocol` | | Design protocol (`protein-anything`, `peptide-anything`, `nanobody-anything`, `protein-small_molecule`) |
+| `num_designs` | | Number of intermediate designs to generate |
+| `budget` | | Number of final diversity-optimized designs to keep |
+| `reuse` | | Reuse previous results (`true`/`false`) |
+| `target_msa` | | Pre-computed MSA for target (e.g., `.a3m`) |
+| `target_template` | | Template structure for Boltz-2 (CIF) |
+
+### Complexa Samplesheet
 
 ```csv
-sample,design_yaml,protocol,num_designs,budget
-my_design,design.yaml,protein-anything,10,5
+sample_id,target_pdb,pipeline_config,target_sequence,target_msa,target_template
+design1,target.cif,configs/pipeline.yaml,target.fasta,target.a3m,
 ```
 
+| Column | Required | Description |
+|--------|----------|-------------|
+| `sample_id` | ✅ | Unique sample identifier |
+| `target_pdb` | ✅ | Target structure (PDB or CIF) |
+| `pipeline_config` | ✅ | Complexa Hydra pipeline config YAML |
+| `target_sequence` | ✅ | Target sequence FASTA (for Boltz-2 refolding) |
+| `target_msa` | | Pre-computed MSA for target (e.g., `.a3m`) |
+| `target_template` | | Template structure for Boltz-2 (PDB/CIF) |
+
 ## ⚙️ Key Parameters
 
+### Design Tool Selection
+
+- `--protein_design_tool`: Design backend to use — `boltzgen` (default) or `complexa`
+
+### Common Parameters
+
 - `--input`: Path to samplesheet CSV
 - `--outdir`: Output directory (default: `./results`)
-- `--run_proteinmpnn`: Enable ProteinMPNN sequence optimization
-- `--run_boltz2_refold`: Enable Boltz-2 structure prediction
-- `--run_ipsae`: Enable IPSAE interface scoring
-- `--run_prodigy`: Enable PRODIGY affinity prediction
-- `--run_consolidation`: Generate consolidated metrics report
+- `--run_proteinmpnn`: Enable ProteinMPNN sequence optimization (default: `true`)
+- `--run_boltz2_refold`: Enable Boltz-2 structure prediction (default: `true`)
+- `--run_ipsae`: Enable IPSAE interface scoring (default: `true`)
+- `--run_prodigy`: Enable PRODIGY affinity prediction (default: `true`)
+- `--run_foldseek`: Enable Foldseek structural similarity search (default: `true`)
+- `--run_consolidation`: Generate consolidated metrics report (default: `true`)
+
+### BoltzGen-Specific Parameters
+
+- `--cache_dir`: Cache directory for BoltzGen model weights
+
+### Complexa-Specific Parameters
+
+- `--complexa_ckpt_dir`: Path to Complexa checkpoint directory
+- `--complexa_search_algorithm`: Search algorithm (`best-of-n`, `beam-search`, etc.)
+- `--complexa_nsteps`: Diffusion sampling steps (default: 400)
+- `--complexa_batch_size`: Generation batch size (default: 16)
 
 See `nextflow.config` for all available parameters.
 
@@ -77,20 +142,24 @@ Results are organized by sample in the output directory:
 
 ```
 results/
-├── boltzgen/          # Boltzgen designs and structures
-├── proteinmpnn/       # Optimized sequences (if enabled)
-├── boltz2/            # Refolded structures (if enabled)
-├── ipsae/             # Interface scores (if enabled)
-├── prodigy/           # Affinity predictions (if enabled)
-├── foldseek/          # Structural search results (if enabled)
-└── consolidated/      # Combined metrics report (if enabled)
+├── {sample_id}/
+│   ├── boltzgen/          # BoltzGen designs (if using boltzgen)
+│   ├── complexa/          # Complexa designs (if using complexa)
+│   ├── proteinmpnn/       # Optimized sequences
+│   ├── boltz2/            # Refolded structures
+│   ├── ipsae/             # Interface scores
+│   ├── prodigy/           # Affinity predictions
+│   ├── foldseek/          # Structural search results
+│   └── consolidated/      # Combined metrics report
+└── pipeline_info/         # Execution reports
 ```
 
 ## 📚 Citation
 
 If you use this pipeline, please cite:
 
-- **Boltzgen**: Stark et al. (2025) bioRxiv 2025.11.20.689494
+- **BoltzGen**: Jing et al. (2024) "Generative Modeling of Molecular Dynamics Trajectories"
+- **Proteina-Complexa**: [Add Complexa citation]
 - **ProteinMPNN**: Dauparas et al. (2022) Science
 - **Nextflow**: Di Tommaso et al. (2017) Nature Biotechnology
 

diff --git a/assets/ipsae.py b/assets/ipsae.py
@@ -437,18 +437,18 @@ def classify_chains(chains, residue_types):
     # pae_AURKA_TPX2_model_0.npz
     # plddt_AURKA_TPX2_model_0.npz
 
-    # Boltzgen (Boltz2) filenames (no pae_ prefix):
+    # Complexa (Boltz2) filenames (no pae_ prefix):
     # design_0.cif
     # design_0.npz  (contains PAE data)
     # confidence_design_0.json (optional)
-    # Note: Boltzgen uses same filename for CIF and NPZ
+    # Note: Complexa uses same filename for CIF and NPZ
 
-    # First check if pLDDT data is in the same NPZ file (Boltz2/Boltzgen style)
+    # First check if pLDDT data is in the same NPZ file (Boltz2/Complexa style)
     data_pae = np.load(pae_file_path)
     print(f"Boltz PAE file keys: {list(data_pae.keys())}")
 
     if 'plddt' in data_pae.keys():
-        # Boltz2/Boltzgen format: plddt in same file as pae
+        # Boltz2/Complexa format: plddt in same file as pae
         plddt_boltz1=np.array(100.0*data_pae['plddt']) if data_pae['plddt'].max() <= 1.0 else np.array(data_pae['plddt'])
         plddt =    plddt_boltz1[np.ix_(token_array.astype(bool))]
         cb_plddt = plddt_boltz1[np.ix_(token_array.astype(bool))]

diff --git a/assets/schema_input_boltzgen.json b/assets/schema_input_boltzgen.json
@@ -0,0 +1,68 @@
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "$id": "https://raw.githubusercontent.com/seqeralabs/nf-proteindesign/main/assets/schema_input_boltzgen.json",
+  "title": "seqeralabs/nf-proteindesign - BoltzGen samplesheet schema",
+  "description": "Schema for validating samplesheets when --protein_design_tool=boltzgen. Each row specifies a design YAML, structure files, and generation parameters.",
+  "type": "array",
+  "items": {
+    "type": "object",
+    "properties": {
+      "sample_id": {
+        "type": "string",
+        "pattern": "^[a-zA-Z0-9_-]+$",
+        "errorMessage": "Sample ID must be alphanumeric with underscores or hyphens only"
+      },
+      "design_yaml": {
+        "type": "string",
+        "pattern": "^\\S+\\.ya?ml$",
+        "errorMessage": "Design YAML must be a valid file path ending in .yaml or .yml"
+      },
+      "structure_files": {
+        "type": "string",
+        "errorMessage": "Structure files must be a comma-separated list of PDB/CIF file paths (e.g., '2VSM.cif' or 'protein1.pdb,protein2.cif')"
+      },
+      "protocol": {
+        "type": "string",
+        "enum": [
+          "protein-anything",
+          "peptide-anything",
+          "protein-small_molecule",
+          "nanobody-anything"
+        ],
+        "errorMessage": "Protocol must be one of: protein-anything, peptide-anything, protein-small_molecule, nanobody-anything"
+      },
+      "num_designs": {
+        "type": "integer",
+        "minimum": 1,
+        "errorMessage": "Number of designs must be a positive integer"
+      },
+      "budget": {
+        "type": "integer",
+        "minimum": 1,
+        "errorMessage": "Budget must be a positive integer"
+      },
+      "reuse": {
+        "type": "boolean",
+        "errorMessage": "Reuse must be true or false"
+      },
+      "target_msa": {
+        "type": "string",
+        "errorMessage": "Target MSA must be a valid file path to a pre-computed MSA file (e.g., 'target.a3m')"
+      },
+      "target_sequence": {
+        "type": "string",
+        "errorMessage": "Target sequence must be a valid file path to a FASTA file containing the target protein sequence"
+      },
+      "target_template": {
+        "type": "string",
+        "pattern": "^\\S+\\.cif$",
+        "errorMessage": "Target template must be a valid file path to a CIF file"
+      }
+    },
+    "required": [
+      "sample_id",
+      "design_yaml",
+      "target_sequence"
+    ]
+  }
+}
diff --git a/assets/schema_input_complexa.json b/assets/schema_input_complexa.json
@@ -0,0 +1,45 @@
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "$id": "https://raw.githubusercontent.com/seqeralabs/nf-proteindesign/main/assets/schema_input_complexa.json",
+  "title": "seqeralabs/nf-proteindesign - Proteina-Complexa samplesheet schema",
+  "description": "Schema for validating samplesheets when --protein_design_tool=complexa. Each row specifies a target PDB, pipeline config YAML, and target sequence.",
+  "type": "array",
+  "items": {
+    "type": "object",
+    "properties": {
+      "sample_id": {
+        "type": "string",
+        "pattern": "^[a-zA-Z0-9_-]+$",
+        "errorMessage": "Sample ID must be alphanumeric with underscores or hyphens only"
+      },
+      "target_pdb": {
+        "type": "string",
+        "pattern": "^\\S+\\.(pdb|cif)$",
+        "errorMessage": "Target PDB must be a valid file path ending in .pdb or .cif"
+      },
+      "pipeline_config": {
+        "type": "string",
+        "pattern": "^\\S+\\.ya?ml$",
+        "errorMessage": "Pipeline config must be a valid file path ending in .yaml or .yml"
+      },
+      "target_sequence": {
+        "type": "string",
+        "errorMessage": "Target sequence must be a valid file path to a FASTA file"
+      },
+      "target_msa": {
+        "type": "string",
+        "errorMessage": "Target MSA must be a valid file path to a pre-computed MSA file"
+      },
+      "target_template": {
+        "type": "string",
+        "errorMessage": "Target template must be a valid file path to a PDB or CIF file"
+      }
+    },
+    "required": [
+      "sample_id",
+      "target_pdb",
+      "pipeline_config",
+      "target_sequence"
+    ]
+  }
+}