Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ Thumbs.db
# Testing
test_output/
test_results/
results_test_*/

# Claude Code
.claude/

# Boltzgen cache
.cache/
Expand Down
481 changes: 481 additions & 0 deletions PROTEINA_COMPLEXA_MIGRATION.md

Large diffs are not rendered by default.

111 changes: 90 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,34 @@

> ⚠️ **IMPORTANT**: This pipeline was developed by Seqera as a proof of principle using Seqera AI. It demonstrates the capabilities of AI-assisted bioinformatics pipeline development but should be thoroughly validated before use in production environments.

A Nextflow pipeline for AI-powered protein design using Boltzgen to design protein binders, nanobodies, and peptides.
A Nextflow pipeline for AI-powered protein design supporting two generative backends — **BoltzGen** (default) and **Proteina-Complexa** — to design protein binders, nanobodies, and peptides.

## 📋 Overview

This pipeline automates the process of designing novel protein binders using Boltzgen and provides comprehensive analysis through optional modules:
This pipeline automates the process of designing novel protein binders and provides comprehensive analysis through optional downstream modules:

- 🎯 **Boltzgen Design**: Generate protein, nanobody, or peptide binders for target structures
- 🎯 **BoltzGen** (default): Flow-matching generative model for protein design using design YAML specifications
- 🏗️ **Proteina-Complexa**: Generative diffusion model for protein design using pipeline config YAMLs
- 🧬 **ProteinMPNN**: Optimize sequences for improved stability and expression
- 🔄 **Boltz-2 Refolding**: Validate designs through structure prediction
- 📊 **IPSAE**: Score protein-protein interface quality
- ⚡ **PRODIGY**: Predict binding affinity
- 🔍 **Foldseek**: Search structural databases for similar designs
- 📈 **Metrics Consolidation**: Generate comprehensive analysis reports

Both design backends converge into the same downstream pipeline (ProteinMPNN → Boltz-2 → Analysis → Consolidation).

## 🚀 Quick Start

### ✅ Prerequisites

- ⚙️ Nextflow (≥23.10)
- ⚙️ Nextflow (≥23.04.0)
- 🐳 Docker or Singularity
- 🎮 GPU recommended for optimal performance

### 🧪 Running with Test Profiles

Test the pipeline with one of three available profiles:
Test the pipeline with one of three available profiles (uses BoltzGen by default):

```bash
# Test protein binder design
Expand All @@ -43,31 +46,93 @@ Replace `docker` with `singularity` if using Singularity containers.

### 🔬 Running with Your Own Data

#### BoltzGen (default)

```bash
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
-profile docker
```

#### Proteina-Complexa

```bash
nextflow run main.nf \
--protein_design_tool complexa \
--input samplesheet_complexa.csv \
--complexa_ckpt_dir /path/to/checkpoints \
--outdir results \
-profile docker
```

## 📝 Input Format

The pipeline requires a CSV samplesheet with design specifications. See `assets/test_data/` for examples:
The samplesheet format depends on the chosen design tool (`--protein_design_tool`). See `assets/test_data/` for examples.

### BoltzGen Samplesheet (default)

```csv
sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template
design1,designs/my_design.yaml,target.cif,protein-anything,3,2,,target.a3m,target.fasta,
```

| Column | Required | Description |
|--------|----------|-------------|
| `sample_id` | ✅ | Unique sample identifier |
| `design_yaml` | ✅ | Path to BoltzGen design YAML specification |
| `target_sequence` | ✅ | Target sequence FASTA (for Boltz-2 refolding) |
| `structure_files` | | Comma-separated structure files (PDB/CIF) |
| `protocol` | | Design protocol (`protein-anything`, `peptide-anything`, `nanobody-anything`, `protein-small_molecule`) |
| `num_designs` | | Number of intermediate designs to generate |
| `budget` | | Number of final diversity-optimized designs to keep |
| `reuse` | | Reuse previous results (`true`/`false`) |
| `target_msa` | | Pre-computed MSA for target (e.g., `.a3m`) |
| `target_template` | | Template structure for Boltz-2 (CIF) |

### Complexa Samplesheet

```csv
sample,design_yaml,protocol,num_designs,budget
my_design,design.yaml,protein-anything,10,5
sample_id,target_pdb,pipeline_config,target_sequence,target_msa,target_template
design1,target.cif,configs/pipeline.yaml,target.fasta,target.a3m,
```

| Column | Required | Description |
|--------|----------|-------------|
| `sample_id` | ✅ | Unique sample identifier |
| `target_pdb` | ✅ | Target structure (PDB or CIF) |
| `pipeline_config` | ✅ | Complexa Hydra pipeline config YAML |
| `target_sequence` | ✅ | Target sequence FASTA (for Boltz-2 refolding) |
| `target_msa` | | Pre-computed MSA for target (e.g., `.a3m`) |
| `target_template` | | Template structure for Boltz-2 (PDB/CIF) |

## ⚙️ Key Parameters

### Design Tool Selection

- `--protein_design_tool`: Design backend to use — `boltzgen` (default) or `complexa`

### Common Parameters

- `--input`: Path to samplesheet CSV
- `--outdir`: Output directory (default: `./results`)
- `--run_proteinmpnn`: Enable ProteinMPNN sequence optimization
- `--run_boltz2_refold`: Enable Boltz-2 structure prediction
- `--run_ipsae`: Enable IPSAE interface scoring
- `--run_prodigy`: Enable PRODIGY affinity prediction
- `--run_consolidation`: Generate consolidated metrics report
- `--run_proteinmpnn`: Enable ProteinMPNN sequence optimization (default: `true`)
- `--run_boltz2_refold`: Enable Boltz-2 structure prediction (default: `true`)
- `--run_ipsae`: Enable IPSAE interface scoring (default: `true`)
- `--run_prodigy`: Enable PRODIGY affinity prediction (default: `true`)
- `--run_foldseek`: Enable Foldseek structural similarity search (default: `true`)
- `--run_consolidation`: Generate consolidated metrics report (default: `true`)

### BoltzGen-Specific Parameters

- `--cache_dir`: Cache directory for BoltzGen model weights

### Complexa-Specific Parameters

- `--complexa_ckpt_dir`: Path to Complexa checkpoint directory
- `--complexa_search_algorithm`: Search algorithm (`best-of-n`, `beam-search`, etc.)
- `--complexa_nsteps`: Diffusion sampling steps (default: 400)
- `--complexa_batch_size`: Generation batch size (default: 16)

See `nextflow.config` for all available parameters.

Expand All @@ -77,20 +142,24 @@ Results are organized by sample in the output directory:

```
results/
├── boltzgen/ # Boltzgen designs and structures
├── proteinmpnn/ # Optimized sequences (if enabled)
├── boltz2/ # Refolded structures (if enabled)
├── ipsae/ # Interface scores (if enabled)
├── prodigy/ # Affinity predictions (if enabled)
├── foldseek/ # Structural search results (if enabled)
└── consolidated/ # Combined metrics report (if enabled)
├── {sample_id}/
│ ├── boltzgen/ # BoltzGen designs (if using boltzgen)
│ ├── complexa/ # Complexa designs (if using complexa)
│ ├── proteinmpnn/ # Optimized sequences
│ ├── boltz2/ # Refolded structures
│ ├── ipsae/ # Interface scores
│ ├── prodigy/ # Affinity predictions
│ ├── foldseek/ # Structural search results
│ └── consolidated/ # Combined metrics report
└── pipeline_info/ # Execution reports
```

## 📚 Citation

If you use this pipeline, please cite:

- **Boltzgen**: Stark et al. (2025) bioRxiv 2025.11.20.689494
- **BoltzGen**: Jing et al. (2024) "Generative Modeling of Molecular Dynamics Trajectories"
- **Proteina-Complexa**: [Add Complexa citation]
- **ProteinMPNN**: Dauparas et al. (2022) Science
- **Nextflow**: Di Tommaso et al. (2017) Nature Biotechnology

Expand Down
8 changes: 4 additions & 4 deletions assets/ipsae.py
Original file line number Diff line number Diff line change
Expand Up @@ -437,18 +437,18 @@ def classify_chains(chains, residue_types):
# pae_AURKA_TPX2_model_0.npz
# plddt_AURKA_TPX2_model_0.npz

# Boltzgen (Boltz2) filenames (no pae_ prefix):
# Complexa (Boltz2) filenames (no pae_ prefix):
# design_0.cif
# design_0.npz (contains PAE data)
# confidence_design_0.json (optional)
# Note: Boltzgen uses same filename for CIF and NPZ
# Note: Complexa uses same filename for CIF and NPZ

# First check if pLDDT data is in the same NPZ file (Boltz2/Boltzgen style)
# First check if pLDDT data is in the same NPZ file (Boltz2/Complexa style)
data_pae = np.load(pae_file_path)
print(f"Boltz PAE file keys: {list(data_pae.keys())}")

if 'plddt' in data_pae.keys():
# Boltz2/Boltzgen format: plddt in same file as pae
# Boltz2/Complexa format: plddt in same file as pae
plddt_boltz1=np.array(100.0*data_pae['plddt']) if data_pae['plddt'].max() <= 1.0 else np.array(data_pae['plddt'])
plddt = plddt_boltz1[np.ix_(token_array.astype(bool))]
cb_plddt = plddt_boltz1[np.ix_(token_array.astype(bool))]
Expand Down
68 changes: 68 additions & 0 deletions assets/schema_input_boltzgen.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://raw.githubusercontent.com/seqeralabs/nf-proteindesign/main/assets/schema_input_boltzgen.json",
"title": "seqeralabs/nf-proteindesign - BoltzGen samplesheet schema",
"description": "Schema for validating samplesheets when --protein_design_tool=boltzgen. Each row specifies a design YAML, structure files, and generation parameters.",
"type": "array",
"items": {
"type": "object",
"properties": {
"sample_id": {
"type": "string",
"pattern": "^[a-zA-Z0-9_-]+$",
"errorMessage": "Sample ID must be alphanumeric with underscores or hyphens only"
},
"design_yaml": {
"type": "string",
"pattern": "^\\S+\\.ya?ml$",
"errorMessage": "Design YAML must be a valid file path ending in .yaml or .yml"
},
"structure_files": {
"type": "string",
"errorMessage": "Structure files must be a comma-separated list of PDB/CIF file paths (e.g., '2VSM.cif' or 'protein1.pdb,protein2.cif')"
},
"protocol": {
"type": "string",
"enum": [
"protein-anything",
"peptide-anything",
"protein-small_molecule",
"nanobody-anything"
],
"errorMessage": "Protocol must be one of: protein-anything, peptide-anything, protein-small_molecule, nanobody-anything"
},
"num_designs": {
"type": "integer",
"minimum": 1,
"errorMessage": "Number of designs must be a positive integer"
},
"budget": {
"type": "integer",
"minimum": 1,
"errorMessage": "Budget must be a positive integer"
},
"reuse": {
"type": "boolean",
"errorMessage": "Reuse must be true or false"
},
"target_msa": {
"type": "string",
"errorMessage": "Target MSA must be a valid file path to a pre-computed MSA file (e.g., 'target.a3m')"
},
"target_sequence": {
"type": "string",
"errorMessage": "Target sequence must be a valid file path to a FASTA file containing the target protein sequence"
},
"target_template": {
"type": "string",
"pattern": "^\\S+\\.cif$",
"errorMessage": "Target template must be a valid file path to a CIF file"
}
},
"required": [
"sample_id",
"design_yaml",
"target_sequence"
]
}
}
45 changes: 45 additions & 0 deletions assets/schema_input_complexa.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://raw.githubusercontent.com/seqeralabs/nf-proteindesign/main/assets/schema_input_complexa.json",
"title": "seqeralabs/nf-proteindesign - Proteina-Complexa samplesheet schema",
"description": "Schema for validating samplesheets when --protein_design_tool=complexa. Each row specifies a target PDB, pipeline config YAML, and target sequence.",
"type": "array",
"items": {
"type": "object",
"properties": {
"sample_id": {
"type": "string",
"pattern": "^[a-zA-Z0-9_-]+$",
"errorMessage": "Sample ID must be alphanumeric with underscores or hyphens only"
},
"target_pdb": {
"type": "string",
"pattern": "^\\S+\\.(pdb|cif)$",
"errorMessage": "Target PDB must be a valid file path ending in .pdb or .cif"
},
"pipeline_config": {
"type": "string",
"pattern": "^\\S+\\.ya?ml$",
"errorMessage": "Pipeline config must be a valid file path ending in .yaml or .yml"
},
"target_sequence": {
"type": "string",
"errorMessage": "Target sequence must be a valid file path to a FASTA file"
},
"target_msa": {
"type": "string",
"errorMessage": "Target MSA must be a valid file path to a pre-computed MSA file"
},
"target_template": {
"type": "string",
"errorMessage": "Target template must be a valid file path to a PDB or CIF file"
}
},
"required": [
"sample_id",
"target_pdb",
"pipeline_config",
"target_sequence"
]
}
}
Loading
Loading