AFM-Pisa-classifier/README.Rmd at main · philipptrepte/AFM-Pisa-classifier · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  dpi = 300,
  out.width = "100%"
)
```

# AFM-PISA-classifier

<!-- badges: start -->

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)

<!-- badges: end -->

The AFM-PISA-classifier package uses AlphaFold-Multimer (AFM) predictions of the human reference set of positive and random protein interactions (hsPRS-v2, hsRRS-v2) from 'Choi, S.G., Olivet, J., Cassonnet, P. et al. *Nat Communications* (2019)' as training set to predict the likelihood of AFM predictions of interest to be true-positive interactions. For this, the AFM predicted structures are analyzed using the PDBePISA tool for the exploration of macromolecular interfaces. As training features, the PAE values from the AFM predictions and interface areas retrieved by PDBePISA are used.

## Installation

For some plots XQuartz is required, which can be downloaded from [XQuartz](https://www.xquartz.org/).
You can install the development version of AFM-PISA-classifier from [GitHub](https://github.com/philipptrepte/AFM-PISA-classifier) with:

``` r
# Install dependencies
install.packages("BiocManager")
BiocManager::install("ComplexHeatmap")
BiocManager::install("Biostrings")
BiocManager::install("BiocStyle")

devtools::install_github("philipptrepte/binary-PPI-classifier")

# Install AFMpisa
devtools::install_github("philipptrepte/AFM-Pisa-classifier")

```

## Requirements

```{r requirement, message=FALSE, warning=FALSE, include=FALSE}
library(AFMpisa)
library(DT)
library(dplyr)
library(knitr)
library(stringr)
```

### AlphaFold-Multimer (AFM) predictions

Predict protein complexes with AlphaFold-Multimer using for example [AlphaFold](https://github.com/google-deepmind/alphafold) or [ColabFold](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb). Results for each predicted model will include '.json' files containing the PAE values.

### Protein Interfaces, surfaces and assemblies (PISA)

Use [PDBePISA](https://www.ebi.ac.uk/pdbe/pisa/) or [PisaPy](https://github.com/hocinebib/PisaPy) for batch analysis of AFM predictions to extract information on macromolecular interfaces. Results will include '[...]interfacesummary0.xml' and '[...]interfacetable.xml' files. Among others, the *interfacesummary* contains information on the 'Interface Area' and 'Number of Interface Residues', while the *interfacetable* contains information on the 'Total Surface Area', the 'Solvation Free Energy', 'H-Bonds', 'Saltbridges', 'Disulfide Bonds' and 'Complexation Significance Score (CSS)'.

### Binary PPI-classifier

Evaluation of AFM predicted structural models, requires the [binaryPPIclassifier](https://github.com/philipptrepte/binary-PPI-classifier) package. To evaluate the predicted models, you can either train a multi-adaptive support vector machine learning (maSVM) algorithm on your own training data, or use the provided models, which can be found in the `/data/maSVM_models/` folder. The provided models were generated by training the maSVM algorithm on AFM predicted structural models from 51 known interactions and 67 random protein pairs not known to interact.

## Usage

### import.afm()

The function `import.afm()` will import all .json files in the specified directory `dir` where your save the results from the AFM predictions. PAE, pLLDT and PTM values will be extracted and stored in a list.

In the following, we provide an example for the AFM predicted structural models for the NSP10-NSP16 and NSP10-NSP14 complexes.

```{r AFMresults, echo=TRUE, message=FALSE, warning=FALSE}
YOUR_AFM_RESULTS <- import.afm(dir = "data/AFM_json/")
```

#### Protein information

The list will contain a data frame providing you with an overview of the predicted protein complexes:

```{r proteinInfo, echo=TRUE}
head(YOUR_AFM_RESULTS$protein)
```

#### pLLDT

It also contains a vector with pLLDT values for every amino acid:

```{r pLLDT, echo=TRUE}
YOUR_AFM_RESULTS$plldt$`NSP10-NSP16_3cd38.result/NSP10_NSP16_3cd38_unrelaxed_rank_1_model_3_scores.json`
```

#### PAE matrix

It also contains a matrix with PAE values for every amino acid pair. Only the first 20 amino acids are shown:

```{r PAEmatrix, echo=TRUE}
YOUR_AFM_RESULTS$pae$`NSP10-NSP16_3cd38.result/NSP10_NSP16_3cd38_unrelaxed_rank_1_model_3_scores.json`[1:20,1:20]
```

#### PTM and max PAE

It also contains a data frame on the PTM and max PAE values:

```{r PTM, echo=TRUE}
YOUR_AFM_RESULTS$ptm
```

#### Number of AFM Models

And finally the number of models predicted by AFM for each protein pair:

```{r numModels, echo=TRUE, warning=FALSE}
YOUR_AFM_RESULTS$num_models
```

### import.pisa()

The function `import.pisa()` will import from all subdirectoris under the specified directory `dir` the `interfacesummary0.xml` and `interfacetable.xml` files that you saved from the PDBePISA results. Information on 'Interface Area', 'Number of Interface Residues', 'Total Surface Area', 'Solvation Free Energy', 'H-Bonds', 'Saltbridges', 'Disulfide Bonds' and 'Complexation Significance Score (CSS)' will be extracted and stored in a data frame.

In the following, we provide an example on the information extracted from the PDBePISA analyzed AFM predicted structural models of the E-M, NSP10-NSP16 and NSP10-NSP14 complexes.

```{r PISAinterface, echo=TRUE, message=FALSE, warning=FALSE}
YOUR_PISA_INTERFACE <- import.pisa(dir = "data/PDBePISA_xml/")

YOUR_PISA_INTERFACE
```

### plldt.lineplot()

This function plots the pLLDT values for each amino acid of an AFM predicted structural model. You need to specify which complex to plot `afm_complex = "NSP10-NSP16"` and which rank `afm_rank = 1` or `afm_rank = "all"`.

```{r pLLDTlineplot, echo=TRUE, warning=FALSE}
plot_grid(
  plldt.lineplot(import_afm = YOUR_AFM_RESULTS, afm_complex = "E-M", afm_rank = 'all'),
  plldt.lineplot(import_afm = YOUR_AFM_RESULTS, afm_complex = "NSP10-NSP16", afm_rank = 'all'),
  plldt.lineplot(import_afm = YOUR_AFM_RESULTS, afm_complex = "NSP14-NSP10", afm_rank = 'all'),
  ncol = 1
)

```

### pae.heatmap()

The function pae.heatmap() lets you visualize the intra- and inter-residue PAE values from your AFM predicted structural models. You can specify which AFM model to plot by providing the rank, and to only plot the kmeans clustered interface region for which you can set a pLLDT cutoff that defines which amino acids to include during kmeans clustering. When clustering the interface region, a barplot will be plotted, showing the average PAE values from the resulting eight clusters.

#### E-M complex without interface clustering

```{r PAEheatmapEM, echo=TRUE, warning=FALSE}
pae.heatmap(import_afm = YOUR_AFM_RESULTS, afm_complex = "E-M", afm_rank = 1, interface_cluster = FALSE)
```

#### E-M complex with interface clustering

Note that we used here a pLLDT of 0 \`plldt = 0' as cutoff.

```{r PAEheatmapEMclustered, echo=TRUE, warning=FALSE}
pae.heatmap(import_afm = YOUR_AFM_RESULTS, afm_complex = "E-M", afm_rank = 1, interface_cluster = TRUE, plldt = 0)
```

#### NSP10-NSP16 complex without interface clustering

```{r PAEheatmapNSP10NSP16, echo=TRUE, warning=FALSE}
pae.heatmap(import_afm = YOUR_AFM_RESULTS, afm_complex = "NSP10-NSP16", afm_rank = 1, interface_cluster = FALSE)
```

#### NSP10-NSP16 complex with interface clustering

Note that we used here a pLLDT of 50 \`plldt = 50' as cutoff.

```{r PAEheatmapNSP10NSP16clustered, echo=TRUE, warning=FALSE}
pae.heatmap(import_afm = YOUR_AFM_RESULTS, afm_complex = "NSP10-NSP16", afm_rank = 1, interface_cluster = TRUE, plldt = 50)
```

#### NSP14-NSP10 complex without interface clustering

```{r PAEheatmapNSP10NSP14, echo=TRUE, warning=FALSE}
pae.heatmap(import_afm = YOUR_AFM_RESULTS, afm_complex = "NSP14-NSP10", afm_rank = 1, interface_cluster = FALSE)
```

#### NSP14-NSP10 complex with interface clustering

Note that we used here a pLLDT of 50 \`plldt = 50' as cutoff.

```{r PAEheatmapNSP10NSP14clustered, echo=TRUE, warning=FALSE}
pae.heatmap(import_afm = YOUR_AFM_RESULTS, afm_complex = "NSP14-NSP10", afm_rank = 1, interface_cluster = TRUE, plldt = 50)
```

### pae.interface()

This function performs kmeans clustering to identify amino acid clusters with the lowest average PAE. This is equivalent to the results from the `pae.heatmap()` function with the parameter `interface_cluster = TRUE`. The function takes the resulting list from `import.afm()` as input and produces a data frame as output, which stores for the resulting eight clusters (AB cluster 1-4 and BA cluster 1-4) information on their median and mean PAE values as well as the cluster size as `number of residues protein A * number of residues protein B`. Note that we used a pLLDT cutoff of \>50 for all complexes now `plldt = 50`.

```{r interface, echo=TRUE, warning=FALSE}
YOUR_INTERFACE <- pae.interface(import_afm = YOUR_AFM_RESULTS, plldt = 50)
```

```{r interfaceTable, echo=TRUE, warning=FALSE}
colnames(YOUR_INTERFACE)

YOUR_INTERFACE %>% dplyr::select(A_length, B_length, A_protein, B_protein, complex, model, rank, pae)
```

### pae.boxplot()

The function `pae.boxplot()` will plot the lowest average PAE values from the eight clusters. Each dot represents an AFM predicted structural model, which typically predicts five models.

```{r PAEboxplot, warning=FALSE, out.width="50%", recho=TRUE}
pae.boxplot(pae_interface = YOUR_INTERFACE)
```

### afm.pisa.heatmap()

The function `afm.pisa.heatmap()` will plot for all complex, the minimum average PAE from the eight clusters, and the PDBePISA calculated solvation free energy (𝚫G), interface area and surface area for all AFM predicted structural models (typically five per complex).

```{r AFMPISAheatmap, echo=TRUE, warning=FALSE}
afm.pisa.heatmap(import_afm = YOUR_AFM_RESULTS, import_pisa = YOUR_PISA_INTERFACE)
```

## Evaluate AFM structural models for likelihood of interaction

If you provide your own reference set, follow the instructiosn of the [binaryPPIclassifier](https://github.com/philipptrepte/binary-PPI-classifier) package.

You can also predict the interaction probability of your AFM predicted structural complexes using the maSVM models provided herein, which were trained on 51 known interactions and 67 random protein pairs not known to interact (hsPRS-AF and hsRRS-AF).

### Load Models

```{r}
data("AFM_maSVM_models")
```

### Prepare data

```{r TESTSETtable, echo=TRUE, warning=FALSE}
YOUR_TEST_SET <- YOUR_PISA_INTERFACE %>%
  left_join(YOUR_INTERFACE %>%
              dplyr::select(A_protein, B_protein, complex, model, rank, pae),
            by = c("complex" = "complex", "rank" = "rank", "model" = "model")) %>%
  dplyr::mutate(deltaG = -deltaG, #invert deltaG
                pae = 40-pae) %>% #invert PAE
  # adjust columns to meet the binary-PPI-classifier input requirements
  dplyr::mutate(Donor = paste(A_protein, rank, sep = "_"),
                Donor_tag = "NA",
                Donor_protein = A_protein,
                Acceptor = paste(B_protein, rank, sep = "_"),
                Acceptor_tag = "NA",
                Acceptor_protein = B_protein,
                complex = "Covid",
                reference = "NA",
                interaction = paste0(A_protein, " + ", B_protein),
                sample = paste0(A_protein, "+", B_protein, "_"),
                orientation = paste0(Donor_tag, rank, "+", Acceptor_tag, rank)) %>%
  pivot_longer(cols = c(interfaceArea, deltaG, pae),
               names_to = "data", values_to = "score")

YOUR_TEST_SET %>% dplyr::select(complex, interaction, orientation, data, score)
```

```{r TESTSETmatrix, echo=TRUE, warning=FALSE}
TEST_MAT <- YOUR_TEST_SET %>%
  tidyr::unite(complex, interaction, sample, orientation, col = "sample", sep = ";") %>%
  tidyr::pivot_wider(names_from = data, values_from = score) %>%
  dplyr::filter(across(.cols = c('interfaceArea', 'pae'), ~!is.na(.x))) %>%
  tibble::column_to_rownames("sample") %>%
  dplyr::select(c('interfaceArea', 'pae')) %>%
  base::as.matrix()

TEST_MAT
```

### Predict Interaction Probability

```{r AFMprediction, echo=TRUE, warning=FALSE}
prediction <- data.frame()
for(i in 1:length(AFM_maSVM_models)) {
  tmp <- attr(stats::predict(AFM_maSVM_models[[i]], newdata = TEST_MAT,
                      decision.values = TRUE, probability = TRUE), "probabilities")
  tmp <- tmp %>%
    as.data.frame() %>%
    rownames_to_column("id") %>%
    tidyr::separate(col = "id",
                    into = c("complex", "interaction", "sample", "orientation"),
                    sep = ";")
  tmp <- cbind(tmp,i)
  prediction <- rbind(prediction, tmp)
  rm(tmp)
}

YOUR_AFM_PREDICTIONS <- prediction %>%
  group_by(interaction, orientation) %>%
  dplyr::summarise(probability = mean(`2`))

YOUR_AFM_PREDICTIONS
```

## Reference

[**AI-guided pipeline for protein–protein interaction drug discovery identifies a SARS-CoV-2 inhibitor**]{.underline}

Trepte P, Secker C, Olivet J, Blavier J, Kostova S, Maseko SB, Minia I, Ramos ES, Cassonnet P, Golusik S, et al (2024) AI-guided pipeline for protein–protein interaction drug discovery identifies a SARS-CoV-2 inhibitor. Mol Syst Biol: 1–30

https://doi.org/10.1038/s44320-024-00019-8

## License

Distributed under the MIT License. See `License.md` for more information.

## Contact

Philipp Trepte - [philipp.trepte\@imba.oeaw.ac.at](mailto:philipp.trepte@imba.oeaw.ac.at) - [LinkedIn](https://www.linkedin.com/in/philipp-trepte/)

AFM-PISA-classifier: https://github.com/philipptrepte/AFM-PISA-classifier

## Session Info

```{r}
sessionInfo()
```