Skip to content

Commit f2296d5

Browse files
committed
Final draft ML lesson with updated dataset (more similar to real dataset)
1 parent f0e9eb4 commit f2296d5

4 files changed

Lines changed: 701 additions & 683 deletions

File tree

_instructor_notes.txt

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,17 @@
1-
When creating the `intro_python` environment, also run the following line:
1+
When creating the `intro_python` environment, the following libraries need to be installed:
2+
3+
```
4+
jupyter
5+
nb_conda_kernels
6+
numpy
7+
pandas
8+
matplotlib
9+
scikit-learn
10+
seaborn
11+
python-graphviz
12+
```
13+
14+
also run the following line:
215

316
```
417
python -m ipykernel install --user --name intro_python --display-name "intro_python"

docs/lessons/13_machine_learning.html

Lines changed: 495 additions & 474 deletions
Large diffs are not rendered by default.

lessons/13_machine_learning.qmd

Lines changed: 72 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,9 @@ In this lesson, we will:
3333

3434
## Overview of lesson
3535

36-
When doing XYZ...
36+
Maching learning and AI are becoming increasingly important tools in the world and the life sciences are no exception. These tools are able to learn patterns from data without the need for a human to define the rules. This makes them powerful for making predictions and uncovering insights from complex datasets.
37+
38+
In this lesson, we will be using a machine learning algorithm called a **random forest classifier** to predict the cortical layer labels of cells in a spatial transcriptomics dataset. This is a great example of how machine learning can be used to make predictions and uncover insights from biological datasets. We will be using the `scikit-learn` library in Python to train our random forest classifier and evaluate its performance.
3739

3840
## Machine learning
3941

@@ -46,9 +48,33 @@ Machine learning is a subfield of artificial intelligence that focuses on traini
4648

4749
These algorithms can be used for a wide range of applications, including image recognition, natural language processing, and predictive modeling.
4850

51+
### Random forest classifiers
52+
53+
Random forests allow you to predict a categorical variable (cortical layer) based on one or more predictor variables (spatial coordinates and gene expression scores). To do so, the algorithm builds multiple decision trees, which are models that make predictions based on a series of binary decisions (`True` or `False`).
54+
55+
::: {#fig-decision_tree_example .figure}
56+
![](../img/decision_tree_example.png){width="80%"}
57+
58+
Example of a decision tree where the variables are age, weight, and smoker to predict risk level of a heart attack.<br>
59+
_Image source: [DataCamp](https://www.datacamp.com/tutorial/decision-tree-classification-python)_
60+
:::
61+
62+
These decision trees comprise of decision nodes, which are the points where the data is split based on a predictor variable, and leaf nodes, which are the final predictions made by the tree.
63+
64+
The random forest algorithm generates multiple of these decision trees based upon the training data. Each tree is built upon a random subset of the data and variables variables. This randomness helps to reduce overfitting and improve the generalizability of the model. Therefore, each tree will learn different patterns from the data.
65+
66+
Now that the model has been trained, we can supply a new dataset and run it through the model. In this case, the data will run through each of the decision trees in the random forest and make a prediction. Then, a majority vote is taken across the decision of all the trees to make the final prediction.
67+
68+
::: {#fig-random_forest_example .figure}
69+
![](../img/random_forest_algorithm.png){width="80%"}
70+
71+
Example of a random forest with 3 decision trees to generate a prediction based upon majority voting.<br>
72+
_Image source: [GeeksforGeeks](https://www.geeksforgeeks.org/random-forest-classifier-using-scikit-learn/)_
73+
:::
74+
4975
## Cortical layer dataset
5076

51-
To provide a real-world example of how to use machine learning in research, we will be using a synthetic [Visium HD](https://www.10xgenomics.com/platforms/visium/product-family) spatial transcriptomics dataset. In this experiment, you take a piece of tissue and lay it on a grid. Then, each spot is genomically sequenced such that you know **both the spatial location and gene expression** of each spot.
77+
To provide a real-world example of how to use machine learning in research, we will be using a synthetic [Visium HD](https://www.10xgenomics.com/platforms/visium/product-family) spatial transcriptomics dataset. In type of experiment, you take a piece of tissue and lay it on a grid. Then, each spot on the grid is genomically sequenced such that you know **both the spatial location and gene expression** of each spot.
5278

5379
::: {#fig-visium_hd .figure}
5480
![](../img/visium.png){width=300}
@@ -62,7 +88,7 @@ _Image source: [10x Genomics](https://www.10xgenomics.com/blog/your-introduction
6288
::: column
6389
The human cortex is a great use case for this technology because the brain is divvied into layers 1-6 and a white matter layer. A good analogy would be to think of them as an onion, where the layers are stacked right on top of each other spatially (x, y coordinates).
6490

65-
The layers are relatively distinct in their spatial locations but also have genes that are expressed highly in one layer and not the others. This makes it a great use case for machine learning because we can use both the spatial location and gene expression to predict which layer a cell belongs to.
91+
The layers are relatively distinct in their spatial locations but also have genes that are expressed highly in one layer and not the others. This makes it a great use case for machine learning because we can use both the spatial location and gene expression to predict which layer a spot belongs to.
6692
:::
6793

6894
::: column
@@ -81,7 +107,7 @@ _Image source: [Rai et al. (2026)](https://www.biorxiv.org/content/10.64898/2026
81107

82108
### Cortical information
83109

84-
We have created a made-up dataset (as the data has not been published yet) based upon [this dataset](https://www.biorxiv.org/content/10.64898/2026.01.12.698703v1.full). Where layers are broken into 6 cortical layers (L1, L2, L3, L4, L5, L6) and a white matter layer. The dataset contains spatial coordinates of cells in the cortex, as well as the cortical layer that each cell belongs to.
110+
We have created a made-up dataset (as the data has not been published yet) based upon [this dataset](https://www.biorxiv.org/content/10.64898/2026.01.12.698703v1.full). Where layers are broken into 6 cortical layers (L1, L2, L3, L4, L5, L6) and a white matter layer.
85111

86112
```{python}
87113
#| label: tbl-load_cortical_data
@@ -96,23 +122,23 @@ from sklearn.model_selection import train_test_split
96122
from sklearn.metrics import accuracy_score, confusion_matrix
97123
98124
# Load synthetic cortical dataset
99-
df_cortical = pd.read_csv("data/synthetic_cortex_data.csv")
125+
df_cortical = pd.read_csv("data/synthetic_cortex_data_new.csv")
100126
df_cortical.head()
101127
```
102128

103129
We have the following columns in this dataset:
104130

105131
- `barcode`: A unique identifier for each spot in the dataset
106-
- `x`: The x coordinate of the cell's spatial location
107-
- `y`: The y coordinate of the cell's spatial location
108-
- `layer`: The cortical layer that the cell belongs to (L1, L2, L3, L4, L5, L6, WM)
132+
- `x`: The x coordinate of the spot's spatial location
133+
- `y`: The y coordinate of the spot's spatial location
134+
- `layer`: The cortical layer that the spot belongs to (L1, L2, L3, L4, L5, L6, WM)
109135

110-
As this is a spatial dataset, we can visualize where on the tissue each cell is located by plotting the x and y coordinates of each cell and coloring the points by the cortical layer that they belong to:
136+
As this is a spatial dataset, we can visualize where on the tissue each spot is located by plotting the x and y coordinates of each spot and coloring the points by the cortical layer that they belong to:
111137

112138
```{python}
113139
#| label: fig-cortical_layers
114-
#| fig-cap: "Spatial plot of the cortical cells colored by the cortical layer they belong to."
115-
# Plot the spatial locations of the cells colored by the cortical layer they belong to
140+
#| fig-cap: "Spatial plot of the cortical spots colored by the cortical layer they belong to."
141+
# Plot the spatial locations of the spots colored by the cortical layer they belong to
116142
sns.scatterplot(data=df_cortical,
117143
x="x", y="y",
118144
hue="cortical_layer",
@@ -129,7 +155,7 @@ plt.show()
129155

130156
So now we have a better idea of what the cross-section of the cortex looks like and where the different layers are located.
131157

132-
However, just using the x and y coordinates of each spot is not enough. You may have noticed that there apepars to be some mixing of layers near the boundaries. Luckily for us, the different cortical layers have known genes that are highly expressed in distinct layers. We can take a quick look at some canonical markers that are used to identify the different cortical layers:
158+
However, just using the x and y coordinates of each spot is not enough. You may have noticed that there appears to be some mixing of layers near the boundaries. Luckily for us, the different cortical layers have known genes that are unique expressed in the layers. We can take a quick look at some of these marker genes that are used to identify the different cortical layers in the original paper that this dataset is based on:
133159

134160
::: {#fig-cortical_marker_genes .figure}
135161
![](../img/paper_cortical_markers.png){width=550}
@@ -138,7 +164,7 @@ Example of the spatial expression of known marker genes for each cortical layer.
138164
_Image source: [Rai et al. (2026)](https://www.biorxiv.org/content/10.64898/2026.01.12.698703v1.full)_
139165
:::
140166

141-
In the dataset, you have have noted that we also have columns: `AQP4`, `HPCAL1`, `FREM3`, `TRABD2A`, `KRT17`, and `MOBP`. These are the log-normalized expression values for those genes in each cell. Similar to the figure from above, we can visualize the expression of these marker genes in each cell (point) across the cortex to see the pattern of values across the different layers. This will give us a better idea of how we can use both the spatial location and gene expression to predict which layer a cell belongs to.
167+
In the dataset, you have have noted that we also have columns: `AQP4`, `HPCAL1`, `FREM3`, `TRABD2A`, `KRT17`, and `MOBP`. These are the [log-normalized expression](https://hbctraining.github.io/Intro-to-scRNAseq/lessons/07_SCT_normalization.html#normalization) values for those genes in each spot. Similar to the figure from above, we can visualize the expression of these marker genes in each spot (scatterplot point) across the cortex to see the pattern of values across the different layers. This will give us a better idea of how we can use both the spatial location and gene expression to predict which layer a spot belongs to.
142168

143169
```{python}
144170
#| label: fig-cortical_marker_genes
@@ -171,40 +197,21 @@ plt.show()
171197

172198
::: {.callout-note collapse="true"}
173199
# Making multiple subplots in a loop
174-
In the above code, we first initialized a plot with 2 rows and 3 columns - with the goal of plotting each of the 6 genes in our dataset. This creates an **array of plots** which we can then access to generate each of our plots.
200+
In the above code, we first initialized a plot with 2 rows and 3 columns - with the goal of plotting each of the 6 genes in our dataset. This creates an **2D array of plots** which we can then access to generate each of our plots.
175201

176-
So we could have proceeded using the `[]` indexing we have been using for matrices, but that is more complex as we would need to keep track of both rows and columns in the for loop. Instead, we can use the `flatten()` method to convert this **2D array of plots into a list of plots**, which is easier to index in the for loop.
202+
So we could have proceeded using the `[row, column]` indexing we have been using for matrices, but that is more complex as we would need to keep track of both rows and columns in the for loop. Instead, we can use the `flatten()` method to convert this **2D array of plots into a list of plots**, which is easier to index in the for loop.
177203

178204
:::
179205

180206
**We will be using this synthetic dataset to train a random forest classifier to predict the cortical layer labels based on the spatial location and gene expression of each cell.**
181207

208+
## Training a random forest classifier
182209

183-
## Random forest classifiers
184-
185-
Random forests allow you to predict a categorical variable (cortical layer) based on one or more predictor variables (x and y coordinates). To do so, the algorithm builds multiple decision trees, which are models that make predictions based on a series of binary decisions (`True` or `False`).
186-
187-
::: {#fig-decision_tree_example .figure}
188-
![](../img/decision_tree_example.png){width="80%"}
189-
190-
Example of a decision tree where the variables are age, weight, and smoker to predict risk level of a heart attack.<br>
191-
_Image source: [DataCamp](https://www.datacamp.com/tutorial/decision-tree-classification-python)_
192-
:::
193-
194-
These decision trees comprise of decision nodes, which are the points where the data is split based on a predictor variable, and leaf nodes, which are the final predictions made by the tree.
195-
196-
Random forests build multiple decision trees and combine their predictions to improve accuracy and reduce overfitting. These trees are built on random subsets of the data. Then, a majority vote is taken across the final decision of all the trees to make the final prediction.
197-
198-
::: {#fig-random_forest_example .figure}
199-
![](../img/random_forest_algorithm.png){width="80%"}
200-
201-
Example of a random forest with 3 decision trees to generate a prediction based upon majority voting.<br>
202-
_Image source: [GeeksforGeeks](https://www.geeksforgeeks.org/random-forest-classifier-using-scikit-learn/)_
203-
:::
210+
Now that we have a basic understanding of how random forests work, let's dive into training one with our cortical dataset. We will be using the `RandomForestClassifier` class from the `sklearn.ensemble` module to train our model.
204211

205212
### Preparing training dataset
206213

207-
The _learning_ of machine learning comes from the fact that these algorithms must first learn patterns. This is accomplished by taking a subset of labelled data, the **training set**, to train the model. From this, the random forest classifier would learn how to predict the cortical layer of a cell based on its x, y coordinates and gene expression.
214+
The _learning_ of machine learning comes from the fact that these algorithms must first learn patterns. This is accomplished by taking a subset of labelled data, the **training set**, to train the model. From this, the random forest classifier would learn how to predict the cortical layer of a cell based on its coordinates and gene expression. Additionally, we also need a **test set** to evaluate the performance of the model. The test set is a subset of the data that the model has not seen during training (but has labels), and it is used to assess the accuracy of the model.
208215

209216
First we are going to define what is the label we want to predict (`cortical_layer`), and what are the predictor variables we want to use to make that prediction, (`x`, `y` coordinates and gene expression). Oftentimes there will be referred to as the `X` and `y` respectively.
210217

@@ -256,7 +263,7 @@ rf = RandomForestClassifier(n_estimators=100,
256263
class_weight="balanced")
257264
```
258265

259-
Next, we will train the model using the `fit()` method, which takes in the predictor variables (x and y coordinates) and the target variable (cortical layer) from the training data.
266+
Next, we will train the model using the `fit()` method, which takes in the predictor variables (spatial coordinates and gene expression) and the target variable (cortical layer) from the training data.
260267

261268
```{python}
262269
#| label: train_model
@@ -267,7 +274,7 @@ rf.fit(X_train, y_train)
267274

268275
## Predict cortical layer labels
269276

270-
With this model, `rf`, we can now predict the cortical layer labels of the test dataset using the `predict()` method. Once again supplying the x and y coordinates, but this time for the prediction data instead of the training data. The model will use the patterns it learned from the training data to predict which cortical layer each unassigned cell belongs to based on its spatial location.
277+
With this model, `rf`, we can now predict the cortical layer labels of the test dataset using the `predict()` method. Once again supplying the predictor variables, but this time for the **test data** instead of the training data. The model will use the patterns it learned from the training data to predict which cortical layer each spot belongs to.
271278

272279
```{python}
273280
#| label: predict_cortical_layer
@@ -290,34 +297,34 @@ It is a numpy array! So we can access the first few elements to see what the pre
290297
y_pred[0:5]
291298
```
292299

300+
So we see that the output of the `predict()` method is an array of predicted cortical layer labels for each spot in the test dataset.
293301

294302
::: {.callout-note collapse="true"}
295303
# Visualizing a _single_ decision tree in the random forest
296304

297305
```{python}
298306
#| label: visualize_decision_tree
299-
#| fig-cap: Visualization of a single decision tree from the random forest model.
300-
#| fig-width: 40
301-
#| fig-height: 15
302-
# Source - https://stackoverflow.com/a/61037626
303-
# Posted by Michael James Kali Galarnyk
304-
# Retrieved 2026-04-09, License - CC BY-SA 4.0
305-
from sklearn import tree
306-
307-
fn = df_cortical["cell_barcode"]
308-
cn = df_cortical["cortical_layer"]
309-
310-
fig, axes = plt.subplots(nrows = 1,
311-
ncols = 1,
312-
figsize = (70, 25),
313-
dpi=300)
314-
315-
tree.plot_tree(rf.estimators_[0],
316-
feature_names = fn,
317-
class_names = cn,
318-
filled = True)
319-
320-
plt.show()
307+
#| fig-cap: Visualization of a single decision tree from the random forest model (truncated to depth `max_depth`).
308+
from sklearn.tree import export_graphviz
309+
import graphviz
310+
311+
# Unique class names for the target variable (cortical_layer)
312+
cn = df_cortical["cortical_layer"].unique()
313+
314+
315+
dot_data = export_graphviz(rf.estimators_[0],
316+
out_file = None,
317+
feature_names = feature_cols,
318+
class_names = cn,
319+
filled = True,
320+
rounded = True,
321+
special_characters = True,
322+
impurity = False,
323+
proportion = True,
324+
max_depth = 3)
325+
326+
graph = graphviz.Source(dot_data)
327+
graph
321328
```
322329

323330
:::
@@ -343,7 +350,8 @@ Our accuracy is quite high! This tells us that our model is doing a good job at
343350
Confusion matrices are another way to evaluate the performance of classification. This table shows the number of labels that were correctly predicted (true positives and true negatives) and the number of labels that were incorrectly predicted (false positives and false negatives).
344351

345352
```{python}
346-
#| label: calculate_confusion_matrix
353+
#| label: fig-confusion_matrix
354+
#| fig-cap: Confusion matrix to evaluate the performance of the random forest classifier on the test dataset for each cortical layer.
347355
class_names = sorted(y.unique())
348356
cm = confusion_matrix(y_test, y_pred, labels=class_names)
349357
@@ -365,6 +373,8 @@ plt.show()
365373

366374
This can help you understand which classes the model is doing well on and which classes it is struggling with. If your accuracy is low, you can look at the confusion matrix to see which classes are being misclassified and potentially adjust your model or data accordingly.
367375

376+
If we think about our dataset, it makes sense that we see the **most error in the prediction when layers are adjacent to one another**. This is because the boundary between when one layer starts and another ends is not always clear, and there may be some mixing of layers near the boundaries.
377+
368378
## Next steps
369379

370380
With this model, you could try to predict the cortical layer labels of other datasets. You could also try to use different predictor variables (e.g. only gene expression or only spatial coordinates) to see how that affects the accuracy of the model's predictions.

0 commit comments

Comments
 (0)