diff --git a/tutorials/vision_transformers.md b/tutorials/vision_transformers.md
index 59e77bbb..e33be289 100644
--- a/tutorials/vision_transformers.md
+++ b/tutorials/vision_transformers.md
@@ -31,7 +31,7 @@ Since the final classification is done on the class token computed in the last a
 the output will not be affected by the 14x14 channels in the last layer.
 The gradient of the output with respect to them, will be 0!
 
-We should chose any layer before the final attention block, for example:
+We should choose any layer before the final attention block, for example:
 ```python
 target_layers = [model.blocks[-1].norm1]
 ```
@@ -69,7 +69,7 @@ def reshape_transform(tensor, height=7, width=7):
 Since the swin transformer is different from ViT, it does not contains `cls_token` as present in ViT,
 therefore we will use all the 7x7 images we get from the last block of the last layer.
 
-We should chose any layer before the final attention block, for example:
+We should choose any layer before the final attention block, for example:
 ```python
 target_layers = [model.layers[-1].blocks[-1].norm1]
 ```