fix: Update numerical outputs, adjust PyTorch code examples, and correct typos across learning content files.

vukrosic · vukrosic · commit d607276c9826 · 2025-12-28T09:57:11.000+01:00
diff --git a/public/content/learn/attention-mechanism/applying-attention-weights/applying-attention-weights-content.md b/public/content/learn/attention-mechanism/applying-attention-weights/applying-attention-weights-content.md
@@ -72,9 +72,9 @@ Think of values as the "payload" - the actual content we'll extract.
 output = attn_weights @ V
 
 print(output)
-# tensor([[2.2000, 3.2000],
-#         [2.8000, 3.8000],
-#         [2.6000, 3.6000]])
+# tensor([[2.4000, 3.4000],
+#         [3.2000, 4.2000],
+#         [2.8000, 3.8000]])
 ```
 
 **Shape transformation:**
@@ -96,7 +96,7 @@ Position 0 output:
   = [2.4, 3.4]
 ```
 
-**PyTorch output:** [2.2, 3.2] (small difference due to rounding in display)
+**PyTorch output:** [2.4, 3.4] (matches perfectly!)
 
 **What happened:**
 - Position 0 mostly retrieves from V[0] (weight 0.5)
diff --git a/public/content/learn/attention-mechanism/calculating-attention-scores/calculating-attention-scores-content.md b/public/content/learn/attention-mechanism/calculating-attention-scores/calculating-attention-scores-content.md
@@ -202,16 +202,16 @@ print(attn_weights)
 **After softmax (each row sums to 1):**
 ```
          Pos0   Pos1   Pos2
-Query0 [0.576, 0.212, 0.212]  ← Mostly attends to position 0
-Query1 [0.212, 0.576, 0.212]  ← Mostly attends to position 1
+Query0 [0.506, 0.186, 0.308]  ← Mostly attends to position 0
+Query1 [0.186, 0.506, 0.308]  ← Mostly attends to position 1
 Query2 [0.333, 0.333, 0.333]  ← Attends equally to all
 ```
 
 ### Understanding the Result
 
 **Position 0:** 
 - Query matched Key0 best (score 2.0 before scaling)
-- After softmax: 57.6% attention to position 0
+- After softmax: 50.6% attention to position 0
 
 **Position 2:**
 - Query matched all keys equally (scores all 1.0)
diff --git a/public/content/learn/attention-mechanism/multi-head-attention/multi-head-attention-content.md b/public/content/learn/attention-mechanism/multi-head-attention/multi-head-attention-content.md
@@ -64,7 +64,7 @@ import torch
 import torch.nn as nn
 
 # Single-head attention: One attention pattern
-single_head = nn.MultiheadAttention(embed_dim=512, num_heads=1)
+single_head = nn.MultiheadAttention(embed_dim=512, num_heads=1, batch_first=True)
 ```
 
 **With 1 head:**
@@ -74,7 +74,7 @@ single_head = nn.MultiheadAttention(embed_dim=512, num_heads=1)
 
 ```python
 # Multi-head attention: 8 parallel attention patterns!
-multi_head = nn.MultiheadAttention(embed_dim=512, num_heads=8)
+multi_head = nn.MultiheadAttention(embed_dim=512, num_heads=8, batch_first=True)
 ```
 
 **With 8 heads:**
@@ -84,13 +84,13 @@ multi_head = nn.MultiheadAttention(embed_dim=512, num_heads=8)
 
 ```python
 # Test both
-x = torch.randn(10, 32, 512)  # (seq_len=10, batch=32, embed_dim=512)
+x = torch.randn(32, 10, 512)  # (batch=32, seq_len=10, embed_dim=512)
 
 single_output, _ = single_head(x, x, x)
 multi_output, _ = multi_head(x, x, x)
 
-print(f"Single head output: {single_output.shape}")  # torch.Size([10, 32, 512])
-print(f"Multi-head output: {multi_output.shape}")    # torch.Size([10, 32, 512])
+print(f"Single head output: {single_output.shape}")  # torch.Size([32, 10, 512])
+print(f"Multi-head output: {multi_output.shape}")    # torch.Size([32, 10, 512])
 ```
 
 **Same output shape!** But multi-head is more expressive.
diff --git a/public/content/learn/math/functions/functions-content.md b/public/content/learn/math/functions/functions-content.md
@@ -53,7 +53,7 @@ For x = -1:
 
 f(-1) = 2(-1) + 3 = -2 + 3 = 1
 
-Now image a function that takes in "Cat sat on a" and returns "mat" - that function would be a lot more difficult to create, but neural networks (LLMs) can learn it.
+Now imagine a function that takes in "The cat sat on a" and returns "mat" - that function would be a lot more difficult to create, but neural networks (LLMs) can learn it.
 
 ### Example 2: Quadratic Function f(x) = x² + 2x + 1
 
@@ -109,7 +109,7 @@ Previous quadratic function will always give 9 if x=2 and nothing else.
 
 ## Code Examples
 
-Our 2 functions coded in python, if you are unfamiliar with python you can skip the code, next module will focus on python.
+Our 2 functions coded in Python, if you are unfamiliar with Python you can skip the code, next module will focus on Python.
 
 ```python
 # Linear function: f(x) = 2x + 3
@@ -348,23 +348,23 @@ def cosine_function(x):
 
 ![Trigonometric Functions](/content/learn/math/functions/trigonometric-functions.png)
 
-This is used in Rotory Positional Embeddings (RoPE) - LLM is using it to know the order of words (tokens) in the text.
+This is used in Rotary Positional Embeddings (RoPE) - LLM is using it to know the order of words (tokens) in the text.
 
 
 
 
 
 
 
-Functions are using in neural networks a lot: forward propagation, backward propagation, attention, activation functions, gradients, and many more.
+Functions are used in neural networks a lot: forward propagation, backward propagation, attention, activation functions, gradients, and many more.
 
 You don't need to learn them yet, just check them out.
 
 ### 1. Sigmoid Function
 
 ![Sigmoid Formula](/content/learn/math/functions/sigmoid-formula.png)
 
-**e** is a famous constant (Euler's number) used in math everywhere, it's value is approximately 2.718
+**e** is a famous constant (Euler's number) used in math everywhere, its value is approximately 2.718
 
 **f(x) = 1 / (1 + e^(-x))**
 
@@ -379,7 +379,7 @@ def sigmoid_derivative(x):
 
 ![Sigmoid Function and Derivative](/content/learn/math/functions/sigmoid-function-derivative.png)
 
-We will learn derivativers in the next lesson, but I included the images here - derivative tells you how fast the function is changing - you see that when sigmoid function is growing fastest (in the middle), the derivative value is spiking.
+We will learn derivatives in the next lesson, but I included the images here - derivative tells you how fast the function is changing - you see that when sigmoid function is growing fastest (in the middle), the derivative value is spiking.
 
 Just look at the slope of the function, if it's big (changing fast), the derivative will be big.
 
diff --git a/public/content/learn/neuron-from-scratch/making-a-prediction/making-a-prediction-content.md b/public/content/learn/neuron-from-scratch/making-a-prediction/making-a-prediction-content.md
@@ -79,7 +79,7 @@ input_data = torch.tensor([[1.0, 2.0]])  # New data point
 prediction = neuron(input_data)
 
 print(prediction)
-# tensor([[0.8176]]) ← Prediction!
+# tensor([[0.8581]]) ← Prediction!
 ```
 
 **Manual calculation:**
diff --git a/public/content/learn/neuron-from-scratch/the-linear-step/the-linear-step-content.md b/public/content/learn/neuron-from-scratch/the-linear-step/the-linear-step-content.md
@@ -77,7 +77,7 @@ z = torch.dot(w, x) + b
 # OR: z = (w * x).sum() + b
 
 print(z)
-# tensor(1.1000)
+# tensor(1.4000)
 ```
 
 **Manual calculation:**
@@ -367,12 +367,12 @@ with torch.no_grad():
 # Predict price
 predicted_price = price_neuron(house_features)
 print(predicted_price)
-# tensor([[540000.]]) ← $540,000 prediction
+# tensor([[590000.]]) ← $590,000 prediction
 
 # Manual calculation:
 # 2000×200 + 3×50000 + 10×(-1000) + 50000
 # = 400,000 + 150,000 - 10,000 + 50,000
-# = 590,000 (close to our result!)
+# = 590,000 (perfect match!)
 ```
 
 **What the weights learned:**