diff --git a/COURSE_STRUCTURE.md b/COURSE_STRUCTURE.md new file mode 100644 index 0000000..2a38e60 --- /dev/null +++ b/COURSE_STRUCTURE.md @@ -0,0 +1,253 @@ +# Course Structure Documentation + +## Overview + +The learning course has been reorganized to use **markdown files** stored in the `public/content/learn/` directory, following the same pattern as the blog posts. This makes it easy to manage content and add images. + +## ๐Ÿ“ File Structure + +``` +public/content/learn/ +โ”œโ”€โ”€ README.md # Documentation for content management +โ”œโ”€โ”€ math/ +โ”‚ โ”œโ”€โ”€ functions/ +โ”‚ โ”‚ โ”œโ”€โ”€ functions-content.md +โ”‚ โ”‚ โ””โ”€โ”€ [add your images here] +โ”‚ โ”œโ”€โ”€ derivatives/ +โ”‚ โ”‚ โ”œโ”€โ”€ derivatives-content.md +โ”‚ โ”‚ โ”œโ”€โ”€ derivative-graph.png +โ”‚ โ”‚ โ””โ”€โ”€ tangent-line.png +โ”‚ โ”œโ”€โ”€ vectors/ +โ”‚ โ”‚ โ”œโ”€โ”€ vectors-content.md +โ”‚ โ”‚ โ””โ”€โ”€ [images included] +โ”‚ โ”œโ”€โ”€ matrices/ +โ”‚ โ”‚ โ”œโ”€โ”€ matrices-content.md +โ”‚ โ”‚ โ””โ”€โ”€ [images included] +โ”‚ โ””โ”€โ”€ gradients/ +โ”‚ โ”œโ”€โ”€ gradients-content.md +โ”‚ โ””โ”€โ”€ [images included] +โ””โ”€โ”€ neural-networks/ + โ”œโ”€โ”€ introduction/ + โ”‚ โ”œโ”€โ”€ introduction-content.md + โ”‚ โ””โ”€โ”€ [add your images here] + โ”œโ”€โ”€ forward-propagation/ + โ”‚ โ”œโ”€โ”€ forward-propagation-content.md + โ”‚ โ””โ”€โ”€ [add your images here] + โ”œโ”€โ”€ backpropagation/ + โ”‚ โ”œโ”€โ”€ backpropagation-content.md + โ”‚ โ””โ”€โ”€ [add your images here] + โ””โ”€โ”€ training/ + โ”œโ”€โ”€ training-content.md + โ””โ”€โ”€ [add your images here] +``` + +## ๐ŸŽ“ Course Modules + +### Module 1: Mathematics Fundamentals + +1. **Functions** (`/learn/math/functions`) + - Linear functions + - Activation functions (Sigmoid, ReLU, Tanh) + - Loss functions + - Why non-linearity matters + +2. **Derivatives** (`/learn/math/derivatives`) + - What derivatives are + - Why they matter in AI + - Common derivative rules + - Practical examples with loss functions + +3. **Vectors** (`/learn/math/vectors`) + - What vectors are (magnitude and direction) + - Vector components and representation + - Vector operations (addition, scalar multiplication) + - Applications in machine learning + +4. **Matrices** (`/learn/math/matrices`) + - Matrix fundamentals + - Matrix operations (multiplication, transpose) + - Matrix transformations + - Role in neural networks + +5. **Gradients** (`/learn/math/gradients`) + - Understanding gradients + - Partial derivatives + - Gradient computation + - Gradient descent in optimization + +### Module 2: Neural Networks from Scratch + +1. **Introduction** (`/learn/neural-networks/introduction`) + - What neural networks are + - Basic architecture (input, hidden, output layers) + - How they learn + - Real-world applications + +2. **Forward Propagation** (`/learn/neural-networks/forward-propagation`) + - The forward pass process + - Weighted sums and activations + - Step-by-step numerical examples + - Matrix operations + +3. **Backpropagation** (`/learn/neural-networks/backpropagation`) + - The backpropagation algorithm + - Chain rule in action + - Gradient computation + - Common challenges (vanishing/exploding gradients) + +4. **Training & Optimization** (`/learn/neural-networks/training`) + - Gradient descent variants (SGD, mini-batch, batch) + - Advanced optimizers (Adam, RMSprop, Momentum) + - Hyperparameters and learning rate schedules + - Best practices and common pitfalls + +## ๐Ÿ› ๏ธ Technical Implementation + +### Components Created + +1. **LessonPage Component** (`components/lesson-page.tsx`) + - Reusable component that loads markdown content + - Handles frontmatter parsing + - Supports navigation between lessons + - Similar to blog post structure + +2. **Page Routes** (`app/learn/...`) + - Each lesson has a simple page component + - Uses `LessonPage` with configuration + - Clean and maintainable + +### How It Works + +1. **Markdown files** are stored in `public/content/learn/[category]/[lesson]/` +2. Each file has **frontmatter** with hero data (title, subtitle, tags) +3. **Images** are placed alongside the markdown files +4. **Page components** load the markdown using the `LessonPage` component +5. Images are referenced as `![alt](image.png)` and served from `/content/learn/...` + +### Example Markdown Frontmatter + +```markdown +--- +hero: + title: "Understanding Derivatives" + subtitle: "The Foundation of Neural Network Training" + tags: + - "๐Ÿ“ Mathematics" + - "โฑ๏ธ 10 min read" +--- + +# Your content here... + +![Derivative Graph](derivative-graph.png) +``` + +## ๐Ÿ“ Adding New Content + +### To Add a New Lesson: + +1. **Create folder structure:** + ```bash + mkdir -p public/content/learn/[category]/[lesson-name] + ``` + +2. **Create markdown file:** + ```bash + touch public/content/learn/[category]/[lesson-name]/[lesson-name]-content.md + ``` + +3. **Add frontmatter and content** to the markdown file + +4. **Add images** to the same folder + +5. **Create page component:** + ```tsx + // app/learn/[category]/[lesson-name]/page.tsx + import { LessonPage } from "@/components/lesson-page"; + + export default function YourLessonPage() { + return ( + + ); + } + ``` + +## ๐Ÿ–ผ๏ธ Adding Images + +### Placeholder Images Currently Referenced: + +**Math - Derivatives:** +- `derivative-graph.png` - Visual showing derivative as slope +- `tangent-line.png` - Tangent line illustration + +**Math - Functions:** +- `linear-function.png` - Linear function visualization +- `relu-function.png` - ReLU activation graph +- `function-composition.png` - Function composition diagram + +**Neural Networks - Introduction:** +- `neural-network-diagram.png` - Basic NN architecture +- `layer-types.png` - Input, hidden, output layers +- `training-process.png` - Training loop diagram +- `depth-vs-performance.png` - Network depth impact + +**Neural Networks - Forward Propagation:** +- `forward-prop-diagram.png` - Data flow diagram +- `forward-example.png` - Example calculation +- `activations-comparison.png` - Different activation functions +- `matrix-backprop.png` - Matrix operations + +**Neural Networks - Backpropagation:** +- `backprop-overview.png` - Algorithm overview +- `backprop-steps.png` - Step-by-step process +- `matrix-backprop.png` - Matrix form backprop + +**Neural Networks - Training:** +- `training-loop.png` - Training loop visualization +- `gradient-descent.png` - Gradient descent illustration +- `gd-variants.png` - GD variants comparison +- `optimizers-comparison.png` - Optimizer behaviors +- `lr-schedules.png` - Learning rate schedules +- `training-curves.png` - Loss/accuracy curves + +### To Add Your Images: + +1. Create your images (PNG or JPG recommended) +2. Place them in the appropriate lesson folder +3. They're already referenced in the markdown - just replace the placeholders! + +## ๐ŸŽจ Design Features + +- **Beautiful gradient backgrounds** matching the site theme +- **Syntax highlighting** for code blocks +- **Responsive design** for mobile and desktop +- **Navigation** between lessons with prev/next buttons +- **Markdown rendering** with support for: + - Headings, paragraphs, lists + - Code blocks + - Images + - Tables + - Math formulas (using KaTeX in MarkdownRenderer) + +## ๐Ÿš€ Next Steps + +1. **Add your images** - Replace placeholder PNG files with actual visualizations +2. **Expand content** - Add more lessons or modules as needed +3. **Test on localhost** - Visit `/learn` to see the course +4. **Customize styling** - Adjust colors/gradients in the components if desired + +## ๐Ÿ“‹ Summary + +โœ… Course structure created with 9 lessons (5 math + 4 neural networks) +โœ… Markdown files in `public/content/learn/` +โœ… Reusable `LessonPage` component +โœ… Images ready for math lessons (vectors, matrices, gradients) +โœ… Navigation between lessons +โœ… Frontmatter support for hero sections +โœ… README documentation in content folder + +Your course is ready with comprehensive math fundamentals! ๐ŸŽ‰ + diff --git a/app/learn/activation-functions/relu/page.tsx b/app/learn/activation-functions/relu/page.tsx new file mode 100644 index 0000000..5daa632 --- /dev/null +++ b/app/learn/activation-functions/relu/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function ReluPage() { + return ( + + ); +} + diff --git a/app/learn/activation-functions/sigmoid/page.tsx b/app/learn/activation-functions/sigmoid/page.tsx new file mode 100644 index 0000000..68e1726 --- /dev/null +++ b/app/learn/activation-functions/sigmoid/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function SigmoidPage() { + return ( + + ); +} + diff --git a/app/learn/activation-functions/silu/page.tsx b/app/learn/activation-functions/silu/page.tsx new file mode 100644 index 0000000..6d215c8 --- /dev/null +++ b/app/learn/activation-functions/silu/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function SiluPage() { + return ( + + ); +} + diff --git a/app/learn/activation-functions/softmax/page.tsx b/app/learn/activation-functions/softmax/page.tsx new file mode 100644 index 0000000..5f74f3c --- /dev/null +++ b/app/learn/activation-functions/softmax/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function SoftmaxPage() { + return ( + + ); +} + diff --git a/app/learn/activation-functions/swiglu/page.tsx b/app/learn/activation-functions/swiglu/page.tsx new file mode 100644 index 0000000..4e6656a --- /dev/null +++ b/app/learn/activation-functions/swiglu/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function SwigluPage() { + return ( + + ); +} + diff --git a/app/learn/activation-functions/tanh/page.tsx b/app/learn/activation-functions/tanh/page.tsx new file mode 100644 index 0000000..51fefa4 --- /dev/null +++ b/app/learn/activation-functions/tanh/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TanhPage() { + return ( + + ); +} + diff --git a/app/learn/attention-mechanism/applying-attention-weights/page.tsx b/app/learn/attention-mechanism/applying-attention-weights/page.tsx new file mode 100644 index 0000000..11e0f91 --- /dev/null +++ b/app/learn/attention-mechanism/applying-attention-weights/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function ApplyingAttentionWeightsPage() { + return ( + + ); +} + diff --git a/app/learn/attention-mechanism/attention-in-code/page.tsx b/app/learn/attention-mechanism/attention-in-code/page.tsx new file mode 100644 index 0000000..0bb3f76 --- /dev/null +++ b/app/learn/attention-mechanism/attention-in-code/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function AttentionInCodePage() { + return ( + + ); +} + diff --git a/app/learn/attention-mechanism/calculating-attention-scores/page.tsx b/app/learn/attention-mechanism/calculating-attention-scores/page.tsx new file mode 100644 index 0000000..6058f59 --- /dev/null +++ b/app/learn/attention-mechanism/calculating-attention-scores/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function CalculatingAttentionScoresPage() { + return ( + + ); +} + diff --git a/app/learn/attention-mechanism/multi-head-attention/page.tsx b/app/learn/attention-mechanism/multi-head-attention/page.tsx new file mode 100644 index 0000000..2b3d895 --- /dev/null +++ b/app/learn/attention-mechanism/multi-head-attention/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function MultiHeadAttentionPage() { + return ( + + ); +} + diff --git a/app/learn/attention-mechanism/self-attention-from-scratch/page.tsx b/app/learn/attention-mechanism/self-attention-from-scratch/page.tsx new file mode 100644 index 0000000..0d31494 --- /dev/null +++ b/app/learn/attention-mechanism/self-attention-from-scratch/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function SelfAttentionFromScratchPage() { + return ( + + ); +} + diff --git a/app/learn/attention-mechanism/what-is-attention/page.tsx b/app/learn/attention-mechanism/what-is-attention/page.tsx new file mode 100644 index 0000000..799ee0f --- /dev/null +++ b/app/learn/attention-mechanism/what-is-attention/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function WhatIsAttentionPage() { + return ( + + ); +} + diff --git a/app/learn/building-a-transformer/building-a-transformer-block/page.tsx b/app/learn/building-a-transformer/building-a-transformer-block/page.tsx new file mode 100644 index 0000000..b684901 --- /dev/null +++ b/app/learn/building-a-transformer/building-a-transformer-block/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function BuildingATransformerBlockPage() { + return ( + + ); +} + diff --git a/app/learn/building-a-transformer/full-transformer-in-code/page.tsx b/app/learn/building-a-transformer/full-transformer-in-code/page.tsx new file mode 100644 index 0000000..fc0a45a --- /dev/null +++ b/app/learn/building-a-transformer/full-transformer-in-code/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function FullTransformerInCodePage() { + return ( + + ); +} + diff --git a/app/learn/building-a-transformer/rope-positional-encoding/page.tsx b/app/learn/building-a-transformer/rope-positional-encoding/page.tsx new file mode 100644 index 0000000..0e02d0b --- /dev/null +++ b/app/learn/building-a-transformer/rope-positional-encoding/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function RopePositionalEncodingPage() { + return ( + + ); +} + diff --git a/app/learn/building-a-transformer/the-final-linear-layer/page.tsx b/app/learn/building-a-transformer/the-final-linear-layer/page.tsx new file mode 100644 index 0000000..3e6c49d --- /dev/null +++ b/app/learn/building-a-transformer/the-final-linear-layer/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TheFinalLinearLayerPage() { + return ( + + ); +} + diff --git a/app/learn/building-a-transformer/training-a-transformer/page.tsx b/app/learn/building-a-transformer/training-a-transformer/page.tsx new file mode 100644 index 0000000..2fe2616 --- /dev/null +++ b/app/learn/building-a-transformer/training-a-transformer/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TrainingATransformerPage() { + return ( + + ); +} + diff --git a/app/learn/building-a-transformer/transformer-architecture/page.tsx b/app/learn/building-a-transformer/transformer-architecture/page.tsx new file mode 100644 index 0000000..741ece3 --- /dev/null +++ b/app/learn/building-a-transformer/transformer-architecture/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TransformerArchitecturePage() { + return ( + + ); +} + diff --git a/app/learn/math/derivatives/page.tsx b/app/learn/math/derivatives/page.tsx new file mode 100644 index 0000000..7097874 --- /dev/null +++ b/app/learn/math/derivatives/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function DerivativesPage() { + return ( + + ); +} + diff --git a/app/learn/math/functions/page.tsx b/app/learn/math/functions/page.tsx new file mode 100644 index 0000000..669be2b --- /dev/null +++ b/app/learn/math/functions/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function FunctionsPage() { + return ( + + ); +} + diff --git a/app/learn/math/gradients/page.tsx b/app/learn/math/gradients/page.tsx new file mode 100644 index 0000000..f1c3022 --- /dev/null +++ b/app/learn/math/gradients/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function GradientsPage() { + return ( + + ); +} + diff --git a/app/learn/math/matrices/page.tsx b/app/learn/math/matrices/page.tsx new file mode 100644 index 0000000..6d72757 --- /dev/null +++ b/app/learn/math/matrices/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function MatricesPage() { + return ( + + ); +} + diff --git a/app/learn/math/vectors/page.tsx b/app/learn/math/vectors/page.tsx new file mode 100644 index 0000000..62bbdb2 --- /dev/null +++ b/app/learn/math/vectors/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function VectorsPage() { + return ( + + ); +} + diff --git a/app/learn/neural-networks/architecture-of-a-network/page.tsx b/app/learn/neural-networks/architecture-of-a-network/page.tsx new file mode 100644 index 0000000..53930b5 --- /dev/null +++ b/app/learn/neural-networks/architecture-of-a-network/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function ArchitectureOfANetworkPage() { + return ( + + ); +} + diff --git a/app/learn/neural-networks/backpropagation-in-action/page.tsx b/app/learn/neural-networks/backpropagation-in-action/page.tsx new file mode 100644 index 0000000..c53ce8d --- /dev/null +++ b/app/learn/neural-networks/backpropagation-in-action/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function BackpropagationInActionPage() { + return ( + + ); +} + diff --git a/app/learn/neural-networks/building-a-layer/page.tsx b/app/learn/neural-networks/building-a-layer/page.tsx new file mode 100644 index 0000000..c549ff2 --- /dev/null +++ b/app/learn/neural-networks/building-a-layer/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function BuildingALayerPage() { + return ( + + ); +} + diff --git a/app/learn/neural-networks/calculating-gradients/page.tsx b/app/learn/neural-networks/calculating-gradients/page.tsx new file mode 100644 index 0000000..3ad5ff1 --- /dev/null +++ b/app/learn/neural-networks/calculating-gradients/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function CalculatingGradientsPage() { + return ( + + ); +} + diff --git a/app/learn/neural-networks/implementing-a-network/page.tsx b/app/learn/neural-networks/implementing-a-network/page.tsx new file mode 100644 index 0000000..0fbb5fd --- /dev/null +++ b/app/learn/neural-networks/implementing-a-network/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function ImplementingANetworkPage() { + return ( + + ); +} + diff --git a/app/learn/neural-networks/implementing-backpropagation/page.tsx b/app/learn/neural-networks/implementing-backpropagation/page.tsx new file mode 100644 index 0000000..42f74f1 --- /dev/null +++ b/app/learn/neural-networks/implementing-backpropagation/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function ImplementingBackpropagationPage() { + return ( + + ); +} + diff --git a/app/learn/neural-networks/the-chain-rule/page.tsx b/app/learn/neural-networks/the-chain-rule/page.tsx new file mode 100644 index 0000000..2631498 --- /dev/null +++ b/app/learn/neural-networks/the-chain-rule/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TheChainRulePage() { + return ( + + ); +} + diff --git a/app/learn/neuron-from-scratch/building-a-neuron-in-python/page.tsx b/app/learn/neuron-from-scratch/building-a-neuron-in-python/page.tsx new file mode 100644 index 0000000..b7967c1 --- /dev/null +++ b/app/learn/neuron-from-scratch/building-a-neuron-in-python/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function BuildingANeuronInPythonPage() { + return ( + + ); +} + diff --git a/app/learn/neuron-from-scratch/making-a-prediction/page.tsx b/app/learn/neuron-from-scratch/making-a-prediction/page.tsx new file mode 100644 index 0000000..0b65430 --- /dev/null +++ b/app/learn/neuron-from-scratch/making-a-prediction/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function MakingAPredictionPage() { + return ( + + ); +} + diff --git a/app/learn/neuron-from-scratch/the-activation-function/page.tsx b/app/learn/neuron-from-scratch/the-activation-function/page.tsx new file mode 100644 index 0000000..3fd78a2 --- /dev/null +++ b/app/learn/neuron-from-scratch/the-activation-function/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TheActivationFunctionPage() { + return ( + + ); +} + diff --git a/app/learn/neuron-from-scratch/the-concept-of-learning/page.tsx b/app/learn/neuron-from-scratch/the-concept-of-learning/page.tsx new file mode 100644 index 0000000..2bede78 --- /dev/null +++ b/app/learn/neuron-from-scratch/the-concept-of-learning/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TheConceptOfLearningPage() { + return ( + + ); +} + diff --git a/app/learn/neuron-from-scratch/the-concept-of-loss/page.tsx b/app/learn/neuron-from-scratch/the-concept-of-loss/page.tsx new file mode 100644 index 0000000..9cea839 --- /dev/null +++ b/app/learn/neuron-from-scratch/the-concept-of-loss/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TheConceptOfLossPage() { + return ( + + ); +} + diff --git a/app/learn/neuron-from-scratch/the-linear-step/page.tsx b/app/learn/neuron-from-scratch/the-linear-step/page.tsx new file mode 100644 index 0000000..1a3c919 --- /dev/null +++ b/app/learn/neuron-from-scratch/the-linear-step/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TheLinearStepPage() { + return ( + + ); +} + diff --git a/app/learn/neuron-from-scratch/what-is-a-neuron/page.tsx b/app/learn/neuron-from-scratch/what-is-a-neuron/page.tsx new file mode 100644 index 0000000..0d2adda --- /dev/null +++ b/app/learn/neuron-from-scratch/what-is-a-neuron/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function WhatIsANeuronPage() { + return ( + + ); +} + diff --git a/app/learn/page.tsx b/app/learn/page.tsx new file mode 100644 index 0000000..3962012 --- /dev/null +++ b/app/learn/page.tsx @@ -0,0 +1,1225 @@ +'use client'; + +import Link from "next/link"; +import { useLanguage } from "@/components/providers/language-provider"; + +export default function LearnPage() { + const { language } = useLanguage(); + + return ( +
+ {/* Hero Section */} +
+
+ +
+
+

+ + {language === 'en' ? 'Learn Everything You Need To Be An AI Researcher' : 'ไปŽ้›ถๅผ€ๅง‹ๅญฆไน AI'} + +

+

+ {language === 'en' + ? 'Master the fundamentals and publish your own papers' + : 'ๆŽŒๆกๅŸบ็ก€็Ÿฅ่ฏ†๏ผŒๆž„ๅปบไฝ ่‡ชๅทฑ็š„็ฅž็ป็ฝ‘็ปœ'} +

+
+

+ {language === 'en' + ? 'Under active development, some parts are AI generated and not reviewed yet. In the end everything will be carefully reviewed and rewritten by humans to the highest quality.' + : 'ๆญฃๅœจ็งฏๆžๅผ€ๅ‘ไธญ๏ผŒ้ƒจๅˆ†ๅ†…ๅฎน็”ฑAI็”Ÿๆˆๅฐšๆœชๅฎกๆ ธใ€‚ๆœ€็ปˆๆ‰€ๆœ‰ๅ†…ๅฎน้ƒฝๅฐ†็”ฑไบบๅทฅไป”็ป†ๅฎกๆ ธๅ’Œ้‡ๅ†™๏ผŒ็กฎไฟๆœ€้ซ˜่ดจ้‡'} +

+
+
+
+
+ + {/* Course Modules */} +
+
+
+ + {/* Math Module */} +
+
+
+ + + +
+
+

+ {language === 'en' ? 'Mathematics Fundamentals' : 'ๆ•ฐๅญฆๅŸบ็ก€'} +

+

+ {language === 'en' ? 'Essential math concepts for AI' : 'AIๅฟ…ๅค‡็š„ๆ•ฐๅญฆๆฆ‚ๅฟต'} +

+
+
+ +
+ +
+

+ 1.{language === 'en' ? 'Functions' : 'ๅ‡ฝๆ•ฐ'} +

+ + + +
+

+ {language === 'en' + ? 'Linear, quadratic, and activation functions' + : '็บฟๆ€งใ€ไบŒๆฌกๅ’Œๆฟ€ๆดปๅ‡ฝๆ•ฐ'} +

+ + + +
+

+ 2.{language === 'en' ? 'Derivatives' : 'ๅฏผๆ•ฐ'} +

+ + + +
+

+ {language === 'en' + ? 'Understanding rates of change and gradients' + : '็†่งฃๅ˜ๅŒ–็އๅ’Œๆขฏๅบฆ'} +

+ + + +
+

+ 3.{language === 'en' ? 'Vectors' : 'ๅ‘้‡'} +

+ + + +
+

+ {language === 'en' + ? 'Understanding magnitude, direction, and vector operations' + : '็†่งฃๅคงๅฐใ€ๆ–นๅ‘ๅ’Œๅ‘้‡่ฟ็ฎ—'} +

+ + + +
+

+ 4.{language === 'en' ? 'Matrices' : '็Ÿฉ้˜ต'} +

+ + + +
+

+ {language === 'en' + ? 'Matrix operations and transformations' + : '็Ÿฉ้˜ต่ฟ็ฎ—ๅ’Œๅ˜ๆข'} +

+ + + +
+

+ 5.{language === 'en' ? 'Gradients' : 'ๆขฏๅบฆ'} +

+ + + +
+

+ {language === 'en' + ? 'Partial derivatives and gradient descent' + : 'ๅๅฏผๆ•ฐๅ’Œๆขฏๅบฆไธ‹้™'} +

+ +
+
+ + {/* PyTorch Fundamentals Module */} +
+
+
+ + + +
+
+

+ {language === 'en' ? 'PyTorch Fundamentals' : 'PyTorchๅŸบ็ก€'} +

+

+ {language === 'en' ? 'Working with tensors and PyTorch basics' : 'ไฝฟ็”จๅผ ้‡ๅ’ŒPyTorchๅŸบ็ก€'} +

+
+
+ +
+ +
+

+ 1.{language === 'en' ? 'Creating Tensors' : 'ๅˆ›ๅปบๅผ ้‡'} +

+ + + +
+

+ {language === 'en' + ? 'Building blocks of deep learning' + : 'ๆทฑๅบฆๅญฆไน ็š„ๅŸบๆœฌๆž„ๅปบๅ—'} +

+ + + +
+

+ 2.{language === 'en' ? 'Tensor Addition' : 'ๅผ ้‡ๅŠ ๆณ•'} +

+ + + +
+

+ {language === 'en' + ? 'Element-wise operations on tensors' + : 'ๅผ ้‡็š„้€ๅ…ƒ็ด ่ฟ็ฎ—'} +

+ + + +
+

+ 3.{language === 'en' ? 'Matrix Multiplication' : '็Ÿฉ้˜ตไน˜ๆณ•'} +

+ + + +
+

+ {language === 'en' + ? 'The core operation in neural networks' + : '็ฅž็ป็ฝ‘็ปœไธญ็š„ๆ ธๅฟƒ่ฟ็ฎ—'} +

+ + + +
+

+ 4.{language === 'en' ? 'Transposing Tensors' : 'ๅผ ้‡่ฝฌ็ฝฎ'} +

+ + + +
+

+ {language === 'en' + ? 'Flipping dimensions and axes' + : '็ฟป่ฝฌ็ปดๅบฆๅ’Œ่ฝด'} +

+ + + +
+

+ 5.{language === 'en' ? 'Reshaping Tensors' : 'ๅผ ้‡้‡ๅก‘'} +

+ + + +
+

+ {language === 'en' + ? 'Changing tensor dimensions' + : 'ๆ”นๅ˜ๅผ ้‡็ปดๅบฆ'} +

+ + + +
+

+ 6.{language === 'en' ? 'Indexing and Slicing' : '็ดขๅผ•ๅ’Œๅˆ‡็‰‡'} +

+ + + +
+

+ {language === 'en' + ? 'Accessing and extracting tensor elements' + : '่ฎฟ้—ฎๅ’Œๆๅ–ๅผ ้‡ๅ…ƒ็ด '} +

+ + + +
+

+ 7.{language === 'en' ? 'Concatenating Tensors' : 'ๅผ ้‡ๆ‹ผๆŽฅ'} +

+ + + +
+

+ {language === 'en' + ? 'Combining multiple tensors' + : '็ป„ๅˆๅคšไธชๅผ ้‡'} +

+ + + +
+

+ 8.{language === 'en' ? 'Creating Special Tensors' : 'ๅˆ›ๅปบ็‰นๆฎŠๅผ ้‡'} +

+ + + +
+

+ {language === 'en' + ? 'Zeros, ones, identity matrices and more' + : '้›ถๅผ ้‡ใ€ๅ•ไฝๅผ ้‡ใ€ๅ•ไฝ็Ÿฉ้˜ต็ญ‰'} +

+ +
+
+ + {/* Neuron From Scratch Module */} +
+
+
+ + + +
+
+

+ {language === 'en' ? 'Neuron From Scratch' : 'ไปŽ้›ถๅผ€ๅง‹ๆž„ๅปบ็ฅž็ปๅ…ƒ'} +

+

+ {language === 'en' ? 'Understanding the fundamental unit of neural networks' : '็†่งฃ็ฅž็ป็ฝ‘็ปœ็š„ๅŸบๆœฌๅ•ๅ…ƒ'} +

+
+
+ +
+ +
+

+ 1.{language === 'en' ? 'What is a Neuron' : 'ไป€ไนˆๆ˜ฏ็ฅž็ปๅ…ƒ'} +

+ + + +
+

+ {language === 'en' + ? 'The basic building block of neural networks' + : '็ฅž็ป็ฝ‘็ปœ็š„ๅŸบๆœฌๆž„ๅปบๅ—'} +

+ + + +
+

+ 2.{language === 'en' ? 'The Linear Step' : '็บฟๆ€งๆญฅ้ชค'} +

+ + + +
+

+ {language === 'en' + ? 'Weighted sums and bias in neurons' + : '็ฅž็ปๅ…ƒไธญ็š„ๅŠ ๆƒๅ’Œๅ’Œๅ็ฝฎ'} +

+ + + +
+

+ 3.{language === 'en' ? 'The Activation Function' : 'ๆฟ€ๆดปๅ‡ฝๆ•ฐ'} +

+ + + +
+

+ {language === 'en' + ? 'Introducing non-linearity to neurons' + : 'ไธบ็ฅž็ปๅ…ƒๅผ•ๅ…ฅ้ž็บฟๆ€ง'} +

+ + + +
+

+ 4.{language === 'en' ? 'Building a Neuron in Python' : '็”จPythonๆž„ๅปบ็ฅž็ปๅ…ƒ'} +

+ + + +
+

+ {language === 'en' + ? 'Implementing a single neuron from scratch' + : 'ไปŽ้›ถๅผ€ๅง‹ๅฎž็Žฐๅ•ไธช็ฅž็ปๅ…ƒ'} +

+ + + +
+

+ 5.{language === 'en' ? 'Making a Prediction' : '่ฟ›่กŒ้ข„ๆต‹'} +

+ + + +
+

+ {language === 'en' + ? 'How a neuron processes input to output' + : '็ฅž็ปๅ…ƒๅฆ‚ไฝ•ๅค„็†่พ“ๅ…ฅๅˆฐ่พ“ๅ‡บ'} +

+ + + +
+

+ 6.{language === 'en' ? 'The Concept of Loss' : 'ๆŸๅคฑๆฆ‚ๅฟต'} +

+ + + +
+

+ {language === 'en' + ? 'Measuring prediction error' + : 'ๆต‹้‡้ข„ๆต‹่ฏฏๅทฎ'} +

+ + + +
+

+ 7.{language === 'en' ? 'The Concept of Learning' : 'ๅญฆไน ๆฆ‚ๅฟต'} +

+ + + +
+

+ {language === 'en' + ? 'How neurons adjust their parameters' + : '็ฅž็ปๅ…ƒๅฆ‚ไฝ•่ฐƒๆ•ดๅ…ถๅ‚ๆ•ฐ'} +

+ +
+
+ + {/* Activation Functions Module */} +
+
+
+ + + +
+
+

+ {language === 'en' ? 'Activation Functions' : 'ๆฟ€ๆดปๅ‡ฝๆ•ฐ'} +

+

+ {language === 'en' ? 'Understanding different activation functions' : '็†่งฃไธๅŒ็š„ๆฟ€ๆดปๅ‡ฝๆ•ฐ'} +

+
+
+ +
+ +
+

+ 1.{language === 'en' ? 'ReLU' : 'ReLU'} +

+ + + +
+

+ {language === 'en' + ? 'Rectified Linear Unit - The most popular activation function' + : 'ไฟฎๆญฃ็บฟๆ€งๅ•ๅ…ƒ - ๆœ€ๆต่กŒ็š„ๆฟ€ๆดปๅ‡ฝๆ•ฐ'} +

+ + + +
+

+ 2.{language === 'en' ? 'Sigmoid' : 'Sigmoid'} +

+ + + +
+

+ {language === 'en' + ? 'The classic S-shaped activation function' + : '็ปๅ…ธ็š„Sๅฝขๆฟ€ๆดปๅ‡ฝๆ•ฐ'} +

+ + + +
+

+ 3.{language === 'en' ? 'Tanh' : 'Tanh'} +

+ + + +
+

+ {language === 'en' + ? 'Hyperbolic tangent - Zero-centered activation' + : 'ๅŒๆ›ฒๆญฃๅˆ‡ - ้›ถไธญๅฟƒๆฟ€ๆดป'} +

+ + + +
+

+ 4.{language === 'en' ? 'SiLU' : 'SiLU'} +

+ + + +
+

+ {language === 'en' + ? 'Sigmoid Linear Unit - The Swish activation' + : 'Sigmoid็บฟๆ€งๅ•ๅ…ƒ - Swishๆฟ€ๆดป'} +

+ + + +
+

+ 5.{language === 'en' ? 'SwiGLU' : 'SwiGLU'} +

+ + + +
+

+ {language === 'en' + ? 'Swish-Gated Linear Unit - Advanced activation' + : 'Swish้—จๆŽง็บฟๆ€งๅ•ๅ…ƒ - ้ซ˜็บงๆฟ€ๆดป'} +

+ + + +
+

+ 6.{language === 'en' ? 'Softmax' : 'Softmax'} +

+ + + +
+

+ {language === 'en' + ? 'Multi-class classification activation function' + : 'ๅคš็ฑปๅˆ†็ฑปๆฟ€ๆดปๅ‡ฝๆ•ฐ'} +

+ +
+
+ + {/* Neural Networks Module */} +
+
+
+ + + +
+
+

+ {language === 'en' ? 'Neural Networks from Scratch' : 'ไปŽ้›ถๅผ€ๅง‹็š„็ฅž็ป็ฝ‘็ปœ'} +

+

+ {language === 'en' ? 'Build neural networks from the ground up' : 'ไปŽๅคดๆž„ๅปบ็ฅž็ป็ฝ‘็ปœ'} +

+
+
+ +
+ +
+

+ 1.{language === 'en' ? 'Architecture of a Network' : '็ฝ‘็ปœๆžถๆž„'} +

+ + + +
+

+ {language === 'en' + ? 'Understanding neural network structure and design' + : '็†่งฃ็ฅž็ป็ฝ‘็ปœ็ป“ๆž„ๅ’Œ่ฎพ่ฎก'} +

+ + + +
+

+ 2.{language === 'en' ? 'Building a Layer' : 'ๆž„ๅปบๅฑ‚'} +

+ + + +
+

+ {language === 'en' + ? 'Constructing individual network layers' + : 'ๆž„ๅปบๅ•ไธช็ฝ‘็ปœๅฑ‚'} +

+ + + +
+

+ 3.{language === 'en' ? 'Implementing a Network' : 'ๅฎž็Žฐ็ฝ‘็ปœ'} +

+ + + +
+

+ {language === 'en' + ? 'Putting together a complete neural network' + : '็ป„่ฃ…ๅฎŒๆ•ด็š„็ฅž็ป็ฝ‘็ปœ'} +

+ + + +
+

+ 4.{language === 'en' ? 'The Chain Rule' : '้“พๅผๆณ•ๅˆ™'} +

+ + + +
+

+ {language === 'en' + ? 'Mathematical foundation of backpropagation' + : 'ๅๅ‘ไผ ๆ’ญ็š„ๆ•ฐๅญฆๅŸบ็ก€'} +

+ + + +
+

+ 5.{language === 'en' ? 'Calculating Gradients' : '่ฎก็ฎ—ๆขฏๅบฆ'} +

+ + + +
+

+ {language === 'en' + ? 'Computing derivatives for network training' + : '่ฎก็ฎ—็ฝ‘็ปœ่ฎญ็ปƒ็š„ๅฏผๆ•ฐ'} +

+ + + +
+

+ 6.{language === 'en' ? 'Backpropagation in Action' : 'ๅๅ‘ไผ ๆ’ญๅฎžๆˆ˜'} +

+ + + +
+

+ {language === 'en' + ? 'Understanding the backpropagation algorithm' + : '็†่งฃๅๅ‘ไผ ๆ’ญ็ฎ—ๆณ•'} +

+ + + +
+

+ 7.{language === 'en' ? 'Implementing Backpropagation' : 'ๅฎž็Žฐๅๅ‘ไผ ๆ’ญ'} +

+ + + +
+

+ {language === 'en' + ? 'Coding the backpropagation algorithm from scratch' + : 'ไปŽ้›ถๅผ€ๅง‹็ผ–ๅ†™ๅๅ‘ไผ ๆ’ญ็ฎ—ๆณ•'} +

+ +
+
+ + {/* Attention Mechanism Module */} +
+
+
+ + + + +
+
+

+ {language === 'en' ? 'Attention Mechanism' : 'ๆณจๆ„ๅŠ›ๆœบๅˆถ'} +

+

+ {language === 'en' ? 'Understanding attention and self-attention' : '็†่งฃๆณจๆ„ๅŠ›ๅ’Œ่‡ชๆณจๆ„ๅŠ›'} +

+
+
+ +
+ +
+

+ 1.{language === 'en' ? 'What is Attention' : 'ไป€ไนˆๆ˜ฏๆณจๆ„ๅŠ›'} +

+ + + +
+

+ {language === 'en' + ? 'Understanding the attention mechanism' + : '็†่งฃๆณจๆ„ๅŠ›ๆœบๅˆถ'} +

+ + + +
+

+ 2.{language === 'en' ? 'Self Attention from Scratch' : 'ไปŽ้›ถๅผ€ๅง‹่‡ชๆณจๆ„ๅŠ›'} +

+ + + +
+

+ {language === 'en' + ? 'Building self-attention from the ground up' + : 'ไปŽ้›ถๅผ€ๅง‹ๆž„ๅปบ่‡ชๆณจๆ„ๅŠ›'} +

+ + + +
+

+ 3.{language === 'en' ? 'Calculating Attention Scores' : '่ฎก็ฎ—ๆณจๆ„ๅŠ›ๅˆ†ๆ•ฐ'} +

+ + + +
+

+ {language === 'en' + ? 'Computing query-key-value similarities' + : '่ฎก็ฎ—ๆŸฅ่ฏข-้”ฎ-ๅ€ผ็›ธไผผๅบฆ'} +

+ + + +
+

+ 4.{language === 'en' ? 'Applying Attention Weights' : 'ๅบ”็”จๆณจๆ„ๅŠ›ๆƒ้‡'} +

+ + + +
+

+ {language === 'en' + ? 'Using attention scores to weight values' + : 'ไฝฟ็”จๆณจๆ„ๅŠ›ๅˆ†ๆ•ฐๅŠ ๆƒๅ€ผ'} +

+ + + +
+

+ 5.{language === 'en' ? 'Multi Head Attention' : 'ๅคšๅคดๆณจๆ„ๅŠ›'} +

+ + + +
+

+ {language === 'en' + ? 'Parallel attention mechanisms' + : 'ๅนถ่กŒๆณจๆ„ๅŠ›ๆœบๅˆถ'} +

+ + + +
+

+ 6.{language === 'en' ? 'Attention in Code' : 'ๆณจๆ„ๅŠ›ไปฃ็ ๅฎž็Žฐ'} +

+ + + +
+

+ {language === 'en' + ? 'Implementing attention mechanisms in Python' + : '็”จPythonๅฎž็Žฐๆณจๆ„ๅŠ›ๆœบๅˆถ'} +

+ +
+
+ + {/* Transformer Feedforward Module */} +
+
+
+ + + +
+
+

+ {language === 'en' ? 'Transformer Feedforward' : 'Transformerๅ‰้ฆˆ็ฝ‘็ปœ'} +

+

+ {language === 'en' ? 'Feedforward networks and Mixture of Experts' : 'ๅ‰้ฆˆ็ฝ‘็ปœๅ’Œไธ“ๅฎถๆททๅˆ'} +

+
+
+ +
+ +
+

+ 1.{language === 'en' ? 'The Feedforward Layer' : 'ๅ‰้ฆˆๅฑ‚'} +

+ + + +
+

+ {language === 'en' + ? 'Understanding transformer feedforward networks' + : '็†่งฃTransformerๅ‰้ฆˆ็ฝ‘็ปœ'} +

+ + + +
+

+ 2.{language === 'en' ? 'What is Mixture of Experts' : 'ไป€ไนˆๆ˜ฏไธ“ๅฎถๆททๅˆ'} +

+ + + +
+

+ {language === 'en' + ? 'Introduction to MoE architecture' + : 'MoEๆžถๆž„ไป‹็ป'} +

+ + + +
+

+ 3.{language === 'en' ? 'The Expert' : 'ไธ“ๅฎถ'} +

+ + + +
+

+ {language === 'en' + ? 'Understanding individual expert networks' + : '็†่งฃๅ•ไธชไธ“ๅฎถ็ฝ‘็ปœ'} +

+ + + +
+

+ 4.{language === 'en' ? 'The Gate' : '้—จๆŽง'} +

+ + + +
+

+ {language === 'en' + ? 'Routing and gating mechanisms in MoE' + : 'MoEไธญ็š„่ทฏ็”ฑๅ’Œ้—จๆŽงๆœบๅˆถ'} +

+ + + +
+

+ 5.{language === 'en' ? 'Combining Experts' : '็ป„ๅˆไธ“ๅฎถ'} +

+ + + +
+

+ {language === 'en' + ? 'Merging multiple expert outputs' + : 'ๅˆๅนถๅคšไธชไธ“ๅฎถ่พ“ๅ‡บ'} +

+ + + +
+

+ 6.{language === 'en' ? 'MoE in a Transformer' : 'Transformerไธญ็š„MoE'} +

+ + + +
+

+ {language === 'en' + ? 'Integrating mixture of experts in transformers' + : 'ๅœจTransformerไธญ้›†ๆˆไธ“ๅฎถๆททๅˆ'} +

+ + + +
+

+ 7.{language === 'en' ? 'MoE in Code' : 'MoEไปฃ็ ๅฎž็Žฐ'} +

+ + + +
+

+ {language === 'en' + ? 'Implementing mixture of experts in Python' + : '็”จPythonๅฎž็Žฐไธ“ๅฎถๆททๅˆ'} +

+ + + +
+

+ 8.{language === 'en' ? 'The DeepSeek MLP' : 'DeepSeek MLP'} +

+ + + +
+

+ {language === 'en' + ? 'DeepSeek\'s advanced MLP architecture' + : 'DeepSeek็š„้ซ˜็บงMLPๆžถๆž„'} +

+ +
+
+ + {/* Building a Transformer Module */} +
+
+
+ + + +
+
+

+ {language === 'en' ? 'Building a Transformer' : 'ๆž„ๅปบTransformer'} +

+

+ {language === 'en' ? 'Complete transformer implementation from scratch' : 'ไปŽ้›ถๅผ€ๅง‹ๅฎŒๆ•ดๅฎž็ŽฐTransformer'} +

+
+
+ +
+ +
+

+ 1.{language === 'en' ? 'Transformer Architecture' : 'Transformerๆžถๆž„'} +

+ + + +
+

+ {language === 'en' + ? 'Understanding the complete transformer structure' + : '็†่งฃๅฎŒๆ•ด็š„Transformer็ป“ๆž„'} +

+ + + +
+

+ 2.{language === 'en' ? 'RoPE Positional Encoding' : 'RoPEไฝ็ฝฎ็ผ–็ '} +

+ + + +
+

+ {language === 'en' + ? 'Rotary position embeddings for transformers' + : 'Transformer็š„ๆ—‹่ฝฌไฝ็ฝฎๅตŒๅ…ฅ'} +

+ + + +
+

+ 3.{language === 'en' ? 'Building a Transformer Block' : 'ๆž„ๅปบTransformerๅ—'} +

+ + + +
+

+ {language === 'en' + ? 'Constructing individual transformer layers' + : 'ๆž„ๅปบๅ•ไธชTransformerๅฑ‚'} +

+ + + +
+

+ 4.{language === 'en' ? 'The Final Linear Layer' : 'ๆœ€็ปˆ็บฟๆ€งๅฑ‚'} +

+ + + +
+

+ {language === 'en' + ? 'Output projection and prediction head' + : '่พ“ๅ‡บๆŠ•ๅฝฑๅ’Œ้ข„ๆต‹ๅคด'} +

+ + + +
+

+ 5.{language === 'en' ? 'Full Transformer in Code' : 'ๅฎŒๆ•ดTransformerไปฃ็ '} +

+ + + +
+

+ {language === 'en' + ? 'Complete transformer implementation' + : 'ๅฎŒๆ•ด็š„Transformerๅฎž็Žฐ'} +

+ + + +
+

+ 6.{language === 'en' ? 'Training a Transformer' : '่ฎญ็ปƒTransformer'} +

+ + + +
+

+ {language === 'en' + ? 'Training process and optimization' + : '่ฎญ็ปƒ่ฟ‡็จ‹ๅ’Œไผ˜ๅŒ–'} +

+ +
+
+ +
+
+
+
+ ); +} + diff --git a/app/learn/tensors/concatenating-tensors/page.tsx b/app/learn/tensors/concatenating-tensors/page.tsx new file mode 100644 index 0000000..c3c78a4 --- /dev/null +++ b/app/learn/tensors/concatenating-tensors/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function ConcatenatingTensorsPage() { + return ( + + ); +} + diff --git a/app/learn/tensors/creating-special-tensors/page.tsx b/app/learn/tensors/creating-special-tensors/page.tsx new file mode 100644 index 0000000..662ba9c --- /dev/null +++ b/app/learn/tensors/creating-special-tensors/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function CreatingSpecialTensorsPage() { + return ( + + ); +} + diff --git a/app/learn/tensors/creating-tensors/page.tsx b/app/learn/tensors/creating-tensors/page.tsx new file mode 100644 index 0000000..e4d6dd1 --- /dev/null +++ b/app/learn/tensors/creating-tensors/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function CreatingTensorsPage() { + return ( + + ); +} + diff --git a/app/learn/tensors/indexing-and-slicing/page.tsx b/app/learn/tensors/indexing-and-slicing/page.tsx new file mode 100644 index 0000000..52b61e6 --- /dev/null +++ b/app/learn/tensors/indexing-and-slicing/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function IndexingAndSlicingPage() { + return ( + + ); +} + diff --git a/app/learn/tensors/matrix-multiplication/page.tsx b/app/learn/tensors/matrix-multiplication/page.tsx new file mode 100644 index 0000000..0c502ff --- /dev/null +++ b/app/learn/tensors/matrix-multiplication/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function MatrixMultiplicationPage() { + return ( + + ); +} + diff --git a/app/learn/tensors/reshaping-tensors/page.tsx b/app/learn/tensors/reshaping-tensors/page.tsx new file mode 100644 index 0000000..5a18358 --- /dev/null +++ b/app/learn/tensors/reshaping-tensors/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function ReshapingTensorsPage() { + return ( + + ); +} + diff --git a/app/learn/tensors/tensor-addition/page.tsx b/app/learn/tensors/tensor-addition/page.tsx new file mode 100644 index 0000000..c8c738f --- /dev/null +++ b/app/learn/tensors/tensor-addition/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TensorAdditionPage() { + return ( + + ); +} + diff --git a/app/learn/tensors/transposing-tensors/page.tsx b/app/learn/tensors/transposing-tensors/page.tsx new file mode 100644 index 0000000..959ce8d --- /dev/null +++ b/app/learn/tensors/transposing-tensors/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TransposingTensorsPage() { + return ( + + ); +} + diff --git a/app/learn/transformer-feedforward/combining-experts/page.tsx b/app/learn/transformer-feedforward/combining-experts/page.tsx new file mode 100644 index 0000000..34bc471 --- /dev/null +++ b/app/learn/transformer-feedforward/combining-experts/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function CombiningExpertsPage() { + return ( + + ); +} + diff --git a/app/learn/transformer-feedforward/moe-in-a-transformer/page.tsx b/app/learn/transformer-feedforward/moe-in-a-transformer/page.tsx new file mode 100644 index 0000000..d4ead9d --- /dev/null +++ b/app/learn/transformer-feedforward/moe-in-a-transformer/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function MoeInATransformerPage() { + return ( + + ); +} + diff --git a/app/learn/transformer-feedforward/moe-in-code/page.tsx b/app/learn/transformer-feedforward/moe-in-code/page.tsx new file mode 100644 index 0000000..876c91d --- /dev/null +++ b/app/learn/transformer-feedforward/moe-in-code/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function MoeInCodePage() { + return ( + + ); +} + diff --git a/app/learn/transformer-feedforward/the-deepseek-mlp/page.tsx b/app/learn/transformer-feedforward/the-deepseek-mlp/page.tsx new file mode 100644 index 0000000..7a3ccee --- /dev/null +++ b/app/learn/transformer-feedforward/the-deepseek-mlp/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TheDeepseekMlpPage() { + return ( + + ); +} + diff --git a/app/learn/transformer-feedforward/the-expert/page.tsx b/app/learn/transformer-feedforward/the-expert/page.tsx new file mode 100644 index 0000000..3046b82 --- /dev/null +++ b/app/learn/transformer-feedforward/the-expert/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TheExpertPage() { + return ( + + ); +} + diff --git a/app/learn/transformer-feedforward/the-feedforward-layer/page.tsx b/app/learn/transformer-feedforward/the-feedforward-layer/page.tsx new file mode 100644 index 0000000..38bfa34 --- /dev/null +++ b/app/learn/transformer-feedforward/the-feedforward-layer/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TheFeedforwardLayerPage() { + return ( + + ); +} + diff --git a/app/learn/transformer-feedforward/the-gate/page.tsx b/app/learn/transformer-feedforward/the-gate/page.tsx new file mode 100644 index 0000000..c3ed35c --- /dev/null +++ b/app/learn/transformer-feedforward/the-gate/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function TheGatePage() { + return ( + + ); +} + diff --git a/app/learn/transformer-feedforward/what-is-mixture-of-experts/page.tsx b/app/learn/transformer-feedforward/what-is-mixture-of-experts/page.tsx new file mode 100644 index 0000000..9cb37a7 --- /dev/null +++ b/app/learn/transformer-feedforward/what-is-mixture-of-experts/page.tsx @@ -0,0 +1,12 @@ +import { LessonPage } from "@/components/lesson-page"; + +export default function WhatIsMixtureOfExpertsPage() { + return ( + + ); +} + diff --git a/app/page.tsx b/app/page.tsx index 7353cbd..d72bf40 100644 --- a/app/page.tsx +++ b/app/page.tsx @@ -73,14 +73,12 @@ export default function Home() { {language === 'en' ? ( <> Open - Superintelligence - Lab + Superintelligence ) : ( <> ๅผ€ๆ”พ - ่ถ…็บงๆ™บ่ƒฝ - ๅฎž้ชŒๅฎค + ่ถ…็บงๆ™บ่ƒฝ )} @@ -90,18 +88,29 @@ export default function Home() { {language === 'en' ? ( <> Open - Superintelligence - Lab + Superintelligence ) : ( <> ๅผ€ๆ”พ - ่ถ…็บงๆ™บ่ƒฝ - ๅฎž้ชŒๅฎค + ่ถ…็บงๆ™บ่ƒฝ )} + + {/* Subtitle */} +
+

+ The Most Difficult Project In Human History +

+ {/* Glow effect for subtitle */} +
+ + The Most Difficult Project In Human History + +
+
{/* Enhanced decorative elements */} @@ -174,16 +183,19 @@ export default function Home() {
{/* Road to AI Researcher Project */} -
+
Learning Path
- Coming Soon + New
-

+

Zero To AI Researcher - Full Course

@@ -191,12 +203,12 @@ export default function Home() {

Open Superintelligence Lab - - Coming Soon โ†’ + + Start Learning โ†’
-
+ {/* DeepSeek Sparse Attention Project */} (null); + + const modules = getCourseModules(); + + // Auto-scroll to active lesson on mount and pathname change + useEffect(() => { + // Only scroll if we're on a lesson page (pathname starts with /learn/) + if (!pathname?.startsWith('/learn/')) { + return; + } + + // Use a small delay to ensure the DOM is fully rendered + const timer = setTimeout(() => { + if (activeLinkRef.current) { + try { + activeLinkRef.current.scrollIntoView({ + behavior: 'smooth', + block: 'center', + inline: 'nearest' + }); + console.log('Scrolled to active lesson:', pathname); + } catch (error) { + console.error('Error scrolling to active lesson:', error); + } + } else { + console.log('Active link ref not found for:', pathname); + } + }, 100); + + return () => clearTimeout(timer); + }, [pathname]); + + const NavigationContent = () => ( + <> +
+

+ {language === 'en' ? 'Course Contents' : '่ฏพ็จ‹็›ฎๅฝ•'} +

+

+ {language === 'en' ? 'Navigate through the lessons' : 'ๆต่งˆ่ฏพ็จ‹ๅ†…ๅฎน'} +

+
+ + + +
+ + + + + {language === 'en' ? 'Course Home' : '่ฏพ็จ‹้ฆ–้กต'} + +
+ + ); + + return ( + <> + {/* Mobile Toggle Button */} + + + {/* Mobile Overlay */} + {isOpen && ( +
setIsOpen(false)} + /> + )} + + {/* Mobile Sidebar */} + + + {/* Desktop Sidebar */} + + + ); +} + diff --git a/components/lesson-page.tsx b/components/lesson-page.tsx new file mode 100644 index 0000000..5e195d9 --- /dev/null +++ b/components/lesson-page.tsx @@ -0,0 +1,237 @@ +'use client'; + +import Link from "next/link"; +import { usePathname } from "next/navigation"; +import { useLanguage } from "@/components/providers/language-provider"; +import { MarkdownRenderer } from "@/components/markdown-renderer"; +import { CourseNavigation } from "@/components/course-navigation"; +import { useEffect, useState } from "react"; +import { getAdjacentLessons } from "@/lib/course-structure"; + +interface HeroData { + title: string; + subtitle: string; + tags: string[]; +} + +interface LessonPageProps { + contentPath: string; + prevLink?: { href: string; label: string }; + nextLink?: { href: string; label: string }; +} + +export function LessonPage({ contentPath, prevLink, nextLink }: LessonPageProps) { + const { language } = useLanguage(); + const pathname = usePathname(); + const [markdownContent, setMarkdownContent] = useState(''); + const [heroData, setHeroData] = useState(null); + const [isLoading, setIsLoading] = useState(true); + + // Auto-determine next/prev links from course structure if not provided + const adjacentLessons = getAdjacentLessons(pathname); + const effectivePrevLink = prevLink || (adjacentLessons.prev ? { + href: adjacentLessons.prev.href, + label: `โ† ${language === 'en' ? 'Previous' : 'ไธŠไธ€่ฏพ'}: ${language === 'en' ? adjacentLessons.prev.title : adjacentLessons.prev.titleZh}` + } : undefined); + + const effectiveNextLink = nextLink || (adjacentLessons.next ? { + href: adjacentLessons.next.href, + label: `${language === 'en' ? 'Next' : 'ไธ‹ไธ€่ฏพ'}: ${language === 'en' ? adjacentLessons.next.title : adjacentLessons.next.titleZh} โ†’` + } : undefined); + + useEffect(() => { + const fetchMarkdownContent = async () => { + try { + const response = await fetch(`/content/learn/${contentPath}/${contentPath.split('/').pop()}-content.md`); + const content = await response.text(); + + // Parse frontmatter + const frontmatterMatch = content.match(/^---\n([\s\S]*?)\n---\n([\s\S]*)$/); + if (frontmatterMatch) { + const frontmatterContent = frontmatterMatch[1]; + const markdownBody = frontmatterMatch[2]; + + // Default hero data + const heroData: HeroData = { + title: "", + subtitle: "", + tags: [] + }; + + // Extract values from frontmatter + const lines = frontmatterContent.split('\n'); + let currentKey = ''; + let currentArray: string[] = []; + + for (const line of lines) { + const trimmedLine = line.trim(); + if (trimmedLine.startsWith('hero:')) continue; + + if (trimmedLine.includes(':')) { + const [key, ...valueParts] = trimmedLine.split(':'); + const value = valueParts.join(':').trim().replace(/^["']|["']$/g, ''); + + switch (key.trim()) { + case 'title': + heroData.title = value; + break; + case 'subtitle': + heroData.subtitle = value; + break; + case 'tags': + currentKey = 'tags'; + currentArray = []; + break; + } + } else if (trimmedLine.startsWith('- ')) { + if (currentKey === 'tags') { + const tagValue = trimmedLine.substring(2).replace(/^["']|["']$/g, ''); + currentArray.push(tagValue); + } + } else if (trimmedLine === '' && currentArray.length > 0) { + if (currentKey === 'tags') { + heroData.tags = currentArray; + currentArray = []; + currentKey = ''; + } + } + } + + // Handle final array + if (currentArray.length > 0 && currentKey === 'tags') { + heroData.tags = currentArray; + } + + setHeroData(heroData); + setMarkdownContent(markdownBody); + } else { + setMarkdownContent(content); + } + } catch (error) { + console.error('Failed to fetch markdown content:', error); + setMarkdownContent('# Error loading content\n\nFailed to load the lesson content.'); + } finally { + setIsLoading(false); + } + }; + + fetchMarkdownContent(); + }, [contentPath]); + + if (isLoading) { + return ( +
+
+
+

Loading lesson...

+
+
+ ); + } + + return ( + <> + {/* Course Navigation Sidebar */} + + + {/* Main Content with Sidebar Offset */} +
+ {/* Hero Section */} +
+
+ +
+
+ {/* Back to Course */} + + + + + {language === 'en' ? 'Back to Course' : '่ฟ”ๅ›ž่ฏพ็จ‹'} + + +
+

+ + {heroData?.title || 'Lesson'} + +

+

+ {heroData?.subtitle || ''} +

+ + {/* Tags */} + {heroData?.tags && heroData.tags.length > 0 && ( +
+ {heroData.tags.map((tag, index) => ( + + {index > 0 && โ€ข} + {tag} + + ))} +
+ )} +
+
+
+
+ + {/* Main Content */} +
+
+
+
+
+ +
+
+ + {/* Navigation */} +
+ {effectivePrevLink ? ( + + + + + {effectivePrevLink.label} + + ) : ( +
+ )} + + {effectiveNextLink ? ( + + {effectiveNextLink.label} + + + + + ) : ( + + {language === 'en' ? 'Course Complete! ๐ŸŽ‰' : '่ฏพ็จ‹ๅฎŒๆˆ๏ผ๐ŸŽ‰'} + + + + + )} +
+
+
+
+
+ + ); +} + diff --git a/components/navigation.tsx b/components/navigation.tsx index 9e3312c..3962191 100644 --- a/components/navigation.tsx +++ b/components/navigation.tsx @@ -30,6 +30,12 @@ export function Navigation({ }: NavigationProps) {
+ + {language === 'en' ? 'Learn' : 'ๅญฆไน '} + ', lw=3, color='white')) + + # Bottom explanation + ax.text(7, 2.5, 'Weight gets closer to optimal value (1.0)', + fontsize=26, color='#3B82F6', ha='center', fontweight='bold') + ax.text(7, 1.8, 'Loss decreases from 2.5 โ†’ 0.05', + fontsize=26, color='#10B981', ha='center', fontweight='bold') + ax.text(7, 1, 'Learning = Automatic weight adjustment!', + fontsize=24, color='#94A3B8', ha='center', style='italic') + + fig.patch.set_facecolor('#1E293B') + ax.set_facecolor('#1E293B') + plt.tight_layout() + plt.savefig(BASE_PATH + 'neuron-from-scratch/the-concept-of-learning/learning-process.png', + dpi=150, facecolor='#1E293B', bbox_inches='tight', pad_inches=0.3) + plt.close() + +def create_prediction_flow(): + """Making a prediction flow diagram""" + fig, ax = plt.subplots(figsize=(14, 7)) + ax.set_xlim(0, 14) + ax.set_ylim(0, 7) + ax.axis('off') + + # Title + ax.text(7, 6.5, 'Forward Pass: Making a Prediction', + fontsize=36, fontweight='bold', color='white', ha='center') + + steps = ['Input', 'Linear\n(wยทx+b)', 'Activation\n(ReLU)', 'Output'] + values = ['[1, 2]', '0.9', '0.9', '0.9'] + colors = ['#3B82F6', '#F59E0B', '#10B981', '#8B5CF6'] + + for i, (step, val, color) in enumerate(zip(steps, values, colors)): + x = 1 + i*3.5 + + # Box + box = patches.FancyBboxPatch((x, 3.5), 2, 1.5, + boxstyle="round,pad=0.1", + edgecolor='white', facecolor=color, linewidth=3) + ax.add_patch(box) + + # Step name + ax.text(x+1, 5.3, step, fontsize=22, fontweight='bold', color='white', ha='center', va='top') + + # Value + ax.text(x+1, 4, val, fontsize=28, fontweight='bold', color='white', ha='center', va='center') + + # Arrow + if i < len(steps) - 1: + ax.annotate('', xy=(x+2.5, 4.25), xytext=(x+2.2, 4.25), + arrowprops=dict(arrowstyle='->', lw=4, color='white')) + + # Bottom note + ax.text(7, 2, 'Data flows forward through the network', + fontsize=26, color='#94A3B8', ha='center', style='italic') + ax.text(7, 1.3, 'Input โ†’ Transform โ†’ Activate โ†’ Prediction', + fontsize=24, color='#94A3B8', ha='center') + + fig.patch.set_facecolor('#1E293B') + ax.set_facecolor('#1E293B') + plt.tight_layout() + plt.savefig(BASE_PATH + 'neuron-from-scratch/making-a-prediction/prediction-flow.png', + dpi=150, facecolor='#1E293B', bbox_inches='tight', pad_inches=0.3) + plt.close() + +def create_neuron_code_visual(): + """Building a neuron code visualization""" + fig, ax = plt.subplots(figsize=(14, 8)) + ax.set_xlim(0, 14) + ax.set_ylim(0, 8) + ax.axis('off') + + # Title + ax.text(7, 7.5, 'Neuron Components in Code', + fontsize=36, fontweight='bold', color='white', ha='center') + + components = [ + ('nn.Linear()', 'Weights & Bias', '#3B82F6'), + ('nn.ReLU()', 'Activation', '#F59E0B'), + ('forward()', 'Computation', '#10B981'), + ('backward()', 'Learning', '#8B5CF6'), + ] + + y_start = 6 + for i, (code, desc, color) in enumerate(components): + y = y_start - i*1.4 + + # Code box + box = patches.FancyBboxPatch((2, y), 4, 0.9, + boxstyle="round,pad=0.1", + edgecolor='white', facecolor=color, linewidth=2) + ax.add_patch(box) + ax.text(4, y+0.45, code, fontsize=26, fontweight='bold', color='white', ha='center', va='center', + family='monospace') + + # Description + ax.text(7, y+0.45, 'โ†’', fontsize=32, color='white', ha='center') + ax.text(9.5, y+0.45, desc, fontsize=24, color='white', ha='left') + + # Bottom note + ax.text(7, 0.8, 'PyTorch handles all the complexity!', + fontsize=26, color='#94A3B8', ha='center', fontweight='bold') + + fig.patch.set_facecolor('#1E293B') + ax.set_facecolor('#1E293B') + plt.tight_layout() + plt.savefig(BASE_PATH + 'neuron-from-scratch/building-a-neuron-in-python/neuron-code.png', + dpi=150, facecolor='#1E293B', bbox_inches='tight', pad_inches=0.3) + plt.close() + +# ============================================================================ +# NEURAL NETWORKS IMAGES +# ============================================================================ + +def create_network_layers(): + """Network architecture layers visualization""" + fig, ax = plt.subplots(figsize=(14, 9)) + ax.set_xlim(0, 14) + ax.set_ylim(0, 9) + ax.axis('off') + + # Title + ax.text(7, 8.5, 'Neural Network Architecture', + fontsize=38, fontweight='bold', color='white', ha='center') + + layers = [ + ('Input\nLayer', 784, '#3B82F6', 1.5), + ('Hidden\nLayer 1', 128, '#10B981', 4.5), + ('Hidden\nLayer 2', 64, '#F59E0B', 7.5), + ('Output\nLayer', 10, '#8B5CF6', 10.5), + ] + + for name, size, color, x in layers: + # Draw neurons + num_display = min(size, 8) + y_start = 6 - (num_display * 0.4) + + for i in range(num_display): + y = y_start + i*0.8 + circle = plt.Circle((x, y), 0.25, color=color, ec='white', linewidth=2) + ax.add_patch(circle) + + if i == num_display - 1 and size > num_display: + ax.text(x, y-0.6, '...', fontsize=24, color=color, ha='center') + + # Label + ax.text(x, 7.5, name, fontsize=22, color='white', ha='center', fontweight='bold') + ax.text(x, 2, f'{size}', fontsize=20, color='#94A3B8', ha='center') + + # Connections + if x < 10: + for i in range(min(3, num_display)): + for j in range(min(3, num_display)): + y1 = y_start + i*0.8 + y2 = y_start + j*0.8 + ax.plot([x+0.25, x+3-0.25], [y1, y2], 'white', alpha=0.2, linewidth=1) + + # Bottom note + ax.text(7, 1, 'Each layer transforms data: 784 โ†’ 128 โ†’ 64 โ†’ 10', + fontsize=24, color='#94A3B8', ha='center', fontweight='bold') + + fig.patch.set_facecolor('#1E293B') + ax.set_facecolor('#1E293B') + plt.tight_layout() + plt.savefig(BASE_PATH + 'neural-networks/architecture-of-a-network/network-layers.png', + dpi=150, facecolor='#1E293B', bbox_inches='tight', pad_inches=0.3) + plt.close() + +def create_layer_structure(): + """Single layer structure""" + fig, ax = plt.subplots(figsize=(14, 7)) + ax.set_xlim(0, 14) + ax.set_ylim(0, 7) + ax.axis('off') + + # Title + ax.text(7, 6.5, 'Layer = Multiple Neurons in Parallel', + fontsize=36, fontweight='bold', color='white', ha='center') + + # Input + ax.text(2, 5.5, 'Input (3)', fontsize=24, color='white', ha='center') + for i in range(3): + circle = plt.Circle((2, 4.5-i*0.8), 0.3, color='#3B82F6', ec='white', linewidth=2) + ax.add_patch(circle) + + # Neurons in layer + ax.text(7, 5.5, 'Layer (4 neurons)', fontsize=24, color='white', ha='center') + for i in range(4): + circle = plt.Circle((7, 5-i), 0.35, color='#10B981', ec='white', linewidth=3) + ax.add_patch(circle) + + # Connections from all inputs + for j in range(3): + ax.plot([2.3, 6.65], [4.5-j*0.8, 5-i], 'white', alpha=0.3, linewidth=1.5) + + # Output + ax.text(12, 5.5, 'Output (4)', fontsize=24, color='white', ha='center') + for i in range(4): + circle = plt.Circle((12, 5-i), 0.3, color='#8B5CF6', ec='white', linewidth=2) + ax.add_patch(circle) + ax.plot([7.35, 11.7], [5-i, 5-i], 'white', alpha=0.4, linewidth=2) + + # Note + ax.text(7, 1.5, 'Each neuron receives ALL inputs', + fontsize=26, color='#94A3B8', ha='center', fontweight='bold') + ax.text(7, 0.8, 'nn.Linear(3, 4) creates this layer', + fontsize=24, color='#94A3B8', ha='center', style='italic') + + fig.patch.set_facecolor('#1E293B') + ax.set_facecolor('#1E293B') + plt.tight_layout() + plt.savefig(BASE_PATH + 'neural-networks/building-a-layer/layer-structure.png', + dpi=150, facecolor='#1E293B', bbox_inches='tight', pad_inches=0.3) + plt.close() + +# Create all neuron and network images +print("Creating neuron-from-scratch images...") +create_linear_step_visual() +create_activation_comparison() +create_loss_visual() +create_learning_process() +create_prediction_flow() +create_neuron_code_visual() + +print("Creating neural-networks images...") +create_network_layers() +create_layer_structure() + +print("Part 1 complete! Run generate_all_missing_images_part2.py for attention/transformer images...") + diff --git a/generate_all_missing_images_part2.py b/generate_all_missing_images_part2.py new file mode 100644 index 0000000..d7d5845 --- /dev/null +++ b/generate_all_missing_images_part2.py @@ -0,0 +1,543 @@ +import matplotlib.pyplot as plt +import matplotlib.patches as patches +import numpy as np + +# Set style +plt.rcParams['font.family'] = 'sans-serif' +plt.rcParams['font.sans-serif'] = ['Arial', 'Helvetica'] + +BASE_PATH = '/Users/vukrosic/AI Science Projects/open-superintelligence-lab-github-io/public/content/learn/' + +# ============================================================================ +# ATTENTION MECHANISM IMAGES +# ============================================================================ + +def create_attention_concept(): + """What is attention: concept visualization""" + fig, ax = plt.subplots(figsize=(14, 9)) + ax.set_xlim(0, 14) + ax.set_ylim(0, 9) + ax.axis('off') + + # Title + ax.text(7, 8.5, 'Attention: Focus on Relevant Parts', + fontsize=36, fontweight='bold', color='white', ha='center') + + # Sentence + sentence = "The cat sat on the mat" + words = sentence.split() + + # Word boxes with attention highlights + ax.text(7, 7.5, 'Query: "What did the cat do?"', fontsize=26, color='#F59E0B', ha='center', fontweight='bold') + + # Attention weights + attention = [0.1, 0.6, 0.2, 0.05, 0.02, 0.03] # "cat" and "sat" most important + + x_start = 2 + y_pos = 5.5 + + for i, (word, attn) in enumerate(zip(words, attention)): + alpha = 0.3 + attn * 0.7 # Scale alpha by attention + size = 1 + attn * 1.5 + color_intensity = int(255 * attn) + + # Box with size based on attention + box = patches.FancyBboxPatch((x_start + i*1.8, y_pos), 1.5, 0.8+attn, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#10B981' if attn > 0.3 else '#3B82F6', + linewidth=2+attn*4) + ax.add_patch(box) + ax.text(x_start + i*1.8 + 0.75, y_pos + 0.4 + attn/2, word, + fontsize=18+attn*20, fontweight='bold', color='white', ha='center', va='center') + + # Attention weight below + ax.text(x_start + i*1.8 + 0.75, y_pos - 0.4, f'{attn:.0%}', + fontsize=18, color='#94A3B8', ha='center') + + # Explanation + ax.text(7, 3.5, '"cat" (60%) and "sat" (20%) are most relevant', + fontsize=28, color='#10B981', ha='center', fontweight='bold') + ax.text(7, 2.8, 'Other words get less attention', + fontsize=24, color='#94A3B8', ha='center') + + ax.text(7, 1.5, 'Attention weights sum to 100%', + fontsize=24, color='#94A3B8', ha='center', style='italic') + ax.text(7, 0.8, 'Model learns which words to focus on!', + fontsize=22, color='#94A3B8', ha='center') + + fig.patch.set_facecolor('#1E293B') + ax.set_facecolor('#1E293B') + plt.tight_layout() + plt.savefig(BASE_PATH + 'attention-mechanism/what-is-attention/attention-concept.png', + dpi=150, facecolor='#1E293B', bbox_inches='tight', pad_inches=0.3) + plt.close() + +def create_qkv_visual(): + """Query, Key, Value visualization""" + fig, ax = plt.subplots(figsize=(14, 9)) + ax.set_xlim(0, 14) + ax.set_ylim(0, 9) + ax.axis('off') + + # Title + ax.text(7, 8.5, 'Query, Key, Value Mechanism', + fontsize=36, fontweight='bold', color='white', ha='center') + + # Input + input_box = patches.FancyBboxPatch((6, 7.5), 2, 0.6, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#94A3B8', linewidth=2) + ax.add_patch(input_box) + ax.text(7, 7.8, 'Input', fontsize=24, fontweight='bold', color='white', ha='center', va='center') + + # Split to Q, K, V + components = [ + ('Query', 'What am I\nlooking for?', '#10B981', 2), + ('Key', 'What do I\ncontain?', '#F59E0B', 7), + ('Value', 'What info\ndo I have?', '#8B5CF6', 12), + ] + + for name, desc, color, x in components: + # Arrow from input + ax.annotate('', xy=(x+0.5, 6.2), xytext=(7, 7.3), + arrowprops=dict(arrowstyle='->', lw=3, color=color)) + + # Component box + box = patches.FancyBboxPatch((x, 4.8), 2, 1.2, + boxstyle="round,pad=0.1", + edgecolor='white', facecolor=color, linewidth=3) + ax.add_patch(box) + ax.text(x+1, 5.8, name, fontsize=26, fontweight='bold', color='white', ha='center', va='center') + ax.text(x+1, 5.2, desc, fontsize=18, color='white', ha='center', va='center') + + # Attention computation + ax.text(7, 3.5, '1. Q ร— K โ†’ Scores', fontsize=24, color='#94A3B8', ha='center') + ax.text(7, 3, '2. Softmax โ†’ Weights', fontsize=24, color='#94A3B8', ha='center') + ax.text(7, 2.5, '3. Weights ร— V โ†’ Output', fontsize=24, color='#94A3B8', ha='center') + + # Output + output_box = patches.FancyBboxPatch((5.5, 1), 3, 0.8, + boxstyle="round,pad=0.1", + edgecolor='white', facecolor='#3B82F6', linewidth=3) + ax.add_patch(output_box) + ax.text(7, 1.4, 'Attention Output', fontsize=26, fontweight='bold', color='white', ha='center', va='center') + + fig.patch.set_facecolor('#1E293B') + ax.set_facecolor('#1E293B') + plt.tight_layout() + plt.savefig(BASE_PATH + 'attention-mechanism/what-is-attention/qkv-mechanism.png', + dpi=150, facecolor='#1E293B', bbox_inches='tight', pad_inches=0.3) + plt.close() + +def create_attention_scores_matrix(): + """Attention scores matrix visualization""" + fig, ax = plt.subplots(figsize=(12, 10)) + ax.set_xlim(0, 12) + ax.set_ylim(0, 10) + ax.axis('off') + + # Title + ax.text(6, 9.5, 'Attention Score Matrix', + fontsize=36, fontweight='bold', color='white', ha='center') + + # Create attention matrix visualization + size = 5 + scores = np.random.rand(size, size) + scores = scores / scores.sum(axis=1, keepdims=True) # Normalize rows + + box_size = 1 + x_start = 2.5 + y_start = 7.5 + + # Row labels (Query positions) + for i in range(size): + ax.text(x_start - 0.7, y_start - i*1.1 + 0.5, f'Q{i}', + fontsize=20, color='#10B981', ha='center', fontweight='bold') + + # Column labels (Key positions) + for j in range(size): + ax.text(x_start + j*1.1 + 0.5, y_start + 0.7, f'K{j}', + fontsize=20, color='#F59E0B', ha='center', fontweight='bold') + + # Draw matrix + for i in range(size): + for j in range(size): + val = scores[i, j] + color_intensity = val + color = plt.cm.viridis(color_intensity) + + rect = patches.FancyBboxPatch((x_start + j*1.1, y_start - i*1.1), box_size, box_size, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor=color, linewidth=1) + ax.add_patch(rect) + ax.text(x_start + j*1.1 + 0.5, y_start - i*1.1 + 0.5, f'{val:.2f}', + fontsize=16, fontweight='bold', color='white', ha='center', va='center') + + # Note + ax.text(6, 1.5, 'Each row shows where one position attends', + fontsize=24, color='#94A3B8', ha='center', fontweight='bold') + ax.text(6, 0.9, 'Darker = Higher attention', + fontsize=22, color='#94A3B8', ha='center', style='italic') + + fig.patch.set_facecolor('#1E293B') + ax.set_facecolor('#1E293B') + plt.tight_layout() + plt.savefig(BASE_PATH + 'attention-mechanism/calculating-attention-scores/attention-matrix.png', + dpi=150, facecolor='#1E293B', bbox_inches='tight', pad_inches=0.3) + plt.close() + +def create_multi_head_visualization(): + """Multi-head attention visualization""" + fig, ax = plt.subplots(figsize=(14, 8)) + ax.set_xlim(0, 14) + ax.set_ylim(0, 8) + ax.axis('off') + + # Title + ax.text(7, 7.5, 'Multi-Head Attention: 8 Heads in Parallel', + fontsize=34, fontweight='bold', color='white', ha='center') + + # Input + input_box = patches.FancyBboxPatch((6, 6.5), 2, 0.6, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#3B82F6', linewidth=3) + ax.add_patch(input_box) + ax.text(7, 6.8, 'Input', fontsize=24, fontweight='bold', color='white', ha='center', va='center') + + # 8 heads + num_heads = 8 + colors = plt.cm.tab10(np.linspace(0, 1, num_heads)) + + y_start = 5 + for i in range(num_heads): + x = 1.5 + i*1.5 + + # Arrow from input + ax.plot([7, x+0.4], [6.4, y_start+0.6], 'white', alpha=0.3, linewidth=2) + + # Head box + box = patches.FancyBboxPatch((x, y_start - 0.3), 0.8, 0.6, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor=colors[i], linewidth=2) + ax.add_patch(box) + ax.text(x+0.4, y_start, f'H{i+1}', fontsize=18, fontweight='bold', color='white', ha='center', va='center') + + # Arrow to concat + ax.plot([x+0.4, 7], [y_start-0.5, 3.2], 'white', alpha=0.3, linewidth=2) + + # Concatenate + concat_box = patches.FancyBboxPatch((5, 2.5), 4, 0.6, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#F59E0B', linewidth=3) + ax.add_patch(concat_box) + ax.text(7, 2.8, 'Concatenate Heads', fontsize=24, fontweight='bold', color='white', ha='center', va='center') + + # Output projection + ax.annotate('', xy=(7, 1.5), xytext=(7, 2.3), + arrowprops=dict(arrowstyle='->', lw=4, color='white')) + + output_box = patches.FancyBboxPatch((6, 0.5), 2, 0.8, + boxstyle="round,pad=0.1", + edgecolor='white', facecolor='#8B5CF6', linewidth=3) + ax.add_patch(output_box) + ax.text(7, 0.9, 'Output', fontsize=26, fontweight='bold', color='white', ha='center', va='center') + + fig.patch.set_facecolor('#1E293B') + ax.set_facecolor('#1E293B') + plt.tight_layout() + plt.savefig(BASE_PATH + 'attention-mechanism/multi-head-attention/multi-head-visual.png', + dpi=150, facecolor='#1E293B', bbox_inches='tight', pad_inches=0.3) + plt.close() + +def create_self_attention_visual(): + """Self-attention concept""" + fig, ax = plt.subplots(figsize=(14, 8)) + ax.set_xlim(0, 14) + ax.set_ylim(0, 8) + ax.axis('off') + + # Title + ax.text(7, 7.5, 'Self-Attention: Sequence Attends to Itself', + fontsize=34, fontweight='bold', color='white', ha='center') + + words = ['The', 'cat', 'sat'] + positions = [3, 7, 11] + + for i, (word, x) in enumerate(zip(words, positions)): + # Word box + box = patches.FancyBboxPatch((x-0.8, 5.5), 1.6, 0.8, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#3B82F6', linewidth=3) + ax.add_patch(box) + ax.text(x, 5.9, word, fontsize=28, fontweight='bold', color='white', ha='center', va='center') + + # Show attention connections + for j, (word2, x2) in enumerate(zip(words, positions)): + if i != j: + # Attention line + alpha = 0.5 if abs(i-j) == 1 else 0.2 + ax.plot([x, x2], [5.3, 5.3], 'c-', alpha=alpha, linewidth=2+alpha*4) + ax.plot([x, x2], [5.3, 5.3], 'co', markersize=8, alpha=alpha) + + # Explanation + ax.text(7, 3.8, 'Each word attends to ALL words (including itself)', + fontsize=26, color='#94A3B8', ha='center', fontweight='bold') + ax.text(7, 3.2, '"cat" learns from "The" and "sat" for context', + fontsize=24, color='#10B981', ha='center') + ax.text(7, 2.6, 'Q, K, V all come from the same sequence!', + fontsize=22, color='#94A3B8', ha='center', style='italic') + + fig.patch.set_facecolor('#1E293B') + ax.set_facecolor('#1E293B') + plt.tight_layout() + plt.savefig(BASE_PATH + 'attention-mechanism/self-attention-from-scratch/self-attention-concept.png', + dpi=150, facecolor='#1E293B', bbox_inches='tight', pad_inches=0.3) + plt.close() + +# ============================================================================ +# TRANSFORMER IMAGES +# ============================================================================ + +def create_transformer_architecture_diagram(): + """Full transformer architecture""" + fig, ax = plt.subplots(figsize=(12, 14)) + ax.set_xlim(0, 12) + ax.set_ylim(0, 14) + ax.axis('off') + + # Title + ax.text(6, 13.5, 'Transformer Architecture', + fontsize=36, fontweight='bold', color='white', ha='center') + + # Input + box = patches.FancyBboxPatch((4, 12.5), 4, 0.7, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#3B82F6', linewidth=2) + ax.add_patch(box) + ax.text(6, 12.85, 'Input Tokens', fontsize=22, fontweight='bold', color='white', ha='center', va='center') + + # Embeddings + y = 11.5 + box = patches.FancyBboxPatch((4, y), 4, 0.7, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#6366F1', linewidth=2) + ax.add_patch(box) + ax.text(6, y+0.35, 'Embeddings + Positions', fontsize=20, fontweight='bold', color='white', ha='center', va='center') + ax.plot([6, 6], [y+0.8, y+1.4], 'white', linewidth=3) + + # Transformer blocks (N times) + for block_idx in range(3): + y_block = 10 - block_idx*3 + + # Block container + block_box = patches.FancyBboxPatch((3, y_block-2.5), 6, 2.3, + boxstyle="round,pad=0.1", + edgecolor='cyan', facecolor='#1E293B', linewidth=2, linestyle='--') + ax.add_patch(block_box) + ax.text(9.2, y_block - 1.3, f'Block {block_idx+1}', fontsize=18, color='cyan', ha='left') + + # Multi-head attention + box1 = patches.FancyBboxPatch((4, y_block-0.7), 4, 0.6, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#10B981', linewidth=2) + ax.add_patch(box1) + ax.text(6, y_block-0.4, 'Multi-Head Attention', fontsize=18, fontweight='bold', color='white', ha='center', va='center') + + # FFN + box2 = patches.FancyBboxPatch((4, y_block-1.9), 4, 0.6, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#F59E0B', linewidth=2) + ax.add_patch(box2) + ax.text(6, y_block-1.6, 'Feed-Forward', fontsize=18, fontweight='bold', color='white', ha='center', va='center') + + # Arrows + ax.plot([6, 6], [y_block-0.1, y_block-1.3], 'white', linewidth=2) + + if block_idx < 2: + ax.plot([6, 6], [y_block-2.6, y_block-3.2], 'white', linewidth=2) + + # Output head + y_out = 1 + box = patches.FancyBboxPatch((4, y_out), 4, 0.7, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#8B5CF6', linewidth=2) + ax.add_patch(box) + ax.text(6, y_out+0.35, 'Output Projection', fontsize=20, fontweight='bold', color='white', ha='center', va='center') + + fig.patch.set_facecolor('#1E293B') + ax.set_facecolor('#1E293B') + plt.tight_layout() + plt.savefig(BASE_PATH + 'building-a-transformer/transformer-architecture/transformer-diagram.png', + dpi=150, facecolor='#1E293B', bbox_inches='tight', pad_inches=0.3) + plt.close() + +def create_transformer_block_diagram(): + """Transformer block internal structure""" + fig, ax = plt.subplots(figsize=(14, 10)) + ax.set_xlim(0, 14) + ax.set_ylim(0, 10) + ax.axis('off') + + # Title + ax.text(7, 9.5, 'Transformer Block Components', + fontsize=36, fontweight='bold', color='white', ha='center') + + # Input + y = 8.5 + box = patches.FancyBboxPatch((5.5, y), 3, 0.7, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#3B82F6', linewidth=2) + ax.add_patch(box) + ax.text(7, y+0.35, 'Input', fontsize=24, fontweight='bold', color='white', ha='center', va='center') + + # Attention sub-block + y = 7 + ax.text(3, y+1, '1. Attention Sub-block', fontsize=22, color='#10B981', ha='left', fontweight='bold') + + box = patches.FancyBboxPatch((4, y), 6, 0.6, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#10B981', linewidth=2) + ax.add_patch(box) + ax.text(7, y+0.3, 'Multi-Head Attention', fontsize=20, fontweight='bold', color='white', ha='center', va='center') + + # Add & Norm + y = 6 + box = patches.FancyBboxPatch((4.5, y), 5, 0.5, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#6366F1', linewidth=2) + ax.add_patch(box) + ax.text(7, y+0.25, 'Add & Norm (Residual)', fontsize=18, color='white', ha='center', va='center') + + # FFN sub-block + y = 4.8 + ax.text(3, y+1, '2. FFN Sub-block', fontsize=22, color='#F59E0B', ha='left', fontweight='bold') + + box = patches.FancyBboxPatch((4, y), 6, 0.6, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#F59E0B', linewidth=2) + ax.add_patch(box) + ax.text(7, y+0.3, 'Feed-Forward Network', fontsize=20, fontweight='bold', color='white', ha='center', va='center') + + # Add & Norm + y = 3.8 + box = patches.FancyBboxPatch((4.5, y), 5, 0.5, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#6366F1', linewidth=2) + ax.add_patch(box) + ax.text(7, y+0.25, 'Add & Norm (Residual)', fontsize=18, color='white', ha='center', va='center') + + # Output + y = 2.5 + box = patches.FancyBboxPatch((5.5, y), 3, 0.7, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#8B5CF6', linewidth=2) + ax.add_patch(box) + ax.text(7, y+0.35, 'Output', fontsize=24, fontweight='bold', color='white', ha='center', va='center') + + # Note + ax.text(7, 1.2, 'Attention โ†’ Add&Norm โ†’ FFN โ†’ Add&Norm', + fontsize=24, color='#94A3B8', ha='center', fontweight='bold') + ax.text(7, 0.5, 'Residual connections help gradients flow!', + fontsize=22, color='#94A3B8', ha='center', style='italic') + + fig.patch.set_facecolor('#1E293B') + ax.set_facecolor('#1E293B') + plt.tight_layout() + plt.savefig(BASE_PATH + 'building-a-transformer/building-a-transformer-block/block-diagram.png', + dpi=150, facecolor='#1E293B', bbox_inches='tight', pad_inches=0.3) + plt.close() + +# ============================================================================ +# MOE IMAGES +# ============================================================================ + +def create_moe_routing(): + """MoE routing visualization""" + fig, ax = plt.subplots(figsize=(14, 10)) + ax.set_xlim(0, 14) + ax.set_ylim(0, 10) + ax.axis('off') + + # Title + ax.text(7, 9.5, 'Mixture of Experts: Sparse Routing', + fontsize=36, fontweight='bold', color='white', ha='center') + + # Input token + token_box = patches.FancyBboxPatch((6, 8.5), 2, 0.7, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#3B82F6', linewidth=3) + ax.add_patch(token_box) + ax.text(7, 8.85, 'Token', fontsize=24, fontweight='bold', color='white', ha='center', va='center') + + # Router + ax.annotate('', xy=(7, 7.5), xytext=(7, 8.3), + arrowprops=dict(arrowstyle='->', lw=3, color='white')) + + router_box = patches.FancyBboxPatch((5.5, 6.8), 3, 0.6, + boxstyle="round,pad=0.05", + edgecolor='white', facecolor='#F59E0B', linewidth=2) + ax.add_patch(router_box) + ax.text(7, 7.1, 'Router', fontsize=22, fontweight='bold', color='white', ha='center', va='center') + + # 8 Experts + num_experts = 8 + expert_colors = ['#10B981', '#EF4444', '#94A3B8', '#94A3B8', '#94A3B8', '#10B981', '#94A3B8', '#94A3B8'] + active = [True, False, False, False, False, True, False, False] + + y_experts = 5 + for i in range(num_experts): + x = 1.5 + i*1.5 + + # Expert box + box = patches.FancyBboxPatch((x, y_experts), 0.9, 0.7, + boxstyle="round,pad=0.05", + edgecolor='white' if active[i] else '#4B5563', + facecolor=expert_colors[i], + linewidth=3 if active[i] else 1, + alpha=1.0 if active[i] else 0.3) + ax.add_patch(box) + ax.text(x+0.45, y_experts+0.35, f'E{i}', fontsize=18, fontweight='bold', color='white', ha='center', va='center') + + # Connection from router + alpha = 1.0 if active[i] else 0.15 + linewidth = 3 if active[i] else 1 + ax.plot([7, x+0.45], [6.7, y_experts+0.8], color=expert_colors[i] if active[i] else '#4B5563', + alpha=alpha, linewidth=linewidth) + + # Output + ax.text(7, 3.5, 'Top-2 Experts Selected: E0 (60%) + E5 (40%)', + fontsize=26, color='#10B981', ha='center', fontweight='bold') + + output_box = patches.FancyBboxPatch((5, 2.3), 4, 0.8, + boxstyle="round,pad=0.1", + edgecolor='white', facecolor='#8B5CF6', linewidth=3) + ax.add_patch(output_box) + ax.text(7, 2.7, 'Combined Output', fontsize=24, fontweight='bold', color='white', ha='center', va='center') + + ax.text(7, 1, 'Only 2 of 8 experts activated (sparse!)', + fontsize=24, color='#94A3B8', ha='center', style='italic') + + fig.patch.set_facecolor('#1E293B') + ax.set_facecolor('#1E293B') + plt.tight_layout() + plt.savefig(BASE_PATH + 'transformer-feedforward/what-is-mixture-of-experts/moe-routing.png', + dpi=150, facecolor='#1E293B', bbox_inches='tight', pad_inches=0.3) + plt.close() + +# Create all images +print("Creating attention mechanism images...") +create_attention_concept() +create_qkv_visual() +create_attention_scores_matrix() +create_multi_head_visualization() +create_self_attention_visual() + +print("Creating transformer images...") +create_transformer_architecture_diagram() +create_transformer_block_diagram() + +print("Creating MoE images...") +create_moe_routing() + +print("\nโœ… All missing images created successfully!") + diff --git a/lib/course-structure.tsx b/lib/course-structure.tsx new file mode 100644 index 0000000..3ad030d --- /dev/null +++ b/lib/course-structure.tsx @@ -0,0 +1,391 @@ +export interface LessonItem { + title: string; + titleZh: string; + href: string; +} + +export interface ModuleData { + title: string; + titleZh: string; + icon: React.ReactNode; + lessons: LessonItem[]; +} + +export const getCourseModules = (): ModuleData[] => [ + { + title: "Mathematics Fundamentals", + titleZh: "ๆ•ฐๅญฆๅŸบ็ก€", + icon: ( + + + + ), + lessons: [ + { + title: "Functions", + titleZh: "ๅ‡ฝๆ•ฐ", + href: "/learn/math/functions" + }, + { + title: "Derivatives", + titleZh: "ๅฏผๆ•ฐ", + href: "/learn/math/derivatives" + }, + { + title: "Vectors", + titleZh: "ๅ‘้‡", + href: "/learn/math/vectors" + }, + { + title: "Matrices", + titleZh: "็Ÿฉ้˜ต", + href: "/learn/math/matrices" + }, + { + title: "Gradients", + titleZh: "ๆขฏๅบฆ", + href: "/learn/math/gradients" + } + ] + }, + { + title: "PyTorch Fundamentals", + titleZh: "PyTorchๅŸบ็ก€", + icon: ( + + + + ), + lessons: [ + { + title: "Creating Tensors", + titleZh: "ๅˆ›ๅปบๅผ ้‡", + href: "/learn/tensors/creating-tensors" + }, + { + title: "Tensor Addition", + titleZh: "ๅผ ้‡ๅŠ ๆณ•", + href: "/learn/tensors/tensor-addition" + }, + { + title: "Matrix Multiplication", + titleZh: "็Ÿฉ้˜ตไน˜ๆณ•", + href: "/learn/tensors/matrix-multiplication" + }, + { + title: "Transposing Tensors", + titleZh: "ๅผ ้‡่ฝฌ็ฝฎ", + href: "/learn/tensors/transposing-tensors" + }, + { + title: "Reshaping Tensors", + titleZh: "ๅผ ้‡้‡ๅก‘", + href: "/learn/tensors/reshaping-tensors" + }, + { + title: "Indexing and Slicing", + titleZh: "็ดขๅผ•ๅ’Œๅˆ‡็‰‡", + href: "/learn/tensors/indexing-and-slicing" + }, + { + title: "Concatenating Tensors", + titleZh: "ๅผ ้‡ๆ‹ผๆŽฅ", + href: "/learn/tensors/concatenating-tensors" + }, + { + title: "Creating Special Tensors", + titleZh: "ๅˆ›ๅปบ็‰นๆฎŠๅผ ้‡", + href: "/learn/tensors/creating-special-tensors" + } + ] + }, + { + title: "Neuron From Scratch", + titleZh: "ไปŽ้›ถๅผ€ๅง‹ๆž„ๅปบ็ฅž็ปๅ…ƒ", + icon: ( + + + + ), + lessons: [ + { + title: "What is a Neuron", + titleZh: "ไป€ไนˆๆ˜ฏ็ฅž็ปๅ…ƒ", + href: "/learn/neuron-from-scratch/what-is-a-neuron" + }, + { + title: "The Linear Step", + titleZh: "็บฟๆ€งๆญฅ้ชค", + href: "/learn/neuron-from-scratch/the-linear-step" + }, + { + title: "The Activation Function", + titleZh: "ๆฟ€ๆดปๅ‡ฝๆ•ฐ", + href: "/learn/neuron-from-scratch/the-activation-function" + }, + { + title: "Building a Neuron in Python", + titleZh: "็”จPythonๆž„ๅปบ็ฅž็ปๅ…ƒ", + href: "/learn/neuron-from-scratch/building-a-neuron-in-python" + }, + { + title: "Making a Prediction", + titleZh: "่ฟ›่กŒ้ข„ๆต‹", + href: "/learn/neuron-from-scratch/making-a-prediction" + }, + { + title: "The Concept of Loss", + titleZh: "ๆŸๅคฑๆฆ‚ๅฟต", + href: "/learn/neuron-from-scratch/the-concept-of-loss" + }, + { + title: "The Concept of Learning", + titleZh: "ๅญฆไน ๆฆ‚ๅฟต", + href: "/learn/neuron-from-scratch/the-concept-of-learning" + } + ] + }, + { + title: "Activation Functions", + titleZh: "ๆฟ€ๆดปๅ‡ฝๆ•ฐ", + icon: ( + + + + ), + lessons: [ + { + title: "ReLU", + titleZh: "ReLU", + href: "/learn/activation-functions/relu" + }, + { + title: "Sigmoid", + titleZh: "Sigmoid", + href: "/learn/activation-functions/sigmoid" + }, + { + title: "Tanh", + titleZh: "Tanh", + href: "/learn/activation-functions/tanh" + }, + { + title: "SiLU", + titleZh: "SiLU", + href: "/learn/activation-functions/silu" + }, + { + title: "SwiGLU", + titleZh: "SwiGLU", + href: "/learn/activation-functions/swiglu" + }, + { + title: "Softmax", + titleZh: "Softmax", + href: "/learn/activation-functions/softmax" + } + ] + }, + { + title: "Neural Networks from Scratch", + titleZh: "ไปŽ้›ถๅผ€ๅง‹็š„็ฅž็ป็ฝ‘็ปœ", + icon: ( + + + + ), + lessons: [ + { + title: "Architecture of a Network", + titleZh: "็ฝ‘็ปœๆžถๆž„", + href: "/learn/neural-networks/architecture-of-a-network" + }, + { + title: "Building a Layer", + titleZh: "ๆž„ๅปบๅฑ‚", + href: "/learn/neural-networks/building-a-layer" + }, + { + title: "Implementing a Network", + titleZh: "ๅฎž็Žฐ็ฝ‘็ปœ", + href: "/learn/neural-networks/implementing-a-network" + }, + { + title: "The Chain Rule", + titleZh: "้“พๅผๆณ•ๅˆ™", + href: "/learn/neural-networks/the-chain-rule" + }, + { + title: "Calculating Gradients", + titleZh: "่ฎก็ฎ—ๆขฏๅบฆ", + href: "/learn/neural-networks/calculating-gradients" + }, + { + title: "Backpropagation in Action", + titleZh: "ๅๅ‘ไผ ๆ’ญๅฎžๆˆ˜", + href: "/learn/neural-networks/backpropagation-in-action" + }, + { + title: "Implementing Backpropagation", + titleZh: "ๅฎž็Žฐๅๅ‘ไผ ๆ’ญ", + href: "/learn/neural-networks/implementing-backpropagation" + } + ] + }, + { + title: "Attention Mechanism", + titleZh: "ๆณจๆ„ๅŠ›ๆœบๅˆถ", + icon: ( + + + + + ), + lessons: [ + { + title: "What is Attention", + titleZh: "ไป€ไนˆๆ˜ฏๆณจๆ„ๅŠ›", + href: "/learn/attention-mechanism/what-is-attention" + }, + { + title: "Self Attention from Scratch", + titleZh: "ไปŽ้›ถๅผ€ๅง‹่‡ชๆณจๆ„ๅŠ›", + href: "/learn/attention-mechanism/self-attention-from-scratch" + }, + { + title: "Calculating Attention Scores", + titleZh: "่ฎก็ฎ—ๆณจๆ„ๅŠ›ๅˆ†ๆ•ฐ", + href: "/learn/attention-mechanism/calculating-attention-scores" + }, + { + title: "Applying Attention Weights", + titleZh: "ๅบ”็”จๆณจๆ„ๅŠ›ๆƒ้‡", + href: "/learn/attention-mechanism/applying-attention-weights" + }, + { + title: "Multi Head Attention", + titleZh: "ๅคšๅคดๆณจๆ„ๅŠ›", + href: "/learn/attention-mechanism/multi-head-attention" + }, + { + title: "Attention in Code", + titleZh: "ๆณจๆ„ๅŠ›ไปฃ็ ๅฎž็Žฐ", + href: "/learn/attention-mechanism/attention-in-code" + } + ] + }, + { + title: "Transformer Feedforward", + titleZh: "Transformerๅ‰้ฆˆ็ฝ‘็ปœ", + icon: ( + + + + ), + lessons: [ + { + title: "The Feedforward Layer", + titleZh: "ๅ‰้ฆˆๅฑ‚", + href: "/learn/transformer-feedforward/the-feedforward-layer" + }, + { + title: "What is Mixture of Experts", + titleZh: "ไป€ไนˆๆ˜ฏไธ“ๅฎถๆททๅˆ", + href: "/learn/transformer-feedforward/what-is-mixture-of-experts" + }, + { + title: "The Expert", + titleZh: "ไธ“ๅฎถ", + href: "/learn/transformer-feedforward/the-expert" + }, + { + title: "The Gate", + titleZh: "้—จๆŽง", + href: "/learn/transformer-feedforward/the-gate" + }, + { + title: "Combining Experts", + titleZh: "็ป„ๅˆไธ“ๅฎถ", + href: "/learn/transformer-feedforward/combining-experts" + }, + { + title: "MoE in a Transformer", + titleZh: "Transformerไธญ็š„MoE", + href: "/learn/transformer-feedforward/moe-in-a-transformer" + }, + { + title: "MoE in Code", + titleZh: "MoEไปฃ็ ๅฎž็Žฐ", + href: "/learn/transformer-feedforward/moe-in-code" + }, + { + title: "The DeepSeek MLP", + titleZh: "DeepSeek MLP", + href: "/learn/transformer-feedforward/the-deepseek-mlp" + } + ] + }, + { + title: "Building a Transformer", + titleZh: "ๆž„ๅปบTransformer", + icon: ( + + + + ), + lessons: [ + { + title: "Transformer Architecture", + titleZh: "Transformerๆžถๆž„", + href: "/learn/building-a-transformer/transformer-architecture" + }, + { + title: "RoPE Positional Encoding", + titleZh: "RoPEไฝ็ฝฎ็ผ–็ ", + href: "/learn/building-a-transformer/rope-positional-encoding" + }, + { + title: "Building a Transformer Block", + titleZh: "ๆž„ๅปบTransformerๅ—", + href: "/learn/building-a-transformer/building-a-transformer-block" + }, + { + title: "The Final Linear Layer", + titleZh: "ๆœ€็ปˆ็บฟๆ€งๅฑ‚", + href: "/learn/building-a-transformer/the-final-linear-layer" + }, + { + title: "Full Transformer in Code", + titleZh: "ๅฎŒๆ•ดTransformerไปฃ็ ", + href: "/learn/building-a-transformer/full-transformer-in-code" + }, + { + title: "Training a Transformer", + titleZh: "่ฎญ็ปƒTransformer", + href: "/learn/building-a-transformer/training-a-transformer" + } + ] + } +]; + +// Get all lessons as a flat array +export const getAllLessons = (): LessonItem[] => { + const modules = getCourseModules(); + return modules.flatMap(module => module.lessons); +}; + +// Get next and previous lessons for a given href +export const getAdjacentLessons = (currentHref: string) => { + const allLessons = getAllLessons(); + const currentIndex = allLessons.findIndex(lesson => lesson.href === currentHref); + + if (currentIndex === -1) { + return { prev: null, next: null }; + } + + const prev = currentIndex > 0 ? allLessons[currentIndex - 1] : null; + const next = currentIndex < allLessons.length - 1 ? allLessons[currentIndex + 1] : null; + + return { prev, next }; +}; + diff --git a/public/content/learn/README.md b/public/content/learn/README.md new file mode 100644 index 0000000..b44ccd9 --- /dev/null +++ b/public/content/learn/README.md @@ -0,0 +1,93 @@ +# Course Content Structure + +This directory contains markdown files and images for the AI/ML course lessons. + +## Directory Structure + +``` +learn/ +โ”œโ”€โ”€ math/ +โ”‚ โ”œโ”€โ”€ derivatives/ +โ”‚ โ”‚ โ”œโ”€โ”€ derivatives-content.md +โ”‚ โ”‚ โ”œโ”€โ”€ derivative-graph.png (placeholder - add your image here) +โ”‚ โ”‚ โ””โ”€โ”€ tangent-line.png (placeholder - add your image here) +โ”‚ โ””โ”€โ”€ functions/ +โ”‚ โ”œโ”€โ”€ functions-content.md +โ”‚ โ”œโ”€โ”€ linear-function.png (add your image here) +โ”‚ โ”œโ”€โ”€ relu-function.png (add your image here) +โ”‚ โ””โ”€โ”€ function-composition.png (add your image here) +โ””โ”€โ”€ neural-networks/ + โ”œโ”€โ”€ introduction/ + โ”‚ โ”œโ”€โ”€ introduction-content.md + โ”‚ โ”œโ”€โ”€ neural-network-diagram.png (add your image here) + โ”‚ โ”œโ”€โ”€ layer-types.png (add your image here) + โ”‚ โ”œโ”€โ”€ training-process.png (add your image here) + โ”‚ โ””โ”€โ”€ depth-vs-performance.png (add your image here) + โ”œโ”€โ”€ forward-propagation/ + โ”‚ โ”œโ”€โ”€ forward-propagation-content.md + โ”‚ โ”œโ”€โ”€ forward-prop-diagram.png (add your image here) + โ”‚ โ”œโ”€โ”€ forward-example.png (add your image here) + โ”‚ โ”œโ”€โ”€ activations-comparison.png (add your image here) + โ”‚ โ””โ”€โ”€ matrix-backprop.png (add your image here) + โ”œโ”€โ”€ backpropagation/ + โ”‚ โ”œโ”€โ”€ backpropagation-content.md + โ”‚ โ”œโ”€โ”€ backprop-overview.png (add your image here) + โ”‚ โ”œโ”€โ”€ backprop-steps.png (add your image here) + โ”‚ โ””โ”€โ”€ matrix-backprop.png (add your image here) + โ””โ”€โ”€ training/ + โ”œโ”€โ”€ training-content.md + โ”œโ”€โ”€ training-loop.png (add your image here) + โ”œโ”€โ”€ gradient-descent.png (add your image here) + โ”œโ”€โ”€ gd-variants.png (add your image here) + โ”œโ”€โ”€ optimizers-comparison.png (add your image here) + โ”œโ”€โ”€ lr-schedules.png (add your image here) + โ””โ”€โ”€ training-curves.png (add your image here) +``` + +## How to Add Images + +1. Place your PNG/JPG images in the corresponding lesson folder +2. Reference them in the markdown using: + ```markdown + ![Alt Text](image-name.png) + ``` +3. The images will be served from `/content/learn/[lesson-path]/[image-name]` + +## Markdown Frontmatter Format + +Each lesson markdown file should start with frontmatter: + +```markdown +--- +hero: + title: "Lesson Title" + subtitle: "Lesson Subtitle" + tags: + - "๐Ÿ“ Category" + - "โฑ๏ธ Reading Time" +--- + +# Your content here... +``` + +## Adding New Lessons + +1. Create a new folder under the appropriate category +2. Add a `{folder-name}-content.md` file +3. Add your images +4. Create a page component in `app/learn/[category]/[lesson-name]/page.tsx`: + +```tsx +import { LessonPage } from "@/components/lesson-page"; + +export default function YourLessonPage() { + return ( + + ); +} +``` + diff --git a/public/content/learn/activation-functions/relu/relu-content.md b/public/content/learn/activation-functions/relu/relu-content.md new file mode 100644 index 0000000..be18afd --- /dev/null +++ b/public/content/learn/activation-functions/relu/relu-content.md @@ -0,0 +1,339 @@ +--- +hero: + title: "ReLU" + subtitle: "Rectified Linear Unit - The Most Popular Activation Function" + tags: + - "โšก Activation Functions" + - "โฑ๏ธ 10 min read" +--- + +ReLU is the **most widely used** activation function in deep learning. It's simple, fast, and works incredibly well! + +## The Formula + +**ReLU(x) = max(0, x)** + +That's it! If the input is negative, output 0. If positive, output the input unchanged. + +![ReLU Graph](/content/learn/activation-functions/relu/relu-graph.png) + +```yaml +Input < 0 โ†’ Output = 0 +Input โ‰ฅ 0 โ†’ Output = Input + +Examples: +ReLU(-5) = 0 +ReLU(-1) = 0 +ReLU(0) = 0 +ReLU(3) = 3 +ReLU(10) = 10 +``` + +## How It Works + +**Example:** + +```python +import torch +import torch.nn as nn + +# Create ReLU activation +relu = nn.ReLU() + +# Test with different values +x = torch.tensor([-3.0, -1.0, 0.0, 2.0, 5.0]) +output = relu(x) + +print(output) +# tensor([0., 0., 0., 2., 5.]) +``` + +**Manual calculation:** + +```yaml +Input: [-3.0, -1.0, 0.0, 2.0, 5.0] + โ†“ โ†“ โ†“ โ†“ โ†“ +ReLU: max(0,-3) max(0,-1) max(0,0) max(0,2) max(0,5) + โ†“ โ†“ โ†“ โ†“ โ†“ +Output: [0.0, 0.0, 0.0, 2.0, 5.0] +``` + +![ReLU Example](/content/learn/activation-functions/relu/relu-example.png) + +**The rule:** Negative numbers get "zeroed out", positive numbers pass through unchanged. + +## In Code (Simple Implementation) + +You can implement ReLU yourself: + +```python +import torch + +def relu(x): + """Simple ReLU implementation""" + return torch.maximum(torch.tensor(0.0), x) + +# Test it +x = torch.tensor([-2.0, 3.0, -1.0, 4.0]) +output = relu(x) +print(output) +# tensor([0., 3., 0., 4.]) +``` + +Or even simpler with element-wise operations: + +```python +def relu_simple(x): + """Even simpler ReLU""" + return x * (x > 0) # Multiply by boolean mask + +x = torch.tensor([-2.0, 3.0, -1.0, 4.0]) +output = relu_simple(x) +print(output) +# tensor([0., 3., 0., 4.]) +``` + +## Why ReLU is Amazing + +### 1. Simple and Fast + +```yaml +Computation: Just one comparison! + if x > 0: return x + else: return 0 + +No expensive operations: + โœ“ No exponentials (unlike sigmoid/tanh) + โœ“ No divisions + โœ“ Just comparison and selection +``` + +### 2. Solves Vanishing Gradient Problem + +For positive values, gradient is always 1: + +```python +import torch + +x = torch.tensor([5.0], requires_grad=True) +y = torch.relu(x) +y.backward() + +print(x.grad) # tensor([1.]) +# Gradient is 1 for positive inputs! +``` + +**Why this matters:** + +```yaml +Sigmoid/Tanh: gradients get very small (vanishing) +ReLU: gradient is 1 for positive inputs + +Result: Faster training, deeper networks possible! +``` + +### 3. Creates Sparsity + +ReLU zeros out negative values, creating sparse activations: + +![ReLU Network](/content/learn/activation-functions/relu/relu-network.png) + +```python +# Example: network layer output +layer_output = torch.tensor([-2.1, 3.5, -0.8, 1.2, -1.5]) +activated = torch.relu(layer_output) + +print(activated) +# tensor([0.0, 3.5, 0.0, 1.2, 0.0]) + +# 60% of activations are zero! +sparsity = (activated == 0).sum().item() / activated.numel() +print(f"Sparsity: {sparsity:.1%}") +# Output: Sparsity: 60.0% +``` + +**Benefits of sparsity:** + +```yaml +Sparse networks: + โœ“ More efficient (many zeros) + โœ“ Better generalization + โœ“ Easier to interpret + โœ“ Faster computation +``` + +## Using ReLU in PyTorch + +### Method 1: As a Layer + +```python +import torch.nn as nn + +# Create a neural network with ReLU +model = nn.Sequential( + nn.Linear(10, 20), + nn.ReLU(), # โ† ReLU activation + nn.Linear(20, 5), + nn.ReLU(), # โ† Another ReLU + nn.Linear(5, 1) +) +``` + +### Method 2: As a Function + +```python +import torch +import torch.nn.functional as F + +x = torch.randn(5, 10) + +# Apply ReLU directly +output = F.relu(x) + +# Same as +output = torch.relu(x) +``` + +### Method 3: Manual Implementation + +```python +# In your custom forward pass +def forward(self, x): + x = self.linear1(x) + x = torch.relu(x) # Apply ReLU + x = self.linear2(x) + return x +``` + +## Practical Example: Multi-Layer Network + +```python +import torch +import torch.nn as nn + +# 3-layer network with ReLU +class SimpleNet(nn.Module): + def __init__(self): + super().__init__() + self.fc1 = nn.Linear(784, 256) # Input layer + self.fc2 = nn.Linear(256, 128) # Hidden layer + self.fc3 = nn.Linear(128, 10) # Output layer + + def forward(self, x): + x = self.fc1(x) + x = torch.relu(x) # ReLU after layer 1 + + x = self.fc2(x) + x = torch.relu(x) # ReLU after layer 2 + + x = self.fc3(x) + # No ReLU on output layer! + return x + +# Test it +model = SimpleNet() +input_data = torch.randn(32, 784) # Batch of 32 +output = model(input_data) + +print(output.shape) # torch.Size([32, 10]) +``` + +## The Dying ReLU Problem + +**Issue:** Sometimes neurons can get "stuck" outputting only zeros. + +```python +# Neuron with large negative bias +weights = torch.randn(10) +bias = torch.tensor(-100.0) # Very negative! + +# Forward pass +x = torch.randn(10) +linear_output = x @ weights + bias +activated = torch.relu(linear_output) + +print(linear_output) # tensor(-98.5) - always negative! +print(activated) # tensor(0.) - always zero! +``` + +**Why this happens:** + +```yaml +1. Neuron produces negative output +2. ReLU makes it zero +3. Gradient for negative inputs is also zero +4. Neuron never updates โ†’ stuck at zero forever! + +Solution: Use variants like Leaky ReLU or careful initialization +``` + +## ReLU Variants + +### Leaky ReLU + +Allows small negative values: + +```python +import torch.nn as nn + +# Standard ReLU +relu = nn.ReLU() +print(relu(torch.tensor(-1.0))) # tensor(0.) + +# Leaky ReLU (small slope for negatives) +leaky_relu = nn.LeakyReLU(negative_slope=0.01) +print(leaky_relu(torch.tensor(-1.0))) # tensor(-0.0100) +``` + +**Formula:** + +```yaml +LeakyReLU(x) = max(0.01x, x) + +For x < 0: output = 0.01 * x (small negative) +For x โ‰ฅ 0: output = x (unchanged) +``` + +## Key Takeaways + +โœ“ **Simple formula:** max(0, x) + +โœ“ **Fast:** Just comparison, no complex math + +โœ“ **Solves vanishing gradients:** Gradient is 1 for positive values + +โœ“ **Creates sparsity:** Zeros out negative activations + +โœ“ **Most popular:** Default choice for hidden layers + +โœ“ **Watch out for:** Dying ReLU (neurons stuck at zero) + +**Quick Reference:** + +```python +# Using ReLU +import torch +import torch.nn as nn +import torch.nn.functional as F + +# Method 1: Module +relu_layer = nn.ReLU() +output = relu_layer(x) + +# Method 2: Functional +output = F.relu(x) + +# Method 3: Direct +output = torch.relu(x) + +# Method 4: Manual +output = torch.maximum(torch.tensor(0.0), x) +``` + +**When to use ReLU:** +- โœ“ Hidden layers in CNNs +- โœ“ Hidden layers in feedforward networks +- โœ“ Default activation for most architectures +- โœ— NOT for output layer (use softmax/sigmoid/linear instead) + +**Remember:** ReLU is simple but powerful. It's the workhorse of modern deep learning! ๐ŸŽ‰ diff --git a/public/content/learn/activation-functions/relu/relu-example.png b/public/content/learn/activation-functions/relu/relu-example.png new file mode 100644 index 0000000..188ac1b Binary files /dev/null and b/public/content/learn/activation-functions/relu/relu-example.png differ diff --git a/public/content/learn/activation-functions/relu/relu-graph.png b/public/content/learn/activation-functions/relu/relu-graph.png new file mode 100644 index 0000000..db0625a Binary files /dev/null and b/public/content/learn/activation-functions/relu/relu-graph.png differ diff --git a/public/content/learn/activation-functions/relu/relu-network.png b/public/content/learn/activation-functions/relu/relu-network.png new file mode 100644 index 0000000..d577e3c Binary files /dev/null and b/public/content/learn/activation-functions/relu/relu-network.png differ diff --git a/public/content/learn/activation-functions/sigmoid/sigmoid-classification.png b/public/content/learn/activation-functions/sigmoid/sigmoid-classification.png new file mode 100644 index 0000000..ef2d10b Binary files /dev/null and b/public/content/learn/activation-functions/sigmoid/sigmoid-classification.png differ diff --git a/public/content/learn/activation-functions/sigmoid/sigmoid-content.md b/public/content/learn/activation-functions/sigmoid/sigmoid-content.md new file mode 100644 index 0000000..9700aa0 --- /dev/null +++ b/public/content/learn/activation-functions/sigmoid/sigmoid-content.md @@ -0,0 +1,357 @@ +--- +hero: + title: "Sigmoid" + subtitle: "The Classic S-shaped Activation Function" + tags: + - "โšก Activation Functions" + - "โฑ๏ธ 10 min read" +--- + +Sigmoid is a smooth, S-shaped function that **squashes any input to a value between 0 and 1**. Perfect for probabilities! + +## The Formula + +**ฯƒ(x) = 1 / (1 + eโปหฃ)** + +The output is always between 0 and 1, making it ideal for binary classification! + +![Sigmoid Graph](/content/learn/activation-functions/sigmoid/sigmoid-graph.png) + +```yaml +Input โ†’ -โˆž โ†’ Output โ†’ 0 +Input = 0 โ†’ Output = 0.5 +Input โ†’ +โˆž โ†’ Output โ†’ 1 + +Key property: Output is always in (0, 1) +``` + +## How It Works + +**Example:** + +```python +import torch +import torch.nn as nn + +# Create sigmoid activation +sigmoid = nn.Sigmoid() + +# Test with different values +x = torch.tensor([-5.0, -1.0, 0.0, 1.0, 5.0]) +output = sigmoid(x) + +print(output) +# tensor([0.0067, 0.2689, 0.5000, 0.7311, 0.9933]) +``` + +![Sigmoid Example](/content/learn/activation-functions/sigmoid/sigmoid-example.png) + +**Manual calculation (for x = 2):** + +```yaml +ฯƒ(2) = 1 / (1 + eโปยฒ) + = 1 / (1 + 0.1353) + = 1 / 1.1353 + = 0.881 + +Result: ~0.88 or 88% probability +``` + +## The S-Shape Explained + +```yaml +Large negative input (x = -10): + eโปโฝโปยนโฐโพ = eยนโฐ = 22026 (huge!) + ฯƒ(x) = 1 / (1 + 22026) โ‰ˆ 0.00005 + โ†’ Output near 0 + +Zero input (x = 0): + eโปโฐ = 1 + ฯƒ(x) = 1 / (1 + 1) = 0.5 + โ†’ Output exactly 0.5 + +Large positive input (x = 10): + eโปยนโฐ = 0.000045 (tiny!) + ฯƒ(x) = 1 / (1 + 0.000045) โ‰ˆ 0.99995 + โ†’ Output near 1 +``` + +## Binary Classification + +Sigmoid's killer application: **predicting probabilities for binary classification**! + +![Sigmoid Classification](/content/learn/activation-functions/sigmoid/sigmoid-classification.png) + +**Example:** + +```python +import torch +import torch.nn as nn + +# Binary classification model +class BinaryClassifier(nn.Module): + def __init__(self): + super().__init__() + self.linear = nn.Linear(10, 1) # 10 features โ†’ 1 output + self.sigmoid = nn.Sigmoid() + + def forward(self, x): + logits = self.linear(x) + probabilities = self.sigmoid(logits) + return probabilities + +# Test +model = BinaryClassifier() +x = torch.randn(5, 10) # 5 samples, 10 features each +probs = model(x) + +print(probs) +# tensor([[0.7234], +# [0.3421], +# [0.8956], +# [0.1234], +# [0.6543]], grad_fn=) + +# Convert to predictions +predictions = (probs > 0.5).float() +print(predictions) +# tensor([[1.], # Class 1 (prob > 0.5) +# [0.], # Class 0 (prob < 0.5) +# [1.], +# [0.], +# [1.]]) +``` + +**What happened:** + +```yaml +Model output (logit): 2.5 + โ†“ +Sigmoid: 1/(1 + eโปยฒยทโต) = 0.92 + โ†“ +0.92 > 0.5 โ†’ Predict Class 1! +``` + +## In Code (Simple Implementation) + +```python +import torch + +def sigmoid(x): + """Simple sigmoid implementation""" + return 1 / (1 + torch.exp(-x)) + +# Test it +x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0]) +output = sigmoid(x) +print(output) +# tensor([0.1192, 0.2689, 0.5000, 0.7311, 0.8808]) +``` + +## Using Sigmoid in PyTorch + +### Method 1: As a Layer + +```python +import torch.nn as nn + +model = nn.Sequential( + nn.Linear(10, 5), + nn.ReLU(), + nn.Linear(5, 1), + nn.Sigmoid() # โ† Output layer for binary classification +) +``` + +### Method 2: As a Function + +```python +import torch +import torch.nn.functional as F + +x = torch.randn(5, 1) +output = F.sigmoid(x) # or torch.sigmoid(x) +``` + +### Method 3: Combined with Loss (BCE) + +```python +import torch +import torch.nn as nn + +# Binary Cross Entropy already includes sigmoid! +criterion = nn.BCEWithLogitsLoss() # Sigmoid + BCE + +# Model outputs raw logits (no sigmoid) +logits = model(x) +loss = criterion(logits, targets) # Sigmoid applied internally! +``` + +## The Vanishing Gradient Problem + +Sigmoid's main weakness: **gradients vanish for large inputs**! + +```python +import torch + +# Large input +x = torch.tensor([10.0], requires_grad=True) +y = torch.sigmoid(x) +y.backward() + +print(f"Input: {x.item()}") +print(f"Output: {y.item():.6f}") +print(f"Gradient: {x.grad.item():.6f}") +# Gradient: 0.000045 โ† Very small! +``` + +**Why this is bad:** + +```yaml +Gradient too small โ†’ + Slow learning โ†’ + Deep networks struggle โ†’ + ReLU became more popular! + +This is why ReLU replaced sigmoid in hidden layers. +``` + +**When sigmoid gradients vanish:** + +```yaml +For x = -10 or x = 10: + Output is ~0 or ~1 (saturated) + Gradient โ‰ˆ 0 (flat region) + Learning stops! + +For x near 0: + Output around 0.5 (steep region) + Gradient maximum (~0.25) + Learning is good here +``` + +## Practical Examples + +### Example 1: Email Spam Detector + +```python +import torch +import torch.nn as nn + +class SpamDetector(nn.Module): + def __init__(self, num_features): + super().__init__() + self.fc1 = nn.Linear(num_features, 64) + self.fc2 = nn.Linear(64, 32) + self.fc3 = nn.Linear(32, 1) + self.sigmoid = nn.Sigmoid() + + def forward(self, x): + x = torch.relu(self.fc1(x)) + x = torch.relu(self.fc2(x)) + x = self.fc3(x) + probability = self.sigmoid(x) # Sigmoid at end! + return probability + +# Predict +email_features = torch.randn(1, 100) +spam_probability = model(email_features) + +if spam_probability > 0.5: + print(f"SPAM (confidence: {spam_probability.item():.2%})") +else: + print(f"NOT SPAM (confidence: {1-spam_probability.item():.2%})") +``` + +### Example 2: Medical Diagnosis + +```python +# Patient features โ†’ Disease probability +patient = torch.randn(1, 50) # 50 medical features +probability = model(patient) + +print(f"Disease probability: {probability.item():.1%}") +# Output: Disease probability: 23.4% + +if probability > 0.7: + print("High risk - recommend testing") +elif probability > 0.3: + print("Medium risk - monitor") +else: + print("Low risk") +``` + +## Sigmoid vs ReLU + +```yaml +Sigmoid: + โœ“ Outputs 0 to 1 (probabilities) + โœ“ Smooth, differentiable everywhere + โœ“ Great for binary classification OUTPUT + โœ— Vanishing gradients for large |x| + โœ— Slow computation (exponential) + โœ— NOT zero-centered + +ReLU: + โœ“ Fast (simple comparison) + โœ“ No vanishing gradient for x > 0 + โœ“ Creates sparsity + โœ— Outputs 0 to โˆž (not probabilities) + โœ— Dying ReLU problem + โœ— NOT smooth at x = 0 +``` + +**When to use each:** + +```yaml +Use Sigmoid for: + โœ“ Binary classification output layer + โœ“ When you need probabilities + โœ“ Gates in LSTM/GRU + +Use ReLU for: + โœ“ Hidden layers + โœ“ Convolutional layers + โœ“ Most modern architectures +``` + +## Key Takeaways + +โœ“ **S-shaped curve:** Smooth transition from 0 to 1 + +โœ“ **Formula:** ฯƒ(x) = 1 / (1 + eโปหฃ) + +โœ“ **Output range:** Always between 0 and 1 + +โœ“ **Perfect for probabilities:** Binary classification output + +โœ“ **Vanishing gradients:** Problem in deep networks + +โœ“ **Mostly for output:** ReLU used in hidden layers instead + +**Quick Reference:** + +```python +# Using sigmoid +import torch +import torch.nn as nn +import torch.nn.functional as F + +# Method 1: Module +sigmoid_layer = nn.Sigmoid() +output = sigmoid_layer(x) + +# Method 2: Functional +output = F.sigmoid(x) + +# Method 3: Direct +output = torch.sigmoid(x) + +# Method 4: Manual +output = 1 / (1 + torch.exp(-x)) + +# For binary classification with loss +criterion = nn.BCEWithLogitsLoss() # Includes sigmoid! +``` + +**Remember:** Sigmoid for the output, ReLU for the hidden layers! ๐ŸŽ‰ diff --git a/public/content/learn/activation-functions/sigmoid/sigmoid-example.png b/public/content/learn/activation-functions/sigmoid/sigmoid-example.png new file mode 100644 index 0000000..d2e0dd4 Binary files /dev/null and b/public/content/learn/activation-functions/sigmoid/sigmoid-example.png differ diff --git a/public/content/learn/activation-functions/sigmoid/sigmoid-graph.png b/public/content/learn/activation-functions/sigmoid/sigmoid-graph.png new file mode 100644 index 0000000..e1c7411 Binary files /dev/null and b/public/content/learn/activation-functions/sigmoid/sigmoid-graph.png differ diff --git a/public/content/learn/activation-functions/silu/silu-content.md b/public/content/learn/activation-functions/silu/silu-content.md new file mode 100644 index 0000000..276bd25 --- /dev/null +++ b/public/content/learn/activation-functions/silu/silu-content.md @@ -0,0 +1,375 @@ +--- +hero: + title: "SiLU" + subtitle: "Sigmoid Linear Unit - The Swish Activation" + tags: + - "โšก Activation Functions" + - "โฑ๏ธ 10 min read" +--- + +SiLU (also called Swish) is a **smooth** alternative to ReLU. It's ReLU but with a smooth curve instead of a hard cutoff! + +## The Formula + +**SiLU(x) = x ยท ฯƒ(x) = x ยท sigmoid(x)** + +Simply multiply the input by its sigmoid! This creates a smooth, non-linear function. + +![SiLU Graph](/content/learn/activation-functions/silu/silu-graph.png) + +```yaml +For large negative x: + sigmoid(x) โ‰ˆ 0 + SiLU(x) = x ยท 0 โ‰ˆ 0 + +For x = 0: + sigmoid(0) = 0.5 + SiLU(0) = 0 ยท 0.5 = 0 + +For large positive x: + sigmoid(x) โ‰ˆ 1 + SiLU(x) = x ยท 1 โ‰ˆ x +``` + +## How It Works + +**Example:** + +```python +import torch +import torch.nn as nn + +# Create SiLU activation +silu = nn.SiLU() + +# Test with different values +x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0]) +output = silu(x) + +print(output) +# tensor([-0.2384, -0.2689, 0.0000, 0.7311, 1.7616]) +``` + +**Manual calculation (for x = 2):** + +```yaml +SiLU(2) = 2 ยท sigmoid(2) + = 2 ยท (1 / (1 + eโปยฒ)) + = 2 ยท 0.881 + = 1.762 + +Notice: Not just 2 (like ReLU), but close! +``` + +## The Smooth Advantage + +Unlike ReLU, SiLU is **smooth everywhere** and allows small negative values: + +![SiLU vs ReLU](/content/learn/activation-functions/silu/silu-vs-relu.png) + +**Example comparison:** + +```python +import torch + +x = torch.tensor([-2.0, -1.0, -0.5, 0.0, 1.0, 2.0]) + +# ReLU: hard cutoff +relu_out = torch.relu(x) +print("ReLU:", relu_out) +# tensor([0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 2.0000]) + +# SiLU: smooth transition +silu_out = torch.nn.functional.silu(x) +print("SiLU:", silu_out) +# tensor([-0.2384, -0.2689, -0.1887, 0.0000, 0.7311, 1.7616]) +``` + +**Key differences:** + +```yaml +ReLU: + x < 0 โ†’ Output = 0 (hard cutoff) + x > 0 โ†’ Output = x (straight line) + NOT smooth at x = 0 + +SiLU: + x < 0 โ†’ Small negative values (smooth) + x > 0 โ†’ Nearly linear (smooth) + Smooth everywhere! +``` + +## Why SiLU is Better Than ReLU + +### 1. Smooth Gradients + +```python +import torch + +x = torch.tensor([0.0], requires_grad=True) + +# ReLU gradient at x=0 is undefined (jump) +# SiLU gradient at x=0 is smooth (0.5) +y = torch.nn.functional.silu(x) +y.backward() + +print(x.grad) # tensor([0.5000]) +# Smooth gradient! +``` + +### 2. No Dying Neurons + +```python +# Neuron that would "die" with ReLU +x = torch.tensor([-5.0], requires_grad=True) + +# ReLU would output 0 with gradient 0 +relu_out = torch.relu(x) +print(relu_out) # tensor([0.]) โ† Dead! + +# SiLU allows gradient flow +silu_out = torch.nn.functional.silu(x) +print(silu_out) # tensor([-0.0337]) โ† Small but not zero! + +# Gradient still flows +silu_out.backward() +print(x.grad) # tensor([0.0030]) โ† Can still learn! +``` + +### 3. Better Performance + +Recent research shows SiLU **outperforms ReLU** in many tasks, especially in vision transformers and modern architectures! + +## In Code (Simple Implementation) + +```python +import torch + +def silu(x): + """Simple SiLU implementation""" + return x * torch.sigmoid(x) + +# Test it +x = torch.tensor([-1.0, 0.0, 1.0, 2.0]) +output = silu(x) +print(output) +# tensor([-0.2689, 0.0000, 0.7311, 1.7616]) + +# Verify against PyTorch +print(torch.nn.functional.silu(x)) +# tensor([-0.2689, 0.0000, 0.7311, 1.7616]) โ† Same! +``` + +## Using SiLU in PyTorch + +### Method 1: As a Layer + +```python +import torch.nn as nn + +model = nn.Sequential( + nn.Linear(10, 20), + nn.SiLU(), # โ† SiLU activation + nn.Linear(20, 5), + nn.SiLU(), # โ† Another SiLU + nn.Linear(5, 1) +) +``` + +### Method 2: As a Function + +```python +import torch.nn.functional as F + +x = torch.randn(5, 10) +output = F.silu(x) +``` + +## Practical Example: Vision Transformer + +SiLU is used in many modern architectures like EfficientNet and Vision Transformers: + +```python +import torch +import torch.nn as nn + +class ModernBlock(nn.Module): + def __init__(self, dim): + super().__init__() + self.norm = nn.LayerNorm(dim) + self.fc1 = nn.Linear(dim, dim * 4) + self.fc2 = nn.Linear(dim * 4, dim) + self.silu = nn.SiLU() # โ† SiLU instead of ReLU! + + def forward(self, x): + residual = x + x = self.norm(x) + x = self.fc1(x) + x = self.silu(x) # Smooth activation + x = self.fc2(x) + return x + residual + +# Test +block = ModernBlock(dim=128) +x = torch.randn(32, 128) # Batch of 32 +output = block(x) +print(output.shape) # torch.Size([32, 128]) +``` + +## SiLU vs Other Activations + +```yaml +SiLU (Swish): + โœ“ Smooth everywhere (no hard cutoff) + โœ“ No dying neurons + โœ“ Better performance than ReLU + โœ“ Self-gated (uses its own sigmoid) + โœ— Slightly slower than ReLU + โœ— More computation (sigmoid) + +ReLU: + โœ“ Fastest (simple comparison) + โœ“ Simple to understand + โœ— Not smooth at x=0 + โœ— Dying neuron problem + โœ— Hard cutoff at zero + +Tanh: + โœ“ Zero-centered + โœ“ Smooth + โœ— Vanishing gradients + โœ— Slower than both +``` + +## Where SiLU is Used + +**Modern architectures using SiLU:** +- EfficientNet (image classification) +- Vision Transformers (ViT) +- Some language models +- Mobile-optimized networks + +**Example from research:** + +```yaml +Study: "Searching for Activation Functions" (Google Brain, 2017) +Finding: Swish/SiLU outperformed ReLU on ImageNet +Result: Adopted in many modern architectures + +Performance gain: ~0.6-0.9% accuracy improvement +``` + +## Practical Example: EfficientNet-style Block + +```python +import torch +import torch.nn as nn + +class MBConvBlock(nn.Module): + """Mobile Inverted Bottleneck with SiLU""" + def __init__(self, in_channels, out_channels, expand_ratio=4): + super().__init__() + hidden_dim = in_channels * expand_ratio + + self.expand_conv = nn.Conv2d(in_channels, hidden_dim, 1) + self.depthwise_conv = nn.Conv2d(hidden_dim, hidden_dim, 3, + padding=1, groups=hidden_dim) + self.project_conv = nn.Conv2d(hidden_dim, out_channels, 1) + self.silu = nn.SiLU() # โ† SiLU for smooth activation + + def forward(self, x): + # Expand + out = self.expand_conv(x) + out = self.silu(out) # SiLU + + # Depthwise + out = self.depthwise_conv(out) + out = self.silu(out) # SiLU + + # Project + out = self.project_conv(out) + return out + +# Test +block = MBConvBlock(32, 64) +x = torch.randn(1, 32, 56, 56) # Image: batch, channels, H, W +output = block(x) +print(output.shape) # torch.Size([1, 64, 56, 56]) +``` + +## The Self-Gating Mechanism + +SiLU is "self-gated" - it uses its own sigmoid as a gate: + +```python +import torch + +x = torch.tensor([2.0]) + +# SiLU gates itself +sigmoid_gate = torch.sigmoid(x) # 0.881 +output = x * sigmoid_gate # 2.0 * 0.881 = 1.762 + +print(f"Input: {x.item()}") +print(f"Gate: {sigmoid_gate.item():.3f}") +print(f"Output: {output.item():.3f}") + +# Input: 2.0 +# Gate: 0.881 +# Output: 1.762 +``` + +**What this means:** + +```yaml +The input controls its own "gate": + - Large positive x โ†’ gate โ‰ˆ 1 โ†’ mostly pass through + - Large negative x โ†’ gate โ‰ˆ 0 โ†’ mostly blocked + - Small x โ†’ partial gating (smooth) + +This self-regulation makes SiLU effective! +``` + +## Key Takeaways + +โœ“ **Formula:** x ยท sigmoid(x) + +โœ“ **Smooth:** No hard cutoff like ReLU + +โœ“ **Self-gated:** Uses its own sigmoid as a gate + +โœ“ **Better than ReLU:** Improved performance in many tasks + +โœ“ **No dying neurons:** Always has gradient flow + +โœ“ **Modern choice:** Used in EfficientNet, ViT, and more + +**Quick Reference:** + +```python +# Using SiLU +import torch +import torch.nn as nn +import torch.nn.functional as F + +# Method 1: Module +silu_layer = nn.SiLU() +output = silu_layer(x) + +# Method 2: Functional +output = F.silu(x) + +# Method 3: Manual +output = x * torch.sigmoid(x) + +# Also known as Swish +swish = nn.SiLU() # Same thing! +``` + +**When to use SiLU:** +- โœ“ Modern CNN architectures +- โœ“ Vision transformers +- โœ“ When you want better performance than ReLU +- โœ“ Mobile/efficient networks + +**Remember:** SiLU is the smooth, modern upgrade to ReLU! ๐ŸŽ‰ diff --git a/public/content/learn/activation-functions/silu/silu-graph.png b/public/content/learn/activation-functions/silu/silu-graph.png new file mode 100644 index 0000000..3cb874d Binary files /dev/null and b/public/content/learn/activation-functions/silu/silu-graph.png differ diff --git a/public/content/learn/activation-functions/silu/silu-vs-relu.png b/public/content/learn/activation-functions/silu/silu-vs-relu.png new file mode 100644 index 0000000..3c6ef33 Binary files /dev/null and b/public/content/learn/activation-functions/silu/silu-vs-relu.png differ diff --git a/public/content/learn/activation-functions/softmax/softmax-classification.png b/public/content/learn/activation-functions/softmax/softmax-classification.png new file mode 100644 index 0000000..57b6983 Binary files /dev/null and b/public/content/learn/activation-functions/softmax/softmax-classification.png differ diff --git a/public/content/learn/activation-functions/softmax/softmax-content.md b/public/content/learn/activation-functions/softmax/softmax-content.md new file mode 100644 index 0000000..32b1bd6 --- /dev/null +++ b/public/content/learn/activation-functions/softmax/softmax-content.md @@ -0,0 +1,411 @@ +--- +hero: + title: "Softmax" + subtitle: "Multi-class Classification Activation Function" + tags: + - "โšก Activation Functions" + - "โฑ๏ธ 10 min read" +--- + +Softmax converts raw model outputs (logits) into **probabilities that sum to 1**. Perfect for multi-class classification! + +## The Formula + +**Softmax(xแตข) = exp(xแตข) / ฮฃ exp(xโฑผ)** + +For each element: +1. Take exponential (e^x) +2. Divide by sum of all exponentials + +This ensures all outputs are positive and sum to exactly 1! + +## How It Works + +![Softmax Transformation](/content/learn/activation-functions/softmax/softmax-transformation.png) + +**Example:** + +```python +import torch +import torch.nn as nn + +# Raw model outputs (logits) +logits = torch.tensor([2.0, 1.0, 0.1]) + +# Apply softmax +softmax = nn.Softmax(dim=0) +probabilities = softmax(logits) + +print(probabilities) +# tensor([0.6590, 0.2424, 0.0986]) + +print(probabilities.sum()) +# tensor(1.0000) โ† Sums to 1! +``` + +**Manual calculation:** + +```yaml +Step 1: Exponentiate each value + exp(2.0) = 7.389 + exp(1.0) = 2.718 + exp(0.1) = 1.105 + +Step 2: Sum all exponentials + Sum = 7.389 + 2.718 + 1.105 = 11.212 + +Step 3: Divide each by sum + 7.389 / 11.212 = 0.659 (65.9%) + 2.718 / 11.212 = 0.242 (24.2%) + 1.105 / 11.212 = 0.099 (9.9%) + +Result: [0.659, 0.242, 0.099] +Verification: 0.659 + 0.242 + 0.099 = 1.0 โœ“ +``` + +## Multi-Class Classification + +Softmax's main use: **predicting probabilities across multiple classes**! + +![Softmax Classification](/content/learn/activation-functions/softmax/softmax-classification.png) + +**Example:** + +```python +import torch +import torch.nn as nn + +# 10-class classification model +class MultiClassifier(nn.Module): + def __init__(self): + super().__init__() + self.fc1 = nn.Linear(784, 128) # Input layer + self.fc2 = nn.Linear(128, 64) # Hidden layer + self.fc3 = nn.Linear(64, 10) # Output: 10 classes + self.softmax = nn.Softmax(dim=1) + + def forward(self, x): + x = torch.relu(self.fc1(x)) + x = torch.relu(self.fc2(x)) + logits = self.fc3(x) + probabilities = self.softmax(logits) # โ† Softmax! + return probabilities + +# Test +model = MultiClassifier() +batch = torch.randn(5, 784) # 5 images +probs = model(batch) + +print(probs.shape) # torch.Size([5, 10]) +print(probs[0]) # First image probabilities +# tensor([0.0823, 0.1245, 0.0567, 0.3421, 0.0912, +# 0.0734, 0.1823, 0.0234, 0.0156, 0.0085]) + +print(probs[0].sum()) # tensor(1.0000) โ† Sums to 1! + +# Get predictions +predictions = torch.argmax(probs, dim=1) +print(predictions) # tensor([3, 7, 2, 3, 1]) +# Class indices with highest probability +``` + +## Why Exponential? + +The exponential makes softmax **sensitive to large values**: + +```python +import torch + +# Small difference in logits +logits1 = torch.tensor([1.0, 1.1, 1.2]) +probs1 = torch.softmax(logits1, dim=0) +print(probs1) +# tensor([0.3006, 0.3322, 0.3672]) +# Similar probabilities + +# Large difference in logits +logits2 = torch.tensor([1.0, 2.0, 3.0]) +probs2 = torch.softmax(logits2, dim=0) +print(probs2) +# tensor([0.0900, 0.2447, 0.6652]) +# Clear winner! + +# Huge difference +logits3 = torch.tensor([1.0, 5.0, 10.0]) +probs3 = torch.softmax(logits3, dim=0) +print(probs3) +# tensor([0.0000, 0.0067, 0.9933]) +# Dominant class! +``` + +**What happened:** + +```yaml +exp() amplifies differences: + +Small logits [1.0, 1.1, 1.2]: + exp โ†’ [2.7, 3.0, 3.3] + Difference is small โ†’ similar probabilities + +Large logits [1.0, 5.0, 10.0]: + exp โ†’ [2.7, 148, 22026] + Difference is HUGE โ†’ one dominates +``` + +## In Code (Simple Implementation) + +```python +import torch + +def softmax(x): + """Simple softmax implementation""" + exp_x = torch.exp(x) + return exp_x / exp_x.sum() + +# Test it +logits = torch.tensor([2.0, 1.0, 0.5]) +output = softmax(logits) +print(output) +# tensor([0.6364, 0.2341, 0.1295]) +print(output.sum()) +# tensor(1.0000) โ† Sums to 1! +``` + +## Using Softmax in PyTorch + +### Method 1: As a Layer + +```python +import torch.nn as nn + +model = nn.Sequential( + nn.Linear(784, 128), + nn.ReLU(), + nn.Linear(128, 10), + nn.Softmax(dim=1) # โ† Softmax on output +) +``` + +### Method 2: As a Function + +```python +import torch.nn.functional as F + +logits = torch.randn(32, 10) # Batch of 32, 10 classes +probs = F.softmax(logits, dim=1) # Softmax across classes + +print(probs.shape) # torch.Size([32, 10]) +print(probs.sum(dim=1)) # All 1.0 +``` + +### Method 3: Combined with Loss (CrossEntropy) + +**Important:** PyTorch's `CrossEntropyLoss` includes softmax! + +```python +import torch +import torch.nn as nn + +# CrossEntropy already has softmax! +criterion = nn.CrossEntropyLoss() + +# Model outputs raw logits (NO softmax) +logits = model(x) +loss = criterion(logits, targets) # Softmax applied internally! + +# DON'T do this: +# probs = F.softmax(logits, dim=1) # โ† Wrong! +# loss = criterion(probs, targets) # โ† Applies softmax twice! +``` + +## Temperature Scaling + +You can control softmax "confidence" with temperature: + +```python +import torch + +logits = torch.tensor([2.0, 1.0, 0.5]) + +# Normal softmax (temperature = 1) +probs_normal = torch.softmax(logits, dim=0) +print(probs_normal) +# tensor([0.6364, 0.2341, 0.1295]) + +# Low temperature (sharper, more confident) +probs_sharp = torch.softmax(logits / 0.5, dim=0) +print(probs_sharp) +# tensor([0.8360, 0.1131, 0.0508]) + +# High temperature (softer, less confident) +probs_soft = torch.softmax(logits / 2.0, dim=0) +print(probs_soft) +# tensor([0.4750, 0.3107, 0.2143]) +``` + +**Effect of temperature:** + +```yaml +T < 1 (low): + - Sharper probabilities + - More confident predictions + - Winner takes more + +T > 1 (high): + - Softer probabilities + - Less confident predictions + - More uniform distribution + +T = 1: + - Standard softmax +``` + +## Practical Example: Image Classification + +```python +import torch +import torch.nn as nn + +class ImageClassifier(nn.Module): + def __init__(self, num_classes=1000): + super().__init__() + self.features = nn.Sequential( + nn.Conv2d(3, 64, 3), + nn.ReLU(), + nn.MaxPool2d(2), + # ... more layers ... + ) + self.classifier = nn.Sequential( + nn.Linear(512, 256), + nn.ReLU(), + nn.Linear(256, num_classes) + # NO softmax here if using CrossEntropyLoss! + ) + + def forward(self, x): + x = self.features(x) + x = x.view(x.size(0), -1) # Flatten + logits = self.classifier(x) + return logits # Return logits, not probabilities! + +# For inference, apply softmax manually +model = ImageClassifier() +image = torch.randn(1, 3, 224, 224) + +with torch.no_grad(): + logits = model(image) + probs = torch.softmax(logits, dim=1) + + # Get top-5 predictions + top5_probs, top5_indices = torch.topk(probs, 5, dim=1) + + print("Top 5 predictions:") + for i in range(5): + print(f"Class {top5_indices[0, i]}: {top5_probs[0, i]:.1%}") +``` + +## Softmax Across Different Dimensions + +```python +import torch + +# Batch of logits +logits = torch.tensor([[2.0, 1.0, 0.5], + [0.8, 2.1, 1.3]]) # 2 samples, 3 classes + +# Softmax across classes (dim=1) +probs = torch.softmax(logits, dim=1) +print(probs) +# tensor([[0.6364, 0.2341, 0.1295], +# [0.1899, 0.6841, 0.1260]]) + +print(probs.sum(dim=1)) # tensor([1., 1.]) +# Each row sums to 1! + +# Softmax across samples (dim=0) - unusual! +probs_dim0 = torch.softmax(logits, dim=0) +print(probs_dim0.sum(dim=0)) # tensor([1., 1., 1.]) +# Each column sums to 1 +``` + +**Rule:** Use `dim=1` for batch processing (softmax across classes for each sample)! + +## Common Mistakes + +### โŒ Mistake 1: Softmax Before CrossEntropyLoss + +```python +# WRONG - softmax applied twice! +logits = model(x) +probs = torch.softmax(logits, dim=1) +loss = nn.CrossEntropyLoss()(probs, targets) # โ† ERROR! + +# CORRECT - CrossEntropy includes softmax +logits = model(x) +loss = nn.CrossEntropyLoss()(logits, targets) # โ† Correct! +``` + +### โŒ Mistake 2: Wrong Dimension + +```python +# Logits shape: (batch_size, num_classes) +logits = torch.randn(32, 10) + +# WRONG - softmax across batch +probs = torch.softmax(logits, dim=0) # โ† Each class sums to 1 (weird!) + +# CORRECT - softmax across classes +probs = torch.softmax(logits, dim=1) # โ† Each sample sums to 1 +``` + +## Key Takeaways + +โœ“ **Converts to probabilities:** All outputs between 0 and 1 + +โœ“ **Sums to 1:** All probabilities add up to exactly 1 + +โœ“ **Multi-class:** For 3+ classes (cat, dog, bird, etc.) + +โœ“ **Amplifies differences:** exp() makes large logits dominate + +โœ“ **CrossEntropy includes it:** Don't apply softmax before loss! + +โœ“ **Use dim=1:** For batch processing (softmax per sample) + +**Quick Reference:** + +```python +# Using softmax +import torch +import torch.nn as nn +import torch.nn.functional as F + +# Method 1: Module +softmax_layer = nn.Softmax(dim=1) +probs = softmax_layer(logits) + +# Method 2: Functional (most common) +probs = F.softmax(logits, dim=1) + +# Method 3: Direct +probs = torch.softmax(logits, dim=1) + +# For training with CrossEntropyLoss +criterion = nn.CrossEntropyLoss() # Includes softmax! +loss = criterion(logits, targets) # Don't softmax first! + +# For inference +with torch.no_grad(): + logits = model(x) + probs = F.softmax(logits, dim=1) + prediction = torch.argmax(probs, dim=1) +``` + +**When to use Softmax:** +- โœ“ Multi-class classification output (3+ classes) +- โœ“ When you need probability distribution +- โœ“ Attention mechanisms +- โœ— Binary classification (use sigmoid instead) +- โœ— Regression (use linear output) + +**Remember:** Softmax for multi-class, Sigmoid for binary! ๐ŸŽ‰ diff --git a/public/content/learn/activation-functions/softmax/softmax-transformation.png b/public/content/learn/activation-functions/softmax/softmax-transformation.png new file mode 100644 index 0000000..5dc59c5 Binary files /dev/null and b/public/content/learn/activation-functions/softmax/softmax-transformation.png differ diff --git a/public/content/learn/activation-functions/swiglu/glu-variants.png b/public/content/learn/activation-functions/swiglu/glu-variants.png new file mode 100644 index 0000000..bf7ed1f Binary files /dev/null and b/public/content/learn/activation-functions/swiglu/glu-variants.png differ diff --git a/public/content/learn/activation-functions/swiglu/swiglu-architecture.png b/public/content/learn/activation-functions/swiglu/swiglu-architecture.png new file mode 100644 index 0000000..5330534 Binary files /dev/null and b/public/content/learn/activation-functions/swiglu/swiglu-architecture.png differ diff --git a/public/content/learn/activation-functions/swiglu/swiglu-content.md b/public/content/learn/activation-functions/swiglu/swiglu-content.md new file mode 100644 index 0000000..120ee40 --- /dev/null +++ b/public/content/learn/activation-functions/swiglu/swiglu-content.md @@ -0,0 +1,315 @@ +--- +hero: + title: "SwiGLU" + subtitle: "Swish-Gated Linear Unit - Advanced Activation" + tags: + - "โšก Activation Functions" + - "โฑ๏ธ 10 min read" +--- + +SwiGLU is a **gated activation function** used in state-of-the-art language models like LLaMA and PaLM. It's more complex than ReLU but much more powerful! + +## The Concept: Gating + +**Gating = One path controls another path** + +Think of it like a smart light switch - one signal decides how much of another signal gets through! + +![SwiGLU Architecture](/content/learn/activation-functions/swiglu/swiglu-architecture.png) + +## The Formula + +**SwiGLU(x) = SiLU(Wโ‚(x)) โŠ™ V(x)** + +Where: +- `Wโ‚(x)` = first linear transformation +- `SiLU()` = activation (swish) +- `V(x)` = second linear transformation (gate) +- `โŠ™` = element-wise multiplication + +**In plain English:** +1. Split input into two paths +2. Apply SiLU to first path +3. Keep second path as-is +4. Multiply them together element-wise + +## How It Works + +**Example:** + +```python +import torch +import torch.nn as nn + +class SwiGLU(nn.Module): + def __init__(self, dim): + super().__init__() + self.W1 = nn.Linear(dim, dim) + self.V = nn.Linear(dim, dim) + self.silu = nn.SiLU() + + def forward(self, x): + # Path 1: Linear + SiLU + gate = self.silu(self.W1(x)) + + # Path 2: Linear only + value = self.V(x) + + # Multiply together + output = gate * value + return output + +# Test +swiglu = SwiGLU(dim=128) +x = torch.randn(32, 128) # Batch of 32 +output = swiglu(x) + +print(output.shape) # torch.Size([32, 128]) +``` + +**Manual calculation (simplified):** + +```yaml +Input x = [1.0, 2.0, 3.0] + +Path 1 (Gate): + W1(x) = [-0.5, 2.0, 1.0] + SiLU(W1(x)) = [-0.19, 1.76, 0.73] + +Path 2 (Value): + V(x) = [0.8, -1.2, 2.0] + +Element-wise multiply: + [-0.19 * 0.8, 1.76 * -1.2, 0.73 * 2.0] + = [-0.15, -2.11, 1.46] + +The gate controls how much of value passes through! +``` + +## Why SwiGLU is Powerful + +### 1. Gating Mechanism + +```python +# Gating allows selective information flow +gate = torch.tensor([0.1, 0.5, 0.9]) # Low, medium, high gates +value = torch.tensor([5.0, 5.0, 5.0]) # Same values + +output = gate * value +print(output) +# tensor([0.5, 2.5, 4.5]) + +# Gate controls how much gets through! +``` + +### 2. Double the Parameters (More Capacity) + +```yaml +Regular FFN: + Linear(dim, 4*dim) โ†’ ReLU โ†’ Linear(4*dim, dim) + Parameters: dim*4*dim + 4*dim*dim = 8*dimยฒ + +SwiGLU: + Two parallel linears + gating + Parameters: Slightly more (~1.5x FFN) + +But: Better performance despite similar size! +``` + +### 3. Smooth Activation (SiLU) + +Using SiLU instead of ReLU provides smooth gradients! + +## The GLU Family + +![GLU Variants](/content/learn/activation-functions/swiglu/glu-variants.png) + +All GLU variants follow the same pattern: + +```yaml +GLU: ฯƒ(W(x)) โŠ™ V(x) โ† Sigmoid gate +ReGLU: ReLU(W(x)) โŠ™ V(x) โ† ReLU gate +GEGLU: GELU(W(x)) โŠ™ V(x) โ† GELU gate +SwiGLU: SiLU(W(x)) โŠ™ V(x) โ† SiLU gate (best!) +``` + +**Performance ranking (empirical):** + +```yaml +Best: SwiGLU โ‰ˆ GEGLU +Good: ReGLU +Original: GLU +``` + +## Using SwiGLU in Transformers + +SwiGLU is used in the feedforward network (FFN) of transformers: + +```python +import torch +import torch.nn as nn + +class SwiGLUFFN(nn.Module): + """Feedforward network with SwiGLU""" + def __init__(self, dim, hidden_dim=None): + super().__init__() + if hidden_dim is None: + hidden_dim = int(dim * 8/3) # Adjusted for gating + + self.W1 = nn.Linear(dim, hidden_dim, bias=False) + self.V = nn.Linear(dim, hidden_dim, bias=False) + self.W2 = nn.Linear(hidden_dim, dim, bias=False) + self.silu = nn.SiLU() + + def forward(self, x): + # SwiGLU activation + gate = self.silu(self.W1(x)) + value = self.V(x) + hidden = gate * value + + # Project back + output = self.W2(hidden) + return output + +# Example usage in transformer block +class TransformerBlock(nn.Module): + def __init__(self, dim): + super().__init__() + self.attention = nn.MultiheadAttention(dim, num_heads=8) + self.ffn = SwiGLUFFN(dim) # โ† SwiGLU FFN + self.norm1 = nn.LayerNorm(dim) + self.norm2 = nn.LayerNorm(dim) + + def forward(self, x): + # Attention block + x = x + self.attention(self.norm1(x), self.norm1(x), self.norm1(x))[0] + + # FFN block with SwiGLU + x = x + self.ffn(self.norm2(x)) + return x +``` + +## Where SwiGLU is Used + +**Major models using SwiGLU:** +- **LLaMA** (Meta's language model) +- **PaLM** (Google's language model) +- **GPT-J** (EleutherAI) +- Many other modern LLMs + +**Why they chose SwiGLU:** + +```yaml +Research findings: + - Better performance than standard FFN + - Improved training stability + - Smoother optimization + - State-of-the-art results + +Trade-off: Slightly more parameters, but worth it! +``` + +## Practical Example: LLaMA-style FFN + +```python +import torch +import torch.nn as nn + +class LLaMAFFN(nn.Module): + """FFN from LLaMA (uses SwiGLU)""" + def __init__(self, dim=4096, hidden_dim=11008): + super().__init__() + self.gate_proj = nn.Linear(dim, hidden_dim, bias=False) # W1 + self.up_proj = nn.Linear(dim, hidden_dim, bias=False) # V + self.down_proj = nn.Linear(hidden_dim, dim, bias=False) # W2 + self.silu = nn.SiLU() + + def forward(self, x): + # SwiGLU + gate = self.silu(self.gate_proj(x)) + up = self.up_proj(x) + hidden = gate * up + + # Project back down + output = self.down_proj(hidden) + return output + +# Test +ffn = LLaMAFFN(dim=512, hidden_dim=1376) # Smaller for demo +x = torch.randn(2, 10, 512) # Batch=2, seq_len=10, dim=512 +output = ffn(x) + +print(output.shape) # torch.Size([2, 10, 512]) +``` + +## Implementation Tips + +### Efficient Implementation + +```python +import torch +import torch.nn as nn + +class EfficientSwiGLU(nn.Module): + """Efficient SwiGLU with combined projection""" + def __init__(self, dim, hidden_dim): + super().__init__() + # Combine W1 and V into single matrix for efficiency + self.combined = nn.Linear(dim, hidden_dim * 2, bias=False) + self.down = nn.Linear(hidden_dim, dim, bias=False) + self.silu = nn.SiLU() + + def forward(self, x): + # Single matrix multiply, then split + combined = self.combined(x) + gate, value = combined.chunk(2, dim=-1) + + # SwiGLU + hidden = self.silu(gate) * value + output = self.down(hidden) + return output +``` + +## Key Takeaways + +โœ“ **Gated activation:** One path controls another + +โœ“ **Formula:** SiLU(Wโ‚(x)) โŠ™ V(x) + +โœ“ **State-of-the-art:** Used in LLaMA, PaLM, and modern LLMs + +โœ“ **Better than FFN:** Outperforms standard ReLU-based networks + +โœ“ **Smooth:** Thanks to SiLU activation + +โœ“ **More parameters:** But worth it for performance + +**Quick Reference:** + +```python +# Basic SwiGLU implementation +class SwiGLU(nn.Module): + def __init__(self, dim, hidden_dim): + super().__init__() + self.W1 = nn.Linear(dim, hidden_dim) + self.V = nn.Linear(dim, hidden_dim) + self.W2 = nn.Linear(hidden_dim, dim) + + def forward(self, x): + gate = torch.nn.functional.silu(self.W1(x)) + value = self.V(x) + hidden = gate * value + return self.W2(hidden) + +# Usage +swiglu = SwiGLU(dim=512, hidden_dim=2048) +output = swiglu(input_tensor) +``` + +**When to use SwiGLU:** +- โœ“ Transformer feedforward networks +- โœ“ Large language models +- โœ“ When you want state-of-the-art performance +- โœ“ Modern architectures + +**Remember:** SwiGLU is the advanced gating mechanism powering modern LLMs! ๐ŸŽ‰ diff --git a/public/content/learn/activation-functions/tanh/tanh-content.md b/public/content/learn/activation-functions/tanh/tanh-content.md new file mode 100644 index 0000000..648d448 --- /dev/null +++ b/public/content/learn/activation-functions/tanh/tanh-content.md @@ -0,0 +1,323 @@ +--- +hero: + title: "Tanh" + subtitle: "Hyperbolic Tangent - Zero-centered Activation" + tags: + - "โšก Activation Functions" + - "โฑ๏ธ 10 min read" +--- + +Tanh (hyperbolic tangent) is like Sigmoid's **zero-centered cousin**. It squashes inputs to the range **[-1, 1]** instead of [0, 1]. + +## The Formula + +**tanh(x) = (eหฃ - eโปหฃ) / (eหฃ + eโปหฃ)** + +Or equivalently: **tanh(x) = 2ยทฯƒ(2x) - 1** (scaled and shifted sigmoid) + +![Tanh Graph](/content/learn/activation-functions/tanh/tanh-graph.png) + +```yaml +Input โ†’ -โˆž โ†’ Output โ†’ -1 +Input = 0 โ†’ Output = 0 +Input โ†’ +โˆž โ†’ Output โ†’ +1 + +Key property: Output is always in (-1, 1) +Zero-centered! (unlike sigmoid) +``` + +## How It Works + +**Example:** + +```python +import torch +import torch.nn as nn + +# Create tanh activation +tanh = nn.Tanh() + +# Test with different values +x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0]) +output = tanh(x) + +print(output) +# tensor([-0.9640, -0.7616, 0.0000, 0.7616, 0.9640]) +``` + +**Manual calculation:** + +```yaml +Input: [-2.0, -1.0, 0.0, 1.0, 2.0] + โ†“ โ†“ โ†“ โ†“ โ†“ +Tanh: -0.96 -0.76 0.00 0.76 0.96 + โ†“ โ†“ โ†“ โ†“ โ†“ +Range: All values between -1 and 1 +``` + +## The Zero-Centered Advantage + +**This is tanh's superpower:** outputs are centered around zero! + +![Tanh vs Sigmoid](/content/learn/activation-functions/tanh/tanh-vs-sigmoid.png) + +**Example:** + +```python +import torch + +x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0]) + +# Tanh: zero-centered +tanh_out = torch.tanh(x) +print(tanh_out.mean()) +# tensor(0.0000) โ† Mean is zero! + +# Sigmoid: NOT zero-centered +sigmoid_out = torch.sigmoid(x) +print(sigmoid_out.mean()) +# tensor(0.5000) โ† Mean is 0.5 +``` + +**Why zero-centered is better:** + +```yaml +Zero-centered (tanh): + โœ“ Gradients can be positive or negative + โœ“ Faster convergence + โœ“ More stable training + โœ“ Better for hidden layers + +Not zero-centered (sigmoid): + โœ— All gradients have same sign + โœ— Slower learning + โœ— Zig-zag optimization path +``` + +## In Code (Simple Implementation) + +```python +import torch + +def tanh_manual(x): + """Manual tanh implementation""" + exp_x = torch.exp(x) + exp_neg_x = torch.exp(-x) + return (exp_x - exp_neg_x) / (exp_x + exp_neg_x) + +# Test it +x = torch.tensor([-1.0, 0.0, 1.0]) +output = tanh_manual(x) +print(output) +# tensor([-0.7616, 0.0000, 0.7616]) + +# Verify against PyTorch +print(torch.tanh(x)) +# tensor([-0.7616, 0.0000, 0.7616]) โ† Same! +``` + +## Using Tanh in PyTorch + +### Method 1: As a Layer + +```python +import torch.nn as nn + +model = nn.Sequential( + nn.Linear(10, 20), + nn.Tanh(), # โ† Tanh activation + nn.Linear(20, 5), + nn.Tanh(), # โ† Another tanh + nn.Linear(5, 1) +) +``` + +### Method 2: As a Function + +```python +import torch +import torch.nn.functional as F + +x = torch.randn(5, 10) +output = F.tanh(x) # or torch.tanh(x) +``` + +## Practical Example: RNN/LSTM + +Tanh is commonly used in recurrent neural networks: + +```python +import torch +import torch.nn as nn + +class SimpleRNN(nn.Module): + def __init__(self, input_size, hidden_size): + super().__init__() + self.hidden_size = hidden_size + self.i2h = nn.Linear(input_size, hidden_size) + self.h2h = nn.Linear(hidden_size, hidden_size) + + def forward(self, x, hidden): + # Combine input and hidden state + combined = self.i2h(x) + self.h2h(hidden) + + # Apply tanh + new_hidden = torch.tanh(combined) # โ† Tanh here! + return new_hidden + +# Initialize +rnn = SimpleRNN(input_size=10, hidden_size=20) +x = torch.randn(5, 10) # 5 samples +h = torch.zeros(5, 20) # Initial hidden state + +# Forward pass +new_h = rnn(x, h) +print(new_h.shape) # torch.Size([5, 20]) +print(new_h.min(), new_h.max()) +# All values between -1 and 1! +``` + +## Tanh vs Sigmoid vs ReLU + +```yaml +Tanh: + โœ“ Zero-centered (best for hidden layers) + โœ“ Output range: [-1, 1] + โœ“ Smooth gradient + โœ— Vanishing gradient problem + โœ— Slower than ReLU (exponentials) + +Sigmoid: + โœ“ Output range: [0, 1] (probabilities) + โœ“ Smooth gradient + โœ— NOT zero-centered + โœ— Vanishing gradient problem + โœ— Slower than ReLU + +ReLU: + โœ“ Fast (no exponentials) + โœ“ No vanishing gradient for x > 0 + โœ“ Creates sparsity + โœ— NOT smooth at zero + โœ— Dying ReLU problem + โœ— NOT zero-centered +``` + +**When to use each:** + +```yaml +Hidden layers: + Modern: ReLU (fastest, works well) + Classical: Tanh (zero-centered) + Rarely: Sigmoid (not zero-centered) + +Output layer: + Binary classification: Sigmoid + Multi-class: Softmax + Regression: None (linear) + +RNN/LSTM: + Gates: Sigmoid + State update: Tanh +``` + +## The Vanishing Gradient Problem + +Like sigmoid, tanh suffers from vanishing gradients: + +```python +import torch + +# Large input +x = torch.tensor([5.0], requires_grad=True) +y = torch.tanh(x) +y.backward() + +print(f"Output: {y.item():.6f}") # 0.999909 +print(f"Gradient: {x.grad.item():.6f}") # 0.000181 +# Gradient is tiny! +``` + +**Why this happens:** + +```yaml +For large |x|: + Output saturates (near -1 or +1) + Gradient becomes very small + Learning slows down + +This is why ReLU replaced tanh in most modern networks! +``` + +## Relationship to Sigmoid + +Tanh is actually just a rescaled sigmoid: + +```python +import torch + +x = torch.tensor([0.5, 1.0, 1.5]) + +# Tanh +tanh_output = torch.tanh(x) + +# Same as scaled sigmoid +sigmoid_output = 2 * torch.sigmoid(2*x) - 1 + +print(tanh_output) +# tensor([0.4621, 0.7616, 0.9051]) + +print(sigmoid_output) +# tensor([0.4621, 0.7616, 0.9051]) + +# They're the same! +``` + +**Mathematical relationship:** + +```yaml +tanh(x) = 2ยทsigmoid(2x) - 1 + +Proof: + sigmoid(x) gives [0, 1] + 2ยทsigmoid(2x) gives [0, 2] + 2ยทsigmoid(2x) - 1 gives [-1, 1] โ† tanh range! +``` + +## Key Takeaways + +โœ“ **S-shaped curve:** Like sigmoid but zero-centered + +โœ“ **Output range:** Always between -1 and 1 + +โœ“ **Zero-centered:** Better than sigmoid for hidden layers + +โœ“ **Formula:** (eหฃ - eโปหฃ) / (eหฃ + eโปหฃ) + +โœ“ **Common in RNNs:** Used in LSTM/GRU cells + +โœ“ **Vanishing gradients:** Mostly replaced by ReLU in modern networks + +**Quick Reference:** + +```python +# Using tanh +import torch +import torch.nn as nn +import torch.nn.functional as F + +# Method 1: Module +tanh_layer = nn.Tanh() +output = tanh_layer(x) + +# Method 2: Functional +output = F.tanh(x) + +# Method 3: Direct +output = torch.tanh(x) + +# Method 4: Manual +output = (torch.exp(x) - torch.exp(-x)) / (torch.exp(x) + torch.exp(-x)) +``` + +**Remember:** Tanh is zero-centered sigmoid. Use it for RNN states, but ReLU is faster for feedforward! ๐ŸŽ‰ diff --git a/public/content/learn/activation-functions/tanh/tanh-graph.png b/public/content/learn/activation-functions/tanh/tanh-graph.png new file mode 100644 index 0000000..e50256c Binary files /dev/null and b/public/content/learn/activation-functions/tanh/tanh-graph.png differ diff --git a/public/content/learn/activation-functions/tanh/tanh-vs-sigmoid.png b/public/content/learn/activation-functions/tanh/tanh-vs-sigmoid.png new file mode 100644 index 0000000..ca6dc6c Binary files /dev/null and b/public/content/learn/activation-functions/tanh/tanh-vs-sigmoid.png differ diff --git a/public/content/learn/attention-mechanism/applying-attention-weights/applying-attention-weights-content.md b/public/content/learn/attention-mechanism/applying-attention-weights/applying-attention-weights-content.md new file mode 100644 index 0000000..04ecf2c --- /dev/null +++ b/public/content/learn/attention-mechanism/applying-attention-weights/applying-attention-weights-content.md @@ -0,0 +1,93 @@ +--- +hero: + title: "Applying Attention Weights" + subtitle: "Combining Values with Attention" + tags: + - "๐ŸŽฏ Attention" + - "โฑ๏ธ 8 min read" +--- + +After calculating attention weights, we use them to create a **weighted combination of values**! + +## The Final Step + +**Output = Attention_Weights ร— Values** + +```python +import torch + +# Attention weights (from softmax) +attn_weights = torch.tensor([[0.5, 0.3, 0.2], # Position 0 attends to... + [0.1, 0.7, 0.2], # Position 1 attends to... + [0.4, 0.3, 0.3]]) # Position 2 attends to... + +# Values (what information each position has) +V = torch.tensor([[1.0, 2.0], # Position 0 value + [3.0, 4.0], # Position 1 value + [5.0, 6.0]]) # Position 2 value + +# Apply attention +output = attn_weights @ V + +print(output) +# tensor([[2.2000, 3.2000], +# [2.8000, 3.8000], +# [2.6000, 3.6000]]) +``` + +**Manual calculation for position 0:** + +```yaml +Position 0 output: + = 0.5 ร— [1.0, 2.0] + 0.3 ร— [3.0, 4.0] + 0.2 ร— [5.0, 6.0] + = [0.5, 1.0] + [0.9, 1.2] + [1.0, 1.2] + = [2.4, 3.4] + +This is a weighted average! +``` + +## Complete Attention + +```python +import torch +import torch.nn.functional as F + +def attention(Q, K, V): + """Complete attention mechanism""" + # 1. Compute scores + d_k = Q.size(-1) + scores = Q @ K.transpose(-2, -1) / (d_k ** 0.5) + + # 2. Softmax to get weights + attn_weights = F.softmax(scores, dim=-1) + + # 3. Apply to values + output = attn_weights @ V + + return output, attn_weights + +# Test +Q = torch.randn(1, 5, 64) +K = torch.randn(1, 5, 64) +V = torch.randn(1, 5, 64) + +output, weights = attention(Q, K, V) +print(output.shape) # torch.Size([1, 5, 64]) +``` + +## Key Takeaways + +โœ“ **Final step:** Multiply attention weights by values + +โœ“ **Weighted average:** Combines information by relevance + +โœ“ **Output:** Context-aware representation + +**Quick Reference:** + +```python +# Attention output +output = attention_weights @ V +``` + +**Remember:** Attention weights select which values to use! ๐ŸŽ‰ diff --git a/public/content/learn/attention-mechanism/attention-in-code/attention-in-code-content.md b/public/content/learn/attention-mechanism/attention-in-code/attention-in-code-content.md new file mode 100644 index 0000000..0d312ed --- /dev/null +++ b/public/content/learn/attention-mechanism/attention-in-code/attention-in-code-content.md @@ -0,0 +1,96 @@ +--- +hero: + title: "Attention in Code" + subtitle: "Complete Attention Implementation" + tags: + - "๐ŸŽฏ Attention" + - "โฑ๏ธ 10 min read" +--- + +Here's the complete, production-ready attention implementation! + +## Full Implementation + +```python +import torch +import torch.nn as nn +import torch.nn.functional as F + +class ScaledDotProductAttention(nn.Module): + def __init__(self, dropout=0.1): + super().__init__() + self.dropout = nn.Dropout(dropout) + + def forward(self, Q, K, V, mask=None): + # Q, K, V: (batch, heads, seq_len, head_dim) + + d_k = Q.size(-1) + + # Compute attention scores + scores = Q @ K.transpose(-2, -1) / (d_k ** 0.5) + + # Apply mask if provided + if mask is not None: + scores = scores.masked_fill(mask == 0, float('-inf')) + + # Softmax + attn_weights = F.softmax(scores, dim=-1) + attn_weights = self.dropout(attn_weights) + + # Apply to values + output = attn_weights @ V + + return output, attn_weights + +# Use it +attention = ScaledDotProductAttention() +Q = torch.randn(2, 8, 10, 64) # batch=2, heads=8, seq=10, dim=64 +K = torch.randn(2, 8, 10, 64) +V = torch.randn(2, 8, 10, 64) + +output, weights = attention(Q, K, V) +print(output.shape) # torch.Size([2, 8, 10, 64]) +``` + +## With Masking + +```python +# Create causal mask (for autoregressive models) +def create_causal_mask(seq_len): + mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1) + return mask == 0 # True where we CAN attend + +mask = create_causal_mask(5) +print(mask) +# tensor([[ True, False, False, False, False], +# [ True, True, False, False, False], +# [ True, True, True, False, False], +# [ True, True, True, True, False], +# [ True, True, True, True, True]]) + +# Position 0 can only attend to position 0 +# Position 1 can attend to positions 0, 1 +# etc. +``` + +## PyTorch Implementation + +```python +# Using PyTorch's built-in +attention = nn.MultiheadAttention(embed_dim=512, num_heads=8) + +x = torch.randn(10, 32, 512) # (seq, batch, embed) +output, attn_weights = attention(x, x, x) + +print(output.shape) # torch.Size([10, 32, 512]) +``` + +## Key Takeaways + +โœ“ **Complete function:** Q, K, V โ†’ Output + +โœ“ **Masking:** Controls what can attend to what + +โœ“ **PyTorch built-in:** Use `nn.MultiheadAttention` + +**Remember:** Attention is just a few lines of code! ๐ŸŽ‰ diff --git a/public/content/learn/attention-mechanism/calculating-attention-scores/attention-matrix.png b/public/content/learn/attention-mechanism/calculating-attention-scores/attention-matrix.png new file mode 100644 index 0000000..1b708b6 Binary files /dev/null and b/public/content/learn/attention-mechanism/calculating-attention-scores/attention-matrix.png differ diff --git a/public/content/learn/attention-mechanism/calculating-attention-scores/calculating-attention-scores-content.md b/public/content/learn/attention-mechanism/calculating-attention-scores/calculating-attention-scores-content.md new file mode 100644 index 0000000..11d565e --- /dev/null +++ b/public/content/learn/attention-mechanism/calculating-attention-scores/calculating-attention-scores-content.md @@ -0,0 +1,124 @@ +--- +hero: + title: "Calculating Attention Scores" + subtitle: "Computing Query-Key-Value Similarities" + tags: + - "๐ŸŽฏ Attention" + - "โฑ๏ธ 10 min read" +--- + +Attention scores measure **how much each position should attend to every other position**! + +![Attention Matrix](/content/learn/attention-mechanism/calculating-attention-scores/attention-matrix.png) + +## The Formula + +**Score = Q ร— Kแต€ / โˆšd** + +Where: +- Q = Query matrix +- K = Key matrix +- d = dimension size +- โˆšd = scaling factor + +```python +import torch +import torch.nn.functional as F + +# Query and Key +Q = torch.randn(1, 10, 64) # (batch, seq_len, dim) +K = torch.randn(1, 10, 64) + +# Compute scores +scores = Q @ K.transpose(-2, -1) # (1, 10, 10) +scores = scores / (64 ** 0.5) # Scale by โˆšd + +# Convert to probabilities +attn_weights = F.softmax(scores, dim=-1) + +print(attn_weights.shape) # torch.Size([1, 10, 10]) +print(attn_weights[0, 0].sum()) # tensor(1.0) โ† Sums to 1! +``` + +## Step-by-Step Example + +```python +import torch +import torch.nn.functional as F + +# Simple example: 3 positions, 4-dim embeddings +Q = torch.tensor([[1.0, 0.0, 1.0, 0.0], + [0.0, 1.0, 0.0, 1.0], + [1.0, 1.0, 0.0, 0.0]]) # (3, 4) + +K = torch.tensor([[1.0, 0.0, 1.0, 0.0], + [0.0, 1.0, 0.0, 1.0], + [0.5, 0.5, 0.5, 0.5]]) # (3, 4) + +# 1. Dot product +scores = Q @ K.T # (3, 3) +print("Raw scores:") +print(scores) + +# 2. Scale +d_k = 4 +scaled_scores = scores / (d_k ** 0.5) +print("\\nScaled scores:") +print(scaled_scores) + +# 3. Softmax +attn_weights = F.softmax(scaled_scores, dim=-1) +print("\\nAttention weights:") +print(attn_weights) +# Each row sums to 1! +``` + +## Why Scaling? + +```yaml +Without scaling (โˆšd): + Large dot products โ†’ large scores + Softmax saturates โ†’ gradients vanish + +With scaling: + Controlled scores + Stable softmax + Better gradients +``` + +## Attention Matrix + +```python +# The attention matrix shows who attends to whom +attn_matrix = torch.softmax(Q @ K.T / (d ** 0.5), dim=-1) + +print(attn_matrix) +# Pos 0 Pos 1 Pos 2 +# Pos 0 [[0.5, 0.2, 0.3], โ† Position 0 attends to all positions +# Pos 1 [0.1, 0.7, 0.2], โ† Position 1 mostly attends to itself +# Pos 2 [0.4, 0.3, 0.3]] โ† Position 2 attends evenly +``` + +## Key Takeaways + +โœ“ **Scores:** Measure similarity (dot product) + +โœ“ **Scaling:** Divide by โˆšd for stability + +โœ“ **Softmax:** Convert to probabilities + +โœ“ **Matrix:** Shows all attention connections + +**Quick Reference:** + +```python +# Compute attention scores +scores = Q @ K.transpose(-2, -1) +scores = scores / (d_k ** 0.5) +attn_weights = F.softmax(scores, dim=-1) + +# Apply to values +output = attn_weights @ V +``` + +**Remember:** Scores tell us where to pay attention! ๐ŸŽ‰ diff --git a/public/content/learn/attention-mechanism/multi-head-attention/multi-head-attention-content.md b/public/content/learn/attention-mechanism/multi-head-attention/multi-head-attention-content.md new file mode 100644 index 0000000..ee3d9c9 --- /dev/null +++ b/public/content/learn/attention-mechanism/multi-head-attention/multi-head-attention-content.md @@ -0,0 +1,87 @@ +--- +hero: + title: "Multi-Head Attention" + subtitle: "Multiple Attention Mechanisms in Parallel" + tags: + - "๐ŸŽฏ Attention" + - "โฑ๏ธ 10 min read" +--- + +Multi-head attention runs **multiple attention mechanisms in parallel**, each focusing on different aspects! + +![Multi-Head Visual](/content/learn/attention-mechanism/multi-head-attention/multi-head-visual.png) + +## The Idea + +Instead of one attention: +- Run 8 (or more) attention heads in parallel +- Each head learns different patterns +- Concatenate and project outputs + +```python +import torch +import torch.nn as nn + +# Single-head attention +single_head = nn.MultiheadAttention(embed_dim=512, num_heads=1) + +# Multi-head attention (8 heads) +multi_head = nn.MultiheadAttention(embed_dim=512, num_heads=8) + +x = torch.randn(10, 32, 512) # (seq_len, batch, embed_dim) +output, attn_weights = multi_head(x, x, x) + +print(output.shape) # torch.Size([10, 32, 512]) +``` + +## Implementation + +```python +class MultiHeadAttention(nn.Module): + def __init__(self, embed_dim, num_heads): + super().__init__() + self.num_heads = num_heads + self.head_dim = embed_dim // num_heads + + self.q_linear = nn.Linear(embed_dim, embed_dim) + self.k_linear = nn.Linear(embed_dim, embed_dim) + self.v_linear = nn.Linear(embed_dim, embed_dim) + self.out_linear = nn.Linear(embed_dim, embed_dim) + + def forward(self, x): + batch_size, seq_len, embed_dim = x.size() + + # Project and split into heads + Q = self.q_linear(x).view(batch_size, seq_len, self.num_heads, self.head_dim) + K = self.k_linear(x).view(batch_size, seq_len, self.num_heads, self.head_dim) + V = self.v_linear(x).view(batch_size, seq_len, self.num_heads, self.head_dim) + + # Transpose for attention + Q = Q.transpose(1, 2) # (batch, heads, seq, head_dim) + K = K.transpose(1, 2) + V = V.transpose(1, 2) + + # Attention for each head + scores = Q @ K.transpose(-2, -1) / (self.head_dim ** 0.5) + attn = F.softmax(scores, dim=-1) + output = attn @ V + + # Concatenate heads + output = output.transpose(1, 2).contiguous() + output = output.view(batch_size, seq_len, embed_dim) + + # Final projection + output = self.out_linear(output) + + return output +``` + +## Key Takeaways + +โœ“ **Multiple heads:** Each learns different patterns + +โœ“ **Parallel:** All heads run simultaneously + +โœ“ **Standard:** 8 heads is common + +**Remember:** More heads = more ways to pay attention! ๐ŸŽ‰ diff --git a/public/content/learn/attention-mechanism/multi-head-attention/multi-head-visual.png b/public/content/learn/attention-mechanism/multi-head-attention/multi-head-visual.png new file mode 100644 index 0000000..b0678a6 Binary files /dev/null and b/public/content/learn/attention-mechanism/multi-head-attention/multi-head-visual.png differ diff --git a/public/content/learn/attention-mechanism/self-attention-from-scratch/self-attention-concept.png b/public/content/learn/attention-mechanism/self-attention-from-scratch/self-attention-concept.png new file mode 100644 index 0000000..e76075a Binary files /dev/null and b/public/content/learn/attention-mechanism/self-attention-from-scratch/self-attention-concept.png differ diff --git a/public/content/learn/attention-mechanism/self-attention-from-scratch/self-attention-from-scratch-content.md b/public/content/learn/attention-mechanism/self-attention-from-scratch/self-attention-from-scratch-content.md new file mode 100644 index 0000000..eecf67f --- /dev/null +++ b/public/content/learn/attention-mechanism/self-attention-from-scratch/self-attention-from-scratch-content.md @@ -0,0 +1,99 @@ +--- +hero: + title: "Self Attention from Scratch" + subtitle: "Building Self-Attention from the Ground Up" + tags: + - "๐ŸŽฏ Attention" + - "โฑ๏ธ 10 min read" +--- + +Let's build self-attention from scratch - the core of transformers! + +![Self-Attention Concept](/content/learn/attention-mechanism/self-attention-from-scratch/self-attention-concept.png) + +## Complete Implementation + +```python +import torch +import torch.nn as nn +import torch.nn.functional as F + +class SelfAttention(nn.Module): + def __init__(self, embed_dim): + super().__init__() + self.embed_dim = embed_dim + + # Linear projections for Q, K, V + self.query = nn.Linear(embed_dim, embed_dim) + self.key = nn.Linear(embed_dim, embed_dim) + self.value = nn.Linear(embed_dim, embed_dim) + + def forward(self, x): + # x: (batch, seq_len, embed_dim) + + # Project to Q, K, V + Q = self.query(x) + K = self.key(x) + V = self.value(x) + + # Compute attention scores + scores = Q @ K.transpose(-2, -1) + scores = scores / (self.embed_dim ** 0.5) + + # Softmax + attn_weights = F.softmax(scores, dim=-1) + + # Apply to values + output = attn_weights @ V + + return output + +# Test +attention = SelfAttention(embed_dim=64) +x = torch.randn(2, 10, 64) # Batch=2, seq=10, dim=64 +output = attention(x) +print(output.shape) # torch.Size([2, 10, 64]) +``` + +## Step-by-Step Example + +```python +import torch +import torch.nn.functional as F + +# Input: 3 words, 4-dim embeddings +x = torch.tensor([[1.0, 0.0, 1.0, 0.0], + [0.0, 1.0, 0.0, 1.0], + [1.0, 1.0, 0.0, 0.0]]) + +# Create Q, K, V projections +W_q = torch.randn(4, 4) +W_k = torch.randn(4, 4) +W_v = torch.randn(4, 4) + +# Compute Q, K, V +Q = x @ W_q +K = x @ W_k +V = x @ W_v + +# Attention scores +scores = Q @ K.T / (4 ** 0.5) +attn_weights = F.softmax(scores, dim=-1) + +# Output +output = attn_weights @ V + +print(output.shape) # torch.Size([3, 4]) +``` + +## Key Takeaways + +โœ“ **Self-attention:** Sequence attends to itself + +โœ“ **Q, K, V:** All come from same input + +โœ“ **Complete implementation:** ~20 lines of code + +โœ“ **Foundation:** Core of transformers + +**Remember:** Self-attention is simpler than it looks! ๐ŸŽ‰ diff --git a/public/content/learn/attention-mechanism/what-is-attention/attention-concept.png b/public/content/learn/attention-mechanism/what-is-attention/attention-concept.png new file mode 100644 index 0000000..914b992 Binary files /dev/null and b/public/content/learn/attention-mechanism/what-is-attention/attention-concept.png differ diff --git a/public/content/learn/attention-mechanism/what-is-attention/qkv-mechanism.png b/public/content/learn/attention-mechanism/what-is-attention/qkv-mechanism.png new file mode 100644 index 0000000..16998a7 Binary files /dev/null and b/public/content/learn/attention-mechanism/what-is-attention/qkv-mechanism.png differ diff --git a/public/content/learn/attention-mechanism/what-is-attention/what-is-attention-content.md b/public/content/learn/attention-mechanism/what-is-attention/what-is-attention-content.md new file mode 100644 index 0000000..aced9c9 --- /dev/null +++ b/public/content/learn/attention-mechanism/what-is-attention/what-is-attention-content.md @@ -0,0 +1,197 @@ +--- +hero: + title: "What is Attention" + subtitle: "Understanding the Attention Mechanism" + tags: + - "๐ŸŽฏ Attention" + - "โฑ๏ธ 10 min read" +--- + +Attention lets the model **focus on relevant parts** of the input, just like how you focus on important words when reading! + +![Attention Concept](/content/learn/attention-mechanism/what-is-attention/attention-concept.png) + +## The Core Idea + +**Attention = Weighted average based on relevance** + +Instead of treating all inputs equally, attention: +1. Calculates how relevant each input is +2. Weights inputs by relevance +3. Combines them into output + +```yaml +Without attention: + All words matter equally + "The cat sat on the mat" + โ†’ All words get same weight + +With attention: + Important words matter more + "The CAT sat on the mat" + โ†’ "cat" gets higher weight +``` + +## Simple Example + +```python +import torch +import torch.nn.functional as F + +# Input sequence (3 words, each 4-dim embedding) +sequence = torch.tensor([[0.1, 0.2, 0.3, 0.4], # word 1 + [0.5, 0.6, 0.7, 0.8], # word 2 + [0.9, 1.0, 1.1, 1.2]]) # word 3 + +# Attention scores (how important each word is) +attention_weights = torch.tensor([0.1, 0.3, 0.6]) # word 3 most important + +# Weighted average +output = torch.zeros(4) +for i, weight in enumerate(attention_weights): + output += weight * sequence[i] + +print(output) +# Mostly influenced by word 3 (weight 0.6) +``` + +## Query, Key, Value + +![QKV Mechanism](/content/learn/attention-mechanism/what-is-attention/qkv-mechanism.png) + +Attention uses three concepts: + +```yaml +Query (Q): "What am I looking for?" +Key (K): "What do I contain?" +Value (V): "What information do I have?" + +Process: +1. Compare Query with all Keys โ†’ scores +2. Convert scores to weights (softmax) +3. Weighted sum of Values +``` + +**Example:** + +```python +import torch +import torch.nn.functional as F + +# Query: what we're looking for +query = torch.tensor([1.0, 0.0, 1.0]) + +# Keys: what each position contains +keys = torch.tensor([[1.0, 0.0, 1.0], # Similar to query! + [0.0, 1.0, 0.0], # Different + [1.0, 0.0, 0.8]]) # Somewhat similar + +# Values: actual information +values = torch.tensor([[10.0, 20.0], + [30.0, 40.0], + [50.0, 60.0]]) + +# 1. Compute attention scores (dot product) +scores = keys @ query +print("Scores:", scores) +# tensor([2.0000, 0.0000, 1.8000]) + +# 2. Convert to probabilities +weights = F.softmax(scores, dim=0) +print("Weights:", weights) +# tensor([0.5308, 0.0874, 0.3818]) + +# 3. Weighted sum of values +output = torch.zeros(2) +for i, weight in enumerate(weights): + output += weight * values[i] + +print("Output:", output) +# Mostly from value 0 (weight 0.53) +``` + +## Why Attention is Powerful + +```yaml +Before attention (RNNs): + Process sequence left-to-right + Hard to remember distant info + Slow (sequential) + +With attention (Transformers): + Look at ALL positions at once + Direct connections everywhere + Fast (parallel) + +Result: Better at long sequences! +``` + +## Self-Attention + +**Self-attention: Sequence attends to itself** + +```python +# Sentence: "The cat sat" +# Each word attends to all words + +"The" attends to: The(0.3), cat(0.2), sat(0.5) +"cat" attends to: The(0.4), cat(0.4), sat(0.2) +"sat" attends to: The(0.1), cat(0.6), sat(0.3) + +# Each word builds context from others! +``` + +## Basic Implementation + +```python +import torch +import torch.nn as nn +import torch.nn.functional as F + +class SimpleAttention(nn.Module): + def __init__(self, embed_dim): + super().__init__() + self.query = nn.Linear(embed_dim, embed_dim) + self.key = nn.Linear(embed_dim, embed_dim) + self.value = nn.Linear(embed_dim, embed_dim) + + def forward(self, x): + # x shape: (batch, seq_len, embed_dim) + + # Compute Q, K, V + Q = self.query(x) + K = self.key(x) + V = self.value(x) + + # Attention scores + scores = Q @ K.transpose(-2, -1) + scores = scores / (Q.size(-1) ** 0.5) # Scale + + # Attention weights + attn_weights = F.softmax(scores, dim=-1) + + # Weighted values + output = attn_weights @ V + + return output + +# Test +attention = SimpleAttention(embed_dim=64) +x = torch.randn(1, 10, 64) # Batch=1, seq_len=10, dim=64 +output = attention(x) +print(output.shape) # torch.Size([1, 10, 64]) +``` + +## Key Takeaways + +โœ“ **Attention:** Weighted average by relevance + +โœ“ **Q, K, V:** Query, Key, Value mechanism + +โœ“ **Self-attention:** Sequence attends to itself + +โœ“ **Parallel:** Processes all positions at once + +โœ“ **Transformers:** Built entirely on attention + +**Remember:** Attention lets models focus on what matters! ๐ŸŽ‰ diff --git a/public/content/learn/building-a-transformer/building-a-transformer-block/block-diagram.png b/public/content/learn/building-a-transformer/building-a-transformer-block/block-diagram.png new file mode 100644 index 0000000..b2c13c5 Binary files /dev/null and b/public/content/learn/building-a-transformer/building-a-transformer-block/block-diagram.png differ diff --git a/public/content/learn/building-a-transformer/building-a-transformer-block/building-a-transformer-block-content.md b/public/content/learn/building-a-transformer/building-a-transformer-block/building-a-transformer-block-content.md new file mode 100644 index 0000000..cf2350b --- /dev/null +++ b/public/content/learn/building-a-transformer/building-a-transformer-block/building-a-transformer-block-content.md @@ -0,0 +1,142 @@ +--- +hero: + title: "Building a Transformer Block" + subtitle: "Creating the Core Transformer Component" + tags: + - "๐Ÿค– Transformers" + - "โฑ๏ธ 10 min read" +--- + +A transformer block is the **repeatable unit** that makes transformers work! + +![Block Diagram](/content/learn/building-a-transformer/building-a-transformer-block/block-diagram.png) + +## The Structure + +**Transformer Block = Attention + FFN + Normalization + Residuals** + +```python +import torch +import torch.nn as nn + +class TransformerBlock(nn.Module): + def __init__(self, d_model, n_heads, d_ff, dropout=0.1): + super().__init__() + + # 1. Multi-head attention + self.attention = nn.MultiheadAttention( + embed_dim=d_model, + num_heads=n_heads, + dropout=dropout, + batch_first=True + ) + + # 2. Feed-forward network + self.ffn = nn.Sequential( + nn.Linear(d_model, d_ff), + nn.ReLU(), + nn.Dropout(dropout), + nn.Linear(d_ff, d_model), + nn.Dropout(dropout) + ) + + # 3. Layer normalization + self.norm1 = nn.LayerNorm(d_model) + self.norm2 = nn.LayerNorm(d_model) + + # 4. Dropout + self.dropout = nn.Dropout(dropout) + + def forward(self, x, mask=None): + # Attention sub-block + attn_out, _ = self.attention(x, x, x, attn_mask=mask) + x = self.norm1(x + self.dropout(attn_out)) # Residual + Norm + + # FFN sub-block + ffn_out = self.ffn(x) + x = self.norm2(x + ffn_out) # Residual + Norm + + return x + +# Create and test +block = TransformerBlock(d_model=512, n_heads=8, d_ff=2048) +x = torch.randn(32, 10, 512) # (batch, seq, embed) +output = block(x) +print(output.shape) # torch.Size([32, 10, 512]) +``` + +## The Flow + +```yaml +Input + โ†“ +Multi-Head Attention + โ†“ +Add & Norm (residual connection) + โ†“ +Feed-Forward Network + โ†“ +Add & Norm (residual connection) + โ†“ +Output (same shape as input!) +``` + +## Residual Connections + +**Why residual connections matter:** + +```python +# Without residual +output = layer(x) + +# With residual +output = x + layer(x) # Add input back! + +# This helps gradients flow during backprop +``` + +## Stacking Blocks + +```python +class Transformer(nn.Module): + def __init__(self, vocab_size, d_model=512, n_heads=8, + n_layers=6, d_ff=2048): + super().__init__() + + self.embedding = nn.Embedding(vocab_size, d_model) + + # Stack N transformer blocks + self.blocks = nn.ModuleList([ + TransformerBlock(d_model, n_heads, d_ff) + for _ in range(n_layers) + ]) + + self.ln_f = nn.LayerNorm(d_model) + self.head = nn.Linear(d_model, vocab_size) + + def forward(self, x): + x = self.embedding(x) + + # Pass through all blocks + for block in self.blocks: + x = block(x) + + x = self.ln_f(x) + logits = self.head(x) + + return logits + +model = Transformer(vocab_size=50000, n_layers=12) +``` + +## Key Takeaways + +โœ“ **Core component:** Attention + FFN + Norm + Residuals + +โœ“ **Repeatable:** Stack many blocks + +โœ“ **Same shape:** Input and output dimensions match + +โœ“ **Self-contained:** Each block is independent + +**Remember:** Transformers are just stacked blocks! ๐ŸŽ‰ diff --git a/public/content/learn/building-a-transformer/full-transformer-in-code/full-transformer-in-code-content.md b/public/content/learn/building-a-transformer/full-transformer-in-code/full-transformer-in-code-content.md new file mode 100644 index 0000000..2f7e9b2 --- /dev/null +++ b/public/content/learn/building-a-transformer/full-transformer-in-code/full-transformer-in-code-content.md @@ -0,0 +1,143 @@ +--- +hero: + title: "Full Transformer in Code" + subtitle: "Complete Implementation from Scratch" + tags: + - "๐Ÿค– Transformers" + - "โฑ๏ธ 15 min read" +--- + +Let's build a complete, working transformer from scratch! + +## Complete Implementation + +```python +import torch +import torch.nn as nn +import torch.nn.functional as F +import math + +class MultiHeadAttention(nn.Module): + def __init__(self, d_model, n_heads): + super().__init__() + self.d_model = d_model + self.n_heads = n_heads + self.head_dim = d_model // n_heads + + self.q_linear = nn.Linear(d_model, d_model) + self.k_linear = nn.Linear(d_model, d_model) + self.v_linear = nn.Linear(d_model, d_model) + self.out_linear = nn.Linear(d_model, d_model) + + def forward(self, x, mask=None): + batch_size, seq_len, d_model = x.size() + + # Project and split into heads + Q = self.q_linear(x).view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2) + K = self.k_linear(x).view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2) + V = self.v_linear(x).view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2) + + # Attention + scores = Q @ K.transpose(-2, -1) / math.sqrt(self.head_dim) + if mask is not None: + scores = scores.masked_fill(mask == 0, float('-inf')) + + attn = F.softmax(scores, dim=-1) + output = attn @ V + + # Concatenate heads + output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model) + output = self.out_linear(output) + + return output + +class FeedForward(nn.Module): + def __init__(self, d_model, d_ff, dropout=0.1): + super().__init__() + self.net = nn.Sequential( + nn.Linear(d_model, d_ff), + nn.ReLU(), + nn.Dropout(dropout), + nn.Linear(d_ff, d_model), + nn.Dropout(dropout) + ) + + def forward(self, x): + return self.net(x) + +class TransformerBlock(nn.Module): + def __init__(self, d_model, n_heads, d_ff, dropout=0.1): + super().__init__() + self.attention = MultiHeadAttention(d_model, n_heads) + self.ffn = FeedForward(d_model, d_ff, dropout) + self.norm1 = nn.LayerNorm(d_model) + self.norm2 = nn.LayerNorm(d_model) + self.dropout = nn.Dropout(dropout) + + def forward(self, x, mask=None): + # Attention + attn_out = self.attention(x, mask) + x = self.norm1(x + self.dropout(attn_out)) + + # FFN + ffn_out = self.ffn(x) + x = self.norm2(x + self.dropout(ffn_out)) + + return x + +class Transformer(nn.Module): + def __init__(self, vocab_size, d_model=512, n_heads=8, + n_layers=6, d_ff=2048, max_seq_len=512, dropout=0.1): + super().__init__() + + # Embeddings + self.token_emb = nn.Embedding(vocab_size, d_model) + self.pos_emb = nn.Embedding(max_seq_len, d_model) + self.dropout = nn.Dropout(dropout) + + # Transformer blocks + self.blocks = nn.ModuleList([ + TransformerBlock(d_model, n_heads, d_ff, dropout) + for _ in range(n_layers) + ]) + + # Output + self.ln_f = nn.LayerNorm(d_model) + self.head = nn.Linear(d_model, vocab_size) + + def forward(self, x): + batch, seq_len = x.size() + + # Embeddings + positions = torch.arange(seq_len, device=x.device).unsqueeze(0) + x = self.token_emb(x) + self.pos_emb(positions) + x = self.dropout(x) + + # Transformer blocks + for block in self.blocks: + x = block(x) + + # Output + x = self.ln_f(x) + logits = self.head(x) + + return logits + +# Create GPT-style model +model = Transformer(vocab_size=50000, n_layers=12, d_model=768) + +# Test +tokens = torch.randint(0, 50000, (2, 64)) +logits = model(tokens) +print(logits.shape) # torch.Size([2, 64, 50000]) +``` + +## Key Takeaways + +โœ“ **Complete:** All components together + +โœ“ **Production-ready:** Real implementation + +โœ“ **Flexible:** Easy to modify + +**Remember:** You just built a transformer! ๐ŸŽ‰ diff --git a/public/content/learn/building-a-transformer/rope-positional-encoding/rope-positional-encoding-content.md b/public/content/learn/building-a-transformer/rope-positional-encoding/rope-positional-encoding-content.md new file mode 100644 index 0000000..df6b22c --- /dev/null +++ b/public/content/learn/building-a-transformer/rope-positional-encoding/rope-positional-encoding-content.md @@ -0,0 +1,84 @@ +--- +hero: + title: "RoPE Positional Encoding" + subtitle: "Rotary Position Embeddings" + tags: + - "๐Ÿค– Transformers" + - "โฑ๏ธ 10 min read" +--- + +RoPE (Rotary Position Embedding) is a modern way to encode position information in transformers! + +## The Problem + +Transformers don't know word order without position information! + +```yaml +"Dog bites man" vs "Man bites dog" +โ†’ Without positions, looks the same to transformer! + +Need to add position information! +``` + +## How RoPE Works + +```python +import torch +import torch.nn as nn + +class RotaryPositionalEmbedding(nn.Module): + def __init__(self, dim, max_seq_len=2048): + super().__init__() + inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim)) + self.register_buffer('inv_freq', inv_freq) + + def forward(self, x): + seq_len = x.size(1) + t = torch.arange(seq_len, device=x.device).type_as(self.inv_freq) + freqs = torch.outer(t, self.inv_freq) + emb = torch.cat((freqs, freqs), dim=-1) + + cos_emb = emb.cos() + sin_emb = emb.sin() + + return cos_emb, sin_emb + +def apply_rope(x, cos, sin): + """Apply rotary embeddings""" + x1, x2 = x[..., ::2], x[..., 1::2] + rotated = torch.cat([ + x1 * cos - x2 * sin, + x1 * sin + x2 * cos + ], dim=-1) + return rotated + +# Use it +rope = RotaryPositionalEmbedding(dim=64) +x = torch.randn(1, 10, 64) +cos, sin = rope(x) +x_with_pos = apply_rope(x, cos, sin) +``` + +## Why RoPE is Better + +```yaml +Old way (learned embeddings): + - Fixed max sequence length + - Doesn't generalize to longer sequences + +RoPE: + โœ“ Works for any sequence length + โœ“ Relative positions encoded + โœ“ Better extrapolation + โœ“ Used in LLaMA, GPT-NeoX +``` + +## Key Takeaways + +โœ“ **Rotary:** Encodes position via rotation + +โœ“ **Relative:** Captures relative positions + +โœ“ **Modern:** Used in latest LLMs + +**Remember:** RoPE is the modern way to handle positions! ๐ŸŽ‰ diff --git a/public/content/learn/building-a-transformer/the-final-linear-layer/the-final-linear-layer-content.md b/public/content/learn/building-a-transformer/the-final-linear-layer/the-final-linear-layer-content.md new file mode 100644 index 0000000..3b9e08d --- /dev/null +++ b/public/content/learn/building-a-transformer/the-final-linear-layer/the-final-linear-layer-content.md @@ -0,0 +1,61 @@ +--- +hero: + title: "The Final Linear Layer" + subtitle: "From Hidden States to Predictions" + tags: + - "๐Ÿค– Transformers" + - "โฑ๏ธ 8 min read" +--- + +The final linear layer projects transformer outputs to vocabulary logits for prediction! + +## Language Model Head + +```python +import torch +import torch.nn as nn + +class LMHead(nn.Module): + def __init__(self, d_model, vocab_size): + super().__init__() + self.ln = nn.LayerNorm(d_model) + self.linear = nn.Linear(d_model, vocab_size, bias=False) + + def forward(self, x): + x = self.ln(x) + logits = self.linear(x) + return logits + +# Use it +lm_head = LMHead(d_model=768, vocab_size=50000) +hidden_states = torch.randn(32, 128, 768) # (batch, seq, dim) +logits = lm_head(hidden_states) + +print(logits.shape) # torch.Size([32, 128, 50000]) +# For each position: 50000 logits (one per vocab token) +``` + +## Complete Forward Pass + +```python +# Input tokens โ†’ Embeddings โ†’ Transformer โ†’ LM Head โ†’ Logits + +input_ids = torch.randint(0, 50000, (1, 10)) +embeddings = embedding_layer(input_ids) +hidden_states = transformer_blocks(embeddings) +logits = lm_head(hidden_states) + +# Get next token prediction +next_token_logits = logits[:, -1, :] # Last position +next_token = torch.argmax(next_token_logits, dim=-1) +``` + +## Key Takeaways + +โœ“ **Final layer:** Hidden states โ†’ vocabulary logits + +โœ“ **Large:** Often biggest layer (vocab_size is huge) + +โœ“ **Shared weights:** Often tied with embedding matrix + +**Remember:** Final layer converts understanding to predictions! ๐ŸŽ‰ diff --git a/public/content/learn/building-a-transformer/training-a-transformer/training-a-transformer-content.md b/public/content/learn/building-a-transformer/training-a-transformer/training-a-transformer-content.md new file mode 100644 index 0000000..2eab24e --- /dev/null +++ b/public/content/learn/building-a-transformer/training-a-transformer/training-a-transformer-content.md @@ -0,0 +1,83 @@ +--- +hero: + title: "Training a Transformer" + subtitle: "How to Train Language Models" + tags: + - "๐Ÿค– Transformers" + - "โฑ๏ธ 10 min read" +--- + +Training transformers involves next-token prediction and lots of data! + +## The Training Objective + +**Goal: Predict the next token given previous tokens** + +```python +import torch +import torch.nn as nn + +# Training data +input_tokens = torch.tensor([[1, 2, 3, 4]]) # Input +target_tokens = torch.tensor([[2, 3, 4, 5]]) # Targets (shifted by 1) + +# Model forward +logits = model(input_tokens) # (1, 4, vocab_size) + +# Loss: Cross entropy +criterion = nn.CrossEntropyLoss() +loss = criterion( + logits.view(-1, vocab_size), # Flatten + target_tokens.view(-1) # Flatten +) +``` + +## Complete Training Loop + +```python +import torch +import torch.optim as optim + +def train_step(model, batch, optimizer, criterion): + # Get input and target (shifted) + input_ids = batch[:, :-1] + targets = batch[:, 1:] + + # Forward + logits = model(input_ids) + + # Loss + loss = criterion( + logits.reshape(-1, logits.size(-1)), + targets.reshape(-1) + ) + + # Backward + optimizer.zero_grad() + loss.backward() + + # Update + torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) + optimizer.step() + + return loss.item() + +# Training +model = Transformer(vocab_size=50000) +optimizer = optim.AdamW(model.parameters(), lr=3e-4) +criterion = nn.CrossEntropyLoss() + +for epoch in range(num_epochs): + for batch in dataloader: + loss = train_step(model, batch, optimizer, criterion) +``` + +## Key Takeaways + +โœ“ **Next-token prediction:** Core training task + +โœ“ **Shift targets:** Input[:-1] โ†’ Target[1:] + +โœ“ **Cross entropy:** Standard loss for LMs + +**Remember:** Training is just next-token prediction! ๐ŸŽ‰ diff --git a/public/content/learn/building-a-transformer/transformer-architecture/transformer-architecture-content.md b/public/content/learn/building-a-transformer/transformer-architecture/transformer-architecture-content.md new file mode 100644 index 0000000..c907fdf --- /dev/null +++ b/public/content/learn/building-a-transformer/transformer-architecture/transformer-architecture-content.md @@ -0,0 +1,147 @@ +--- +hero: + title: "Transformer Architecture" + subtitle: "Understanding the Transformer Model" + tags: + - "๐Ÿค– Transformers" + - "โฑ๏ธ 12 min read" +--- + +The Transformer is the architecture behind GPT, BERT, and modern LLMs. It's built entirely on attention! + +![Transformer Diagram](/content/learn/building-a-transformer/transformer-architecture/transformer-diagram.png) + +## The Big Picture + +**Transformer = Encoder + Decoder (or just one)** + +```yaml +Input Text + โ†“ +Embedding + Positional Encoding + โ†“ +N ร— Transformer Blocks: + - Multi-Head Attention + - Feed-Forward Network + - Layer Normalization + - Residual Connections + โ†“ +Output Logits +``` + +## Basic Transformer Block + +```python +import torch +import torch.nn as nn + +class TransformerBlock(nn.Module): + def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1): + super().__init__() + + # Multi-head attention + self.attention = nn.MultiheadAttention(embed_dim, num_heads, dropout=dropout) + + # Feedforward network + self.ff = nn.Sequential( + nn.Linear(embed_dim, ff_dim), + nn.ReLU(), + nn.Dropout(dropout), + nn.Linear(ff_dim, embed_dim), + nn.Dropout(dropout) + ) + + # Layer normalization + self.norm1 = nn.LayerNorm(embed_dim) + self.norm2 = nn.LayerNorm(embed_dim) + + def forward(self, x): + # Attention block + attn_out, _ = self.attention(x, x, x) + x = self.norm1(x + attn_out) # Residual connection + + # Feedforward block + ff_out = self.ff(x) + x = self.norm2(x + ff_out) # Residual connection + + return x + +# Test +block = TransformerBlock(embed_dim=512, num_heads=8, ff_dim=2048) +x = torch.randn(10, 32, 512) # (seq, batch, embed) +output = block(x) +print(output.shape) # torch.Size([10, 32, 512]) +``` + +## Complete Transformer + +```python +class Transformer(nn.Module): + def __init__(self, vocab_size, embed_dim=512, num_heads=8, + num_layers=6, ff_dim=2048, max_seq_len=5000): + super().__init__() + + # Embeddings + self.token_embedding = nn.Embedding(vocab_size, embed_dim) + self.pos_embedding = nn.Embedding(max_seq_len, embed_dim) + + # Transformer blocks + self.blocks = nn.ModuleList([ + TransformerBlock(embed_dim, num_heads, ff_dim) + for _ in range(num_layers) + ]) + + # Output layer + self.ln_f = nn.LayerNorm(embed_dim) + self.head = nn.Linear(embed_dim, vocab_size, bias=False) + + def forward(self, x): + batch, seq_len = x.size() + + # Token + position embeddings + positions = torch.arange(seq_len, device=x.device).unsqueeze(0) + x = self.token_embedding(x) + self.pos_embedding(positions) + + # Apply transformer blocks + for block in self.blocks: + x = block(x.transpose(0, 1)).transpose(0, 1) + + # Output projection + x = self.ln_f(x) + logits = self.head(x) + + return logits + +# Create transformer +model = Transformer(vocab_size=50000, num_layers=12) +``` + +## Key Components + +```yaml +1. Embeddings: + - Token embeddings (vocabulary) + - Positional embeddings (position info) + +2. Transformer Blocks (repeated N times): + - Multi-head attention + - Feedforward network + - Layer normalization + - Residual connections + +3. Output: + - Final layer norm + - Linear projection to vocabulary +``` + +## Key Takeaways + +โœ“ **Self-attention based:** No recurrence, no convolution + +โœ“ **Parallel:** Processes entire sequence at once + +โœ“ **Scalable:** Stack more blocks for more capacity + +โœ“ **Powerful:** Powers GPT, BERT, LLaMA + +**Remember:** Transformers are just stacked attention blocks! ๐ŸŽ‰ diff --git a/public/content/learn/building-a-transformer/transformer-architecture/transformer-diagram.png b/public/content/learn/building-a-transformer/transformer-architecture/transformer-diagram.png new file mode 100644 index 0000000..7d52a53 Binary files /dev/null and b/public/content/learn/building-a-transformer/transformer-architecture/transformer-diagram.png differ diff --git a/public/content/learn/math/derivatives/derivative-graph.png b/public/content/learn/math/derivatives/derivative-graph.png new file mode 100644 index 0000000..e69de29 diff --git a/public/content/learn/math/derivatives/derivatives-content.md b/public/content/learn/math/derivatives/derivatives-content.md new file mode 100644 index 0000000..6e1b3eb --- /dev/null +++ b/public/content/learn/math/derivatives/derivatives-content.md @@ -0,0 +1,627 @@ +--- +hero: + title: "Understanding Derivatives" + subtitle: "The Foundation of Neural Network Training" + tags: + - "๐Ÿ“ Mathematics" + - "โฑ๏ธ 10 min read" +--- + +**[video coming soon]** + +## What are Derivatives? + +A **derivative** measures how a function changes as its input changes. + +### Intuitive Understanding + +Think of driving a car: + + + + + +- Your position is a function of time: `position(t)` + +- Your speed is the derivative of position: `speed = d(position)/dt` + +- Speed tells you how fast your position is changing + +If `x` goes from 3 to 4, does `f(x)`, that is `y`, change fast, eg. 6 to 40 or slower, eg. 6 to 7 + +**Derivative tells us the instantaneous rate of change of a function at any point.** + +### Mathematical Definition + +The derivative of `f(x)` at point `x` is: + +``` +f'(x) = lim[hโ†’0] (f(x+h) - f(x)) / h +``` + +### Visual Representation + + + +Here we have linearly growing function. + +Derivative is always 3 for any `x` value, which means that in the original function, growth of `y` at any point is 3x (if you increase `x` by 1, `y` will increase by 3, check it). + +![Linear Function Derivative](/content/learn/math/derivatives/linear-function-derivative.png) + +Here you can see that as `y` grows faster and faster in original function (square functions grow very fast). + +Derivative shows this accelerating growth, you can notice that derivative is increasing (linearly) - which means the growth is accelerating. + +![Quadratic Function Derivative](/content/learn/math/derivatives/quadratic-function-derivative.png) + +In previous example derivative was always 3, which meant that function is always consistantly growing by 3 times `x`. + +Here, on the other hand, the growth is growing. + +## Common Derivative Rules + +You will never calculate derivatives manually, but researcher needs to understand how it works. + +### 1. Power Rule + +If `f(x) = xโฟ`, then `f'(x) = nxโฟโปยน` + +So just put the exponent in front of the variable (or multiply with the number in front) and reduce exponent by 1. + +For `f(x) = xยณ`, derivative is `f'(x) = 3xยฒ` + +For `f(x) = 4xยณ`, derivative is `f'(x) = 4*3xยฒ = 12xยฒ` + +#### Step-by-Step Examples + +**Example 1:** `f(x) = xยฒ` + + + + + +- Using power rule: `f'(x) = 2x^(2-1) = 2xยน = 2x` + +- Verification: `f'(x) = 2x` + +**Example 2:** `f(x) = xยณ` + +- Using power rule: `f'(x) = 3x^(3-1) = 3xยฒ` + +- Verification: `f'(x) = 3xยฒ` + +**Example 3:** `f(x) = xโด` + +- Using power rule: `f'(x) = 4x^(4-1) = 4xยณ` + +- Verification: `f'(x) = 4xยณ` + +**Example 4:** `f(x) = โˆšx = x^(1/2)` + +- Using power rule: `f'(x) = (1/2)x^((1/2)-1) = (1/2)x^(-1/2) = 1/(2โˆšx)` + +- Verification: `f'(x) = 1/(2โˆšx)` + +**Example 5:** `f(x) = 1/x = x^(-1)` + +- Using power rule: `f'(x) = (-1)x^(-1-1) = (-1)x^(-2) = -1/xยฒ` + +- Verification: `f'(x) = -1/xยฒ` + + + +### 2. Constant Multiple Rule + +If `f(x) = cยทg(x)`, then `f'(x) = cยทg'(x)` + +#### Step-by-Step Examples + +**Example:** `f(x) = 5xยฒ` + +**Step 1:** Identify the constant and the function + +- Constant: `c = 5` + +- Function: `g(x) = xยฒ` + +**Step 2:** Find `g'(x)` + +- `g'(x) = 2x` (using power rule) + +**Step 3:** Apply constant multiple rule + +- `f'(x) = cยทg'(x) = 5ยท(2x) = 10x` - I showed this in the power rule as well. + +**Verification:** + +- `f(x) = 5xยฒ` + +- `f'(x) = 10x` โœ“ + +**Example:** `f(x) = -3xยณ` + +**Step 1:** Identify the constant and the function + +- Constant: `c = -3` + + + +- Function: `g(x) = xยณ` + +**Step 2:** Find `g'(x)` + +- `g'(x) = 3xยฒ` (using power rule) + +**Step 3:** Apply constant multiple rule + +- `f'(x) = cยทg'(x) = (-3)ยท(3xยฒ) = -9xยฒ` + +**Verification:** + +- `f(x) = -3xยณ` + +- `f'(x) = -9xยฒ` โœ“ + + + +### 3. Sum Rule + +If `f(x) = g(x) + h(x)`, then `f'(x) = g'(x) + h'(x)` + +#### Step-by-Step Examples + +**Example:** `f(x) = xยฒ + 3x` + +**Step 1:** Identify the functions + +- `g(x) = xยฒ` + +- `h(x) = 3x` + +**Step 2:** Find individual derivatives + +- `g'(x) = 2x` (power rule) + +- `h'(x) = 3` (constant multiple rule: 3ยท1 = 3) + +**Step 3:** Apply sum rule + +- `f'(x) = g'(x) + h'(x) = 2x + 3` + +**Verification:** + +- `f(x) = xยฒ + 3x` + +- `f'(x) = 2x + 3` โœ“ + +**Example:** `f(x) = xยณ + 2xยฒ + 5x + 1` + +**Step 1:** Identify the functions + +- `g(x) = xยณ` + +- `h(x) = 2xยฒ` + +- `i(x) = 5x` + +- `j(x) = 1` + +**Step 2:** Find individual derivatives + +- `g'(x) = 3xยฒ` (power rule) + +- `h'(x) = 4x` (constant multiple rule: 2ยท2x = 4x) + +- `i'(x) = 5` (constant multiple rule: 5ยท1 = 5) + +- `j'(x) = 0` (constant rule) + +**Step 3:** Apply sum rule + +- `f'(x) = g'(x) + h'(x) + i'(x) + j'(x) = 3xยฒ + 4x + 5 + 0 = 3xยฒ + 4x + 5` + +**Verification:** + +- `f(x) = xยณ + 2xยฒ + 5x + 1` + +- `f'(x) = 3xยฒ + 4x + 5` โœ“ + + + +### 4. Product Rule + +If `f(x) = g(x)ยทh(x)`, then `f'(x) = g'(x)ยทh(x) + g(x)ยทh'(x)` + +#### Step-by-Step Examples + +**Example:** `f(x) = xยฒ(x + 1)` + +**Step 1:** Identify the functions + +- `g(x) = xยฒ` + +- `h(x) = x + 1` + +**Step 2:** Find individual derivatives + +- `g'(x) = 2x` (power rule) + +- `h'(x) = 1` (sum rule: derivative of x is 1, derivative of 1 is 0) + +**Step 3:** Apply product rule + +- `f'(x) = g'(x)ยทh(x) + g(x)ยทh'(x)` + +- `f'(x) = (2x)ยท(x + 1) + (xยฒ)ยท(1)` + +- `f'(x) = 2x(x + 1) + xยฒ` + + + +- `f'(x) = 2xยฒ + 2x + xยฒ` + +- `f'(x) = 3xยฒ + 2x` + +**Verification by expanding first:** + +- `f(x) = xยฒ(x + 1) = xยณ + xยฒ` + +- `f'(x) = 3xยฒ + 2x` โœ“ + +**Example:** `f(x) = (2x + 3)(xยฒ - 1)` + +**Step 1:** Identify the functions + +- `g(x) = 2x + 3` + +- `h(x) = xยฒ - 1` + +**Step 2:** Find individual derivatives + +- `g'(x) = 2` (sum rule: derivative of 2x is 2, derivative of 3 is 0) + +- `h'(x) = 2x` (sum rule: derivative of xยฒ is 2x, derivative of -1 is 0) + +**Step 3:** Apply product rule + +- `f'(x) = g'(x)ยทh(x) + g(x)ยทh'(x)` + +- `f'(x) = (2)ยท(xยฒ - 1) + (2x + 3)ยท(2x)` + +- `f'(x) = 2(xยฒ - 1) + (2x + 3)(2x)` + +- `f'(x) = 2xยฒ - 2 + 4xยฒ + 6x` + +- `f'(x) = 6xยฒ + 6x - 2` + + + +### 5. Chain Rule + +If `f(x) = g(h(x))`, then `f'(x) = g'(h(x))ยทh'(x)` + +#### Step-by-Step Examples + +**Example:** `f(x) = (xยฒ + 1)ยณ` + +**Step 1:** Identify the inner and outer functions + +- Inner function: `h(x) = xยฒ + 1` + +- Outer function: `g(u) = uยณ` (where `u = h(x)`) + +**Step 2:** Find individual derivatives + +- `h'(x) = 2x` (sum rule: derivative of xยฒ is 2x, derivative of 1 is 0) + +- `g'(u) = 3uยฒ` (power rule) + +**Step 3:** Apply chain rule + +- `f'(x) = g'(h(x))ยทh'(x)` + +- `f'(x) = 3(h(x))ยฒยท(2x)` + +- `f'(x) = 3(xยฒ + 1)ยฒยท(2x)` + +- `f'(x) = 6x(xยฒ + 1)ยฒ` + +**Verification by expanding first:** + +- `f(x) = (xยฒ + 1)ยณ = (xยฒ + 1)(xยฒ + 1)(xยฒ + 1)` + +- Expanding: `f(x) = xโถ + 3xโด + 3xยฒ + 1` + +- `f'(x) = 6xโต + 12xยณ + 6x = 6x(xโด + 2xยฒ + 1) = 6x(xยฒ + 1)ยฒ` โœ“ + +**Example:** `f(x) = โˆš(xยฒ + 4)` + +**Step 1:** Identify the inner and outer functions + +- Inner function: `h(x) = xยฒ + 4` + +- Outer function: `g(u) = โˆšu = u^(1/2)` (where `u = h(x)`) + +**Step 2:** Find individual derivatives + +- `h'(x) = 2x` (sum rule: derivative of xยฒ is 2x, derivative of 4 is 0) + +- `g'(u) = (1/2)u^(-1/2) = 1/(2โˆšu)` (power rule) + +**Step 3:** Apply chain rule + + + + + +- `f'(x) = g'(h(x))ยทh'(x)` + +- `f'(x) = (1/(2โˆš(xยฒ + 4)))ยท(2x)` + +- `f'(x) = 2x/(2โˆš(xยฒ + 4))` + +- `f'(x) = x/โˆš(xยฒ + 4)` + +--- + +## Derivatives of Neural Network Functions + +### 1. Sigmoid Function + +![Sigmoid Formula](/content/learn/math/derivatives/sigmoid-formula.png) + +``` +f(x) = 1 / (1 + e^(-x)) +``` + +#### Step-by-Step Derivative Calculation + +To find the derivative of sigmoid, we'll use the quotient rule and chain rule. + +Usually you will ChatGPT sigmoid derivative, but let's see how it's derived. + +**Step 1:** Rewrite the function + +- `f(x) = 1 / (1 + e^(-x))` + +- Let `u = 1 + e^(-x)`, so `f(x) = 1/u` + +**Step 2:** Apply quotient rule + +- `f'(x) = (0ยทu - 1ยทu') / uยฒ = -u' / uยฒ` + +**Step 3:** Find `u'` using chain rule + +- `u = 1 + e^(-x)` + +- `u' = 0 + e^(-x) ยท (-1) = -e^(-x)` + +**Step 4:** Substitute back + +- `f'(x) = -(-e^(-x)) / (1 + e^(-x))ยฒ` + +- `f'(x) = e^(-x) / (1 + e^(-x))ยฒ` + +**Step 5:** Simplify + +- `f'(x) = e^(-x) / (1 + e^(-x))ยฒ` + +- `f'(x) = [e^(-x) / (1 + e^(-x))] ยท [1 / (1 + e^(-x))]` + +- `f'(x) = [1 / (1 + e^(-x))] ยท [e^(-x) / (1 + e^(-x))]` + +- `f'(x) = f(x) ยท [e^(-x) / (1 + e^(-x))]` + +**Step 6:** Further simplification + +- Notice that `e^(-x) / (1 + e^(-x)) = 1 - 1/(1 + e^(-x)) = 1 - f(x)` + +- Therefore: `f'(x) = f(x) ยท (1 - f(x))` + +**Final Result:** `f'(x) = f(x)(1 - f(x))` + +--- + +## Chain Rule + +Chain rule is how neural networks learn (backpropagation). + +### Mathematical Statement + +If `y = f(g(x))`, then `dy/dx = (dy/dg) ร— (dg/dx)` + +### Neural Network Application + +In neural networks, we often have functions like: `f(x) = activation(linear_transformation(x))` + +### Step-by-Step Chain Rule Example + +**Example:** Neural Network Layer with Sigmoid Activation + +**Given:** + +- Linear transformation: `z = 2x + 1` + +- Activation function: `ฯƒ(z) = 1/(1 + e^(-z))` + +- Composite function: `f(x) = ฯƒ(2x + 1)` + +**Step 1:** Identify inner and outer functions + +- Inner function: `h(x) = 2x + 1` + +- Outer function: `g(z) = ฯƒ(z) = 1/(1 + e^(-z))` + +**Step 2:** Find individual derivatives + +- `h'(x) = 2` (derivative of 2x + 1) + +- `g'(z) = ฯƒ(z)(1 - ฯƒ(z))` (sigmoid derivative) + +**Step 3:** Apply chain rule + +- `f'(x) = g'(h(x)) ยท h'(x)` + +- `f'(x) = ฯƒ(2x + 1)(1 - ฯƒ(2x + 1)) ยท 2` + +- `f'(x) = 2ฯƒ(2x + 1)(1 - ฯƒ(2x + 1))` + +**Step 4:** Calculate at specific point `(x = 1)` + +**Step 4a:** Calculate `h(1)` + +- `h(1) = 2(1) + 1 = 3` + +**Step 4b:** Calculate `ฯƒ(3)` + + + + + +- `ฯƒ(3) = 1/(1 + e^(-3)) = 1/(1 + 0.050) = 1/1.050 โ‰ˆ 0.953` + +**Step 4c:** Calculate `ฯƒ'(3)` + +- `ฯƒ'(3) = ฯƒ(3)(1 - ฯƒ(3)) = 0.953(1 - 0.953) = 0.953(0.047) โ‰ˆ 0.045` + +**Step 4d:** Apply chain rule + +- `f'(1) = ฯƒ'(3) ยท h'(1) = 0.045 ยท 2 = 0.090` + +**Final Answer:** `f'(1) โ‰ˆ 0.090` + +--- + +## Partial Derivatives + +When we have functions of multiple variables, we use **partial derivatives**. + +### Definition + +For `f(x, y)`, the partial derivative with respect to `x` is: +``` +โˆ‚f/โˆ‚x = lim[hโ†’0] (f(x+h, y) - f(x, y)) / h +``` + +### Example: Linear Function + +`f(x, y) = 2x + 3y + 1` + +#### Step-by-Step Partial Derivative Calculation + +**Finding โˆ‚f/โˆ‚x (partial derivative with respect to x):** + +**Step 1:** Treat `y` as a constant + +- `f(x, y) = 2x + 3y + 1` + +- When taking โˆ‚f/โˆ‚x, we treat `y` as constant, so `3y + 1` is constant + +**Step 2:** Differentiate with respect to `x` + +- `โˆ‚f/โˆ‚x = โˆ‚/โˆ‚x(2x) + โˆ‚/โˆ‚x(3y) + โˆ‚/โˆ‚x(1)` + +- `โˆ‚f/โˆ‚x = 2 + 0 + 0 = 2` + +**Finding โˆ‚f/โˆ‚y (partial derivative with respect to y):** + +**Step 1:** Treat `x` as a constant + +- `f(x, y) = 2x + 3y + 1` + +- When taking โˆ‚f/โˆ‚y, we treat `x` as constant, so `2x + 1` is constant + +**Step 2:** Differentiate with respect to `y` + +- `โˆ‚f/โˆ‚y = โˆ‚/โˆ‚y(2x) + โˆ‚/โˆ‚y(3y) + โˆ‚/โˆ‚y(1)` + +- `โˆ‚f/โˆ‚y = 0 + 3 + 0 = 3` + +**Final Results:** + +- `โˆ‚f/โˆ‚x = 2` + +- `โˆ‚f/โˆ‚y = 3` + +#### Hand Calculation Examples + +**Example:** Find partial derivatives at `(x, y) = (1, 2)` + +- `โˆ‚f/โˆ‚x = 2` (constant, doesn't depend on x or y) + +- `โˆ‚f/โˆ‚y = 3` (constant, doesn't depend on x or y) + +**Example:** Find partial derivatives at `(x, y) = (5, -1)` + +- `โˆ‚f/โˆ‚x = 2` (still constant) + +- `โˆ‚f/โˆ‚y = 3` (still constant) + + + +### Example: Quadratic Function + +`f(x, y) = xยฒ + 2xy + yยฒ` + +#### Step-by-Step Partial Derivative Calculation + +**Finding โˆ‚f/โˆ‚x (partial derivative with respect to x):** + +**Step 1:** Treat `y` as a constant + +- `f(x, y) = xยฒ + 2xy + yยฒ` + +- When taking โˆ‚f/โˆ‚x, we treat `y` as constant + +**Step 2:** Differentiate with respect to `x` + +- `โˆ‚f/โˆ‚x = โˆ‚/โˆ‚x(xยฒ) + โˆ‚/โˆ‚x(2xy) + โˆ‚/โˆ‚x(yยฒ)` + +- `โˆ‚f/โˆ‚x = 2x + 2y + 0 = 2x + 2y` + +**Finding โˆ‚f/โˆ‚y (partial derivative with respect to y):** + +**Step 1:** Treat `x` as a constant + +- `f(x, y) = xยฒ + 2xy + yยฒ` + +- When taking โˆ‚f/โˆ‚y, we treat `x` as constant + +**Step 2:** Differentiate with respect to `y` + +- `โˆ‚f/โˆ‚y = โˆ‚/โˆ‚y(xยฒ) + โˆ‚/โˆ‚y(2xy) + โˆ‚/โˆ‚y(yยฒ)` + +- `โˆ‚f/โˆ‚y = 0 + 2x + 2y = 2x + 2y` + +**Final Results:** + +- `โˆ‚f/โˆ‚x = 2x + 2y` + +- `โˆ‚f/โˆ‚y = 2x + 2y` + +#### Hand Calculation Examples + +**Example:** Find partial derivatives at `(x, y) = (1, 2)` + +**Step 1:** Calculate โˆ‚f/โˆ‚x + +- `โˆ‚f/โˆ‚x = 2(1) + 2(2) = 2 + 4 = 6` + +**Step 2:** Calculate โˆ‚f/โˆ‚y + +- `โˆ‚f/โˆ‚y = 2(1) + 2(2) = 2 + 4 = 6` + +**Example:** Find partial derivatives at `(x, y) = (3, -1)` + +**Step 1:** Calculate โˆ‚f/โˆ‚x + +- `โˆ‚f/โˆ‚x = 2(3) + 2(-1) = 6 - 2 = 4` + +**Step 2:** Calculate โˆ‚f/โˆ‚y + + + + + +- `โˆ‚f/โˆ‚y = 2(3) + 2(-1) = 6 - 2 = 4` \ No newline at end of file diff --git a/public/content/learn/math/derivatives/linear-function-derivative.png b/public/content/learn/math/derivatives/linear-function-derivative.png new file mode 100644 index 0000000..313cdce Binary files /dev/null and b/public/content/learn/math/derivatives/linear-function-derivative.png differ diff --git a/public/content/learn/math/derivatives/quadratic-function-derivative.png b/public/content/learn/math/derivatives/quadratic-function-derivative.png new file mode 100644 index 0000000..4e76795 Binary files /dev/null and b/public/content/learn/math/derivatives/quadratic-function-derivative.png differ diff --git a/public/content/learn/math/derivatives/sigmoid-formula.png b/public/content/learn/math/derivatives/sigmoid-formula.png new file mode 100644 index 0000000..7c44d2b Binary files /dev/null and b/public/content/learn/math/derivatives/sigmoid-formula.png differ diff --git a/public/content/learn/math/functions/cubic-quartic-functions.png b/public/content/learn/math/functions/cubic-quartic-functions.png new file mode 100644 index 0000000..84dd6e1 Binary files /dev/null and b/public/content/learn/math/functions/cubic-quartic-functions.png differ diff --git a/public/content/learn/math/functions/exponential-functions-log-scale.png b/public/content/learn/math/functions/exponential-functions-log-scale.png new file mode 100644 index 0000000..96000da Binary files /dev/null and b/public/content/learn/math/functions/exponential-functions-log-scale.png differ diff --git a/public/content/learn/math/functions/exponential-functions.png b/public/content/learn/math/functions/exponential-functions.png new file mode 100644 index 0000000..54257fa Binary files /dev/null and b/public/content/learn/math/functions/exponential-functions.png differ diff --git a/public/content/learn/math/functions/functions-content.md b/public/content/learn/math/functions/functions-content.md new file mode 100644 index 0000000..4faf670 --- /dev/null +++ b/public/content/learn/math/functions/functions-content.md @@ -0,0 +1,416 @@ +--- +hero: + title: "Mathematical Functions" + subtitle: "Building Blocks of Neural Networks" + tags: + - "๐Ÿ“ Mathematics" + - "โฑ๏ธ 12 min read" +--- + +Functions are the foundation of neural networks. + +## What is a Function? + +In simple terms, function is like a machine that takes something in and gives something back out. More formally, a **function** is a mathematical relationship that **maps inputs to outputs**. + + + +## Simple Examples + +### Example 1: Linear Function f(x) = 2x + 3 + +This is a function that takes any number x and returns 2x + 3. + +![Linear Function](/content/learn/math/functions/linear-function.png) + +Let's calculate f(x) for different values step by step: + +For x = 1: + + + + + +f(1) = 2(1) + 3 = 2 + 3 = 5 + + + +Don't confuse `f(1)` and `2(1)`. `f(1)` means passing 1 into function f, and `2(1)` mean `2*1`. + +For x = 0: + + + + + +f(0) = 2(0) + 3 = 0 + 3 = 3 + +For x = -1: + + + + + +f(-1) = 2(-1) + 3 = -2 + 3 = 1 + +Now image a function that takes in "Cat sat on a" and returns "mat" - that function would be a lot more difficult to create, but neural networks (LLMs) can learn it. + +### Example 2: Quadratic Function f(x) = xยฒ + 2x + 1 + +![Quadratic Function](/content/learn/math/functions/quadratic-function.png) + +Let's calculate f(x) for different values step by step: + +For x = 2: + + + + + +f(2) = (2)ยฒ + 2(2) + 1 = 4 + 4 + 1 = 9 + +For x = 0: + + + + + +f(0) = (0)ยฒ + 2(0) + 1 = 0 + 0 + 1 = 1 + +For x = -1: + + + + + +f(-1) = (-1)ยฒ + 2(-1) + 1 = 1 - 2 + 1 = 0 + +## Mathematical Definition of a Function + +A function **f: A โ†’ B** maps every element in set A to **exactly one** element in set B. + +Previous quadratic function will always give 9 if x=2 and nothing else. + +## Notation + + + + + +**f(x) = y** (read as "f of x equals y") + + + +**x** is the input (independent variable) + + + +**y** is the output (dependent variable) - it depends on x + +## Code Examples + +Our 2 functions coded in python, if you are unfamiliar with python you can skip the code, next module will focus on python. + +```python +# Linear function: f(x) = 2x + 3 +def linear_function(x): + return 2 * x + 3 + +# Test the function +print(f"f(1) = {linear_function(1)}") # Output: f(1) = 5 +print(f"f(0) = {linear_function(0)}") # Output: f(0) = 3 +print(f"f(-1) = {linear_function(-1)}") # Output: f(-1) = 1 + +# Quadratic function: f(x) = xยฒ + 2x + 1 +def quadratic_function(x): + return x**2 + 2*x + 1 + +# Test the function +print(f"f(2) = {quadratic_function(2)}") # Output: f(2) = 9 +print(f"f(0) = {quadratic_function(0)}") # Output: f(0) = 1 +print(f"f(-1) = {quadratic_function(-1)}") # Output: f(-1) = 0 +``` + +## Types of Functions + +### 1. Linear Functions + +Linear functions have the form: **f(x) = mx + b** + +Where: + + + + + +**m** is the slope (how steep the line is) + + + +**b** is the y-intercept (where the line crosses the y-axis) + +Let's draw it + +![Linear Functions Comparison](/content/learn/math/functions/linear-functions-comparison.png) + +Blue line: 2x + 1 + + + + + +2 is the slope, meaning that if you move by 1 on x axis, y will go up by 2 + + + +y or f(x) - it's the same + + + +1 is the value on y coordinate where the blue line will cross it (y-intercept), at x=0 - see it for yourself, blue line should pass through x=0 and y=1 spot + +### 2. Polynomial Functions + +Functions with powers of x: **f(x) = aโ‚™xโฟ + aโ‚™โ‚‹โ‚xโฟโปยน + ... + aโ‚x + aโ‚€** + +**Hand Calculation Examples** + +**Example: f(x) = xยณ - 3xยฒ + 2x + 1** + +Let's calculate f(x) for different values step by step: + +For x = 1: + + + + + +f(1) = (1)ยณ - 3(1)ยฒ + 2(1) + 1 + + + +f(1) = 1 - 3(1) + 2 + 1 + + + +f(1) = 1 - 3 + 2 + 1 + + + +f(1) = 1 + +For x = 2: + + + + + +f(2) = (2)ยณ - 3(2)ยฒ + 2(2) + 1 + + + +f(2) = 8 - 3(4) + 4 + 1 + + + +f(2) = 8 - 12 + 4 + 1 + + + +f(2) = 1 + +For x = 0: + + + + + +f(0) = (0)ยณ - 3(0)ยฒ + 2(0) + 1 + + + +f(0) = 0 - 0 + 0 + 1 + + + +f(0) = 1 + +**Example: f(x) = xโด - 4xยฒ + 3** + +Let's calculate f(x) for different values step by step: + +For x = 1: + + + + + +f(1) = (1)โด - 4(1)ยฒ + 3 + + + +f(1) = 1 - 4(1) + 3 + + + +f(1) = 1 - 4 + 3 + + + +f(1) = 0 + +For x = 2: + + + + + +f(2) = (2)โด - 4(2)ยฒ + 3 + + + +f(2) = 16 - 4(4) + 3 + + + +f(2) = 16 - 16 + 3 + + + +f(2) = 3 + +For x = 0: + + + + + +f(0) = (0)โด - 4(0)ยฒ + 3 + + + +f(0) = 0 - 0 + 3 + + + +f(0) = 3 + +```python +# Polynomial function examples +def cubic_function(x): + return x**3 - 3*x**2 + 2*x + 1 + +def quartic_function(x): + return x**4 - 4*x**2 + 3 +``` + +![Cubic and Quartic Functions](/content/learn/math/functions/cubic-quartic-functions.png) + +Just look at it - it seems interesting, no need to master it yet. + +### 3. Exponential Functions + +Functions with constant base raised to variable power: **f(x) = aหฃ** + +```python +# Exponential function examples +def exponential_function(x): + return 2**x + +def exponential_e(x): + return np.exp(x) +``` + +![Exponential Functions](/content/learn/math/functions/exponential-functions.png) + +Careful! The y axis is exponential. + +If we make it linear, it looks like this: + +![Exponential Functions Linear Scale](/content/learn/math/functions/exponential-functions-log-scale.png) + + + + + +### 4. Trigonometric Functions + +Functions based on angles and periodic behavior + +```python +# Trigonometric function examples +def sine_function(x): + return np.sin(x) + +def cosine_function(x): + return np.cos(x) +``` + +![Trigonometric Functions](/content/learn/math/functions/trigonometric-functions.png) + +This is used in Rotory Positional Embeddings (RoPE) - LLM is using it to know the order of words (tokens) in the text. + + + + + + + +Functions are using in neural networks a lot: forward propagation, backward propagation, attention, activation functions, gradients, and many more. + +You don't need to learn them yet, just check them out. + +### 1. Sigmoid Function + +![Sigmoid Formula](/content/learn/math/functions/sigmoid-formula.png) + +**e** is a famous constant (Euler's number) used in math everywhere, it's value is approximately 2.718 + +**f(x) = 1 / (1 + e^(-x))** + +```python +def sigmoid(x): + return 1 / (1 + np.exp(-x)) + +def sigmoid_derivative(x): + s = sigmoid(x) + return s * (1 - s) +``` + +![Sigmoid Function and Derivative](/content/learn/math/functions/sigmoid-function-derivative.png) + +We will learn derivativers in the next lesson, but I included the images here - derivative tells you how fast the function is changing - you see that when sigmoid function is growing fastest (in the middle), the derivative value is spiking. + +Just look at the slope of the function, if it's big (changing fast), the derivative will be big. + +### 2. ReLU (Rectified Linear Unit) + +**f(x) = max(0, x)** + +```python +def relu(x): + return np.maximum(0, x) + +def relu_derivative(x): + return (x > 0).astype(float) +``` + +![ReLU Function and Derivative](/content/learn/math/functions/relu-function-derivative.png) + +### 3. Tanh Function + +![Tanh Formula](/content/learn/math/functions/tanh-formula.png) + +**f(x) = tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))** + +```python +def tanh(x): + return np.tanh(x) + +def tanh_derivative(x): + return 1 - np.tanh(x)**2 +``` + +![Tanh Function and Derivative](/content/learn/math/functions/tanh-function-derivative.png) + +**Congratulations! You finished functions for neural networks lesson!** \ No newline at end of file diff --git a/public/content/learn/math/functions/linear-function.png b/public/content/learn/math/functions/linear-function.png new file mode 100644 index 0000000..5f3f96e Binary files /dev/null and b/public/content/learn/math/functions/linear-function.png differ diff --git a/public/content/learn/math/functions/linear-functions-comparison.png b/public/content/learn/math/functions/linear-functions-comparison.png new file mode 100644 index 0000000..ff85273 Binary files /dev/null and b/public/content/learn/math/functions/linear-functions-comparison.png differ diff --git a/public/content/learn/math/functions/quadratic-function.png b/public/content/learn/math/functions/quadratic-function.png new file mode 100644 index 0000000..044ca30 Binary files /dev/null and b/public/content/learn/math/functions/quadratic-function.png differ diff --git a/public/content/learn/math/functions/relu-function-derivative.png b/public/content/learn/math/functions/relu-function-derivative.png new file mode 100644 index 0000000..50aaf68 Binary files /dev/null and b/public/content/learn/math/functions/relu-function-derivative.png differ diff --git a/public/content/learn/math/functions/sigmoid-formula.png b/public/content/learn/math/functions/sigmoid-formula.png new file mode 100644 index 0000000..49e818c Binary files /dev/null and b/public/content/learn/math/functions/sigmoid-formula.png differ diff --git a/public/content/learn/math/functions/sigmoid-function-derivative.png b/public/content/learn/math/functions/sigmoid-function-derivative.png new file mode 100644 index 0000000..ef46024 Binary files /dev/null and b/public/content/learn/math/functions/sigmoid-function-derivative.png differ diff --git a/public/content/learn/math/functions/tanh-formula.png b/public/content/learn/math/functions/tanh-formula.png new file mode 100644 index 0000000..3deb79a Binary files /dev/null and b/public/content/learn/math/functions/tanh-formula.png differ diff --git a/public/content/learn/math/functions/tanh-function-derivative.png b/public/content/learn/math/functions/tanh-function-derivative.png new file mode 100644 index 0000000..93818db Binary files /dev/null and b/public/content/learn/math/functions/tanh-function-derivative.png differ diff --git a/public/content/learn/math/functions/trigonometric-functions.png b/public/content/learn/math/functions/trigonometric-functions.png new file mode 100644 index 0000000..d071d33 Binary files /dev/null and b/public/content/learn/math/functions/trigonometric-functions.png differ diff --git a/public/content/learn/math/gradients/derivatives-tangent-lines.png b/public/content/learn/math/gradients/derivatives-tangent-lines.png new file mode 100644 index 0000000..b314af1 Binary files /dev/null and b/public/content/learn/math/gradients/derivatives-tangent-lines.png differ diff --git a/public/content/learn/math/gradients/gradient-surface-plot.png b/public/content/learn/math/gradients/gradient-surface-plot.png new file mode 100644 index 0000000..172f20e Binary files /dev/null and b/public/content/learn/math/gradients/gradient-surface-plot.png differ diff --git a/public/content/learn/math/gradients/gradients-content.md b/public/content/learn/math/gradients/gradients-content.md new file mode 100644 index 0000000..f0efb7b --- /dev/null +++ b/public/content/learn/math/gradients/gradients-content.md @@ -0,0 +1,166 @@ +--- +hero: + title: "Gradients" + subtitle: "How Neural Networks Learn Through Gradient Descent" + tags: + - "๐Ÿ“ Mathematics" + - "โฑ๏ธ 14 min read" +--- + +**[video coming soon]** + +Welcome! This guide will walk you through the concept of gradients. We'll start with the familiar idea of a derivative and build up to understanding how gradients make neural networks learn. + +**Prerequisites:** Check out previous 3 lessons: Functions, Derivatives & Vectors + +--- + +## Step 1: From Line Slope (Derivative) To Surface Slope (Gradient) + +Let's start with what you know. For a simple function like `f(x) = xยฒ`, the derivative `f'(x) = 2x` gives you the slope of the curve at any point `x`. So for `x=3`, derivative is `2*3=6`. That means as you increase `x` but a tiny bit, `f(x) = xยฒ` will increase by 6. + +At `x=4`, derivative is `2*4=8`, so at that point `f(x) = xยฒ` is increasing by 8x. + + + + + +Notice that I say "if you increase x by a bit, `f(x) = xยฒ` will increase by 6" and I don't say "if you increase x by 1", because increasing x by 1 (from 3 to 4 in this case) is a lot and by that point derivative (rate of change) will go from 6 to 8. + +On this image you can see that the red slope at `x=3` is smaller than thes green slope at `x=4`. + +![Derivatives with Tangent Lines](/content/learn/math/gradients/derivatives-tangent-lines.png) + +In this case, if you increase `x=3` by 1, derivative will go from 6 to 8. So that's why we say "if you increase `x=3` by a tiny bit, `f(x) = xยฒ` will increase by 6". + +But what if our function has multiple inputs, like `f(x, y) = xยฒ + yยฒ`? + + + + + +This function doesn't describe a line; it describes a 3D surface, like a bowl landscape. If you're standing at any point `(x, y)` on this surface, what is "the" slope? + +![Gradient Surface Plot](/content/learn/math/gradients/gradient-surface-plot.png) + +There isn't just one. There's a slope if you take a step in the x-direction, a different slope if you step in the y-direction, and another for every other direction in between. + +To handle this, we use **partial derivatives**. + +- **Partial Derivative with respect to x (โˆ‚f/โˆ‚x):** This is the slope if you only move in the x-direction. You treat y as a constant. For `f(x, y) = xยฒ + yยฒ`, the partial derivative `โˆ‚f/โˆ‚x = 2x` - remember the rule for a constant that stands alone, constants become 0 in the derivative, and since we treat y as a constant, `+ yยฒ` will ecome `+ 0`. + +- **Partial Derivative with respect to y (โˆ‚f/โˆ‚y):** This is the slope if you only move in the y-direction. You treat x as a constant. For `f(x, y) = xยฒ + yยฒ`, the partial derivative `โˆ‚f/โˆ‚y = 2y`. + +Now we have two slopes, one for each axis. The **gradient** is simply a way to package all these partial derivatives together. + +**Definition:** The gradient is a vector that contains all the partial derivatives of a function. It's denoted by `โˆ‡f` (pronounced "nabla f" or "del f"). + +For our function `f(x, y)`, the gradient is: + +``` +โˆ‡f = [ โˆ‚f/โˆ‚x, โˆ‚f/โˆ‚y ] = [ 2x, 2y ] +``` + + + +## Step 2: What the Gradient Vector Tells Us + +So, the gradient is a vector (think of it as an arrow). What do the direction and length of this arrow mean? + +This is the most important intuition to grasp. + +### 1. The Direction of the Gradient + +The gradient vector at any point `(x, y)` points in the direction of the **steepest possible ascent**. + +Imagine you're standing on a mountainside. If you look around, there are many ways to take a step. One direction leads straight uphill, another leads straight downhill, and others traverse the mountain at a constant elevation. The gradient is an arrow painted on the ground at your feet that points directly up the steepest path from where you are. + +### 2. The Magnitude (Length) of the Gradient + +The length of the gradient vector tells you **how steep** that steepest path is. + + + + + +- A **long gradient vector** means the slope is very steep. A small step will result in a large change in elevation. + +- A **short gradient vector** means the slope is gentle. The terrain is nearly flat. + +- A **zero-length gradient vector** (i.e., [0, 0]) means you are at a flat spotโ€”either a peak, a valley bottom, or a flat plateau. + + + +## Step 3: A Concrete Example + +Let's go back to our bowl function, `f(x, y) = xยฒ + yยฒ`, and its gradient, `โˆ‡f = [2x, 2y]`. The minimum of this function is clearly at `(0, 0)`. + +Let's calculate the gradient at a specific point, say `(3, 1)`. + +``` +โˆ‡f(3, 1) = [ 2 3, 2 1 ] = [6, 2] +``` + +This vector `[6, 2]` is an arrow that points "6 units in the x-direction and 2 units in the y-direction." This is an arrow pointing up and to the right, away from the minimum at `(0, 0)`. This makes perfect sense! From the point `(3, 1)`, the steepest way up the bowl is away from the bottom. + +What about the point `(-2, -2)`? + +``` +โˆ‡f(-2, -2) = [ 2 -2, 2 -2 ] = [-4, -4] +``` + +This vector points down and to the left, again, away from the bottom of the bowl at `(0, 0)`. + + + +## Step 4: Visualizing the Gradient Field + +Let's visualize this. The image below shows a contour plot of our function `f(x, y) = xยฒ + yยฒ`. Think of this as a topographic map. The lines connect points of equal "elevation." The arrows represent the gradient vectors at various points. + +Notice two crucial properties in the visualization: + +- **Direction:** The arrows always point from a lower contour line to a higher one (from blue to yellow). They show the path of steepest ascent. + +- **Orthogonality:** The gradient vectors are always perpendicular to the contour lines. To go straight uphill, you must walk at a right angle to the path of "no elevation change." + +When you run this, you will see a visual representation of everything we've discussed. + + + +## Step 5: The "Why": Gradients and Machine Learning + +This is where gradients become incredibly powerful. In machine learning, we define a **loss function** (or **cost function**). This function measures how "wrong" our model's predictions are. The inputs to this function are the model's parameters (its weights and biases), and the output is a single number representing the total error. + +Our goal is to **find the set of parameters that minimizes the error**. + +This is the exact same problem as finding the lowest point in a valley! + +The algorithm used to do this is called **Gradient Descent**. Here's how it works: + + + + + +1. **Start Somewhere:** Initialize the model's parameters to random values. (This is like dropping a hiker at a random spot on the mountain). + +2. **Find the Way Down:** Calculate the gradient of the loss function at your current location. The gradient points straight uphill. + +3. **Take a Step Downhill:** To go downhill, simply move in the direction of the **negative gradient**. We update our parameters by taking a small step in that opposite direction. + +4. **Repeat:** Go back to step 2. Keep calculating the gradient and taking small steps downhill until you reach the bottom of the valley, where the gradient is zero. + +This is the core mechanic of how neural networks "learn." They are constantly calculating the gradient of their error and adjusting their internal parameters to move in the direction that reduces that error. + +## Key Takeaways + + + + + +- A **gradient** is a vector of partial derivatives that generalizes the concept of slope to functions with multiple inputs. + +- **Direction:** The gradient vector points in the direction of the steepest ascent. + +- **Magnitude:** Its length represents how steep that ascent is. + +- **Optimization:** The negative gradient points in the direction of steepest descent, which is the key to finding the minimum of a function using Gradient Descent. \ No newline at end of file diff --git a/public/content/learn/math/matrices/matrices-content.md b/public/content/learn/math/matrices/matrices-content.md new file mode 100644 index 0000000..ba1f675 --- /dev/null +++ b/public/content/learn/math/matrices/matrices-content.md @@ -0,0 +1,102 @@ +--- +hero: + title: "Matrices" + subtitle: "Operations and Transformations for Neural Networks" + tags: + - "๐Ÿ“ Mathematics" + - "โฑ๏ธ 12 min read" +--- + +**[video coming soon]** + +**Level:** Beginner โ†’ Intermediate. + +--- + +## 1. What is a matrix? + +A matrix is a rectangular array of numbers arranged in rows and columns. We write an `(m x n)` matrix as: + +![Matrix Notation](/content/learn/math/matrices/matrix-notation.png) + + + + + +`(m)` is the number of rows, `(n)` the number of columns. + +If `(m=n)` the matrix is **square**. + +**Why matrices?** They represent neural network weights, linear transformations, systems of linear equations, data tables, graphs, and more. + + + +## 2. Notation and basic examples + +**Entries:** `(A_ij)` is element in row `(i)`, column `(j)`. + +**Row vector:** 1ร—n, **column vector:** mร—1. + +### Example matrices + +We will use these 2 matrices below. + +![Matrix Example](/content/learn/math/matrices/matrix-example.png) + +## 3. Step-by-step matrix operations + +### 3.1 Addition and subtraction (elementwise) + +Only for matrices of the same size. Add corresponding elements. + +**Example:** `(A+B)` + +![Matrix Addition](/content/learn/math/matrices/matrix-addition.png) + +### 3.2 Scalar multiplication + +Multiply each element by the scalar. For `(2A)`: + +![Scalar Multiplication Matrix](/content/learn/math/matrices/scalar-multiplication-matrix.png) + +### 3.3 Matrix multiplication + +You do a dot product of a row of th first matrix with the column of the second matrix and write result at the position where that row and column intercept. + +If `(A)` is `(m x p)` and `(B)` is `(p x n)`, then `(AB)` is `(m x n)`. Multiply rows of `(A)` by columns of `(B)` and sum. + +**Example:** multiply the two 2ร—2 matrices above. + +![Matrix Multiplication Steps](/content/learn/math/matrices/matrix-multiplication-steps.png) + +**Important:** Matrix multiplication is generally **not commutative**: `(AB is not equal to BA)` in general. + +## 4. Key matrix transformations and properties + +### 4.1 Transpose + +![Matrix Transpose](/content/learn/math/matrices/matrix-transpose.png) + +### 4.2 Determinant (square matrices) + +![Matrix Determinant](/content/learn/math/matrices/matrix-determinant.png) + +### 4.3 Inverse (when it exists) + +![Matrix Inverse Formula](/content/learn/math/matrices/matrix-inverse-formula.png) + +### 4.4 Rank + +The **rank** is the dimension of the column space (or row space). If rank = n for an `(n x n)` matrix, it's **full rank** and **invertible**. + +### 4.5 Special matrices (common types) + +![Special Matrices](/content/learn/math/matrices/special-matrices.png) + +## 5. Common pitfalls and tips + +- Remember matrix multiplication order matters. + +- Watch dimensions carefully (rows of left must equal columns of right). + +- Numerical stability: beware near-singular matrices (determinant โ‰ˆ 0). \ No newline at end of file diff --git a/public/content/learn/math/matrices/matrix-addition.png b/public/content/learn/math/matrices/matrix-addition.png new file mode 100644 index 0000000..e9c3d24 Binary files /dev/null and b/public/content/learn/math/matrices/matrix-addition.png differ diff --git a/public/content/learn/math/matrices/matrix-determinant.png b/public/content/learn/math/matrices/matrix-determinant.png new file mode 100644 index 0000000..6042242 Binary files /dev/null and b/public/content/learn/math/matrices/matrix-determinant.png differ diff --git a/public/content/learn/math/matrices/matrix-example.png b/public/content/learn/math/matrices/matrix-example.png new file mode 100644 index 0000000..0c8c266 Binary files /dev/null and b/public/content/learn/math/matrices/matrix-example.png differ diff --git a/public/content/learn/math/matrices/matrix-inverse-formula.png b/public/content/learn/math/matrices/matrix-inverse-formula.png new file mode 100644 index 0000000..02b037d Binary files /dev/null and b/public/content/learn/math/matrices/matrix-inverse-formula.png differ diff --git a/public/content/learn/math/matrices/matrix-multiplication-steps.png b/public/content/learn/math/matrices/matrix-multiplication-steps.png new file mode 100644 index 0000000..28dcfb8 Binary files /dev/null and b/public/content/learn/math/matrices/matrix-multiplication-steps.png differ diff --git a/public/content/learn/math/matrices/matrix-notation.png b/public/content/learn/math/matrices/matrix-notation.png new file mode 100644 index 0000000..be5e26d Binary files /dev/null and b/public/content/learn/math/matrices/matrix-notation.png differ diff --git a/public/content/learn/math/matrices/matrix-transpose.png b/public/content/learn/math/matrices/matrix-transpose.png new file mode 100644 index 0000000..e7cb7b9 Binary files /dev/null and b/public/content/learn/math/matrices/matrix-transpose.png differ diff --git a/public/content/learn/math/matrices/scalar-multiplication-matrix.png b/public/content/learn/math/matrices/scalar-multiplication-matrix.png new file mode 100644 index 0000000..d36a8ab Binary files /dev/null and b/public/content/learn/math/matrices/scalar-multiplication-matrix.png differ diff --git a/public/content/learn/math/matrices/special-matrices.png b/public/content/learn/math/matrices/special-matrices.png new file mode 100644 index 0000000..eefc9ca Binary files /dev/null and b/public/content/learn/math/matrices/special-matrices.png differ diff --git a/public/content/learn/math/vectors/scalar-multiplication.png b/public/content/learn/math/vectors/scalar-multiplication.png new file mode 100644 index 0000000..0600354 Binary files /dev/null and b/public/content/learn/math/vectors/scalar-multiplication.png differ diff --git a/public/content/learn/math/vectors/simple-vector.png b/public/content/learn/math/vectors/simple-vector.png new file mode 100644 index 0000000..9345ff1 Binary files /dev/null and b/public/content/learn/math/vectors/simple-vector.png differ diff --git a/public/content/learn/math/vectors/vector-addition.png b/public/content/learn/math/vectors/vector-addition.png new file mode 100644 index 0000000..df1fb8f Binary files /dev/null and b/public/content/learn/math/vectors/vector-addition.png differ diff --git a/public/content/learn/math/vectors/vector-angle.png b/public/content/learn/math/vectors/vector-angle.png new file mode 100644 index 0000000..14a6fd3 Binary files /dev/null and b/public/content/learn/math/vectors/vector-angle.png differ diff --git a/public/content/learn/math/vectors/vectors-content.md b/public/content/learn/math/vectors/vectors-content.md new file mode 100644 index 0000000..4541bde --- /dev/null +++ b/public/content/learn/math/vectors/vectors-content.md @@ -0,0 +1,189 @@ +--- +hero: + title: "Vectors" + subtitle: "Magnitude, Direction, and Vector Operations" + tags: + - "๐Ÿ“ Mathematics" + - "โฑ๏ธ 15 min read" +--- + +**[video comingn soon]** + +Welcome! This guide will introduce you to vectors, which are fundamental objects in mathematics, physics, and computer science. We'll explore what they are and how to work with them, focusing on the concepts, not the code. + +--- + +## Step 1: What is a Vector? + +At its core, a **vector** is a mathematical object that has both **magnitude** (length or size) and **direction**. + + + + + +Think about the difference between "speed" and "velocity." + +- **Speed** is a single number (a scalar), like 50 km/h. It only tells you the magnitude. + +- **Velocity** is a vector, like 50 km/h north. It tells you both the magnitude (50 km/h) and the direction (north). + +We represent vectors as a list of numbers called **components**. For example, in a 2D plane, a vector `v` can be written as: + +``` +v = [x, y] +``` + +This notation means "start at the origin (0,0), move x units along the horizontal axis, and y units along the vertical axis." The arrow drawn from the origin to that point (x, y) is the vector. + +**Examples:** + +- `v = [3, 4]` represents an arrow pointing to the coordinate (3, 4). + +- `u = [-2, 1]` represents an arrow pointing to the coordinate (-2, 1). + +![Simple Vector](/content/learn/math/vectors/simple-vector.png) + + + +## Step 2: The Two Core Properties: Magnitude and Direction + +Every vector is defined by these two properties. + +### Magnitude (Length) + +The **magnitude** of a vector is its length. It's often written with double bars, like `||v||`. We can calculate it using the Pythagorean theorem. For a 2D vector `v = [x, y]`, the formula is: + +``` +||v|| = โˆš(xยฒ + yยฒ) +``` + +For a 3D vector `w = [x, y, z]`, it's a natural extension: `||w|| = โˆš(xยฒ + yยฒ + zยฒ)`. + +**Example:** +For `v = [3, 4]`: +``` +||v|| = โˆš(3ยฒ + 4ยฒ) = โˆš(9 + 16) = โˆš25 = 5 +``` +The length of the vector [3, 4] is 5 units. + +### Direction (Unit Vectors) + +How can we describe only the direction of a vector, ignoring its length? We use a **unit vector**. A unit vector is any vector that has a magnitude of exactly 1. + +To find the unit vector of any given vector, you simply divide the vector by its own magnitude. This scales the vector down to a length of 1 while preserving its direction. The unit vector is often denoted with a "hat," like `รป`. + +``` +รป = v / ||v|| +``` + +**Example:** +For `v = [3, 4]`, we know `||v|| = 5`. +The unit vector `รป` is: +``` +รป = [3, 4] / 5 = [3/5, 4/5] = [0.6, 0.8] +``` +This new vector [0.6, 0.8] points in the exact same direction as [3, 4], but its length is 1. + + + +## Step 3: Vector Arithmetic + +We can perform operations on vectors to combine or modify them. + +### Vector Addition + +**Geometrically**, adding two vectors `u + v` means placing the tail of vector `v` at the tip of vector `u`. The resulting vector, `w`, is the arrow drawn from the original starting point to the tip of the second vector. + +**Mathematically**, we just add the corresponding components: +If `u = [xโ‚, yโ‚]` and `v = [xโ‚‚, yโ‚‚]`, then: +``` +u + v = [xโ‚ + xโ‚‚, yโ‚ + yโ‚‚] +``` + +![Vector Addition](/content/learn/math/vectors/vector-addition.png) + +### Scalar Multiplication + +Multiplying a vector by a regular number (a **scalar**) changes its magnitude but not its direction (unless the scalar is negative, in which case the direction is reversed). + +If `k` is a scalar and `v = [x, y]`, then: +``` +k v = [kx, k*y] +``` + + +**Examples:** + +- `2 * v` doubles the vector's length. + +- `0.5 * v` halves the vector's length. + +- `-1 * v` flips the vector to point in the opposite direction. + +![Scalar Multiplication](/content/learn/math/vectors/scalar-multiplication.png) + + + + + + + +## Step 4: The Dot Product + +The **dot product** is a way of multiplying two vectors that results in a single number (a scalar). It is one of the most important vector operations. + +**Intuition:** The dot product tells you how much two vectors align or point in the same direction. + +- **Large positive dot product:** The vectors point in very similar directions. + +- **Dot product is zero:** The vectors are perpendicular (orthogonal) to each other. + +- **Large negative dot product:** The vectors point in generally opposite directions. + +**Calculation:** To calculate the dot product, you multiply the corresponding components and then add the results. +If `u = [xโ‚, yโ‚]` and `v = [xโ‚‚, yโ‚‚]`, the dot product `u ยท v` is: + +``` +u ยท v = (xโ‚ xโ‚‚) + (yโ‚ yโ‚‚) +``` + +### Geometric Meaning & Finding Angles +The dot product also has a powerful geometric definition: + +``` +u ยท v = ||u|| ||v|| cos(ฮธ) +``` + +where `ฮธ` (theta) is the angle between the two vectors. We can rearrange this formula to find the angle between any two vectors! + +``` +cos(ฮธ) = (u ยท v) / (||u|| * ||v||) +``` + +This is an incredibly useful property, allowing us to calculate angles in any number of dimensions. + +![Vector Angle](/content/learn/math/vectors/vector-angle.png) + +## Step 5: Neural Networks: + +Every input, hidden state, and output is a vector. + +- A single image, sound, or sentence is converted into a vector of numbers that captures its features. + +- Each neuron operates on these vectors โ€” combining them through dot products, matrix multiplications, and nonlinear activations to extract patterns. + +- When you train a neural network, you're really adjusting weight vectors so that the model transforms input vectors into desired output vectors. + + + +### ๐Ÿ’ฌ In Large Language Models (LLMs): + +LLMs represent words, sentences, and even abstract concepts as high-dimensional vectors (embeddings). + +- The vector for a word like "king" is close to "queen" in this space because their meanings are similar. + +- Attention mechanisms compute dot products between vectors to measure how related words are in context โ€” that's how the model "focuses" on relevant information. + +- The entire reasoning process of an LLM โ€” understanding, summarizing, generating โ€” happens through transformations of these vectors. + +**By understanding vectors, you understand how neural networks think, learn, and represent meaning.** \ No newline at end of file diff --git a/public/content/learn/neural-networks/architecture-of-a-network/architecture-of-a-network-content.md b/public/content/learn/neural-networks/architecture-of-a-network/architecture-of-a-network-content.md new file mode 100644 index 0000000..e8d76a0 --- /dev/null +++ b/public/content/learn/neural-networks/architecture-of-a-network/architecture-of-a-network-content.md @@ -0,0 +1,283 @@ +--- +hero: + title: "Architecture of a Network" + subtitle: "Understanding Neural Network Structure and Design" + tags: + - "๐Ÿง  Neural Networks" + - "โฑ๏ธ 10 min read" +--- + +A neural network's **architecture** is its structure - how many layers, how many neurons, and how they connect! + +![Network Layers](/content/learn/neural-networks/architecture-of-a-network/network-layers.png) + +## Basic Architecture + +**Typical neural network has three parts:** + +1. **Input Layer:** Receives the data +2. **Hidden Layers:** Process and transform +3. **Output Layer:** Makes the prediction + +```yaml +Input Layer โ†’ Hidden Layer 1 โ†’ Hidden Layer 2 โ†’ Output Layer + (784) (128) (64) (10) +``` + +## Example Architecture + +```python +import torch +import torch.nn as nn + +class SimpleNet(nn.Module): + def __init__(self): + super().__init__() + # Input layer โ†’ Hidden layer 1 + self.fc1 = nn.Linear(784, 128) + + # Hidden layer 1 โ†’ Hidden layer 2 + self.fc2 = nn.Linear(128, 64) + + # Hidden layer 2 โ†’ Output layer + self.fc3 = nn.Linear(64, 10) + + def forward(self, x): + # Layer 1 + x = torch.relu(self.fc1(x)) + + # Layer 2 + x = torch.relu(self.fc2(x)) + + # Output layer (no activation for logits) + x = self.fc3(x) + + return x + +model = SimpleNet() +print(model) +``` + +**Architecture diagram:** + +```yaml +Input: 784 features (28ร—28 image flattened) + โ†“ +Linear(784 โ†’ 128) + ReLU + โ†“ +Linear(128 โ†’ 64) + ReLU + โ†“ +Linear(64 โ†’ 10) [logits for 10 classes] + โ†“ +Output: 10 class scores +``` + +## Layer Sizes + +**How to choose layer sizes:** + +```yaml +Input layer: + Size = number of features + Example: 28ร—28 image = 784 + +Hidden layers: + Start wide, gradually narrow + Common pattern: 512 โ†’ 256 โ†’ 128 + Or: Stay same size + +Output layer: + Size = number of outputs + Classification: number of classes + Regression: usually 1 +``` + +**Example patterns:** + +```python +# Pattern 1: Funnel (wide to narrow) +model = nn.Sequential( + nn.Linear(784, 512), + nn.ReLU(), + nn.Linear(512, 256), + nn.ReLU(), + nn.Linear(256, 10) +) + +# Pattern 2: Uniform (same size) +model = nn.Sequential( + nn.Linear(100, 100), + nn.ReLU(), + nn.Linear(100, 100), + nn.ReLU(), + nn.Linear(100, 1) +) + +# Pattern 3: Bottleneck (narrow middle) +model = nn.Sequential( + nn.Linear(784, 128), + nn.ReLU(), + nn.Linear(128, 32), # Bottleneck + nn.ReLU(), + nn.Linear(32, 128), + nn.ReLU(), + nn.Linear(128, 784) +) +``` + +## Depth vs Width + +**Depth = number of layers** +**Width = neurons per layer** + +```python +# Deep but narrow +deep_narrow = nn.Sequential( + nn.Linear(10, 20), + nn.ReLU(), + nn.Linear(20, 20), + nn.ReLU(), + nn.Linear(20, 20), + nn.ReLU(), + nn.Linear(20, 20), + nn.ReLU(), + nn.Linear(20, 1) +) # 5 layers, 20 neurons each + +# Shallow but wide +shallow_wide = nn.Sequential( + nn.Linear(10, 1000), + nn.ReLU(), + nn.Linear(1000, 1) +) # 2 layers, 1000 neurons +``` + +**Trade-offs:** + +```yaml +Deep networks: + โœ“ Learn hierarchical features + โœ“ More expressive + โœ— Harder to train + โœ— Gradient problems + +Wide networks: + โœ“ More parameters per layer + โœ“ Easier to train + โœ— Less feature hierarchy + โœ— More memory +``` + +## Common Architectures + +### Fully Connected (Dense) + +```python +# Every neuron connects to every neuron in next layer +fc_net = nn.Sequential( + nn.Linear(784, 256), + nn.ReLU(), + nn.Linear(256, 128), + nn.ReLU(), + nn.Linear(128, 10) +) +``` + +### Convolutional (CNN) + +```python +# For images +cnn = nn.Sequential( + nn.Conv2d(3, 32, 3), + nn.ReLU(), + nn.MaxPool2d(2), + nn.Conv2d(32, 64, 3), + nn.ReLU(), + nn.Flatten(), + nn.Linear(64*6*6, 10) +) +``` + +## Counting Parameters + +```python +import torch.nn as nn + +model = nn.Sequential( + nn.Linear(10, 20), # 10ร—20 + 20 = 220 params + nn.ReLU(), # 0 params + nn.Linear(20, 5) # 20ร—5 + 5 = 105 params +) + +# Count total parameters +total_params = sum(p.numel() for p in model.parameters()) +print(f"Total parameters: {total_params}") +# Output: 325 +``` + +## Practical Example: MNIST Classifier + +```python +import torch.nn as nn + +class MNISTNet(nn.Module): + def __init__(self): + super().__init__() + self.network = nn.Sequential( + # Input: 28ร—28 = 784 + nn.Linear(784, 128), + nn.ReLU(), + nn.Dropout(0.2), + + nn.Linear(128, 64), + nn.ReLU(), + nn.Dropout(0.2), + + # Output: 10 classes (digits 0-9) + nn.Linear(64, 10) + ) + + def forward(self, x): + # Flatten image + x = x.view(-1, 784) + # Forward pass + return self.network(x) + +model = MNISTNet() + +# Count parameters +params = sum(p.numel() for p in model.parameters()) +print(f"Parameters: {params:,}") +``` + +## Key Takeaways + +โœ“ **Three parts:** Input โ†’ Hidden โ†’ Output + +โœ“ **Layer sizes:** Input (features), Hidden (variable), Output (targets) + +โœ“ **Depth:** Number of layers + +โœ“ **Width:** Neurons per layer + +โœ“ **More layers:** More complex patterns + +โœ“ **Design choice:** Many valid architectures + +**Quick Reference:** + +```python +# Basic architecture template +model = nn.Sequential( + nn.Linear(input_size, hidden1_size), + nn.ReLU(), + nn.Linear(hidden1_size, hidden2_size), + nn.ReLU(), + nn.Linear(hidden2_size, output_size) +) + +# Count parameters +total = sum(p.numel() for p in model.parameters()) +``` + +**Remember:** Architecture is like a blueprint - it defines your network's structure! ๐ŸŽ‰ diff --git a/public/content/learn/neural-networks/architecture-of-a-network/network-layers.png b/public/content/learn/neural-networks/architecture-of-a-network/network-layers.png new file mode 100644 index 0000000..b7b9395 Binary files /dev/null and b/public/content/learn/neural-networks/architecture-of-a-network/network-layers.png differ diff --git a/public/content/learn/neural-networks/backpropagation-in-action/backpropagation-in-action-content.md b/public/content/learn/neural-networks/backpropagation-in-action/backpropagation-in-action-content.md new file mode 100644 index 0000000..f72db77 --- /dev/null +++ b/public/content/learn/neural-networks/backpropagation-in-action/backpropagation-in-action-content.md @@ -0,0 +1,114 @@ +--- +hero: + title: "Backpropagation in Action" + subtitle: "Seeing Gradients Flow Through Networks" + tags: + - "๐Ÿง  Neural Networks" + - "โฑ๏ธ 8 min read" +--- + +Let's see backpropagation in action with real examples! + +## Watching Gradients + +```python +import torch +import torch.nn as nn + +model = nn.Sequential( + nn.Linear(2, 3), + nn.ReLU(), + nn.Linear(3, 1) +) + +x = torch.tensor([[1.0, 2.0]]) +y_true = torch.tensor([[5.0]]) + +# Forward +y_pred = model(x) +loss = (y_pred - y_true) ** 2 + +# Backward +loss.backward() + +# See gradients +for name, param in model.named_parameters(): + print(f"{name}:") + print(f" Value: {param.data}") + print(f" Gradient: {param.grad}") + print() +``` + +## Gradient Flow Example + +```python +import torch + +# Three-step computation +x = torch.tensor([2.0], requires_grad=True) +y = x ** 2 # y = xยฒ +z = y + 3 # z = y + 3 +loss = z ** 2 # loss = zยฒ + +# Backward +loss.backward() + +print(f"x = {x.item()}") +print(f"y = {y.item()}") +print(f"z = {z.item()}") +print(f"loss = {loss.item()}") +print(f"\\ndloss/dx = {x.grad.item()}") + +# Manual chain rule: +# dloss/dx = dloss/dz ร— dz/dy ร— dy/dx +# = 2z ร— 1 ร— 2x +# = 2(7) ร— 1 ร— 2(2) +# = 14 ร— 4 = 56 โœ“ +``` + +## Training with Backprop + +```python +import torch +import torch.nn as nn +import torch.optim as optim + +model = nn.Linear(1, 1) +optimizer = optim.SGD(model.parameters(), lr=0.01) +criterion = nn.MSELoss() + +# Data: y = 2x +X = torch.tensor([[1.0], [2.0], [3.0], [4.0]]) +y = torch.tensor([[2.0], [4.0], [6.0], [8.0]]) + +# Train +for epoch in range(50): + # Forward + pred = model(X) + loss = criterion(pred, y) + + # Backward + optimizer.zero_grad() + loss.backward() + + # Update + optimizer.step() + + if epoch % 10 == 0: + print(f"Epoch {epoch}, Loss: {loss.item():.4f}") + +print(f"Learned weight: {model.weight.item():.2f}") # ~2.0 +print(f"Learned bias: {model.bias.item():.2f}") # ~0.0 +``` + +## Key Takeaways + +โœ“ **Backprop:** Computes gradients efficiently + +โœ“ **Chain rule:** Multiplies gradients backwards + +โœ“ **Automatic:** PyTorch handles it + +โœ“ **Essential:** Makes training possible + +**Remember:** Backprop = automatic gradient calculation through layers! ๐ŸŽ‰ diff --git a/public/content/learn/neural-networks/backpropagation/backpropagation-content.md b/public/content/learn/neural-networks/backpropagation/backpropagation-content.md new file mode 100644 index 0000000..e2ece30 --- /dev/null +++ b/public/content/learn/neural-networks/backpropagation/backpropagation-content.md @@ -0,0 +1,379 @@ +--- +hero: + title: "Backpropagation" + subtitle: "The Algorithm That Enables Learning" + tags: + - "๐Ÿง  Neural Networks" + - "โฑ๏ธ 18 min read" +--- + +# Backpropagation + +## What is Backpropagation? + +Backpropagation (short for "backward propagation of errors") is the algorithm used to **calculate gradients** of the loss function with respect to the weights. It works backward through the network, computing how much each weight contributed to the error. + +Think of it as **tracing blame backward** through the network! + +![Backpropagation Overview](backprop-overview.png) + +## Why It Matters + +Without backpropagation: +- โŒ We couldn't efficiently train deep neural networks +- โŒ Would need to compute millions of partial derivatives manually +- โŒ Training would take forever + +With backpropagation: +- โœ… Efficiently computes all gradients in one backward pass +- โœ… Uses the chain rule to reuse computations +- โœ… Makes deep learning practical + +## The Core Idea + +The key insight is the **chain rule** from calculus: + +``` +If y = f(g(x)), then: +dy/dx = (dy/dg) ร— (dg/dx) +``` + +In a neural network with multiple layers, we chain these derivatives together: + +``` +โˆ‚L/โˆ‚wโฝยนโพ = (โˆ‚L/โˆ‚aโฝยณโพ) ร— (โˆ‚aโฝยณโพ/โˆ‚aโฝยฒโพ) ร— (โˆ‚aโฝยฒโพ/โˆ‚aโฝยนโพ) ร— (โˆ‚aโฝยนโพ/โˆ‚wโฝยนโพ) +``` + +## The Backpropagation Process + +### Step 1: Forward Pass +First, do a forward pass to get the prediction and cache all intermediate values: + +```python +# Forward pass (saving values for backprop) +z1 = W1 @ X + b1 +a1 = relu(z1) # Cache z1, a1 + +z2 = W2 @ a1 + b2 +a2 = sigmoid(z2) # Cache z2, a2 (prediction) + +# Compute loss +loss = (a2 - y)**2 # MSE loss +``` + +### Step 2: Output Layer Gradient +Calculate gradient at the output: + +```python +# For MSE loss: L = (ลท - y)ยฒ +dL_da2 = 2 * (a2 - y) + +# Gradient through sigmoid +da2_dz2 = a2 * (1 - a2) # sigmoid derivative + +# Combine using chain rule +dL_dz2 = dL_da2 * da2_dz2 +``` + +### Step 3: Propagate Backward +For each layer (from output to input): + +```python +# Gradients for layer 2 weights and bias +dL_dW2 = dL_dz2 @ a1.T +dL_db2 = dL_dz2 + +# Gradient flowing to previous layer +dL_da1 = W2.T @ dL_dz2 + +# Gradient through ReLU +da1_dz1 = (z1 > 0).astype(float) # ReLU derivative +dL_dz1 = dL_da1 * da1_dz1 + +# Gradients for layer 1 weights and bias +dL_dW1 = dL_dz1 @ X.T +dL_db1 = dL_dz1 +``` + +### Step 4: Update Weights +Use gradients to update parameters: + +```python +# Gradient descent +learning_rate = 0.01 + +W1 -= learning_rate * dL_dW1 +b1 -= learning_rate * dL_db1 +W2 -= learning_rate * dL_dW2 +b2 -= learning_rate * dL_db2 +``` + +![Backprop Steps](backprop-steps.png) + +## Detailed Example + +Let's work through a concrete example with numbers. + +### Setup +``` +Input: x = 2 +Target: y = 1 + +Network: +- Layer 1: 1 neuron, ReLU + W1 = 0.5, b1 = 0.1 +- Layer 2: 1 neuron, Sigmoid + W2 = 0.8, b2 = 0.2 + +Loss: MSE = (ลท - y)ยฒ +``` + +### Forward Pass +``` +Layer 1: +z1 = 0.5(2) + 0.1 = 1.1 +a1 = ReLU(1.1) = 1.1 + +Layer 2: +z2 = 0.8(1.1) + 0.2 = 1.08 +a2 = sigmoid(1.08) = 0.746 + +Loss: +L = (0.746 - 1)ยฒ = 0.0645 +``` + +### Backward Pass + +**Output Layer:** +``` +dL/da2 = 2(0.746 - 1) = -0.508 + +sigmoid'(z2) = a2(1 - a2) + = 0.746(1 - 0.746) = 0.189 + +dL/dz2 = -0.508 ร— 0.189 = -0.096 + +dL/dW2 = dL/dz2 ร— a1 = -0.096 ร— 1.1 = -0.106 +dL/db2 = dL/dz2 = -0.096 +``` + +**Hidden Layer:** +``` +dL/da1 = W2 ร— dL/dz2 + = 0.8 ร— (-0.096) = -0.077 + +ReLU'(z1) = 1 (since z1 = 1.1 > 0) + +dL/dz1 = -0.077 ร— 1 = -0.077 + +dL/dW1 = dL/dz1 ร— x = -0.077 ร— 2 = -0.154 +dL/db1 = dL/dz1 = -0.077 +``` + +### Update Weights (ฮฑ = 0.1) +``` +W1_new = 0.5 - 0.1(-0.154) = 0.515 +b1_new = 0.1 - 0.1(-0.077) = 0.108 +W2_new = 0.8 - 0.1(-0.106) = 0.811 +b2_new = 0.2 - 0.1(-0.096) = 0.210 +``` + +The weights moved in the direction to reduce the loss! โœ… + +## Activation Function Derivatives + +### ReLU +```python +def relu_derivative(z): + return (z > 0).astype(float) + +# Examples: +relu'(-1) = 0 +relu'(0) = 0 +relu'(1) = 1 +``` + +### Sigmoid +```python +def sigmoid_derivative(a): + # a is the sigmoid output + return a * (1 - a) + +# Examples: +# If sigmoid(z) = 0.7, then sigmoid'(z) = 0.7 ร— 0.3 = 0.21 +``` + +### Tanh +```python +def tanh_derivative(a): + # a is the tanh output + return 1 - a**2 + +# Examples: +# If tanh(z) = 0.5, then tanh'(z) = 1 - 0.25 = 0.75 +``` + +### Softmax (special case) +```python +# For softmax with cross-entropy loss, the gradient simplifies to: +dL/dz = a - y # where a is softmax output, y is one-hot label +``` + +## Loss Function Gradients + +### Mean Squared Error (MSE) +```python +# L = (ลท - y)ยฒ +dL/dลท = 2(ลท - y) +``` + +### Binary Cross-Entropy +```python +# L = -[y log(ลท) + (1-y)log(1-ลท)] +dL/dลท = -(y/ลท) + (1-y)/(1-ลท) + +# Simplified with sigmoid: dL/dz = ลท - y +``` + +### Categorical Cross-Entropy +```python +# L = -ฮฃ yแตข log(ลทแตข) +dL/dลทแตข = -yแตข/ลทแตข + +# Simplified with softmax: dL/dzแตข = ลทแตข - yแตข +``` + +## Matrix Form (Batch Processing) + +For a batch of examples: + +```python +# Forward pass +Z1 = X @ W1.T + b1 # (batch_size, hidden_dim) +A1 = relu(Z1) + +Z2 = A1 @ W2.T + b2 # (batch_size, output_dim) +A2 = sigmoid(Z2) + +# Loss (averaged over batch) +L = ((A2 - Y)**2).mean() + +# Backward pass +dL_dZ2 = (A2 - Y) / batch_size +dL_dW2 = dL_dZ2.T @ A1 +dL_db2 = dL_dZ2.sum(axis=0) + +dL_dA1 = dL_dZ2 @ W2 +dL_dZ1 = dL_dA1 * (Z1 > 0) +dL_dW1 = dL_dZ1.T @ X +dL_db1 = dL_dZ1.sum(axis=0) +``` + +![Matrix Backprop](matrix-backprop.png) + +## Common Challenges + +### 1. Vanishing Gradients + +**Problem:** Gradients become very small in deep networks + +``` +# With sigmoid, if all gradients are < 1: +grad = 0.25 ร— 0.25 ร— 0.25 ร— ... โ†’ โ‰ˆ 0 +``` + +**Solutions:** +- Use ReLU instead of sigmoid/tanh +- Batch normalization +- Residual connections (skip connections) +- Careful weight initialization + +### 2. Exploding Gradients + +**Problem:** Gradients become very large + +``` +# If weights are > 1: +grad = 2 ร— 2 ร— 2 ร— ... โ†’ โˆž +``` + +**Solutions:** +- Gradient clipping +- Smaller learning rate +- Better weight initialization + +### 3. Dead ReLU + +**Problem:** ReLU neurons output 0 for all inputs (gradient always 0) + +**Solutions:** +- Use Leaky ReLU or ELU +- Lower learning rate +- Better initialization + +## Computational Efficiency + +Why backpropagation is efficient: + +1. **Reuses Computations** + ``` + โˆ‚L/โˆ‚wโฝยนโพ needs โˆ‚L/โˆ‚aโฝยฒโพ + โˆ‚L/โˆ‚wโฝยฒโพ also needs โˆ‚L/โˆ‚aโฝยฒโพ + โ†’ Compute once, use twice! + ``` + +2. **One Backward Pass** + - Forward: O(n) operations + - Backward: O(n) operations + - Total: O(2n) โ‰ˆ O(n) + +3. **Automatic Differentiation** + - Modern frameworks (PyTorch, TensorFlow) do this automatically + - Just specify the loss, backprop is automatic! + +## PyTorch Example + +Here's how easy it is with PyTorch: + +```python +import torch +import torch.nn as nn + +# Define network +model = nn.Sequential( + nn.Linear(2, 3), + nn.ReLU(), + nn.Linear(3, 1), + nn.Sigmoid() +) + +# Forward pass +x = torch.tensor([[2.0, 3.0]]) +y = torch.tensor([[1.0]]) +y_pred = model(x) + +# Compute loss +loss = ((y_pred - y)**2).mean() + +# Backward pass (automatic!) +loss.backward() + +# Gradients are computed automatically +for name, param in model.named_parameters(): + print(f"{name}: {param.grad}") +``` + +## Key Takeaways + +โœ… Backpropagation efficiently computes gradients using the chain rule +โœ… It works backward from output to input layer +โœ… Each layer computes: gradients for weights + gradients for previous layer +โœ… Modern frameworks automate this process +โœ… Understanding it helps with debugging and designing better networks + +## What's Next? + +Now that we know how to compute gradients, we need to learn how to **use them effectively** to train neural networks. That's where **optimization algorithms** come in! + +Let's explore training and optimization next! ๐Ÿš€ + diff --git a/public/content/learn/neural-networks/building-a-layer/building-a-layer-content.md b/public/content/learn/neural-networks/building-a-layer/building-a-layer-content.md new file mode 100644 index 0000000..074071a --- /dev/null +++ b/public/content/learn/neural-networks/building-a-layer/building-a-layer-content.md @@ -0,0 +1,170 @@ +--- +hero: + title: "Building a Layer" + subtitle: "Creating Layers of Neurons" + tags: + - "๐Ÿง  Neural Networks" + - "โฑ๏ธ 8 min read" +--- + +A layer is a collection of neurons that process inputs together. It's the fundamental unit of neural networks! + +![Layer Structure](/content/learn/neural-networks/building-a-layer/layer-structure.png) + +## What is a Layer? + +**Layer = Multiple neurons working in parallel** + +```python +import torch.nn as nn + +# Single neuron +neuron = nn.Linear(10, 1) # 10 inputs โ†’ 1 output + +# Layer of 5 neurons +layer = nn.Linear(10, 5) # 10 inputs โ†’ 5 outputs + +# Each output is from a different neuron! +``` + +## Creating a Layer + +```python +import torch +import torch.nn as nn + +# Create layer: 3 inputs โ†’ 4 outputs +layer = nn.Linear(in_features=3, out_features=4) + +# Test +x = torch.tensor([[1.0, 2.0, 3.0]]) # 1 sample, 3 features +output = layer(x) + +print(output.shape) # torch.Size([1, 4]) +print(output) +# tensor([[0.234, -1.123, 0.567, 2.134]], grad_fn=) +# 4 different outputs! +``` + +**What happened:** + +```yaml +4 neurons, each with: + - 3 weights (one per input) + - 1 bias + +Total parameters: 4ร—(3+1) = 16 parameters + +Each neuron computes: + neuron1: w1ยทx + b1 + neuron2: w2ยทx + b2 + neuron3: w3ยทx + b3 + neuron4: w4ยทx + b4 +``` + +## Layer with Activation + +```python +class LayerWithActivation(nn.Module): + def __init__(self, in_features, out_features): + super().__init__() + self.linear = nn.Linear(in_features, out_features) + self.activation = nn.ReLU() + + def forward(self, x): + return self.activation(self.linear(x)) + +# Use it +layer = LayerWithActivation(10, 20) +x = torch.randn(32, 10) # Batch of 32 +output = layer(x) + +print(output.shape) # torch.Size([32, 20]) +``` + +## Multiple Layers + +```python +# Stack layers together +model = nn.Sequential( + nn.Linear(784, 256), + nn.ReLU(), + + nn.Linear(256, 128), + nn.ReLU(), + + nn.Linear(128, 10) +) + +# Each layer transforms the data +x = torch.randn(1, 784) +print(x.shape) # torch.Size([1, 784]) + +x = model[0](x) # First linear +print(x.shape) # torch.Size([1, 256]) + +x = model[1](x) # ReLU +print(x.shape) # torch.Size([1, 256]) + +x = model[2](x) # Second linear +print(x.shape) # torch.Size([1, 128]) +``` + +## Custom Layer + +```python +class CustomLayer(nn.Module): + def __init__(self, in_dim, out_dim): + super().__init__() + self.linear = nn.Linear(in_dim, out_dim) + self.norm = nn.BatchNorm1d(out_dim) + self.activation = nn.ReLU() + self.dropout = nn.Dropout(0.2) + + def forward(self, x): + x = self.linear(x) + x = self.norm(x) + x = self.activation(x) + x = self.dropout(x) + return x + +# Use custom layer +layer = CustomLayer(100, 50) +x = torch.randn(32, 100) +output = layer(x) +print(output.shape) # torch.Size([32, 50]) +``` + +## Key Takeaways + +โœ“ **Layer = Multiple neurons:** Process inputs in parallel + +โœ“ **nn.Linear(in, out):** Creates a layer + +โœ“ **Add activation:** After linear transformation + +โœ“ **Stack layers:** Build deep networks + +โœ“ **Custom layers:** Combine multiple operations + +**Quick Reference:** + +```python +# Basic layer +layer = nn.Linear(input_dim, output_dim) + +# Layer with activation +layer = nn.Sequential( + nn.Linear(in_dim, out_dim), + nn.ReLU() +) + +# Multi-layer network +model = nn.Sequential( + nn.Linear(784, 128), + nn.ReLU(), + nn.Linear(128, 10) +) +``` + +**Remember:** Layers are just multiple neurons working together! ๐ŸŽ‰ diff --git a/public/content/learn/neural-networks/building-a-layer/layer-structure.png b/public/content/learn/neural-networks/building-a-layer/layer-structure.png new file mode 100644 index 0000000..882eef3 Binary files /dev/null and b/public/content/learn/neural-networks/building-a-layer/layer-structure.png differ diff --git a/public/content/learn/neural-networks/calculating-gradients/calculating-gradients-content.md b/public/content/learn/neural-networks/calculating-gradients/calculating-gradients-content.md new file mode 100644 index 0000000..f9a1d82 --- /dev/null +++ b/public/content/learn/neural-networks/calculating-gradients/calculating-gradients-content.md @@ -0,0 +1,99 @@ +--- +hero: + title: "Calculating Gradients" + subtitle: "Understanding Gradient Computation" + tags: + - "๐Ÿง  Neural Networks" + - "โฑ๏ธ 8 min read" +--- + +Gradients tell us **which direction** to adjust weights to reduce loss! + +## What is a Gradient? + +**Gradient = Rate of change of loss with respect to a parameter** + +```python +import torch + +# Simple function: loss = wยฒ +w = torch.tensor([3.0], requires_grad=True) +loss = w ** 2 + +# Calculate gradient +loss.backward() + +print(f"Weight: {w.item()}") +print(f"Loss: {loss.item()}") +print(f"Gradient: {w.grad.item()}") + +# Gradient = 2w = 2ร—3 = 6 +# This tells us: increasing w increases loss +``` + +## Computing Gradients in PyTorch + +```python +import torch +import torch.nn as nn + +# Model +model = nn.Linear(3, 1) + +# Data +x = torch.tensor([[1.0, 2.0, 3.0]]) +y_true = torch.tensor([[5.0]]) + +# Forward pass +y_pred = model(x) +loss = (y_pred - y_true) ** 2 + +# Compute gradients +loss.backward() + +# Check gradients +print("Weight gradients:", model.weight.grad) +print("Bias gradient:", model.bias.grad) +``` + +## Gradient Descent Update + +```python +# Manual gradient descent +learning_rate = 0.01 + +with torch.no_grad(): + for param in model.parameters(): + # Update: param = param - lr * gradient + param -= learning_rate * param.grad + + # Reset gradient + param.grad.zero_() +``` + +## Key Takeaways + +โœ“ **Gradient:** Direction and magnitude of change + +โœ“ **`.backward()`:** Computes all gradients + +โœ“ **Automatic:** PyTorch calculates for you + +โœ“ **Update rule:** param -= lr * gradient + +**Quick Reference:** + +```python +# Compute gradients +loss.backward() + +# Access gradients +param.grad + +# Zero gradients +optimizer.zero_grad() +# or +param.grad.zero_() +``` + +**Remember:** Gradients point the way to better weights! ๐ŸŽ‰ diff --git a/public/content/learn/neural-networks/forward-propagation/forward-propagation-content.md b/public/content/learn/neural-networks/forward-propagation/forward-propagation-content.md new file mode 100644 index 0000000..fdc24ff --- /dev/null +++ b/public/content/learn/neural-networks/forward-propagation/forward-propagation-content.md @@ -0,0 +1,303 @@ +--- +hero: + title: "Forward Propagation" + subtitle: "How Data Flows Through Neural Networks" + tags: + - "๐Ÿง  Neural Networks" + - "โฑ๏ธ 13 min read" +--- + +# Forward Propagation + +## What is Forward Propagation? + +Forward propagation is the process of passing input data through the neural network to get an output (prediction). It's called **"forward"** because data moves in one direction: + +``` +Input Layer โ†’ Hidden Layers โ†’ Output Layer +``` + +This is how neural networks make predictions! + +![Forward Propagation Flow](forward-prop-diagram.png) + +## The Process Step by Step + +### Step 1: Input Layer +Receive the input features + +```python +# Example: Image of handwritten digit +x = [0.5, 0.8, 0.3, ...] # Pixel values +``` + +### Step 2: Weighted Sum +For each neuron in the next layer, calculate: + +``` +z = wโ‚xโ‚ + wโ‚‚xโ‚‚ + ... + wโ‚™xโ‚™ + b +``` + +Or in matrix form: +``` +Z = WX + b +``` + +Where: +- `W` = weight matrix +- `X` = input vector +- `b` = bias vector + +### Step 3: Activation Function +Apply non-linear activation: + +``` +a = ฯƒ(z) # e.g., ReLU(z) or sigmoid(z) +``` + +### Step 4: Repeat +Use the outputs as inputs for the next layer, repeat steps 2-3 until reaching the output layer. + +## Mathematical Representation + +For a network with L layers: + +``` +Layer 1: aโฝยนโพ = ฯƒ(Wโฝยนโพx + bโฝยนโพ) +Layer 2: aโฝยฒโพ = ฯƒ(Wโฝยฒโพaโฝยนโพ + bโฝยฒโพ) +... +Layer L: aโฝแดธโพ = ฯƒ(Wโฝแดธโพaโฝแดธโปยนโพ + bโฝแดธโพ) +``` + +The final output `aโฝแดธโพ` is our prediction! + +## Simple Example: 2-Layer Network + +Let's walk through a tiny network: + +**Network Architecture:** +- Input: 2 features +- Hidden layer: 3 neurons (ReLU) +- Output: 1 neuron (Sigmoid) + +### Given: +``` +Input: x = [2, 3] + +Hidden layer weights: +Wโฝยนโพ = [[0.5, 0.3], + [0.2, 0.8], + [0.1, 0.6]] + +Hidden layer bias: bโฝยนโพ = [0.1, 0.2, 0.3] + +Output layer weights: Wโฝยฒโพ = [[0.4, 0.5, 0.6]] +Output layer bias: bโฝยฒโพ = [0.1] +``` + +### Step-by-Step Calculation: + +**Hidden Layer (Layer 1):** + +Neuron 1: +``` +zโ‚โฝยนโพ = 0.5(2) + 0.3(3) + 0.1 = 2.0 +aโ‚โฝยนโพ = ReLU(2.0) = 2.0 +``` + +Neuron 2: +``` +zโ‚‚โฝยนโพ = 0.2(2) + 0.8(3) + 0.2 = 3.0 +aโ‚‚โฝยนโพ = ReLU(3.0) = 3.0 +``` + +Neuron 3: +``` +zโ‚ƒโฝยนโพ = 0.1(2) + 0.6(3) + 0.3 = 2.3 +aโ‚ƒโฝยนโพ = ReLU(2.3) = 2.3 +``` + +Hidden layer output: `aโฝยนโพ = [2.0, 3.0, 2.3]` + +**Output Layer (Layer 2):** +``` +zโฝยฒโพ = 0.4(2.0) + 0.5(3.0) + 0.6(2.3) + 0.1 = 3.68 +aโฝยฒโพ = sigmoid(3.68) โ‰ˆ 0.975 +``` + +**Final Prediction: 0.975** (97.5% probability for class 1) + +![Example Network](forward-example.png) + +## Matrix Operations (Vectorized) + +For efficiency, we compute for all neurons at once: + +### Layer 1: +```python +import numpy as np + +# Input +X = np.array([2, 3]) + +# Layer 1 +W1 = np.array([[0.5, 0.3], + [0.2, 0.8], + [0.1, 0.6]]) +b1 = np.array([0.1, 0.2, 0.3]) + +Z1 = W1 @ X + b1 # Matrix multiplication +A1 = np.maximum(0, Z1) # ReLU + +# Layer 2 +W2 = np.array([[0.4, 0.5, 0.6]]) +b2 = np.array([0.1]) + +Z2 = W2 @ A1 + b2 +A2 = 1 / (1 + np.exp(-Z2)) # Sigmoid + +print(f"Prediction: {A2[0]:.3f}") +# Output: Prediction: 0.975 +``` + +## Batch Processing + +In practice, we process **multiple examples** simultaneously: + +```python +# Batch of 3 examples +X = np.array([[2, 3], + [1, 4], + [3, 2]]) # Shape: (3, 2) + +# Forward pass +Z1 = X @ W1.T + b1 # Broadcasting handles bias +A1 = np.maximum(0, Z1) + +Z2 = A1 @ W2.T + b2 +A2 = 1 / (1 + np.exp(-Z2)) + +print(A2.shape) # (3, 1) - predictions for 3 examples +``` + +## Activation Functions in Action + +Different activation functions transform data differently: + +### ReLU +```python +def relu(z): + return np.maximum(0, z) + +# Keeps positive values, zeros out negative +relu([-2, -1, 0, 1, 2]) # [0, 0, 0, 1, 2] +``` + +### Sigmoid +```python +def sigmoid(z): + return 1 / (1 + np.exp(-z)) + +# Squashes to (0, 1) +sigmoid([-2, 0, 2]) # [0.119, 0.5, 0.881] +``` + +### Tanh +```python +def tanh(z): + return np.tanh(z) + +# Squashes to (-1, 1) +tanh([-2, 0, 2]) # [-0.964, 0, 0.964] +``` + +![Activation Functions](activations-comparison.png) + +## Common Patterns + +### Classification (Softmax Output) +For multi-class classification, use softmax in the output layer: + +```python +def softmax(z): + exp_z = np.exp(z - np.max(z)) # Numerical stability + return exp_z / exp_z.sum() + +# Example: 3-class classification +logits = np.array([2.0, 1.0, 0.1]) +probs = softmax(logits) +# [0.659, 0.242, 0.099] - probabilities sum to 1 +``` + +### Regression (Linear Output) +For regression, no activation in output layer: + +```python +# Final layer for regression +output = W_last @ a_last + b_last +# No activation - can output any real number +``` + +## Key Properties + +### Deterministic +Same input + same weights = same output every time + +### Differentiable +We can compute gradients (needed for backpropagation) + +### Composable +Output of one layer is input to next - function composition + +### Efficient +Matrix operations are highly optimized (GPUs!) + +## Debugging Forward Pass + +Common issues and solutions: + +### 1. Shape Mismatches +```python +# Check shapes at each layer +print(f"Input shape: {X.shape}") +print(f"W1 shape: {W1.shape}") +print(f"Z1 shape: {Z1.shape}") +``` + +### 2. Numerical Overflow +```python +# For sigmoid/softmax, use numerical stability tricks +# Bad: exp(x) / sum(exp(x)) +# Good: exp(x - max(x)) / sum(exp(x - max(x))) +``` + +### 3. Wrong Activation +```python +# Make sure you use the right activation for each layer +# Hidden: ReLU, Tanh +# Output (classification): Sigmoid (binary), Softmax (multi-class) +# Output (regression): None (linear) +``` + +## Implementation Tips + +โœ… Use vectorized operations (NumPy/PyTorch) +โœ… Process data in batches for efficiency +โœ… Cache intermediate values (needed for backprop) +โœ… Add assertions to check shapes +โœ… Normalize inputs for stable training + +## What We've Learned + +๐ŸŽฏ Forward propagation transforms inputs into predictions +๐ŸŽฏ It's a series of weighted sums + activations +๐ŸŽฏ Matrix operations make it efficient +๐ŸŽฏ Different activations serve different purposes +๐ŸŽฏ The process is deterministic and differentiable + +## Next Steps + +Forward propagation gets us predictions, but how does the network **learn**? That's where **backpropagation** comes in! It calculates how to adjust the weights to improve predictions. + +Let's dive into backpropagation next! ๐ŸŽ“ + diff --git a/public/content/learn/neural-networks/implementing-a-network/implementing-a-network-content.md b/public/content/learn/neural-networks/implementing-a-network/implementing-a-network-content.md new file mode 100644 index 0000000..51d57bb --- /dev/null +++ b/public/content/learn/neural-networks/implementing-a-network/implementing-a-network-content.md @@ -0,0 +1,215 @@ +--- +hero: + title: "Implementing a Network" + subtitle: "Building Complete Neural Networks in PyTorch" + tags: + - "๐Ÿง  Neural Networks" + - "โฑ๏ธ 10 min read" +--- + +Let's build complete, working neural networks from scratch! + +## Simple Feedforward Network + +```python +import torch +import torch.nn as nn + +class FeedForwardNet(nn.Module): + def __init__(self, input_size, hidden_size, output_size): + super().__init__() + self.fc1 = nn.Linear(input_size, hidden_size) + self.fc2 = nn.Linear(hidden_size, output_size) + + def forward(self, x): + x = torch.relu(self.fc1(x)) + x = self.fc2(x) + return x + +# Create network +model = FeedForwardNet(input_size=784, hidden_size=128, output_size=10) + +# Test +x = torch.randn(32, 784) +output = model(x) +print(output.shape) # torch.Size([32, 10]) +``` + +## Complete Training Pipeline + +```python +import torch +import torch.nn as nn +import torch.optim as optim + +# 1. Define model +class Net(nn.Module): + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(10, 20), + nn.ReLU(), + nn.Linear(20, 1) + ) + + def forward(self, x): + return self.layers(x) + +# 2. Create model, loss, optimizer +model = Net() +criterion = nn.MSELoss() +optimizer = optim.Adam(model.parameters(), lr=0.001) + +# 3. Training loop +def train(model, X_train, y_train, epochs=100): + for epoch in range(epochs): + # Forward + predictions = model(X_train) + loss = criterion(predictions, y_train) + + # Backward + optimizer.zero_grad() + loss.backward() + optimizer.step() + + if epoch % 20 == 0: + print(f"Epoch {epoch}, Loss: {loss.item():.4f}") + + return model + +# 4. Train +X = torch.randn(100, 10) +y = torch.randn(100, 1) +trained_model = train(model, X, y) +``` + +## Multi-Layer Deep Network + +```python +class DeepNet(nn.Module): + def __init__(self): + super().__init__() + self.layer1 = nn.Linear(784, 512) + self.layer2 = nn.Linear(512, 256) + self.layer3 = nn.Linear(256, 128) + self.layer4 = nn.Linear(128, 10) + + self.dropout = nn.Dropout(0.2) + + def forward(self, x): + x = torch.relu(self.layer1(x)) + x = self.dropout(x) + + x = torch.relu(self.layer2(x)) + x = self.dropout(x) + + x = torch.relu(self.layer3(x)) + x = self.dropout(x) + + x = self.layer4(x) + return x + +model = DeepNet() +``` + +## Complete MNIST Example + +```python +import torch +import torch.nn as nn +import torch.optim as optim +from torch.utils.data import DataLoader, TensorDataset + +class MNISTNet(nn.Module): + def __init__(self): + super().__init__() + self.network = nn.Sequential( + nn.Linear(784, 128), + nn.ReLU(), + nn.Dropout(0.2), + nn.Linear(128, 64), + nn.ReLU(), + nn.Dropout(0.2), + nn.Linear(64, 10) + ) + + def forward(self, x): + x = x.view(-1, 784) # Flatten + return self.network(x) + +# Create model +model = MNISTNet() +criterion = nn.CrossEntropyLoss() +optimizer = optim.Adam(model.parameters(), lr=0.001) + +# Training function +def train_epoch(model, dataloader, criterion, optimizer): + model.train() + total_loss = 0 + + for batch_x, batch_y in dataloader: + # Forward + outputs = model(batch_x) + loss = criterion(outputs, batch_y) + + # Backward + optimizer.zero_grad() + loss.backward() + optimizer.step() + + total_loss += loss.item() + + return total_loss / len(dataloader) + +# Evaluation function +def evaluate(model, dataloader): + model.eval() + correct = 0 + total = 0 + + with torch.no_grad(): + for batch_x, batch_y in dataloader: + outputs = model(batch_x) + predictions = torch.argmax(outputs, dim=1) + correct += (predictions == batch_y).sum().item() + total += batch_y.size(0) + + return correct / total +``` + +## Key Takeaways + +โœ“ **Structure:** Define model as `nn.Module` + +โœ“ **Forward:** Implement `forward()` method + +โœ“ **Training:** Forward โ†’ loss โ†’ backward โ†’ update + +โœ“ **Complete pipeline:** Model + criterion + optimizer + +**Quick Reference:** + +```python +# Define +class MyNet(nn.Module): + def __init__(self): + super().__init__() + self.layers = nn.Sequential(...) + + def forward(self, x): + return self.layers(x) + +# Train +model = MyNet() +optimizer = optim.Adam(model.parameters()) +criterion = nn.CrossEntropyLoss() + +for epoch in range(epochs): + pred = model(x) + loss = criterion(pred, y) + optimizer.zero_grad() + loss.backward() + optimizer.step() +``` + +**Remember:** You can now build any neural network! ๐ŸŽ‰ diff --git a/public/content/learn/neural-networks/implementing-backpropagation/implementing-backpropagation-content.md b/public/content/learn/neural-networks/implementing-backpropagation/implementing-backpropagation-content.md new file mode 100644 index 0000000..289bdee --- /dev/null +++ b/public/content/learn/neural-networks/implementing-backpropagation/implementing-backpropagation-content.md @@ -0,0 +1,97 @@ +--- +hero: + title: "Implementing Backpropagation" + subtitle: "Coding the Backward Pass" + tags: + - "๐Ÿง  Neural Networks" + - "โฑ๏ธ 10 min read" +--- + +Backpropagation is how neural networks **learn**. It calculates gradients for all weights efficiently! + +## The Algorithm + +**Backpropagation:** +1. Forward pass: Compute predictions +2. Compute loss +3. Backward pass: Compute gradients (chain rule) +4. Update weights + +```python +import torch +import torch.nn as nn + +model = nn.Sequential(nn.Linear(10, 5), nn.ReLU(), nn.Linear(5, 1)) +optimizer = torch.optim.SGD(model.parameters(), lr=0.01) +criterion = nn.MSELoss() + +# Training step +def train_step(x, y): + # 1. Forward pass + predictions = model(x) + + # 2. Compute loss + loss = criterion(predictions, y) + + # 3. Backward pass (backpropagation!) + optimizer.zero_grad() + loss.backward() + + # 4. Update weights + optimizer.step() + + return loss.item() + +# Train +x = torch.randn(32, 10) +y = torch.randn(32, 1) +loss = train_step(x, y) +print(f"Loss: {loss:.4f}") +``` + +## Manual Backpropagation + +```python +import torch + +# Simple network: y = w2 * relu(w1 * x) +x = torch.tensor([2.0], requires_grad=True) +w1 = torch.tensor([0.5], requires_grad=True) +w2 = torch.tensor([1.5], requires_grad=True) + +# Forward +z1 = w1 * x +a1 = torch.relu(z1) +y = w2 * a1 + +# Target +target = torch.tensor([3.0]) +loss = (y - target) ** 2 + +# Backward (automatic) +loss.backward() + +print(f"dL/dw1: {w1.grad.item()}") +print(f"dL/dw2: {a1.item()}") +``` + +## Key Takeaways + +โœ“ **Backprop:** Efficiently computes all gradients + +โœ“ **Chain rule:** Applied automatically by PyTorch + +โœ“ **Three steps:** forward โ†’ backward โ†’ update + +โœ“ **`.backward()`:** Does all the work! + +**Quick Reference:** + +```python +# Standard training step +optimizer.zero_grad() # Clear old gradients +loss.backward() # Compute gradients +optimizer.step() # Update weights +``` + +**Remember:** Backpropagation = automatic gradient calculation! ๐ŸŽ‰ diff --git a/public/content/learn/neural-networks/introduction/introduction-content.md b/public/content/learn/neural-networks/introduction/introduction-content.md new file mode 100644 index 0000000..206e239 --- /dev/null +++ b/public/content/learn/neural-networks/introduction/introduction-content.md @@ -0,0 +1,207 @@ +--- +hero: + title: "Introduction to Neural Networks" + subtitle: "Building Intelligent Systems from Scratch" + tags: + - "๐Ÿง  Neural Networks" + - "โฑ๏ธ 15 min read" +--- + +# Introduction to Neural Networks + +## What is a Neural Network? + +A neural network is a **computational model** inspired by the way biological neural networks in the human brain work. It consists of interconnected nodes (neurons) organized in layers that process information. + +Think of it as a **function approximator** that learns patterns from data! + +![Neural Network Architecture](neural-network-diagram.png) + +## The Biological Inspiration + +Just like neurons in your brain: +- Receive signals from multiple sources (dendrites) +- Process the information (cell body) +- Fire a signal if threshold is exceeded (axon) + +Artificial neurons work similarly: +- Receive weighted inputs from previous layer +- Sum them up and add bias +- Apply activation function +- Send output to next layer + +## Basic Architecture + +A typical neural network has **three types of layers**: + +### Input Layer +- Receives the raw data (features) +- One neuron per feature +- No computation happens here + +**Example:** For a 28x28 grayscale image: 784 input neurons (28 ร— 28) + +### Hidden Layer(s) +- Performs computations +- Extracts features from the data +- Can have multiple hidden layers (deep learning!) + +**The more layers:** +- More complex patterns can be learned +- But also harder to train + +### Output Layer +- Produces the final prediction +- Number of neurons depends on the task: + - 1 neuron: binary classification or regression + - N neurons: N-class classification + +![Layer Types](layer-types.png) + +## How Does a Single Neuron Work? + +Each neuron performs a simple operation: + +``` +1. Weighted Sum: z = wโ‚xโ‚ + wโ‚‚xโ‚‚ + ... + wโ‚™xโ‚™ + b +2. Activation: a = ฯƒ(z) +3. Output: a becomes input to next layer +``` + +**Example calculation:** +``` +Inputs: x = [2, 3] +Weights: w = [0.5, 0.3] +Bias: b = 0.1 + +Step 1: z = 0.5(2) + 0.3(3) + 0.1 = 2.0 +Step 2: a = ReLU(2.0) = 2.0 +Step 3: Output = 2.0 +``` + +## The Learning Process + +Neural networks learn through **supervised learning**: + +### 1. Initialize +Start with random weights and biases + +### 2. Forward Pass +Pass data through the network to get predictions + +### 3. Calculate Loss +Measure how wrong the predictions are + +``` +Loss = (prediction - actual)ยฒ +``` + +### 4. Backward Pass (Backpropagation) +Calculate gradients: how much each weight contributed to the error + +### 5. Update Weights +Adjust weights in the direction that reduces loss + +``` +w_new = w_old - learning_rate ร— gradient +``` + +### 6. Repeat +Do this for many iterations (epochs) until the model performs well + +![Training Process](training-process.png) + +## Types of Neural Networks + +### Feedforward Neural Networks (FNN) +- Information flows in one direction: input โ†’ hidden โ†’ output +- Used for: tabular data, simple classification + +### Convolutional Neural Networks (CNN) +- Specialized for image data +- Uses filters to detect features +- Used for: computer vision, image classification + +### Recurrent Neural Networks (RNN) +- Has memory of previous inputs +- Used for: time series, text, speech + +### Transformers +- Attention mechanism +- Used for: language models (GPT, BERT), machine translation + +## Real-World Applications + +| Domain | Application | Network Type | +|--------|------------|--------------| +| ๐Ÿ–ผ๏ธ Computer Vision | Image classification, object detection | CNN | +| ๐Ÿ’ฌ NLP | Chatbots, translation, text generation | Transformer | +| ๐ŸŽต Audio | Speech recognition, music generation | RNN, Transformer | +| ๐ŸŽฎ Gaming | Game AI, reinforcement learning | Deep Q-Networks | +| ๐Ÿฅ Healthcare | Disease diagnosis, drug discovery | CNN, FNN | +| ๐Ÿ’ฐ Finance | Fraud detection, stock prediction | FNN, LSTM | + +## Why Neural Networks Work + +### Universal Approximation Theorem +With enough neurons and the right activation functions, a neural network can approximate **any continuous function**! + +### Feature Learning +Unlike traditional ML, neural networks **automatically learn** the important features from raw data. No manual feature engineering needed! + +### Scalability +Neural networks get better with: +- More data +- More compute +- Better architectures + +![Network Depth vs Performance](depth-vs-performance.png) + +## Key Components Summary + +| Component | Purpose | +|-----------|---------| +| **Weights (w)** | Parameters to learn, control signal strength | +| **Bias (b)** | Shifts the activation function | +| **Activation Function** | Introduces non-linearity | +| **Loss Function** | Measures prediction error | +| **Optimizer** | Updates weights to minimize loss | + +## Challenges and Solutions + +### 1. Overfitting +**Problem:** Model memorizes training data +**Solution:** Dropout, regularization, more data + +### 2. Vanishing Gradients +**Problem:** Gradients become too small in deep networks +**Solution:** ReLU activation, batch normalization, skip connections + +### 3. Slow Training +**Problem:** Takes too long to converge +**Solution:** Better optimizers (Adam), GPU acceleration, batch processing + +### 4. Need Lots of Data +**Problem:** Neural networks are data-hungry +**Solution:** Transfer learning, data augmentation, synthetic data + +## Getting Started Checklist + +Before building your first neural network, you should understand: + +- โœ… Linear algebra (matrices, vectors) +- โœ… Calculus (derivatives, chain rule) +- โœ… Probability basics +- โœ… Programming (Python recommended) +- โœ… Framework basics (PyTorch or TensorFlow) + +## What's Next? + +Now that you understand the basics, we'll dive deeper into: + +1. **Forward Propagation** - How data flows through the network +2. **Backpropagation** - How the network learns +3. **Training & Optimization** - How to train networks effectively + +Let's continue the journey! ๐Ÿš€ + diff --git a/public/content/learn/neural-networks/the-chain-rule/the-chain-rule-content.md b/public/content/learn/neural-networks/the-chain-rule/the-chain-rule-content.md new file mode 100644 index 0000000..a90a5bc --- /dev/null +++ b/public/content/learn/neural-networks/the-chain-rule/the-chain-rule-content.md @@ -0,0 +1,132 @@ +--- +hero: + title: "The Chain Rule" + subtitle: "The Math Behind Backpropagation" + tags: + - "๐Ÿง  Neural Networks" + - "โฑ๏ธ 8 min read" +--- + +The chain rule is how we calculate gradients through multiple layers. It's the secret sauce of backpropagation! + +## The Basic Idea + +**Chain rule: Multiply gradients as you go backwards through layers** + +```yaml +If y = f(g(x)), then: +dy/dx = (dy/dg) ร— (dg/dx) + +In words: Multiply the gradients of each function +``` + +## Simple Example + +```python +import torch + +# y = (x + 2)ยฒ +x = torch.tensor([3.0], requires_grad=True) + +# Break it down: +# g = x + 2 +# y = gยฒ + +g = x + 2 +y = g ** 2 + +# Backward pass +y.backward() + +print(f"x = {x.item()}") +print(f"g = {g.item()}") +print(f"y = {y.item()}") +print(f"dy/dx = {x.grad.item()}") + +# Manual: +# dy/dg = 2g = 2ร—5 = 10 +# dg/dx = 1 +# dy/dx = 10ร—1 = 10 โœ“ +``` + +## In Neural Networks + +```python +import torch +import torch.nn as nn + +# Two-layer network +model = nn.Sequential( + nn.Linear(1, 1), # Layer 1 + nn.ReLU(), + nn.Linear(1, 1) # Layer 2 +) + +x = torch.tensor([[2.0]]) +y_true = torch.tensor([[10.0]]) + +# Forward +y_pred = model(x) +loss = (y_pred - y_true) ** 2 + +# Backward (chain rule applied automatically!) +loss.backward() + +# Gradients computed through both layers +for name, param in model.named_parameters(): + print(f"{name}: gradient = {param.grad}") +``` + +**What happens:** + +```yaml +Forward: + x โ†’ Layer1 โ†’ ReLU โ†’ Layer2 โ†’ prediction โ†’ loss + +Backward (chain rule): + dloss/dprediction โ†’ dLayer2 โ†’ dReLU โ†’ dLayer1 โ†’ dx + +Each gradient multiplies with the next! +``` + +## Why It Works + +```yaml +Loss depends on layer 2 output +Layer 2 output depends on ReLU output +ReLU output depends on layer 1 output +Layer 1 output depends on weights + +So: Loss depends on weights (through chain)! + +Chain rule connects them: +dLoss/dWeight = dLoss/dOutput ร— dOutput/dWeight +``` + +## PyTorch Does It For You + +```python +import torch + +# Complex computation +x = torch.tensor([2.0], requires_grad=True) +y = ((x ** 2 + 3) * torch.sin(x)) ** 3 + +# PyTorch applies chain rule automatically! +y.backward() + +print(f"Gradient: {x.grad.item()}") +# Calculated using chain rule through all operations! +``` + +## Key Takeaways + +โœ“ **Chain rule:** Multiply gradients backwards + +โœ“ **Backpropagation:** Applies chain rule through network + +โœ“ **Automatic:** PyTorch does it for you + +โœ“ **Essential:** Makes training deep networks possible + +**Remember:** Chain rule lets us train deep networks by connecting all the gradients! ๐ŸŽ‰ diff --git a/public/content/learn/neural-networks/training/training-content.md b/public/content/learn/neural-networks/training/training-content.md new file mode 100644 index 0000000..5bb2d2a --- /dev/null +++ b/public/content/learn/neural-networks/training/training-content.md @@ -0,0 +1,581 @@ +--- +hero: + title: "Training & Optimization" + subtitle: "Making Neural Networks Learn Effectively" + tags: + - "๐Ÿง  Neural Networks" + - "โฑ๏ธ 16 min read" +--- + +# Training & Optimization + +## The Training Process + +Training a neural network is an iterative process of adjusting weights to minimize the loss function. The goal is to find the optimal set of parameters that make accurate predictions on both **training and unseen data**. + +![Training Loop](training-loop.png) + +## Gradient Descent: The Foundation + +Gradient descent is the fundamental optimization algorithm: + +``` +1. Start with random weights +2. Calculate loss on data +3. Compute gradients (how to adjust weights) +4. Update weights in opposite direction of gradient +5. Repeat until convergence +``` + +### Mathematical Formula + +``` +ฮธ_new = ฮธ_old - ฮฑ ยท โˆ‡L(ฮธ) +``` + +Where: +- `ฮธ` = parameters (weights and biases) +- `ฮฑ` = learning rate +- `โˆ‡L` = gradient of loss + +Think of it as **rolling down a hill** to find the lowest point (minimum loss)! + +![Gradient Descent](gradient-descent.png) + +## Variants of Gradient Descent + +### 1. Batch Gradient Descent + +Uses **entire dataset** for each update: + +```python +for epoch in range(num_epochs): + # Compute loss on ALL data + predictions = forward_pass(X_train, weights) + loss = compute_loss(predictions, y_train) + + # Compute gradients using all data + gradients = backward_pass(X_train, y_train, weights) + + # Single update per epoch + weights -= learning_rate * gradients +``` + +**Pros:** +- โœ… Stable updates +- โœ… Guaranteed convergence (for convex problems) + +**Cons:** +- โŒ Very slow for large datasets +- โŒ Requires entire dataset in memory +- โŒ Can get stuck in local minima + +### 2. Stochastic Gradient Descent (SGD) + +Updates weights after **each training example**: + +```python +for epoch in range(num_epochs): + # Shuffle data + indices = np.random.permutation(len(X_train)) + + for i in indices: + # Use single example + x, y = X_train[i], y_train[i] + + prediction = forward_pass(x, weights) + loss = compute_loss(prediction, y) + gradients = backward_pass(x, y, weights) + + # Update after each example + weights -= learning_rate * gradients +``` + +**Pros:** +- โœ… Much faster iterations +- โœ… Can escape local minima (noise helps!) +- โœ… Works with large datasets + +**Cons:** +- โŒ Noisy updates +- โŒ Can oscillate around minimum +- โŒ Harder to parallelize + +### 3. Mini-Batch Gradient Descent โญ (Most Popular) + +Best of both worlds! Uses **small batches** (32, 64, 128, 256): + +```python +batch_size = 64 + +for epoch in range(num_epochs): + # Shuffle data + indices = np.random.permutation(len(X_train)) + + for i in range(0, len(X_train), batch_size): + # Get batch + batch_indices = indices[i:i+batch_size] + X_batch = X_train[batch_indices] + y_batch = y_train[batch_indices] + + # Forward pass on batch + predictions = forward_pass(X_batch, weights) + loss = compute_loss(predictions, y_batch) + + # Backward pass on batch + gradients = backward_pass(X_batch, y_batch, weights) + + # Update weights + weights -= learning_rate * gradients +``` + +**Pros:** +- โœ… Good balance between speed and stability +- โœ… Efficient GPU utilization +- โœ… More stable than SGD +- โœ… Faster than batch GD + +**Cons:** +- โŒ One more hyperparameter (batch size) + +![GD Variants Comparison](gd-variants.png) + +## Advanced Optimizers + +### 1. Momentum ๐Ÿƒ + +Accumulates a **velocity** term to accelerate in consistent directions: + +```python +velocity = 0 +beta = 0.9 # momentum coefficient + +for epoch in range(num_epochs): + gradients = compute_gradients() + + # Update velocity + velocity = beta * velocity + (1 - beta) * gradients + + # Update weights using velocity + weights -= learning_rate * velocity +``` + +**Why it works:** +- Accelerates in valleys +- Dampens oscillations +- Helps escape plateaus + +**Analogy:** A ball rolling down a hill gains momentum! + +### 2. RMSprop + +Adapts learning rate **per parameter** based on recent gradients: + +```python +cache = 0 +beta = 0.9 + +for epoch in range(num_epochs): + gradients = compute_gradients() + + # Update cache (exponential moving average of squared gradients) + cache = beta * cache + (1 - beta) * gradients**2 + + # Update weights with adaptive learning rate + weights -= learning_rate * gradients / (np.sqrt(cache) + 1e-8) +``` + +**Why it works:** +- Different learning rates for each parameter +- Larger steps for parameters with small gradients +- Smaller steps for parameters with large gradients + +**Great for:** Recurrent neural networks + +### 3. Adam (Adaptive Moment Estimation) โญ (Most Popular) + +Combines **momentum** and **RMSprop**: + +```python +m = 0 # First moment (mean) +v = 0 # Second moment (variance) +beta1 = 0.9 +beta2 = 0.999 + +for epoch in range(num_epochs): + gradients = compute_gradients() + + # Update moments + m = beta1 * m + (1 - beta1) * gradients + v = beta2 * v + (1 - beta2) * gradients**2 + + # Bias correction + m_hat = m / (1 - beta1**epoch) + v_hat = v / (1 - beta2**epoch) + + # Update weights + weights -= learning_rate * m_hat / (np.sqrt(v_hat) + 1e-8) +``` + +**Why it works:** +- Combines best of momentum and RMSprop +- Adaptive learning rates +- Bias correction for early iterations +- Works well in practice + +**Default choice** for most deep learning tasks! + +![Optimizers Comparison](optimizers-comparison.png) + +## Learning Rate Strategies + +### 1. Fixed Learning Rate +```python +learning_rate = 0.001 # Constant throughout training +``` + +Simple but often suboptimal. + +### 2. Learning Rate Decay + +Gradually reduce learning rate: + +```python +# Step decay +initial_lr = 0.01 +drop_rate = 0.5 +epochs_drop = 10 + +lr = initial_lr * (drop_rate ** (epoch // epochs_drop)) + +# Exponential decay +lr = initial_lr * np.exp(-decay_rate * epoch) + +# 1/t decay +lr = initial_lr / (1 + decay_rate * epoch) +``` + +### 3. Learning Rate Scheduling + +```python +# Cosine annealing +import math + +def cosine_schedule(epoch, total_epochs, lr_max, lr_min=0): + return lr_min + 0.5 * (lr_max - lr_min) * ( + 1 + math.cos(math.pi * epoch / total_epochs) + ) +``` + +### 4. Warm-up + Decay + +```python +def lr_schedule(epoch, warmup_epochs=5, initial_lr=0.001): + if epoch < warmup_epochs: + # Linear warm-up + return initial_lr * (epoch / warmup_epochs) + else: + # Cosine decay + return cosine_schedule( + epoch - warmup_epochs, + total_epochs - warmup_epochs, + initial_lr + ) +``` + +![Learning Rate Schedules](lr-schedules.png) + +## Key Hyperparameters + +### 1. Learning Rate (ฮฑ) + +**Most important hyperparameter!** + +```python +# Too high: divergence +lr = 1.0 # Loss explodes โŒ + +# Too low: very slow training +lr = 0.00001 # Takes forever โŒ + +# Just right: fast and stable +lr = 0.001 # Good starting point โœ… +``` + +**Finding the right learning rate:** +- Start with 0.001 or 0.0001 +- Use learning rate finder +- Monitor training loss + +### 2. Batch Size + +```python +# Small batches (8-32) +# + More noise โ†’ can escape local minima +# - Slower, less stable + +# Medium batches (64-128) โญ +# + Good balance +# + Efficient GPU usage + +# Large batches (256-1024) +# + Faster training (fewer updates) +# + More stable +# - Can lead to poor generalization +# - Requires more memory +``` + +**Rule of thumb:** Start with 32 or 64 + +### 3. Number of Epochs + +```python +# Too few epochs +epochs = 5 # Underfitting โŒ + +# Too many epochs +epochs = 1000 # Overfitting โŒ + +# Use early stopping โœ… +best_loss = float('inf') +patience = 10 +counter = 0 + +for epoch in range(max_epochs): + val_loss = validate() + + if val_loss < best_loss: + best_loss = val_loss + counter = 0 + save_model() + else: + counter += 1 + + if counter >= patience: + print("Early stopping!") + break +``` + +### 4. Optimizer Parameters + +```python +# Adam parameters +optimizer = Adam( + learning_rate=0.001, # Step size + beta1=0.9, # Momentum decay (usually 0.9) + beta2=0.999, # RMSprop decay (usually 0.999) + epsilon=1e-8 # Numerical stability +) + +# SGD with momentum +optimizer = SGD( + learning_rate=0.01, + momentum=0.9 # Usually 0.9 or 0.95 +) +``` + +## Training Best Practices + +### 1. Data Preparation +```python +# Normalize inputs +X = (X - X.mean()) / X.std() + +# Or use min-max scaling +X = (X - X.min()) / (X.max() - X.min()) +``` + +### 2. Weight Initialization +```python +# Xavier/Glorot initialization (for sigmoid/tanh) +W = np.random.randn(n_in, n_out) * np.sqrt(1 / n_in) + +# He initialization (for ReLU) +W = np.random.randn(n_in, n_out) * np.sqrt(2 / n_in) +``` + +### 3. Regularization +```python +# L2 regularization (weight decay) +loss = mse_loss + lambda_reg * np.sum(weights**2) + +# Dropout (randomly zero out neurons) +if training: + mask = (np.random.rand(*activations.shape) > dropout_rate) + activations = activations * mask / (1 - dropout_rate) +``` + +### 4. Batch Normalization +```python +# Normalize activations in each layer +z_norm = (z - z.mean()) / np.sqrt(z.var() + epsilon) +z_scaled = gamma * z_norm + beta # Learnable parameters +``` + +### 5. Monitoring Training + +```python +history = { + 'train_loss': [], + 'val_loss': [], + 'train_acc': [], + 'val_acc': [] +} + +for epoch in range(num_epochs): + # Training + train_loss, train_acc = train_epoch() + history['train_loss'].append(train_loss) + history['train_acc'].append(train_acc) + + # Validation + val_loss, val_acc = validate() + history['val_loss'].append(val_loss) + history['val_acc'].append(val_acc) + + # Check for overfitting + if val_loss > train_loss * 1.2: + print("Warning: Possible overfitting!") +``` + +![Training Curves](training-curves.png) + +## Common Issues and Solutions + +### 1. Loss Not Decreasing +**Problem:** Loss stays constant or increases + +**Solutions:** +- โœ… Check learning rate (try 0.001, 0.0001) +- โœ… Verify data preprocessing +- โœ… Check for bugs in forward/backward pass +- โœ… Try different weight initialization + +### 2. Training Loss Decreases, Validation Loss Increases +**Problem:** Overfitting + +**Solutions:** +- โœ… Add regularization (L2, dropout) +- โœ… Reduce model complexity +- โœ… Get more training data +- โœ… Use data augmentation +- โœ… Early stopping + +### 3. Loss Explodes (NaN) +**Problem:** Numerical instability + +**Solutions:** +- โœ… Lower learning rate +- โœ… Use gradient clipping +- โœ… Check for division by zero +- โœ… Use batch normalization + +### 4. Training Too Slow +**Problem:** Takes forever to converge + +**Solutions:** +- โœ… Increase learning rate +- โœ… Use Adam instead of SGD +- โœ… Increase batch size +- โœ… Use GPU/TPU acceleration + +## Complete Training Example + +```python +import numpy as np + +# Hyperparameters +learning_rate = 0.001 +batch_size = 64 +num_epochs = 100 +patience = 10 + +# Initialize optimizer +m = v = 0 +beta1, beta2 = 0.9, 0.999 + +# Training loop +best_val_loss = float('inf') +patience_counter = 0 + +for epoch in range(num_epochs): + # Shuffle training data + indices = np.random.permutation(len(X_train)) + + epoch_loss = 0 + num_batches = 0 + + # Mini-batch training + for i in range(0, len(X_train), batch_size): + # Get batch + batch_idx = indices[i:i+batch_size] + X_batch = X_train[batch_idx] + y_batch = y_train[batch_idx] + + # Forward pass + y_pred = forward(X_batch, weights) + loss = compute_loss(y_pred, y_batch) + + # Backward pass + grads = backward(X_batch, y_batch, weights) + + # Adam optimizer + m = beta1 * m + (1 - beta1) * grads + v = beta2 * v + (1 - beta2) * grads**2 + m_hat = m / (1 - beta1**(epoch+1)) + v_hat = v / (1 - beta2**(epoch+1)) + + # Update weights + weights -= learning_rate * m_hat / (np.sqrt(v_hat) + 1e-8) + + epoch_loss += loss + num_batches += 1 + + # Validation + val_loss = validate(X_val, y_val, weights) + + # Early stopping + if val_loss < best_val_loss: + best_val_loss = val_loss + save_weights(weights) + patience_counter = 0 + else: + patience_counter += 1 + + if patience_counter >= patience: + print(f"Early stopping at epoch {epoch}") + break + + # Print progress + avg_train_loss = epoch_loss / num_batches + print(f"Epoch {epoch}: Train Loss = {avg_train_loss:.4f}, " + f"Val Loss = {val_loss:.4f}") +``` + +## Key Takeaways + +โœ… Gradient descent is the foundation of neural network training +โœ… Mini-batch GD provides the best balance of speed and stability +โœ… Adam is the go-to optimizer for most tasks +โœ… Learning rate is the most important hyperparameter +โœ… Monitor both training and validation metrics +โœ… Use regularization to prevent overfitting +โœ… Early stopping saves time and prevents overfitting + +## Congratulations! ๐ŸŽ‰ + +You've completed the Neural Networks from Scratch course! You now understand: + +- The mathematical foundations (derivatives, functions) +- How neural networks process information (forward propagation) +- How they learn (backpropagation) +- How to train them effectively (optimization) + +**Next steps:** +- Implement a neural network from scratch in Python +- Try different architectures (CNN, RNN, Transformer) +- Work on real projects and datasets +- Explore advanced topics (attention mechanisms, GANs, etc.) + +Keep learning and building! ๐Ÿš€ + diff --git a/public/content/learn/neuron-from-scratch/building-a-neuron-in-python/building-a-neuron-in-python-content.md b/public/content/learn/neuron-from-scratch/building-a-neuron-in-python/building-a-neuron-in-python-content.md new file mode 100644 index 0000000..b6a5c21 --- /dev/null +++ b/public/content/learn/neuron-from-scratch/building-a-neuron-in-python/building-a-neuron-in-python-content.md @@ -0,0 +1,312 @@ +--- +hero: + title: "Building a Neuron in Python" + subtitle: "Implementing a Neuron from Scratch" + tags: + - "๐Ÿง  Neuron" + - "โฑ๏ธ 10 min read" +--- + +Let's build a complete, working neuron from scratch using pure Python and PyTorch! + +![Neuron Code](/content/learn/neuron-from-scratch/building-a-neuron-in-python/neuron-code.png) + +## Simple Neuron Class + +**Example:** + +```python +import torch +import torch.nn as nn + +class Neuron(nn.Module): + def __init__(self, num_inputs): + super().__init__() + self.linear = nn.Linear(num_inputs, 1) + self.activation = nn.Sigmoid() + + def forward(self, x): + # Linear step + z = self.linear(x) + + # Activation + output = self.activation(z) + + return output + +# Create neuron with 3 inputs +neuron = Neuron(num_inputs=3) + +# Make prediction +x = torch.tensor([[1.0, 2.0, 3.0]]) +prediction = neuron(x) + +print(prediction) +# tensor([[0.6789]], grad_fn=) +``` + +## Complete Training Example + +```python +import torch +import torch.nn as nn +import torch.optim as optim + +# Create neuron +neuron = Neuron(num_inputs=2) + +# Training data (AND gate) +X = torch.tensor([[0.0, 0.0], + [0.0, 1.0], + [1.0, 0.0], + [1.0, 1.0]]) + +y = torch.tensor([[0.0], + [0.0], + [0.0], + [1.0]]) + +# Loss and optimizer +criterion = nn.BCELoss() +optimizer = optim.SGD(neuron.parameters(), lr=0.5) + +# Training loop +for epoch in range(1000): + # Forward pass + predictions = neuron(X) + + # Calculate loss + loss = criterion(predictions, y) + + # Backward pass + optimizer.zero_grad() + loss.backward() + + # Update weights + optimizer.step() + + if epoch % 200 == 0: + print(f"Epoch {epoch}, Loss: {loss.item():.4f}") + +# Test the trained neuron +print("\\nTrained neuron predictions:") +with torch.no_grad(): + for i, (input_vals, target_val) in enumerate(zip(X, y)): + pred = neuron(input_vals.unsqueeze(0)) + print(f"{input_vals.tolist()} โ†’ {pred.item():.3f} (target: {target_val.item()})") +``` + +## From Scratch (No nn.Linear) + +Build a neuron with just tensors: + +```python +import torch + +class ManualNeuron: + def __init__(self, num_inputs): + # Initialize weights and bias randomly + self.weights = torch.randn(num_inputs, requires_grad=True) + self.bias = torch.randn(1, requires_grad=True) + + def forward(self, x): + # Linear step: wยทx + b + z = torch.dot(self.weights, x) + self.bias + + # Activation: sigmoid + output = 1 / (1 + torch.exp(-z)) + + return output + + def parameters(self): + return [self.weights, self.bias] + +# Create and test +neuron = ManualNeuron(num_inputs=3) +x = torch.tensor([1.0, 2.0, 3.0]) +output = neuron.forward(x) + +print(output) +# tensor([0.7234], grad_fn=) +``` + +## Training From Scratch + +```python +import torch + +# Manual neuron (from above) +neuron = ManualNeuron(num_inputs=2) + +# Training data +X = torch.tensor([[1.0, 2.0], + [2.0, 3.0], + [3.0, 4.0]]) +y = torch.tensor([0.0, 0.0, 1.0]) + +learning_rate = 0.1 + +# Training loop +for epoch in range(100): + total_loss = 0 + + for i in range(len(X)): + # Forward pass + prediction = neuron.forward(X[i]) + + # Loss (MSE) + loss = (prediction - y[i]) ** 2 + total_loss += loss.item() + + # Backward pass + loss.backward() + + # Update weights manually + with torch.no_grad(): + for param in neuron.parameters(): + param -= learning_rate * param.grad + param.grad.zero_() + + if epoch % 20 == 0: + print(f"Epoch {epoch}, Loss: {total_loss:.4f}") + +# Test +print("\\nPredictions after training:") +for i in range(len(X)): + pred = neuron.forward(X[i]) + print(f"Input: {X[i].tolist()}, Prediction: {pred.item():.3f}, Target: {y[i].item()}") +``` + +## Complete Neuron with All Features + +```python +import torch +import torch.nn as nn + +class CompleteNeuron(nn.Module): + def __init__(self, num_inputs, activation='relu'): + super().__init__() + self.linear = nn.Linear(num_inputs, 1) + + # Choose activation + if activation == 'relu': + self.activation = nn.ReLU() + elif activation == 'sigmoid': + self.activation = nn.Sigmoid() + elif activation == 'tanh': + self.activation = nn.Tanh() + else: + self.activation = nn.Identity() # No activation + + def forward(self, x): + z = self.linear(x) + output = self.activation(z) + return output + + def get_weights(self): + return self.linear.weight.data + + def get_bias(self): + return self.linear.bias.data + +# Create neurons with different activations +relu_neuron = CompleteNeuron(3, activation='relu') +sigmoid_neuron = CompleteNeuron(3, activation='sigmoid') + +x = torch.tensor([[1.0, 2.0, 3.0]]) + +print("ReLU:", relu_neuron(x)) +print("Sigmoid:", sigmoid_neuron(x)) +``` + +## Real-World Application + +```python +import torch +import torch.nn as nn +import torch.optim as optim + +# House price predictor +class HousePriceNeuron(nn.Module): + def __init__(self): + super().__init__() + # 3 features: size, bedrooms, age + self.linear = nn.Linear(3, 1) + # No activation (regression) + + def forward(self, features): + price = self.linear(features) + return price + +# Training data +houses = torch.tensor([[1500.0, 3.0, 10.0], # [size, bedrooms, age] + [2000.0, 4.0, 5.0], + [1200.0, 2.0, 15.0], + [1800.0, 3.0, 8.0]]) + +prices = torch.tensor([[300000.0], # Actual prices + [450000.0], + [250000.0], + [380000.0]]) + +# Create and train +model = HousePriceNeuron() +criterion = nn.MSELoss() +optimizer = optim.SGD(model.parameters(), lr=0.0000001) + +# Train +for epoch in range(500): + predictions = model(houses) + loss = criterion(predictions, prices) + + optimizer.zero_grad() + loss.backward() + optimizer.step() + + if epoch % 100 == 0: + print(f"Epoch {epoch}, Loss: {loss.item():.2f}") + +# Predict new house +new_house = torch.tensor([[1600.0, 3.0, 12.0]]) +predicted_price = model(new_house) +print(f"\\nPredicted price: ${predicted_price.item():,.0f}") +``` + +## Key Takeaways + +โœ“ **Building blocks:** Linear layer + activation function + +โœ“ **From scratch:** Can build with just tensors + +โœ“ **PyTorch way:** Use `nn.Module` and `nn.Linear` + +โœ“ **Training:** Forward โ†’ loss โ†’ backward โ†’ update + +โœ“ **Flexible:** Choose different activations for different tasks + +**Quick Reference:** + +```python +# Simple neuron +class Neuron(nn.Module): + def __init__(self, num_inputs): + super().__init__() + self.linear = nn.Linear(num_inputs, 1) + self.activation = nn.ReLU() + + def forward(self, x): + return self.activation(self.linear(x)) + +# Training +model = Neuron(num_inputs=5) +optimizer = optim.SGD(model.parameters(), lr=0.01) + +for epoch in range(epochs): + pred = model(x) + loss = criterion(pred, y) + optimizer.zero_grad() + loss.backward() + optimizer.step() +``` + +**Remember:** You just built a neuron from scratch! This is the foundation of all neural networks! ๐ŸŽ‰ diff --git a/public/content/learn/neuron-from-scratch/building-a-neuron-in-python/neuron-code.png b/public/content/learn/neuron-from-scratch/building-a-neuron-in-python/neuron-code.png new file mode 100644 index 0000000..e09a80b Binary files /dev/null and b/public/content/learn/neuron-from-scratch/building-a-neuron-in-python/neuron-code.png differ diff --git a/public/content/learn/neuron-from-scratch/making-a-prediction/making-a-prediction-content.md b/public/content/learn/neuron-from-scratch/making-a-prediction/making-a-prediction-content.md new file mode 100644 index 0000000..5829a28 --- /dev/null +++ b/public/content/learn/neuron-from-scratch/making-a-prediction/making-a-prediction-content.md @@ -0,0 +1,220 @@ +--- +hero: + title: "Making a Prediction" + subtitle: "Using a Neuron for Forward Pass" + tags: + - "๐Ÿง  Neuron" + - "โฑ๏ธ 8 min read" +--- + +Now that we understand neurons, let's use one to **make predictions**! This is called the **forward pass**. + +![Prediction Flow](/content/learn/neuron-from-scratch/making-a-prediction/prediction-flow.png) + +## The Forward Pass + +**Forward pass = Input โ†’ Linear โ†’ Activation โ†’ Output** + +**Example:** + +```python +import torch +import torch.nn as nn + +# Create a trained neuron (pretend it's already trained) +neuron = nn.Sequential( + nn.Linear(2, 1), + nn.Sigmoid() +) + +# Set trained weights manually (normally learned from data) +with torch.no_grad(): + neuron[0].weight = nn.Parameter(torch.tensor([[0.5, 0.8]])) + neuron[0].bias = nn.Parameter(torch.tensor([-0.3])) + +# Make a prediction +input_data = torch.tensor([[1.0, 2.0]]) # New data point +prediction = neuron(input_data) + +print(prediction) +# tensor([[0.8176]]) โ† Prediction! +``` + +**Manual calculation:** + +```yaml +Input: [1.0, 2.0] +Weights: [0.5, 0.8] +Bias: -0.3 + +Step 1: Linear + z = (1.0ร—0.5) + (2.0ร—0.8) + (-0.3) + = 0.5 + 1.6 - 0.3 + = 1.8 + +Step 2: Activation (Sigmoid) + output = 1 / (1 + eโปยนยทโธ) + = 1 / (1 + 0.165) + = 0.858 + +Prediction: 0.858 or 85.8% probability +``` + +## Batch Predictions + +Process multiple samples at once: + +```python +import torch +import torch.nn as nn + +neuron = nn.Sequential( + nn.Linear(3, 1), + nn.ReLU() +) + +# Batch of 5 samples, 3 features each +batch = torch.tensor([[1.0, 2.0, 3.0], + [2.0, 3.0, 4.0], + [0.5, 1.0, 1.5], + [3.0, 2.0, 1.0], + [1.5, 2.5, 3.5]]) + +# Make predictions for all samples +predictions = neuron(batch) + +print(predictions.shape) # torch.Size([5, 1]) +print(predictions) +# tensor([[...], +# [...], +# [...], +# [...], +# [...]]) โ† 5 predictions! +``` + +## Real-World Example: Binary Classification + +```python +import torch +import torch.nn as nn + +# Spam detector neuron +class SpamNeuron(nn.Module): + def __init__(self, num_features): + super().__init__() + self.linear = nn.Linear(num_features, 1) + self.sigmoid = nn.Sigmoid() + + def forward(self, email_features): + # Linear step + logit = self.linear(email_features) + + # Activation (probability) + probability = self.sigmoid(logit) + + return probability + +# Create and use +spam_detector = SpamNeuron(num_features=100) + +# New email features +email = torch.randn(1, 100) + +# Predict +spam_probability = spam_detector(email) +print(f"Spam probability: {spam_probability.item():.1%}") + +if spam_probability > 0.5: + print("Prediction: SPAM") +else: + print("Prediction: NOT SPAM") +``` + +## Step-by-Step Prediction + +```python +import torch + +# Input +x = torch.tensor([3.0, 2.0]) + +# Learned parameters +w = torch.tensor([0.4, 0.6]) +b = torch.tensor(0.2) + +# Step 1: Weighted sum +print("Inputs:", x) +print("Weights:", w) + +products = x * w +print("Products:", products) +# tensor([1.2, 1.2]) + +weighted_sum = products.sum() + b +print("Sum + bias:", weighted_sum) +# tensor(2.6) + +# Step 2: Activation +activated = torch.relu(weighted_sum) +print("After ReLU:", activated) +# tensor(2.6) + +# Final prediction +print(f"\\nPrediction: {activated.item()}") +``` + +**Output:** + +```yaml +Inputs: tensor([3., 2.]) +Weights: tensor([0.4, 0.6]) +Products: tensor([1.2, 1.2]) +Sum + bias: tensor(2.6) +After ReLU: tensor(2.6) + +Prediction: 2.6 +``` + +## Inference Mode + +When making predictions (not training), use `torch.no_grad()`: + +```python +import torch + +model = nn.Sequential(nn.Linear(10, 1), nn.Sigmoid()) + +# For prediction (inference) +with torch.no_grad(): + input_data = torch.randn(1, 10) + prediction = model(input_data) + print(prediction) + +# Why? Saves memory (doesn't track gradients) +``` + +## Key Takeaways + +โœ“ **Forward pass:** Input โ†’ Linear โ†’ Activation โ†’ Output + +โœ“ **Batch processing:** Handle multiple samples at once + +โœ“ **Inference mode:** Use `torch.no_grad()` when not training + +โœ“ **Prediction:** Just run the forward pass! + +**Quick Reference:** + +```python +# Single prediction +output = model(input_data) + +# Batch predictions +outputs = model(batch_data) + +# Inference mode (no gradients) +with torch.no_grad(): + prediction = model(new_data) +``` + +**Remember:** Making predictions is just running the forward pass! ๐ŸŽ‰ diff --git a/public/content/learn/neuron-from-scratch/making-a-prediction/prediction-flow.png b/public/content/learn/neuron-from-scratch/making-a-prediction/prediction-flow.png new file mode 100644 index 0000000..a28e6d5 Binary files /dev/null and b/public/content/learn/neuron-from-scratch/making-a-prediction/prediction-flow.png differ diff --git a/public/content/learn/neuron-from-scratch/the-activation-function/activation-comparison.png b/public/content/learn/neuron-from-scratch/the-activation-function/activation-comparison.png new file mode 100644 index 0000000..7843c05 Binary files /dev/null and b/public/content/learn/neuron-from-scratch/the-activation-function/activation-comparison.png differ diff --git a/public/content/learn/neuron-from-scratch/the-activation-function/the-activation-function-content.md b/public/content/learn/neuron-from-scratch/the-activation-function/the-activation-function-content.md new file mode 100644 index 0000000..11fda60 --- /dev/null +++ b/public/content/learn/neuron-from-scratch/the-activation-function/the-activation-function-content.md @@ -0,0 +1,243 @@ +--- +hero: + title: "The Activation Function" + subtitle: "Adding Non-Linearity to Neurons" + tags: + - "๐Ÿง  Neuron" + - "โฑ๏ธ 8 min read" +--- + +The activation function is what makes neural networks **powerful**. Without it, you'd just have fancy linear regression! + +![Activation Comparison](/content/learn/neuron-from-scratch/the-activation-function/activation-comparison.png) + +## Why We Need Activation Functions + +**Without activation:** No matter how many layers, it's still just linear! + +```python +import torch +import torch.nn as nn + +# Network WITHOUT activation functions +model_linear = nn.Sequential( + nn.Linear(10, 20), + # No activation! + nn.Linear(20, 5), + # No activation! + nn.Linear(5, 1) +) + +# This is mathematically equivalent to: +model_simple = nn.Linear(10, 1) + +# Same power as single layer! +``` + +**With activation:** Non-linear transformations โ†’ complex patterns! + +```python +# Network WITH activation functions +model_nonlinear = nn.Sequential( + nn.Linear(10, 20), + nn.ReLU(), # โ† Non-linearity! + nn.Linear(20, 5), + nn.ReLU(), # โ† Non-linearity! + nn.Linear(5, 1) +) + +# This can learn complex patterns! +``` + +**The difference:** + +```yaml +Without activation: + Layer 1: y = W1x + b1 + Layer 2: z = W2y + b2 + = W2(W1x + b1) + b2 + = W2W1x + W2b1 + b2 + = W3x + b3 โ† Still just linear! + +With activation: + Layer 1: y = ReLU(W1x + b1) + Layer 2: z = ReLU(W2y + b2) + โ† Non-linear! Can learn curves, boundaries, etc. +``` + +## Common Activation Functions + +### ReLU (Most Popular) + +```python +import torch + +def relu(x): + return torch.maximum(torch.tensor(0.0), x) + +x = torch.tensor([-1.0, 0.0, 1.0, 2.0]) +print(relu(x)) +# tensor([0., 0., 1., 2.]) +``` + +```yaml +ReLU(x) = max(0, x) + +Properties: + โœ“ Fast (simple comparison) + โœ“ No vanishing gradient + โœ“ Creates sparsity + +Use: Hidden layers +``` + +### Sigmoid (For Probabilities) + +```python +def sigmoid(x): + return 1 / (1 + torch.exp(-x)) + +x = torch.tensor([-2.0, 0.0, 2.0]) +print(sigmoid(x)) +# tensor([0.1192, 0.5000, 0.8808]) +``` + +```yaml +ฯƒ(x) = 1 / (1 + eโปหฃ) + +Properties: + โœ“ Outputs [0, 1] + โœ“ Smooth + โœ— Vanishing gradients + +Use: Binary classification output +``` + +### Tanh (Zero-Centered) + +```python +x = torch.tensor([-1.0, 0.0, 1.0]) +print(torch.tanh(x)) +# tensor([-0.7616, 0.0000, 0.7616]) +``` + +```yaml +tanh(x) = (eหฃ - eโปหฃ) / (eหฃ + eโปหฃ) + +Properties: + โœ“ Outputs [-1, 1] + โœ“ Zero-centered + โœ— Vanishing gradients + +Use: RNN cells +``` + +## Where Activation Goes + +**After the linear step, before the next layer:** + +```python +import torch +import torch.nn as nn + +class SingleNeuron(nn.Module): + def __init__(self): + super().__init__() + self.linear = nn.Linear(3, 1) + self.activation = nn.ReLU() + + def forward(self, x): + # Step 1: Linear (weighted sum) + z = self.linear(x) + + # Step 2: Activation (non-linearity) + output = self.activation(z) + + return output + +# Test +neuron = SingleNeuron() +x = torch.tensor([[1.0, 2.0, 3.0]]) +output = neuron(x) +print(output) +``` + +## Practical Example + +```python +import torch +import torch.nn as nn + +# Temperature prediction neuron +# Inputs: [humidity, pressure, wind_speed] +weather = torch.tensor([[65.0, 1013.0, 10.0]]) + +# Create neuron +temp_neuron = nn.Sequential( + nn.Linear(3, 1), + nn.ReLU() # Activation ensures non-negative temperature +) + +prediction = temp_neuron(weather) +print(f"Predicted temperature: {prediction.item():.1f}ยฐF") +``` + +## Choosing the Right Activation + +```yaml +Hidden layers: + Default: ReLU + Modern: SiLU/GELU + Classical: Tanh + +Output layer (depends on task): + Binary classification: Sigmoid + Multi-class: Softmax + Regression: None (linear) +``` + +**Example network:** + +```python +import torch.nn as nn + +model = nn.Sequential( + nn.Linear(10, 20), + nn.ReLU(), # Hidden layer activation + nn.Linear(20, 10), + nn.ReLU(), # Hidden layer activation + nn.Linear(10, 1), + nn.Sigmoid() # Output activation for binary classification +) +``` + +## Key Takeaways + +โœ“ **Activation adds non-linearity:** Makes networks powerful + +โœ“ **Applied after linear step:** Linear โ†’ Activation โ†’ Next layer + +โœ“ **Different types:** ReLU, Sigmoid, Tanh, etc. + +โœ“ **Choose based on task:** Hidden vs output, type of problem + +โœ“ **Without activation:** Multiple layers = single layer (useless!) + +**Quick Reference:** + +```python +# After linear transformation +z = linear(x) + +# Apply activation +output = activation(z) + +# Common activations +torch.relu(z) # ReLU +torch.sigmoid(z) # Sigmoid +torch.tanh(z) # Tanh +F.silu(z) # SiLU +F.gelu(z) # GELU +``` + +**Remember:** Linear step computes, activation function decides! ๐ŸŽ‰ diff --git a/public/content/learn/neuron-from-scratch/the-concept-of-learning/learning-process.png b/public/content/learn/neuron-from-scratch/the-concept-of-learning/learning-process.png new file mode 100644 index 0000000..f5e7623 Binary files /dev/null and b/public/content/learn/neuron-from-scratch/the-concept-of-learning/learning-process.png differ diff --git a/public/content/learn/neuron-from-scratch/the-concept-of-learning/the-concept-of-learning-content.md b/public/content/learn/neuron-from-scratch/the-concept-of-learning/the-concept-of-learning-content.md new file mode 100644 index 0000000..08a8604 --- /dev/null +++ b/public/content/learn/neuron-from-scratch/the-concept-of-learning/the-concept-of-learning-content.md @@ -0,0 +1,234 @@ +--- +hero: + title: "The Concept of Learning" + subtitle: "How Neurons Adjust Their Weights" + tags: + - "๐Ÿง  Neuron" + - "โฑ๏ธ 8 min read" +--- + +Learning is the process of **adjusting weights to reduce loss**. The neuron literally learns from mistakes! + +![Learning Process](/content/learn/neuron-from-scratch/the-concept-of-learning/learning-process.png) + +## What Does "Learning" Mean? + +**Learning = Automatically adjusting weights to make better predictions** + +```yaml +Before learning: + Weights: Random + Predictions: Bad + Loss: High + +After learning: + Weights: Optimized + Predictions: Good + Loss: Low +``` + +## The Learning Process + +**Step-by-step:** + +1. Make prediction (forward pass) +2. Calculate loss (how wrong?) +3. Calculate gradients (which direction to adjust?) +4. Update weights (move in right direction) +5. Repeat! + +**Example:** + +```python +import torch +import torch.nn as nn + +# Model +model = nn.Linear(1, 1) + +# Training data +x = torch.tensor([[1.0], [2.0], [3.0]]) +y = torch.tensor([[2.0], [4.0], [6.0]]) # y = 2x + +# Loss function +criterion = nn.MSELoss() + +# Optimizer (handles weight updates) +optimizer = torch.optim.SGD(model.parameters(), lr=0.01) + +# Training loop +for epoch in range(100): + # 1. Forward pass + predictions = model(x) + + # 2. Calculate loss + loss = criterion(predictions, y) + + # 3. Backward pass (calculate gradients) + optimizer.zero_grad() + loss.backward() + + # 4. Update weights + optimizer.step() + + if epoch % 20 == 0: + print(f"Epoch {epoch}, Loss: {loss.item():.4f}") + +# After training +print(f"Learned weight: {model.weight.item():.2f}") # Should be close to 2.0 +print(f"Learned bias: {model.bias.item():.2f}") # Should be close to 0.0 +``` + +## Gradient Descent + +**The algorithm that powers learning:** + +```yaml +Current weight: w = 0.5 +Loss: high + +Gradient: โˆ‚Loss/โˆ‚w = -2.3 + Negative gradient โ†’ loss decreases if we INCREASE w + +Update: + w_new = w - learning_rate ร— gradient + w_new = 0.5 - 0.01 ร— (-2.3) + w_new = 0.5 + 0.023 + w_new = 0.523 + +Result: Loss is now lower! +``` + +## Learning Rate + +**Learning rate controls how big each step is:** + +```python +# Too small: slow learning +optimizer = torch.optim.SGD(model.parameters(), lr=0.0001) +# Takes forever to learn! + +# Just right: good learning +optimizer = torch.optim.SGD(model.parameters(), lr=0.01) +# Learns efficiently + +# Too large: unstable learning +optimizer = torch.optim.SGD(model.parameters(), lr=10.0) +# Might overshoot and never converge! +``` + +**Effect of learning rate:** + +```yaml +lr = 0.001 (small): + Small weight updates + Slow but stable + Many epochs needed + +lr = 0.01 (medium): + Moderate updates + Good balance + Converges reasonably + +lr = 1.0 (large): + Large weight updates + Fast but unstable + Might oscillate or diverge +``` + +## Simple Learning Example + +```python +import torch + +# True relationship: y = 3x + 1 +x_train = torch.tensor([1.0, 2.0, 3.0, 4.0]) +y_train = torch.tensor([4.0, 7.0, 10.0, 13.0]) + +# Model (start with random weights) +w = torch.tensor([0.5], requires_grad=True) +b = torch.tensor([0.0], requires_grad=True) + +learning_rate = 0.01 + +# Train for 100 steps +for step in range(100): + # Prediction + y_pred = w * x_train + b + + # Loss + loss = ((y_pred - y_train) ** 2).mean() + + # Backpropagation + loss.backward() + + # Update weights + with torch.no_grad(): + w -= learning_rate * w.grad + b -= learning_rate * b.grad + + # Reset gradients + w.grad.zero_() + b.grad.zero_() + + if step % 20 == 0: + print(f"Step {step}: w={w.item():.2f}, b={b.item():.2f}, loss={loss.item():.4f}") + +print(f"\\nLearned: y = {w.item():.2f}x + {b.item():.2f}") +# Should be close to: y = 3x + 1 +``` + +## What the Neuron Learns + +```python +# Example: Learning to classify + +# Initially (random weights): +prediction = neuron([1.0, 2.0]) # 0.34 (wrong!) +actual = 1.0 +loss = high + +# After seeing examples: +# The neuron learns that: +# - Feature 1 with value > 0.5 โ†’ usually class 1 +# - Feature 2 with value > 1.0 โ†’ usually class 1 +# So it adjusts weights accordingly + +# Finally (trained weights): +prediction = neuron([1.0, 2.0]) # 0.98 (correct!) +actual = 1.0 +loss = low +``` + +## Key Takeaways + +โœ“ **Learning = Adjusting weights:** Based on errors + +โœ“ **Goal:** Minimize loss + +โœ“ **Gradient descent:** The learning algorithm + +โœ“ **Learning rate:** Controls step size + +โœ“ **Automatic:** PyTorch calculates gradients for you! + +**Quick Reference:** + +```python +# Training loop +for epoch in range(num_epochs): + # Forward pass + predictions = model(inputs) + + # Calculate loss + loss = criterion(predictions, targets) + + # Backward pass + optimizer.zero_grad() + loss.backward() + + # Update weights + optimizer.step() +``` + +**Remember:** Learning is just: predict โ†’ measure error โ†’ adjust โ†’ repeat! ๐ŸŽ‰ diff --git a/public/content/learn/neuron-from-scratch/the-concept-of-loss/loss-function.png b/public/content/learn/neuron-from-scratch/the-concept-of-loss/loss-function.png new file mode 100644 index 0000000..a76146d Binary files /dev/null and b/public/content/learn/neuron-from-scratch/the-concept-of-loss/loss-function.png differ diff --git a/public/content/learn/neuron-from-scratch/the-concept-of-loss/the-concept-of-loss-content.md b/public/content/learn/neuron-from-scratch/the-concept-of-loss/the-concept-of-loss-content.md new file mode 100644 index 0000000..fc8106b --- /dev/null +++ b/public/content/learn/neuron-from-scratch/the-concept-of-loss/the-concept-of-loss-content.md @@ -0,0 +1,229 @@ +--- +hero: + title: "The Concept of Loss" + subtitle: "Measuring How Wrong Your Model Is" + tags: + - "๐Ÿง  Neuron" + - "โฑ๏ธ 8 min read" +--- + +Loss tells you **how wrong** your model's predictions are. Lower loss = better model! + +![Loss Function](/content/learn/neuron-from-scratch/the-concept-of-loss/loss-function.png) + +## What is Loss? + +**Loss = Difference between prediction and actual answer** + +Think of it like a score in golf - **lower is better**! + +**Example:** + +```python +import torch + +# Actual answer (ground truth) +actual = torch.tensor([1.0]) + +# Model's prediction +prediction = torch.tensor([0.7]) + +# Loss: how far off? +loss = (prediction - actual) ** 2 # Squared difference +print(loss) +# tensor([0.0900]) + +# Closer prediction +better_prediction = torch.tensor([0.95]) +better_loss = (better_prediction - actual) ** 2 +print(better_loss) +# tensor([0.0025]) โ† Much lower! Better! +``` + +**Manual calculation:** + +```yaml +Actual: 1.0 +Prediction: 0.7 +Difference: 0.7 - 1.0 = -0.3 +Squared: (-0.3)ยฒ = 0.09 +Loss: 0.09 + +Better prediction: 0.95 +Difference: 0.95 - 1.0 = -0.05 +Squared: (-0.05)ยฒ = 0.0025 +Loss: 0.0025 โ† Much better! +``` + +## Common Loss Functions + +### Mean Squared Error (MSE) + +For regression (predicting numbers): + +```python +import torch +import torch.nn as nn + +# Multiple predictions +predictions = torch.tensor([2.5, 3.1, 4.8]) +actual = torch.tensor([2.0, 3.0, 5.0]) + +# MSE Loss +mse_loss = nn.MSELoss() +loss = mse_loss(predictions, actual) + +print(loss) +# tensor(0.1000) + +# Manual: ((2.5-2)ยฒ + (3.1-3)ยฒ + (4.8-5)ยฒ) / 3 +# = (0.25 + 0.01 + 0.04) / 3 +# = 0.1 +``` + +### Binary Cross Entropy (BCE) + +For binary classification (yes/no): + +```python +# Predictions (probabilities) +predictions = torch.tensor([0.9, 0.2, 0.7]) + +# Actual labels (0 or 1) +labels = torch.tensor([1.0, 0.0, 1.0]) + +# BCE Loss +bce_loss = nn.BCELoss() +loss = bce_loss(predictions, labels) + +print(loss) +# Low loss because predictions are close to labels! +``` + +### Cross Entropy Loss + +For multi-class classification: + +```python +# Raw logits (before softmax) +logits = torch.tensor([[2.0, 1.0, 0.1]]) + +# Actual class (class 0) +target = torch.tensor([0]) + +# Cross Entropy (includes softmax) +ce_loss = nn.CrossEntropyLoss() +loss = ce_loss(logits, target) + +print(loss) +# Lower loss because logits[0]=2.0 is highest! +``` + +## Why We Minimize Loss + +**Goal of training: Make loss as small as possible!** + +```yaml +High loss: + Model is very wrong + Predictions far from truth + Need to adjust weights + +Low loss: + Model is accurate + Predictions close to truth + Weights are good! + +Training: + Start: High loss (random weights) + Process: Adjust weights to reduce loss + End: Low loss (trained model) +``` + +## Practical Example + +```python +import torch +import torch.nn as nn + +# Simple model +model = nn.Sequential( + nn.Linear(2, 1), + nn.Sigmoid() +) + +# Data +inputs = torch.tensor([[1.0, 2.0]]) +target = torch.tensor([[1.0]]) # Actual answer + +# Forward pass +prediction = model(inputs) +print(f"Prediction: {prediction.item():.3f}") + +# Calculate loss +loss_fn = nn.BCELoss() +loss = loss_fn(prediction, target) +print(f"Loss: {loss.item():.3f}") + +# Interpretation +if loss < 0.1: + print("Great! Model is accurate") +elif loss < 0.5: + print("OK, but needs improvement") +else: + print("Bad! Model needs more training") +``` + +## Loss Guides Learning + +```python +# Loss tells us which direction to adjust weights + +# Current prediction vs target +prediction = 0.3 +target = 1.0 +loss = (prediction - target) ** 2 # 0.49 + +# If we increase weight: +# prediction becomes 0.6 +# loss becomes (0.6 - 1.0)ยฒ = 0.16 โ† Better! + +# If we decrease weight: +# prediction becomes 0.1 +# loss becomes (0.1 - 1.0)ยฒ = 0.81 โ† Worse! + +# So we should INCREASE the weight! +``` + +## Key Takeaways + +โœ“ **Loss = Error:** Measures how wrong predictions are + +โœ“ **Lower is better:** Training minimizes loss + +โœ“ **Different types:** MSE, BCE, CrossEntropy for different tasks + +โœ“ **Guides learning:** Loss tells us how to adjust weights + +โœ“ **Always positive:** Loss is never negative + +**Quick Reference:** + +```python +# MSE (regression) +loss = nn.MSELoss()(predictions, targets) + +# BCE (binary classification) +loss = nn.BCELoss()(predictions, targets) + +# CrossEntropy (multi-class) +loss = nn.CrossEntropyLoss()(logits, targets) + +# Training loop +for epoch in range(100): + prediction = model(x) + loss = loss_fn(prediction, y) + # ... backprop and update ... +``` + +**Remember:** Loss is your compass - it guides the model to better predictions! ๐ŸŽ‰ diff --git a/public/content/learn/neuron-from-scratch/the-linear-step/linear-step-visual.png b/public/content/learn/neuron-from-scratch/the-linear-step/linear-step-visual.png new file mode 100644 index 0000000..1973635 Binary files /dev/null and b/public/content/learn/neuron-from-scratch/the-linear-step/linear-step-visual.png differ diff --git a/public/content/learn/neuron-from-scratch/the-linear-step/the-linear-step-content.md b/public/content/learn/neuron-from-scratch/the-linear-step/the-linear-step-content.md new file mode 100644 index 0000000..051bb5a --- /dev/null +++ b/public/content/learn/neuron-from-scratch/the-linear-step/the-linear-step-content.md @@ -0,0 +1,307 @@ +--- +hero: + title: "The Linear Step" + subtitle: "Weighted Sum - The Core Computation" + tags: + - "๐Ÿง  Neuron" + - "โฑ๏ธ 8 min read" +--- + +The linear step is where the **magic begins** - it's how a neuron combines its inputs using weights! + +![Linear Step Visual](/content/learn/neuron-from-scratch/the-linear-step/linear-step-visual.png) + +## The Formula + +**z = wโ‚xโ‚ + wโ‚‚xโ‚‚ + wโ‚ƒxโ‚ƒ + ... + b** + +Or in vector form: **z = w ยท x + b** + +This is called the **weighted sum** or **linear combination**. + +## Breaking It Down + +**Example:** + +```python +import torch + +# Inputs (features) +x = torch.tensor([2.0, 3.0, 1.5]) + +# Weights (learned parameters) +w = torch.tensor([0.5, -0.3, 0.8]) + +# Bias (learned parameter) +b = torch.tensor(0.1) + +# Linear step: weighted sum +z = torch.dot(w, x) + b +# OR: z = (w * x).sum() + b + +print(z) +# tensor(1.1000) +``` + +**Manual calculation:** + +```yaml +Step 1: Multiply each input by its weight + 2.0 ร— 0.5 = 1.0 + 3.0 ร— -0.3 = -0.9 + 1.5 ร— 0.8 = 1.2 + +Step 2: Sum all products + 1.0 + (-0.9) + 1.2 = 1.3 + +Step 3: Add bias + 1.3 + 0.1 = 1.4 + +Result: z = 1.4 +``` + +## Why "Linear"? + +It's called linear because the relationship between inputs and output is a **straight line**! + +```python +# If you double an input, the contribution doubles +x1 = torch.tensor([2.0]) +w1 = torch.tensor([0.5]) + +contribution1 = w1 * x1 +print(contribution1) # tensor([1.0]) + +# Double the input +x2 = torch.tensor([4.0]) +contribution2 = w1 * x2 +print(contribution2) # tensor([2.0]) โ† Exactly double! +``` + +**Linear properties:** + +```yaml +f(x + y) = f(x) + f(y) โ† Additive +f(2x) = 2ยทf(x) โ† Scalable + +This makes it predictable and stable! +``` + +## What Each Component Does + +### Weights: The Learnable Parameters + +Weights determine **which inputs matter**: + +```python +# Positive weight โ†’ input increases output +w_positive = 0.8 +x = 5.0 +contribution = w_positive * x # 4.0 โ† Boosts output! + +# Negative weight โ†’ input decreases output +w_negative = -0.8 +contribution = w_negative * x # -4.0 โ† Reduces output! + +# Small weight โ†’ input barely matters +w_small = 0.01 +contribution = w_small * x # 0.05 โ† Tiny effect + +# Large weight โ†’ input matters a lot +w_large = 10.0 +contribution = w_large * x # 50.0 โ† Huge effect! +``` + +### Bias: The Threshold Adjuster + +Bias shifts the decision boundary: + +```python +import torch + +x = torch.tensor([1.0, 1.0]) +w = torch.tensor([1.0, 1.0]) + +# No bias +z_no_bias = torch.dot(w, x) +print(z_no_bias) # tensor(2.0000) + +# Positive bias (easier to activate) +b_positive = 5.0 +z_positive = torch.dot(w, x) + b_positive +print(z_positive) # tensor(7.0000) โ† Higher! + +# Negative bias (harder to activate) +b_negative = -5.0 +z_negative = torch.dot(w, x) + b_negative +print(z_negative) # tensor(-3.0000) โ† Lower! +``` + +**What bias does:** + +```yaml +Positive bias: + Makes neuron more likely to "fire" + Shifts decision boundary down + +Negative bias: + Makes neuron less likely to "fire" + Shifts decision boundary up + +No bias: + Decision passes through origin +``` + +## Using nn.Linear in PyTorch + +PyTorch provides `nn.Linear` to do this automatically: + +```python +import torch +import torch.nn as nn + +# Create linear layer: 3 inputs โ†’ 1 output +linear = nn.Linear(in_features=3, out_features=1) + +# Input batch: 5 samples, 3 features each +x = torch.randn(5, 3) + +# Apply linear transformation +z = linear(x) + +print(z.shape) # torch.Size([5, 1]) + +# What it does internally: +# z = x @ linear.weight.T + linear.bias +``` + +## Multiple Outputs + +You can have multiple output neurons: + +```python +import torch +import torch.nn as nn + +# 3 inputs โ†’ 5 outputs (5 neurons) +linear = nn.Linear(3, 5) + +x = torch.tensor([[1.0, 2.0, 3.0]]) # 1 sample + +z = linear(x) +print(z) +# tensor([[0.234, -1.123, 0.567, 2.134, -0.876]]) +# 5 different outputs (one per neuron)! + +# Each output has its own weights: +print(linear.weight.shape) # torch.Size([5, 3]) +# 5 neurons ร— 3 weights each + +print(linear.bias.shape) # torch.Size([5]) +# 5 biases (one per neuron) +``` + +## Real-World Example + +```python +import torch +import torch.nn as nn + +# House price prediction +# Inputs: [size_sqft, bedrooms, age_years] +house_features = torch.tensor([[2000.0, 3.0, 10.0]]) + +# Create linear layer +price_neuron = nn.Linear(3, 1) + +# Manually set weights (usually learned from data) +with torch.no_grad(): + price_neuron.weight = nn.Parameter(torch.tensor([[200.0, 50000.0, -1000.0]])) + price_neuron.bias = nn.Parameter(torch.tensor([50000.0])) + +# Predict price +predicted_price = price_neuron(house_features) +print(predicted_price) +# tensor([[540000.]]) โ† $540,000 prediction + +# Manual calculation: +# 2000ร—200 + 3ร—50000 + 10ร—(-1000) + 50000 +# = 400,000 + 150,000 - 10,000 + 50,000 +# = 590,000 (close to our result!) +``` + +**What the weights learned:** + +```yaml +Weight for size: 200 โ†’ Each sq ft adds $200 +Weight for bedrooms: 50,000 โ†’ Each bedroom adds $50k +Weight for age: -1,000 โ†’ Each year reduces price by $1k +Bias: 50,000 โ†’ Base price of $50k +``` + +## Matrix Form + +For a batch, the linear step is matrix multiplication: + +```python +# Batch of 3 samples +X = torch.tensor([[1.0, 2.0], + [3.0, 4.0], + [5.0, 6.0]]) # Shape: (3, 2) + +# Weights for 1 output neuron +W = torch.tensor([[0.5], + [0.3]]) # Shape: (2, 1) + +b = torch.tensor([0.1]) + +# Linear step as matrix multiplication +Z = X @ W + b + +print(Z) +# tensor([[1.2000], +# [2.8000], +# [4.4000]]) +``` + +**Matrix form:** + +```yaml +Z = XW + b + +Where: + X: (batch_size, input_features) + W: (input_features, output_features) + b: (output_features,) + Z: (batch_size, output_features) +``` + +## Key Takeaways + +โœ“ **Linear step:** Weighted sum of inputs plus bias + +โœ“ **Formula:** z = ฮฃ(wแตขxแตข) + b + +โœ“ **Weights:** Determine importance of each input + +โœ“ **Bias:** Shifts the output + +โœ“ **PyTorch:** Use `nn.Linear(in, out)` + +โœ“ **Matrix form:** Efficient for batches + +**Quick Reference:** + +```python +# Manual linear step +z = (weights * inputs).sum() + bias + +# Using PyTorch +linear = nn.Linear(input_dim, output_dim) +z = linear(x) + +# What it does: +# z = x @ linear.weight.T + linear.bias +``` + +**Remember:** The linear step is just multiply โ†’ sum โ†’ add bias. Simple but powerful! ๐ŸŽ‰ diff --git a/public/content/learn/neuron-from-scratch/what-is-a-neuron/biological-vs-artificial.png b/public/content/learn/neuron-from-scratch/what-is-a-neuron/biological-vs-artificial.png new file mode 100644 index 0000000..ebd1b2c Binary files /dev/null and b/public/content/learn/neuron-from-scratch/what-is-a-neuron/biological-vs-artificial.png differ diff --git a/public/content/learn/neuron-from-scratch/what-is-a-neuron/neuron-parts.png b/public/content/learn/neuron-from-scratch/what-is-a-neuron/neuron-parts.png new file mode 100644 index 0000000..1ae788a Binary files /dev/null and b/public/content/learn/neuron-from-scratch/what-is-a-neuron/neuron-parts.png differ diff --git a/public/content/learn/neuron-from-scratch/what-is-a-neuron/simple-neuron.png b/public/content/learn/neuron-from-scratch/what-is-a-neuron/simple-neuron.png new file mode 100644 index 0000000..08ca705 Binary files /dev/null and b/public/content/learn/neuron-from-scratch/what-is-a-neuron/simple-neuron.png differ diff --git a/public/content/learn/neuron-from-scratch/what-is-a-neuron/what-is-a-neuron-content.md b/public/content/learn/neuron-from-scratch/what-is-a-neuron/what-is-a-neuron-content.md new file mode 100644 index 0000000..69f1e39 --- /dev/null +++ b/public/content/learn/neuron-from-scratch/what-is-a-neuron/what-is-a-neuron-content.md @@ -0,0 +1,267 @@ +--- +hero: + title: "What is a Neuron" + subtitle: "The Basic Building Block of Neural Networks" + tags: + - "๐Ÿง  Neuron" + - "โฑ๏ธ 8 min read" +--- + +A neuron is the **fundamental building block** of neural networks. Just like biological neurons in your brain, artificial neurons process inputs and produce outputs! + +## Biological vs Artificial + +![Biological vs Artificial](/content/learn/neuron-from-scratch/what-is-a-neuron/biological-vs-artificial.png) + +**Biological neuron:** +- Receives signals through dendrites +- Processes in cell body +- Sends output through axon + +**Artificial neuron:** +- Receives numerical inputs +- Processes with math (multiply, sum, activate) +- Outputs a single number + +**Both:** Transform multiple inputs into one output! + +## The Five Parts of a Neuron + +![Neuron Parts](/content/learn/neuron-from-scratch/what-is-a-neuron/neuron-parts.png) + +### 1. **Inputs** (xโ‚, xโ‚‚, xโ‚ƒ, ...) + +The data fed into the neuron: + +```python +inputs = [2.0, 3.0, 1.0] +``` + +**Real examples:** +- Pixel values from an image +- Features of a house (size, bedrooms, age) +- Word embeddings + +### 2. **Weights** (wโ‚, wโ‚‚, wโ‚ƒ, ...) + +How important each input is: + +```python +weights = [0.5, -0.3, 0.8] +``` + +**What weights mean:** +- Positive weight โ†’ input increases output +- Negative weight โ†’ input decreases output +- Large |weight| โ†’ input is important +- Small weight โ†’ input matters less + +### 3. **Multiply** (inputs ร— weights) + +Each input gets multiplied by its weight: + +```python +products = [2.0 ร— 0.5, 3.0 ร— -0.3, 1.0 ร— 0.8] + = [1.0, -0.9, 0.8] +``` + +### 4. **Sum** (ฮฃ) + +Add all products together, plus a bias: + +```python +sum_total = 1.0 + (-0.9) + 0.8 + bias + = 0.9 + 0 # assuming bias = 0 + = 0.9 +``` + +### 5. **Activation Function** + +Apply non-linearity (like ReLU, sigmoid, etc.): + +```python +output = ReLU(0.9) = 0.9 # Positive, so unchanged +``` + +## The Complete Formula + +**Output = Activation(ฮฃ(weights ยท inputs) + bias)** + +Or in math notation: +**y = f(wโ‚xโ‚ + wโ‚‚xโ‚‚ + wโ‚ƒxโ‚ƒ + ... + b)** + +Where: +- `x` = inputs +- `w` = weights +- `b` = bias +- `f` = activation function + +## Simple Example + +![Simple Neuron](/content/learn/neuron-from-scratch/what-is-a-neuron/simple-neuron.png) + +**Example:** + +```python +import torch + +# Inputs +x = torch.tensor([2.0, 3.0, 1.0]) + +# Weights +w = torch.tensor([0.5, -0.3, 0.8]) + +# Bias +b = torch.tensor(0.0) + +# Step 1: Multiply +products = x * w +print(products) +# tensor([ 1.0000, -0.9000, 0.8000]) + +# Step 2: Sum +weighted_sum = products.sum() + b +print(weighted_sum) +# tensor(0.9000) + +# Step 3: Activation (ReLU) +output = torch.relu(weighted_sum) +print(output) +# tensor(0.9000) +``` + +**Manual calculation:** + +```yaml +Step 1: Multiply each input by its weight + 2 ร— 0.5 = 1.0 + 3 ร— -0.3 = -0.9 + 1 ร— 0.8 = 0.8 + +Step 2: Sum everything + bias + 1.0 + (-0.9) + 0.8 + 0 = 0.9 + +Step 3: Apply activation (ReLU) + ReLU(0.9) = max(0, 0.9) = 0.9 + +Final output: 0.9 +``` + +## Why Do We Need Neurons? + +### They Learn Patterns + +Neurons adjust their weights to recognize patterns: + +```python +# Neuron learning to detect "cat" in images +# After training: +weights = [0.8, # whiskers โ†’ high weight (important!) + 0.9, # pointy ears โ†’ high weight + 0.1, # background โ†’ low weight (not important) + -0.5] # dog features โ†’ negative (opposite!) + +# When it sees a cat image: +cat_features = [1.0, 1.0, 0.2, 0.0] # Has whiskers, ears +output = sum(cat_features * weights) + bias +# = 0.8 + 0.9 + 0.02 + 0 = 1.72 +# โ†’ High output = "Yes, cat!" + +# When it sees a dog image: +dog_features = [0.0, 0.0, 0.3, 1.0] # No whiskers/ears, has dog features +output = sum(dog_features * weights) + bias +# = 0 + 0 + 0.03 + -0.5 = -0.47 +# โ†’ Low output = "No, not cat" +``` + +## Single Neuron Can Be Powerful + +Even one neuron can solve problems: + +**Example: AND gate** + +```python +import torch + +def and_gate(x1, x2): + """Neuron implementing AND logic""" + w1, w2 = 1.0, 1.0 + bias = -1.5 + + # Weighted sum + z = x1 * w1 + x2 * w2 + bias + + # Activation (step function) + output = 1.0 if z > 0 else 0.0 + return output + +# Truth table +print(and_gate(0, 0)) # 0 (False AND False = False) +print(and_gate(0, 1)) # 0 (False AND True = False) +print(and_gate(1, 0)) # 0 (True AND False = False) +print(and_gate(1, 1)) # 1 (True AND True = True) +``` + +**How it works:** + +```yaml +Inputs: (1, 1) + 1ร—1 + 1ร—1 + (-1.5) = 0.5 > 0 โ†’ Output 1 โœ“ + +Inputs: (0, 1) + 0ร—1 + 1ร—1 + (-1.5) = -0.5 < 0 โ†’ Output 0 โœ“ + +Inputs: (1, 0) + 1ร—1 + 0ร—1 + (-1.5) = -0.5 < 0 โ†’ Output 0 โœ“ + +Inputs: (0, 0) + 0ร—1 + 0ร—1 + (-1.5) = -1.5 < 0 โ†’ Output 0 โœ“ +``` + +## Many Neurons = Network + +```yaml +Single neuron: + Limited power + Can learn simple patterns + +Multiple neurons: + Combined power + Can learn complex patterns + Each neuron specializes in something + +Example: Image classification + Neuron 1: Detects edges + Neuron 2: Detects curves + Neuron 3: Detects textures + ... + Together: Recognize objects! +``` + +## Key Takeaways + +โœ“ **Neuron = Processor:** Takes inputs, produces output + +โœ“ **Three operations:** Multiply, Sum, Activate + +โœ“ **Weights are key:** They determine what the neuron learns + +โœ“ **Bias shifts:** Adjusts the threshold + +โœ“ **Activation adds non-linearity:** Makes networks powerful + +โœ“ **Building block:** Many neurons = neural network + +**The formula:** + +```yaml +Output = Activation(ฮฃ(weights ร— inputs) + bias) + +In code: + output = activation(torch.sum(weights * inputs) + bias) + +Or with linear layer: + output = activation(nn.Linear(inputs)) +``` + +**Remember:** A neuron is just multiply โ†’ sum โ†’ activate! Everything else builds on this! ๐ŸŽ‰ diff --git a/public/content/learn/tensors/concatenating-tensors/concat-dim0.png b/public/content/learn/tensors/concatenating-tensors/concat-dim0.png new file mode 100644 index 0000000..bb5e622 Binary files /dev/null and b/public/content/learn/tensors/concatenating-tensors/concat-dim0.png differ diff --git a/public/content/learn/tensors/concatenating-tensors/concat-dim1.png b/public/content/learn/tensors/concatenating-tensors/concat-dim1.png new file mode 100644 index 0000000..0f6ef2d Binary files /dev/null and b/public/content/learn/tensors/concatenating-tensors/concat-dim1.png differ diff --git a/public/content/learn/tensors/concatenating-tensors/concat-rules.png b/public/content/learn/tensors/concatenating-tensors/concat-rules.png new file mode 100644 index 0000000..83e6e02 Binary files /dev/null and b/public/content/learn/tensors/concatenating-tensors/concat-rules.png differ diff --git a/public/content/learn/tensors/concatenating-tensors/concatenating-tensors-content.md b/public/content/learn/tensors/concatenating-tensors/concatenating-tensors-content.md new file mode 100644 index 0000000..29f768a --- /dev/null +++ b/public/content/learn/tensors/concatenating-tensors/concatenating-tensors-content.md @@ -0,0 +1,419 @@ +--- +hero: + title: "Concatenating Tensors" + subtitle: "Combining Multiple Tensors" + tags: + - "๐Ÿ”ข Tensors" + - "โฑ๏ธ 9 min read" +--- + +Concatenation lets you **join multiple tensors together** along a specific dimension. Think of it like gluing pieces together! + +## The Basic Idea + +**Concatenation = Joining tensors end-to-end along one dimension** + +You can join tensors: +- **Vertically** (stack rows on top of each other) +- **Horizontally** (place side by side) +- **Along any dimension** + +## Concatenating Along Dimension 0 (Rows) + +Stack tensors **vertically** - adding more rows: + +![Concat Dimension 0](/content/learn/tensors/concatenating-tensors/concat-dim0.png) + +**Example:** + +```python +import torch + +A = torch.tensor([[1, 2, 3], + [4, 5, 6]]) # Shape: (2, 3) + +B = torch.tensor([[7, 8, 9], + [10, 11, 12]]) # Shape: (2, 3) + +# Concatenate along dimension 0 (rows) +result = torch.cat([A, B], dim=0) + +print(result) +# tensor([[ 1, 2, 3], +# [ 4, 5, 6], +# [ 7, 8, 9], +# [10, 11, 12]]) + +print(result.shape) # torch.Size([4, 3]) +``` + +**What happened:** + +```yaml +A: (2, 3) โ†’ 2 rows, 3 columns +B: (2, 3) โ†’ 2 rows, 3 columns + +Concatenate rows: 2 + 2 = 4 rows +Columns stay same: 3 columns + +Result: (4, 3) +``` + +**Visual breakdown:** + +```yaml +[[1, 2, 3], โ† From A + [4, 5, 6], โ† From A + [7, 8, 9], โ† From B + [10, 11, 12]] โ† From B +``` + +## Concatenating Along Dimension 1 (Columns) + +Join tensors **horizontally** - adding more columns: + +![Concat Dimension 1](/content/learn/tensors/concatenating-tensors/concat-dim1.png) + +**Example:** + +```python +import torch + +A = torch.tensor([[1, 2], + [3, 4]]) # Shape: (2, 2) + +B = torch.tensor([[5, 6, 7], + [8, 9, 10]]) # Shape: (2, 3) + +# Concatenate along dimension 1 (columns) +result = torch.cat([A, B], dim=1) + +print(result) +# tensor([[ 1, 2, 5, 6, 7], +# [ 3, 4, 8, 9, 10]]) + +print(result.shape) # torch.Size([2, 5]) +``` + +**What happened:** + +```yaml +A: (2, 2) โ†’ 2 rows, 2 columns +B: (2, 3) โ†’ 2 rows, 3 columns + +Rows stay same: 2 rows +Concatenate columns: 2 + 3 = 5 columns + +Result: (2, 5) +``` + +**Visual breakdown:** + +```yaml +[[1, 2, 5, 6, 7], + [3, 4, 8, 9, 10]] + โ†‘โ†‘โ†‘ โ†‘โ†‘โ†‘โ†‘โ†‘โ†‘โ†‘ + From A From B +``` + +## The Concatenation Rules + +![Concat Rules](/content/learn/tensors/concatenating-tensors/concat-rules.png) + +**Rule:** All dimensions EXCEPT the concatenation dimension must match! + +### โœ“ Valid Examples + +```python +# Concatenate dim=0: columns must match +A = torch.randn(2, 3) # (2, 3) +B = torch.randn(4, 3) # (4, 3) - same 3 columns โœ“ +result = torch.cat([A, B], dim=0) # (6, 3) + +# Concatenate dim=1: rows must match +C = torch.randn(5, 2) # (5, 2) +D = torch.randn(5, 7) # (5, 7) - same 5 rows โœ“ +result = torch.cat([C, D], dim=1) # (5, 9) +``` + +### โœ— Invalid Examples + +```python +# Different column counts - can't stack rows! +A = torch.randn(2, 3) +B = torch.randn(2, 4) # Different columns +# torch.cat([A, B], dim=0) # ERROR! 3 โ‰  4 + +# Different row counts - can't join columns! +C = torch.randn(3, 5) +D = torch.randn(2, 5) # Different rows +# torch.cat([C, D], dim=1) # ERROR! 3 โ‰  2 +``` + +**Quick check:** + +```yaml +Concatenating dim=0 (vertical): + โœ“ (2,3) + (4,3) โ†’ (6,3) โ† columns match (3) + โœ— (2,3) + (2,4) โ†’ ERROR โ† columns don't match + +Concatenating dim=1 (horizontal): + โœ“ (5,2) + (5,7) โ†’ (5,9) โ† rows match (5) + โœ— (3,5) + (2,5) โ†’ ERROR โ† rows don't match +``` + +## Stack: Creating a New Dimension + +`torch.stack()` is different - it **creates a new dimension**: + +![Stack Visual](/content/learn/tensors/concatenating-tensors/stack-visual.png) + +**Example:** + +```python +import torch + +A = torch.tensor([[1, 2], [3, 4]]) # (2, 2) +B = torch.tensor([[5, 6], [7, 8]]) # (2, 2) +C = torch.tensor([[9, 10], [11, 12]]) # (2, 2) + +# Stack creates NEW dimension +stacked = torch.stack([A, B, C], dim=0) + +print(stacked.shape) # torch.Size([3, 2, 2]) +# 3 matrices, each 2ร—2 + +print(stacked) +# tensor([[[ 1, 2], +# [ 3, 4]], +# +# [[ 5, 6], +# [ 7, 8]], +# +# [[ 9, 10], +# [11, 12]]]) +``` + +**Key difference:** + +```yaml +cat([A, B], dim=0): + (2, 3) + (2, 3) โ†’ (4, 3) โ† Adds to existing dimension + +stack([A, B], dim=0): + (2, 3) + (2, 3) โ†’ (2, 2, 3) โ† Creates NEW dimension +``` + +**For stack, all tensors must have EXACTLY the same shape!** + +## Multiple Tensors at Once + +You can concatenate more than 2 tensors: + +```python +import torch + +A = torch.ones(2, 3) +B = torch.ones(1, 3) * 2 +C = torch.ones(3, 3) * 3 + +# Concatenate all three +result = torch.cat([A, B, C], dim=0) + +print(result) +# tensor([[1., 1., 1.], +# [1., 1., 1.], +# [2., 2., 2.], +# [3., 3., 3.], +# [3., 3., 3.], +# [3., 3., 3.]]) + +print(result.shape) # torch.Size([6, 3]) +# 2 + 1 + 3 = 6 rows +``` + +**Breakdown:** + +```yaml +A: 2 rows +B: 1 row +C: 3 rows + +Total: 2 + 1 + 3 = 6 rows +``` + +## Practical Examples + +### Example 1: Combining Train and Test Data + +```python +import torch + +# Training data: 100 samples +train_data = torch.randn(100, 10) + +# Test data: 20 samples +test_data = torch.randn(20, 10) + +# Combine into full dataset +full_data = torch.cat([train_data, test_data], dim=0) + +print(full_data.shape) # torch.Size([120, 10]) +# 100 + 20 = 120 samples +``` + +### Example 2: Concatenating Features + +```python +import torch + +# Original features: 5 samples, 3 features each +original_features = torch.randn(5, 3) + +# New features: 5 samples, 2 new features +new_features = torch.randn(5, 2) + +# Combine features horizontally +combined = torch.cat([original_features, new_features], dim=1) + +print(combined.shape) # torch.Size([5, 5]) +# 5 samples, 3 + 2 = 5 features +``` + +### Example 3: Creating Batches with Stack + +```python +import torch + +# Three separate samples +sample1 = torch.randn(28, 28) +sample2 = torch.randn(28, 28) +sample3 = torch.randn(28, 28) + +# Stack into a batch +batch = torch.stack([sample1, sample2, sample3], dim=0) + +print(batch.shape) # torch.Size([3, 28, 28]) +# 3 samples in the batch +``` + +### Example 4: Building Sequences + +```python +import torch + +# Word embeddings for a sentence +# Each word is a 100-dim vector +word1 = torch.randn(100) +word2 = torch.randn(100) +word3 = torch.randn(100) +word4 = torch.randn(100) + +# Stack into sequence +sentence = torch.stack([word1, word2, word3, word4], dim=0) + +print(sentence.shape) # torch.Size([4, 100]) +# 4 words, 100-dim embedding each +``` + +## Cat vs Stack + +The key difference between `cat` and `stack`: + +```python +import torch + +A = torch.tensor([[1, 2], [3, 4]]) # (2, 2) +B = torch.tensor([[5, 6], [7, 8]]) # (2, 2) + +# CAT: Joins along existing dimension +cat_result = torch.cat([A, B], dim=0) +print(cat_result.shape) # torch.Size([4, 2]) + +# STACK: Creates new dimension +stack_result = torch.stack([A, B], dim=0) +print(stack_result.shape) # torch.Size([2, 2, 2]) +``` + +**When to use which:** + +```yaml +Use cat() when: + - Adding more samples to a batch + - Extending features + - Combining datasets + - Tensors can have different sizes in concat dimension + +Use stack() when: + - Creating a batch from individual samples + - All tensors have SAME shape + - Want to add a new dimension +``` + +## Common Gotchas + +### โŒ Gotcha 1: Shape Mismatch + +```python +A = torch.randn(2, 3) +B = torch.randn(2, 4) + +# This will ERROR! +# torch.cat([A, B], dim=0) # 3 โ‰  4 +``` + +### โŒ Gotcha 2: Wrong Dimension + +```python +A = torch.randn(2, 3) +B = torch.randn(2, 3) + +# This will ERROR! +# torch.cat([A, B], dim=2) # Only dims 0 and 1 exist! +``` + +### โŒ Gotcha 3: Forgetting List Brackets + +```python +A = torch.randn(2, 3) +B = torch.randn(2, 3) + +# This will ERROR! +# torch.cat(A, B, dim=0) # Missing [ ] + +# Correct: +torch.cat([A, B], dim=0) # โœ“ +``` + +## Key Takeaways + +โœ“ **cat() joins along existing dimension:** Extends that dimension + +โœ“ **stack() creates new dimension:** All tensors must have same shape + +โœ“ **Other dimensions must match:** Can't concatenate incompatible shapes + +โœ“ **dim=0 is vertical:** Stacks rows (more samples) + +โœ“ **dim=1 is horizontal:** Joins columns (more features) + +โœ“ **Use list brackets:** `torch.cat([A, B, C], dim=0)` + +**Quick Reference:** + +```python +# Concatenate (extends existing dimension) +torch.cat([A, B], dim=0) # Stack vertically (more rows) +torch.cat([A, B], dim=1) # Join horizontally (more columns) +torch.cat([A, B, C], dim=0) # Multiple tensors + +# Stack (creates new dimension) +torch.stack([A, B], dim=0) # New dimension at front +torch.stack([A, B], dim=1) # New dimension at position 1 + +# Split (opposite of concatenate) +torch.split(tensor, 2, dim=0) # Split into chunks of size 2 +torch.chunk(tensor, 3, dim=0) # Split into 3 chunks +``` + +**Remember:** `cat()` extends, `stack()` creates! ๐ŸŽ‰ diff --git a/public/content/learn/tensors/concatenating-tensors/stack-visual.png b/public/content/learn/tensors/concatenating-tensors/stack-visual.png new file mode 100644 index 0000000..61c589e Binary files /dev/null and b/public/content/learn/tensors/concatenating-tensors/stack-visual.png differ diff --git a/public/content/learn/tensors/creating-special-tensors/arange-linspace.png b/public/content/learn/tensors/creating-special-tensors/arange-linspace.png new file mode 100644 index 0000000..2623e3a Binary files /dev/null and b/public/content/learn/tensors/creating-special-tensors/arange-linspace.png differ diff --git a/public/content/learn/tensors/creating-special-tensors/creating-special-tensors-content.md b/public/content/learn/tensors/creating-special-tensors/creating-special-tensors-content.md new file mode 100644 index 0000000..659453a --- /dev/null +++ b/public/content/learn/tensors/creating-special-tensors/creating-special-tensors-content.md @@ -0,0 +1,501 @@ +--- +hero: + title: "Creating Special Tensors" + subtitle: "Zeros, Ones, Identity Matrices and More" + tags: + - "๐Ÿ”ข Tensors" + - "โฑ๏ธ 10 min read" +--- + +Instead of manually typing out every value, PyTorch provides quick ways to create common tensor patterns. These are incredibly useful! + +## Zeros and Ones + +The most basic special tensors: filled with all 0s or all 1s. + +![Zeros and Ones](/content/learn/tensors/creating-special-tensors/zeros-ones.png) + +### Creating Zeros + +**Example:** + +```python +import torch + +# Create 2ร—3 matrix of zeros +zeros = torch.zeros(2, 3) + +print(zeros) +# tensor([[0., 0., 0.], +# [0., 0., 0.]]) + +print(zeros.shape) # torch.Size([2, 3]) +``` + +**More examples:** + +```python +# 1D tensor of zeros +torch.zeros(5) +# tensor([0., 0., 0., 0., 0.]) + +# 3D tensor of zeros +torch.zeros(2, 3, 4) +# tensor([[[0., 0., 0., 0.], +# [0., 0., 0., 0.], +# [0., 0., 0., 0.]], +# [[0., 0., 0., 0.], +# [0., 0., 0., 0.], +# [0., 0., 0., 0.]]]) +``` + +### Creating Ones + +**Example:** + +```python +import torch + +# Create 2ร—3 matrix of ones +ones = torch.ones(2, 3) + +print(ones) +# tensor([[1., 1., 1.], +# [1., 1., 1.]]) + +print(ones.shape) # torch.Size([2, 3]) +``` + +**When to use:** + +```yaml +zeros(): + - Initialize weights to zero + - Create padding + - Initialize bias terms + +ones(): + - Create masks (all True) + - Initialize certain layers + - Multiply by constant values +``` + +## Identity Matrix + +An identity matrix has 1s on the diagonal, 0s everywhere else: + +![Identity Matrix](/content/learn/tensors/creating-special-tensors/identity-matrix.png) + +**Example:** + +```python +import torch + +# Create 4ร—4 identity matrix +identity = torch.eye(4) + +print(identity) +# tensor([[1., 0., 0., 0.], +# [0., 1., 0., 0.], +# [0., 0., 1., 0.], +# [0., 0., 0., 1.]]) +``` + +**Properties:** + +```yaml +torch.eye(n) creates: + - n ร— n square matrix + - 1s on diagonal (where row = column) + - 0s everywhere else + +Special property: + A @ eye(n) = A (multiplying by identity doesn't change A) +``` + +**More examples:** + +```python +# 3ร—3 identity +I = torch.eye(3) +print(I) +# tensor([[1., 0., 0.], +# [0., 1., 0.], +# [0., 0., 1.]]) + +# Test the property: A @ I = A +A = torch.randn(3, 3) +result = A @ I +print(torch.allclose(A, result)) # True! +``` + +## Random Tensors + +Random tensors are crucial for initializing neural network weights! + +![Random Tensors](/content/learn/tensors/creating-special-tensors/random-tensors.png) + +### torch.rand() - Uniform Distribution + +Creates random values **uniformly distributed between 0 and 1**: + +```python +import torch + +# Random values in [0, 1) +random_uniform = torch.rand(2, 3) + +print(random_uniform) +# tensor([[0.2347, 0.8723, 0.4512], +# [0.6234, 0.1156, 0.9901]]) + +# All values are between 0 and 1 +``` + +**When to use:** + +```yaml +Good for: + - Dropout masks + - Random sampling [0, 1) + - Probabilities +``` + +### torch.randn() - Normal Distribution + +Creates random values from a **normal (Gaussian) distribution** with mean 0 and standard deviation 1: + +```python +import torch + +# Random values from normal distribution +random_normal = torch.randn(2, 3) + +print(random_normal) +# tensor([[-0.5234, 1.2301, -1.1142], +# [ 0.0832, -0.7329, 0.4501]]) + +# Values can be negative or positive +# Most values are close to 0 +``` + +**When to use:** + +```yaml +BEST for: + - Weight initialization (most common!) + - Adding noise to data + - Sampling from Gaussian +``` + +**This is the most common way to initialize neural network weights!** + +### torch.randint() - Random Integers + +Creates random **integers** in a specified range: + +```python +import torch + +# Random integers from 0 to 9 (10 excluded) +random_ints = torch.randint(0, 10, (2, 3)) + +print(random_ints) +# tensor([[3, 7, 1], +# [9, 2, 5]]) + +# All values are integers between 0 and 9 +``` + +**More examples:** + +```python +# Random integers from 1 to 6 (dice roll) +dice = torch.randint(1, 7, (10,)) +print(dice) +# tensor([4, 2, 6, 1, 3, 5, 2, 4, 6, 1]) + +# Random integers for class labels +labels = torch.randint(0, 5, (100,)) # 100 labels, classes 0-4 +``` + +## Range Tensors + +Create sequences of numbers automatically! + +![Arange and Linspace](/content/learn/tensors/creating-special-tensors/arange-linspace.png) + +### torch.arange() - Step by Fixed Amount + +Creates a sequence with a fixed step size (like Python's `range`): + +```python +import torch + +# From 0 to 10, step by 2 (10 not included!) +seq = torch.arange(0, 10, 2) + +print(seq) +# tensor([0, 2, 4, 6, 8]) +``` + +**More examples:** + +```python +# Default start is 0, default step is 1 +torch.arange(5) +# tensor([0, 1, 2, 3, 4]) + +# Specify start and end +torch.arange(3, 8) +# tensor([3, 4, 5, 6, 7]) + +# Use decimals +torch.arange(0, 1, 0.2) +# tensor([0.0000, 0.2000, 0.4000, 0.6000, 0.8000]) +``` + +**Pattern:** + +```yaml +torch.arange(start, end, step) + - Starts at 'start' + - Stops BEFORE 'end' + - Increments by 'step' +``` + +### torch.linspace() - N Evenly Spaced Values + +Creates N values evenly spaced between start and end: + +```python +import torch + +# 5 values evenly spaced from 0 to 1 +seq = torch.linspace(0, 1, 5) + +print(seq) +# tensor([0.0000, 0.2500, 0.5000, 0.7500, 1.0000]) +``` + +**More examples:** + +```python +# 10 points from -1 to 1 +torch.linspace(-1, 1, 10) +# tensor([-1.0000, -0.7778, -0.5556, -0.3333, -0.1111, +# 0.1111, 0.3333, 0.5556, 0.7778, 1.0000]) + +# Great for creating x-axis for plotting +x = torch.linspace(0, 10, 100) # 100 points from 0 to 10 +``` + +**Key difference:** + +```yaml +arange(0, 10, 2): + - You specify the STEP (2) + - Result: [0, 2, 4, 6, 8] + - End NOT included + +linspace(0, 10, 5): + - You specify the COUNT (5 values) + - Result: [0.0, 2.5, 5.0, 7.5, 10.0] + - End IS included! +``` + +## Creating "Like" Tensors + +Create new tensors matching another tensor's shape: + +![Like Tensors](/content/learn/tensors/creating-special-tensors/like-tensors.png) + +**Example:** + +```python +import torch + +# Original tensor +x = torch.tensor([[1, 2, 3], + [4, 5, 6]]) + +# Create zeros with same shape +zeros = torch.zeros_like(x) +print(zeros) +# tensor([[0, 0, 0], +# [0, 0, 0]]) + +# Create ones with same shape +ones = torch.ones_like(x) +print(ones) +# tensor([[1, 1, 1], +# [1, 1, 1]]) + +# Create random with same shape +random = torch.randn_like(x.float()) # Must be float for randn +print(random.shape) # torch.Size([2, 3]) +``` + +**When to use:** + +```yaml +zeros_like(): + - Reset gradients + - Create zero-initialized tensors matching input + +ones_like(): + - Create masks + - Initialize to constant + +randn_like(): + - Add noise matching shape + - Initialize weights +``` + +## Practical Examples + +### Example 1: Weight Initialization + +```python +import torch + +# Input dimension: 784 (28ร—28 image flattened) +# Output dimension: 10 (10 classes) +input_dim = 784 +output_dim = 10 + +# Initialize weights with small random values +weights = torch.randn(input_dim, output_dim) * 0.01 + +# Initialize bias to zeros +bias = torch.zeros(output_dim) + +print(f"Weights shape: {weights.shape}") # (784, 10) +print(f"Bias shape: {bias.shape}") # (10,) +``` + +### Example 2: Creating a Mask + +```python +import torch + +# Data batch +data = torch.randn(5, 10) + +# Create mask: first 3 samples are valid, last 2 are padding +mask = torch.zeros(5, dtype=torch.bool) +mask[:3] = True + +print(mask) +# tensor([ True, True, True, False, False]) + +# Apply mask +valid_data = data[mask] +print(valid_data.shape) # torch.Size([3, 10]) +``` + +### Example 3: Creating Training Data + +```python +import torch + +batch_size = 32 +sequence_length = 50 +embedding_dim = 128 + +# Input sequences (random for demo) +inputs = torch.randn(batch_size, sequence_length, embedding_dim) + +# Labels (random class indices) +labels = torch.randint(0, 10, (batch_size,)) + +# Attention mask (all ones = all valid) +attention_mask = torch.ones(batch_size, sequence_length) + +print(f"Inputs: {inputs.shape}") # (32, 50, 128) +print(f"Labels: {labels.shape}") # (32,) +print(f"Mask: {attention_mask.shape}") # (32, 50) +``` + +## Full vs Empty + +Create tensors without initializing values (faster but contains garbage): + +```python +import torch + +# Create empty tensor (uninitialized - garbage values) +empty = torch.empty(2, 3) +print(empty) +# tensor([[3.6893e+19, 1.5414e-19, 3.0818e-41], +# [0.0000e+00, 0.0000e+00, 0.0000e+00]]) +# Random garbage values! + +# Create full tensor (fill with specific value) +sevens = torch.full((2, 3), 7) +print(sevens) +# tensor([[7, 7, 7], +# [7, 7, 7]]) +``` + +**When to use empty:** + +```yaml +torch.empty(): + - When you'll immediately overwrite all values + - Slightly faster than zeros/ones + - WARNING: Contains random garbage! + +torch.full(): + - Fill with any constant value + - Like ones() but more flexible +``` + +## Key Takeaways + +โœ“ **zeros() and ones():** All 0s or all 1s + +โœ“ **eye():** Identity matrix (diagonal 1s) + +โœ“ **rand():** Random [0, 1) uniform + +โœ“ **randn():** Random normal distribution (best for weights!) + +โœ“ **randint():** Random integers + +โœ“ **arange():** Sequence with step (end excluded) + +โœ“ **linspace():** N evenly spaced values (end included) + +โœ“ **_like():** Match another tensor's shape + +**Quick Reference:** + +```python +# Zeros and ones +torch.zeros(3, 4) # 3ร—4 matrix of zeros +torch.ones(2, 5) # 2ร—5 matrix of ones + +# Identity +torch.eye(5) # 5ร—5 identity matrix + +# Random +torch.rand(3, 3) # Uniform [0, 1) +torch.randn(3, 3) # Normal (ฮผ=0, ฯƒ=1) +torch.randint(0, 10, (3, 3)) # Random integers [0, 10) + +# Sequences +torch.arange(0, 10, 2) # [0, 2, 4, 6, 8] +torch.linspace(0, 1, 5) # [0.00, 0.25, 0.50, 0.75, 1.00] + +# Like another tensor +x = torch.randn(2, 3) +torch.zeros_like(x) # Zeros with shape (2, 3) +torch.ones_like(x) # Ones with shape (2, 3) +torch.randn_like(x) # Random with shape (2, 3) + +# Fill with value +torch.full((2, 3), 7) # All 7s +``` + +**Remember:** Use `torch.randn()` for weight initialization - it's the standard! ๐ŸŽ‰ diff --git a/public/content/learn/tensors/creating-special-tensors/identity-matrix.png b/public/content/learn/tensors/creating-special-tensors/identity-matrix.png new file mode 100644 index 0000000..2629523 Binary files /dev/null and b/public/content/learn/tensors/creating-special-tensors/identity-matrix.png differ diff --git a/public/content/learn/tensors/creating-special-tensors/like-tensors.png b/public/content/learn/tensors/creating-special-tensors/like-tensors.png new file mode 100644 index 0000000..ad509b3 Binary files /dev/null and b/public/content/learn/tensors/creating-special-tensors/like-tensors.png differ diff --git a/public/content/learn/tensors/creating-special-tensors/random-tensors.png b/public/content/learn/tensors/creating-special-tensors/random-tensors.png new file mode 100644 index 0000000..6a87dc0 Binary files /dev/null and b/public/content/learn/tensors/creating-special-tensors/random-tensors.png differ diff --git a/public/content/learn/tensors/creating-special-tensors/zeros-ones.png b/public/content/learn/tensors/creating-special-tensors/zeros-ones.png new file mode 100644 index 0000000..fe84c6c Binary files /dev/null and b/public/content/learn/tensors/creating-special-tensors/zeros-ones.png differ diff --git a/public/content/learn/tensors/creating-tensors/3d-tensor.png b/public/content/learn/tensors/creating-tensors/3d-tensor.png new file mode 100644 index 0000000..e3c1d2d Binary files /dev/null and b/public/content/learn/tensors/creating-tensors/3d-tensor.png differ diff --git a/public/content/learn/tensors/creating-tensors/creating-from-data.png b/public/content/learn/tensors/creating-tensors/creating-from-data.png new file mode 100644 index 0000000..2ea1afa Binary files /dev/null and b/public/content/learn/tensors/creating-tensors/creating-from-data.png differ diff --git a/public/content/learn/tensors/creating-tensors/creating-tensors-content.md b/public/content/learn/tensors/creating-tensors/creating-tensors-content.md new file mode 100644 index 0000000..738133d --- /dev/null +++ b/public/content/learn/tensors/creating-tensors/creating-tensors-content.md @@ -0,0 +1,703 @@ +--- +hero: + title: "Creating Tensors" + subtitle: "Building Blocks of Deep Learning" + tags: + - "๐Ÿ”ข Tensors" + - "โฑ๏ธ 15 min read" +--- + +Tensors are the fundamental data structure in deep learning. Everything you work with in neural networks - images, text, audio, weights, gradients - is represented as tensors. + +## What is a Tensor? + +A **tensor** is a multi-dimensional array of numbers. Think of it as a container that can hold data in different dimensions: + +- **0D Tensor (Scalar)**: A single number โ†’ `5` +- **1D Tensor (Vector)**: An array of numbers โ†’ `[1, 2, 3, 4]` +- **2D Tensor (Matrix)**: A table of numbers โ†’ `[[1, 2], [3, 4], [5, 6]]` +- **3D+ Tensor**: Multiple matrices stacked together โ†’ `[[[1, 2], [3, 4]], [[5, 6], [7, 8]]]` + +Let me show you exactly what these look like: + +**0D Tensor (Scalar)** - Just a number, no brackets needed: +``` +5 +``` + +**1D Tensor (Vector)** - One set of brackets `[ ]`: +``` +[1, 2, 3, 4, 5] +``` + +**2D Tensor (Matrix)** - Two sets of brackets `[[ ]]`, one for each row: +``` +[[1, 2, 3], + [4, 5, 6], + [7, 8, 9]] +``` + +**3D Tensor** - Three sets of brackets `[[[ ]]]`, multiple matrices: +``` +[[[1, 2], [[[5, 6], + [3, 4]], [7, 8]]] +``` + +In PyTorch and other deep learning frameworks, tensors are similar to NumPy arrays but with superpowers - they can run on GPUs and automatically compute gradients! + +## The Bracket Rule: How to Count Dimensions + +**Simple Rule:** Count the number of opening brackets `[` at the start of your data! + +**Examples:** + +```python +# 0D Tensor (Scalar) - NO brackets +5 # 0 dimensions + +# 1D Tensor (Vector) - ONE opening bracket [ +[1, 2, 3] # 1 dimension + +# 2D Tensor (Matrix) - TWO opening brackets [[ +[[1, 2], # 2 dimensions + [3, 4]] + +# 3D Tensor - THREE opening brackets [[[ +[[[1, 2], # 3 dimensions + [3, 4]], + [[5, 6], + [7, 8]]] +``` + +**Pro Tip:** When you create a tensor, look at the left edge of your data. Count the `[` symbols stacked up - that's your number of dimensions! + +```python +import torch + +# Let's verify this rule +scalar = torch.tensor(5) # 0 brackets โ†’ ndim = 0 +print(scalar.ndim) # Output: 0 + +vector = torch.tensor([1, 2, 3]) # 1 bracket โ†’ ndim = 1 +print(vector.ndim) # Output: 1 + +matrix = torch.tensor([[1, 2], [3, 4]]) # 2 brackets โ†’ ndim = 2 +print(matrix.ndim) # Output: 2 + +tensor_3d = torch.tensor([[[1, 2]], [[3, 4]]]) # 3 brackets โ†’ ndim = 3 +print(tensor_3d.ndim) # Output: 3 +``` + +![Tensor Dimensions](/content/learn/tensors/creating-tensors/tensor-dimensions.png) + +## Understanding Tensor Dimensions + +### 0D Tensor (Scalar) + +A scalar is just a single number. + +![Scalar Tensor](/content/learn/tensors/creating-tensors/scalar-tensor.png) + +**Example:** + +```python +import torch + +# Creating a scalar tensor +scalar = torch.tensor(5) + +print(scalar) # Output: tensor(5) +print(scalar.shape) # Output: torch.Size([]) +print(scalar.ndim) # Output: 0 (zero dimensions) +``` + +**What happens here?** + +When you write `torch.tensor(5)`: +1. You pass the number `5` to PyTorch +2. PyTorch creates a tensor object that holds this single value +3. The shape is `[]` (empty brackets) because there are no dimensions +4. `ndim` is `0` because it's just a single number, not an array + +Think of it like putting a single marble in a special container - the marble is your number `5`, and the container is the tensor. + +**Real-world use:** Learning rate, loss value, accuracy score + +**More Examples:** + +```python +temperature = torch.tensor(36.5) # Body temperature +score = torch.tensor(95) # Test score + +print(temperature.ndim) # Output: 0 +print(score.ndim) # Output: 0 +``` + +### 1D Tensor (Vector) + +A vector is an array of numbers, like a list. + +![Vector Tensor](/content/learn/tensors/creating-tensors/vector-tensor.png) + +**Example 1:** Simple vector + +```python +import torch + +# Creating a 1D tensor (vector) +vector = torch.tensor([1, 2, 3, 4, 5]) + +print(vector) # Output: tensor([1, 2, 3, 4, 5]) +print(vector.shape) # Output: torch.Size([5]) +print(vector.ndim) # Output: 1 +``` + +**What happens here?** + +When you write `torch.tensor([1, 2, 3, 4, 5])`: +1. You pass a **Python list** (notice the square brackets `[ ]`) to PyTorch +2. PyTorch sees the list has 5 numbers +3. It creates a 1D tensor with 5 elements in a row +4. The shape is `[5]` meaning "one dimension with 5 elements" +5. `ndim` is `1` because there's one dimension (length) + +**Visual breakdown of the brackets:** +```python +[1, 2, 3, 4, 5] +โ†‘ โ†‘ +One opening and one closing bracket = 1D tensor +``` + +**Think of it like:** A row of 5 boxes, each holding one number. + +**Example 2:** Accessing elements + +```python +vector = torch.tensor([10, 20, 30, 40, 50]) + +# Access individual elements (0-indexed) +print(vector[0]) # Output: tensor(10) +print(vector[2]) # Output: tensor(30) +print(vector[-1]) # Output: tensor(50) (last element) + +# Access a slice +print(vector[1:4]) # Output: tensor([20, 30, 40]) +``` + +**Real-world use:** Word embeddings, feature vectors, time series data + +### 2D Tensor (Matrix) + +A matrix is a table of numbers with rows and columns. + +![Matrix Tensor](/content/learn/tensors/creating-tensors/matrix-tensor.png) + +**Example 1:** Creating a matrix + +```python +import torch + +# Creating a 2D tensor (matrix) +matrix = torch.tensor([[1, 2, 3, 4], + [5, 6, 7, 8], + [9, 10, 11, 12]]) + +print(matrix) +# Output: +# tensor([[ 1, 2, 3, 4], +# [ 5, 6, 7, 8], +# [ 9, 10, 11, 12]]) + +print(matrix.shape) # Output: torch.Size([3, 4]) + # 3 rows, 4 columns +print(matrix.ndim) # Output: 2 +``` + +**What happens here?** + +When you write `torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])`: +1. You pass a **nested Python list** (list inside a list!) +2. The outer brackets `[ ]` represent the matrix itself +3. Each inner bracket `[ ]` represents one row +4. PyTorch counts: 3 inner lists = 3 rows, each has 4 numbers = 4 columns +5. The shape is `[3, 4]` meaning "3 rows, 4 columns" +6. `ndim` is `2` because there are two dimensions (rows and columns) + +**Visual breakdown of the brackets:** +```python +[[1, 2, 3, 4], โ† Row 0 (first row) + [5, 6, 7, 8], โ† Row 1 (second row) + [9, 10, 11, 12]] โ† Row 2 (third row) +โ†‘โ†‘ โ†‘ โ†‘ +โ”‚โ”‚ โ”‚ โ””โ”€ Inner closing bracket (end of row) +โ”‚โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€ Outer opening bracket (start of matrix) +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€Outer closing bracket (end of matrix) + +Two levels of brackets = 2D tensor +``` + +**Think of it like:** A table with 3 rows and 4 columns, like a spreadsheet. + +**Remember:** Shape is always `[ROWS, COLUMNS]` + +**Example 2:** Accessing rows and columns + +```python +matrix = torch.tensor([[1, 2, 3], + [4, 5, 6], + [7, 8, 9]]) + +# Access a single element [row, column] +print(matrix[0, 0]) # Output: tensor(1) +print(matrix[1, 2]) # Output: tensor(6) +print(matrix[2, 1]) # Output: tensor(8) + +# Access entire row +print(matrix[0]) # Output: tensor([1, 2, 3]) +print(matrix[1]) # Output: tensor([4, 5, 6]) + +# Access entire column +print(matrix[:, 0]) # Output: tensor([1, 4, 7]) +print(matrix[:, 1]) # Output: tensor([2, 5, 8]) +``` + +**Real-world use:** Grayscale images, batch of word embeddings, weight matrices + +### 3D Tensor + +A 3D tensor is multiple matrices stacked together. Think of it as a cube of numbers. + +![3D Tensor](/content/learn/tensors/creating-tensors/3d-tensor.png) + +**Example 1:** Creating a 3D tensor + +```python +import torch + +# Creating a 3D tensor (2 matrices, each 3x4) +tensor_3d = torch.tensor([[[1, 2, 3, 4], + [5, 6, 7, 8], + [9, 10, 11, 12]], + + [[13, 14, 15, 16], + [17, 18, 19, 20], + [21, 22, 23, 24]]]) + +print(tensor_3d.shape) # Output: torch.Size([2, 3, 4]) + # 2 matrices, each with 3 rows and 4 columns +print(tensor_3d.ndim) # Output: 3 +``` + +**What happens here?** + +When you write `torch.tensor([[[...], [...]], [[...], [...]]])`: +1. You have **three levels of nested lists** (lists inside lists inside lists!) +2. The outermost brackets `[ ]` represent the whole 3D tensor +3. Each middle-level bracket `[ ]` represents one matrix +4. Each innermost bracket `[ ]` represents one row in a matrix +5. PyTorch counts: 2 middle lists = 2 matrices, each has 3 inner lists = 3 rows, each row has 4 numbers = 4 columns +6. The shape is `[2, 3, 4]` meaning "2 matrices, each 3 rows ร— 4 columns" +7. `ndim` is `3` because there are three dimensions + +**Visual breakdown of the brackets:** +```python +[ โ† Outermost opening (start of 3D tensor) + [ โ† First matrix opening + [1, 2, 3, 4], โ† Row 0 of matrix 0 + [5, 6, 7, 8], โ† Row 1 of matrix 0 + [9, 10, 11, 12] โ† Row 2 of matrix 0 + ], โ† First matrix closing + + [ โ† Second matrix opening + [13, 14, 15, 16], โ† Row 0 of matrix 1 + [17, 18, 19, 20], โ† Row 1 of matrix 1 + [21, 22, 23, 24] โ† Row 2 of matrix 1 + ] โ† Second matrix closing +] โ† Outermost closing (end of 3D tensor) + +Three levels of brackets = 3D tensor +``` + +**Think of it like:** A stack of 2 pages, where each page is a table (matrix) with 3 rows and 4 columns. + +**Understanding shape (2, 3, 4):** + +- **First dimension (2)**: Number of matrices (or "depth") +- **Second dimension (3)**: Number of rows in each matrix +- **Third dimension (4)**: Number of columns in each matrix + +```python +# Access the first matrix +print(tensor_3d[0]) +# Output: +# tensor([[ 1, 2, 3, 4], +# [ 5, 6, 7, 8], +# [ 9, 10, 11, 12]]) + +# Access the second matrix +print(tensor_3d[1]) +# Output: +# tensor([[13, 14, 15, 16], +# [17, 18, 19, 20], +# [21, 22, 23, 24]]) + +# Access specific element [matrix, row, column] +print(tensor_3d[0, 1, 2]) # Output: tensor(7) +print(tensor_3d[1, 2, 3]) # Output: tensor(24) +``` + +**Real-world use:** RGB images (height, width, 3 color channels), video frames, batch of images + +## Creating Tensors from Different Data Types + +PyTorch provides multiple ways to create tensors from existing data. + +![Creating from Data](/content/learn/tensors/creating-tensors/creating-from-data.png) + +### From Python Lists + +**Example 1:** 1D tensor from list + +```python +import torch + +# Create from Python list +python_list = [1, 2, 3, 4, 5] +tensor = torch.tensor(python_list) + +print(tensor) # Output: tensor([1, 2, 3, 4, 5]) +print(type(tensor)) # Output: +``` + +**Example 2:** 2D tensor from nested lists + +```python +# Create 2D tensor from nested list +nested_list = [[1, 2, 3], + [4, 5, 6], + [7, 8, 9]] + +tensor_2d = torch.tensor(nested_list) + +print(tensor_2d) +# Output: +# tensor([[1, 2, 3], +# [4, 5, 6], +# [7, 8, 9]]) + +print(tensor_2d.shape) # Output: torch.Size([3, 3]) +``` + +**Example 3:** 3D tensor from deeply nested lists + +```python +# Create 3D tensor (2 matrices, each 2x3) +deep_list = [[[1, 2, 3], + [4, 5, 6]], + + [[7, 8, 9], + [10, 11, 12]]] + +tensor_3d = torch.tensor(deep_list) + +print(tensor_3d.shape) # Output: torch.Size([2, 2, 3]) +``` + +### From NumPy Arrays + +If you're working with NumPy arrays, you can easily convert them to tensors. + +**Example 1:** Converting NumPy array to tensor + +```python +import torch +import numpy as np + +# Create NumPy array +np_array = np.array([1, 2, 3, 4, 5]) + +# Convert to PyTorch tensor +tensor = torch.from_numpy(np_array) + +print(np_array) # Output: [1 2 3 4 5] +print(tensor) # Output: tensor([1, 2, 3, 4, 5]) +``` + +**Example 2:** 2D NumPy array to tensor + +```python +# Create 2D NumPy array +np_matrix = np.array([[1, 2, 3], + [4, 5, 6]]) + +# Convert to tensor +tensor_from_np = torch.from_numpy(np_matrix) + +print(tensor_from_np) +# Output: +# tensor([[1, 2, 3], +# [4, 5, 6]]) + +print(tensor_from_np.shape) # Output: torch.Size([2, 3]) +``` + +**Important Note:** `torch.from_numpy()` shares memory with the original NumPy array, so changes to one affect the other! + +```python +np_array = np.array([1, 2, 3]) +tensor = torch.from_numpy(np_array) + +# Modify NumPy array +np_array[0] = 999 + +print(np_array) # Output: [999 2 3] +print(tensor) # Output: tensor([999, 2, 3]) +# They share memory! +``` + +### From Other Tensors + +**Example:** Creating a new tensor with the same shape + +```python +# Create original tensor +x = torch.tensor([[1, 2], + [3, 4]]) + +# Create new tensor with same shape (but different values) +y = torch.tensor([[5, 6], + [7, 8]]) + +print(x.shape) # Output: torch.Size([2, 2]) +print(y.shape) # Output: torch.Size([2, 2]) +``` + +## Specifying Data Types + +Tensors can hold different types of numbers. Choosing the right data type is important for memory efficiency and computation speed. + +![Data Types](/content/learn/tensors/creating-tensors/data-types.png) + +### Common Data Types + +- `torch.int32` or `torch.int`: 32-bit integers (4 bytes per number) +- `torch.int64` or `torch.long`: 64-bit integers (8 bytes per number) +- `torch.float32` or `torch.float`: 32-bit floating point (4 bytes per number) **[Most Common]** +- `torch.float64` or `torch.double`: 64-bit floating point (8 bytes per number) +- `torch.bool`: Boolean values (True/False) + +**Example 1:** Creating tensors with specific data types + +```python +import torch + +# Integer tensor (int32) +int_tensor = torch.tensor([1, 2, 3], dtype=torch.int32) +print(int_tensor) # Output: tensor([1, 2, 3], dtype=torch.int32) +print(int_tensor.dtype) # Output: torch.int32 + +# Float tensor (float32) - Most common for neural networks! +float_tensor = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32) +print(float_tensor) # Output: tensor([1., 2., 3.]) +print(float_tensor.dtype) # Output: torch.float32 + +# Boolean tensor +bool_tensor = torch.tensor([True, False, True], dtype=torch.bool) +print(bool_tensor) # Output: tensor([ True, False, True]) +print(bool_tensor.dtype) # Output: torch.bool +``` + +**Example 2:** Default data type behavior + +```python +# PyTorch infers the data type from your input + +# Integers โ†’ int64 (by default) +x = torch.tensor([1, 2, 3]) +print(x.dtype) # Output: torch.int64 + +# Floats โ†’ float32 (by default) +y = torch.tensor([1.0, 2.0, 3.0]) +print(y.dtype) # Output: torch.float32 + +# Mixed integers and floats โ†’ float32 +z = torch.tensor([1, 2.0, 3]) +print(z) # Output: tensor([1., 2., 3.]) +print(z.dtype) # Output: torch.float32 +``` + +**Example 3:** Converting between data types + +```python +# Create integer tensor +int_tensor = torch.tensor([1, 2, 3]) +print(int_tensor.dtype) # Output: torch.int64 + +# Convert to float +float_tensor = int_tensor.float() +print(float_tensor) # Output: tensor([1., 2., 3.]) +print(float_tensor.dtype) # Output: torch.float32 +``` + +**Example 4:** Why data type matters + +```python +# Memory usage comparison +large_int64 = torch.ones(1000000, dtype=torch.int64) +large_int32 = torch.ones(1000000, dtype=torch.int32) + +print(f"int64 tensor: {large_int64.element_size() * large_int64.nelement() / 1e6} MB") +# Output: 8.0 MB (8 bytes per element) + +print(f"int32 tensor: {large_int32.element_size() * large_int32.nelement() / 1e6} MB") +# Output: 4.0 MB (4 bytes per element) + +# int32 uses half the memory! +``` + +## Practical Examples + +### Example 1: Creating a Batch of Data + +In deep learning, we often process multiple examples at once (a "batch"). + +```python +import torch + +# Create 3 examples, each with 3 features +# Example: [height, weight, age] +batch = torch.tensor([[170, 65, 25], + [180, 80, 30], + [165, 55, 22]], + dtype=torch.float32) + +print("Batch shape:", batch.shape) +# Output: Batch shape: torch.Size([3, 3]) +# 3 people, 3 features each + +# Access all heights (first column) +all_heights = batch[:, 0] +print(f"Heights: {all_heights}") +# Output: Heights: tensor([170., 180., 165.]) + +print(f"Average height: {all_heights.mean():.1f}cm") +# Output: Average height: 171.7cm +``` + +### Example 2: Creating RGB Image Data + +A tiny 2x2 RGB color image (3 color channels). + +```python +import torch + +# Define a 2x2 RGB image +# Each pixel has [Red, Green, Blue] values +image_rgb = [ + [[255, 0, 0], [0, 255, 0]], # Red, Green pixels + [[0, 0, 255], [255, 255, 0]] # Blue, Yellow pixels +] + +rgb_tensor = torch.tensor(image_rgb, dtype=torch.float32) + +print("Shape:", rgb_tensor.shape) +# Output: Shape: torch.Size([2, 2, 3]) +# 2 height, 2 width, 3 color channels + +# Access the red channel of all pixels +red_channel = rgb_tensor[:, :, 0] +print(f"Red channel:\n{red_channel}") +# Output: +# tensor([[255., 0.], +# [ 0., 255.]]) +``` + +## Common Mistakes and How to Fix Them + +### Mistake 1: Shape Mismatch + +```python +# โŒ Wrong: Inconsistent row lengths +try: + wrong_tensor = torch.tensor([[1, 2, 3], + [4, 5]]) # Second row too short! +except: + print("Error: All rows must have the same length") + +# โœ… Correct: All rows same length +correct_tensor = torch.tensor([[1, 2, 3], + [4, 5, 6]]) +print(correct_tensor.shape) # Output: torch.Size([2, 3]) +``` + +### Mistake 2: Forgetting Dimension Order + +```python +# For images, be careful about dimension order! + +# โŒ Wrong order: (channels, height, width) +# This might cause errors in some operations +wrong_order = torch.rand(3, 224, 224) + +# โœ… PyTorch usually expects: (batch, channels, height, width) +correct_batch = torch.rand(1, 3, 224, 224) # 1 image, 3 channels, 224x224 + +# โœ… For a single image: (channels, height, width) +single_image = torch.rand(3, 224, 224) +``` + +## Quick Reference + +### Creating Tensors + +```python +# From list +torch.tensor([1, 2, 3]) + +# From NumPy +torch.from_numpy(np_array) + +# With specific dtype +torch.tensor([1, 2], dtype=torch.float32) +``` + +### Checking Tensor Properties + +```python +tensor = torch.tensor([[1, 2], [3, 4]]) + +tensor.shape # Shape: torch.Size([2, 2]) +tensor.size() # Same as .shape +tensor.ndim # Number of dimensions: 2 +tensor.dtype # Data type: torch.int64 +tensor.numel() # Total number of elements: 4 +``` + +### Data Type Conversion + +```python +tensor.float() # Convert to float32 +tensor.int() # Convert to int32 +tensor.long() # Convert to int64 +tensor.double() # Convert to float64 +tensor.bool() # Convert to boolean +``` + +## Why Tensors Matter for Neural Networks + +- **Images**: RGB images are 3D tensors (height ร— width ร— 3 channels) +- **Batches**: Neural networks process multiple examples at once (batch dimension) +- **Text**: Word embeddings are 2D tensors (sequence length ร— embedding dimension) +- **Weights**: Model parameters are tensors that get updated during training + +**Example: A batch of images** +```python +# Shape: (batch_size, channels, height, width) +batch_of_images = torch.rand(32, 3, 224, 224) +# 32 images, 3 color channels (RGB), 224ร—224 pixels + +print(f"Batch shape: {batch_of_images.shape}") +# Output: Batch shape: torch.Size([32, 3, 224, 224]) +``` + +**Congratulations! You now understand how to create and work with tensors!** ๐ŸŽ‰ diff --git a/public/content/learn/tensors/creating-tensors/data-types.png b/public/content/learn/tensors/creating-tensors/data-types.png new file mode 100644 index 0000000..2c3450e Binary files /dev/null and b/public/content/learn/tensors/creating-tensors/data-types.png differ diff --git a/public/content/learn/tensors/creating-tensors/matrix-tensor.png b/public/content/learn/tensors/creating-tensors/matrix-tensor.png new file mode 100644 index 0000000..6a974d2 Binary files /dev/null and b/public/content/learn/tensors/creating-tensors/matrix-tensor.png differ diff --git a/public/content/learn/tensors/creating-tensors/scalar-tensor.png b/public/content/learn/tensors/creating-tensors/scalar-tensor.png new file mode 100644 index 0000000..f7adbf4 Binary files /dev/null and b/public/content/learn/tensors/creating-tensors/scalar-tensor.png differ diff --git a/public/content/learn/tensors/creating-tensors/tensor-dimensions.png b/public/content/learn/tensors/creating-tensors/tensor-dimensions.png new file mode 100644 index 0000000..0683845 Binary files /dev/null and b/public/content/learn/tensors/creating-tensors/tensor-dimensions.png differ diff --git a/public/content/learn/tensors/creating-tensors/vector-tensor.png b/public/content/learn/tensors/creating-tensors/vector-tensor.png new file mode 100644 index 0000000..d6ab661 Binary files /dev/null and b/public/content/learn/tensors/creating-tensors/vector-tensor.png differ diff --git a/public/content/learn/tensors/indexing-and-slicing/basic-indexing.png b/public/content/learn/tensors/indexing-and-slicing/basic-indexing.png new file mode 100644 index 0000000..df956b1 Binary files /dev/null and b/public/content/learn/tensors/indexing-and-slicing/basic-indexing.png differ diff --git a/public/content/learn/tensors/indexing-and-slicing/indexing-and-slicing-content.md b/public/content/learn/tensors/indexing-and-slicing/indexing-and-slicing-content.md new file mode 100644 index 0000000..a0d970a --- /dev/null +++ b/public/content/learn/tensors/indexing-and-slicing/indexing-and-slicing-content.md @@ -0,0 +1,500 @@ +--- +hero: + title: "Indexing and Slicing" + subtitle: "Accessing and Extracting Tensor Elements" + tags: + - "๐Ÿ”ข Tensors" + - "โฑ๏ธ 10 min read" +--- + +Indexing and slicing let you access and extract specific parts of tensors. Think of it like selecting specific pages from a book or specific rows from a spreadsheet! + +## The Basics: Indexing Starts at 0 + +**Important:** In Python and PyTorch, counting starts at **0**, not 1! + +![Basic Indexing](/content/learn/tensors/indexing-and-slicing/basic-indexing.png) + +**Example:** + +```python +import torch + +v = torch.tensor([10, 20, 30, 40, 50]) + +print(v[0]) # Output: tensor(10) โ† First element +print(v[2]) # Output: tensor(30) โ† Third element +print(v[4]) # Output: tensor(50) โ† Fifth element +``` + +**Manual breakdown:** + +```yaml +v = [10, 20, 30, 40, 50] + โ†‘ โ†‘ โ†‘ โ†‘ โ†‘ + [0] [1] [2] [3] [4] + +v[0] โ†’ 10 +v[1] โ†’ 20 +v[2] โ†’ 30 +``` + +**Key rule:** First element is `[0]`, second is `[1]`, third is `[2]`, and so on! + +## Negative Indexing + +You can count **backwards from the end** using negative indices: + +![Negative Indexing](/content/learn/tensors/indexing-and-slicing/negative-indexing.png) + +**Example:** + +```python +import torch + +v = torch.tensor([10, 20, 30, 40, 50]) + +print(v[-1]) # Output: tensor(50) โ† Last element +print(v[-2]) # Output: tensor(40) โ† Second from end +print(v[-5]) # Output: tensor(10) โ† Fifth from end (first!) +``` + +**How it works:** + +```yaml +Positive: [0] [1] [2] [3] [4] +Values: 10 20 30 40 50 +Negative: [-5] [-4] [-3] [-2] [-1] + +v[-1] = 50 (last) +v[-2] = 40 (second from last) +v[-3] = 30 (third from last) +``` + +**Useful trick:** `v[-1]` always gets the last element, no matter the size! + +## Matrix Indexing (2D) + +For matrices, use `[row, column]`: + +![Matrix Indexing](/content/learn/tensors/indexing-and-slicing/matrix-indexing.png) + +**Example:** + +```python +import torch + +A = torch.tensor([[10, 20, 30, 40], + [50, 60, 70, 80], + [90, 100, 110, 120]]) + +print(A[0, 0]) # Output: tensor(10) โ† Top-left +print(A[1, 2]) # Output: tensor(70) โ† Row 1, Col 2 +print(A[2, 3]) # Output: tensor(120) โ† Bottom-right +print(A[-1, -1]) # Output: tensor(120) โ† Also bottom-right! +``` + +**Manual breakdown:** + +```yaml + Col 0 Col 1 Col 2 Col 3 +Row 0: 10 20 30 40 +Row 1: 50 60 70 80 +Row 2: 90 100 110 120 + +A[1, 2] โ†’ Row 1, Column 2 โ†’ 70 +A[0, 3] โ†’ Row 0, Column 3 โ†’ 40 +``` + +**Pattern:** `[row, column]` always - row first, column second! + +## Slicing: Getting Multiple Elements + +Slicing uses the syntax `[start:end]` where **end is NOT included**! + +![Slicing Basics](/content/learn/tensors/indexing-and-slicing/slicing-basics.png) + +**Example:** + +```python +import torch + +v = torch.tensor([10, 20, 30, 40, 50, 60]) + +print(v[1:4]) # Output: tensor([20, 30, 40]) +print(v[0:3]) # Output: tensor([10, 20, 30]) +print(v[3:6]) # Output: tensor([40, 50, 60]) +``` + +**Manual breakdown:** + +```yaml +v = [10, 20, 30, 40, 50, 60] + [0] [1] [2] [3] [4] [5] + +v[1:4] gets indices: 1, 2, 3 (stops BEFORE 4) + โ†’ [20, 30, 40] + +v[0:3] gets indices: 0, 1, 2 + โ†’ [10, 20, 30] +``` + +**Critical:** `v[1:4]` gets elements at positions 1, 2, and 3. It does NOT include position 4! + +## Slicing Shortcuts + +You can omit start or end: + +```python +import torch + +v = torch.tensor([10, 20, 30, 40, 50, 60]) + +print(v[:3]) # Output: tensor([10, 20, 30]) โ† From start to 3 +print(v[3:]) # Output: tensor([40, 50, 60]) โ† From 3 to end +print(v[:]) # Output: tensor([10, 20, 30, 40, 50, 60]) โ† Everything! +``` + +**What they mean:** + +```yaml +v[:3] โ†’ v[0:3] โ†’ Start at 0, stop before 3 +v[3:] โ†’ v[3:6] โ†’ Start at 3, go to end +v[:] โ†’ v[0:6] โ†’ All elements (copy) +``` + +## Matrix Slicing + +Slicing works in 2D too! + +![Matrix Slicing](/content/learn/tensors/indexing-and-slicing/matrix-slicing.png) + +**Example:** + +```python +import torch + +A = torch.tensor([[1, 2, 3, 4], + [5, 6, 7, 8], + [9, 10, 11, 12], + [13, 14, 15, 16]]) + +# Get a sub-matrix +print(A[1:3, 1:3]) +# Output: +# tensor([[ 6, 7], +# [10, 11]]) + +# Get entire row 2 +print(A[2, :]) +# Output: tensor([9, 10, 11, 12]) + +# Get entire column 2 +print(A[:, 2]) +# Output: tensor([3, 7, 11, 15]) +``` + +**Manual breakdown:** + +```yaml +A[1:3, 1:3] means: +- Rows 1 to 3 (not including 3) โ†’ rows 1, 2 +- Cols 1 to 3 (not including 3) โ†’ cols 1, 2 + +Result: +[[6, 7], + [10, 11]] + +A[2, :] means: +- Row 2 +- All columns (:) +โ†’ [9, 10, 11, 12] + +A[:, 2] means: +- All rows (:) +- Column 2 +โ†’ [3, 7, 11, 15] +``` + +**Remember:** `:` means "all" (all rows or all columns) + +## Step Slicing + +Add a **step** to skip elements: `[start:end:step]` + +![Step Slicing](/content/learn/tensors/indexing-and-slicing/step-slicing.png) + +**Example:** + +```python +import torch + +v = torch.tensor([0, 10, 20, 30, 40, 50, 60, 70]) + +print(v[::2]) # Output: tensor([0, 20, 40, 60]) โ† Every 2nd +print(v[1::2]) # Output: tensor([10, 30, 50, 70]) โ† Start 1, every 2nd +print(v[::3]) # Output: tensor([0, 30, 60]) โ† Every 3rd +print(v[::-1]) # Output: tensor([70, 60, 50, 40, 30, 20, 10, 0]) โ† Reversed! +``` + +**How it works:** + +```yaml +v[::2] โ†’ Start at 0, take every 2nd element + โ†’ Indices: 0, 2, 4, 6 + โ†’ Values: [0, 20, 40, 60] + +v[1::3] โ†’ Start at 1, take every 3rd element + โ†’ Indices: 1, 4, 7 + โ†’ Values: [10, 40, 70] + +v[::-1] โ†’ Negative step reverses! + โ†’ Values: [70, 60, 50, 40, 30, 20, 10, 0] +``` + +**Cool trick:** `v[::-1]` reverses any tensor! + +## Multiple Elements at Once + +You can use lists to select specific indices: + +```python +import torch + +v = torch.tensor([10, 20, 30, 40, 50]) + +# Select indices 0, 2, 4 +indices = torch.tensor([0, 2, 4]) +result = v[indices] + +print(result) # Output: tensor([10, 30, 50]) +``` + +**For matrices:** + +```python +import torch + +A = torch.tensor([[1, 2, 3], + [4, 5, 6], + [7, 8, 9]]) + +# Get specific rows +rows = torch.tensor([0, 2]) +result = A[rows] + +print(result) +# Output: +# tensor([[1, 2, 3], +# [7, 8, 9]]) +``` + +## Practical Example: Batch Processing + +```python +import torch + +# Batch of 5 samples, each with 3 features +batch = torch.tensor([[1.0, 2.0, 3.0], + [4.0, 5.0, 6.0], + [7.0, 8.0, 9.0], + [10.0, 11.0, 12.0], + [13.0, 14.0, 15.0]]) + +# Get first 3 samples +first_three = batch[:3] +print(first_three) +# tensor([[ 1., 2., 3.], +# [ 4., 5., 6.], +# [ 7., 8., 9.]]) + +# Get last 2 samples +last_two = batch[-2:] +print(last_two) +# tensor([[10., 11., 12.], +# [13., 14., 15.]]) + +# Get all samples, but only first 2 features +first_two_features = batch[:, :2] +print(first_two_features) +# tensor([[ 1., 2.], +# [ 4., 5.], +# [ 7., 8.], +# [10., 11.], +# [13., 14.]]) +``` + +**What happened:** + +```yaml +batch[:3] โ†’ First 3 rows (samples 0, 1, 2) +batch[-2:] โ†’ Last 2 rows (samples 3, 4) +batch[:, :2] โ†’ All rows, first 2 columns (features 0, 1) +``` + +## Modifying with Indexing + +You can change values using indexing: + +```python +import torch + +v = torch.tensor([10, 20, 30, 40, 50]) + +# Change single element +v[2] = 999 +print(v) # tensor([ 10, 20, 999, 40, 50]) + +# Change slice +v[0:2] = torch.tensor([100, 200]) +print(v) # tensor([100, 200, 999, 40, 50]) + +# Set all to same value +v[:] = 0 +print(v) # tensor([0, 0, 0, 0, 0]) +``` + +## 3D Indexing + +For 3D tensors (like batches of images): + +```python +import torch + +# 2 batches, 3 rows, 4 columns +tensor_3d = torch.randn(2, 3, 4) + +# Get first batch +first_batch = tensor_3d[0] # Shape: (3, 4) + +# Get element from second batch, row 1, col 2 +element = tensor_3d[1, 1, 2] # Single value + +# Get all batches, row 0, all columns +slice_3d = tensor_3d[:, 0, :] # Shape: (2, 4) +``` + +**Pattern:** `[batch, row, col]` for 3D tensors + +## Common Patterns + +### Get First/Last Row + +```python +A = torch.randn(5, 3) + +first_row = A[0] # or A[0, :] +last_row = A[-1] # or A[-1, :] +``` + +### Get First/Last Column + +```python +A = torch.randn(5, 3) + +first_col = A[:, 0] +last_col = A[:, -1] +``` + +### Get Main Diagonal + +```python +A = torch.tensor([[1, 2, 3], + [4, 5, 6], + [7, 8, 9]]) + +diagonal = torch.diag(A) +print(diagonal) # tensor([1, 5, 9]) +``` + +### Skip Every Other Row + +```python +A = torch.randn(10, 3) + +every_other_row = A[::2] # Rows 0, 2, 4, 6, 8 +``` + +## Common Gotchas + +### โŒ Gotcha 1: End Index Not Included + +```python +v = torch.tensor([10, 20, 30, 40, 50]) + +# v[1:4] gets indices 1, 2, 3 (NOT 4!) +print(v[1:4]) # tensor([20, 30, 40]) + +# To include index 4, use v[1:5] +print(v[1:5]) # tensor([20, 30, 40, 50]) +``` + +### โŒ Gotcha 2: Slicing Creates a View + +```python +v = torch.tensor([1, 2, 3, 4, 5]) +slice_v = v[1:4] + +# Modifying slice also modifies original! +slice_v[0] = 999 + +print(v) # tensor([ 1, 999, 3, 4, 5]) +print(slice_v) # tensor([999, 3, 4]) + +# Use .clone() for a copy +slice_copy = v[1:4].clone() +slice_copy[0] = 100 +print(v) # tensor([ 1, 999, 3, 4, 5]) โ† Unchanged! +``` + +### โŒ Gotcha 3: Integer vs Slice + +```python +A = torch.randn(3, 4) + +# Integer index reduces dimensions +row = A[0] # Shape: (4,) โ† 1D tensor + +# Slice keeps dimensions +row = A[0:1] # Shape: (1, 4) โ† Still 2D! +``` + +## Key Takeaways + +โœ“ **Indexing starts at 0:** First element is `[0]`, not `[1]` + +โœ“ **Negative indexing:** `-1` is last, `-2` is second from last + +โœ“ **Slicing:** `[start:end]` - end is NOT included! + +โœ“ **Colon means all:** `A[:, 2]` = all rows, column 2 + +โœ“ **Step:** `[::2]` = every 2nd element, `[::-1]` = reverse + +โœ“ **Views not copies:** Slicing creates views - use `.clone()` for copies + +**Quick Reference:** + +```python +# Basic indexing +v[0] # First element +v[-1] # Last element +A[1, 2] # Row 1, column 2 + +# Slicing +v[1:4] # Elements 1, 2, 3 +v[:3] # First 3 elements +v[3:] # From index 3 to end +v[:] # All elements + +# 2D slicing +A[1:3, 2:4] # Rows 1-2, columns 2-3 +A[0, :] # First row +A[:, 0] # First column + +# Step slicing +v[::2] # Every 2nd element +v[::-1] # Reversed +``` + +**Congratulations!** You now know how to access any part of any tensor! This is essential for data processing and neural networks. ๐ŸŽ‰ diff --git a/public/content/learn/tensors/indexing-and-slicing/matrix-indexing.png b/public/content/learn/tensors/indexing-and-slicing/matrix-indexing.png new file mode 100644 index 0000000..1e1469c Binary files /dev/null and b/public/content/learn/tensors/indexing-and-slicing/matrix-indexing.png differ diff --git a/public/content/learn/tensors/indexing-and-slicing/matrix-slicing.png b/public/content/learn/tensors/indexing-and-slicing/matrix-slicing.png new file mode 100644 index 0000000..251ae83 Binary files /dev/null and b/public/content/learn/tensors/indexing-and-slicing/matrix-slicing.png differ diff --git a/public/content/learn/tensors/indexing-and-slicing/negative-indexing.png b/public/content/learn/tensors/indexing-and-slicing/negative-indexing.png new file mode 100644 index 0000000..315b5c8 Binary files /dev/null and b/public/content/learn/tensors/indexing-and-slicing/negative-indexing.png differ diff --git a/public/content/learn/tensors/indexing-and-slicing/slicing-basics.png b/public/content/learn/tensors/indexing-and-slicing/slicing-basics.png new file mode 100644 index 0000000..f025e12 Binary files /dev/null and b/public/content/learn/tensors/indexing-and-slicing/slicing-basics.png differ diff --git a/public/content/learn/tensors/indexing-and-slicing/step-slicing.png b/public/content/learn/tensors/indexing-and-slicing/step-slicing.png new file mode 100644 index 0000000..1cc240f Binary files /dev/null and b/public/content/learn/tensors/indexing-and-slicing/step-slicing.png differ diff --git a/public/content/learn/tensors/matrix-multiplication/all-positions.png b/public/content/learn/tensors/matrix-multiplication/all-positions.png new file mode 100644 index 0000000..5286bea Binary files /dev/null and b/public/content/learn/tensors/matrix-multiplication/all-positions.png differ diff --git a/public/content/learn/tensors/matrix-multiplication/dot-product-steps.png b/public/content/learn/tensors/matrix-multiplication/dot-product-steps.png new file mode 100644 index 0000000..142e6ea Binary files /dev/null and b/public/content/learn/tensors/matrix-multiplication/dot-product-steps.png differ diff --git a/public/content/learn/tensors/matrix-multiplication/dot-product.png b/public/content/learn/tensors/matrix-multiplication/dot-product.png new file mode 100644 index 0000000..1bdac61 Binary files /dev/null and b/public/content/learn/tensors/matrix-multiplication/dot-product.png differ diff --git a/public/content/learn/tensors/matrix-multiplication/elementwise-vs-matmul.png b/public/content/learn/tensors/matrix-multiplication/elementwise-vs-matmul.png new file mode 100644 index 0000000..8aa78df Binary files /dev/null and b/public/content/learn/tensors/matrix-multiplication/elementwise-vs-matmul.png differ diff --git a/public/content/learn/tensors/matrix-multiplication/matrix-multiplication-content.md b/public/content/learn/tensors/matrix-multiplication/matrix-multiplication-content.md new file mode 100644 index 0000000..1fd711c --- /dev/null +++ b/public/content/learn/tensors/matrix-multiplication/matrix-multiplication-content.md @@ -0,0 +1,380 @@ +--- +hero: + title: "Matrix Multiplication" + subtitle: "The Core Operation in Neural Networks" + tags: + - "๐Ÿ”ข Tensors" + - "โฑ๏ธ 10 min read" +--- + +Matrix multiplication is THE most important operation in deep learning. Unlike addition, it's **not element-wise** - it combines rows and columns in a special way. + +## The Key Difference + +**Addition:** Add each position separately +**Multiplication:** Combine entire rows with entire columns + +Let's build up to matrix multiplication step by step! + +## Step 1: The Dot Product + +Before matrices, let's understand the **dot product** - multiplying two vectors: + +![Dot Product](/content/learn/tensors/matrix-multiplication/dot-product.png) + +**Example:** + +```python +import torch + +a = torch.tensor([2, 3, 4]) +b = torch.tensor([1, 2, 3]) + +# Dot product +result = torch.dot(a, b) + +print(result) # Output: tensor(20) +``` + +**Manual calculation:** + +```yaml +Step 1: Multiply corresponding elements +2 ร— 1 = 2 +3 ร— 2 = 6 +4 ร— 3 = 12 + +Step 2: Add them all up +2 + 6 + 12 = 20 + +Result: 20 +``` + +**Key insight:** Dot product = multiply pairs, then sum everything. + +![Dot Product Steps](/content/learn/tensors/matrix-multiplication/dot-product-steps.png) + +## Step 2: Matrix @ Matrix + +Matrix multiplication uses dot products repeatedly! The `@` operator means "matrix multiply": + +![Simple Matrix Multiplication](/content/learn/tensors/matrix-multiplication/simple-matmul.png) + +**Example:** + +```python +import torch + +A = torch.tensor([[1, 2], + [3, 4]]) + +B = torch.tensor([[5, 6], + [7, 8]]) + +result = A @ B # @ means matrix multiply + +print(result) +# Output: +# tensor([[19, 22], +# [43, 50]]) +``` + +**How does this work?** Each position in the result is a dot product! + +## Computing One Position: The Rule + +**To get result[row, col]:** +1. Take the **row** from matrix A +2. Take the **column** from matrix B +3. Compute their **dot product** + +![Step by Step](/content/learn/tensors/matrix-multiplication/step-by-step.png) + +**Manual calculation for position [0, 0]:** + +```yaml +Take row 0 from A: [1, 2] +Take column 0 from B: [5, 7] + +Dot product: +(1 ร— 5) + (2 ร— 7) = 5 + 14 = 19 + +Result[0, 0] = 19 +``` + +**Manual calculation for position [0, 1]:** + +```yaml +Take row 0 from A: [1, 2] +Take column 1 from B: [6, 8] + +Dot product: +(1 ร— 6) + (2 ร— 8) = 6 + 16 = 22 + +Result[0, 1] = 22 +``` + +**Manual calculation for position [1, 0]:** + +```yaml +Take row 1 from A: [3, 4] +Take column 0 from B: [5, 7] + +Dot product: +(3 ร— 5) + (4 ร— 7) = 15 + 28 = 43 + +Result[1, 0] = 43 +``` + +**Manual calculation for position [1, 1]:** + +```yaml +Take row 1 from A: [3, 4] +Take column 1 from B: [6, 8] + +Dot product: +(3 ร— 6) + (4 ร— 8) = 18 + 32 = 50 + +Result[1, 1] = 50 +``` + +**Complete result:** + +```yaml +[[19, 22], + [43, 50]] +``` + +![All Positions](/content/learn/tensors/matrix-multiplication/all-positions.png) + +## The Shape Rule + +**Not all matrices can be multiplied!** The shapes must be compatible: + +![Shape Rule](/content/learn/tensors/matrix-multiplication/shape-rule.png) + +**The rule:** `(m, n) @ (n, p) = (m, p)` + +The **inner dimensions must match**! + +### โœ“ Valid Examples + +```python +# Example 1 +A = torch.randn(3, 4) # 3 rows, 4 columns +B = torch.randn(4, 2) # 4 rows, 2 columns +result = A @ B # Works! โ†’ (3, 2) + +# Example 2 +A = torch.randn(5, 10) +B = torch.randn(10, 7) +result = A @ B # Works! โ†’ (5, 7) + +# Example 3 +A = torch.randn(2, 3) +B = torch.randn(3, 3) +result = A @ B # Works! โ†’ (2, 3) +``` + +**Why these work:** + +```yaml +Example 1: (3, 4) @ (4, 2) = (3, 2) โœ“ 4 = 4 +Example 2: (5, 10) @ (10, 7) = (5, 7) โœ“ 10 = 10 +Example 3: (2, 3) @ (3, 3) = (2, 3) โœ“ 3 = 3 +``` + +### โœ— Invalid Examples + +```python +# Example 1 - WILL ERROR! +A = torch.randn(3, 4) +B = torch.randn(5, 2) +# result = A @ B # Error! 4 โ‰  5 + +# Example 2 - WILL ERROR! +A = torch.randn(2, 7) +B = torch.randn(3, 5) +# result = A @ B # Error! 7 โ‰  3 +``` + +**Why these fail:** + +```yaml +Example 1: (3, 4) @ (5, 2) โœ— 4 โ‰  5 (can't match rows with columns) +Example 2: (2, 7) @ (3, 5) โœ— 7 โ‰  3 (dimensions incompatible) +``` + +## Vector @ Matrix + +A common pattern in neural networks is multiplying a vector by a matrix: + +![Vector @ Matrix](/content/learn/tensors/matrix-multiplication/vector-matrix.png) + +**Example:** + +```python +import torch + +# Input vector (like data going into a layer) +x = torch.tensor([1, 2, 3]) # Shape: (3,) + +# Weight matrix +W = torch.tensor([[4, 5], + [6, 7], + [8, 9]]) # Shape: (3, 2) + +result = x @ W + +print(result) # Output: tensor([40, 46]) +print(result.shape) # Shape: (2,) +``` + +**Manual calculation:** + +```yaml +Position [0]: +Take vector: [1, 2, 3] +Take column 0: [4, 6, 8] +Dot product: (1ร—4) + (2ร—6) + (3ร—8) = 4 + 12 + 24 = 40 + +Position [1]: +Take vector: [1, 2, 3] +Take column 1: [5, 7, 9] +Dot product: (1ร—5) + (2ร—7) + (3ร—9) = 5 + 14 + 27 = 46 + +Result: [40, 46] +``` + +**This is exactly what happens in a neural network layer!** + +## Practical Example: Neural Network Layer + +Here's a realistic example of matrix multiplication in action: + +```python +import torch + +# Batch of 2 samples, each with 3 features +inputs = torch.tensor([[1.0, 2.0, 3.0], + [4.0, 5.0, 6.0]]) # Shape: (2, 3) + +# Weight matrix: 3 inputs โ†’ 4 outputs +weights = torch.tensor([[0.1, 0.2, 0.3, 0.4], + [0.5, 0.6, 0.7, 0.8], + [0.9, 1.0, 1.1, 1.2]]) # Shape: (3, 4) + +# Forward pass +outputs = inputs @ weights # Shape: (2, 4) + +print(outputs) +# tensor([[3.2000, 3.8000, 4.4000, 5.0000], +# [7.7000, 9.2000, 10.7000, 12.2000]]) +``` + +**What happened:** + +```yaml +Shape: (2, 3) @ (3, 4) = (2, 4) + โ†“ โ†“ โ†“ + 2 samples โ†’ 4 outputs per sample + 3 features each +``` + +Each of the 2 input samples got transformed into 4 output values. This is how neural networks transform data! + +![Neural Network Layer](/content/learn/tensors/matrix-multiplication/neural-network.png) + +## Matrix @ Vector + +You can also multiply matrix @ vector (different from vector @ matrix): + +```python +import torch + +A = torch.tensor([[1, 2, 3], + [4, 5, 6]]) # Shape: (2, 3) + +v = torch.tensor([1, 2, 3]) # Shape: (3,) + +result = A @ v + +print(result) # Output: tensor([14, 32]) +print(result.shape) # Shape: (2,) +``` + +**Manual calculation:** + +```yaml +Row 0: [1, 2, 3] ยท [1, 2, 3] = 1 + 4 + 9 = 14 +Row 1: [4, 5, 6] ยท [1, 2, 3] = 4 + 10 + 18 = 32 + +Result: [14, 32] +``` + +## Common Mistakes + +### โŒ Mistake 1: Using * instead of @ + +```python +A = torch.tensor([[1, 2], [3, 4]]) +B = torch.tensor([[5, 6], [7, 8]]) + +wrong = A * B # Element-wise multiplication! โŒ +right = A @ B # Matrix multiplication! โœ“ + +print("Wrong:", wrong) +# tensor([[ 5, 12], +# [21, 32]]) + +print("Right:", right) +# tensor([[19, 22], +# [43, 50]]) +``` + +**Visual comparison:** + +![Element-wise vs Matrix Multiplication](/content/learn/tensors/matrix-multiplication/elementwise-vs-matmul.png) + +### โŒ Mistake 2: Wrong shape order + +```python +A = torch.randn(3, 4) +B = torch.randn(5, 3) + +# result = A @ B # Error! 4 โ‰  5 + +# Fix: Either change order or transpose +result = B @ A # Works! (5, 3) @ (3, 4) = (5, 4) +``` + +## Key Takeaways + +โœ“ **Dot product:** Multiply pairs, then sum + +โœ“ **Matrix multiply:** Each result position = dot product of row ร— column + +โœ“ **Shape rule:** `(m, n) @ (n, p) = (m, p)` - inner dimensions must match! + +โœ“ **Use @:** For matrix multiplication (not `*`) + +โœ“ **Common in ML:** Input @ Weights = Output + +**Quick Reference:** + +```python +# Dot product (1D ร— 1D) +torch.dot(torch.tensor([1, 2]), torch.tensor([3, 4])) # = 11 + +# Vector @ Matrix (transforms vector) +torch.tensor([1, 2]) @ torch.tensor([[1, 2], [3, 4]]) # = [7, 10] + +# Matrix @ Vector (applies to rows) +torch.tensor([[1, 2], [3, 4]]) @ torch.tensor([1, 2]) # = [5, 11] + +# Matrix @ Matrix (transforms matrix) +torch.tensor([[1, 2], [3, 4]]) @ torch.tensor([[5, 6], [7, 8]]) +# = [[19, 22], [43, 50]] +``` + +**Remember:** Every neural network layer uses matrix multiplication to transform data. You've just learned the most important operation in deep learning! ๐ŸŽ‰ diff --git a/public/content/learn/tensors/matrix-multiplication/neural-network.png b/public/content/learn/tensors/matrix-multiplication/neural-network.png new file mode 100644 index 0000000..fe59f7a Binary files /dev/null and b/public/content/learn/tensors/matrix-multiplication/neural-network.png differ diff --git a/public/content/learn/tensors/matrix-multiplication/shape-rule.png b/public/content/learn/tensors/matrix-multiplication/shape-rule.png new file mode 100644 index 0000000..2677781 Binary files /dev/null and b/public/content/learn/tensors/matrix-multiplication/shape-rule.png differ diff --git a/public/content/learn/tensors/matrix-multiplication/simple-matmul.png b/public/content/learn/tensors/matrix-multiplication/simple-matmul.png new file mode 100644 index 0000000..9ff39b0 Binary files /dev/null and b/public/content/learn/tensors/matrix-multiplication/simple-matmul.png differ diff --git a/public/content/learn/tensors/matrix-multiplication/step-by-step.png b/public/content/learn/tensors/matrix-multiplication/step-by-step.png new file mode 100644 index 0000000..02ffe77 Binary files /dev/null and b/public/content/learn/tensors/matrix-multiplication/step-by-step.png differ diff --git a/public/content/learn/tensors/matrix-multiplication/vector-matrix.png b/public/content/learn/tensors/matrix-multiplication/vector-matrix.png new file mode 100644 index 0000000..8801543 Binary files /dev/null and b/public/content/learn/tensors/matrix-multiplication/vector-matrix.png differ diff --git a/public/content/learn/tensors/reshaping-tensors/auto-dimension.png b/public/content/learn/tensors/reshaping-tensors/auto-dimension.png new file mode 100644 index 0000000..20e58e5 Binary files /dev/null and b/public/content/learn/tensors/reshaping-tensors/auto-dimension.png differ diff --git a/public/content/learn/tensors/reshaping-tensors/basic-reshape.png b/public/content/learn/tensors/reshaping-tensors/basic-reshape.png new file mode 100644 index 0000000..10dd8f9 Binary files /dev/null and b/public/content/learn/tensors/reshaping-tensors/basic-reshape.png differ diff --git a/public/content/learn/tensors/reshaping-tensors/batch-reshape.png b/public/content/learn/tensors/reshaping-tensors/batch-reshape.png new file mode 100644 index 0000000..023ef7f Binary files /dev/null and b/public/content/learn/tensors/reshaping-tensors/batch-reshape.png differ diff --git a/public/content/learn/tensors/reshaping-tensors/flatten-visual.png b/public/content/learn/tensors/reshaping-tensors/flatten-visual.png new file mode 100644 index 0000000..8dbdb2c Binary files /dev/null and b/public/content/learn/tensors/reshaping-tensors/flatten-visual.png differ diff --git a/public/content/learn/tensors/reshaping-tensors/reshape-rules.png b/public/content/learn/tensors/reshaping-tensors/reshape-rules.png new file mode 100644 index 0000000..6d2710e Binary files /dev/null and b/public/content/learn/tensors/reshaping-tensors/reshape-rules.png differ diff --git a/public/content/learn/tensors/reshaping-tensors/reshaping-tensors-content.md b/public/content/learn/tensors/reshaping-tensors/reshaping-tensors-content.md new file mode 100644 index 0000000..1ce5e23 --- /dev/null +++ b/public/content/learn/tensors/reshaping-tensors/reshaping-tensors-content.md @@ -0,0 +1,477 @@ +--- +hero: + title: "Reshaping Tensors" + subtitle: "Changing Tensor Dimensions" + tags: + - "๐Ÿ”ข Tensors" + - "โฑ๏ธ 10 min read" +--- + +Reshaping lets you change how data is organized **without changing the actual values**. Same data, different shape! + +## The Basic Idea + +Reshaping reorganizes elements into a new structure. Think of it like rearranging books on shelves - same books, different arrangement! + +![Basic Reshape](/content/learn/tensors/reshaping-tensors/basic-reshape.png) + +**Example:** + +```python +import torch + +# 1D tensor with 6 elements +v = torch.tensor([1, 2, 3, 4, 5, 6]) +print(v.shape) # torch.Size([6]) + +# Reshape to 2D: 2 rows, 3 columns +matrix = v.reshape(2, 3) +print(matrix) +# tensor([[1, 2, 3], +# [4, 5, 6]]) +print(matrix.shape) # torch.Size([2, 3]) +``` + +**What happened:** + +```yaml +Original: [1, 2, 3, 4, 5, 6] โ†’ Shape: (6,) + +Reshaped: [[1, 2, 3], + [4, 5, 6]] โ†’ Shape: (2, 3) + +Same 6 elements, new organization! +``` + +## The Golden Rule + +**Total number of elements must stay the same!** + +```yaml +6 elements can become: +โœ“ (6,) - 1D with 6 elements +โœ“ (2, 3) - 2ร—3 = 6 elements +โœ“ (3, 2) - 3ร—2 = 6 elements +โœ“ (1, 6) - 1ร—6 = 6 elements +โœ— (2, 4) - 2ร—4 = 8 elements (ERROR!) +``` + +## Common Reshape Patterns + +### Pattern 1: 1D โ†’ 2D + +```python +import torch + +v = torch.tensor([1, 2, 3, 4, 5, 6]) + +# Make it 2ร—3 +matrix = v.reshape(2, 3) +print(matrix) +# tensor([[1, 2, 3], +# [4, 5, 6]]) + +# Make it 3ร—2 +matrix = v.reshape(3, 2) +print(matrix) +# tensor([[1, 2], +# [3, 4], +# [5, 6]]) +``` + +### Pattern 2: 2D โ†’ Different 2D + +```python +import torch + +A = torch.tensor([[1, 2, 3], + [4, 5, 6]]) # Shape: (2, 3) + +B = A.reshape(3, 2) +print(B) +# tensor([[1, 2], +# [3, 4], +# [5, 6]]) # Shape: (3, 2) +``` + +## Flattening: Any Dimension โ†’ 1D + +Flattening converts any tensor into a single row: + +![Flatten Visual](/content/learn/tensors/reshaping-tensors/flatten-visual.png) + +**Example:** + +```python +import torch + +matrix = torch.tensor([[1, 2, 3], + [4, 5, 6]]) + +# Method 1: flatten() +flat = matrix.flatten() +print(flat) # tensor([1, 2, 3, 4, 5, 6]) + +# Method 2: reshape(-1) +flat = matrix.reshape(-1) +print(flat) # tensor([1, 2, 3, 4, 5, 6]) + +# Method 3: view(-1) +flat = matrix.view(-1) +print(flat) # tensor([1, 2, 3, 4, 5, 6]) +``` + +**How it reads:** + +```yaml +Matrix: +[[1, 2, 3], + [4, 5, 6]] + +Flattens row by row: +Row 0: [1, 2, 3] +Row 1: [4, 5, 6] + +Result: [1, 2, 3, 4, 5, 6] +``` + +## Using -1: Automatic Dimension + +Use `-1` to let PyTorch figure out one dimension automatically! + +![Auto Dimension](/content/learn/tensors/reshaping-tensors/auto-dimension.png) + +**Example:** + +```python +import torch + +t = torch.arange(12) # [0, 1, 2, ..., 11] - 12 elements + +# You specify columns, PyTorch figures out rows +print(t.reshape(-1, 3)) # (?, 3) โ†’ (4, 3) +# tensor([[ 0, 1, 2], +# [ 3, 4, 5], +# [ 6, 7, 8], +# [ 9, 10, 11]]) + +# You specify rows, PyTorch figures out columns +print(t.reshape(3, -1)) # (3, ?) โ†’ (3, 4) +# tensor([[ 0, 1, 2, 3], +# [ 4, 5, 6, 7], +# [ 8, 9, 10, 11]]) + +# Just -1 means flatten +print(t.reshape(-1)) # (12,) +``` + +**How it works:** + +```yaml +12 elements, reshape(-1, 3): +โ†’ 12 รท 3 = 4 rows +โ†’ Result: (4, 3) + +12 elements, reshape(2, -1): +โ†’ 12 รท 2 = 6 columns +โ†’ Result: (2, 6) +``` + +**Important:** Only ONE -1 allowed per reshape! + +## Squeeze & Unsqueeze + +These add or remove dimensions of size 1: + +![Squeeze Unsqueeze](/content/learn/tensors/reshaping-tensors/squeeze-unsqueeze.png) + +### Unsqueeze: Add a Dimension + +```python +import torch + +v = torch.tensor([1, 2, 3]) # Shape: (3,) + +# Add dimension at position 0 +v_unsqueezed = v.unsqueeze(0) +print(v_unsqueezed.shape) # torch.Size([1, 3]) +print(v_unsqueezed) +# tensor([[1, 2, 3]]) + +# Add dimension at position 1 +v_unsqueezed = v.unsqueeze(1) +print(v_unsqueezed.shape) # torch.Size([3, 1]) +print(v_unsqueezed) +# tensor([[1], +# [2], +# [3]]) +``` + +### Squeeze: Remove Dimensions of Size 1 + +```python +import torch + +t = torch.tensor([[[1, 2, 3]]]) # Shape: (1, 1, 3) + +# Remove all size-1 dimensions +squeezed = t.squeeze() +print(squeezed.shape) # torch.Size([3]) +print(squeezed) # tensor([1, 2, 3]) + +# Remove specific dimension +t2 = torch.randn(1, 5, 1, 3) # Shape: (1, 5, 1, 3) +squeezed = t2.squeeze(0) # Remove dimension 0 +print(squeezed.shape) # torch.Size([5, 1, 3]) +``` + +**When to use:** + +```yaml +Unsqueeze: When you need to match shapes for operations + (3,) + unsqueeze(1) โ†’ (3, 1) for broadcasting + +Squeeze: When you want to remove extra dimensions + (1, 5, 1) + squeeze() โ†’ (5,) cleaner shape +``` + +## Reshape vs View + +Both change shape, but there's a difference: + +```python +import torch + +t = torch.tensor([[1, 2], [3, 4]]) + +# reshape() - always works, may copy data +r = t.reshape(4) # Works! + +# view() - faster but requires contiguous memory +v = t.view(4) # Works if contiguous! +``` + +**Key difference:** + +```yaml +.reshape(): + - Always works + - May create a copy if needed + - Safer choice + +.view(): + - Faster (no copy) + - Only works on contiguous tensors + - May fail with error +``` + +**When to use which:** +- Use `.reshape()` by default (safer) +- Use `.view()` if you know tensor is contiguous and want speed + +## Practical Example: Batch Processing + +![Batch Reshape](/content/learn/tensors/reshaping-tensors/batch-reshape.png) + +```python +import torch + +# 3 images, each 2ร—2 pixels +images = torch.tensor([[[1, 2], [3, 4]], + [[5, 6], [7, 8]], + [[9, 10], [11, 12]]]) + +print(images.shape) # torch.Size([3, 2, 2]) + +# Flatten each image for neural network +batch = images.reshape(3, -1) +print(batch) +# tensor([[ 1, 2, 3, 4], +# [ 5, 6, 7, 8], +# [ 9, 10, 11, 12]]) + +print(batch.shape) # torch.Size([3, 4]) +# 3 samples, 4 features each - ready for neural network! +``` + +**What happened:** + +```yaml +Original: (3, 2, 2) + - 3 images + - Each image is 2ร—2 + +Reshaped: (3, 4) + - 3 samples + - Each sample has 4 features (flattened image) +``` + +## Reshaping Rules + +![Reshape Rules](/content/learn/tensors/reshaping-tensors/reshape-rules.png) + +### โœ“ Valid Reshapes + +```python +# 12 elements can be reshaped many ways +t = torch.arange(12) # 12 elements + +t.reshape(3, 4) # โœ“ 3ร—4 = 12 +t.reshape(2, 6) # โœ“ 2ร—6 = 12 +t.reshape(1, 12) # โœ“ 1ร—12 = 12 +t.reshape(2, 2, 3) # โœ“ 2ร—2ร—3 = 12 +``` + +### โœ— Invalid Reshapes + +```python +t = torch.arange(12) # 12 elements + +# t.reshape(3, 5) # โœ— 3ร—5 = 15 โ‰  12 - ERROR! +# t.reshape(2, 7) # โœ— 2ร—7 = 14 โ‰  12 - ERROR! +``` + +## Real-World Examples + +### Example 1: Preparing Data for Linear Layer + +```python +import torch + +# Batch of 32 images, each 28ร—28 pixels +images = torch.randn(32, 28, 28) + +# Flatten for fully connected layer +flattened = images.reshape(32, -1) +print(flattened.shape) # torch.Size([32, 784]) +# 32 samples, 784 features (28ร—28) + +# Now ready for: output = linear_layer(flattened) +``` + +### Example 2: Converting Model Output + +```python +import torch + +# Model outputs 100 predictions, need 10ร—10 grid +predictions = torch.randn(100) + +# Reshape to grid +grid = predictions.reshape(10, 10) +print(grid.shape) # torch.Size([10, 10]) +``` + +### Example 3: Adding Batch Dimension + +```python +import torch + +# Single sample +sample = torch.randn(28, 28) +print(sample.shape) # torch.Size([28, 28]) + +# Add batch dimension for model +batched = sample.unsqueeze(0) +print(batched.shape) # torch.Size([1, 28, 28]) +# Now it looks like a batch of 1 sample +``` + +## Common Patterns + +### Pattern: Flatten Batch + +```python +batch = torch.randn(32, 3, 224, 224) # 32 images, 3 channels, 224ร—224 +flat = batch.reshape(32, -1) # (32, 150528) +``` + +### Pattern: Split into Batches + +```python +data = torch.arange(100) +batches = data.reshape(10, 10) # 10 batches of 10 samples +``` + +### Pattern: Match Dimensions for Broadcasting + +```python +a = torch.randn(5, 3) # (5, 3) +b = torch.randn(3) # (3,) + +# Add dimension to b for broadcasting +b = b.unsqueeze(0) # (1, 3) +result = a + b # Works! (5, 3) + (1, 3) +``` + +## Common Gotchas + +### โŒ Gotcha 1: Element Count Mismatch + +```python +t = torch.arange(12) # 12 elements + +# This will ERROR! +# t.reshape(3, 5) # 15 โ‰  12 +``` + +### โŒ Gotcha 2: Too Many -1 + +```python +t = torch.arange(12) + +# This will ERROR! +# t.reshape(-1, -1) # Can't infer both dimensions! +``` + +### โŒ Gotcha 3: View on Non-Contiguous Tensor + +```python +t = torch.randn(3, 4) +t_t = t.T # Transpose makes it non-contiguous + +# This might ERROR! +# v = t_t.view(12) + +# Use reshape instead: +r = t_t.reshape(12) # Works! +``` + +## Key Takeaways + +โœ“ **Same data, new shape:** Reshaping reorganizes elements without changing values + +โœ“ **Element count must match:** Total elements before = total elements after + +โœ“ **Use -1 for auto:** Let PyTorch figure out one dimension + +โœ“ **Flatten with reshape(-1):** Any tensor โ†’ 1D + +โœ“ **Unsqueeze adds, squeeze removes:** Manage dimensions of size 1 + +โœ“ **reshape() is safer:** Use reshape() by default, view() for speed + +**Quick Reference:** + +```python +# Basic reshape +t.reshape(2, 3) # Specific shape +t.reshape(-1, 3) # Auto rows, 3 columns +t.reshape(-1) # Flatten to 1D + +# Flatten +t.flatten() # Always returns 1D +t.reshape(-1) # Also flattens +t.view(-1) # Flatten (if contiguous) + +# Add/remove dimensions +t.unsqueeze(0) # Add dimension at position 0 +t.unsqueeze(1) # Add dimension at position 1 +t.squeeze() # Remove all size-1 dimensions +t.squeeze(0) # Remove specific dimension + +# Alternative (view is faster but less safe) +t.view(2, 3) # Like reshape, but needs contiguous tensor +``` + +**Remember:** Reshaping doesn't change the data, only how it's organized! ๐ŸŽ‰ diff --git a/public/content/learn/tensors/reshaping-tensors/squeeze-unsqueeze.png b/public/content/learn/tensors/reshaping-tensors/squeeze-unsqueeze.png new file mode 100644 index 0000000..fd2652e Binary files /dev/null and b/public/content/learn/tensors/reshaping-tensors/squeeze-unsqueeze.png differ diff --git a/public/content/learn/tensors/tensor-addition/broadcasting-scalar-vector.png b/public/content/learn/tensors/tensor-addition/broadcasting-scalar-vector.png new file mode 100644 index 0000000..335ac14 Binary files /dev/null and b/public/content/learn/tensors/tensor-addition/broadcasting-scalar-vector.png differ diff --git a/public/content/learn/tensors/tensor-addition/matrix-addition.png b/public/content/learn/tensors/tensor-addition/matrix-addition.png new file mode 100644 index 0000000..2caabe8 Binary files /dev/null and b/public/content/learn/tensors/tensor-addition/matrix-addition.png differ diff --git a/public/content/learn/tensors/tensor-addition/scalar-addition.png b/public/content/learn/tensors/tensor-addition/scalar-addition.png new file mode 100644 index 0000000..9e26f20 Binary files /dev/null and b/public/content/learn/tensors/tensor-addition/scalar-addition.png differ diff --git a/public/content/learn/tensors/tensor-addition/step-by-step-addition.png b/public/content/learn/tensors/tensor-addition/step-by-step-addition.png new file mode 100644 index 0000000..37320f3 Binary files /dev/null and b/public/content/learn/tensors/tensor-addition/step-by-step-addition.png differ diff --git a/public/content/learn/tensors/tensor-addition/tensor-addition-content.md b/public/content/learn/tensors/tensor-addition/tensor-addition-content.md new file mode 100644 index 0000000..7432f00 --- /dev/null +++ b/public/content/learn/tensors/tensor-addition/tensor-addition-content.md @@ -0,0 +1,290 @@ +--- +hero: + title: "Tensor Addition" + subtitle: "Element-wise Operations on Tensors" + tags: + - "๐Ÿ”ข Tensors" + - "โฑ๏ธ 8 min read" +--- + +Tensor addition is one of the most fundamental operations in deep learning. It's simple: **add corresponding elements together**. + +## The Basic Rule + +**When you add two tensors, you add each position separately (element-wise).** + +Think of it like adding two shopping lists item by item: +- First item + First item +- Second item + Second item +- Third item + Third item + +## Scalar Addition + +Adding two single numbers: + +![Scalar Addition](/content/learn/tensors/tensor-addition/scalar-addition.png) + +**Example:** + +```python +import torch + +a = torch.tensor(5) +b = torch.tensor(3) +result = a + b + +print(result) # Output: tensor(8) +``` + +**Manual calculation:** +```yaml +5 + 3 = 8 +``` + +Simple! Just like regular math. + +## Vector Addition + +Adding arrays of numbers, **element by element**: + +![Vector Addition](/content/learn/tensors/tensor-addition/vector-addition.png) + +**Example:** + +```python +import torch + +a = torch.tensor([10, 20, 30]) +b = torch.tensor([5, 15, 25]) +result = a + b + +print(result) # Output: tensor([15, 35, 55]) +``` + +**Manual calculation:** +```yaml +Position 0: 10 + 5 = 15 +Position 1: 20 + 15 = 35 +Position 2: 30 + 25 = 55 + +Result: [15, 35, 55] +``` + +![Step by Step Addition](/content/learn/tensors/tensor-addition/step-by-step-addition.png) + +**Key insight:** Each position is independent. We add `[0]` with `[0]`, `[1]` with `[1]`, `[2]` with `[2]`. + +## Matrix Addition + +Same rule applies to matrices - add corresponding positions: + +![Matrix Addition](/content/learn/tensors/tensor-addition/matrix-addition.png) + +**Example:** + +```python +import torch + +a = torch.tensor([[10, 20, 30], + [15, 25, 35]]) + +b = torch.tensor([[5, 10, 15], + [8, 12, 18]]) + +result = a + b + +print(result) +# Output: +# tensor([[15, 30, 45], +# [23, 37, 53]]) +``` + +**Manual calculation:** +```yaml +Position [0, 0]: 10 + 5 = 15 +Position [0, 1]: 20 + 10 = 30 +Position [0, 2]: 30 + 15 = 45 +Position [1, 0]: 15 + 8 = 23 +Position [1, 1]: 25 + 12 = 37 +Position [1, 2]: 35 + 18 = 53 + +Result: +[[15, 30, 45], + [23, 37, 53]] +``` + +## Broadcasting: Adding a Scalar to a Vector + +What if you want to add a single number to every element in a vector? PyTorch automatically "broadcasts" the scalar: + +![Broadcasting](/content/learn/tensors/tensor-addition/broadcasting-scalar-vector.png) + +**Example:** + +```python +import torch + +vector = torch.tensor([10, 20, 30]) +scalar = 5 + +result = vector + scalar + +print(result) # Output: tensor([15, 25, 35]) +``` + +**What happens behind the scenes:** + +PyTorch automatically expands `5` to `[5, 5, 5]` and then adds: + +```yaml +[10, 20, 30] + 5 + โ†“ +[10, 20, 30] + [5, 5, 5] (automatic broadcast) + โ†“ +[15, 25, 35] +``` + +**Manual calculation:** +```yaml +10 + 5 = 15 +20 + 5 = 25 +30 + 5 = 35 + +Result: [15, 25, 35] +``` + +This works because adding the same number to every position makes sense! + +## Addition Rules + +### Quick Reminder: What is Shape? + +- **Shape** tells you the dimensions and size of your tensor +- Written as `(rows, columns)` for 2D, or `(size,)` for 1D + +**Examples:** +```yaml +5 โ†’ Shape: () (scalar - no dimensions) +[1, 2, 3] โ†’ Shape: (3,) (1D - 3 elements) +[[1, 2], โ†’ Shape: (3, 2) (2D - 3 rows, 2 columns) - last shape number is the inner most tensor dimension + [3, 4], + [5, 6]] +[[[...], โ†’ Shape: (2, 3, 5) (3D - 2 matrices, 3 rows, 5 columns) + [...], + [...]], + [[...], + [...], + [...]]] + +...and so on for higher dimensions +``` + +Now let's use this to understand addition rules! + +### โœ“ Rule 1: Same Shapes Work + +Tensors must have the **same shape** to be added: + +```python +a = torch.tensor([1, 2, 3]) # Shape: (3,) +b = torch.tensor([4, 5, 6]) # Shape: (3,) +result = a + b # Works! โœ“ + +print(result) # tensor([5, 7, 9]) +``` + +### โœ“ Rule 2: Broadcasting Works + +A scalar can be added to any tensor: + +```python +a = torch.tensor([1, 2, 3]) # Shape: (3,) +b = 10 # Scalar +result = a + b # Works! โœ“ + +print(result) # tensor([11, 12, 13]) +``` + +### โœ— Rule 3: Different Shapes Don't Work + +You **cannot** add tensors with different shapes: + +```python +a = torch.tensor([1, 2, 3]) # Shape: (3,) +b = torch.tensor([4, 5]) # Shape: (2,) + +# This will cause an ERROR! โœ— +# result = a + b +# RuntimeError: The size of tensor a (3) must match the size of tensor b (2) +``` + +**Why?** PyTorch doesn't know how to match up the elements: +- Should position `[0]` add to `[0]`? Yes. +- Should position `[1]` add to `[1]`? Yes. +- Should position `[2]` add to...? There's no `[2]` in the second tensor! โŒ + +## Real-World Example: Adjusting Image Brightness + +Imagine you have a small 2ร—2 grayscale image (values 0-255): + +```python +import torch + +# Original image (darker) +image = torch.tensor([[100, 150], + [120, 180]], dtype=torch.float32) + +# Make it brighter by adding 50 to all pixels +brightness_increase = 50 +brighter_image = image + brightness_increase + +print("Original image:") +print(image) +# tensor([[100., 150.], +# [120., 180.]]) + +print("\nBrighter image:") +print(brighter_image) +# tensor([[150., 200.], +# [170., 230.]]) +``` + +**Manual calculation:** +```yaml +Original: Add 50: Result: +[[100, 150] + 50 โ†’ [[150, 200] + [120, 180]] [170, 230]] + +Each pixel becomes 50 points brighter! +``` + +This is exactly how image editing software makes images brighter - it adds a value to every pixel. + +## Key Takeaways + +โœ“ **Element-wise:** Addition happens position by position + +โœ“ **Same shapes:** Tensors must have identical shapes (or use broadcasting) + +โœ“ **Broadcasting:** Scalars are automatically added to every element + +โœ“ **Independent:** Each position is added separately - no mixing between positions + +**Quick Reference:** + +```python +# Scalar + Scalar +torch.tensor(5) + torch.tensor(3) # = 8 + +# Vector + Vector (same size) +torch.tensor([1, 2]) + torch.tensor([3, 4]) # = [4, 6] + +# Vector + Scalar (broadcasting) +torch.tensor([1, 2, 3]) + 10 # = [11, 12, 13] + +# Matrix + Matrix (same shape) +torch.tensor([[1, 2], [3, 4]]) + torch.tensor([[5, 6], [7, 8]]) +# = [[6, 8], [10, 12]] +``` + +**Congratulations!** You now understand tensor addition. This same element-wise principle applies to subtraction, multiplication, and division too! ๐ŸŽ‰ diff --git a/public/content/learn/tensors/tensor-addition/vector-addition.png b/public/content/learn/tensors/tensor-addition/vector-addition.png new file mode 100644 index 0000000..824aaa9 Binary files /dev/null and b/public/content/learn/tensors/tensor-addition/vector-addition.png differ diff --git a/public/content/learn/tensors/transposing-tensors/matrix-transpose.png b/public/content/learn/tensors/transposing-tensors/matrix-transpose.png new file mode 100644 index 0000000..047999e Binary files /dev/null and b/public/content/learn/tensors/transposing-tensors/matrix-transpose.png differ diff --git a/public/content/learn/tensors/transposing-tensors/square-transpose.png b/public/content/learn/tensors/transposing-tensors/square-transpose.png new file mode 100644 index 0000000..eeadb57 Binary files /dev/null and b/public/content/learn/tensors/transposing-tensors/square-transpose.png differ diff --git a/public/content/learn/tensors/transposing-tensors/transpose-detailed.png b/public/content/learn/tensors/transposing-tensors/transpose-detailed.png new file mode 100644 index 0000000..ea15a43 Binary files /dev/null and b/public/content/learn/tensors/transposing-tensors/transpose-detailed.png differ diff --git a/public/content/learn/tensors/transposing-tensors/transposing-tensors-content.md b/public/content/learn/tensors/transposing-tensors/transposing-tensors-content.md new file mode 100644 index 0000000..aad25f3 --- /dev/null +++ b/public/content/learn/tensors/transposing-tensors/transposing-tensors-content.md @@ -0,0 +1,393 @@ +--- +hero: + title: "Transposing Tensors" + subtitle: "Flipping Dimensions and Axes" + tags: + - "๐Ÿ”ข Tensors" + - "โฑ๏ธ 10 min read" +--- + +Transposing is like **flipping** a tensor - rows become columns, and columns become rows. It's simple but incredibly useful! + +## The Basic Idea + +**Transpose = Swap rows and columns** + +Think of it like rotating a table 90 degrees. The first row becomes the first column, the second row becomes the second column, and so on. + +## Vector Transpose + +When you transpose a vector, you change it from horizontal to vertical (or vice versa): + +![Vector Transpose](/content/learn/tensors/transposing-tensors/vector-transpose.png) + +**Example:** + +```python +import torch + +# Horizontal vector (row) +v = torch.tensor([1, 2, 3, 4]) +print(v.shape) # torch.Size([4]) + +# Transpose to vertical (column) +v_t = v.T +print(v_t) +# tensor([[1], +# [2], +# [3], +# [4]]) +print(v_t.shape) # torch.Size([4, 1]) +``` + +**Manual visualization:** + +```yaml +Before: [1, 2, 3, 4] โ†’ Shape: (4,) + +After: [[1], + [2], + [3], + [4]] โ†’ Shape: (4, 1) +``` + +## Matrix Transpose + +This is where transpose really shines! Rows become columns, columns become rows: + +![Matrix Transpose](/content/learn/tensors/transposing-tensors/matrix-transpose.png) + +**Example:** + +```python +import torch + +# Original matrix: 2 rows, 3 columns +A = torch.tensor([[1, 2, 3], + [4, 5, 6]]) + +print(A.shape) # torch.Size([2, 3]) + +# Transpose: 3 rows, 2 columns +A_T = A.T + +print(A_T) +# tensor([[1, 4], +# [2, 5], +# [3, 6]]) + +print(A_T.shape) # torch.Size([3, 2]) +``` + +**Manual calculation:** + +```yaml +Original (2ร—3): +[[1, 2, 3], + [4, 5, 6]] + +Transpose (3ร—2): +[[1, 4], โ† First column becomes first row + [2, 5], โ† Second column becomes second row + [3, 6]] โ† Third column becomes third row +``` + +## How Elements Move + +Here's exactly what happens to each element during transpose: + +![Transpose Detailed](/content/learn/tensors/transposing-tensors/transpose-detailed.png) + +**The pattern:** Position `[i, j]` โ†’ Position `[j, i]` + +**Example tracking specific elements:** + +```yaml +Original position โ†’ Transposed position + +[0, 0]: value 1 โ†’ [0, 0]: value 1 (stays in place) +[0, 1]: value 2 โ†’ [1, 0]: value 2 (row 0, col 1 โ†’ row 1, col 0) +[0, 2]: value 3 โ†’ [2, 0]: value 3 +[1, 0]: value 4 โ†’ [0, 1]: value 4 +[1, 1]: value 5 โ†’ [1, 1]: value 5 (stays in place) +[1, 2]: value 6 โ†’ [2, 1]: value 6 +``` + +**Key rule:** Just swap the two indices! `[i, j]` becomes `[j, i]` + +## Square Matrix Transpose + +Square matrices (same number of rows and columns) have a special property: + +![Square Transpose](/content/learn/tensors/transposing-tensors/square-transpose.png) + +**Example:** + +```python +import torch + +A = torch.tensor([[1, 2, 3], + [4, 5, 6], + [7, 8, 9]]) + +print(A.shape) # torch.Size([3, 3]) + +A_T = A.T +print(A_T) +# tensor([[1, 4, 7], +# [2, 5, 8], +# [3, 6, 9]]) + +print(A_T.shape) # torch.Size([3, 3]) +``` + +**What happens:** + +```yaml +Original: Transposed: +[[1, 2, 3], [[1, 4, 7], + [4, 5, 6], โ†’ [2, 5, 8], + [7, 8, 9]] [3, 6, 9]] + +Diagonal (1, 5, 9) stays in place! +Everything else flips across the diagonal. +``` + +**The diagonal stays put:** Elements where row = column don't move! + +## Shape Changes + +The shape always flips: + +```python +# Examples of shape changes +original_shape = (2, 3) +transposed_shape = (3, 2) + +original_shape = (5, 7) +transposed_shape = (7, 5) + +original_shape = (4, 4) # Square +transposed_shape = (4, 4) # Still square! +``` + +**Quick reference:** + +```yaml +(2, 3) โ†’ (3, 2) +(5, 1) โ†’ (1, 5) +(10, 20) โ†’ (20, 10) +(n, m) โ†’ (m, n) โ† General pattern +``` + +## Why Do We Transpose? + +The most common reason: **making shapes compatible for matrix multiplication!** + +![Why Transpose](/content/learn/tensors/transposing-tensors/why-transpose.png) + +**Example:** + +```python +import torch + +A = torch.randn(2, 3) # Shape: (2, 3) +B = torch.randn(2, 4) # Shape: (2, 4) + +# This WON'T work - shapes incompatible +# result = A @ B # Error! 3 โ‰  2 + +# Transpose B to make it work! +B_T = B.T # Shape: (4, 2) + +# Now this works! +result = A @ B_T # (2, 3) @ (4, 2)? Wait, still wrong! + +# Actually, we need different dimensions +# Let's try a real example: +A = torch.randn(2, 3) +B = torch.randn(4, 3) # Same inner dimension as A's columns + +# Without transpose - doesn't work +# result = A @ B # Error! (2,3) @ (4,3) - 3 โ‰  4 + +# With transpose - works! +result = A @ B.T # (2,3) @ (3,4) = (2,4) โœ“ + +print(result.shape) # torch.Size([2, 4]) +``` + +**Real example with actual values:** + +```python +import torch + +# Two data samples with 3 features each +X = torch.tensor([[1.0, 2.0, 3.0], + [4.0, 5.0, 6.0]]) # Shape: (2, 3) + +# Weight matrix: 3 inputs, 2 outputs (we want this orientation) +W = torch.tensor([[0.1, 0.2], + [0.3, 0.4], + [0.5, 0.6]]) # Shape: (3, 2) + +# This works! +output = X @ W # (2, 3) @ (3, 2) = (2, 2) +print(output) +# tensor([[2.2000, 2.8000], +# [4.9000, 6.4000]]) + +# But if W was stored transposed... +W_stored = W.T # Shape: (2, 3) + +# We need to transpose it back +output = X @ W_stored.T # (2, 3) @ (3, 2) = (2, 2) +print(output) # Same result! +``` + +## Practical Examples + +### Example 1: Computing Dot Products + +```python +import torch + +# Two vectors +a = torch.tensor([1, 2, 3]) +b = torch.tensor([4, 5, 6]) + +# Can't use @ directly on 1D tensors for matrix multiply +# But we can reshape and transpose! + +a_col = a.reshape(-1, 1) # Column vector (3, 1) +b_row = b.reshape(1, -1) # Row vector (1, 3) + +# Outer product +outer = a_col @ b_row # (3, 1) @ (1, 3) = (3, 3) +print(outer) +# tensor([[ 4, 5, 6], +# [ 8, 10, 12], +# [12, 15, 18]]) + +# Inner product (dot product) +inner = b_row @ a_col # (1, 3) @ (3, 1) = (1, 1) +print(inner) # tensor([[32]]) +``` + +### Example 2: Batch Matrix Transpose + +```python +import torch + +# Batch of 3 matrices, each 2ร—4 +batch = torch.randn(3, 2, 4) + +# Transpose last two dimensions +batch_T = batch.transpose(-2, -1) # Now (3, 4, 2) + +print(batch.shape) # torch.Size([3, 2, 4]) +print(batch_T.shape) # torch.Size([3, 4, 2]) + +# Each of the 3 matrices got transposed individually! +``` + +### Example 3: Neural Network Weights + +```python +import torch + +# In neural networks, weights are often stored transposed +# for computational efficiency + +batch_size = 32 +input_features = 10 +output_features = 5 + +# Input batch +X = torch.randn(batch_size, input_features) # (32, 10) + +# Weights stored as (input, output) for efficiency +W = torch.randn(input_features, output_features) # (10, 5) + +# Forward pass - works directly! +output = X @ W # (32, 10) @ (10, 5) = (32, 5) โœ“ + +# If weights were stored as (output, input) instead... +W_alt = torch.randn(output_features, input_features) # (5, 10) + +# Need to transpose +output = X @ W_alt.T # (32, 10) @ (10, 5) = (32, 5) โœ“ +``` + +## Common Gotchas + +### โŒ Gotcha 1: 1D Tensors Don't Change Much + +```python +v = torch.tensor([1, 2, 3]) +v_t = v.T + +print(torch.equal(v, v_t)) # True! +# 1D tensors look the same after transpose! +``` + +To actually change a 1D tensor, reshape it first: + +```python +v = torch.tensor([1, 2, 3]) +v_col = v.reshape(-1, 1) # Column vector + +print(v.shape) # torch.Size([3]) +print(v_col.shape) # torch.Size([3, 1]) +``` + +### โŒ Gotcha 2: Transpose Creates a View + +```python +A = torch.tensor([[1, 2], [3, 4]]) +A_T = A.T + +# Modifying A_T also modifies A! +A_T[0, 0] = 999 + +print(A) +# tensor([[999, 2], +# [ 3, 4]]) + +# Use .clone() if you want a copy +A_T_copy = A.T.clone() +A_T_copy[0, 0] = 42 +# A is unchanged +``` + +## Key Takeaways + +โœ“ **Transpose swaps rows and columns:** `[i, j]` โ†’ `[j, i]` + +โœ“ **Shape flips:** `(m, n)` โ†’ `(n, m)` + +โœ“ **Main use:** Making shapes compatible for matrix multiplication + +โœ“ **Diagonal stays:** In square matrices, diagonal elements don't move + +โœ“ **Use `.T`:** Simple and clean syntax in PyTorch + +**Quick Reference:** + +```python +# Basic transpose +A = torch.tensor([[1, 2, 3], [4, 5, 6]]) +A_T = A.T # Shape: (2,3) โ†’ (3,2) + +# For 3D+ tensors, specify dimensions +B = torch.randn(5, 10, 20) +B_T = B.transpose(1, 2) # Swap dimensions 1 and 2 +# Shape: (5, 10, 20) โ†’ (5, 20, 10) + +# Transpose last two dimensions (common in batch operations) +C = torch.randn(8, 4, 6) +C_T = C.transpose(-2, -1) # or C.transpose(1, 2) +# Shape: (8, 4, 6) โ†’ (8, 6, 4) +``` + +**Remember:** Transposing is just flipping! Rows โ†’ Columns, Columns โ†’ Rows. That's it! ๐ŸŽ‰ diff --git a/public/content/learn/tensors/transposing-tensors/vector-transpose.png b/public/content/learn/tensors/transposing-tensors/vector-transpose.png new file mode 100644 index 0000000..d079e05 Binary files /dev/null and b/public/content/learn/tensors/transposing-tensors/vector-transpose.png differ diff --git a/public/content/learn/tensors/transposing-tensors/why-transpose.png b/public/content/learn/tensors/transposing-tensors/why-transpose.png new file mode 100644 index 0000000..ad240c1 Binary files /dev/null and b/public/content/learn/tensors/transposing-tensors/why-transpose.png differ diff --git a/public/content/learn/transformer-feedforward/combining-experts/combining-experts-content.md b/public/content/learn/transformer-feedforward/combining-experts/combining-experts-content.md new file mode 100644 index 0000000..1e5e806 --- /dev/null +++ b/public/content/learn/transformer-feedforward/combining-experts/combining-experts-content.md @@ -0,0 +1,63 @@ +--- +hero: + title: "Combining Experts" + subtitle: "Weighted Combination of Expert Outputs" + tags: + - "๐Ÿ”€ MoE" + - "โฑ๏ธ 8 min read" +--- + +After routing, we combine expert outputs using router weights! + +## Combining Formula + +**Output = ฮฃ (router_weight_i ร— expert_i(x))** + +```python +import torch + +# Router selected experts 2 and 5 with weights +expert_indices = [2, 5] +expert_weights = [0.6, 0.4] + +# Expert outputs +expert_2_output = torch.tensor([1.0, 2.0, 3.0]) +expert_5_output = torch.tensor([4.0, 5.0, 6.0]) + +# Weighted combination +final_output = 0.6 * expert_2_output + 0.4 * expert_5_output +print(final_output) +# tensor([2.2000, 3.2000, 4.2000]) +``` + +## Complete MoE Forward + +```python +def moe_forward(x, experts, router): + # Get routing decisions + weights, indices = router(x, top_k=2) + + # Combine expert outputs + output = torch.zeros_like(x) + + for i in range(len(experts)): + # Mask for tokens using this expert + expert_mask = (indices == i).any(dim=-1) + + if expert_mask.any(): + expert_out = experts[i](x[expert_mask]) + expert_weight = weights[expert_mask][(indices[expert_mask] == i).any(dim=-1)] + output[expert_mask] += expert_weight.unsqueeze(-1) * expert_out + + return output +``` + +## Key Takeaways + +โœ“ **Weighted sum:** Combine based on router weights + +โœ“ **Sparse:** Only use selected experts + +โœ“ **Efficient:** Skip unused experts + +**Remember:** Combining is just weighted averaging! ๐ŸŽ‰ diff --git a/public/content/learn/transformer-feedforward/moe-in-a-transformer/moe-in-a-transformer-content.md b/public/content/learn/transformer-feedforward/moe-in-a-transformer/moe-in-a-transformer-content.md new file mode 100644 index 0000000..c1209b7 --- /dev/null +++ b/public/content/learn/transformer-feedforward/moe-in-a-transformer/moe-in-a-transformer-content.md @@ -0,0 +1,63 @@ +--- +hero: + title: "MoE in a Transformer" + subtitle: "Integrating Mixture of Experts" + tags: + - "๐Ÿ”€ MoE" + - "โฑ๏ธ 10 min read" +--- + +MoE replaces the standard FFN in transformer blocks with a sparse expert layer! + +## MoE Transformer Block + +```python +import torch.nn as nn + +class MoETransformerBlock(nn.Module): + def __init__(self, d_model, n_heads, num_experts=8): + super().__init__() + + # Attention (same as standard) + self.attention = nn.MultiheadAttention(d_model, n_heads, batch_first=True) + + # MoE instead of FFN + self.moe = MixtureOfExperts(d_model, num_experts) + + # Normalization + self.norm1 = nn.LayerNorm(d_model) + self.norm2 = nn.LayerNorm(d_model) + + def forward(self, x): + # Attention + attn_out, _ = self.attention(x, x, x) + x = self.norm1(x + attn_out) + + # MoE (replaces FFN) + moe_out = self.moe(x) + x = self.norm2(x + moe_out) + + return x +``` + +## Key Difference + +```yaml +Standard Transformer: + Attention โ†’ FFN โ†’ Output + +MoE Transformer: + Attention โ†’ MoE โ†’ Output + โ†‘ + (Sparse expert routing) +``` + +## Key Takeaways + +โœ“ **Drop-in replacement:** MoE replaces FFN + +โœ“ **Same interface:** Input/output shapes unchanged + +โœ“ **More capacity:** Many experts, sparse activation + +**Remember:** MoE makes transformers bigger without more compute! ๐ŸŽ‰ diff --git a/public/content/learn/transformer-feedforward/moe-in-code/moe-in-code-content.md b/public/content/learn/transformer-feedforward/moe-in-code/moe-in-code-content.md new file mode 100644 index 0000000..be9f0b0 --- /dev/null +++ b/public/content/learn/transformer-feedforward/moe-in-code/moe-in-code-content.md @@ -0,0 +1,88 @@ +--- +hero: + title: "MoE in Code" + subtitle: "Complete MoE Implementation" + tags: + - "๐Ÿ”€ MoE" + - "โฑ๏ธ 10 min read" +--- + +Complete, working Mixture of Experts implementation! + +## Full MoE Layer + +```python +import torch +import torch.nn as nn +import torch.nn.functional as F + +class MixtureOfExperts(nn.Module): + def __init__(self, d_model, num_experts=8, top_k=2, d_ff=None): + super().__init__() + self.num_experts = num_experts + self.top_k = top_k + + if d_ff is None: + d_ff = 4 * d_model + + # Create experts + self.experts = nn.ModuleList([ + nn.Sequential( + nn.Linear(d_model, d_ff), + nn.ReLU(), + nn.Linear(d_ff, d_model) + ) + for _ in range(num_experts) + ]) + + # Router + self.router = nn.Linear(d_model, num_experts) + + def forward(self, x): + batch_size, seq_len, d_model = x.size() + x_flat = x.view(-1, d_model) + + # Route + router_logits = self.router(x_flat) + router_probs = F.softmax(router_logits, dim=-1) + + # Top-k + top_k_probs, top_k_indices = torch.topk(router_probs, self.top_k, dim=-1) + top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True) + + # Apply experts + output = torch.zeros_like(x_flat) + + for expert_idx in range(self.num_experts): + # Tokens for this expert + mask = (top_k_indices == expert_idx).any(dim=-1) + + if mask.any(): + expert_input = x_flat[mask] + expert_output = self.experts[expert_idx](expert_input) + + # Weight by router probability + for k in range(self.top_k): + token_mask = (top_k_indices[:, k] == expert_idx) + if token_mask.any(): + output[token_mask] += top_k_probs[token_mask, k].unsqueeze(-1) * expert_output + + output = output.view(batch_size, seq_len, d_model) + return output + +# Test +moe = MixtureOfExperts(d_model=512, num_experts=8, top_k=2) +x = torch.randn(2, 10, 512) +output = moe(x) +print(output.shape) # torch.Size([2, 10, 512]) +``` + +## Key Takeaways + +โœ“ **Complete implementation:** Production-ready code + +โœ“ **Routing:** Each token to top-k experts + +โœ“ **Efficient:** Sparse computation + +**Remember:** MoE is routing + expert combination! ๐ŸŽ‰ diff --git a/public/content/learn/transformer-feedforward/the-deepseek-mlp/the-deepseek-mlp-content.md b/public/content/learn/transformer-feedforward/the-deepseek-mlp/the-deepseek-mlp-content.md new file mode 100644 index 0000000..67686d0 --- /dev/null +++ b/public/content/learn/transformer-feedforward/the-deepseek-mlp/the-deepseek-mlp-content.md @@ -0,0 +1,83 @@ +--- +hero: + title: "The DeepSeek MLP" + subtitle: "DeepSeek's Efficient MoE Design" + tags: + - "๐Ÿ”€ MoE" + - "โฑ๏ธ 10 min read" +--- + +DeepSeek-MoE uses an efficient MLP design that reduces parameters while maintaining performance! + +## DeepSeek MoE Architecture + +**Key innovation: Shared expert + Routed experts** + +```python +import torch +import torch.nn as nn + +class DeepSeekMoE(nn.Module): + def __init__(self, d_model, num_experts=64, top_k=6): + super().__init__() + + # Shared expert (always active) + self.shared_expert = nn.Sequential( + nn.Linear(d_model, d_model * 4), + nn.SiLU(), + nn.Linear(d_model * 4, d_model) + ) + + # Routed experts + self.experts = nn.ModuleList([ + nn.Sequential( + nn.Linear(d_model, d_model // 4), # Smaller! + nn.SiLU(), + nn.Linear(d_model // 4, d_model) + ) + for _ in range(num_experts) + ]) + + # Router + self.router = nn.Linear(d_model, num_experts) + self.top_k = top_k + + def forward(self, x): + # Shared expert (all tokens) + shared_out = self.shared_expert(x) + + # Route to top-k experts + router_logits = self.router(x) + router_probs = F.softmax(router_logits, dim=-1) + top_k_probs, top_k_indices = torch.topk(router_probs, self.top_k, dim=-1) + + # Combine routed experts + routed_out = self.route_and_combine(x, top_k_probs, top_k_indices) + + # Final output + output = shared_out + routed_out + return output +``` + +## Why It's Efficient + +```yaml +Standard MoE: + 64 experts ร— (d_model โ†’ 4*d_model โ†’ d_model) + = 64 ร— 8dยฒ parameters + +DeepSeek MoE: + 1 shared ร— 8dยฒ parameters + + 64 routed ร— 0.5dยฒ parameters (smaller experts!) + = Much fewer parameters! +``` + +## Key Takeaways + +โœ“ **Shared expert:** Always active for all tokens + +โœ“ **Smaller routed experts:** More efficient + +โœ“ **Better performance:** Despite fewer parameters + +**Remember:** DeepSeek MoE is efficient MoE! ๐ŸŽ‰ diff --git a/public/content/learn/transformer-feedforward/the-expert/the-expert-content.md b/public/content/learn/transformer-feedforward/the-expert/the-expert-content.md new file mode 100644 index 0000000..7e1f7d7 --- /dev/null +++ b/public/content/learn/transformer-feedforward/the-expert/the-expert-content.md @@ -0,0 +1,77 @@ +--- +hero: + title: "The Expert" + subtitle: "Individual Expert Networks in MoE" + tags: + - "๐Ÿ”€ MoE" + - "โฑ๏ธ 8 min read" +--- + +An expert is a **specialized feedforward network** in the Mixture of Experts architecture! + +## Expert Structure + +```python +import torch +import torch.nn as nn + +class Expert(nn.Module): + def __init__(self, d_model, d_ff): + super().__init__() + self.net = nn.Sequential( + nn.Linear(d_model, d_ff), + nn.SiLU(), # Modern activation + nn.Linear(d_ff, d_model) + ) + + def forward(self, x): + return self.net(x) + +# Create expert +expert = Expert(d_model=512, d_ff=2048) +x = torch.randn(10, 512) +output = expert(x) +print(output.shape) # torch.Size([10, 512]) +``` + +## Multiple Experts + +```python +num_experts = 8 + +experts = nn.ModuleList([ + Expert(d_model=512, d_ff=2048) + for _ in range(num_experts) +]) + +# Each expert specializes in different patterns! +# Expert 0: Maybe handles technical text +# Expert 1: Maybe handles conversational text +# Expert 2: Maybe handles code +# etc. +``` + +## Expert Specialization + +```yaml +During training: + - Router learns which expert for which pattern + - Experts specialize automatically + - No manual assignment needed! + +Result: + - Expert 1: Good at math + - Expert 2: Good at grammar + - Expert 3: Good at facts + - etc. +``` + +## Key Takeaways + +โœ“ **Expert = FFN:** Same structure as standard feedforward + +โœ“ **Specialized:** Each learns different patterns + +โœ“ **Independent:** Trained separately via routing + +**Remember:** Experts are specialized sub-networks! ๐ŸŽ‰ diff --git a/public/content/learn/transformer-feedforward/the-feedforward-layer/the-feedforward-layer-content.md b/public/content/learn/transformer-feedforward/the-feedforward-layer/the-feedforward-layer-content.md new file mode 100644 index 0000000..bc1d5c4 --- /dev/null +++ b/public/content/learn/transformer-feedforward/the-feedforward-layer/the-feedforward-layer-content.md @@ -0,0 +1,43 @@ +--- +hero: + title: "The Feedforward Layer" + subtitle: "FFN in Transformer Blocks" + tags: + - "๐Ÿ”€ MoE" + - "โฑ๏ธ 8 min read" +--- + +The feedforward network (FFN) in transformers processes each position independently! + +## Structure + +```python +import torch.nn as nn + +class FeedForward(nn.Module): + def __init__(self, d_model, d_ff, dropout=0.1): + super().__init__() + self.net = nn.Sequential( + nn.Linear(d_model, d_ff), + nn.ReLU(), + nn.Dropout(dropout), + nn.Linear(d_ff, d_model), + nn.Dropout(dropout) + ) + + def forward(self, x): + return self.net(x) + +# Typical: d_ff = 4 ร— d_model +ffn = FeedForward(d_model=512, d_ff=2048) +``` + +## Key Takeaways + +โœ“ **Two layers:** Expand then compress + +โœ“ **Position-wise:** Same FFN for each position + +โœ“ **Standard ratio:** d_ff = 4 ร— d_model + +**Remember:** FFN adds capacity after attention! ๐ŸŽ‰ diff --git a/public/content/learn/transformer-feedforward/the-gate/the-gate-content.md b/public/content/learn/transformer-feedforward/the-gate/the-gate-content.md new file mode 100644 index 0000000..dd611c1 --- /dev/null +++ b/public/content/learn/transformer-feedforward/the-gate/the-gate-content.md @@ -0,0 +1,56 @@ +--- +hero: + title: "The Gate" + subtitle: "Router Network in Mixture of Experts" + tags: + - "๐Ÿ”€ MoE" + - "โฑ๏ธ 8 min read" +--- + +The gate (router) decides **which experts each token should use**! + +## Router Implementation + +```python +import torch +import torch.nn as nn +import torch.nn.functional as F + +class Router(nn.Module): + def __init__(self, d_model, num_experts): + super().__init__() + self.gate = nn.Linear(d_model, num_experts) + + def forward(self, x, top_k=2): + # x: (batch, seq, d_model) + + # Compute routing scores + router_logits = self.gate(x) + router_probs = F.softmax(router_logits, dim=-1) + + # Select top-k experts + top_k_probs, top_k_indices = torch.topk(router_probs, top_k, dim=-1) + + # Normalize + top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True) + + return top_k_probs, top_k_indices + +# Use it +router = Router(d_model=512, num_experts=8) +x = torch.randn(2, 10, 512) +probs, indices = router(x, top_k=2) + +print(probs.shape) # torch.Size([2, 10, 2]) +print(indices.shape) # torch.Size([2, 10, 2]) +``` + +## Key Takeaways + +โœ“ **Router:** Selects which experts to use + +โœ“ **Top-K:** Usually top-2 experts per token + +โœ“ **Learnable:** Router weights are trained + +**Remember:** The gate is the traffic controller! ๐ŸŽ‰ diff --git a/public/content/learn/transformer-feedforward/what-is-mixture-of-experts/moe-routing.png b/public/content/learn/transformer-feedforward/what-is-mixture-of-experts/moe-routing.png new file mode 100644 index 0000000..efc3589 Binary files /dev/null and b/public/content/learn/transformer-feedforward/what-is-mixture-of-experts/moe-routing.png differ diff --git a/public/content/learn/transformer-feedforward/what-is-mixture-of-experts/what-is-mixture-of-experts-content.md b/public/content/learn/transformer-feedforward/what-is-mixture-of-experts/what-is-mixture-of-experts-content.md new file mode 100644 index 0000000..92a71f2 --- /dev/null +++ b/public/content/learn/transformer-feedforward/what-is-mixture-of-experts/what-is-mixture-of-experts-content.md @@ -0,0 +1,107 @@ +--- +hero: + title: "What is Mixture of Experts" + subtitle: "Sparse Expert Models Explained" + tags: + - "๐Ÿ”€ MoE" + - "โฑ๏ธ 10 min read" +--- + +Mixture of Experts (MoE) uses **multiple specialized sub-networks (experts)** and routes inputs to the most relevant ones! + +![MoE Routing](/content/learn/transformer-feedforward/what-is-mixture-of-experts/moe-routing.png) + +## The Core Idea + +Instead of one big feedforward network: +- Have many smaller expert networks +- Route each token to top-K experts +- Combine expert outputs + +```yaml +Traditional FFN: + All tokens โ†’ Same FFN โ†’ Output + +MoE: + Token 1 โ†’ Expert 2 + Expert 5 โ†’ Output + Token 2 โ†’ Expert 1 + Expert 3 โ†’ Output + Token 3 โ†’ Expert 2 + Expert 7 โ†’ Output + +Each token uses different experts! +``` + +## Simple Example + +```python +import torch +import torch.nn as nn + +class SimpleMoE(nn.Module): + def __init__(self, d_model, num_experts=8): + super().__init__() + + # Multiple expert networks + self.experts = nn.ModuleList([ + nn.Sequential( + nn.Linear(d_model, d_model * 4), + nn.ReLU(), + nn.Linear(d_model * 4, d_model) + ) + for _ in range(num_experts) + ]) + + # Router (chooses which experts to use) + self.router = nn.Linear(d_model, num_experts) + + def forward(self, x): + # x: (batch, seq, d_model) + + # Router scores + router_logits = self.router(x) + router_probs = F.softmax(router_logits, dim=-1) + + # Get top-2 experts + top_k_probs, top_k_indices = torch.topk(router_probs, k=2, dim=-1) + + # Route to experts + output = torch.zeros_like(x) + for i in range(len(self.experts)): + # Find tokens routed to this expert + mask = (top_k_indices == i).any(dim=-1) + if mask.any(): + expert_output = self.experts[i](x[mask]) + output[mask] += expert_output * top_k_probs[mask, (top_k_indices[mask] == i).argmax(dim=-1)].unsqueeze(-1) + + return output +``` + +## Why MoE? + +```yaml +Benefits: + โœ“ Huge capacity (many parameters) + โœ“ Efficient (only use few experts per token) + โœ“ Specialization (experts learn different patterns) + +Trade-offs: + โœ— Complex training + โœ— Load balancing needed + โœ— More memory +``` + +## Used In + +- Switch Transformer +- DeepSeek-MoE +- Mixtral +- GPT-4 (rumored) + +## Key Takeaways + +โœ“ **Multiple experts:** Specialized sub-networks + +โœ“ **Sparse routing:** Each token uses few experts + +โœ“ **Scalable:** Add experts without much compute cost + +**Remember:** MoE = specialized experts for different patterns! ๐ŸŽ‰