Hyper-Accel · JaeoneLim · Feb 10, 2026 · Feb 10, 2026 · Feb 10, 2026 · Feb 10, 2026
@@ -64,7 +64,12 @@ As mentioned earlier, TPU was designed specifically for AI operations. The bigge
 
 TPU uses a special unit called Systolic array, which cannot be found in general processors (CPUs), to efficiently execute this matrix multiplication. The term "Systolic" is derived from "systole," the contraction phase of the heart. Just as the heart rhythmically beats and sends blood to various parts of the body, data moves rhythmically and regularly between computational units within the array structure, performing operations - hence the name. Systolic array optimizes data flow and maximizes parallel processing, making it efficient for large-scale operations like matrix multiplication. The process of Systolic array performing matrix multiplication can be visualized as an animation below.
 
-![Systolic array visualization](systolic_array.gif)
+{{< rawhtml >}}
+<video autoplay loop muted playsinline width="100%" style="max-width: 100%; border-radius: 8px;">
+  <source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
+  Your browser does not support the video tag.
+</video>
+{{< /rawhtml >}}
-{{< rawhtml >}}
-<video autoplay loop muted playsinline width="100%" style="max-width: 100%; border-radius: 8px;">
-  <source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
-  Your browser does not support the video tag.
-</video>
-{{< /rawhtml >}}
+[Watch the systolic array animation (MP4)](/posts/TPU-deep-dive/systolic_array.mp4)
-{{< rawhtml >}}
-<video autoplay loop muted playsinline width="100%" style="max-width: 100%; border-radius: 8px;">
-  <source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
-  Your browser does not support the video tag.
-</video>
-{{< /rawhtml >}}
+[Watch the systolic array animation (MP4)](/posts/TPU-deep-dive/systolic_array.mp4)
 
 Next, to explain the effectiveness of Systolic array in more detail, let's compare the operation method of general processors with TPU's systolic array operation method.
 

@@ -61,7 +61,13 @@ TPU 구조를 이해하기 위해서는 먼저 TPU가 개발된 배경에 대해
 
 TPU에서는 이 행렬 곱셈을 효율적으로 실행할 수 있도록 일반적인 프로세서(CPU)에서는 볼 수 없는 Systolic array라는 특별한 유닛을 사용합니다. "Systolic"은 심장의 수축 운동인 '수축기(systole)'에서 유래한 단어입니다. 마치 심장이 규칙적으로 박동하며 혈액을 신체의 각 부분으로 보내는 것처럼, 배열 구조 내에서 데이터가 연산 유닛 사이를 리듬감 있고 규칙적으로 이동하며 연산이 수행되는 모습에서 착안된 이름입니다. Systolic array는 데이터 흐름을 최적화하고 병렬 처리를 극대화하여 행렬 곱셈과 같은 대규모 연산에 효율적입니다. Systolic array가 행렬 곱셈을 진행하는 과정을 애니메이션으로 나타내보면 아래와 같습니다.
 
-![Systolic array visualization](systolic_array.gif)
+{{< rawhtml >}}
+<video autoplay loop muted playsinline width="100%" style="max-width: 100%; border-radius: 8px;">
+  <source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
-  <source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
+  <source src="systolic_array.mp4" type="video/mp4">
-  <source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
+  <source src="systolic_array.mp4" type="video/mp4">
+  브라우저가 비디오 재생을 지원하지 않습니다.
+</video>
+{{< /rawhtml >}}
-{{< rawhtml >}}
-<video autoplay loop muted playsinline width="100%" style="max-width: 100%; border-radius: 8px;">
-  <source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
-  브라우저가 비디오 재생을 지원하지 않습니다.
-</video>
-{{< /rawhtml >}}
+<video autoplay loop muted playsinline width="100%" style="max-width: 100%; border-radius: 8px;">
+  <source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
+  브라우저가 비디오 재생을 지원하지 않습니다.
+</video>
-{{< rawhtml >}}
-<video autoplay loop muted playsinline width="100%" style="max-width: 100%; border-radius: 8px;">
-  <source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
-  브라우저가 비디오 재생을 지원하지 않습니다.
-</video>
-{{< /rawhtml >}}
+<video autoplay loop muted playsinline width="100%" style="max-width: 100%; border-radius: 8px;">
+  <source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
+  브라우저가 비디오 재생을 지원하지 않습니다.
+</video>
+
 다음으로는 Systolic array의 효과를 더 구체적으로 설명하기 위해 일반적인 프로세서의 연산 방식과 TPU의 systolic array를 사용한 연산 방식을 비교해보겠습니다.
 
 ![CPU VS TPU](cpuvstpu.webp)

@@ -235,7 +235,7 @@ Then what tasks can LPU proceed faster with? From the perspective of LLM inferen
 
 **Speculative Decoding**
 
-![Speculative decoding](gifs/speculative-decoding.gif)
+![Speculative decoding](images/speculative-decoding-workflow.jpg)
 
 One recent trend in LLM serving is **Speculative Decoding**. As model sizes grow and computation time becomes longer, a small and fast model (**Draft Model**) that distills or is trained to behave similarly to the existing model (**Target Model**) quickly generates the latter part of a sentence in advance, then the Target Model verifies this in parallel. Groq's LPU clusters can be used for small-sized Draft Model computation here. This is because LPU boasts overwhelming token generation speed in small-sized models. From an overall perspective, the roles of LPU/GPU clusters can be divided as follows:
 

@@ -236,7 +236,7 @@ LPU의 정적 스케줄링(static scheduling)은 바로 딥러닝과 LLM, 그중
 
 **Speculative Decoding**
 
-![Speculative decoding](gifs/speculative-decoding.gif)
+![Speculative decoding](images/speculative-decoding-workflow.jpg)
 
 최근 LLM 서빙의 트렌드 중 하나는 **Speculative Decoding**(추측 디코딩)입니다. 모델 사이즈가 커지면서 연산 시간이 오래 걸리다보니 기존 모델(**Target Model**)을 증류하거나 비슷한 동작을 하도록 훈련된 작고 빠른 모델(**Draft Model**)이 문장의 뒷부분을 미리 빠르게 생성하면, Target Model이 이를 병렬로 검증하는 방식입니다. 그록의 LPU 클러스터는 여기서 작은 사이즈의 Draft Model 연산에 사용될 수 있습니다. LPU는 작은 사이즈의 모델에서 압도적인 토큰 생성 속도를 자랑하기 때문입니다. 전체적인 관점에서 LPU/GPU 클러스터의 역할을 구분해보면 아래와 같습니다.