Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion content/posts/TPU-deep-dive/index.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,12 @@ As mentioned earlier, TPU was designed specifically for AI operations. The bigge

TPU uses a special unit called Systolic array, which cannot be found in general processors (CPUs), to efficiently execute this matrix multiplication. The term "Systolic" is derived from "systole," the contraction phase of the heart. Just as the heart rhythmically beats and sends blood to various parts of the body, data moves rhythmically and regularly between computational units within the array structure, performing operations - hence the name. Systolic array optimizes data flow and maximizes parallel processing, making it efficient for large-scale operations like matrix multiplication. The process of Systolic array performing matrix multiplication can be visualized as an animation below.

![Systolic array visualization](systolic_array.gif)
{{< rawhtml >}}
<video autoplay loop muted playsinline width="100%" style="max-width: 100%; border-radius: 8px;">
<source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
{{< /rawhtml >}}
Comment on lines +67 to +72
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{{< rawhtml >}} shortcode is used here, but the repository doesn’t contain a layouts/shortcodes/rawhtml.html (and Goldmark unsafe HTML isn’t enabled in hugo.yaml). As-is, Hugo will fail to render the page or strip the <video> tag. Add the rawhtml shortcode (or enable markup.goldmark.renderer.unsafe: true and embed the <video> directly).

Suggested change
{{< rawhtml >}}
<video autoplay loop muted playsinline width="100%" style="max-width: 100%; border-radius: 8px;">
<source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
{{< /rawhtml >}}
[Watch the systolic array animation (MP4)](/posts/TPU-deep-dive/systolic_array.mp4)

Copilot uses AI. Check for mistakes.

Next, to explain the effectiveness of Systolic array in more detail, let's compare the operation method of general processors with TPU's systolic array operation method.

Expand Down
8 changes: 7 additions & 1 deletion content/posts/TPU-deep-dive/index.ko.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,13 @@ TPU 구조를 이해하기 위해서는 먼저 TPU가 개발된 배경에 대해

TPU에서는 이 행렬 곱셈을 효율적으로 실행할 수 있도록 일반적인 프로세서(CPU)에서는 볼 수 없는 Systolic array라는 특별한 유닛을 사용합니다. "Systolic"은 심장의 수축 운동인 '수축기(systole)'에서 유래한 단어입니다. 마치 심장이 규칙적으로 박동하며 혈액을 신체의 각 부분으로 보내는 것처럼, 배열 구조 내에서 데이터가 연산 유닛 사이를 리듬감 있고 규칙적으로 이동하며 연산이 수행되는 모습에서 착안된 이름입니다. Systolic array는 데이터 흐름을 최적화하고 병렬 처리를 극대화하여 행렬 곱셈과 같은 대규모 연산에 효율적입니다. Systolic array가 행렬 곱셈을 진행하는 과정을 애니메이션으로 나타내보면 아래와 같습니다.

![Systolic array visualization](systolic_array.gif)
{{< rawhtml >}}
<video autoplay loop muted playsinline width="100%" style="max-width: 100%; border-radius: 8px;">
<source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The <source> uses an absolute path (/posts/TPU-deep-dive/systolic_array.mp4). This is brittle if the site is ever hosted under a subpath or if Hugo’s output path casing differs; it also diverges from the relative asset links used elsewhere in the post bundle. Prefer a relative URL to the page bundle resource (e.g., systolic_array.mp4) or generate the URL via Hugo helpers/shortcode so it’s always correct.

Suggested change
<source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
<source src="systolic_array.mp4" type="video/mp4">

Copilot uses AI. Check for mistakes.
브라우저가 비디오 재생을 지원하지 않습니다.
</video>
{{< /rawhtml >}}
Comment on lines +64 to +69
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{{< rawhtml >}} shortcode is used here, but the repository doesn’t contain a layouts/shortcodes/rawhtml.html (and Goldmark unsafe HTML isn’t enabled in hugo.yaml). As-is, Hugo will fail to render the page or strip the <video> tag. Add the rawhtml shortcode (or enable markup.goldmark.renderer.unsafe: true and embed the <video> directly).

Suggested change
{{< rawhtml >}}
<video autoplay loop muted playsinline width="100%" style="max-width: 100%; border-radius: 8px;">
<source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
브라우저가 비디오 재생을 지원하지 않습니다.
</video>
{{< /rawhtml >}}
<video autoplay loop muted playsinline width="100%" style="max-width: 100%; border-radius: 8px;">
<source src="/posts/TPU-deep-dive/systolic_array.mp4" type="video/mp4">
브라우저가 비디오 재생을 지원하지 않습니다.
</video>

Copilot uses AI. Check for mistakes.

다음으로는 Systolic array의 효과를 더 구체적으로 설명하기 위해 일반적인 프로세서의 연산 방식과 TPU의 systolic array를 사용한 연산 방식을 비교해보겠습니다.

![CPU VS TPU](cpuvstpu.webp)
Expand Down
Binary file removed content/posts/TPU-deep-dive/systolic_array.gif
Binary file not shown.
Binary file not shown.
Binary file modified content/posts/how-GPU-works/images/01-gpu-hopper.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/02-rise-of-nvidia.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/04-cui-gui.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/05-3d-games.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/06-graphics-pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/07-fx-arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/08-tesla-arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/09-pre-gpgpu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/10-SL-CUDA.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/11-hopper-full.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/12-hopper-sm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/13-cuda-pm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/14-thdblk-alloc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/15-code-ex.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/16-thread-group.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/17-schd-single.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/18-schd-double.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/19-schd-triple.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/how-GPU-works/images/gpu-cuda-logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/lpu-deep-dive/gifs/gpu_example.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/lpu-deep-dive/gifs/lpu_example.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file modified content/posts/lpu-deep-dive/images/agentic_ai.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/posts/lpu-deep-dive/images/ai_giga_factory.png
Binary file modified content/posts/lpu-deep-dive/images/all_reduce.png
Binary file modified content/posts/lpu-deep-dive/images/auto_regressive.png
Binary file modified content/posts/lpu-deep-dive/images/dram_vs_sram.png
Binary file modified content/posts/lpu-deep-dive/images/gpu_disaggregation.png
Binary file modified content/posts/lpu-deep-dive/images/gpu_memory_hierarchy.png
Binary file modified content/posts/lpu-deep-dive/images/gpu_sync.jpg
Binary file modified content/posts/lpu-deep-dive/images/groq_logo.jpg
Binary file modified content/posts/lpu-deep-dive/images/remove_ctrl_logic.png
Binary file modified content/posts/lpu-deep-dive/images/roofline_comparison.png
Binary file modified content/posts/lpu-deep-dive/images/roofline_concept.jpg
Binary file modified content/posts/lpu-deep-dive/images/rubin_cpx_platform.png
Binary file modified content/posts/lpu-deep-dive/images/tp&pp.jpg
Binary file modified content/posts/lpu-deep-dive/images/warp_scheduling.png
2 changes: 1 addition & 1 deletion content/posts/lpu-deep-dive/index.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ Then what tasks can LPU proceed faster with? From the perspective of LLM inferen

**Speculative Decoding**

![Speculative decoding](gifs/speculative-decoding.gif)
![Speculative decoding](images/speculative-decoding-workflow.jpg)

One recent trend in LLM serving is **Speculative Decoding**. As model sizes grow and computation time becomes longer, a small and fast model (**Draft Model**) that distills or is trained to behave similarly to the existing model (**Target Model**) quickly generates the latter part of a sentence in advance, then the Target Model verifies this in parallel. Groq's LPU clusters can be used for small-sized Draft Model computation here. This is because LPU boasts overwhelming token generation speed in small-sized models. From an overall perspective, the roles of LPU/GPU clusters can be divided as follows:

Expand Down
2 changes: 1 addition & 1 deletion content/posts/lpu-deep-dive/index.ko.md
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,7 @@ LPU의 정적 스케줄링(static scheduling)은 바로 딥러닝과 LLM, 그중

**Speculative Decoding**

![Speculative decoding](gifs/speculative-decoding.gif)
![Speculative decoding](images/speculative-decoding-workflow.jpg)

최근 LLM 서빙의 트렌드 중 하나는 **Speculative Decoding**(추측 디코딩)입니다. 모델 사이즈가 커지면서 연산 시간이 오래 걸리다보니 기존 모델(**Target Model**)을 증류하거나 비슷한 동작을 하도록 훈련된 작고 빠른 모델(**Draft Model**)이 문장의 뒷부분을 미리 빠르게 생성하면, Target Model이 이를 병렬로 검증하는 방식입니다. 그록의 LPU 클러스터는 여기서 작은 사이즈의 Draft Model 연산에 사용될 수 있습니다. LPU는 작은 사이즈의 모델에서 압도적인 토큰 생성 속도를 자랑하기 때문입니다. 전체적인 관점에서 LPU/GPU 클러스터의 역할을 구분해보면 아래와 같습니다.

Expand Down