AMD ROCm의 kernel library인 aiter 분석 글입니다.#51
Conversation
This commit introduces a new blog post titled "AITER Analysis: How AMD Doubled ROCm Inference Performance," authored by Minho Park. The post details the architecture and kernel strategies of AITER, highlighting its significant performance enhancements for AMD GPUs. It includes comprehensive benchmarks, operational support, and a multi-backend strategy, along with various images to illustrate the concepts discussed. The content is available in both English and Korean.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a comprehensive blog post that delves into AMD's AI Tensor Engine for ROCm (AITER). The post provides an in-depth analysis of how AITER significantly boosts inference performance on AMD GPUs, covering its architectural design, diverse kernel backend strategies, and practical integration methods. It highlights the impact of software optimization on hardware performance, aiming to inform and educate the community on AMD's advancements in the AI accelerator ecosystem. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Kernels run so fast, Doubling speed, a grand feat, Code optimized, swift. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces new English and Korean markdown posts that provide a detailed analysis of AMD's AITER (AI Tensor Engine for ROCm), covering its architecture, multi-backend kernel strategies (Triton, CK, HIP, ASM), JIT compilation pipeline, and performance benchmarks demonstrating significant inference speedups on AMD GPUs. The review comments point out several style guide violations and areas for improvement, including inconsistent capitalization of technical terms, overly long sentences impacting readability, issues within the frontmatter such as incorrect image alt text, an improperly formatted summary field, and a missing required description field, as well as missing language tags in code blocks.
- Fix frontmatter: correct alt text, convert summary from list to string, add description field - Split long intro paragraph into shorter sentences for readability - Add `text` language tags to bare code blocks (CUDA→HIP table, register types, head_dim combinations) - Fix heading capitalization: quantization → Quantization
content/posts/rocm-aiter/index.ko.md
Outdated
|
|
||
| Semi Analysis 는 반도체 업계에서 유명한 리서치 기관입니다. 이 기관은 주요 GPU 의 inference 성능을 실측 비교하는 [InferenceX](https://inferencex.semianalysis.com) 벤치마크를 운영하고 있습니다. | ||
|
|
||
| 2026년 2월에 공개된 [InferenceX v2](https://newsletter.semianalysis.com/p/inferencex-v2-nvidia-blackwell-vs) 보고서에 따르면, AMD MI300X 의 SGLang 성능이 2025년 12월에서 2026년 1월 사이 **거의 2배 가까이** 향상되었다고 합니다. 이 성능 향상의 중심에 **AI Tensor Engine for ROCm(AITER)** 이라는 커널 라이브러리가 있었습니다. |
There was a problem hiding this comment.
글의 초반부이기 때문에 성능이 어떤 측면에서 2배가 향상(throughput, latency, ...)되었는지 언급되면 좋을 것 같습니다.
There was a problem hiding this comment.
반영했습니다! 초반부와 벤치마크 테이블 헤더에 throughput임을 명시했습니다.
| | Block-scale **General Matrix Multiplication(GEMM)** | **2배** | | ||
| | Block-scale Fused **Mixture of Experts(MoE)** | **3배** | | ||
| | MLA Decode | **17배** | | ||
| | **Multi-Head Attention(MHA)** Prefill | **14배** | |
There was a problem hiding this comment.
여기서 2배, 3배, 17배, 14배 성능이 향상되었다는건 throughput 측면에서의 성능 향상을 의미하는 것이 맞을까요?
There was a problem hiding this comment.
네 맞습니다, throughput 기준입니다. 테이블 헤더를 '성능 향상' → 'Throughput 향상'으로 수정했습니다.
|
|
||
| ### Assembly(ASM) | ||
|
|
||
|  |
There was a problem hiding this comment.
반영했습니다! MLA 아키텍처 이미지에 흰색 배경(<figure> 태그)을 추가하여 다크 모드에서도 잘 보이도록 수정했습니다.
|
추가로 content/posts/rocm-aiter/images/aiter-mla-header.webp 해당 파일은 현재 포스트 글에서 사용되고 있지 않은 것으로 보입니다! |
- Clarify '2x improvement' refers to throughput (intro + benchmark table) - Add white background to MLA architecture image for dark mode visibility - Remove unused aiter-mla-header.webp image
|
반영했습니다! |
YoungHoonJun
left a comment
There was a problem hiding this comment.
수정 감사합니다! operator 특성에 따라 4가지 백엔드를 골라쓰는 점이 흥미롭네요...! 좋은 글이라고 생각합니다.
고생하셨습니다, LGTM!

This commit introduces a new blog post titled "AITER Analysis: How AMD Doubled ROCm Inference Performance," authored by Minho Park. The post details the architecture and kernel strategies of AITER, highlighting its significant performance enhancements for AMD GPUs. It includes comprehensive benchmarks, operational support, and a multi-backend strategy, along with various images to illustrate the concepts discussed. The content is available in both English and Korean.