Skip to content

AMD ROCm의 kernel library인 aiter 분석 글입니다.#51

Merged
mino-park7 merged 3 commits intomainfrom
feat/rocm-aiter
Mar 24, 2026
Merged

AMD ROCm의 kernel library인 aiter 분석 글입니다.#51
mino-park7 merged 3 commits intomainfrom
feat/rocm-aiter

Conversation

@mino-park7
Copy link
Copy Markdown
Contributor

This commit introduces a new blog post titled "AITER Analysis: How AMD Doubled ROCm Inference Performance," authored by Minho Park. The post details the architecture and kernel strategies of AITER, highlighting its significant performance enhancements for AMD GPUs. It includes comprehensive benchmarks, operational support, and a multi-backend strategy, along with various images to illustrate the concepts discussed. The content is available in both English and Korean.

This commit introduces a new blog post titled "AITER Analysis: How AMD Doubled ROCm Inference Performance," authored by Minho Park. The post details the architecture and kernel strategies of AITER, highlighting its significant performance enhancements for AMD GPUs. It includes comprehensive benchmarks, operational support, and a multi-backend strategy, along with various images to illustrate the concepts discussed. The content is available in both English and Korean.
@mino-park7 mino-park7 self-assigned this Mar 17, 2026
@mino-park7 mino-park7 requested a review from a team as a code owner March 17, 2026 04:58
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive blog post that delves into AMD's AI Tensor Engine for ROCm (AITER). The post provides an in-depth analysis of how AITER significantly boosts inference performance on AMD GPUs, covering its architectural design, diverse kernel backend strategies, and practical integration methods. It highlights the impact of software optimization on hardware performance, aiming to inform and educate the community on AMD's advancements in the AI accelerator ecosystem.

Highlights

  • New Blog Post: Introduced a new blog post titled "AITER Analysis: How AMD Doubled ROCm Inference Performance" by Minho Park, available in both English and Korean.
  • AITER Architecture and Performance: The post details the architecture, kernel strategies, and significant performance enhancements of AITER (AI Tensor Engine for ROCm) for AMD GPUs, including comprehensive benchmarks.
  • Multi-Backend Strategy: Explained AITER's multi-backend approach utilizing Triton, Composable Kernel (CK), HIP, and Assembly (ASM) for optimal performance across different operations.
  • JIT Compilation and Integration: Described AITER's JIT compilation pipeline for kernel caching and its seamless integration into inference frameworks like vLLM and SGLang via environment variables.
Changelog
  • content/posts/rocm-aiter/index.en.md
    • Added a new English blog post analyzing AMD's AITER kernel library.
  • content/posts/rocm-aiter/index.ko.md
    • Added a new Korean blog post analyzing AMD's AITER kernel library.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Kernels run so fast, Doubling speed, a grand feat, Code optimized, swift.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new English and Korean markdown posts that provide a detailed analysis of AMD's AITER (AI Tensor Engine for ROCm), covering its architecture, multi-backend kernel strategies (Triton, CK, HIP, ASM), JIT compilation pipeline, and performance benchmarks demonstrating significant inference speedups on AMD GPUs. The review comments point out several style guide violations and areas for improvement, including inconsistent capitalization of technical terms, overly long sentences impacting readability, issues within the frontmatter such as incorrect image alt text, an improperly formatted summary field, and a missing required description field, as well as missing language tags in code blocks.

- Fix frontmatter: correct alt text, convert summary from list to string, add description field
- Split long intro paragraph into shorter sentences for readability
- Add `text` language tags to bare code blocks (CUDA→HIP table, register types, head_dim combinations)
- Fix heading capitalization: quantization → Quantization
Copy link
Copy Markdown
Contributor

@YoungHoonJun YoungHoonJun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

글 재밌게 잘 읽었습니다 민호님!


Semi Analysis 는 반도체 업계에서 유명한 리서치 기관입니다. 이 기관은 주요 GPU 의 inference 성능을 실측 비교하는 [InferenceX](https://inferencex.semianalysis.com) 벤치마크를 운영하고 있습니다.

2026년 2월에 공개된 [InferenceX v2](https://newsletter.semianalysis.com/p/inferencex-v2-nvidia-blackwell-vs) 보고서에 따르면, AMD MI300X 의 SGLang 성능이 2025년 12월에서 2026년 1월 사이 **거의 2배 가까이** 향상되었다고 합니다. 이 성능 향상의 중심에 **AI Tensor Engine for ROCm(AITER)** 이라는 커널 라이브러리가 있었습니다.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

글의 초반부이기 때문에 성능이 어떤 측면에서 2배가 향상(throughput, latency, ...)되었는지 언급되면 좋을 것 같습니다.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

반영했습니다! 초반부와 벤치마크 테이블 헤더에 throughput임을 명시했습니다.

| Block-scale **General Matrix Multiplication(GEMM)** | **2배** |
| Block-scale Fused **Mixture of Experts(MoE)** | **3배** |
| MLA Decode | **17배** |
| **Multi-Head Attention(MHA)** Prefill | **14배** |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

여기서 2배, 3배, 17배, 14배 성능이 향상되었다는건 throughput 측면에서의 성능 향상을 의미하는 것이 맞을까요?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

네 맞습니다, throughput 기준입니다. 테이블 헤더를 '성능 향상' → 'Throughput 향상'으로 수정했습니다.


### Assembly(ASM)

![MLA 레이어 구조 — AITER가 ASM으로 최적화한 핵심 대상 (출처: AMD ROCm Blog)](./images/mla-architecture.png)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

해당 사진이 아래처럼 다크 모드에서는 잘 보이지 않습니다. 흰색 배경이 들어가면 더 좋겠습니다!

Image

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

반영했습니다! MLA 아키텍처 이미지에 흰색 배경(<figure> 태그)을 추가하여 다크 모드에서도 잘 보이도록 수정했습니다.

@YoungHoonJun
Copy link
Copy Markdown
Contributor

추가로 ‎content/posts/rocm-aiter/images/aiter-mla-header.webp 해당 파일은 현재 포스트 글에서 사용되고 있지 않은 것으로 보입니다!

- Clarify '2x improvement' refers to throughput (intro + benchmark table)
- Add white background to MLA architecture image for dark mode visibility
- Remove unused aiter-mla-header.webp image
@mino-park7
Copy link
Copy Markdown
Contributor Author

반영했습니다! aiter-mla-header.webp 파일을 삭제했습니다.

@mino-park7 mino-park7 requested review from a team and YoungHoonJun March 23, 2026 06:47
Copy link
Copy Markdown
Contributor

@YoungHoonJun YoungHoonJun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

수정 감사합니다! operator 특성에 따라 4가지 백엔드를 골라쓰는 점이 흥미롭네요...! 좋은 글이라고 생각합니다.

고생하셨습니다, LGTM!

@mino-park7 mino-park7 merged commit 949c2cb into main Mar 24, 2026
1 check passed
@mino-park7 mino-park7 deleted the feat/rocm-aiter branch March 24, 2026 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants