[ROCm] Prefill performance optimization for embedding models#1102
[ROCm] Prefill performance optimization for embedding models#1102liaocz wants to merge 1 commit into
Conversation
AI Code Review - PR #1102Status: BLOCKING Summary: P0/0 · P1/2 · P2/0 · P3/0 Blocking IssuesP1
Checklist Violations (17 fail / 93 total)General Principles Checklist
RTP-LLM Checklist
Python Static-First Checklist
Strengths
|
AI Code Review - PR #1102Status: BLOCKING Summary: P0/0 · P1/2 · P2/0 · P3/0 Blocking IssuesP1
Checklist Violations (4 fail / 82 total)General Principles Checklist
RTP-LLM Checklist
Strengths
|
AI Code Review - PR #1102Status: LGTM Summary: P0/0 · P1/0 · P2/1 · P3/1 lgtm ready to ci Non-blocking SuggestionsP2
P3
Checklist Violations (8 fail / 88 total)General Principles Checklist
RTP-LLM Checklist
Python Static-First Checklist
Strengths
|
AI Code Review - PR #1102Status: BLOCKING Summary: P0/0 · P1/1 · P2/0 · P3/1 Blocking IssuesP1
Non-blocking SuggestionsP3
Checklist Violations (7 fail / 93 total)General Principles Checklist
RTP-LLM Checklist
Python Static-First Checklist
Strengths
|
AI Code Review - PR #1102Status: LGTM Summary: P0/0 · P1/0 · P2/0 · P3/0 lgtm ready to ci Checklist Violations (1 fail / 56 total)General Principles Checklist
Strengths
|
AI Code Review - PR #1102Status: BLOCKING Summary: P0/0 · P1/1 · P2/1 · P3/1 Blocking IssuesP1
Non-blocking SuggestionsP2
P3
Checklist Violations (8 fail / 93 total)General Principles Checklist
RTP-LLM Checklist
Python Static-First Checklist
Strengths
|
AI Code Review - PR #1102Status: BLOCKING Summary: P0/0 · P1/1 · P2/0 · P3/0 Blocking IssuesP1
Checklist Violations (6 fail / 88 total)General Principles Checklist
RTP-LLM Checklist
Python Static-First Checklist
Strengths
|
AI Code Review - PR #1102Status: BLOCKING Summary: P0/0 · P1/2 · P2/0 · P3/0 Blocking IssuesP1
Checklist Violations (8 fail / 56 total)General Principles Checklist
RTP-LLM Checklist
Python Static-First Checklist
Strengths
|
AI Code Review - PR #1102Status: BLOCKING Summary: P0/0 · P1/1 · P2/1 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist Violations (7 fail / 97 total)General Principles Checklist
RTP-LLM Checklist
Python Static-First Checklist
Strengths
|
AI Code Review - PR #1102Status: LGTM Summary: P0/0 · P1/0 · P2/3 · P3/1 lgtm ready to ci Non-blocking SuggestionsP2
P3
Checklist Violations (7 fail / 81 total)General Principles Checklist
RTP-LLM Checklist
Python Static-First Checklist
Strengths
|
AI Code Review - PR #1102Status: LGTM Summary: P0/0 · P1/0 · P2/2 · P3/0 lgtm ready to ci Non-blocking SuggestionsP2
Checklist Violations (9 fail / 97 total)General Principles Checklist
RTP-LLM Checklist
Python Static-First Checklist
Strengths
|
AI Code Review - PR #1102Status: LGTM Summary: P0/0 · P1/0 · P2/2 · P3/0 lgtm ready to ci Non-blocking SuggestionsP2
Checklist Violations (8 fail / 97 total)General Principles Checklist
RTP-LLM Checklist
Python Static-First Checklist
Strengths
|
AI Code Review - PR #1102Status: LGTM Summary: P0/0 · P1/0 · P2/1 · P3/0 lgtm ready to ci Non-blocking SuggestionsP2
Checklist Violations (3 fail / 60 total)General Principles Checklist
RTP-LLM Checklist
Python Static-First Checklist
Strengths
|
| q_size = fmha_impl.head_num * fmha_impl.head_dim | ||
| cu_seqlens_q = fmha_params.cu_seqlens_q | ||
| cu_seqlens_k = fmha_params.cu_seqlens_k | ||
| if cu_seqlens_q.device != q.device: |
There was a problem hiding this comment.
这些都不合理? 应该准备阶段做好
已修复,init阶段已处理~
| """ | ||
| packed_qkv = None | ||
|
|
||
| if isinstance(fmha_input, torch.Tensor) and fmha_input.dim() == 2: |
There was a problem hiding this comment.
这些也不合理?输入应该只是 packed qkv
AI Code Review - PR #1102Status: LGTM Summary: P0/0 · P1/0 · P2/1 · P3/2 lgtm ready to ci Non-blocking SuggestionsP2
P3
Checklist Violations (2 fail / 60 total)Python Static-First Checklist
Strengths
|
… prefill When kv_cache is None (embedding model), skip the fusedQKV transpose kernel and KV cache write by splitting packed QKV directly and calling flash_attn_varlen_func. This avoids unnecessary GPU buffer allocation, transpose, and cache write operations. The optimization is fully contained in the ROCm attention implementation with no changes to model-layer code or C++ kernels.
AI Code Review - PR #1102Status: BLOCKING Summary: P0/0 · P1/1 · P2/1 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist Violations (10 fail / 81 total)General Principles Checklist
RTP-LLM Checklist
Python Static-First Checklist
Strengths
|
Skip fused QKV transpose, KV cache write, and Q buffer allocation for embedding models on ROCm.
Embedding models (task_type != LANGUAGE_MODEL) do not use KV cache. This change adds an
embedding_fast_path in FusedRopeKVCachePrefillOp and AiterPrefillImplAsm to avoid:
Instead, return the in-place RoPE'd packed QKV buffer directly and split it in Python attention layer for flash_attn_varlen_func.
No impact on normal LLM path (kv_cache present).