[webgpu] Register GQA based on graph capture #26384

qjia7 · 2025-10-22T09:58:06Z

This pull request enables conditionally register GQA with total_sequence_length on gpu or not. It resolves the issue that a MemcpyToHost is generated when graph capture is enabled (refer to #25868). This is the last functionality part to support graph capture in webgpu ep in ORT.

The main changes ensure that when graph capture is enabled, sequence length information is read from GPU buffers instead of CPU memory, and shader code generation adapts accordingly. This enables more efficient execution and compatibility with graph-captured models.

In this PR, we still get total sequence length from seqlen_k tensor not total_seqlen_tensor tensor to keep consistent with other parts. In the next PR, we can refactor all places to directly use total_seqlen_tensor instead of seqlen_k when graph capture enabled.

[webgpu] Register GQA based on graph capture

d08c52c

qjia7 requested review from fs-eire, guschmue and sushraja-msft October 22, 2025 10:13

guschmue added the ep:WebGPU ort-web webgpu provider label Oct 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[webgpu] Register GQA based on graph capture #26384

[webgpu] Register GQA based on graph capture #26384

qjia7 commented Oct 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[webgpu] Register GQA based on graph capture #26384

Are you sure you want to change the base?

[webgpu] Register GQA based on graph capture #26384

Conversation

qjia7 commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qjia7 commented Oct 22, 2025 •

edited

Loading