add att backend #1165

hiworldwzj · 2026-01-05T05:21:26Z

No description provided.

gemini-code-assist · 2026-01-05T05:21:56Z

Summary of Changes

Hello @hiworldwzj, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural improvement by implementing a modular and extensible attention backend system. It abstracts the underlying attention kernel implementations, allowing for easy integration of various optimized attention methods (like Triton, FA3, or FlashInfer) for both prefill and decode operations. The existing Triton-based ALiBi attention kernels have been refactored into this new common framework, and the core model logic has been updated to utilize this abstraction. This change enhances the flexibility and maintainability of the attention mechanism within the system, preparing it for future performance optimizations and diverse hardware support.

Highlights

Modular Attention Backend System: Introduced abstract base classes (BaseAttBackend, BasePrefillAttState, BaseDecodeAttState) to define a standardized interface for attention operations during prefill and decode phases, enabling future extensibility.
Triton Backend Implementation: Provided a concrete implementation (TritonAttBackend, TritonPrefillAttState, TritonDecodeAttState) for Triton-based attention, specifically for ALiBi-enabled models, moving the logic to a common location.
Refactored Triton Kernels: Moved existing Triton attention kernels (e.g., context_flashattention_nopad, token_flashattention_nopad) from model-specific directories (like Bloom) to a common basemodel/triton_kernel/alibi_att location, promoting reusability across models.
Core Model Integration: Integrated the new attention backend system into the BaseModel and InferStateInfo to dynamically manage prefill and decode attention states, decoupling attention logic from specific kernel implementations.
CLI Argument for Backend Selection: Added new command-line arguments (--llm_prefill_att_backend, --llm_decode_att_backend, --llm_kv_type) to allow users to select different attention backends and KV types, preparing for diverse optimization options.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable abstraction for attention backends, which significantly improves the modularity and extensibility of the codebase. The refactoring of Triton-based alibi attention kernels into a common, reusable component is a commendable architectural enhancement. However, I've identified a critical bug in the new Triton backend implementation where the value tensor (v) is not correctly passed to the attention kernel, which will lead to incorrect outputs. Additionally, I've provided several recommendations to enhance code clarity, consistency, and maintainability, such as using a single attribute for the attention backend, improving type hints, and providing more descriptive error messages.

gemini-code-assist · 2026-01-05T05:24:01Z

lightllm/common/basemodel/attention/triton_backend.py

+        token_attention_fwd(
+            q,
+            k,
+            k,


There appears to be a critical bug here. The token_attention_fwd function is being called with k for both the key and value arguments. The v tensor, which is available in this method's scope, is being ignored. This will result in incorrect attention calculations. You should pass the v tensor as the value argument.

Suggested change

k,

v,

gemini-code-assist · 2026-01-05T05:24:01Z

lightllm/common/basemodel/attention/base_att.py

+        k: torch.tensor,
+        v: torch.tensor,


For consistency with q and common Python type hinting practices, it's better to use torch.Tensor for type hints instead of torch.tensor. torch.Tensor refers to the tensor class, whereas torch.tensor is a factory function for creating tensors.

Suggested change

k: torch.tensor,

v: torch.tensor,

k: torch.Tensor,

v: torch.Tensor,

gemini-code-assist · 2026-01-05T05:24:01Z

lightllm/common/basemodel/attention/base_att.py

+        alloc_func=torch.empty,
+        use_alibi=False,
+    ) -> torch.Tensor:
+        pass


For consistency with prefill_att, this abstract method should also raise a NotImplementedError. This makes it more explicit that subclasses must provide an implementation.

Suggested change

pass

raise NotImplementedError("not impl")

gemini-code-assist · 2026-01-05T05:24:01Z

lightllm/common/basemodel/attention/triton_backend.py

+        k: torch.tensor,
+        v: torch.tensor,


The type hints for k and v should be torch.Tensor to be consistent with q and standard typing practices. torch.tensor is a factory function, not the type class.

Suggested change

k: torch.tensor,

v: torch.tensor,

k: torch.Tensor,

v: torch.Tensor,

gemini-code-assist · 2026-01-05T05:24:01Z

lightllm/common/basemodel/attention/triton_backend.py

+        if use_alibi:
+            return self._alibi_prefill_att(q=q, k=k, v=v, layer_weight=layer_weight, out=out, alloc_func=alloc_func)
+        else:
+            raise NotImplementedError("error")


The error message "error" is not very descriptive. Consider providing a more informative message to aid in debugging, such as indicating that non-alibi attention is not implemented.

Suggested change

raise NotImplementedError("error")

raise NotImplementedError("Non-alibi attention is not implemented for the Triton backend.")

gemini-code-assist · 2026-01-05T05:24:01Z

lightllm/models/bloom/layer_infer/transformer_layer_infer.py

+        kv: torch.Tensor,
+        infer_state: InferStateInfo,
+        layer_weight: BloomTransformerLayerWeight,
+        out=None,


The out parameter is no longer used in this method since the output tensor is now managed by the attention backend. Removing it from the method signature would improve code clarity.

gemini-code-assist · 2026-01-05T05:24:01Z

lightllm/models/bloom/layer_infer/transformer_layer_infer.py


    def _token_attention_kernel(
-        self, q, infer_state: InferStateInfo, layer_weight: BloomTransformerLayerWeight, out=None
+        self, q: torch.Tensor, infer_state: InferStateInfo, layer_weight: BloomTransformerLayerWeight, out=None


The out parameter is unused in this function and can be removed from the signature to simplify the code.

gemini-code-assist · 2026-01-05T05:24:01Z

lightllm/models/bloom/model.py

+    def _init_att_backend(self):
+        self.prefill_att_backend = TritonAttBackend()
+        self.decode_att_backend = TritonAttBackend()
+        return


This _init_att_backend method overrides the implementation in the base class TpPartBaseModel but is nearly identical (it only lacks the assertion). This override is redundant and can be removed to improve maintainability and reduce code duplication.

gemini-code-assist · 2026-01-05T05:24:01Z

lightllm/server/api_cli.py

+    parser.add_argument(
+        "--llm_kv_type",
+        type=str,
+        choices=[None, ""],
+        default=None,
+        help="""kv type used in llm""",


The choices for the --llm_kv_type argument include an empty string "". This is unconventional and could be confusing. If the empty string is not a valid, meaningful option, it should be removed. If it is intentional, adding a comment to the help text explaining its purpose would be beneficial.

gemini-code-assist · 2026-01-05T05:24:01Z

lightllm/server/core/objs/start_args_type.py

    enable_flashinfer_decode: bool = field(default=False)
+    llm_prefill_att_backend: str = field(default=None, metadata={"choices": [None, "triton", "fa3", "flashinfer"]})
+    llm_decode_att_backend: str = field(default=None, metadata={"choices": [None, "triton", "fa3", "flashinfer"]})
+    llm_kv_type: str = field(default=None, metadata={"choices": [None, ""]})


The choices metadata for the llm_kv_type field includes an empty string "". This is unusual and may cause confusion. If "" is not a valid value, consider removing it from the list of choices.

gemini-code-assist bot reviewed Jan 5, 2026

View reviewed changes

hiworldwzj force-pushed the wzj_att branch 2 times, most recently from f714442 to 90b8084 Compare January 6, 2026 07:06

wangzaijun and others added 26 commits January 9, 2026 01:58

add att backend impl

bfea738

fix alibi att backend

5bdb1b3

fix llama triton att backend

5c08346

fix att

56046f0

fix att

2504086

fix att

1268a77

add att_control params

e88f8aa

fix int8kv

a750c1a

fix

96ed733

add new int8kv dequant triton kernel

5fd9e1b

fix int8kv prefill attention kernel

1d4571a

fix unittest

0e628f2

fix int8kv att backend.

c87265c

fix

928caf5

fix diverse att.

b357568

fix

06936fd

add int4kv.

b5b9c39

add int4 kernel

9dcc357

fix all

f40edd3

fix unit test

9f32562

fix unit test

784f233

fix

4d769c2

fix all

e1283ef

fix all

1501f15

add int4kv backend

4904c46

fix

1300773

wangzaijun and others added 27 commits January 9, 2026 02:01

fix

3948f9e

fix

0480621

fix

3456f16

fix

6cdc5d0

fix

ecc5a5c

fix

1b65f22

fix

ee08fdc

fix

c915ae4

fix

dc44255

fix

7e7a1f2

fix

88d9320

fix all

1bb5b0b

remove mode

df7447d

fix

8a3a097

fix

ee32234

fix

39ba5f3

fix mode

df57d95

fix mode.

11b774f

remove max_len_in_batch

80c6f2d

remove max_len_in_batch

73f0165

fix cuda graph.

2910d2b

fix cuda graph

b56a903

fix

c3786bc

fix

c25b032

fix

c59997e

fix

4a29d64

fix

55a277b

hiworldwzj force-pushed the wzj_att branch from e602dcd to 55a277b Compare January 9, 2026 02:02

wangzaijun added 2 commits January 9, 2026 02:35

fix

47f2deb

fix.

aa05013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add att backend #1165

add att backend #1165

hiworldwzj commented Jan 5, 2026

Uh oh!

gemini-code-assist bot commented Jan 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 5, 2026

Uh oh!

gemini-code-assist bot Jan 5, 2026

Uh oh!

gemini-code-assist bot Jan 5, 2026

Uh oh!

gemini-code-assist bot Jan 5, 2026

Uh oh!

gemini-code-assist bot Jan 5, 2026

Uh oh!

gemini-code-assist bot Jan 5, 2026

Uh oh!

gemini-code-assist bot Jan 5, 2026

Uh oh!

gemini-code-assist bot Jan 5, 2026

Uh oh!

gemini-code-assist bot Jan 5, 2026

Uh oh!

gemini-code-assist bot Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	raise NotImplementedError("error")
	raise NotImplementedError("Non-alibi attention is not implemented for the Triton backend.")

add att backend #1165

Are you sure you want to change the base?

add att backend #1165

Conversation

hiworldwzj commented Jan 5, 2026

Uh oh!

gemini-code-assist bot commented Jan 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants