Skip to content

Conversation

@hiworldwzj
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hiworldwzj, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural improvement by implementing a modular and extensible attention backend system. It abstracts the underlying attention kernel implementations, allowing for easy integration of various optimized attention methods (like Triton, FA3, or FlashInfer) for both prefill and decode operations. The existing Triton-based ALiBi attention kernels have been refactored into this new common framework, and the core model logic has been updated to utilize this abstraction. This change enhances the flexibility and maintainability of the attention mechanism within the system, preparing it for future performance optimizations and diverse hardware support.

Highlights

  • Modular Attention Backend System: Introduced abstract base classes (BaseAttBackend, BasePrefillAttState, BaseDecodeAttState) to define a standardized interface for attention operations during prefill and decode phases, enabling future extensibility.
  • Triton Backend Implementation: Provided a concrete implementation (TritonAttBackend, TritonPrefillAttState, TritonDecodeAttState) for Triton-based attention, specifically for ALiBi-enabled models, moving the logic to a common location.
  • Refactored Triton Kernels: Moved existing Triton attention kernels (e.g., context_flashattention_nopad, token_flashattention_nopad) from model-specific directories (like Bloom) to a common basemodel/triton_kernel/alibi_att location, promoting reusability across models.
  • Core Model Integration: Integrated the new attention backend system into the BaseModel and InferStateInfo to dynamically manage prefill and decode attention states, decoupling attention logic from specific kernel implementations.
  • CLI Argument for Backend Selection: Added new command-line arguments (--llm_prefill_att_backend, --llm_decode_att_backend, --llm_kv_type) to allow users to select different attention backends and KV types, preparing for diverse optimization options.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable abstraction for attention backends, which significantly improves the modularity and extensibility of the codebase. The refactoring of Triton-based alibi attention kernels into a common, reusable component is a commendable architectural enhancement. However, I've identified a critical bug in the new Triton backend implementation where the value tensor (v) is not correctly passed to the attention kernel, which will lead to incorrect outputs. Additionally, I've provided several recommendations to enhance code clarity, consistency, and maintainability, such as using a single attribute for the attention backend, improving type hints, and providing more descriptive error messages.

token_attention_fwd(
q,
k,
k,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a critical bug here. The token_attention_fwd function is being called with k for both the key and value arguments. The v tensor, which is available in this method's scope, is being ignored. This will result in incorrect attention calculations. You should pass the v tensor as the value argument.

Suggested change
k,
v,

Comment on lines +51 to +85
k: torch.tensor,
v: torch.tensor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with q and common Python type hinting practices, it's better to use torch.Tensor for type hints instead of torch.tensor. torch.Tensor refers to the tensor class, whereas torch.tensor is a factory function for creating tensors.

Suggested change
k: torch.tensor,
v: torch.tensor,
k: torch.Tensor,
v: torch.Tensor,

alloc_func=torch.empty,
use_alibi=False,
) -> torch.Tensor:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with prefill_att, this abstract method should also raise a NotImplementedError. This makes it more explicit that subclasses must provide an implementation.

Suggested change
pass
raise NotImplementedError("not impl")

Comment on lines 26 to 27
k: torch.tensor,
v: torch.tensor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hints for k and v should be torch.Tensor to be consistent with q and standard typing practices. torch.tensor is a factory function, not the type class.

Suggested change
k: torch.tensor,
v: torch.tensor,
k: torch.Tensor,
v: torch.Tensor,

if use_alibi:
return self._alibi_prefill_att(q=q, k=k, v=v, layer_weight=layer_weight, out=out, alloc_func=alloc_func)
else:
raise NotImplementedError("error")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error message "error" is not very descriptive. Consider providing a more informative message to aid in debugging, such as indicating that non-alibi attention is not implemented.

Suggested change
raise NotImplementedError("error")
raise NotImplementedError("Non-alibi attention is not implemented for the Triton backend.")

kv: torch.Tensor,
infer_state: InferStateInfo,
layer_weight: BloomTransformerLayerWeight,
out=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The out parameter is no longer used in this method since the output tensor is now managed by the attention backend. Removing it from the method signature would improve code clarity.


def _token_attention_kernel(
self, q, infer_state: InferStateInfo, layer_weight: BloomTransformerLayerWeight, out=None
self, q: torch.Tensor, infer_state: InferStateInfo, layer_weight: BloomTransformerLayerWeight, out=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The out parameter is unused in this function and can be removed from the signature to simplify the code.

Comment on lines 40 to 43
def _init_att_backend(self):
self.prefill_att_backend = TritonAttBackend()
self.decode_att_backend = TritonAttBackend()
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This _init_att_backend method overrides the implementation in the base class TpPartBaseModel but is nearly identical (it only lacks the assertion). This override is redundant and can be removed to improve maintainability and reduce code duplication.

Comment on lines 353 to 358
parser.add_argument(
"--llm_kv_type",
type=str,
choices=[None, ""],
default=None,
help="""kv type used in llm""",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The choices for the --llm_kv_type argument include an empty string "". This is unconventional and could be confusing. If the empty string is not a valid, meaningful option, it should be removed. If it is intentional, adding a comment to the help text explaining its purpose would be beneficial.

enable_flashinfer_decode: bool = field(default=False)
llm_prefill_att_backend: str = field(default=None, metadata={"choices": [None, "triton", "fa3", "flashinfer"]})
llm_decode_att_backend: str = field(default=None, metadata={"choices": [None, "triton", "fa3", "flashinfer"]})
llm_kv_type: str = field(default=None, metadata={"choices": [None, ""]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The choices metadata for the llm_kv_type field includes an empty string "". This is unusual and may cause confusion. If "" is not a valid value, consider removing it from the list of choices.

@hiworldwzj hiworldwzj force-pushed the wzj_att branch 2 times, most recently from f714442 to 90b8084 Compare January 6, 2026 07:06
wangzaijun added 2 commits January 9, 2026 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants