Skip to content

Conversation

@cyx-6
Copy link
Collaborator

@cyx-6 cyx-6 commented Oct 16, 2025

📌 Description

This PR bumps the tvm-ffi version and update the flashinfer code base.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Bug Fixes

    • Fixed TGVGemmRunner output orientation to return results in expected shape without unnecessary transposition.
  • Chores

    • Updated Apache TVM FFI dependency version range from exact pre-release pin to flexible minor version constraint for broader compatibility.

cyx-6 added 7 commits October 16, 2025 20:18
<!-- .github/pull_request_template.md -->

## 📌 Description

This PR fixes the dev container after flashinfer-ai#1880.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @cyx-6, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on adapting the FlashInfer codebase to a newer version of the tvm-ffi library. This involves a significant refactoring of how TensorView objects are accessed in C++ CUDA kernels, moving from direct member access to more robust method-based interactions. Additionally, the Python binding for a GEMM operation was slightly adjusted for correctness.

Highlights

  • TVM FFI Version Bump: The apache-tvm-ffi dependency has been updated from version 0.1.0b15 to 0.1.0b19 across the project's build configurations and requirements.
  • TensorView API Modernization: The FlashInfer codebase has been updated to reflect changes in the tvm-ffi API, transitioning from direct member access like ->shape[idx] and ->data to method calls such as .size(idx) and .data_ptr() for TensorView objects. This change is pervasive across numerous CUDA C++ source files.
  • Gemm Output Transposition Adjustment: The tgv_gemm function in flashinfer/gemm.py was modified to return the output tensor directly, removing an unnecessary transposition (.t()) operation.
Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/nightly-release.yml
    • .github/workflows/release.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request primarily bumps the tvm-ffi version and updates the codebase to align with the new API. The changes are mostly mechanical, such as replacing -> with . for member access and updating method calls like shape to size and data to data_ptr(). The changes are largely correct and consistent. However, I've identified one critical issue in csrc/xqa/utils.cuh where a change was incorrectly applied to a custom struct, which will likely cause a compilation failure. Please see the specific comment for details.

@yzh119
Copy link
Collaborator

yzh119 commented Oct 18, 2025

Let's defer it till the stable release of tvm-ffi

@yzh119
Copy link
Collaborator

yzh119 commented Oct 21, 2025

@cyx-6 would you mind bumping to v0.1.0 stable release?

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 21, 2025

Caution

Review failed

Failed to post review comments

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

A large-scale refactor updates tensor access patterns across CUDA and Python code from pointer-based field syntax (→data, →shape) to modern accessor methods (data_ptr(), size()). Additionally, apache-tvm-ffi dependency constraint is relaxed from fixed pre-release (0.1.0b15) to flexible version range (≥0.1, <0.2).

Changes

Cohort / File(s) Summary
Dependency Version Constraints
.github/workflows/nightly-release.yml, .github/workflows/release.yml, pyproject.toml, requirements.txt, flashinfer-cubin/pyproject.toml, flashinfer-jit-cache/pyproject.toml
Updated apache-tvm-ffi from exact pre-release pin ==0.1.0b15 to semantic range >=0.1,<0.2, relaxing version constraints.
CUDA Attention & Decode Kernels
csrc/batch_attention.cu, csrc/batch_decode.cu, csrc/batch_decode_mla_*.cu, csrc/single_decode.cu
Replaced pointer-based tensor access (→data, →shape[i], →device) with accessor methods (data_ptr(), size(i), device()). Updated stream retrieval and device setup.
CUDA Prefill & MLA Kernels
csrc/batch_prefill*.cu, csrc/batch_mla*.cu, csrc/single_prefill*.cu
Migrated tensor field access from pointer syntax to value-based accessors (size(), stride(), data_ptr()), updated device/stream handling, and adjusted parameter passing to kernel invocations.
CUDA GEMM & Matrix Operations
csrc/gemm_groupwise_sm*.cu, csrc/group_gemm*.cu, csrc/tgv_gemm.cu, csrc/bmm_fp8.cu, csrc/fp4_gemm_cutlass*.cu, csrc/fp8_gemm_cutlass.cu
Replaced shape indexing with size() calls, device field access with device() methods, and raw data pointers with data_ptr() throughout matrix operation kernels.
CUDA Utility Kernels
csrc/norm.cu, csrc/rope.cu, csrc/sampling.cu, csrc/renorm.cu, csrc/quantization.cu, csrc/cascade.cu, csrc/page.cu, csrc/pod.cu
Updated tensor accessors across normalization, rotation encoding, sampling, and utility kernels to use size(), stride(), data_ptr(), and device() instead of raw pointer fields.
CUDA Kernel Launchers (FMHA/SDPA)
csrc/blackwell_fmha_plan.cu, csrc/cudnn_sdpa_kernel_launcher.cu, csrc/fmha_cutlass_sm100.cu
Migrated pointer-based tensor access to accessor methods for tensor metadata, device setup, and kernel parameter passing.
CUDA MoE & Fusion Operations
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu, csrc/trtllm_fused_moe_kernel_launcher.cu
Updated MoE kernel launchers to use size(), dtype(), data_ptr(), and device() accessors for input validation and kernel invocation.
TensorRT-LLM CUDA Bindings
csrc/trtllm_*.cu, csrc/nv_internal/tensorrt_llm/thop/*.cpp
Replaced pointer-based tensor field access with value-based accessors across allreduce, FMHA, GEMM runners, and quantization operations for TensorRT-LLM integration.
CUDA XQA Kernels
csrc/xqa/xqa_wrapper.cu, csrc/cutlass_mla.cu
Updated tensor data and device access to use data_ptr() and device() instead of pointer fields in XQA attention kernels.
TVM FFI Utilities & Headers
csrc/tvm_ffi_utils.h, csrc/batch_mla_config.jinja
Updated accessor macros and template code to use public tensor API (ndim(), size(i), stride(), dtype(), data_ptr(), device()) instead of direct member access.
CUDA NVSHMEM & vLLM Bindings
csrc/nvshmem_binding.cu, csrc/vllm_custom_all_reduce.cu
Migrated tensor access in communication and allreduce kernels from pointer members to accessor methods.
Python Tensor Access
flashinfer/jit/attention/utils.py, flashinfer/jit/activation.py
Updated generated code templates and Python wrapper code to use data_ptr() and size() accessors for tensor data retrieval.
Functional Change
flashinfer/gemm.py
Modified TGVGemmRunner.forward() to return output tensor c directly instead of c.t(), changing output orientation from transposed to native.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Rationale: While the diff touches 70+ files, the changes are highly repetitive and follow consistent patterns: systematic replacement of pointer-based tensor access (→shape[i], →data, →device) with accessor methods (size(i), data_ptr(), device()). The uniformity significantly reduces cognitive load—each reviewer can validate the pattern once and apply it across sections. However, the CUDA-heavy nature demands careful attention to pointer semantics, correct accessor usage, and proper device/stream management. The presence of optional value handling (value().data_ptr()) and occasional stride calculations (stride(n)) introduces minor complexity. The single functional change in gemm.py and dependency updates are straightforward.

Poem

🐰 Hopping through pointers with glee,
We've migrated to .data_ptr()!
No more →shape[0], just size(0) please,
These accessors make refactors a breeze.
From 0.1.0b15 to ranges so wide,
Our tensor APIs now have modern pride. 🚀

Pre-merge checks and finishing touches

❌ Failed checks (3 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The title "Bump tvm ffi version" is specific and directly references a core aspect of the changeset—the apache-tvm-ffi dependency update from ==0.1.0b15 to >=0.1,<0.2. However, the changeset encompasses substantially more than version bumping. The overwhelming majority of changes involve systematic refactoring of CUDA and Python code to migrate from pointer-based tensor field access (→data, →shape, →device) to modern accessor APIs (data_ptr(), size(), device(), stride(), dtype()). The title captures the dependency change but obscures the primary implementation work required for API compatibility.
Description Check ⚠️ Warning The PR description is largely incomplete against the provided template. While the 📌 Description section is present, it contains only a single minimal sentence ("This PR bumps the tvm-ffi version and update the flashinfer code base.") without explaining what specific changes were made, why they're necessary, or their impact. The 🔍 Related Issues section is entirely empty, despite PR comments from yzh119 requesting a bump to v0.1.0 stable release, which appears to be relevant context. All items in the 🚀 Pull Request Checklist are unchecked, indicating pre-commit hooks may not have been run and tests may not have been verified. Expand the Description section to detail the scope of changes (e.g., mention the widespread tensor accessor API migration), link any related issues from the PR discussion (particularly the stable release request), and verify the pre-commit and test checklist items before merging. A more complete description would help reviewers quickly understand both the dependency update and the extensive codebase refactoring required.
Docstring Coverage ⚠️ Warning Docstring coverage is 17.04% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@cyx-6
Copy link
Collaborator Author

cyx-6 commented Oct 21, 2025

closed for #1960

@cyx-6 cyx-6 closed this Oct 21, 2025
yzh119 added a commit that referenced this pull request Oct 24, 2025
<!-- .github/pull_request_template.md -->

## 📌 Description

This PR bumps the tvm-ffi to stable version 0.1.0 and update the
flashinfer code base.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

#1939 

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Relaxed build dependency pins for apache-tvm-ffi and setuptools across
project configs; removed installation of multiple build packages from
the nightly CI step.
* **Refactor**
* Modernized internal CUDA/tensor access patterns to a consistent
accessor API across many modules.
* **Bug Fixes**
* GEMM runner now returns the output tensor in the correct
(non‑transposed) orientation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Zihao Ye <[email protected]>
Co-authored-by: yzh119 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants