-
Notifications
You must be signed in to change notification settings - Fork 578
feat: Add backend='auto' to mm_fp4 and enable autotune for backend='cudnn' #1979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
b1fc962
Add first draft of mm_fp4 backend
bkryu d71943f
Add second draft of mm_fp4 backend
bkryu e3b68dd
Add third draft of mm_fp4 backend -- no audotune
bkryu 34f3eca
Add 4th draft of mm_fp4 backend -- enable cross-backend autotune on aβ¦
bkryu 7def078
Add 5th draft of mm_fp4 backend -- enable cudnn autotune
bkryu 0a4560d
Refactor test_mm_fp4.py
bkryu 7551fde
Address comments
bkryu da049f5
Rebase main
bkryu b685e9e
Cleanup
bkryu aa40f5e
Cleanup
bkryu 17a1f28
Correctly apply common check
bkryu b69cadf
Decorator allow cases when no common_check
bkryu 8b4bda8
Address comments about code redundancy and cleanliness
bkryu 24d17d3
Address Jimmy's comment on moving problem size check to backend-speciβ¦
bkryu 837b290
Final cleanup
bkryu 72eaf90
Add comments in response to feedback
bkryu 0be7217
Don't reinvent get_cuda_version
bkryu 2f15fc4
Address comments
bkryu fe2070b
Updated test_decorators to include some heuristic backend
bkryu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this try-except block trying to verify whether each backend supports the input configuration? I'm thinking whether it would be a good idea to expose the _is_problem_size_supported API in the decorator so that we can avoid doing something like this, and make a call directly to query this. E.g
flashinfer.gemm.mm_fp4.is_problem_size_supported(args...)Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting suggestion, but I also wonder whether it is a common use case. Essentially what we are doing here is reverse engineering what backends are supported for my inputs.
For a framework running end-to-end inference, this would not be a use case so I'd say we probably don't need to worry about it. If anything, we should advocate using the "auto" backend and deliver good backend selection accuracy, which could be done through enabling autotuning.
For a benchmarking script like the current one, I'd see the use case of
is_problem_size_supported, but I'd claim that the current flow is too bad because the exception is raised very early before the kernels are launched.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, when I designed the decorator I had exactly in mind what Jimmy is nudging at: callling is_problem_size_supported before we even run the operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the initial appeal of this idea and how being able to call a support check seems nice at first glance, but I am still not able to see the use case.
If we hypothetically support this, how do we expect frameworks to use it? Would they have a preferred ordering of backends in mind and try to check one by one, and pick the first one that succeeds? and then if no backend is supported maybe fallback to a non-FlashInfer choice?
My stance is:
However, if I am missing something and there is a need for this, I would advocate for a followup PR, just to avoid the scope creep of this PR. The size of this PR has grown quite a bit since its inception and is becoming a bit daunting π
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is not a problem.
Just to provide context of the use case -- I think is_problem_size_supported itself is not as big of a use case. But the complete picture: backend + problem size is:
A DLFW may not want to use auto, and avoid any checks. They could have determined themselves which backend is most performant (perhaps using auto initially) and bake in which backends they want. They can then call the api with skip_check=True, to avoid overhead of any checks. CC+ @elfiegg