-
Notifications
You must be signed in to change notification settings - Fork 31.2k
doc(kernels): update kernels integration documentation #42277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good start! I'm inviting @MekkCyber and @danieldk to also contribute here wrt the overall integration
Small additional comment: imo we should try to have a good, coherent doc here, and cross link it everywhere else where it makes sense (guides/docs like performance, optimization, etc)
| ## Supported Operations | ||
|
|
||
| The `kernels` library provides optimized implementations for: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically we'll go in a direction where most layers can be replaced by kernels, so this list is bound to be outdated quite quickly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @MekkCyber and @danieldk, IMO spending some time on this doc to really showcase the kernels integration in transformers would be really worhtwhile
|
|
||
| ## Important Notes | ||
|
|
||
| - **No Unkernelization**: Once kernels are enabled, they cannot be disabled during the session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to support this. In vanilla kernels, you can kernelize with an empty mapping and the model will use the original implementations again.
| - **No Unkernelization**: Once kernels are enabled, they cannot be disabled during the session | ||
| - **Lazy Loading**: Kernels are downloaded and cached only when needed | ||
| - **Backward Compatibility**: Models work identically with or without kernels enabled | ||
| - **Hardware Requirements**: CUDA kernels require compatible NVIDIA GPUs; ROCm requires AMD GPUs; XPU requires Intel GPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth mentioning that some kernels require specific compute capabilities (e.g. Hopper/Blackwell)?
| - **Kernels Library**: [github.com/huggingface/kernels](https://github.com/huggingface/kernels) | ||
| - **Community Kernels**: [huggingface.co/kernels-community](https://huggingface.co/kernels-community) | ||
| - **API Reference**: See `KernelConfig` documentation for advanced configuration options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also worth linking out to https://huggingface.co/docs/kernels/index and perhaps kernel-builder?
MekkCyber
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @mfuntowicz ! This was very much needed
| **Popular Kernel Repositories:** | ||
| - [`kernels-community/flash-attn`](https://huggingface.co/kernels-community/flash-attn2) - Flash attention implementations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add vllm-flash-attn3 too I think
| - **kernels** package: `pip install kernels` | ||
| - **Recommended**: `kernels>=0.10.2` for XPU support | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit : might be understood incorrectly that the specified version is for xpu support only
| - **kernels** package: `pip install kernels` | |
| - **Recommended**: `kernels>=0.10.2` for XPU support | |
| - **kernels** package: `pip install kernels` | |
| - **Recommended**: Use `kernels>=0.10.2` to ensure support for all backends. | |
| ### Built-in Kernels | ||
|
|
||
| Transformers includes built-in CUDA kernels for specific models: | ||
|
|
||
| - **Falcon Mamba**: Selective scan operations with layer normalization fusion | ||
| - Located in: `transformers.kernels.falcon_mamba` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be removed here : #41664
| - **Hardware Requirements**: CUDA kernels require compatible NVIDIA GPUs; ROCm requires AMD GPUs; XPU requires Intel GPUs | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Hardware Requirements**: CUDA kernels require compatible NVIDIA GPUs; ROCm requires AMD GPUs; XPU requires Intel GPUs | |
| - **Hardware Requirements**: CUDA kernels require compatible NVIDIA GPUs; ROCm requires AMD GPUs; XPU requires Intel GPUs; and Metal kernels require Apple Silicon Devices |
| model = AutoModelForCausalLM.from_pretrained( | ||
| "meta-llama/Llama-3.2-1B-Instruct", | ||
| attn_implementation="kernels-community/flash-attn", | ||
| device_map="cuda" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was renamed recently
| model = AutoModelForCausalLM.from_pretrained( | |
| "meta-llama/Llama-3.2-1B-Instruct", | |
| attn_implementation="kernels-community/flash-attn", | |
| device_map="cuda" | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "meta-llama/Llama-3.2-1B-Instruct", | |
| attn_implementation="kernels-community/flash-attn2", | |
| device_map="cuda" | |
| ) |
| 3. **Replaces forward methods**: Swaps standard PyTorch operations with optimized kernels | ||
| 4. **Maintains compatibility**: Ensures identical outputs while improving performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really sure about the 4th point, since with kernels comes some non determinism
Co-authored-by: Mohamed Mekkouri <[email protected]>
Co-authored-by: Mohamed Mekkouri <[email protected]>
stevhliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for kicking this off, I think we can polish this a bit further!
- the Key Benefits and Supported Operations list interrupts the narrative flow and gets in the way of the learning path. i think this may be better at the end in a Reference section or even a link to where users can find all the supported ops (may not scale well when the list grows)
- it'd be nice to integrate Key Benefits into the intro paragraph of the doc to emphasize its benefits, instead of a list
- it would also make more sense to move the Requirements higher up so users know upfront
- i think it'd flow better if we string the Advanced Features together with the content in the Quick Start. currently it feels a bit disjointed and doesn't really build off of or connect to what comes before
- may be useful to include an example that shows you how to find out which kernels are loaded
- maybe organize it like this:
# Kernels
intro paragraph about what it is and the key benefits
requirements and installation
## Enabling kernels
different ways of specifying kernels in `from_pretrained`
`use_kernels=True`
`attn_implementation` for attention kernels
`KernelConfig` for device-specific kernels
## Mode configuration
switching between inference, training, and torch.compile kernels
## Automatic kernel replacement
explanation about how it works|
|
||
| ### Basic Usage | ||
|
|
||
| Enable kernels when loading any model: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify that this means the optimized kernels are automatically pulled in and used
|
|
||
| ### Kernel Sources | ||
|
|
||
| Kernels are distributed through Hugging Face Hub repositories in the format: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to add some examples showing these different formats.
|
|
||
| Kernels support different operational modes: | ||
|
|
||
| - **Mode.INFERENCE**: Optimized for inference workloads (batch size optimization, reduced memory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example showing how to toggle these modes on/off would also be useful. Or is this done automatically when you do model.eval() and model.train()?
| Specify different kernel implementations per device: | ||
|
|
||
| ```python | ||
| kernel_config = KernelConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to show what you do with the kernel_config once its defined for more context
Add some more content to the kernels integration in Transformers.