-
Notifications
You must be signed in to change notification settings - Fork 31.3k
doc(kernels): update kernels integration documentation #42277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,3 +1,171 @@ | ||||||||||||||||||||||
| # Overview | ||||||||||||||||||||||
| # Hugging Face Transformers Kernels | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Kernels in transformers are used to optimize the performance of models with custom layers from the hub and very low effort. | ||||||||||||||||||||||
| The Transformers Kernels integration provides high-performance, optimized kernel implementations for common transformer operations. By leveraging specialized CUDA, Triton, ROCm, Metal and XPU kernels from the Hugging Face Hub, you can significantly accelerate model inference and training with minimal code changes. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Key Benefits | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| - **Drop-in Performance Boost**: Enable optimized kernels with a single `use_kernels=True` flag | ||||||||||||||||||||||
| - **Hub-Based Distribution**: Access community-maintained kernels directly from Hugging Face Hub | ||||||||||||||||||||||
| - **Multi-Backend Support**: CUDA, ROCm, Metal, and XPU backends for different hardware platforms | ||||||||||||||||||||||
| - **Mode-Aware Optimization**: Automatically switches between training and inference optimizations | ||||||||||||||||||||||
| - **Zero Code Changes**: Existing models automatically benefit from kernel acceleration | ||||||||||||||||||||||
| - **Customizable**: Override default kernels with your own implementations via `KernelConfig` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Supported Operations | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| The `kernels` library provides optimized implementations for: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Normalization Layers | ||||||||||||||||||||||
| - **RMSNorm**: Root Mean Square Layer Normalization | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Activation Functions | ||||||||||||||||||||||
| - **FastGELU, NewGELU, QuickGELU**: GELU variants | ||||||||||||||||||||||
| - **SiLU**: Sigmoid Linear Unit | ||||||||||||||||||||||
| - **GeluTanh**: GELU with Tanh approximation | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### MLP and MoE Layers | ||||||||||||||||||||||
| - **MLP**: Standard Multi-Layer Perceptron layers | ||||||||||||||||||||||
| - **MegaBlocksMoeMLP**: Optimized Mixture-of-Experts implementations | ||||||||||||||||||||||
| - **Llama4TextMoe**: Llama 4 MoE layers | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Attention Mechanisms | ||||||||||||||||||||||
| - **Flash Attention**: Fast attention implementations | ||||||||||||||||||||||
| - **MultiScaleDeformableAttention**: For vision transformers | ||||||||||||||||||||||
| - **Custom Attention**: Load community kernels for specialized attention patterns | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Specialized Operations | ||||||||||||||||||||||
| - **Mamba Selective Scan**: Built-in optimized kernels for Mamba SSM models | ||||||||||||||||||||||
| - **Causal Convolution 1D**: Efficient causal convolutions | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Quick Start | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Basic Usage | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Enable kernels when loading any model: | ||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarify that this means the optimized kernels are automatically pulled in and used |
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```python | ||||||||||||||||||||||
| from transformers import AutoModelForCausalLM | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| model = AutoModelForCausalLM.from_pretrained( | ||||||||||||||||||||||
| "meta-llama/Llama-3.2-1B-Instruct", | ||||||||||||||||||||||
| use_kernels=True, | ||||||||||||||||||||||
| device_map="cuda" | ||||||||||||||||||||||
| ) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| # Model now uses optimized kernels automatically | ||||||||||||||||||||||
| output = model.generate(input_ids, max_new_tokens=50) | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Custom Attention Kernels | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Use specialized attention implementations: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```python | ||||||||||||||||||||||
| model = AutoModelForCausalLM.from_pretrained( | ||||||||||||||||||||||
| "meta-llama/Llama-3.2-1B-Instruct", | ||||||||||||||||||||||
| attn_implementation="kernels-community/flash-attn", | ||||||||||||||||||||||
| device_map="cuda" | ||||||||||||||||||||||
| ) | ||||||||||||||||||||||
|
Comment on lines
+64
to
+68
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It was renamed recently
Suggested change
|
||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## How It Works | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Automatic Kernel Replacement | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| When `use_kernels=True`, the library: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| 1. **Scans the model**: Identifies layers that have optimized kernel implementations | ||||||||||||||||||||||
| 2. **Loads kernels from Hub**: Downloads and caches kernels from Hugging Face repositories | ||||||||||||||||||||||
| 3. **Replaces forward methods**: Swaps standard PyTorch operations with optimized kernels | ||||||||||||||||||||||
| 4. **Maintains compatibility**: Ensures identical outputs while improving performance | ||||||||||||||||||||||
|
Comment on lines
+79
to
+80
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not really sure about the 4th point, since with kernels comes some non determinism |
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Mode-Aware Optimization | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Kernels automatically adapt to your training/inference workflow: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```python | ||||||||||||||||||||||
| model.eval() # Uses inference-optimized kernels | ||||||||||||||||||||||
| model.train() # Switches to training-optimized kernels | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Kernel Sources | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Kernels are distributed through Hugging Face Hub repositories in the format: | ||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It'd be nice to add some examples showing these different formats. |
||||||||||||||||||||||
| - `org/repo:layer_name` - Latest version | ||||||||||||||||||||||
| - `org/[email protected]:layer_name` - Specific version | ||||||||||||||||||||||
| - Supports semantic versioning constraints | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| **Popular Kernel Repositories:** | ||||||||||||||||||||||
| - [`kernels-community/flash-attn`](https://huggingface.co/kernels-community/flash-attn2) - Flash attention implementations | ||||||||||||||||||||||
|
Comment on lines
+98
to
+99
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can add |
||||||||||||||||||||||
| - [`kernels-community/megablocks`](https://huggingface.co/kernels-community/megablocks) - MoE optimizations | ||||||||||||||||||||||
| - [`kernels-community/moe`](https://huggingface.co/RedHatAI/moe) - Llama 4 MoE layers | ||||||||||||||||||||||
| - [`kernels-community/liger_kernels`](https://huggingface.co/kernels-community/liger_kernels) - RMSNorm, activation functions | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Requirements | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| - **kernels** package: `pip install kernels` | ||||||||||||||||||||||
| - **Recommended**: `kernels>=0.10.2` for XPU support | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
Comment on lines
+106
to
+108
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit : might be understood incorrectly that the specified version is for xpu support only
Suggested change
|
||||||||||||||||||||||
| ## Advanced Features | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Training and Inference Modes | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Kernels support different operational modes: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| - **Mode.INFERENCE**: Optimized for inference workloads (batch size optimization, reduced memory) | ||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An example showing how to toggle these modes on/off would also be useful. Or is this done automatically when you do |
||||||||||||||||||||||
| - **Mode.TRAINING**: Optimized for training (gradient computation, mixed precision) | ||||||||||||||||||||||
| - **Mode.TORCH_COMPILE**: Compatible with `torch.compile` for JIT optimization | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Device-Specific Kernels | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Specify different kernel implementations per device: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```python | ||||||||||||||||||||||
| kernel_config = KernelConfig( | ||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It'd be nice to show what you do with the |
||||||||||||||||||||||
| kernel_mapping={ | ||||||||||||||||||||||
| "RMSNorm": { | ||||||||||||||||||||||
| "cuda": "kernels-community/cuda-norm:FastRMSNorm", | ||||||||||||||||||||||
| "rocm": "kernels-community/rocm-norm:RocmRMSNorm", | ||||||||||||||||||||||
| "xpu": "kernels-community/xpu-norm:XpuRMSNorm" | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
| ) | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Built-in Kernels | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Transformers includes built-in CUDA kernels for specific models: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| - **Falcon Mamba**: Selective scan operations with layer normalization fusion | ||||||||||||||||||||||
| - Located in: `transformers.kernels.falcon_mamba` | ||||||||||||||||||||||
|
Comment on lines
+135
to
+140
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will be removed here : #41664 |
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Important Notes | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| - **No Unkernelization**: Once kernels are enabled, they cannot be disabled during the session | ||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should be possible to support this. In vanilla kernels, you can |
||||||||||||||||||||||
| - **Lazy Loading**: Kernels are downloaded and cached only when needed | ||||||||||||||||||||||
| - **Backward Compatibility**: Models work identically with or without kernels enabled | ||||||||||||||||||||||
| - **Hardware Requirements**: CUDA kernels require compatible NVIDIA GPUs; ROCm requires AMD GPUs; XPU requires Intel GPUs | ||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe worth mentioning that some kernels require specific compute capabilities (e.g. Hopper/Blackwell)? |
||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
Comment on lines
+147
to
+148
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||
| ## Troubleshooting | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Import Errors | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| If you see `ModuleNotFoundError: No module named 'kernels'`: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```bash | ||||||||||||||||||||||
| pip install kernels | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Disabling the use of `kernels` globally | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| You can disable the use of `kernels` globally by setting the environment variable `USE_HUB_KERNELS=0|OFF|NO`. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Device Compatibility | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Not all kernels support all devices. Check the kernel repository documentation for device support. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Additional Resources | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| - **Kernels Library**: [github.com/huggingface/kernels](https://github.com/huggingface/kernels) | ||||||||||||||||||||||
| - **Community Kernels**: [huggingface.co/kernels-community](https://huggingface.co/kernels-community) | ||||||||||||||||||||||
| - **API Reference**: See `KernelConfig` documentation for advanced configuration options | ||||||||||||||||||||||
|
Comment on lines
+169
to
+171
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe also worth linking out to https://huggingface.co/docs/kernels/index and perhaps kernel-builder? |
||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically we'll go in a direction where most layers can be replaced by kernels, so this list is bound to be outdated quite quickly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @MekkCyber and @danieldk, IMO spending some time on this doc to really showcase the
kernelsintegration in transformers would be really worhtwhile