Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 170 additions & 2 deletions docs/source/en/kernel_doc/overview.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,171 @@
# Overview
# Hugging Face Transformers Kernels

Kernels in transformers are used to optimize the performance of models with custom layers from the hub and very low effort.
The Transformers Kernels integration provides high-performance, optimized kernel implementations for common transformer operations. By leveraging specialized CUDA, Triton, ROCm, Metal and XPU kernels from the Hugging Face Hub, you can significantly accelerate model inference and training with minimal code changes.

## Key Benefits

- **Drop-in Performance Boost**: Enable optimized kernels with a single `use_kernels=True` flag
- **Hub-Based Distribution**: Access community-maintained kernels directly from Hugging Face Hub
- **Multi-Backend Support**: CUDA, ROCm, Metal, and XPU backends for different hardware platforms
- **Mode-Aware Optimization**: Automatically switches between training and inference optimizations
- **Zero Code Changes**: Existing models automatically benefit from kernel acceleration
- **Customizable**: Override default kernels with your own implementations via `KernelConfig`

## Supported Operations

The `kernels` library provides optimized implementations for:
Comment on lines +14 to +16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically we'll go in a direction where most layers can be replaced by kernels, so this list is bound to be outdated quite quickly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @MekkCyber and @danieldk, IMO spending some time on this doc to really showcase the kernels integration in transformers would be really worhtwhile


### Normalization Layers
- **RMSNorm**: Root Mean Square Layer Normalization

### Activation Functions
- **FastGELU, NewGELU, QuickGELU**: GELU variants
- **SiLU**: Sigmoid Linear Unit
- **GeluTanh**: GELU with Tanh approximation

### MLP and MoE Layers
- **MLP**: Standard Multi-Layer Perceptron layers
- **MegaBlocksMoeMLP**: Optimized Mixture-of-Experts implementations
- **Llama4TextMoe**: Llama 4 MoE layers

### Attention Mechanisms
- **Flash Attention**: Fast attention implementations
- **MultiScaleDeformableAttention**: For vision transformers
- **Custom Attention**: Load community kernels for specialized attention patterns

### Specialized Operations
- **Mamba Selective Scan**: Built-in optimized kernels for Mamba SSM models
- **Causal Convolution 1D**: Efficient causal convolutions

## Quick Start

### Basic Usage

Enable kernels when loading any model:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify that this means the optimized kernels are automatically pulled in and used


```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
use_kernels=True,
device_map="cuda"
)

# Model now uses optimized kernels automatically
output = model.generate(input_ids, max_new_tokens=50)
```

### Custom Attention Kernels

Use specialized attention implementations:

```python
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
attn_implementation="kernels-community/flash-attn",
device_map="cuda"
)
Comment on lines +64 to +68
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was renamed recently

Suggested change
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
attn_implementation="kernels-community/flash-attn",
device_map="cuda"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
attn_implementation="kernels-community/flash-attn2",
device_map="cuda"
)

```

## How It Works

### Automatic Kernel Replacement

When `use_kernels=True`, the library:

1. **Scans the model**: Identifies layers that have optimized kernel implementations
2. **Loads kernels from Hub**: Downloads and caches kernels from Hugging Face repositories
3. **Replaces forward methods**: Swaps standard PyTorch operations with optimized kernels
4. **Maintains compatibility**: Ensures identical outputs while improving performance
Comment on lines +79 to +80
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really sure about the 4th point, since with kernels comes some non determinism


### Mode-Aware Optimization

Kernels automatically adapt to your training/inference workflow:

```python
model.eval() # Uses inference-optimized kernels
model.train() # Switches to training-optimized kernels
```

### Kernel Sources

Kernels are distributed through Hugging Face Hub repositories in the format:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to add some examples showing these different formats.

- `org/repo:layer_name` - Latest version
- `org/[email protected]:layer_name` - Specific version
- Supports semantic versioning constraints

**Popular Kernel Repositories:**
- [`kernels-community/flash-attn`](https://huggingface.co/kernels-community/flash-attn2) - Flash attention implementations
Comment on lines +98 to +99
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add vllm-flash-attn3 too I think

- [`kernels-community/megablocks`](https://huggingface.co/kernels-community/megablocks) - MoE optimizations
- [`kernels-community/moe`](https://huggingface.co/RedHatAI/moe) - Llama 4 MoE layers
- [`kernels-community/liger_kernels`](https://huggingface.co/kernels-community/liger_kernels) - RMSNorm, activation functions

## Requirements

- **kernels** package: `pip install kernels`
- **Recommended**: `kernels>=0.10.2` for XPU support

Comment on lines +106 to +108
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : might be understood incorrectly that the specified version is for xpu support only

Suggested change
- **kernels** package: `pip install kernels`
- **Recommended**: `kernels>=0.10.2` for XPU support
- **kernels** package: `pip install kernels`
- **Recommended**: Use `kernels>=0.10.2` to ensure support for all backends.

## Advanced Features

### Training and Inference Modes

Kernels support different operational modes:

- **Mode.INFERENCE**: Optimized for inference workloads (batch size optimization, reduced memory)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example showing how to toggle these modes on/off would also be useful. Or is this done automatically when you do model.eval() and model.train()?

- **Mode.TRAINING**: Optimized for training (gradient computation, mixed precision)
- **Mode.TORCH_COMPILE**: Compatible with `torch.compile` for JIT optimization

### Device-Specific Kernels

Specify different kernel implementations per device:

```python
kernel_config = KernelConfig(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to show what you do with the kernel_config once its defined for more context

kernel_mapping={
"RMSNorm": {
"cuda": "kernels-community/cuda-norm:FastRMSNorm",
"rocm": "kernels-community/rocm-norm:RocmRMSNorm",
"xpu": "kernels-community/xpu-norm:XpuRMSNorm"
}
}
)
```

### Built-in Kernels

Transformers includes built-in CUDA kernels for specific models:

- **Falcon Mamba**: Selective scan operations with layer normalization fusion
- Located in: `transformers.kernels.falcon_mamba`
Comment on lines +135 to +140
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be removed here : #41664


## Important Notes

- **No Unkernelization**: Once kernels are enabled, they cannot be disabled during the session
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to support this. In vanilla kernels, you can kernelize with an empty mapping and the model will use the original implementations again.

- **Lazy Loading**: Kernels are downloaded and cached only when needed
- **Backward Compatibility**: Models work identically with or without kernels enabled
- **Hardware Requirements**: CUDA kernels require compatible NVIDIA GPUs; ROCm requires AMD GPUs; XPU requires Intel GPUs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth mentioning that some kernels require specific compute capabilities (e.g. Hopper/Blackwell)?


Comment on lines +147 to +148
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Hardware Requirements**: CUDA kernels require compatible NVIDIA GPUs; ROCm requires AMD GPUs; XPU requires Intel GPUs
- **Hardware Requirements**: CUDA kernels require compatible NVIDIA GPUs; ROCm requires AMD GPUs; XPU requires Intel GPUs; and Metal kernels require Apple Silicon Devices

## Troubleshooting

### Import Errors

If you see `ModuleNotFoundError: No module named 'kernels'`:

```bash
pip install kernels
```

### Disabling the use of `kernels` globally

You can disable the use of `kernels` globally by setting the environment variable `USE_HUB_KERNELS=0|OFF|NO`.

### Device Compatibility

Not all kernels support all devices. Check the kernel repository documentation for device support.

## Additional Resources

- **Kernels Library**: [github.com/huggingface/kernels](https://github.com/huggingface/kernels)
- **Community Kernels**: [huggingface.co/kernels-community](https://huggingface.co/kernels-community)
- **API Reference**: See `KernelConfig` documentation for advanced configuration options
Comment on lines +169 to +171
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also worth linking out to https://huggingface.co/docs/kernels/index and perhaps kernel-builder?