Skip to content

Conversation

@mfuntowicz
Copy link
Member

Add some more content to the kernels integration in Transformers.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good start! I'm inviting @MekkCyber and @danieldk to also contribute here wrt the overall integration

Small additional comment: imo we should try to have a good, coherent doc here, and cross link it everywhere else where it makes sense (guides/docs like performance, optimization, etc)

Comment on lines +14 to +16
## Supported Operations

The `kernels` library provides optimized implementations for:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically we'll go in a direction where most layers can be replaced by kernels, so this list is bound to be outdated quite quickly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @MekkCyber and @danieldk, IMO spending some time on this doc to really showcase the kernels integration in transformers would be really worhtwhile


## Important Notes

- **No Unkernelization**: Once kernels are enabled, they cannot be disabled during the session
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to support this. In vanilla kernels, you can kernelize with an empty mapping and the model will use the original implementations again.

- **No Unkernelization**: Once kernels are enabled, they cannot be disabled during the session
- **Lazy Loading**: Kernels are downloaded and cached only when needed
- **Backward Compatibility**: Models work identically with or without kernels enabled
- **Hardware Requirements**: CUDA kernels require compatible NVIDIA GPUs; ROCm requires AMD GPUs; XPU requires Intel GPUs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth mentioning that some kernels require specific compute capabilities (e.g. Hopper/Blackwell)?

Comment on lines +169 to +171
- **Kernels Library**: [github.com/huggingface/kernels](https://github.com/huggingface/kernels)
- **Community Kernels**: [huggingface.co/kernels-community](https://huggingface.co/kernels-community)
- **API Reference**: See `KernelConfig` documentation for advanced configuration options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also worth linking out to https://huggingface.co/docs/kernels/index and perhaps kernel-builder?

Copy link
Contributor

@MekkCyber MekkCyber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @mfuntowicz ! This was very much needed

Comment on lines +98 to +99
**Popular Kernel Repositories:**
- [`kernels-community/flash-attn`](https://huggingface.co/kernels-community/flash-attn2) - Flash attention implementations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add vllm-flash-attn3 too I think

Comment on lines +106 to +108
- **kernels** package: `pip install kernels`
- **Recommended**: `kernels>=0.10.2` for XPU support

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : might be understood incorrectly that the specified version is for xpu support only

Suggested change
- **kernels** package: `pip install kernels`
- **Recommended**: `kernels>=0.10.2` for XPU support
- **kernels** package: `pip install kernels`
- **Recommended**: Use `kernels>=0.10.2` to ensure support for all backends.

Comment on lines +135 to +140
### Built-in Kernels

Transformers includes built-in CUDA kernels for specific models:

- **Falcon Mamba**: Selective scan operations with layer normalization fusion
- Located in: `transformers.kernels.falcon_mamba`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be removed here : #41664

Comment on lines +147 to +148
- **Hardware Requirements**: CUDA kernels require compatible NVIDIA GPUs; ROCm requires AMD GPUs; XPU requires Intel GPUs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Hardware Requirements**: CUDA kernels require compatible NVIDIA GPUs; ROCm requires AMD GPUs; XPU requires Intel GPUs
- **Hardware Requirements**: CUDA kernels require compatible NVIDIA GPUs; ROCm requires AMD GPUs; XPU requires Intel GPUs; and Metal kernels require Apple Silicon Devices

Comment on lines +64 to +68
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
attn_implementation="kernels-community/flash-attn",
device_map="cuda"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was renamed recently

Suggested change
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
attn_implementation="kernels-community/flash-attn",
device_map="cuda"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
attn_implementation="kernels-community/flash-attn2",
device_map="cuda"
)

Comment on lines +79 to +80
3. **Replaces forward methods**: Swaps standard PyTorch operations with optimized kernels
4. **Maintains compatibility**: Ensures identical outputs while improving performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really sure about the 4th point, since with kernels comes some non determinism

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for kicking this off, I think we can polish this a bit further!

  • the Key Benefits and Supported Operations list interrupts the narrative flow and gets in the way of the learning path. i think this may be better at the end in a Reference section or even a link to where users can find all the supported ops (may not scale well when the list grows)
  • it'd be nice to integrate Key Benefits into the intro paragraph of the doc to emphasize its benefits, instead of a list
  • it would also make more sense to move the Requirements higher up so users know upfront
  • i think it'd flow better if we string the Advanced Features together with the content in the Quick Start. currently it feels a bit disjointed and doesn't really build off of or connect to what comes before
  • may be useful to include an example that shows you how to find out which kernels are loaded
  • maybe organize it like this:
# Kernels

intro paragraph about what it is and the key benefits
requirements and installation

## Enabling kernels
different ways of specifying kernels in `from_pretrained`
`use_kernels=True`
`attn_implementation` for attention kernels
`KernelConfig` for device-specific kernels

## Mode configuration
switching between inference, training, and torch.compile kernels

## Automatic kernel replacement
explanation about how it works


### Basic Usage

Enable kernels when loading any model:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify that this means the optimized kernels are automatically pulled in and used


### Kernel Sources

Kernels are distributed through Hugging Face Hub repositories in the format:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to add some examples showing these different formats.


Kernels support different operational modes:

- **Mode.INFERENCE**: Optimized for inference workloads (batch size optimization, reduced memory)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example showing how to toggle these modes on/off would also be useful. Or is this done automatically when you do model.eval() and model.train()?

Specify different kernel implementations per device:

```python
kernel_config = KernelConfig(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to show what you do with the kernel_config once its defined for more context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants