Skip to content

Commit 803d9c9

Browse files
committed
update
1 parent 6f6095e commit 803d9c9

File tree

11 files changed

+451
-457
lines changed

11 files changed

+451
-457
lines changed

docs/source/en/_toctree.yml

Lines changed: 8 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -92,17 +92,15 @@
9292
title: Response parsing
9393
title: Chat with models
9494
- sections:
95+
- local: serialization
96+
title: Export to production
97+
- local: transformers_as_backend
98+
title: Inference backends
9599
- local: serving
96-
title: Serving LLMs, VLMs, and other chat-based models
97-
- local: jan
98-
title: Jan
99-
- local: cursor
100-
title: Cursor
101-
- local: tiny_agents
102-
title: Tiny-Agents CLI and MCP tools
103-
- local: open_webui
104-
title: Open WebUI
105-
title: Serving
100+
title: Serve CLI
101+
- local: serve-cli-integrations
102+
title: Serve CLI integrations
103+
title: Deployment
106104
- sections:
107105
- local: perf_torch_compile
108106
title: torch.compile
@@ -117,8 +115,6 @@
117115
title: Agents
118116
- local: tools
119117
title: Tools
120-
- local: transformers_as_backend
121-
title: Inference server backends
122118
- local: continuous_batching
123119
title: Continuous Batching
124120
title: Inference
@@ -225,13 +221,6 @@
225221
- local: kernel_doc/overview
226222
title: Kernels in transformers
227223
title: Kernels
228-
- isExpanded: false
229-
sections:
230-
- local: serialization
231-
title: ONNX
232-
- local: executorch
233-
title: ExecuTorch
234-
title: Export to production
235224
- isExpanded: false
236225
sections:
237226
- sections:

docs/source/en/cursor.md

Lines changed: 0 additions & 41 deletions
This file was deleted.

docs/source/en/executorch.md

Lines changed: 0 additions & 33 deletions
This file was deleted.

docs/source/en/jan.md

Lines changed: 0 additions & 32 deletions
This file was deleted.

docs/source/en/model_doc/audioflamingo3.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ rendered properly in your Markdown viewer.
1414
1515
-->
1616

17-
*This model was released on 2025-07-10 and added to Hugging Face Transformers on 2025-11-11.*
17+
*This model was released on 2025-07-10 and added to Hugging Face Transformers on 2025-11-17.*
1818

1919
# Audio Flamingo 3
2020

docs/source/en/open_webui.md

Lines changed: 0 additions & 23 deletions
This file was deleted.

docs/source/en/serialization.md

Lines changed: 65 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -14,87 +14,98 @@ rendered properly in your Markdown viewer.
1414
1515
-->
1616

17-
# ONNX
17+
# Export to production
1818

19-
[ONNX](http://onnx.ai) is an open standard that defines a common set of operators and a file format to represent deep learning models in different frameworks, including PyTorch and TensorFlow. When a model is exported to ONNX, the operators construct a computational graph (or *intermediate representation*) which represents the flow of data through the model. Standardized operators and data types makes it easy to switch between frameworks.
19+
Export Transformers' models to different formats for optimized runtimes and devices. Deploy the same model to cloud providers or run it on mobile and edge devices. You don't need to rewrite the model from scratch for each deployment environment. Freely deploy across any inference ecosystem.
2020

21-
The [Optimum](https://huggingface.co/docs/optimum/index) library exports a model to ONNX with configuration objects which are supported for [many architectures](https://huggingface.co/docs/optimum/exporters/onnx/overview) and can be easily extended. If a model isn't supported, feel free to make a [contribution](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute) to Optimum.
21+
## ExecuTorch
2222

23-
The benefits of exporting to ONNX include the following.
23+
[ExecuTorch](https://pytorch.org/executorch/stable/index.html) runs PyTorch models on mobile and edge devices. It exports a model into a graph of standardized operators, compiles the graph into an ExecuTorch program, and executes it on the target device. The runtime is lightweight and calculates the execution plan ahead of time.
2424

25-
- [Graph optimization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization) and [quantization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization) for improving inference.
26-
- Use the [`~optimum.onnxruntime.ORTModel`] API to run a model with [ONNX Runtime](https://onnxruntime.ai/).
27-
- Use [optimized inference pipelines](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/pipelines) for ONNX models.
25+
Install [Optimum ExecuTorch](https://huggingface.co/docs/optimum-executorch/en/index) from source.
2826

29-
Export a Transformers model to ONNX with the Optimum CLI or the `optimum.onnxruntime` module.
27+
```bash
28+
git clone https://github.com/huggingface/optimum-executorch.git
29+
cd optimum-executorch
30+
pip install '.[dev]'
31+
```
3032

31-
## Optimum CLI
33+
Export a Transformers model to ExecuTorch with the CLI tool.
34+
35+
```bash
36+
optimum-cli export executorch \
37+
--model "Qwen/Qwen3-8B" \
38+
--task "text-generation" \
39+
--recipe "xnnpack" \
40+
--use_custom_sdpa \
41+
--use_custom_kv_cache \
42+
--qlinear 8da4w \
43+
--qembedding 8w \
44+
--output_dir="hf_smollm2"
45+
```
3246

33-
Run the command below to install Optimum and the [exporters](https://huggingface.co/docs/optimum/exporters/overview) module.
47+
Run the following command to view all export options.
3448

3549
```bash
36-
pip install optimum-onnx
50+
optimum-cli export executorch --help
3751
```
3852

39-
> [!TIP]
40-
> Refer to the [Export a model to ONNX with optimum.exporters.onnx](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) guide for all available arguments or with the command below.
41-
>
42-
> ```bash
43-
> optimum-cli export onnx --help
44-
> ```
53+
## ONNX
54+
55+
[ONNX](http://onnx.ai) is a shared language for describing models from different frameworks. It represents models as a graph of standardized operators with well-defined types, shapes, and metadata. Models serialize into compact protobuf files that you can deploy across optimized runtimes and engines.
4556

46-
Set the `--model` argument to export a PyTorch model from the Hub.
57+
[Optimum ONNX](https://huggingface.co/docs/optimum-onnx/index) exports models to ONNX with configuration objects. It supports many [architectures](https://huggingface.co/docs/optimum-onnx/onnx/overview) and is easily extendable. Export models through the CLI tool or programmatically.
58+
59+
Install [Optimum ONNX](https://huggingface.co/docs/optimum-onnx/index).
4760

4861
```bash
49-
optimum-cli export onnx --model distilbert/distilbert-base-uncased-distilled-squad distilbert_base_uncased_squad_onnx/
62+
uv pip install optimum-onnx
5063
```
5164

52-
You should see logs indicating the progress and showing where the resulting `model.onnx` is saved.
53-
54-
```text
55-
Validating ONNX model distilbert_base_uncased_squad_onnx/model.onnx...
56-
-[✓] ONNX model output names match reference model (start_logits, end_logits)
57-
- Validating ONNX Model output "start_logits":
58-
-[✓] (2, 16) matches (2, 16)
59-
-[✓] all values close (atol: 0.0001)
60-
- Validating ONNX Model output "end_logits":
61-
-[✓] (2, 16) matches (2, 16)
62-
-[✓] all values close (atol: 0.0001)
63-
The ONNX export succeeded and the exported model was saved at: distilbert_base_uncased_squad_onnx
64-
```
65+
### optimum-cli
6566

66-
For local models, make sure the model weights and tokenizer files are saved in the same directory, for example `local_path`. Pass the directory to the `--model` argument and use `--task` to indicate the [task](https://huggingface.co/docs/optimum/exporters/task_manager) a model can perform. If `--task` isn't provided, the model architecture without a task-specific head is used.
67+
Specify a model to export and the output directory with the `--model` argument.
6768

6869
```bash
69-
optimum-cli export onnx --model local_path --task question-answering distilbert_base_uncased_squad_onnx/
70+
optimum-cli export onnx --model Qwen/Qwen3-8B Qwen/Qwen3-8b-onnx/
7071
```
7172

72-
The `model.onnx` file can be deployed with any [accelerator](https://onnx.ai/supported-tools.html#deployModel) that supports ONNX. The example below demonstrates loading and running a model with ONNX Runtime.
73+
Run the following command to view all available arguments or refer to the [Export a model to ONNX with optimum.exporters.onnx](https://huggingface.co/docs/optimum-onnx/onnx/usage_guides/export_a_model) guide for more details.
7374

74-
```python
75-
>>> from transformers import AutoTokenizer
76-
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering
75+
```bash
76+
optimum cli export onnx --help
77+
```
7778

78-
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert_base_uncased_squad_onnx")
79-
>>> model = ORTModelForQuestionAnswering.from_pretrained("distilbert_base_uncased_squad_onnx")
80-
>>> inputs = tokenizer("What am I using?", "Using DistilBERT with ONNX Runtime!", return_tensors="pt")
81-
>>> outputs = model(**inputs)
79+
To export a local model, save the weights and tokenizer files in the same directory. Pass the directory path to the `--model` argument and use the `--task` argument to specify the [task](https://huggingface.co/docs/optimum/exporters/task_manager#transformers). If you don't provide `--task`, the system auto-infers it from the model or uses an architecture without a task-specific head.
80+
81+
```bash
82+
optimum-cli export onnx --model path/to/local/model --task text-generation Qwen/Qwen3-8b-onnx/
8283
```
8384

84-
## optimum.onnxruntime
85+
Deploy the model with any [runtime](https://onnx.ai/supported-tools.html#deployModel) that supports ONNX, including ONNX Runtime.
8586

86-
The `optimum.onnxruntime` module supports programmatically exporting a Transformers model. Instantiate a [`~optimum.onnxruntime.ORTModel`] for a task and set `export=True`. Use [`~OptimizedModel.save_pretrained`] to save the ONNX model.
87+
```py
88+
from transformers import AutoTokenizer
89+
from optimum.onnxruntime import ORTModelForCausalLM
8790

88-
```python
89-
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
90-
>>> from transformers import AutoTokenizer
91+
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8b-onnx")
92+
model = ORTModelForCausalLM.from_pretrained("Qwen/Qwen3-8b-onnx")
93+
inputs = tokenizer("Plants generate energy through a process known as ", return_tensors="pt")
94+
outputs = model.generate(**inputs)
95+
print(tokenizer.batch_decode(outputs))
96+
```
9197

92-
>>> model_checkpoint = "distilbert/distilbert-base-uncased-distilled-squad"
93-
>>> save_directory = "onnx/"
98+
### optimum.onnxruntime
9499

95-
>>> ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True)
96-
>>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
100+
Export Transformers' models programmatically with Optimum ONNX. Instantiate a [`~optimum.onnxruntime.ORTModel`] with a model and set `export=True`. Save the ONNX model with [`~optimum.onnxruntime.ORTModel.save_pretrained`].
97101

98-
>>> ort_model.save_pretrained(save_directory)
99-
>>> tokenizer.save_pretrained(save_directory)
100-
```
102+
```py
103+
from optimum.onnxruntime import ORTModelForCausalLM
104+
from transformers import AutoTokenizer
105+
106+
ort_model = ORTModelForCausalLM.from_pretrained("Qwen/Qwen3-8b", export=True)
107+
tokenizer = AutoTokenizer.from_pretrained("onnx/")
108+
109+
ort_model.save_pretrained("onnx/")
110+
tokenizer.save_pretrained("onnx/")
111+
```

0 commit comments

Comments
 (0)