You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/serving.md
+24-2Lines changed: 24 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -383,6 +383,30 @@ transformers serve \
383
383
--attn_implementation "sdpa"
384
384
```
385
385
386
+
### Quantization
387
+
388
+
transformers serve is compatible with all [quantization methods](https://huggingface.co/docs/transformers/main/quantization/overview) supported in transformers. Quantization can significantly reduce memory usage and improve inference speed, with two main workflows: pre-quantized models and on-the-fly quantization.
389
+
390
+
#### Pre-quantized Models
391
+
392
+
For models that are already quantized (e.g., GPTQ, AWQ, bitsandbytes), simply choose a quantized model name for serving.
393
+
Make sure to install the required libraries listed in the quantization documentation.
394
+
395
+
> [!TIP]
396
+
> Pre-quantized models generally provide the best balance of performance and accuracy.
397
+
398
+
#### On the fly quantization
399
+
400
+
If you want to quantize a model at runtime, you can specify the --quantization flag in the CLI. Note that not all quantization methods support on-the-fly conversion. The full list of supported methods is available in the quantization [overview](https://huggingface.co/docs/transformers/main/quantization/overview).
401
+
402
+
Currently, with transformers serve, we only supports some methods: ["bnb-4bit", "bnb-8bit"]
403
+
404
+
For example, to enable 4-bit quantization with bitsandbytes, you need to pass add `--quantization bnb-4bit`:
405
+
406
+
```sh
407
+
transformers serve --quantization bnb-4bit
408
+
```
409
+
386
410
### Performance tips
387
411
388
412
- Use an efficient attention backend when available:
@@ -397,6 +421,4 @@ transformers serve \
397
421
398
422
-`--dtype {bfloat16|float16}` typically improve throughput and memory use vs. `float32`
399
423
400
-
-`--load_in_4bit`/`--load_in_8bit` can reduce memory footprint for LoRA setups
401
-
402
424
-`--force-model <repo_id>` avoids per-request model hints and helps produce stable, repeatable runs
Copy file name to clipboardExpand all lines: src/transformers/cli/serve.py
+15-29Lines changed: 15 additions & 29 deletions
Original file line number
Diff line number
Diff line change
@@ -377,14 +377,10 @@ def __init__(
377
377
help="Which attention implementation to use; you can run --attn_implementation=flash_attention_2, in which case you must install this manually by running `pip install flash-attn --no-build-isolation`."
378
378
),
379
379
] =None,
380
-
load_in_8bit: Annotated[
381
-
bool, typer.Option(help="Whether to use 8 bit precision for the base model - works only with LoRA.")
382
-
] =False,
383
-
load_in_4bit: Annotated[
384
-
bool, typer.Option(help="Whether to use 4 bit precision for the base model - works only with LoRA.")
0 commit comments