You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/pruning.md
+123-4Lines changed: 123 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -243,6 +243,24 @@ Regularization is a technique that discourages learning a more complex model and
243
243
</a>
244
244
</div>
245
245
246
+
### Large Language Model Pruning
247
+
248
+
To efficiently achieve pruning for Large Language Models (LLMs), we have implemented two post-training pruning methods that utilize different pruning patterns: **Retrain-free** (channel-wise) and **SparseGPT** (1x1/N:M).
249
+
250
+
- Retrain-free
251
+
252
+
[The retrain-free algorithm](https://arxiv.org/abs/2204.09656) is a lightweight method that utilizes mask retrieval and rearrangement techniques within the Transformer architecture. By incorporating channel pruning and sparse model slimming for the linear layer in Multi-Layer Perceptron (MLP), it effectively achieves a 20% sparsity per layer while preserving accuracy with an accuracy loss of less than 1%. This algorithm seamlessly supports popular models like GPT, OPT, LLaMA, and BLOOM. Its capability to enhance model efficiency while maintaining performance makes it a valuable pruning approach for LLMs.
253
+
254
+
For a quick and efficient start with the retrain-free algorithm, please refer to the API instructions [Retrain-free Pruning API](#Retrain-free-Pruning-API)
255
+
256
+
- SparseGPT
257
+
258
+
[The SparseGPT algorithm](https://arxiv.org/abs/2301.00774) is an efficient post-training pruning method that operates on a block-wise basis. It supports multiple pruning patterns, including 1x1 and N:M, targeting the linear layers within the Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) components. By applying this method, it is possible to achieve up to 50% sparsity on models with sizes larger than 10 billion parameters(More model parameters are less sensitive to sparsity), all while maintaining an accuracy loss of less than 1%. It is compatible with a wide range of models, including OPT, GPT, LLaMA, BLOOM, Dolly, MPT, Falcon, Stable-LM, and LaMini-LM, providing flexibility and effectiveness in pruning LLMs. Additionally, it is worth mentioning that larger model parameters tend to be less impacted by sparsity.
259
+
260
+
For a smooth initiation with sparseGPT algorithm, please refer to the API instructions provided [SparseGPT Pruning API](#SparseGPT-Pruning-API)
261
+
262
+
263
+
246
264
247
265
### Pruning Support Matrix
248
266
@@ -255,12 +273,11 @@ Regularization is a technique that discourages learning a more complex model and
255
273
## Get Started with Pruning API
256
274
257
275
258
-
259
276
Neural Compressor `Pruning` API is defined under `neural_compressor.training`, which takes a user-defined configure object as input.
260
277
Users can pass the customized training/evaluation functions to `Pruning` in various scenarios.
261
278
262
279
263
-
280
+
### Training-aware pruning API
264
281
The following section exemplifies how to use hooks in user pass-in training function to perform model pruning. Through the pruning API, multiple pruner objects are supported in one single Pruning object to enable layer-specific configurations and a default set is used as a complement.
265
282
266
283
- Step 1: Define a dict-like configuration in your training codes. Usually only 5-7 configuration items need to be identified. For customized pruning, a configuration template is shown below:
@@ -297,7 +314,7 @@ The following section exemplifies how to use hooks in user pass-in training func
297
314
]
298
315
```
299
316
300
-
- Step 2: Enable pruning functionalities
317
+
- Step 2: Enable pruning functionalities
301
318
302
319
[**Experimental option**]Modify model and optimizer.
303
320
@@ -345,7 +362,101 @@ The following section exemplifies how to use hooks in user pass-in training func
345
362
compression_manager.callbacks.on_train_end()
346
363
```
347
364
348
-
In the case mentioned above, pruning process can be done by pre-defined hooks in Neural Compressor. Users need to place those hooks inside the training function.
365
+
In the case mentioned above, pruning process can be done by pre-defined hooks in Neural Compressor. Users need to place those hooks inside the training function.
366
+
367
+
368
+
### Retrain-free Pruning API
369
+
- Step 1: Define a dict-like configuration in your training codes. Usually only 5-7 configuration items need to be identified.
370
+
371
+
If the name of the layer to be pruned is known, you can create a pruning_config manually by inputting the relevant information. This allows for greater customization and control over the pruning process.
372
+
373
+
```python
374
+
pruning_configs = [
375
+
{ # config of a single pruner
376
+
"pruning_type": "retrain_free",
377
+
"pruning_scope": "global",
378
+
"op_names": [".fc", ".mlp"], # MLP layer_names
379
+
"start_step": 1,
380
+
"end_step": 300, # set end_step for Few shot pruning.
381
+
"excluded_op_names": ["lm_head"], # A list of modules that would not be pruned.
382
+
"target_sparsity": 0.2, # Target sparsity ratio of modules.
383
+
"pruning_frequency": 50, # Frequency of applying pruning,
If you find yourself uncertain about the names of the linear modules within the MLP of the model or desire a simplified approach to setting up pruning, you can utilize a module that automatically generates a config:
390
+
391
+
```python
392
+
# auto config
393
+
from neural_compressor.compression.pruner import parse_auto_slim_config
394
+
395
+
pruning_configs = []
396
+
auto_configs = parse_auto_slim_config(
397
+
model,
398
+
ffn2_sparsity=args.target_sparsity, # e.g. 0.2
399
+
mha_sparsity=0,
400
+
pruning_scope="global",
401
+
pruning_type="retrain_free",
402
+
)
403
+
pruning_configs += auto_configs
404
+
```
405
+
406
+
407
+
- Step 2: Enable pruning functionalities
408
+
The process itself is quite straightforward. By passing the prepared config and the calibration dataset, the pruning process can be automatically carried out with a simple API call.
409
+
410
+
```python
411
+
from neural_compressor.training import prepare_pruning, WeightPruningConfig
412
+
configs = WeightPruningConfig(
413
+
pruning_configs,
414
+
target_sparsity= args.target_sparsity, # global setting for all pruners(optional)
415
+
pattern= args.pruning_pattern,
416
+
start_step= pruning_start,
417
+
end_step= pruning_end,
418
+
)
419
+
config = WeightPruningConfig(pruning_configs)
420
+
421
+
pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning
422
+
```
423
+
424
+
425
+
426
+
### SparseGPT Pruning API
427
+
- Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example:
428
+
429
+
```python
430
+
pruning_configs = [
431
+
{ #example pruner
432
+
"pruning_type": "sparse_gpt",
433
+
"op_names": [".*"], # Prunes all linear modules by default.
434
+
"excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned.
435
+
"target_sparsity": 0.5, # Target sparsity ratio of modules.
436
+
"pattern": "1x1", # Default pruning pattern.
437
+
}
438
+
]
439
+
```
440
+
441
+
- Step 2: Enable pruning functionalities
442
+
By providing the pruning config, calibration dataset, and specifying the desired device card number, the pruning process can be executed automatically with a simple API call.
443
+
444
+
```python
445
+
from neural_compressor.training import prepare_pruning, WeightPruningConfig
446
+
447
+
configs = WeightPruningConfig(
448
+
pruning_configs,
449
+
target_sparsity=args.target_sparsity, # global setting for all pruners
@@ -360,6 +471,10 @@ The pruning technique is validated on typical models across various domains (in
360
471
361
472
"Experimental" annotation means these examples codes are ready but pruning results are under improvements. Please don't hesitate to try these codes with different configurations to get better pruning results!
362
473
474
+
- Language Modeling
475
+
476
+
Sparsity is effectively implemented through various pruning patterns in Causal language modeling (CLM) tasks. [Language-modeling examples](../../../examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager).
477
+
363
478
- Text Classification
364
479
365
480
Sparsity is implemented in different pruning patterns of MRPCandSST-2 tasks [Text-classification examples](../../examples/pytorch/nlp/huggingface_models/text-classification/pruning/eager).
@@ -395,3 +510,7 @@ For more details, please refer to [HPO document](../../neural_compressor/compres
395
510
[1] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019.
396
511
397
512
[2] Zafrir, Ofir, Ariel Larey, Guy Boudoukh, Haihao Shen, and Moshe Wasserblat. "Prune once for all: Sparse pre-trained language models." arXiv preprint arXiv:2111.05754 (2021).
513
+
514
+
[3] Kwon, W., Kim, S., Mahoney, M.W., Hassoun, J., Keutzer, K. and Gholami, A., 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35, pp.24101-24116.
515
+
516
+
[4] Frantar, E. and Alistarh, D., Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. URL https://arxiv. org/abs/2301.00774.
The dataset will be downloaded automatically from the datasets Hub.
23
23
See more about loading [huggingface dataset](https://huggingface.co/docs/datasets/loading_datasets.html)
24
24
25
+
<br />
25
26
26
27
# Run Examples
27
28
28
-
Intel® Neural Compressor supports pruning and slimming operations for LLMs without retraining. Experimentally verified pruning at the MLP layers with channel-wise pattern, which can achieve 10%-20% sparsity and speed up inference while accuracy drops < 1% [Retrain-free Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_no_trainer.py).
29
-
The 1x1 and N:M pruning formats are supported using the [SparseGPT Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py), with models up to 70B being able to be pruned in less than an hour and achieving 40%-50% sparsity in the MHA and MLP layers, while models of 7B and above may have a accuracy drop < 1%.
30
-
There are pruning scripts for LLM sparse models (GPT-j, BLOOM, OPT, LLaMA etc). The sparse model can be obtained by modifying pruning parameters. [Pruning Scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/).
29
+
Intel® Neural Compressor provides support for pruning and model slimming operations in Large Language Models (LLMs) without the need for retraining.
30
+
31
+
Through experimental verification, it has been observed that pruning the Multi-Layer Perceptron (MLP) layers using a channel-wise pattern can achieve a sparsity level of 10%-20%. This pruning technique speeds up inference while maintaining an accuracy drop of less than 1%. [Retrain-free Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_no_trainer.py).
32
+
33
+
The pruning patterns of 1x1 and N:M are supported through the use of the [SparseGPT Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py), It is possible to prune models up to 70B in size within two hours, achieving a sparsity of 40%-50% in both the Multi-Head Attention (MHA) and MLP layers. For models of 7B and above, the drop in accuracy is less than 1%.
31
34
35
+
Pruning scripts are available for LLM sparse models such as GPT-j, BLOOM, OPT, LLaMA, and the sparse model can be obtained by modifying the pruning parameters. [Pruning Scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/).
36
+
37
+
<br />
32
38
33
-
###Results
39
+
##Retrain-free Results
34
40
35
-
The last token accuracy for channel pruning using the retrain-free algorithm is presented in the following table.
41
+
The last token accuracy for channel pruning using [the retrain-free scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/run_gptj_pruning.sh) is presented in the following table.
36
42
| Model | Calibration dataset | Evaluation dataset | Sparsity pattern | Over MLP block sparsity |Element-wise/matmul, Gemm, conv ratio | Dense last token accuracy | Sparse last token accuracy | Relative drop |
@@ -64,7 +70,9 @@ The last word acc of the channel-wise sparse model is shown in the following tab
64
70
65
71
<br />
66
72
67
-
The last word acc of the 1x1 pattern sparse model using the sparseGPT algorithm is shown in the following table.
73
+
## SparseGPT Results
74
+
75
+
The last word acc of the 1x1 pattern sparse model using [the sparseGPT script](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/run_llm_sparsegpt.sh) is shown in the following table.
68
76
| Model | Task | Calibration dataset | Evaluation dataset | Sparsity | Precision | Dense last word accuracy | Sparse last word accuracy | Relative drop |
*[A Fast Post-Training Pruning Framework for Transformers](https://arxiv.org/abs/2204.09656)
86
-
*[SparseGPT: Massive Language Models Can be Accurately Pruned in One-shot](https://arxiv.org/abs/2301.00774)
93
+
94
+
[1] Kwon, W., Kim, S., Mahoney, M.W., Hassoun, J., Keutzer, K. and Gholami, A., 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35, pp.24101-24116.
95
+
96
+
[2] Frantar, E. and Alistarh, D., Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. URL https://arxiv. org/abs/2301.00774.
0 commit comments