Skip to content

Commit 2956db2

Browse files
refine pruner docs (#1256)
Signed-off-by: Zhang, Weiwei1 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 10bff1a commit 2956db2

File tree

4 files changed

+270
-124
lines changed

4 files changed

+270
-124
lines changed

docs/source/pruning.md

Lines changed: 123 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -243,6 +243,24 @@ Regularization is a technique that discourages learning a more complex model and
243243
</a>
244244
</div>
245245

246+
### Large Language Model Pruning
247+
248+
To efficiently achieve pruning for Large Language Models (LLMs), we have implemented two post-training pruning methods that utilize different pruning patterns: **Retrain-free** (channel-wise) and **SparseGPT** (1x1/N:M).
249+
250+
- Retrain-free
251+
252+
[The retrain-free algorithm](https://arxiv.org/abs/2204.09656) is a lightweight method that utilizes mask retrieval and rearrangement techniques within the Transformer architecture. By incorporating channel pruning and sparse model slimming for the linear layer in Multi-Layer Perceptron (MLP), it effectively achieves a 20% sparsity per layer while preserving accuracy with an accuracy loss of less than 1%. This algorithm seamlessly supports popular models like GPT, OPT, LLaMA, and BLOOM. Its capability to enhance model efficiency while maintaining performance makes it a valuable pruning approach for LLMs.
253+
254+
For a quick and efficient start with the retrain-free algorithm, please refer to the API instructions [Retrain-free Pruning API](#Retrain-free-Pruning-API)
255+
256+
- SparseGPT
257+
258+
[The SparseGPT algorithm](https://arxiv.org/abs/2301.00774) is an efficient post-training pruning method that operates on a block-wise basis. It supports multiple pruning patterns, including 1x1 and N:M, targeting the linear layers within the Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) components. By applying this method, it is possible to achieve up to 50% sparsity on models with sizes larger than 10 billion parameters(More model parameters are less sensitive to sparsity), all while maintaining an accuracy loss of less than 1%. It is compatible with a wide range of models, including OPT, GPT, LLaMA, BLOOM, Dolly, MPT, Falcon, Stable-LM, and LaMini-LM, providing flexibility and effectiveness in pruning LLMs. Additionally, it is worth mentioning that larger model parameters tend to be less impacted by sparsity.
259+
260+
For a smooth initiation with sparseGPT algorithm, please refer to the API instructions provided [SparseGPT Pruning API](#SparseGPT-Pruning-API)
261+
262+
263+
246264

247265
### Pruning Support Matrix
248266

@@ -255,12 +273,11 @@ Regularization is a technique that discourages learning a more complex model and
255273
## Get Started with Pruning API
256274

257275

258-
259276
Neural Compressor `Pruning` API is defined under `neural_compressor.training`, which takes a user-defined configure object as input.
260277
Users can pass the customized training/evaluation functions to `Pruning` in various scenarios.
261278

262279

263-
280+
### Training-aware pruning API
264281
The following section exemplifies how to use hooks in user pass-in training function to perform model pruning. Through the pruning API, multiple pruner objects are supported in one single Pruning object to enable layer-specific configurations and a default set is used as a complement.
265282

266283
- Step 1: Define a dict-like configuration in your training codes. Usually only 5-7 configuration items need to be identified. For customized pruning, a configuration template is shown below:
@@ -297,7 +314,7 @@ The following section exemplifies how to use hooks in user pass-in training func
297314
]
298315
```
299316

300-
- Step 2: Enable pruning functionalities
317+
- Step 2: Enable pruning functionalities
301318

302319
[**Experimental option** ]Modify model and optimizer.
303320

@@ -345,7 +362,101 @@ The following section exemplifies how to use hooks in user pass-in training func
345362
compression_manager.callbacks.on_train_end()
346363
```
347364

348-
In the case mentioned above, pruning process can be done by pre-defined hooks in Neural Compressor. Users need to place those hooks inside the training function.
365+
In the case mentioned above, pruning process can be done by pre-defined hooks in Neural Compressor. Users need to place those hooks inside the training function.
366+
367+
368+
### Retrain-free Pruning API
369+
- Step 1: Define a dict-like configuration in your training codes. Usually only 5-7 configuration items need to be identified.
370+
371+
If the name of the layer to be pruned is known, you can create a pruning_config manually by inputting the relevant information. This allows for greater customization and control over the pruning process.
372+
373+
```python
374+
pruning_configs = [
375+
{ # config of a single pruner
376+
"pruning_type": "retrain_free",
377+
"pruning_scope": "global",
378+
"op_names": [".fc", ".mlp"], # MLP layer_names
379+
"start_step": 1,
380+
"end_step": 300, # set end_step for Few shot pruning.
381+
"excluded_op_names": ["lm_head"], # A list of modules that would not be pruned.
382+
"target_sparsity": 0.2, # Target sparsity ratio of modules.
383+
"pruning_frequency": 50, # Frequency of applying pruning,
384+
"pattern": "channelx1", # Default pruning pattern.
385+
},
386+
]
387+
```
388+
389+
If you find yourself uncertain about the names of the linear modules within the MLP of the model or desire a simplified approach to setting up pruning, you can utilize a module that automatically generates a config:
390+
391+
```python
392+
# auto config
393+
from neural_compressor.compression.pruner import parse_auto_slim_config
394+
395+
pruning_configs = []
396+
auto_configs = parse_auto_slim_config(
397+
model,
398+
ffn2_sparsity=args.target_sparsity, # e.g. 0.2
399+
mha_sparsity=0,
400+
pruning_scope="global",
401+
pruning_type="retrain_free",
402+
)
403+
pruning_configs += auto_configs
404+
```
405+
406+
407+
- Step 2: Enable pruning functionalities
408+
The process itself is quite straightforward. By passing the prepared config and the calibration dataset, the pruning process can be automatically carried out with a simple API call.
409+
410+
```python
411+
from neural_compressor.training import prepare_pruning, WeightPruningConfig
412+
configs = WeightPruningConfig(
413+
pruning_configs,
414+
target_sparsity = args.target_sparsity, # global setting for all pruners(optional)
415+
pattern = args.pruning_pattern,
416+
start_step = pruning_start,
417+
end_step = pruning_end,
418+
)
419+
config = WeightPruningConfig(pruning_configs)
420+
421+
pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning
422+
```
423+
424+
425+
426+
### SparseGPT Pruning API
427+
- Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example:
428+
429+
```python
430+
pruning_configs = [
431+
{ #example pruner
432+
"pruning_type": "sparse_gpt",
433+
"op_names": [".*"], # Prunes all linear modules by default.
434+
"excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned.
435+
"target_sparsity": 0.5, # Target sparsity ratio of modules.
436+
"pattern": "1x1", # Default pruning pattern.
437+
}
438+
]
439+
```
440+
441+
- Step 2: Enable pruning functionalities
442+
By providing the pruning config, calibration dataset, and specifying the desired device card number, the pruning process can be executed automatically with a simple API call.
443+
444+
```python
445+
from neural_compressor.training import prepare_pruning, WeightPruningConfig
446+
447+
configs = WeightPruningConfig(
448+
pruning_configs,
449+
target_sparsity=args.target_sparsity, # global setting for all pruners
450+
pattern=args.pruning_pattern, # e.g. 1x1 / 2:4
451+
)
452+
config = WeightPruningConfig(pruning_configs)
453+
# for example: device = "cuda:1"
454+
pruning = prepare_pruning(
455+
model, configs, dataloader=train_dataloader, device=device
456+
) # modify the model and complete the pruning
457+
```
458+
459+
349460

350461

351462
## Examples
@@ -360,6 +471,10 @@ The pruning technique is validated on typical models across various domains (in
360471

361472
"Experimental" annotation means these examples codes are ready but pruning results are under improvements. Please don't hesitate to try these codes with different configurations to get better pruning results!
362473

474+
- Language Modeling
475+
476+
Sparsity is effectively implemented through various pruning patterns in Causal language modeling (CLM) tasks. [Language-modeling examples](../../../examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager).
477+
363478
- Text Classification
364479

365480
Sparsity is implemented in different pruning patterns of MRPC and SST-2 tasks [Text-classification examples](../../examples/pytorch/nlp/huggingface_models/text-classification/pruning/eager).
@@ -395,3 +510,7 @@ For more details, please refer to [HPO document](../../neural_compressor/compres
395510
[1] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019.
396511

397512
[2] Zafrir, Ofir, Ariel Larey, Guy Boudoukh, Haihao Shen, and Moshe Wasserblat. "Prune once for all: Sparse pre-trained language models." arXiv preprint arXiv:2111.05754 (2021).
513+
514+
[3] Kwon, W., Kim, S., Mahoney, M.W., Hassoun, J., Keutzer, K. and Gholami, A., 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35, pp.24101-24116.
515+
516+
[4] Frantar, E. and Alistarh, D., Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. URL https://arxiv. org/abs/2301.00774.

examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,17 +22,23 @@ pip install -r examples/pytorch/nlp/huggingface_models/language-modeling/pruning
2222
The dataset will be downloaded automatically from the datasets Hub.
2323
See more about loading [huggingface dataset](https://huggingface.co/docs/datasets/loading_datasets.html)
2424

25+
<br />
2526

2627
# Run Examples
2728

28-
Intel® Neural Compressor supports pruning and slimming operations for LLMs without retraining. Experimentally verified pruning at the MLP layers with channel-wise pattern, which can achieve 10%-20% sparsity and speed up inference while accuracy drops < 1% [Retrain-free Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_no_trainer.py).
29-
The 1x1 and N:M pruning formats are supported using the [SparseGPT Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py), with models up to 70B being able to be pruned in less than an hour and achieving 40%-50% sparsity in the MHA and MLP layers, while models of 7B and above may have a accuracy drop < 1%.
30-
There are pruning scripts for LLM sparse models (GPT-j, BLOOM, OPT, LLaMA etc). The sparse model can be obtained by modifying pruning parameters. [Pruning Scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/).
29+
Intel® Neural Compressor provides support for pruning and model slimming operations in Large Language Models (LLMs) without the need for retraining.
30+
31+
Through experimental verification, it has been observed that pruning the Multi-Layer Perceptron (MLP) layers using a channel-wise pattern can achieve a sparsity level of 10%-20%. This pruning technique speeds up inference while maintaining an accuracy drop of less than 1%. [Retrain-free Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_no_trainer.py).
32+
33+
The pruning patterns of 1x1 and N:M are supported through the use of the [SparseGPT Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py), It is possible to prune models up to 70B in size within two hours, achieving a sparsity of 40%-50% in both the Multi-Head Attention (MHA) and MLP layers. For models of 7B and above, the drop in accuracy is less than 1%.
3134

35+
Pruning scripts are available for LLM sparse models such as GPT-j, BLOOM, OPT, LLaMA, and the sparse model can be obtained by modifying the pruning parameters. [Pruning Scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/).
36+
37+
<br />
3238

33-
### Results
39+
## Retrain-free Results
3440

35-
The last token accuracy for channel pruning using the retrain-free algorithm is presented in the following table.
41+
The last token accuracy for channel pruning using [the retrain-free scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/run_gptj_pruning.sh) is presented in the following table.
3642
| Model | Calibration dataset | Evaluation dataset | Sparsity pattern | Over MLP block sparsity |Element-wise/matmul, Gemm, conv ratio | Dense last token accuracy | Sparse last token accuracy | Relative drop |
3743
| :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----:| :----: |
3844
| EleutherAI/gpt-j-6b | lambada | lambada | channelx1 | 0.1999 | 0.1242 | 0.7917 | 0.8038 | +1.50% |
@@ -64,7 +70,9 @@ The last word acc of the channel-wise sparse model is shown in the following tab
6470

6571
<br />
6672

67-
The last word acc of the 1x1 pattern sparse model using the sparseGPT algorithm is shown in the following table.
73+
## SparseGPT Results
74+
75+
The last word acc of the 1x1 pattern sparse model using [the sparseGPT script](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/run_llm_sparsegpt.sh) is shown in the following table.
6876
| Model | Task | Calibration dataset | Evaluation dataset | Sparsity | Precision | Dense last word accuracy | Sparse last word accuracy | Relative drop |
6977
| :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----:|
7078
| EleutherAI/gpt-j-6b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.6831 | 0.6911 | +1.17% |
@@ -78,12 +86,15 @@ The last word acc of the 1x1 pattern sparse model using the sparseGPT algorithm
7886
| bigscience/bloom-7b1 | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.5764 | 0.5575 | -3.28% |
7987
| bigscience/bloom-7b1 | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.5723 | 0.5513 | -3.67% |
8088
| decapoda-research/llama-13b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 50% | FP32 | 0.7627 | 0.7584 | -0.56% |
81-
| decapoda-research/llama-13b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 50% | BF16 | 0.7601 | 0.7545 | -0.74% |
89+
| decapoda-research/llama-13b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 50% | BF16 | 0.7601 | 0.7545 | -0.74% |
8290

8391

8492
## References
85-
* [A Fast Post-Training Pruning Framework for Transformers](https://arxiv.org/abs/2204.09656)
86-
* [SparseGPT: Massive Language Models Can be Accurately Pruned in One-shot](https://arxiv.org/abs/2301.00774)
93+
94+
[1] Kwon, W., Kim, S., Mahoney, M.W., Hassoun, J., Keutzer, K. and Gholami, A., 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35, pp.24101-24116.
95+
96+
[2] Frantar, E. and Alistarh, D., Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. URL https://arxiv. org/abs/2301.00774.
97+
8798

8899

89100

0 commit comments

Comments
 (0)