intel
diff --git a/‎docs/source/pruning.md‎
Lines changed: 123 additions & 4 deletions b/‎docs/source/pruning.md‎
Lines changed: 123 additions & 4 deletions
diff --git a/‎examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md‎
Lines changed: 20 additions & 9 deletions b/‎examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md‎
Lines changed: 20 additions & 9 deletions
@@ -243,6 +243,24 @@ Regularization is a technique that discourages learning a more complex model and
 </a>
 </div>
 
+### Large Language Model Pruning
+
+To efficiently achieve pruning for Large Language Models (LLMs), we have implemented two post-training pruning methods that utilize different pruning patterns: **Retrain-free** (channel-wise) and **SparseGPT** (1x1/N:M).
+
+- Retrain-free
+
+  [The retrain-free algorithm](https://arxiv.org/abs/2204.09656) is a lightweight method that utilizes mask retrieval and rearrangement techniques within the Transformer architecture. By incorporating channel pruning and sparse model slimming for the linear layer in Multi-Layer Perceptron (MLP), it effectively achieves a 20% sparsity per layer while preserving accuracy with an accuracy loss of less than 1%. This algorithm seamlessly supports popular models like GPT, OPT, LLaMA, and BLOOM. Its capability to enhance model efficiency while maintaining performance makes it a valuable pruning approach for LLMs. 
+
+  For a quick and efficient start with the retrain-free algorithm, please refer to the API instructions [Retrain-free Pruning API](#Retrain-free-Pruning-API)
+
+- SparseGPT
+
+  [The SparseGPT algorithm](https://arxiv.org/abs/2301.00774) is an efficient post-training pruning method that operates on a block-wise basis. It supports multiple pruning patterns, including 1x1 and N:M, targeting the linear layers within the Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) components. By applying this method, it is possible to achieve up to 50% sparsity on models with sizes larger than 10 billion parameters(More model parameters are less sensitive to sparsity), all while maintaining an accuracy loss of less than 1%. It is compatible with a wide range of models, including OPT, GPT, LLaMA, BLOOM, Dolly, MPT, Falcon, Stable-LM, and LaMini-LM, providing flexibility and effectiveness in pruning LLMs. Additionally, it is worth mentioning that larger model parameters tend to be less impacted by sparsity.
+
+  For a smooth initiation with sparseGPT algorithm, please refer to the API instructions provided [SparseGPT Pruning API](#SparseGPT-Pruning-API)
+
+
+
 
 ### Pruning Support Matrix
 
@@ -255,12 +273,11 @@ Regularization is a technique that discourages learning a more complex model and
 ## Get Started with Pruning API
 
 
-
 Neural Compressor `Pruning` API is defined under `neural_compressor.training`, which takes a user-defined configure object as input.
 Users can pass the customized training/evaluation functions to `Pruning` in various scenarios.
 
 
-
+### Training-aware pruning API
 The following section exemplifies how to use hooks in user pass-in training function to perform model pruning. Through the pruning API, multiple pruner objects are supported in one single Pruning object to enable layer-specific configurations and a default set is used as a complement.
 
 - Step 1: Define a dict-like configuration in your training codes. Usually only 5-7 configuration items need to be identified. For customized pruning, a configuration template is shown below:
@@ -297,7 +314,7 @@ The following section exemplifies how to use hooks in user pass-in training func
   ]
   ```
 
-- Step 2: Enable pruning functionalities 
+- Step 2: Enable pruning functionalities
 
      [**Experimental option** ]Modify model and optimizer.
 
@@ -345,7 +362,101 @@ The following section exemplifies how to use hooks in user pass-in training func
   compression_manager.callbacks.on_train_end()
   ```
 
-In the case mentioned above, pruning process can be done by pre-defined hooks in Neural Compressor. Users need to place those hooks inside the training function.
+  In the case mentioned above, pruning process can be done by pre-defined hooks in Neural Compressor. Users need to place those hooks inside the training function.
+
+
+### Retrain-free Pruning API
+- Step 1: Define a dict-like configuration in your training codes. Usually only 5-7 configuration items need to be identified. 
+
+  If the name of the layer to be pruned is known, you can create a pruning_config manually by inputting the relevant information. This allows for greater customization and control over the pruning process.
+
+  ```python
+  pruning_configs = [
+      {  # config of a single pruner
+          "pruning_type": "retrain_free",
+          "pruning_scope": "global",
+          "op_names": [".fc", ".mlp"],  # MLP layer_names
+          "start_step": 1,
+          "end_step": 300,  # set end_step for Few shot pruning.
+          "excluded_op_names": ["lm_head"],  # A list of modules that would not be pruned.
+          "target_sparsity": 0.2,  # Target sparsity ratio of modules.
+          "pruning_frequency": 50,  # Frequency of applying pruning,
+          "pattern": "channelx1",  # Default pruning pattern.
+      },
+  ]
+  ```
+
+  If you find yourself uncertain about the names of the linear modules within the MLP of the model or desire a simplified approach to setting up pruning, you can utilize a module that automatically generates a config:
+
+  ```python
+  # auto config
+  from neural_compressor.compression.pruner import parse_auto_slim_config
+
+  pruning_configs = []
+  auto_configs = parse_auto_slim_config(
+      model,
+      ffn2_sparsity=args.target_sparsity,  # e.g. 0.2
+      mha_sparsity=0,
+      pruning_scope="global",
+      pruning_type="retrain_free",
+  )
+  pruning_configs += auto_configs
+  ```
+
+
+- Step 2: Enable pruning functionalities
+  The process itself is quite straightforward. By passing the prepared config and the calibration dataset, the pruning process can be automatically carried out with a simple API call.
+
+  ```python
+    from neural_compressor.training import prepare_pruning, WeightPruningConfig
+    configs = WeightPruningConfig(
+        pruning_configs,
+        target_sparsity = args.target_sparsity, # global setting for all pruners(optional)
+        pattern = args.pruning_pattern,
+        start_step = pruning_start,
+        end_step = pruning_end,
+    )
+    config = WeightPruningConfig(pruning_configs)
+
+    pruning = prepare_pruning(model, configs, dataloader=train_dataloader)  # modify the model and complete the pruning
+      ```
+
+
+
+  ### SparseGPT Pruning API
+  - Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example:
+
+    ```python
+    pruning_configs = [
+        { #example pruner
+          "pruning_type": "sparse_gpt",
+          "op_names": [".*"], # Prunes all linear modules by default.
+          "excluded_op_names": ["lm_head", "embed_out"],  # A list of modules that would not be pruned.
+          "target_sparsity": 0.5,  # Target sparsity ratio of modules.
+          "pattern": "1x1",  # Default pruning pattern.
+        }
+    ]
+  ```
+
+- Step 2: Enable pruning functionalities
+  By providing the pruning config, calibration dataset, and specifying the desired device card number, the pruning process can be executed automatically with a simple API call.
+  
+  ```python
+  from neural_compressor.training import prepare_pruning, WeightPruningConfig
+
+  configs = WeightPruningConfig(
+      pruning_configs,
+      target_sparsity=args.target_sparsity,  # global setting for all pruners
+      pattern=args.pruning_pattern,  # e.g. 1x1 / 2:4
+  )
+  config = WeightPruningConfig(pruning_configs)
+  # for example: device = "cuda:1"
+  pruning = prepare_pruning(
+      model, configs, dataloader=train_dataloader, device=device
+  )  # modify the model and complete the pruning
+  ```
+
+
 
 
 ## Examples
@@ -360,6 +471,10 @@ The pruning technique  is validated on typical models across various domains (in
 
 "Experimental" annotation means these examples codes are ready but pruning results are under improvements. Please don't hesitate to try these codes with different configurations to get better pruning results! 
 
+- Language Modeling
+
+  Sparsity is effectively implemented through various pruning patterns in Causal language modeling (CLM) tasks. [Language-modeling examples](../../../examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager).
+
 - Text Classification
 
   Sparsity is implemented in different pruning patterns of MRPC and SST-2 tasks [Text-classification examples](../../examples/pytorch/nlp/huggingface_models/text-classification/pruning/eager).
@@ -395,3 +510,7 @@ For more details, please refer to [HPO document](../../neural_compressor/compres
 [1] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019.
 
 [2] Zafrir, Ofir, Ariel Larey, Guy Boudoukh, Haihao Shen, and Moshe Wasserblat. "Prune once for all: Sparse pre-trained language models." arXiv preprint arXiv:2111.05754 (2021).
+
+[3] Kwon, W., Kim, S., Mahoney, M.W., Hassoun, J., Keutzer, K. and Gholami, A., 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35, pp.24101-24116.
+
+[4] Frantar, E. and Alistarh, D., Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. URL https://arxiv. org/abs/2301.00774.
@@ -22,17 +22,23 @@ pip install -r examples/pytorch/nlp/huggingface_models/language-modeling/pruning
 The dataset will be downloaded automatically from the datasets Hub.
 See more about loading [huggingface dataset](https://huggingface.co/docs/datasets/loading_datasets.html)
 
+<br />
 
 # Run Examples
 
-Intel® Neural Compressor supports pruning and slimming operations for LLMs without retraining. Experimentally verified pruning at the MLP layers with channel-wise pattern, which can achieve 10%-20% sparsity and speed up inference while accuracy drops < 1% [Retrain-free Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_no_trainer.py).
-The 1x1 and N:M pruning formats are supported using the [SparseGPT Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py), with models up to 70B being able to be pruned in less than an hour and achieving 40%-50% sparsity in the MHA and MLP layers, while models of 7B and above may have a accuracy drop < 1%.
-There are pruning scripts for LLM sparse models (GPT-j, BLOOM, OPT, LLaMA etc). The sparse model can be obtained by modifying pruning parameters. [Pruning Scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/).
+Intel® Neural Compressor provides support for pruning and model slimming operations in Large Language Models (LLMs) without the need for retraining. 
+
+Through experimental verification, it has been observed that pruning the Multi-Layer Perceptron (MLP) layers using a channel-wise pattern can achieve a sparsity level of 10%-20%. This pruning technique speeds up inference while maintaining an accuracy drop of less than 1%. [Retrain-free Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_no_trainer.py).
+
+The pruning patterns of 1x1 and N:M are supported through the use of the [SparseGPT Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py), It is possible to prune models up to 70B in size within two hours, achieving a sparsity of 40%-50% in both the Multi-Head Attention (MHA) and MLP layers. For models of 7B and above, the drop in accuracy is less than 1%.
 
+Pruning scripts are available for LLM sparse models such as GPT-j, BLOOM, OPT, LLaMA, and the sparse model can be obtained by modifying the pruning parameters. [Pruning Scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/).
+
+<br />
 
-### Results
+## Retrain-free Results
 
-The last token accuracy for channel pruning using the retrain-free algorithm is presented in the following table.
+The last token accuracy for channel pruning using [the retrain-free scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/run_gptj_pruning.sh) is presented in the following table.
 | Model | Calibration dataset | Evaluation dataset | Sparsity pattern | Over MLP block sparsity |Element-wise/matmul, Gemm, conv ratio | Dense last token accuracy | Sparse last token accuracy | Relative drop |
 |  :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----:| :----: |
 | EleutherAI/gpt-j-6b | lambada | lambada | channelx1  | 0.1999 | 0.1242 | 0.7917 | 0.8038 | +1.50% |
@@ -64,7 +70,9 @@ The last word acc of the channel-wise sparse model is shown in the following tab
 
 <br />
 
-The last word acc of the 1x1 pattern sparse model using the sparseGPT algorithm is shown in the following table.
+## SparseGPT Results
+
+The last word acc of the 1x1 pattern sparse model using [the sparseGPT script](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/run_llm_sparsegpt.sh) is shown in the following table.
 | Model | Task | Calibration dataset | Evaluation dataset | Sparsity | Precision | Dense last word accuracy | Sparse last word accuracy | Relative drop |
 |  :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----:|
 | EleutherAI/gpt-j-6b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.6831 | 0.6911 | +1.17% |
@@ -78,12 +86,15 @@ The last word acc of the 1x1 pattern sparse model using the sparseGPT algorithm
 | bigscience/bloom-7b1 | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.5764 | 0.5575 | -3.28% |
 | bigscience/bloom-7b1 | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.5723 | 0.5513 | -3.67% |
 | decapoda-research/llama-13b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 50% | FP32 | 0.7627 | 0.7584 | -0.56% |
-| decapoda-research/llama-13b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 50% | BF16 | 0.7601 | 0.7545 | -0.74% |   
+| decapoda-research/llama-13b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 50% | BF16 | 0.7601 | 0.7545 | -0.74% |
 
 
 ## References
-* [A Fast Post-Training Pruning Framework for Transformers](https://arxiv.org/abs/2204.09656)
-* [SparseGPT: Massive Language Models Can be Accurately Pruned in One-shot](https://arxiv.org/abs/2301.00774)
+
+[1] Kwon, W., Kim, S., Mahoney, M.W., Hassoun, J., Keutzer, K. and Gholami, A., 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35, pp.24101-24116.
+
+[2] Frantar, E. and Alistarh, D., Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. URL https://arxiv. org/abs/2301.00774.
+