Skip to content

Commit bb91d73

Browse files
authored
Update Puzzle Compression Tutorial (#493)
## What does this PR do? **Type of change:** Documentation **Overview:** Updated the tutorial with more details on how to choose the required config parameters and added MMLU evaluation. --------- Signed-off-by: Liana Mikaelyan <[email protected]>
1 parent 25b4aed commit bb91d73

File tree

2 files changed

+33
-8
lines changed

2 files changed

+33
-8
lines changed

examples/compress/README.md

Lines changed: 31 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,15 @@
11
# Compress Algorithm Tutorial
22

33
This tutorial demonstrates how to compress large language models using the compress algorithm based on the [Puzzle paper](https://arxiv.org/abs/2411.19146).
4+
The goal of the algorithm it to find the most optimal modifications to MLP and attention layers of the model, resulting in a heterogeneous model architecture.
5+
The supported modifications are:
46

5-
In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model by searching for the optimal `ffn_intermediate_size` across MLP layers and `attention op/noop`. This results in a heterogeneous architecture while reducing GPU memory usage from 113 GiB to 96 GiB (15% reduction) with less than 1% regression in the token_accuracy_top_10 metric.
7+
- `ffn_intermediate_size`: different FFN intermediate sizes
8+
- `attention op/noop`: complete removal of attention layers
9+
10+
To use the Puzzle algorithm effectively, we need to specify the target number of parameters and/or the memory. The final stage is based on Mixed-Integer Programming (MIP) algorithm to find the most optimal combination of layer modifications that satisfy the target requirements.
11+
12+
In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model reducing GPU memory usage from 113 GiB to 96 GiB (15% reduction) with less than 1% regression in the token_accuracy_top_10 metric.
613

714
## Environment
815

@@ -13,7 +20,11 @@ In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://hugg
1320

1421
1. Specify the `puzzle_dir`, `input_hf_model_path`, `dataset_path`, `intermediate_size_list`, and `target_memory` arguments in the [llama-3_1-8B_pruneffn_memory.yaml](./configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml) configuration file.
1522

16-
Let's first shoot for 32% GPU memory reduction setting `target_memory = 78_000` GiB.
23+
**_NOTE:_**
24+
How to choose `intermediate_size_list`?
25+
The list specifies the candidate FFN sizes that we wish to search over. It is recommended to choose several pruning sizes (e.g. 15%, 20%, 30% etc of the original). Note that the values must be hardware-friendly (divisible by a multiple of 2) to avoid issues with tensor operations in subsequent steps.
26+
27+
Let's first shoot for 32% GPU memory reduction setting `target_memory = 78_000` GiB. This means that the algorithm will choose the candidates with highest accuracy that also meet the specified requirements.
1728

1829
2. Download and prepare the [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2).
1930

@@ -23,7 +34,7 @@ In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://hugg
2334
python -m modelopt.torch._compress.dataset.prepare_dataset --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 --output_dir path/to/Nemotron-Post-Training-Dataset-v2
2435
```
2536

26-
3. Run the compression script.
37+
3. Run the compression script.
2738

2839
```bash
2940
torchrun --nproc_per_node 2 examples/compress/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml 2>&1 | tee ./log.txt | grep "Compress Progress"
@@ -42,7 +53,7 @@ In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://hugg
4253
[2025-11-02 12:52:34] Compress Progress 8/8: compression pipeline completed (multi-gpu)
4354
```
4455

45-
This will generate the following network architecture (see `log.txt`):
56+
Once the process is complete, the resulting network architecture will be recorded in `log.txt` for your review:
4657

4758
```bash
4859
...
@@ -96,12 +107,12 @@ In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://hugg
96107
97108
30% GPU memory reduction leads to nearly 5% regression in token_accuracy_top_10 metric (0.898 / 0.942). Let's rerun MIP search aiming for 15% memory reduction.
98109
99-
## Re-run MIP Search with different memory constraints
110+
## Re-run MIP Search with different constraints
100111
101-
If you want to try different memory constraints without re-running the expensive pruning and scoring steps, use the `--mip-only` flag.
112+
If you want to try different constraints without re-running the expensive pruning and scoring steps, use the `--mip-only` flag.
102113
This assumes pruning, replacement library building, NAS scoring, and subblock stats calculation have already been completed.
103114
104-
Set `target_memory: 96_000` in `llama-3_1-8B_pruneffn_memory.yaml`.
115+
For example, let's set `target_memory: 96_000` in `llama-3_1-8B_pruneffn_memory.yaml`.
105116
106117
```bash
107118
torchrun --nproc_per_node 2 examples/compress/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml --mip-only 2>&1 | tee ./log.txt | grep "Compress Progress"
@@ -151,7 +162,7 @@ validate_model_with_kl_div(model_name='solution_0', is_calc_kl_div=True)
151162
Average losses = {'lm_loss': 1.2425934937782586, 'token_accuracy_top_1': 0.703862190246582, 'token_accuracy_top_5': 0.8954982757568359, 'token_accuracy_top_10': 0.9336576461791992
152163
```
153164
154-
On the other hand, if you set `target_memory: 28_000`, you would observe that for some layers the intermediate FFN size starts to reduce (see `log.txt`):
165+
On the other hand, if you set `target_memory: 28_000`, you'll observe that the intermediate FFN sizes are significantly reduced in certain layers (see `log.txt` for details):
155166
156167
```bash
157168
block_5: attention no_op ffn intermediate_11520
@@ -166,6 +177,18 @@ block_13: attention no_op ffn intermediate_11520
166177
block_14: attention no_op ffn intermediate_3072
167178
```
168179
180+
## Evaluation
181+
182+
Once the model is ready, you can evaluate it using [Language Model Evaluation Harness](https://pypi.org/project/lm-eval/). For example, run the following to evaluate the model on a subset of [MMLU](https://huggingface.co/datasets/cais/mmlu).
183+
184+
```bash
185+
lm_eval --model hf \
186+
--model_args pretrained=path/to/model,dtype=bfloat16,trust_remote_code=true,parallelize=True \
187+
--tasks mmlu_humanities \
188+
--num_fewshot 5 \
189+
--batch_size 4
190+
```
191+
169192
## Advanced usage
170193
171194
Modify `path/to/Llama-3_1-8B yaml` file for advanced compression scenarios.

examples/pruning/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ This section focuses on applying Model Optimizer's state-of-the-art complementar
2323

2424
</div>
2525

26+
For more advanced pruning strategies, such as the [Puzzle methodology](https://arxiv.org/pdf/2411.19146), please see [Puzzle pruning example](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/feature/compress/examples/compress).
27+
2628
## Pre-Requisites
2729

2830
For Minitron pruning for Megatron-LM / NeMo models, use the NeMo container (e.g., `nvcr.io/nvidia/nemo:25.07`) which has all the dependencies installed.

0 commit comments

Comments
 (0)