Skip to content

Commit f19cc9d

Browse files
authored
Add LLMs quantization model list and recipes (#1504)
Signed-off-by: chensuyue <[email protected]>
1 parent 7634409 commit f19cc9d

File tree

7 files changed

+48
-8
lines changed

7 files changed

+48
-8
lines changed

README.md

+4-3
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,12 @@ Intel® Neural Compressor
55
<h3> An open-source Python library supporting popular model compression techniques on all mainstream deep learning frameworks (TensorFlow, PyTorch, ONNX Runtime, and MXNet)</h3>
66

77
[![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/neural-compressor)
8-
[![version](https://img.shields.io/badge/release-2.4-green)](https://github.com/intel/neural-compressor/releases)
8+
[![version](https://img.shields.io/badge/release-2.4.1-green)](https://github.com/intel/neural-compressor/releases)
99
[![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/neural-compressor/blob/master/LICENSE)
1010
[![coverage](https://img.shields.io/badge/coverage-85%25-green)](https://github.com/intel/neural-compressor)
1111
[![Downloads](https://static.pepy.tech/personalized-badge/neural-compressor?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/neural-compressor)
1212

13-
[Architecture](./docs/source/design.md#architecture)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Workflow](./docs/source/design.md#workflow)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/validated_model_list.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/README.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentations](https://intel.github.io/neural-compressor)
13+
[Architecture](./docs/source/design.md#architecture)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Workflow](./docs/source/design.md#workflow)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[LLMs Recipes](./docs/source/llm_recipes.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/validated_model_list.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentations](https://intel.github.io/neural-compressor)
1414

1515
---
1616
<div align="left">
@@ -72,8 +72,9 @@ q_model = fit(
7272
<tr>
7373
<td colspan="2" align="center"><a href="./docs/source/design.md#architecture">Architecture</a></td>
7474
<td colspan="2" align="center"><a href="./docs/source/design.md#workflow">Workflow</a></td>
75+
<td colspan="1" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">APIs</a></td>
76+
<td colspan="1" align="center"><a href="./docs/source/llm_recipes.md">LLMs Recipes</a></td>
7577
<td colspan="2" align="center"><a href="examples/README.md">Examples</a></td>
76-
<td colspan="2" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">APIs</a></td>
7778
</tr>
7879
</tbody>
7980
<thead>

conda_meta/basic/meta.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
{% set version = "2.4" %}
1+
{% set version = "2.4.1" %}
22
{% set buildnumber = 0 %}
33
package:
44
name: neural-compressor

conda_meta/neural_insights/meta.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
{% set version = "2.4" %}
1+
{% set version = "2.4.1" %}
22
{% set buildnumber = 0 %}
33
package:
44
name: neural-insights

conda_meta/neural_solution/meta.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
{% set version = "2.4" %}
1+
{% set version = "2.4.1" %}
22
{% set buildnumber = 0 %}
33
package:
44
name: neural-solution

docs/source/llm_recipes.md

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
LLM Quantization Models and Recipes
2+
---
3+
4+
Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ),
5+
and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with [PyTorch](https://pytorch.org/),
6+
[Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) and [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers).
7+
This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss.
8+
9+
> Notes:
10+
> - The quantization algorithms provide by [Intel® Neural Compressor](https://github.com/intel/neural-compressor) and the evaluate functions provide by [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers).
11+
> - The model list are continuing update, please expect to find more LLMs in the future.
12+
13+
## IPEX key models
14+
| Models | SQ INT8 | WOQ INT8 | WOQ INT4 |
15+
|:-------------------------:|---------|:--------:|:--------:|
16+
| EleutherAI/gpt-j-6b ||||
17+
| facebook/opt-1.3b ||||
18+
| facebook/opt-30b ||||
19+
| meta-llama/Llama-2-7b-hf ||||
20+
| meta-llama/Llama-2-13b-hf ||||
21+
| meta-llama/Llama-2-70b-hf ||||
22+
| tiiuae/falcon-40b ||||
23+
24+
**Detail recipes can be found [HERE](https://github.com/intel/intel-extension-for-transformers/examples/huggingface/pytorch/text-generation/quantization/llm_quantization_recipes.md).**
25+
> Notes:
26+
> - This model list comes from [IPEX](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/llm.html).
27+
> - WOQ INT4 recipes will be published soon.

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_benchmark.sh

+13-1
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,10 @@ function run_benchmark {
7979
model_name_or_path="facebook/opt-125m"
8080
approach="weight_only"
8181
extra_cmd=$extra_cmd" --woq_algo GPTQ"
82+
elif [ "${topology}" = "opt_125m_woq_gptq_debug_int4" ]; then
83+
model_name_or_path="facebook/opt-125m"
84+
approach="weight_only"
85+
extra_cmd=$extra_cmd" --woq_algo GPTQ --woq_bits 4 --woq_scheme asym --woq_group_size 128 --gptq_use_max_length --gptq_debug"
8286
elif [ "${topology}" = "opt_125m_woq_teq" ]; then
8387
model_name_or_path="facebook/opt-125m"
8488
approach="weight_only"
@@ -98,13 +102,21 @@ function run_benchmark {
98102
elif [ "${topology}" = "gpt_j_ipex_sq" ]; then
99103
model_name_or_path="EleutherAI/gpt-j-6b"
100104
extra_cmd=$extra_cmd" --ipex --sq --alpha 1.0"
101-
elif [ "${topology}" = "gpt_j_woq_rtn" ]; then
105+
elif [ "${topology}" = "gpt_j_woq_rtn_int4" ]; then
102106
model_name_or_path="EleutherAI/gpt-j-6b"
103107
approach="weight_only"
104108
extra_cmd=$extra_cmd" --woq_algo RTN --woq_bits 4 --woq_group_size 128 --woq_scheme asym --woq_enable_mse_search"
109+
elif [ "${topology}" = "gpt_j_woq_gptq_debug_int4" ]; then
110+
model_name_or_path="EleutherAI/gpt-j-6b"
111+
approach="weight_only"
112+
extra_cmd=$extra_cmd" --woq_algo GPTQ --woq_bits 4 --woq_group_size 128 --woq_scheme asym --gptq_use_max_length --gptq_debug"
105113
elif [ "${topology}" = "falcon_7b_sq" ]; then
106114
model_name_or_path="tiiuae/falcon-7b-instruct"
107115
extra_cmd=$extra_cmd" --sq --alpha 0.5"
116+
elif [ "${topology}" = "falcon_7b_woq_gptq_debug_int4" ]; then
117+
model_name_or_path="tiiuae/falcon-7b-instruct"
118+
approach="weight_only"
119+
extra_cmd=$extra_cmd" --woq_algo GPTQ --woq_bits 4 --woq_group_size 128 --woq_scheme asym --gptq_use_max_length --gptq_debug"
108120
fi
109121

110122
python -u run_clm_no_trainer.py \

neural_compressor/version.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@
1515
# See the License for the specific language governing permissions and
1616
# limitations under the License.
1717
"""Intel® Neural Compressor: An open-source Python library supporting popular model compression techniques."""
18-
__version__ = "2.4"
18+
__version__ = "2.4.1"

0 commit comments

Comments
 (0)