Skip to content

Commit 4e45d57

Browse files
committed
ptq english
1 parent c6d9c10 commit 4e45d57

File tree

8 files changed

+134
-125
lines changed

8 files changed

+134
-125
lines changed

docs/faq/onnx.mdx

+63-55
Original file line numberDiff line numberDiff line change
@@ -6,46 +6,45 @@ type: explainer
66

77

88
:::note
9-
随着 NVIDIA 将重心转移到 [`TorchTensorrt`](https://github.com/pytorch/TensorRT), [`torch2trt`](https://github.com/NVIDIA-AI-IOT/torch2trt) 已经停止大规模维护。tensorrt官方将对onnx的支持推到了最高程度。在我们的所有已知实践中,通过静态onnx组合, 动态onnx,预生成tensorrt模型等途径,torchpipe能完整替代 `torch2trt`.
9+
In all of our known practices, TorchPipe can completely replace [`torch2trt`](https://github.com/NVIDIA-AI-IOT/torch2trt) through static ONNX composition, dynamic ONNX, pre-generated TensorRT models, and other methods.
1010
:::
1111

12-
## torch转onnx
12+
## Torch to ONNX Conversion
1313

14-
框架优先支持动态 `batch` 或者 `batchsize==1` 的静态 `batch`。实际中,有些模型无法转为动态尺度,或者比较容易出错,
15-
我们也支持[**同时加载多个不同静态batchsize的模型**](../Intra-node/schedule#single_node_combine),去模拟动态尺度。以下说明主要针对导出动态 batchsize 模型。
14+
The framework prioritizes dynamic `batch` or static `batch` with `batchsize==1`. In reality, some models cannot be converted to dynamic scale or are prone to errors. We also support [**loading multiple models with different static batch sizes at the same time**](../Intra-node/schedule#single_node_combine) to simulate dynamic scale. The following instructions mainly apply to exporting dynamic batch size models.
1615

17-
:::caution 动态batch的导出
18-
- 以下操作导致动态batch不可用: ``x.view(int(x.size(0)), -1)``. 需要检查模型文件中是否存在将batch维度写死的情况,比如:x.view(int(x.size(0)), -1, 1, 1),x.reshape(int(x.size(0)), -1, 1, 1)等,这可能会导致转换onnx后动态batch出现问题。注意,在Transformer-like的网络中,batch维度不一定在第0维度。
19-
- batch维度指定为动态大小时,低版本tensorrt对此处理能力弱一些,冗余算子多一些。比如对于 ``x.view(x.size(0), -1)``,会在onnx中引入Gather等算子来计算x的第一个维度。可修改为 ``x = x.view(-1, int(x.size(1)*x.size(2)*x.size(3)))`` 或者 ``x = torch.flatten(x, 1)``。此项非必需。
20-
- 对于部分模型(tensorrt8.5.1, lstm 和 transformer),batch维度和非batch维度同时动态时,可能消耗更多资源 :
21-
- 对于layerNorm层以及动态batch的Transformer-like的网络,推荐使用opset>=17, tensorrt>=8.6.1
16+
:::caution Exporting Dynamic Batch Size Models
17+
- The following operations make dynamic batch size unavailable: ``x.view(int(x.size(0)), -1)``. Check if the model file has hardcoded the batch dimension, such as ``x.view(int(x.size(0)), -1, 1, 1)``, ``x.reshape(int(x.size(0)), -1, 1, 1)``, etc., which may cause problems with dynamic batch size after converting to ONNX. Note that in Transformer-like networks, the batch dimension is not necessarily in the 0th dimension.
18+
- When the batch dimension is specified as dynamic size, low-version TensorRT has weaker processing capabilities and more redundant operators. For example, for ``x.view(x.size(0), -1)``, Gather and other operators will be introduced in ONNX to calculate the first dimension of x. It can be modified to ``x = x.view(-1, int(x.size(1)*x.size(2)*x.size(3)))`` or ``x = torch.flatten(x, 1)``. This is not necessary.
19+
- For some models (TensorRT 8.5.1, LSTM, and Transformer), when the batch dimension and non-batch dimension are both dynamic, more resources may be consumed:
20+
- For LayerNorm layers and Transformer-like networks with dynamic batch size, opset>=17 and TensorRT>=8.6.1 are recommended.
21+
:::
2222

2323
```bash
24-
# batch和非batch同时动态,需要9ms(推理输入大小为optShapes=input:1x1000x80,mask:1x1x1000:
24+
# When both batch and non-batch dimensions are dynamic, it takes 9ms (inference input size is optShapes=input:1x1000x80,mask:1x1x1000):
2525
/opt/tensorrt/bin/trtexec --onnx=test_fp32.onnx --shapes=input:1x1000x80,mask:1x1x1000 --workspace=64000 \
2626
--minShapes=input:1x20x80,mask:1x1x20 \
2727
--optShapes=input:1x1000x80,mask:1x1x1000 \
2828
--maxShapes=input:4x2000x80,mask:4x1x2000
2929

3030

31-
# 固定batchsize==1,只需要4.6ms
31+
# When batchsize==1 is fixed, it only takes 4.6ms:
3232
/opt/tensorrt/bin/trtexec --onnx=test_fp32.onnx --shapes=input:1x1000x80,mask:1x1x1000 --workspace=64000 \
3333
--minShapes=input:1x20x80,mask:1x1x20 \
3434
--optShapes=input:1x1000x80,mask:1x1x1000 \
3535
--maxShapes=input:1x2000x80,mask:1x1x2000
3636
```
37-
此时推荐**只将其中一个维度离散化**
37+
At this point, it is recommended to **discretize only one dimension**.
3838

3939
:::
40-
41-
:::tip 最佳实践
42-
- 可能的情况下,保持batch维度在第0维度,长度为默认状态(也就是-1),以便去除冗余算子。
43-
- 使用onnx-simplify进行优化
44-
- [更小的优化范围通常意味着更快的速度和消耗更少的资源](https://github.com/NVIDIA/TensorRT/issues/1166#issuecomment-815551064)
40+
:::tip Best Practices
41+
- Whenever possible, keep the batch dimension in the 0th dimension with a length of the default state (i.e., -1) to remove redundant operators.
42+
- Use onnx-simplify for optimization.
43+
- [Smaller optimization ranges usually mean faster speeds and less resource consumption](https://github.com/NVIDIA/TensorRT/issues/1166#issuecomment-815551064).
4544
:::
4645

4746

48-
修改完网络后,可以利用下面代码,将pytorch模型转换为onnx模型。
47+
After modifying the network, you can use the following code to convert the PyTorch model to an ONNX model:
4948

5049
```python
5150
x = torch.randn(1,*input_shape).cuda()
@@ -60,8 +59,8 @@ torch.onnx.export(torch_model,
6059
onnx_save_path,
6160
opset_version=17,
6261
do_constant_folding=True,
63-
input_names=["input"], # 输入名
64-
output_names=[f"output_{i}" for i in range(out_size)], # 输出名
62+
input_names=["input"], # input name
63+
output_names=[f"output_{i}" for i in range(out_size)], # output names
6564
dynamic_axes=out)
6665

6766
import onnx
@@ -73,59 +72,68 @@ model_simp, check = onnx_simplifier.simplify(onnx_model, check_n = 0)
7372
onnx.save(model_simp, onnx_save_path)
7473
```
7574

76-
<details><summary>为了方便,针对这步我们提供了torchpipe.utils.models.onnx_export小工具</summary>
75+
<details><summary>`torchpipe.utils.models.onnx_export`(effective from 0.3.2b1):</summary>
76+
77+
- This tool can convert PyTorch models to ONNX models and save them locally. It only supports single input.
78+
- It supports dynamic batch and comes with onnx-simplify optimization.
79+
80+
81+
```python
82+
def onnx_export(model: Union[torch.nn.Module, torch.jit.ScriptModule, torch.jit.ScriptFunction], onnx_path, input = None, opset = 17):
83+
```
84+
:::tip Parameters
85+
- **model** - PyTorch model.
86+
- **onnx_path** - Path to save the ONNX model.
87+
- **input** - Model input. Defaults to torch.randn(1,3,224,224) if not set.
88+
- **opset** - ONNX opset.
89+
:::
7790

78-
- 该工具可以实现将PyTorch模型转换为ONNX模型并保存到本地
91+
<details><summary>Example Code</summary>
7992

8093
```python
81-
import os
94+
import os, tempfile
8295
from torchvision import models
8396
import torch
8497
import torchpipe
8598

8699
## export onnx
87100
m = models.resnet50(weights=None).eval()
88-
onnx_path = os.path.join("/tmp", f"resnet50.onnx")
89-
input = torch.randn(1, 3, 224, 224)
90-
torchpipe.utils.models.onnx_export(m, onnx_path, input, opset=17)
91-
101+
onnx_path = os.path.join(tempfile.gettempdir(), f"resnet50.onnx")
102+
torchpipe.utils.models.onnx_export(m, onnx_path, torch.randn(1, 3, 224, 224), opset=17)
92103
```
93104
</details>
105+
</details>
94106

95107
### 转换失败说明
96108

97109

98-
torch转onnx经常遇到转换失败的情况。可采取的方法有:
110+
When converting from torch to ONNX, it is common to encounter conversion failures. Here are some methods that can be used:
99111

100-
101-
- 动态维度保持动态, 比如对于yolox:
112+
- Keep dynamic dimensions dynamic. For example, for YOLOX:
102113

103114
```python
104115
x = x.view(int(x.size(0)), -1, 1, 1)
105-
# 改为
116+
# Change to:
106117
x = x.flatten(1).unsqueeze(2).unsqueeze(2)
107118

108119
x = x.view(int(x.size(0)), -1)
109-
# 改为
120+
# Change to:
110121
x = x.view(-1, int(x.size(1)*x.size(2)*x.size(3)))
111122
```
112123

113-
- bool值改为float:
124+
- Change boolean values to float:
114125

115126
```python
116127
tgt_padding_mask = (tgt_in == self.eos_id)
117-
# 改为
128+
# Change to:
118129
tgt_padding_mask = (tgt_in == self.eos_id).float()
119130
```
120131

121-
- 采用[onnx-simplify](#onnx-smi)简化
122-
123-
124-
- 版本原因, 尝试不同版本:
125-
- 尽可能使用最新版,比如采用 onnx opset >= 14 和 tensorrt >= 8.2
126-
- 对于tensorrt7,推荐onnx版本1.9.0, onnx opset = 11
127-
128-
- 尝试使用trtexec进行模型转换:
132+
- Simplify using [onnx-simplify](#onnx-smi).
133+
- Try different versions due to version issues:
134+
- Use the latest version as much as possible, such as onnx opset >= 14 and tensorrt >= 8.2.
135+
- For tensorrt7, it is recommended to use onnx version 1.9.0 and onnx opset = 11.
136+
- Try using trtexec for model conversion:
129137

130138
```python
131139

@@ -137,33 +145,33 @@ tgt_padding_mask = (tgt_in == self.eos_id).float()
137145
--saveEngine=test_fp32.trt
138146
```
139147

140-
## onnx 相关工具
141-
148+
## ONNX Related Tools
142149

143150
### [onnx-simplify](https://github.com/daquexian/onnx-simplifier) {#onnx-smi}
144-
简化模型结构的工具
151+
152+
Tools for simplifying model structure:
153+
145154
```python
146155
pip install onnx onnxsim
147156
onnxsim input.onnx output.onnx
148157
```
149158
### [netron](https://github.com/lutzroeder/netron)
150159

151-
用于可视化onnx模型的工具。
160+
A tool for visualizing ONNX models.
152161

153-
运行 pip install netronnetron [FILE] 或者 netron.start('[FILE]').
162+
Run `pip install netron` and `netron [FILE]` or `netron.start('[FILE]')`.
154163

155164
### [ONNX GraphSurgeon](https://github.com/NVIDIA/TensorRT/tree/master/tools/onnx-graphsurgeon)
156165

157-
ONNX GraphSurgeon 是tensorrt官方发布的一款用于修改onnx结构的工具。
158-
166+
ONNX GraphSurgeon is a tool released by TensorRT for modifying ONNX structures.
159167

160168
### [Polygraphy](https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy)
161169

162-
nvidia官方用于测试tensorrt或者onnx的工具。提供模型转换功能,对于fp16精度损失可进行调试,指定层不使用fp16.
170+
Polygraphy is a tool provided by NVIDIA for testing TensorRT or ONNX. It provides model conversion functionality and allows for debugging of FP16 precision loss. It also allows for specifying layers that should not use FP16..
163171

164172

165-
## 参考连接
166-
[PyTorch ONNX 详解](https://zhuanlan.zhihu.com/p/498425043)
167-
[ONNX 模型的修改与调试](https://zhuanlan.zhihu.com/p/516920606)
168-
[TensorRT 教程 | 基于 8.6.1 版本](https://www.bilibili.com/video/BV1jj411Z7wG/?spm_id_from=333.999.0.0&vd_source=c31de98543aa977b5899e24bdd5d8f89)
169-
[quantization tutorial](https://github.com/NVIDIA/TensorRT/tree/release/8.6/quickstart/quantization_tutorial)
173+
## Reference Links
174+
- [PyTorch to ONNX Conversion Tutorial](https://zhuanlan.zhihu.com/p/498425043)
175+
- [Modifying and Debugging ONNX Models](https://zhuanlan.zhihu.com/p/516920606)
176+
- [TensorRT Tutorial | Based on version 8.6.1](https://www.bilibili.com/video/BV1jj411Z7wG/?spm_id_from=333.999.0.0&vd_source=c31de98543aa977b5899e24bdd5d8f89)
177+
- [Quantization Tutorial](https://github.com/NVIDIA/TensorRT/tree/release/8.6/quickstart/quantization_tutorial)

docs/introduction.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ To enhance the peak throughput of deep learning serving, various challenges must
1111

1212
There are some industry practices, such as [triton inference server](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models), [Alimama high_service(in chinese)](https://mp.weixin.qq.com/s/Fd2GNXqO3wl3FrA7Wli3jA), and [Meituan Vision GPU Inference Service Deployment Architecture Optimization Practice(in chinese)](https://zhuanlan.zhihu.com/p/605094862).
1313

14-
One common complaint from users of the Triton Inference Server is that in a system with multiple intertwined nodes, a lot of business logic needs to be completed on the client side and then called through RPC to the server, which can be cumbersome. For performance reasons, unconventional methods such as shared memory, ensemble, and [BLS](https://github.com/triton-inference-server/python_backend#business-logic-scripting) must be considered.
14+
One common complaint from users of the Triton Inference Server is that in a system with multiple intertwined nodes, a lot of business logic needs to be completed on the client side and then called through RPC to the server, which can be cumbersome. For performance reasons, unconventional methods such as shared memory, ensemble, and [Business Logic Scripting(BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting) must be considered.
1515

1616
To address these issues, TorchPipe provides a thread-safe function interface for the PyTorch frontend and a fine-grained backend extension for users, by delving into PyTorch's C++ calculation backend and CUDA stream management, as well as modeling domain-specific languages for multiple nodes.
1717

0 commit comments

Comments
 (0)