Skip to content

Commit f5ce8de

Browse files
committed
add notes
1 parent 8b58eca commit f5ce8de

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1878
-0
lines changed
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
2+
---
3+
title: course materials
4+
excerpt: 机器学习和并行计算相关课程资源汇总,包括MLSys系统课程、GPU并行编程课程链接,以及高性能计算实验室资源,涵盖CMU、EPFL、华盛顿大学等知名院校。
5+
---
6+
7+
## Deep learning
8+
### mlsys
9+
- [CS 723 Topics on ML Systems Spring 2023](https://parsa.epfl.ch/course-info/cs723/index.php?page=schedule.php)
10+
- [ML for ML Systems (cse599m, courses.cs.washington.edu)](https://courses.cs.washington.edu/courses/cse599m/23sp/)
11+
- [15-849: Machine Learning Systems cs.cmu](https://www.cs.cmu.edu/~zhihaoj2/15-849/)
12+
- [15-884: Machine Learning Systems cs.cmu](https://catalyst.cs.cmu.edu/15-884-mlsys-sp21/)
13+
- [cs294-ai-sys-sp22 Machine Learning Systems ucbrise](https://ucbrise.github.io/cs294-ai-sys-sp22/)
14+
15+
## parallel computing(GPU)
16+
https://vccvisualization.org/CS380_GPU_and_GPGPU_Programming/
17+
18+
## Lab link
19+
- [Architecture Lab for Creative High-performance Energy-efficient Machines (alchem.cs.purdue.edu)](https://alchem.cs.purdue.edu/index.html)
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
---
2+
title: flash attention
3+
tags: [Flash Attention, Transformer, GPU Optimization]
4+
excerpt: Flash Attention technology explained, including parallelization strategies, work partition optimization, supported head dimensions, and Flash Attention2's fused kernels, matrix tiling, causal masking, and other core optimization techniques.
5+
---
6+
7+
# Flash Attention
8+
## [Flash Attention](https://openreview.net/pdf?id=H4DqfPSibmx)
9+
- parallelism
10+
parallelize over batch_size and num of heads
11+
flash attention2 - long sequences(small batch size or num of heads), parallelize over sequence length dimension
12+
- better work partition
13+
reduce the amount of synchronization and communication between different warps
14+
FlashAttention splits K and V across 4 warps while keeping Q accessible by all warps.
15+
FlashAttention-2 splits Q across 4 warps while keeping K and V accessible by all warps.
16+
![image](https://github.com/zhangjun/zhangjun.github.io/assets/1312389/703cdb9d-927b-4316-85ad-58d380b9478d)
17+
- supported head dimensions up to 256 and MQA
18+
GQA、MQA
19+
## Flash Attention2 优化
20+
<img width="657" alt="image" src="https://github.com/zhangjun/zhangjun.github.io/assets/1312389/2cadba46-f1e4-4b4b-a2bc-aaa0c0061b4d">
21+
22+
Flash Attention2优化点详解
23+
- Fused kernel与矩阵分块
24+
- Causal Masking
25+
- Non-Matmul 计算优化
26+
- 流水编排与异步加载和Double Buffer
27+
- Layout Swizzle
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
---
2+
title: deep learning & llm
3+
---
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
---
2+
title: multi head attention
3+
---
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
---
2+
title: norm
3+
tags: [Normalization, Deep Learning, PyTorch]
4+
excerpt: 深度学习归一化方法详解,包括Batch Norm、Layer Norm、Instance Norm、Group Norm四种归一化技术的原理、实现方法和PyTorch代码示例,帮助理解不同归一化策略的应用场景。
5+
---
6+
7+
## 归一化方法
8+
9+
![](https://img-blog.csdnimg.cn/20210319221042880.png)
10+
11+
### Batch Norm
12+
Batch Norm在通道维度进行归一化,最后得到C个统计量u,δ。假设输入特征为[N, H, W, C],在C的每个维度上对[N, H, W]计算其均值、方差,用于该维度上的归一化操作。
13+
```python
14+
import numpy as np
15+
import torch
16+
import torch.nn as nn
17+
from einops import rearrange, repeat, reduce
18+
19+
image = [np.random.randn(30, 40, 3) for _ in range(16)]
20+
image = rearrange(image, 'b h w c -> b h w c')
21+
# print(rearrange(image, 'b h w c -> b h w c').shape)
22+
23+
image_ = rearrange(image, 'b h w c -> (b h w) c')
24+
mean = rearrange(image_.mean(axis=0), 'c -> 1 1 1 c')
25+
std = rearrange(image_.std(axis=0), 'c -> 1 1 1 c')
26+
27+
y_ = (image - mean)/std
28+
29+
b, h, w, c = image.shape
30+
bn = nn.BatchNorm2d(c, eps=1e-10, affine=False, track_running_stats=False)
31+
y = bn(torch.from_numpy(image))
32+
33+
print('diff={}\n'.format(torch.abs(y - y_).max()))
34+
35+
```
36+
### Layer Norm
37+
Layer Norm以样本为单位计算统计量,因此最后会得到N个u,δ。假设输入特征为[N, H, W, C],在N的每个维度上对[H, W,C]计算其均值、方差,用于该维度上的归一化操作。
38+
```python
39+
import numpy as np
40+
import torch
41+
import torch.nn as nn
42+
from einops import rearrange, repeat, reduce
43+
44+
x = torch.randn((6, 3, 20, 20))
45+
b, c, h, w = x.shape
46+
47+
layer_norm = nn.LayerNorm([c, h, w], eps=1e-12, elementwise_affine=False)
48+
y = layer_norm(x)
49+
50+
x_ = rearrange(x, 'b c h w -> (h w c) b')
51+
mean = rearrange(x_.mean(axis=0), 'b -> b 1 1 1')
52+
std = rearrange(x_.std(axis=0), 'b -> b 1 1 1')
53+
54+
y_ = (x - mean)/std
55+
56+
print('diff={}\n'.format(torch.abs(y - y_).max()))
57+
58+
```
59+
### Instance Norm
60+
```python
61+
import numpy as np
62+
import torch
63+
import torch.nn as nn
64+
from einops import rearrange, repeat, reduce
65+
66+
x = torch.randn((6, 3, 20, 20))
67+
b, c, h, w = x.shape
68+
69+
instance_norm = nn.InstanceNorm2d(c, eps=1e-12, affine=False, track_running_stats=False)
70+
y = instance_norm(x)
71+
72+
x_ = rearrange(x, 'b c h w -> b c (h w)')
73+
# mean = rearrange(x_.mean(axis=2), 'b c -> b c 1 1')
74+
# std = rearrange(x_.std(axis=2), 'b c -> b c 1 1')
75+
mean = rearrange(x_.mean(dim=2), 'b c -> b c 1 1')
76+
std = rearrange(x_.std(dim=2), 'b c -> b c 1 1')
77+
78+
y_ = (x - mean)/std
79+
80+
print('diff={}\n'.format(torch.abs(y - y_).max()))
81+
82+
```
83+
### Group Norm
84+
```python
85+
import numpy as np
86+
import torch
87+
import torch.nn as nn
88+
from einops import rearrange, repeat, reduce
89+
90+
x = torch.randn((6, 6, 20, 20))
91+
b, c, h, w = x.shape
92+
group_num = 3
93+
n = 2
94+
95+
group_norm = nn.GroupNorm(group_num, c, eps=1e-12, affine=False)
96+
y = group_norm(x)
97+
98+
x_ = rearrange(x, 'b (g n) h w -> b g (n h w)', g = group_num) # [6, 3, 2*20*20]
99+
print(x_.shape)
100+
mean = rearrange(x_.mean(dim=2), 'b g -> b g 1') # [6, 3, 1]
101+
std = rearrange(x_.std(dim=2), 'b g -> b g 1')
102+
103+
y_ = (x_ - mean)/std
104+
y_ = rearrange(y_, 'b g (n h w) -> b (g n) h w', g = group_num, h = h, w = w)
105+
106+
print('diff={}\n'.format(torch.abs(y - y_).max()))
107+
108+
```
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
title: quantization
3+
tags: [Quantization, LLM, Optimization]
4+
excerpt: 大语言模型量化技术综述,包括SmoothQuant、AWQ、LLM.int8、GPTQ、ZeroQuant、LUT-GEMM、SparseGPT等先进量化方法,以及weight-only量化在推理优化中的应用。
5+
---
6+
7+
## LLM量化
8+
https://zhuanlan.zhihu.com/p/616969812
9+
- [SmoothQuant](https://github.com/mit-han-lab/smoothquant)
10+
- Outlier Suppression
11+
- [Outlier Suppression](https://arxiv.org/abs/2209.13325)
12+
- [Outlier Suppression+](https://arxiv.org/abs/2304.09145)
13+
- [AWQ](https://arxiv.org/abs/2306.00978)
14+
基于激活与参数大小的缩放保护
15+
- [LLM.int8](https://arxiv.org/abs/2208.07339)
16+
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
17+
- https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/
18+
- https://github.com/timdettmers/bitsandbytes
19+
- [GPTQ](https://arxiv.org/pdf/2210.17323.pdf)
20+
- [GPTQ: 模型量化,穷鬼救星](https://zhuanlan.zhihu.com/p/616969812)
21+
不使用贪心算法,对W进行优化时,固定位置挑选Q
22+
W不同列间权重更新互相独立,使用批处理更新(Lazy Batch更新,group_size参数控制)
23+
数值稳定性优化
24+
- [4-bit LLM Quantization with GPTQ](https://mlabonne.github.io/blog/posts/4_bit_Quantization_with_GPTQ.html)
25+
- [ZeroQuant](https://arxiv.org/abs/2206.01861)
26+
- https://github.com/microsoft/DeepSpeed/pull/2217
27+
- [ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation](https://arxiv.org/abs/2303.08302)
28+
- [LUT-GEMM](https://arxiv.org/pdf/2206.09557.pdf)
29+
- [LUT-GEMM: 基于lut的量化矩阵乘法在大规模生成语言模型中的高效推理](https://blog.csdn.net/weixin_42764932/article/details/131230429?spm=1001.2014.3001.5501)
30+
- [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf)
31+
- weight only
32+
- [Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production](https://arxiv.org/pdf/2211.10017.pdf)
33+
- [TVM weight-only](https://github.com/apache/tvm/pull/15111)
34+
- [cutlass_fpA_intB_gemm](https://github.com/tlc-pack/cutlass_fpA_intB_gemm/pull/1/files)
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
title: sparse on nvgpu
3+
tags: [Sparse Computing, GPU, NVIDIA]
4+
excerpt: NVIDIA GPU sparse computing technology overview, including efficient GPU kernel implementations for N:M sparse weights, Apex N:M sparse support, structured sparsity optimization on Tensor Cores, and related papers and open source project resources.
5+
---
6+
7+
## Sparse on Nvidia GPU
8+
- [Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning](https://proceedings.mlsys.org/paper_files/paper/2023/file/4552cedd396a308320209f75f56a5ad5-Paper-mlsys2023.pdf)
9+
- [Apex N:M sparse](https://github.com/NVIDIA/apex/pull/1631)
10+
- [Sparse GPU Kernels for Deep Learning](https://arxiv.org/pdf/2006.10901.pdf)
11+
- [Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision](https://seal.ece.ucsb.edu/sites/default/files/publications/vector_sparse_transformer_camera_ready_.pdf)
12+
- [N:M Fine-grained Structured Sparse Neural Networks](https://github.com/aojunzz/NM-sparsity)
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
title: stable diffusion optimization
3+
tags: [Stable Diffusion, Optimization, PaddlePaddle]
4+
excerpt: Stable Diffusion推理优化技术详解,包括Flash Attention、Norm融合、混合Layout计算、推理显存优化等核心技术,实现512×512图像0.76秒生成,性能超越TensorRT 7.9%。
5+
---
6+
7+
## stable diffusion
8+
- norm+act
9+
- add_bias_norm
10+
- add_norm_act
11+
- add_bias_add_norm_act
12+
13+
## paddle stable diffusion optimization
14+
15+
基于PaddlePaddle对Stable Diffusion进行推理时,512*512图像生成速度68.2 iters/s,实现 0.76s 出图。其推理速度是 Diffusers(PyTorch)的4倍,比TensorRT最优速度快7.9%,同时显存占用仅为TensorRT的43%。
16+
17+
- Flash Attention
18+
19+
飞桨一直致力于大模型推理优化,支持多种通用Transformer类结构的高性能推理优化。在Stable Diffusion模型推理中,飞桨集成的高性能的Flash Attention kernel,通过将attention中的softmax计算进行拆解、分片计算,大量减少推理过程中self-attention和cross-attention计算对显存的访问次数,同时实现了推理加速和显存优化。
20+
21+
- Norm融合
22+
23+
Norm是Stable Diffusion中U-Net常用算子,主要分为LayerNorm和GroupNorm。LayerNorm和GroupNorm算子作为批规约运算,能够很好地和前后的elementwise类型、激活类型算子进行融合,消除算子间的显存访问。飞桨对LayerNorm和GroupNorm与前后算子的4种不同pattern进行了融合,共融合了93个Norm结构,提升了3%的推理性能。
24+
![image](https://github.com/zhangjun/zhangjun.github.io/assets/1312389/45db9b78-cdc4-4f0d-b098-93a4e7b2c40f)
25+
26+
- 混合Layout计算
27+
28+
通过对模型张量排布匹配优化,支持不同的Layout消除和合并U-Net中的转置操作,提高了推理速度同时也能降低了运行显存占用,共减少了32次转置操作,带来了3~4%的推理性能提升。整体显存占用降低约19%
29+
![image](https://github.com/zhangjun/zhangjun.github.io/assets/1312389/40b74c4d-7c6e-44d4-9a17-5741aea70632)
30+
31+
- 推理显存优化
32+
33+
推理workspace复用技术
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
---
2+
title: transformer
3+
tags: [Transformer, Deep Learning, Attention]
4+
excerpt: Transformer架构深度解析,包括Encoder和Decoder结构设计、Multi-Head Attention机制、Position-Wise Feed-Forward Network,以及完整的TensorFlow实现代码示例。
5+
---
6+
7+
![image](https://user-images.githubusercontent.com/1312389/228570903-35442ec1-3064-4331-8e0b-c922ea1b806d.png)
8+
9+
- Transformer 中 Encoder 由 6 个相同的层组成,每个层包含 2 个部分:
10+
- Multi-Head Self-Attention
11+
- Position-Wise Feed-Forward Network
12+
- Decoder 也是由 6 个相同的层组成,每个层包含 3 个部分:
13+
- Multi-Head Self-Attention
14+
- Multi-Head Context-Attention
15+
- Position-Wise Feed-Forward Network
16+
17+
bert代码
18+
transformer的左侧部分是由这样的6个encoder单元组成,现在把EncoderLayer堆叠起来。`embedding-->pos_encoding-->dropout-->EncoderLayers`
19+
20+
<details>
21+
<summary>Encoder code</summary>
22+
23+
```python
24+
class Encoder(tf.keras.layers.Layer):
25+
def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
26+
maximum_position_encoding, rate=0.1):
27+
super(Encoder, self).__init__()
28+
29+
self.d_model = d_model
30+
self.num_layers = num_layers
31+
32+
self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
33+
self.pos_encoding = positional_encoding(maximum_position_encoding,
34+
self.d_model)
35+
36+
37+
self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate)
38+
for _ in range(num_layers)]
39+
40+
self.dropout = tf.keras.layers.Dropout(rate)
41+
42+
def call(self, x, training, mask):
43+
44+
seq_len = tf.shape(x)[1]
45+
46+
# 将嵌入和位置编码相加。
47+
x = self.embedding(x) # (batch_size, input_seq_len, d_model)
48+
x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
49+
x += self.pos_encoding[:, :seq_len, :]
50+
51+
x = self.dropout(x, training=training)
52+
53+
for i in range(self.num_layers):
54+
x = self.enc_layers[i](x, training, mask)
55+
56+
return x # (batch_size, input_seq_len, d_model)
57+
```
58+
</details>
59+
60+
transformer的右侧部分是由这样的6个decoder单元组成,现在把DecoderLayer堆叠起来。与Encoder不同的是,Decoder返回的结果中,包含了用字典存储的每一层的attention_weights。`embedding-->pos_encoding--> dropout-->DecoderLayers`
61+
<details>
62+
<summary>Decoder code</summary>
63+
64+
```python
65+
class Decoder(tf.keras.layers.Layer):
66+
def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,
67+
maximum_position_encoding, rate=0.1):
68+
super(Decoder, self).__init__()
69+
70+
self.d_model = d_model
71+
self.num_layers = num_layers
72+
73+
self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
74+
self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)
75+
76+
self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate)
77+
for _ in range(num_layers)]
78+
self.dropout = tf.keras.layers.Dropout(rate)
79+
80+
def call(self, x, enc_output, training,
81+
look_ahead_mask, padding_mask):
82+
83+
seq_len = tf.shape(x)[1]
84+
# 用字典保存每次attention的结果
85+
attention_weights = {}
86+
87+
x = self.embedding(x) # (batch_size, target_seq_len, d_model)
88+
x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
89+
x += self.pos_encoding[:, :seq_len, :]
90+
91+
x = self.dropout(x, training=training)
92+
93+
for i in range(self.num_layers):
94+
x, block1, block2 = self.dec_layers[i](x, enc_output, training,
95+
look_ahead_mask, padding_mask)
96+
97+
attention_weights['decoder_layer{}_block1'.format(i+1)] = block1
98+
attention_weights['decoder_layer{}_block2'.format(i+1)] = block2
99+
100+
# x.shape == (batch_size, target_seq_len, d_model)
101+
return x, attention_weights
102+
```
103+
</details>
104+
105+
搭建Transformer类,`Encoder-->Decoder-->final_layerfinal_layer`是与vocab_size大小保持一致的全连接层。
106+
<details>
107+
<summary>Transformer code</summary>
108+
109+
```python
110+
class Transformer(tf.keras.Model):
111+
def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
112+
target_vocab_size, pe_input, pe_target, rate=0.1):
113+
super(Transformer, self).__init__()
114+
115+
self.encoder = Encoder(num_layers, d_model, num_heads, dff,
116+
input_vocab_size, pe_input, rate)
117+
118+
self.decoder = Decoder(num_layers, d_model, num_heads, dff,
119+
target_vocab_size, pe_target, rate)
120+
121+
self.final_layer = tf.keras.layers.Dense(target_vocab_size)
122+
123+
def call(self, inp, tar, training, enc_padding_mask,
124+
look_ahead_mask, dec_padding_mask):
125+
126+
enc_output = self.encoder(inp, training, enc_padding_mask) # (batch_size, inp_seq_len, d_model)
127+
128+
# dec_output.shape == (batch_size, tar_seq_len, d_model)
129+
dec_output, attention_weights = self.decoder(
130+
tar, enc_output, training, look_ahead_mask, dec_padding_mask)
131+
132+
final_output = self.final_layer(dec_output) # (batch_size, tar_seq_len, target_vocab_size)
133+
134+
return final_output, attention_weights
135+
```
136+
</details>

0 commit comments

Comments
 (0)