zhangjun
diff --git a/‎source/_posts/notes/course_materials.md‎
Lines changed: 19 additions & 0 deletions b/‎source/_posts/notes/course_materials.md‎
Lines changed: 19 additions & 0 deletions
diff --git a/‎source/_posts/notes/deep_learning_and_llm/flash_attention.md‎
Lines changed: 27 additions & 0 deletions b/‎source/_posts/notes/deep_learning_and_llm/flash_attention.md‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎source/_posts/notes/deep_learning_and_llm/index.md‎
Lines changed: 3 additions & 0 deletions b/‎source/_posts/notes/deep_learning_and_llm/index.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎source/_posts/notes/deep_learning_and_llm/multi_head_attention.md‎
Lines changed: 3 additions & 0 deletions b/‎source/_posts/notes/deep_learning_and_llm/multi_head_attention.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎source/_posts/notes/deep_learning_and_llm/norm.md‎
Lines changed: 108 additions & 0 deletions b/‎source/_posts/notes/deep_learning_and_llm/norm.md‎
Lines changed: 108 additions & 0 deletions
diff --git a/‎source/_posts/notes/deep_learning_and_llm/quantization.md‎
Lines changed: 34 additions & 0 deletions b/‎source/_posts/notes/deep_learning_and_llm/quantization.md‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎source/_posts/notes/deep_learning_and_llm/sparse_on_nvgpu.md‎
Lines changed: 12 additions & 0 deletions b/‎source/_posts/notes/deep_learning_and_llm/sparse_on_nvgpu.md‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎source/_posts/notes/deep_learning_and_llm/stable_diffusion_optimize.md‎
Lines changed: 33 additions & 0 deletions b/‎source/_posts/notes/deep_learning_and_llm/stable_diffusion_optimize.md‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎source/_posts/notes/deep_learning_and_llm/transformer.md‎
Lines changed: 136 additions & 0 deletions b/‎source/_posts/notes/deep_learning_and_llm/transformer.md‎
Lines changed: 136 additions & 0 deletions
@@ -0,0 +1,19 @@
+
+---
+title: course materials
+excerpt: 机器学习和并行计算相关课程资源汇总，包括MLSys系统课程、GPU并行编程课程链接，以及高性能计算实验室资源，涵盖CMU、EPFL、华盛顿大学等知名院校。
+---
+
+## Deep learning
+### mlsys
+- [CS 723 Topics on ML Systems Spring 2023](https://parsa.epfl.ch/course-info/cs723/index.php?page=schedule.php)
+- [ML for ML Systems (cse599m, courses.cs.washington.edu)](https://courses.cs.washington.edu/courses/cse599m/23sp/)
+- [15-849: Machine Learning Systems cs.cmu](https://www.cs.cmu.edu/~zhihaoj2/15-849/)
+- [15-884: Machine Learning Systems cs.cmu](https://catalyst.cs.cmu.edu/15-884-mlsys-sp21/)
+- [cs294-ai-sys-sp22 Machine Learning Systems ucbrise](https://ucbrise.github.io/cs294-ai-sys-sp22/)
+
+## parallel computing(GPU)
+https://vccvisualization.org/CS380_GPU_and_GPGPU_Programming/
+
+## Lab link
+- [Architecture Lab for Creative High-performance Energy-efficient Machines (alchem.cs.purdue.edu)](https://alchem.cs.purdue.edu/index.html)
@@ -0,0 +1,27 @@
+---
+title: flash attention
+tags: [Flash Attention, Transformer, GPU Optimization]
+excerpt: Flash Attention technology explained, including parallelization strategies, work partition optimization, supported head dimensions, and Flash Attention2's fused kernels, matrix tiling, causal masking, and other core optimization techniques.
+---
+
+# Flash Attention
+## [Flash Attention](https://openreview.net/pdf?id=H4DqfPSibmx)
+- parallelism
+  parallelize over batch_size and num of heads
+  flash attention2 - long sequences(small batch size or num of heads), parallelize over sequence length dimension
+- better work partition
+  reduce the amount of synchronization and communication between different warps
+  FlashAttention splits K and V across 4 warps while keeping Q accessible by all warps.
+  FlashAttention-2 splits Q across 4 warps while keeping K and V accessible by all warps.
+  ![image](https://github.com/zhangjun/zhangjun.github.io/assets/1312389/703cdb9d-927b-4316-85ad-58d380b9478d)
+- supported head dimensions up to 256 and MQA
+   GQA、MQA
+## Flash Attention2 优化
+<img width="657" alt="image" src="https://github.com/zhangjun/zhangjun.github.io/assets/1312389/2cadba46-f1e4-4b4b-a2bc-aaa0c0061b4d">
+
+Flash Attention2优化点详解
+- Fused kernel与矩阵分块
+- Causal Masking
+- Non-Matmul 计算优化
+- 流水编排与异步加载和Double Buffer
+- Layout Swizzle
@@ -0,0 +1,3 @@
+---
+title: deep learning & llm 
+---
@@ -0,0 +1,3 @@
+---
+title: multi head attention
+---
@@ -0,0 +1,108 @@
+---
+title: norm
+tags: [Normalization, Deep Learning, PyTorch]
+excerpt: 深度学习归一化方法详解，包括Batch Norm、Layer Norm、Instance Norm、Group Norm四种归一化技术的原理、实现方法和PyTorch代码示例，帮助理解不同归一化策略的应用场景。
+---
+
+## 归一化方法
+
+![](https://img-blog.csdnimg.cn/20210319221042880.png)
+
+### Batch Norm
+Batch Norm在通道维度进行归一化，最后得到C个统计量u，δ。假设输入特征为[N, H, W, C]，在C的每个维度上对[N, H, W]计算其均值、方差，用于该维度上的归一化操作。
+```python
+import numpy as np
+import torch
+import torch.nn as nn
+from einops import rearrange, repeat, reduce
+
+image = [np.random.randn(30, 40, 3) for _ in range(16)]
+image = rearrange(image, 'b h w c -> b h w c')
+# print(rearrange(image, 'b h w c -> b h w c').shape)
+
+image_ = rearrange(image, 'b h w c -> (b h w) c')
+mean = rearrange(image_.mean(axis=0), 'c -> 1 1 1 c')
+std = rearrange(image_.std(axis=0), 'c -> 1 1 1 c')
+
+y_ =  (image - mean)/std
+
+b, h, w, c = image.shape
+bn = nn.BatchNorm2d(c, eps=1e-10, affine=False, track_running_stats=False)
+y = bn(torch.from_numpy(image))
+
+print('diff={}\n'.format(torch.abs(y - y_).max()))
+
+```
+### Layer Norm
+Layer Norm以样本为单位计算统计量，因此最后会得到N个u，δ。假设输入特征为[N, H, W, C]，在N的每个维度上对[H, W，C]计算其均值、方差，用于该维度上的归一化操作。
+```python
+import numpy as np
+import torch
+import torch.nn as nn
+from einops import rearrange, repeat, reduce
+
+x = torch.randn((6, 3, 20, 20))
+b, c, h, w = x.shape
+
+layer_norm = nn.LayerNorm([c, h, w], eps=1e-12, elementwise_affine=False)
+y = layer_norm(x)
+
+x_ = rearrange(x, 'b c h w -> (h w c) b')
+mean = rearrange(x_.mean(axis=0), 'b -> b 1 1 1')
+std = rearrange(x_.std(axis=0), 'b -> b 1 1 1')
+
+y_ =  (x - mean)/std
+
+print('diff={}\n'.format(torch.abs(y - y_).max()))
+
+```
+### Instance Norm
+```python
+import numpy as np
+import torch
+import torch.nn as nn
+from einops import rearrange, repeat, reduce
+
+x = torch.randn((6, 3, 20, 20))
+b, c, h, w = x.shape
+
+instance_norm = nn.InstanceNorm2d(c, eps=1e-12, affine=False, track_running_stats=False)
+y = instance_norm(x)
+
+x_ = rearrange(x, 'b c h w -> b c (h w)')
+# mean = rearrange(x_.mean(axis=2), 'b c -> b c 1 1')
+# std = rearrange(x_.std(axis=2), 'b c -> b c 1 1')
+mean = rearrange(x_.mean(dim=2), 'b c -> b c 1 1')
+std = rearrange(x_.std(dim=2), 'b c -> b c 1 1')
+
+y_ =  (x - mean)/std
+
+print('diff={}\n'.format(torch.abs(y - y_).max()))
+
+```
+### Group Norm
+```python
+import numpy as np
+import torch
+import torch.nn as nn
+from einops import rearrange, repeat, reduce
+
+x = torch.randn((6, 6, 20, 20))
+b, c, h, w = x.shape
+group_num = 3
+n = 2
+
+group_norm = nn.GroupNorm(group_num, c, eps=1e-12, affine=False)
+y = group_norm(x)
+
+x_ = rearrange(x, 'b (g n) h w -> b g (n h w)', g = group_num) # [6, 3, 2*20*20]
+print(x_.shape)
+mean = rearrange(x_.mean(dim=2), 'b g -> b g 1')  # [6, 3, 1]
+std = rearrange(x_.std(dim=2), 'b g -> b g 1')
+
+y_ =  (x_ - mean)/std
+y_ = rearrange(y_, 'b g (n h w) -> b (g n) h w', g = group_num, h = h, w = w)
+
+print('diff={}\n'.format(torch.abs(y - y_).max()))
+
+```
@@ -0,0 +1,34 @@
+---
+title: quantization
+tags: [Quantization, LLM, Optimization]
+excerpt: 大语言模型量化技术综述，包括SmoothQuant、AWQ、LLM.int8、GPTQ、ZeroQuant、LUT-GEMM、SparseGPT等先进量化方法，以及weight-only量化在推理优化中的应用。
+---
+
+## LLM量化
+https://zhuanlan.zhihu.com/p/616969812
+- [SmoothQuant](https://github.com/mit-han-lab/smoothquant)
+- Outlier Suppression
+   - [Outlier Suppression](https://arxiv.org/abs/2209.13325)
+   - [Outlier Suppression+](https://arxiv.org/abs/2304.09145)
+- [AWQ](https://arxiv.org/abs/2306.00978)
+ 基于激活与参数大小的缩放保护
+- [LLM.int8](https://arxiv.org/abs/2208.07339)
+   - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
+   - https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/
+   - https://github.com/timdettmers/bitsandbytes
+- [GPTQ](https://arxiv.org/pdf/2210.17323.pdf)
+   - [GPTQ: 模型量化，穷鬼救星](https://zhuanlan.zhihu.com/p/616969812)
+      不使用贪心算法，对W进行优化时，固定位置挑选Q
+      W不同列间权重更新互相独立，使用批处理更新（Lazy Batch更新，group_size参数控制）
+      数值稳定性优化
+   - [4-bit LLM Quantization with GPTQ](https://mlabonne.github.io/blog/posts/4_bit_Quantization_with_GPTQ.html)
+- [ZeroQuant](https://arxiv.org/abs/2206.01861)
+   - https://github.com/microsoft/DeepSpeed/pull/2217
+   - [ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation](https://arxiv.org/abs/2303.08302)
+- [LUT-GEMM](https://arxiv.org/pdf/2206.09557.pdf)
+   - [LUT-GEMM: 基于lut的量化矩阵乘法在大规模生成语言模型中的高效推理](https://blog.csdn.net/weixin_42764932/article/details/131230429?spm=1001.2014.3001.5501)
+- [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf)
+- weight only
+   - [Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production](https://arxiv.org/pdf/2211.10017.pdf) 
+   - [TVM weight-only](https://github.com/apache/tvm/pull/15111)
+   - [cutlass_fpA_intB_gemm](https://github.com/tlc-pack/cutlass_fpA_intB_gemm/pull/1/files)
@@ -0,0 +1,12 @@
+---
+title: sparse on nvgpu
+tags: [Sparse Computing, GPU, NVIDIA]
+excerpt: NVIDIA GPU sparse computing technology overview, including efficient GPU kernel implementations for N:M sparse weights, Apex N:M sparse support, structured sparsity optimization on Tensor Cores, and related papers and open source project resources.
+---
+
+## Sparse on Nvidia GPU
+- [Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning](https://proceedings.mlsys.org/paper_files/paper/2023/file/4552cedd396a308320209f75f56a5ad5-Paper-mlsys2023.pdf)
+- [Apex N:M sparse](https://github.com/NVIDIA/apex/pull/1631)
+- [Sparse GPU Kernels for Deep Learning](https://arxiv.org/pdf/2006.10901.pdf)
+- [Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision](https://seal.ece.ucsb.edu/sites/default/files/publications/vector_sparse_transformer_camera_ready_.pdf)
+- [N:M Fine-grained Structured Sparse Neural Networks](https://github.com/aojunzz/NM-sparsity)
@@ -0,0 +1,33 @@
+---
+title: stable diffusion optimization
+tags: [Stable Diffusion, Optimization, PaddlePaddle]
+excerpt: Stable Diffusion推理优化技术详解，包括Flash Attention、Norm融合、混合Layout计算、推理显存优化等核心技术，实现512×512图像0.76秒生成，性能超越TensorRT 7.9%。
+---
+
+## stable diffusion
+- norm+act
+- add_bias_norm
+- add_norm_act
+- add_bias_add_norm_act
+
+## paddle stable diffusion optimization
+
+基于PaddlePaddle对Stable Diffusion进行推理时，512*512图像生成速度68.2 iters/s，实现 0.76s 出图。其推理速度是 Diffusers（PyTorch）的4倍，比TensorRT最优速度快7.9%，同时显存占用仅为TensorRT的43%。
+
+- Flash Attention
+
+飞桨一直致力于大模型推理优化，支持多种通用Transformer类结构的高性能推理优化。在Stable Diffusion模型推理中，飞桨集成的高性能的Flash Attention kernel，通过将attention中的softmax计算进行拆解、分片计算，大量减少推理过程中self-attention和cross-attention计算对显存的访问次数，同时实现了推理加速和显存优化。
+
+- Norm融合
+
+Norm是Stable Diffusion中U-Net常用算子，主要分为LayerNorm和GroupNorm。LayerNorm和GroupNorm算子作为批规约运算，能够很好地和前后的elementwise类型、激活类型算子进行融合，消除算子间的显存访问。飞桨对LayerNorm和GroupNorm与前后算子的4种不同pattern进行了融合，共融合了93个Norm结构，提升了3%的推理性能。
+![image](https://github.com/zhangjun/zhangjun.github.io/assets/1312389/45db9b78-cdc4-4f0d-b098-93a4e7b2c40f)
+
+- 混合Layout计算
+
+通过对模型张量排布匹配优化，支持不同的Layout消除和合并U-Net中的转置操作，提高了推理速度同时也能降低了运行显存占用，共减少了32次转置操作，带来了3~4%的推理性能提升。整体显存占用降低约19%
+![image](https://github.com/zhangjun/zhangjun.github.io/assets/1312389/40b74c4d-7c6e-44d4-9a17-5741aea70632)
+
+- 推理显存优化
+
+推理workspace复用技术
@@ -0,0 +1,136 @@
+---
+title: transformer
+tags: [Transformer, Deep Learning, Attention]
+excerpt: Transformer架构深度解析，包括Encoder和Decoder结构设计、Multi-Head Attention机制、Position-Wise Feed-Forward Network，以及完整的TensorFlow实现代码示例。
+---
+
+![image](https://user-images.githubusercontent.com/1312389/228570903-35442ec1-3064-4331-8e0b-c922ea1b806d.png)
+
+- Transformer 中 Encoder 由 6 个相同的层组成，每个层包含 2 个部分：
+   - Multi-Head Self-Attention
+   - Position-Wise Feed-Forward Network 
+- Decoder 也是由 6 个相同的层组成，每个层包含 3 个部分：
+   - Multi-Head Self-Attention
+   - Multi-Head Context-Attention
+   - Position-Wise Feed-Forward Network
+   
+bert代码
+transformer的左侧部分是由这样的6个encoder单元组成，现在把EncoderLayer堆叠起来。`embedding-->pos_encoding-->dropout-->EncoderLayers`
+
+<details>
+  <summary>Encoder code</summary>
+
+```python
+class Encoder(tf.keras.layers.Layer):
+  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
+               maximum_position_encoding, rate=0.1):
+    super(Encoder, self).__init__()
+
+    self.d_model = d_model
+    self.num_layers = num_layers
+    
+    self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
+    self.pos_encoding = positional_encoding(maximum_position_encoding, 
+                                            self.d_model)
+    
+    
+    self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) 
+                       for _ in range(num_layers)]
+  
+    self.dropout = tf.keras.layers.Dropout(rate)
+        
+  def call(self, x, training, mask):
+
+    seq_len = tf.shape(x)[1]
+    
+    # 将嵌入和位置编码相加。
+    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
+    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
+    x += self.pos_encoding[:, :seq_len, :]
+
+    x = self.dropout(x, training=training)
+    
+    for i in range(self.num_layers):
+      x = self.enc_layers[i](x, training, mask)
+    
+    return x  # (batch_size, input_seq_len, d_model)
+```
+</details>
+
+transformer的右侧部分是由这样的6个decoder单元组成，现在把DecoderLayer堆叠起来。与Encoder不同的是，Decoder返回的结果中，包含了用字典存储的每一层的attention_weights。`embedding-->pos_encoding--> dropout-->DecoderLayers`
+<details>
+  <summary>Decoder code</summary>
+
+```python
+class Decoder(tf.keras.layers.Layer):
+  def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,
+               maximum_position_encoding, rate=0.1):
+    super(Decoder, self).__init__()
+
+    self.d_model = d_model
+    self.num_layers = num_layers
+    
+    self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
+    self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)
+    
+    self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate) 
+                       for _ in range(num_layers)]
+    self.dropout = tf.keras.layers.Dropout(rate)
+    
+  def call(self, x, enc_output, training, 
+           look_ahead_mask, padding_mask):
+
+    seq_len = tf.shape(x)[1]
+    # 用字典保存每次attention的结果
+    attention_weights = {}
+    
+    x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
+    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
+    x += self.pos_encoding[:, :seq_len, :]
+    
+    x = self.dropout(x, training=training)
+
+    for i in range(self.num_layers):
+      x, block1, block2 = self.dec_layers[i](x, enc_output, training,
+                                             look_ahead_mask, padding_mask)
+      
+      attention_weights['decoder_layer{}_block1'.format(i+1)] = block1
+      attention_weights['decoder_layer{}_block2'.format(i+1)] = block2
+    
+    # x.shape == (batch_size, target_seq_len, d_model)
+    return x, attention_weights
+```
+</details>
+
+搭建Transformer类，`Encoder-->Decoder-->final_layerfinal_layer`是与vocab_size大小保持一致的全连接层。
+<details>
+  <summary>Transformer code</summary>
+
+```python
+class Transformer(tf.keras.Model):
+  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, 
+               target_vocab_size, pe_input, pe_target, rate=0.1):
+    super(Transformer, self).__init__()
+
+    self.encoder = Encoder(num_layers, d_model, num_heads, dff, 
+                           input_vocab_size, pe_input, rate)
+
+    self.decoder = Decoder(num_layers, d_model, num_heads, dff, 
+                           target_vocab_size, pe_target, rate)
+
+    self.final_layer = tf.keras.layers.Dense(target_vocab_size)
+    
+  def call(self, inp, tar, training, enc_padding_mask, 
+           look_ahead_mask, dec_padding_mask):
+
+    enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)
+    
+    # dec_output.shape == (batch_size, tar_seq_len, d_model)
+    dec_output, attention_weights = self.decoder(
+        tar, enc_output, training, look_ahead_mask, dec_padding_mask)
+    
+    final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)
+    
+    return final_output, attention_weights
+```
+</details>
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+---`
	`2`	`+title: deep learning & llm`
	`3`	`+---`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+---`
	`2`	`+title: multi head attention`
	`3`	`+---`