Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
114 commits
Select commit Hold shift + click to select a range
c609215
add att backend impl
Jan 5, 2026
395a593
fix alibi att backend
Jan 5, 2026
1d794aa
fix llama triton att backend
Jan 5, 2026
5120d01
fix att
Jan 5, 2026
7f05167
fix att
Jan 5, 2026
7011330
fix att
Jan 5, 2026
08f4336
add att_control params
hiworldwzj Jan 5, 2026
d6a4405
fix int8kv
hiworldwzj Jan 5, 2026
86e1f59
fix
Jan 6, 2026
7e45e69
add new int8kv dequant triton kernel
Jan 6, 2026
f44623e
fix int8kv prefill attention kernel
Jan 6, 2026
fa0f512
fix unittest
Jan 6, 2026
650db11
fix int8kv att backend.
hiworldwzj Jan 6, 2026
cc55263
fix
hiworldwzj Jan 6, 2026
2ab9a1e
fix diverse att.
hiworldwzj Jan 6, 2026
70cbed2
fix
hiworldwzj Jan 6, 2026
b0abacf
add int4kv.
hiworldwzj Jan 6, 2026
41ecffd
add int4 kernel
hiworldwzj Jan 6, 2026
ccaaa6a
fix all
hiworldwzj Jan 6, 2026
9961261
fix unit test
hiworldwzj Jan 6, 2026
06e2369
fix unit test
hiworldwzj Jan 6, 2026
ef80334
fix
hiworldwzj Jan 6, 2026
bddb2cf
fix all
Jan 7, 2026
e9f4462
fix all
Jan 7, 2026
61e4f24
add int4kv backend
Jan 7, 2026
deba727
fix
Jan 7, 2026
8ef04b4
add fa3
Jan 7, 2026
eb5b94a
add fa3
Jan 7, 2026
abdc928
fix
Jan 7, 2026
a2be571
fix
Jan 7, 2026
ae5bb45
fix
Jan 7, 2026
c5b0c0a
fix
Jan 7, 2026
bd825dc
fix
Jan 7, 2026
b7ce3f3
fix memmanager
Jan 7, 2026
15afeb5
fix memmanager
Jan 7, 2026
c323ead
fix all
Jan 7, 2026
8ca694a
add fp8 flashattention
Jan 7, 2026
f1c5e4a
fix llama
Jan 7, 2026
5203fd3
fix
Jan 7, 2026
002c1f7
fix
Jan 7, 2026
6f3a71c
fix
Jan 7, 2026
415cecb
fix
Jan 7, 2026
4bcf148
fix
Jan 7, 2026
dd8ee1d
fix all
Jan 7, 2026
b0a59bb
fix
Jan 7, 2026
120f28d
fix all
Jan 7, 2026
2371f7e
fix all
Jan 7, 2026
a4a7614
fix flashinfer
Jan 7, 2026
a330e9b
fix
Jan 9, 2026
724a600
remove chatglm2
hiworldwzj Jan 7, 2026
412d6e0
fix
hiworldwzj Jan 7, 2026
ae87183
fix
hiworldwzj Jan 7, 2026
9f504c6
fix
hiworldwzj Jan 7, 2026
031ed6f
fix
hiworldwzj Jan 7, 2026
3b3ad51
fix
hiworldwzj Jan 7, 2026
292d961
add triton mla decode.
hiworldwzj Jan 7, 2026
e383e29
remove
hiworldwzj Jan 7, 2026
ec503f7
fix deepseek
hiworldwzj Jan 7, 2026
af8cd0c
fix
hiworldwzj Jan 7, 2026
198cc9b
add triton mla prefill
Jan 8, 2026
aa012be
add flashinfer mla decode
Jan 8, 2026
8ee933c
fix
Jan 8, 2026
6e7af99
fix
Jan 8, 2026
24588d2
fix
Jan 8, 2026
3fc8084
fix all
Jan 8, 2026
e78b346
fix
Jan 8, 2026
f6d8508
fix
Jan 8, 2026
4aa45e1
fix
Jan 8, 2026
7b24ace
fix
Jan 8, 2026
e129535
fix
Jan 8, 2026
f4fc09e
fix
Jan 8, 2026
51b9b93
fix
Jan 8, 2026
f76dfb1
fix
Jan 8, 2026
e2018d6
fix
Jan 8, 2026
df348ee
fix
Jan 8, 2026
6a637e5
fix
Jan 8, 2026
e0f14b6
fix
Jan 8, 2026
53bad88
fix
Jan 8, 2026
6c3428b
fix
Jan 8, 2026
9e0d4cb
fix
Jan 8, 2026
ff10912
fix
Jan 8, 2026
35e8a8e
fix all
Jan 8, 2026
69a100b
remove mode
Jan 8, 2026
aaa3531
fix
Jan 8, 2026
d7faae0
fix
Jan 8, 2026
9b64be6
fix
Jan 8, 2026
93f87cf
fix mode
Jan 8, 2026
0852f64
fix mode.
Jan 8, 2026
8f59c77
remove max_len_in_batch
Jan 8, 2026
d193c60
remove max_len_in_batch
Jan 8, 2026
0c6818d
fix cuda graph.
Jan 8, 2026
31fea47
fix cuda graph
Jan 8, 2026
4edf70e
fix
hiworldwzj Jan 8, 2026
9d2cb3a
fix
hiworldwzj Jan 8, 2026
a6a8540
fix
hiworldwzj Jan 8, 2026
7670c4d
fix
hiworldwzj Jan 8, 2026
1d4884d
fix
Jan 9, 2026
289a369
fix
Jan 9, 2026
6aee5fe
fix.
Jan 9, 2026
2d9705f
fix
Jan 9, 2026
4a4961f
fix
Jan 9, 2026
d9bcf6c
fix
Jan 9, 2026
5db566a
fix
Jan 9, 2026
c89bcd1
fix
Jan 9, 2026
730fd50
fix
Jan 9, 2026
b503525
fix
Jan 9, 2026
a1b85a7
fix
Jan 9, 2026
fab3365
fix
Jan 9, 2026
3c400b8
fix
Jan 9, 2026
2528c4c
fix
Jan 9, 2026
b50d74c
fix
Jan 9, 2026
c17eed1
fix
Jan 9, 2026
74b8f5b
fix
Jan 9, 2026
f4b982c
fix
Jan 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,4 @@ repos:
rev: 6.1.0
hooks:
- id: flake8
args: ['--max-line-length=120', '--ignore=TYP001, E722, C901, E203, E266, E402, E302, E241, E902, E731, F403, E701, F405, F401, W292, W293, W503, W606, E231']
args: ['--max-line-length=120', '--ignore=TYP001, E722, C901, E203, E266, E402, E302, E241, E902, E731, F403, E701, F405, F401, W292, W293, W503, W606, E231, F541']
28 changes: 2 additions & 26 deletions docs/CN/source/tutorial/api_server_args_zh.rst
Original file line number Diff line number Diff line change
Expand Up @@ -183,22 +183,6 @@ PD 分离模式参数
设置为 True 时,--nccl_host 必须等于 config_server_host,--nccl_port 对于 config_server 必须是唯一的,
不要为不同的推理节点使用相同的 nccl_port,这将是严重错误

attention类型选择参数
---------------------

.. option:: --mode

模型推理模式,可以指定多个值:

* ``triton_int8kv``: 使用 int8 存储 kv cache,可增加 token 容量,使用 triton kernel
* ``ppl_int8kv``: 使用 int8 存储 kv cache,使用 ppl 快速 kernel
* ``ppl_fp16``: 使用 ppl 快速 fp16 解码注意力 kernel
* ``triton_flashdecoding``: 用于长上下文的 flashdecoding 模式,当前支持 llama llama2 qwen
* ``triton_gqa_attention``: 使用 GQA 的模型的快速 kernel
* ``triton_gqa_flashdecoding``: 使用 GQA 的模型的快速 flashdecoding kernel
* ``triton_fp8kv``: 使用 float8 存储 kv cache,目前仅用于 deepseek2

需要阅读源代码以确认所有模型支持的具体模式

调度参数
--------
Expand Down Expand Up @@ -327,17 +311,9 @@ attention类型选择参数

推理后端将为解码使用微批次重叠模式

.. option:: --enable_flashinfer_prefill

推理后端将为预填充使用 flashinfer 的注意力 kernel

.. option:: --enable_flashinfer_decode

推理后端将为解码使用 flashinfer 的注意力 kernel

.. option:: --enable_fa3
.. option:: --llm_kv_type

推理后端将为预填充和解码使用 fa3 注意力 kernel
推理后端使用什么类型的数据存储kv cache, 可选值为 "None", "int8kv", "int4kv", "fp8kv"

.. option:: --disable_cudagraph

Expand Down
36 changes: 24 additions & 12 deletions docs/CN/source/tutorial/deepseek_deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,14 @@ LightLLM 支持以下几种部署模式:
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 8 \
--enable_fa3
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3

**参数说明:**
- `LOADWORKER=18`: 模型加载线程数,提高加载速度
- `--tp 8`: 张量并行度,使用8个GPU
- `--enable_fa3`: 启用 Flash Attention 3.0
- `--llm_prefill_att_backend fa3`: 启用 Flash Attention 3.0
- `--llm_decode_att_backend fa3`: 启用 Flash Attention 3.0
- `--port 8088`: 服务端口

1.2 单机 DP + EP 模式 (Data Parallel + Expert Parallel)
Expand All @@ -55,13 +57,15 @@ LightLLM 支持以下几种部署模式:
--model_dir /path/DeepSeek-R1 \
--tp 8 \
--dp 8 \
--enable_fa3
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3

**参数说明:**
- `MOE_MODE=EP`: 设置专家并行模式
- `--tp 8`: 张量并行度
- `--dp 8`: 数据并行度,通常设置为与 tp 相同的值
- `--enable_fa3`: 启用 Flash Attention 3.0
- `--llm_prefill_att_backend fa3`: 启用 Flash Attention 3.0
- `--llm_decode_att_backend fa3`: 启用 Flash Attention 3.0

**可选优化参数:**
- `--enable_prefill_microbatch_overlap`: 启用预填充微批次重叠
Expand All @@ -85,7 +89,8 @@ LightLLM 支持以下几种部署模式:
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 0 \
--nccl_host $nccl_host \
Expand All @@ -101,7 +106,8 @@ LightLLM 支持以下几种部署模式:
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 1 \
--nccl_host $nccl_host \
Expand Down Expand Up @@ -129,7 +135,8 @@ LightLLM 支持以下几种部署模式:
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--dp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 0 \
--nccl_host $nccl_host \
Expand All @@ -146,7 +153,8 @@ LightLLM 支持以下几种部署模式:
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--dp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 1 \
--nccl_host $nccl_host \
Expand Down Expand Up @@ -195,7 +203,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
--host $host \
--port 8019 \
--nccl_port 2732 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--disable_cudagraph \
--pd_master_ip $pd_master_ip \
--pd_master_port 60011
Expand All @@ -219,7 +228,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
--host $host \
--port 8121 \
--nccl_port 12322 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--disable_cudagraph \
--pd_master_ip $pd_master_ip \
--pd_master_port 60011
Expand Down Expand Up @@ -287,7 +297,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
--tp 8 \
--dp 8 \
--nccl_port 2732 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--disable_cudagraph \
--config_server_host $config_server_host \
--config_server_port 60088
Expand All @@ -306,7 +317,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
--nccl_port 12322 \
--tp 8 \
--dp 8 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--config_server_host $config_server_host \
--config_server_port 60088
# 如果需要启用微批次重叠,可以取消注释以下行
Expand Down
8 changes: 5 additions & 3 deletions docs/CN/source/tutorial/multi_level_cache_deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,8 @@ LightLLM 的多级缓存系统采用分层设计:
--model_dir /path/to/Qwen3-235B-A22B \
--tp 8 \
--graph_max_batch_size 500 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--mem_fraction 0.88 \
--enable_cpu_cache \
--cpu_cache_storage_size 400 \
Expand All @@ -81,7 +82,7 @@ LightLLM 的多级缓存系统采用分层设计:
- ``--model_dir``: 模型文件路径,支持本地路径或 HuggingFace 模型名称
- ``--tp 8``: 张量并行度,使用 8 个 GPU 进行模型推理
- ``--graph_max_batch_size 500``: CUDA Graph 最大批次大小,影响吞吐量和显存占用
- ``--enable_fa3``: 启用 Flash Attention 3.0,提升注意力计算速度,也可以换成flashinfer后端性能更佳
- ``--llm_prefill_att_backend fa3``: 启用 Flash Attention 3.0,提升注意力计算速度,也可以换成flashinfer后端性能更佳
- ``--mem_fraction 0.88``: GPU 显存使用比例,建议设置为 0.88及以下

CPU 缓存参数
Expand Down Expand Up @@ -130,7 +131,8 @@ CPU 缓存参数
--model_dir /path/to/Qwen3-235B-A22B \
--tp 8 \
--graph_max_batch_size 500 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--mem_fraction 0.88 \
--enable_cpu_cache \
--cpu_cache_storage_size 400 \
Expand Down
3 changes: 2 additions & 1 deletion docs/CN/source/tutorial/reasoning_parser.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ DeepSeek-R1
--model_dir /path/to/DeepSeek-R1 \
--reasoning_parser deepseek-r1 \
--tp 8 \
--enable_fa3
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3

DeepSeek-V3
~~~~~~~~~~~
Expand Down
29 changes: 0 additions & 29 deletions docs/EN/source/tutorial/api_server_args_zh.rst
Original file line number Diff line number Diff line change
Expand Up @@ -183,23 +183,6 @@ Different Parallel Mode Setting Parameters
When set to True, --nccl_host must equal config_server_host, --nccl_port must be unique for config_server,
do not use the same nccl_port for different inference nodes, this will be a serious error

Attention Type Selection Parameters
------------------------------------

.. option:: --mode

Model inference mode, can specify multiple values:

* ``triton_int8kv``: Use int8 to store kv cache, can increase token capacity, uses triton kernel
* ``ppl_int8kv``: Use int8 to store kv cache, uses ppl fast kernel
* ``ppl_fp16``: Use ppl fast fp16 decode attention kernel
* ``triton_flashdecoding``: Flashdecoding mode for long context, currently supports llama llama2 qwen
* ``triton_gqa_attention``: Fast kernel for models using GQA
* ``triton_gqa_flashdecoding``: Fast flashdecoding kernel for models using GQA
* ``triton_fp8kv``: Use float8 to store kv cache, currently only used for deepseek2

Need to read source code to confirm specific modes supported by all models

Scheduling Parameters
---------------------

Expand Down Expand Up @@ -325,18 +308,6 @@ Performance Optimization Parameters
.. option:: --enable_decode_microbatch_overlap

The inference backend will use microbatch overlap mode for decoding

.. option:: --enable_flashinfer_prefill

The inference backend will use flashinfer's attention kernel for prefill

.. option:: --enable_flashinfer_decode

The inference backend will use flashinfer's attention kernel for decoding

.. option:: --enable_fa3

The inference backend will use fa3 attention kernel for prefill and decoding

.. option:: --disable_cudagraph

Expand Down
33 changes: 21 additions & 12 deletions docs/EN/source/tutorial/deepseek_deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,13 @@ Suitable for deploying DeepSeek-R1 model on a single H200 node.
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 8 \
--enable_fa3
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3

**Parameter Description:**
- `LOADWORKER=18`: Model loading thread count, improves loading speed
- `--tp 8`: Tensor parallelism, using 8 GPUs
- `--enable_fa3`: Enable Flash Attention 3.0
- `--llm_prefill_att_backend fa3`: Enable Flash Attention 3.0
- `--port 8088`: Service port

1.2 Single node DP + EP Mode (Data Parallel + Expert Parallel)
Expand All @@ -55,13 +56,13 @@ Suitable for expert parallelism deployment of MoE models like DeepSeek-V2/V3.
--model_dir /path/DeepSeek-R1 \
--tp 8 \
--dp 8 \
--enable_fa3
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3

**Parameter Description:**
- `MOE_MODE=EP`: Set expert parallelism mode
- `--tp 8`: Tensor parallelism
- `--dp 8`: Data parallelism, usually set to the same value as tp
- `--enable_fa3`: Enable Flash Attention 3.0

**Optional Optimization Parameters:**
- `--enable_prefill_microbatch_overlap`: Enable prefill microbatch overlap
Expand All @@ -85,7 +86,8 @@ Suitable for deployment across multiple H200/H100 nodes.
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 0 \
--nccl_host $nccl_host \
Expand All @@ -101,7 +103,8 @@ Suitable for deployment across multiple H200/H100 nodes.
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 1 \
--nccl_host $nccl_host \
Expand Down Expand Up @@ -129,7 +132,8 @@ Suitable for deploying MoE models across multiple nodes.
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--dp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 0 \
--nccl_host $nccl_host \
Expand All @@ -146,7 +150,8 @@ Suitable for deploying MoE models across multiple nodes.
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--dp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 1 \
--nccl_host $nccl_host \
Expand Down Expand Up @@ -195,7 +200,8 @@ PD (Prefill-Decode) disaggregation mode separates prefill and decode stages for
--host $host \
--port 8019 \
--nccl_port 2732 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--disable_cudagraph \
--pd_master_ip $pd_master_ip

Expand All @@ -216,7 +222,8 @@ PD (Prefill-Decode) disaggregation mode separates prefill and decode stages for
--host $host \
--port 8121 \
--nccl_port 12322 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--disable_cudagraph \
--pd_master_ip $pd_master_ip \
--pd_master_port 60011
Expand Down Expand Up @@ -284,7 +291,8 @@ Supports multiple PD Master nodes, providing better load balancing and high avai
--tp 8 \
--dp 8 \
--nccl_port 2732 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--disable_cudagraph \
--config_server_host $config_server_host \
--config_server_port 60088
Expand All @@ -303,7 +311,8 @@ Supports multiple PD Master nodes, providing better load balancing and high avai
--nccl_port 12322 \
--tp 8 \
--dp 8 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--config_server_host $config_server_host \
--config_server_port 60088
# if you want to enable microbatch overlap, you can uncomment the following lines
Expand Down
8 changes: 5 additions & 3 deletions docs/EN/source/tutorial/multi_level_cache_deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,8 @@ Suitable for most scenarios, significantly increasing cache capacity while maint
--model_dir /path/to/Qwen3-235B-A22B \
--tp 8 \
--graph_max_batch_size 500 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--mem_fraction 0.88 \
--enable_cpu_cache \
--cpu_cache_storage_size 400 \
Expand All @@ -81,7 +82,7 @@ Basic Parameters
- ``--model_dir``: Model file path, supports local path or HuggingFace model name
- ``--tp 8``: Tensor parallelism degree, using 8 GPUs for model inference
- ``--graph_max_batch_size 500``: CUDA Graph maximum batch size, affects throughput and memory usage
- ``--enable_fa3``: Enable Flash Attention 3.0 to improve attention computation speed. You can also switch to flashinfer backend for better performance
- ``--llm_prefill_att_backend fa3``: Enable Flash Attention 3.0 to improve attention computation speed. You can also switch to flashinfer backend for better performance
- ``--mem_fraction 0.88``: GPU memory usage ratio, recommended to set to 0.88 or below

CPU Cache Parameters
Expand Down Expand Up @@ -130,7 +131,8 @@ Suitable for ultra-long text or extremely high-concurrency scenarios, providing
--model_dir /path/to/Qwen3-235B-A22B \
--tp 8 \
--graph_max_batch_size 500 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--mem_fraction 0.88 \
--enable_cpu_cache \
--cpu_cache_storage_size 400 \
Expand Down
3 changes: 2 additions & 1 deletion docs/EN/source/tutorial/reasoning_parser.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ DeepSeek-R1
--model_dir /path/to/DeepSeek-R1 \
--reasoning_parser deepseek-r1 \
--tp 8 \
--enable_fa3
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3

DeepSeek-V3
~~~~~~~~~~~
Expand Down
Loading