ColossalAI VS Vllm Benchmark #5513

zzb610 · 2024-03-26T08:40:16Z

zzb610
Mar 26, 2024

我在 A100 40G 上使用 ColossalAI 的 colossal-infer 分支 (main 分支跑不通) 的代码进行了对 llama-7b 模型进行了推理性能的测试
https://github.com/hpcaitech/ColossalAI/tree/feature/colossal-infer/colossalai/inference

CollossalAI 和 vllm 的版本分别是
colossalai 21e1e36
vllm 0.3.0

得到的结果如下

请问各位大佬, 造成 ColossalAI 在 bs > 32 推理性能超过 vllm 的原因是什么 ?
flash_decoding_attention ? KVCachaManager ? RequestHandler

zzb610 · 2024-03-26T08:43:56Z

zzb610
Mar 26, 2024
Author

	input_len128, output_len 128
bs	Whole batch end2end time (ms)		Whole batch per token latency (ms)		Throughput (tokens/s)		Flops (TFLOPS)		Max memory allocated (GB)		Max memory reserved (GB)
	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm
16	2549.76	2136.03	1.24	1.04	806.68	958.79	10.39	12.35	25.47	33.01	25.55	34.48
32	2730.67	2651.48	0.67	0.66	1501.93	1520.29	19.35	19.59	25.47	33.01	25.55	34.48
64	3444.10	3658.37	0.42	0.46	2379.60	2163.09	30.66	27.87	25.47	33.01	33.57	34.48


	input_len128, output_len 256
bs	Whole batch end2end time (ms)		Whole batch per token latency (ms)		Throughput (tokens/s)		Flops (TFLOPS)		Max memory allocated (GB)		Max memory reserved (GB)
	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm
16	4803.12	4255.71	1.21	1.04	828.28	962.72	10.67	12.41	25.47	33.01	25.55	34.48
32	5203.47	5194.34	0.64	0.67	1577.73	1500.87	20.33	19.34	25.47	33.01	31.25	34.48
64	6595.34	9621.42	0.41	0.60	2484.96	1665.82	32.02	21.46	28.46	33.01	37.70	34.48


	input_len512, output_len 128
bs	Whole batch end2end time (ms)		Whole batch per token latency (ms)		Throughput (tokens/s)		Flops (TFLOPS)		Max memory allocated (GB)		Max memory reserved (GB)
	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm
16	3108.77	3037.75	1.52	1.48	659.58	674.31	8.50	8.69	25.47	33.01	27.87	34.48
32	4402.87	6381.53	1.08	1.56	930.34	642.00	11.99	8.27	27.34	33.01	36.72	34.48
64	OOM	10949.05	OOM	1.38	OOM	726.12	OOM	9.36	OOM	33.01	OOM	34.49


input_len512, output_len 256
bs	Whole batch end2end time (ms)		Whole batch per token latency (ms)		Throughput (tokens/s)		Flops (TFLOPS)		Max memory allocated (GB)		Max memory reserved (GB)
	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm
16	5489.04	5389.87	1.34	1.39	748.12	721.71	9.64	9.30	25.47	33.01	31.51	34.48
32	7429.36	11258.70	0.91	1.40	1102.87	713.87	14.21	9.20	29.34	33.01	37.92	34.49
64	OOM	19478.40	OOM	1.22	OOM	821.62	OOM	10.59	OOM	33.01	OOM	34.49


input_len1024, output_len 128
bs	Whole batch end2end time (ms)		Whole batch per token latency (ms)		Throughput (tokens/s)		Flops (TFLOPS)		Max memory allocated (GB)		Max memory reserved (GB)
	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm
16	4336.53	6026.45	2.12	2.94	472.37	339.87	6.09	4.38	26.34		35.78	34.48
32	OOM	10400.14	OOM	2.60	OOM	385.10	OOM	4.96	OOM	33.01	OOM	34.49
64	OOM	19304.86	OOM	2.37	OOM	422.75	OOM	5.45	OOM	33.01	OOM	34.48


input_len1024, output_len 256
bs	Whole batch end2end time (ms)		Whole batch per token latency (ms)		Throughput (tokens/s)		Flops (TFLOPS)		Max memory allocated (GB)		Max memory reserved (GB)
	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm	colossalai	vllm
16	7094.15	10601.76	1.73	2.63	577.44	379.77	7.44	4.89	27.34	33.01	36.72	34.48
32	OOM	18103.94	OOM	2.21	OOM	452.52	OOM	5.83	OOM	33.01	OOM	34.49
64	OOM	32291.98	OOM	2.04	OOM	489.39	OOM	6.31	OOM	33.01	OOM	34.49

0 replies

yuanheng-zhao · 2024-04-02T02:39:31Z

yuanheng-zhao
Apr 2, 2024
Maintainer

我在 A100 40G 上使用 ColossalAI 的 colossal-infer 分支 (main 分支跑不通) 的代码进行了对 llama-7b 模型进行了推理性能的测试 https://github.com/hpcaitech/ColossalAI/tree/feature/colossal-infer/colossalai/inference

CollossalAI 和 vllm 的版本分别是 colossalai 21e1e36 vllm 0.3.0

得到的结果如下

请问各位大佬, 造成 ColossalAI 在 bs > 32 推理性能超过 vllm 的原因是什么 ? flash_decoding_attention ? KVCachaManager ? RequestHandler

Hey @zzb610 ,

(I'll use English to reply so that anyone could participate into the discussion)
The main components to approach a higher throughput for bsz > 32 are optimized triton and CUDA kernels (e.g. fused kernels), optimized llama modeling and cache/buffer management (e.g. allocating in advance and re-using intermediate tensors), and relatively lite scheduler.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ColossalAI VS Vllm Benchmark #5513

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

ColossalAI VS Vllm Benchmark #5513

zzb610 Mar 26, 2024

Replies: 2 comments

zzb610 Mar 26, 2024 Author

yuanheng-zhao Apr 2, 2024 Maintainer

zzb610
Mar 26, 2024

zzb610
Mar 26, 2024
Author

yuanheng-zhao
Apr 2, 2024
Maintainer