11# Available Models
22More profiling metrics coming soon!
33
4- ## [ Cohere for AI: Command R] ( https://huggingface.co/collections/CohereForAI/c4ai-command-r-plus-660ec4c34f7a69c50ce7f7b9 )
4+ ## Text Generation Models
5+
6+ ### [ Cohere for AI: Command R] ( https://huggingface.co/collections/CohereForAI/c4ai-command-r-plus-660ec4c34f7a69c50ce7f7b9 )
57
68| Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
79| :----------:| :----------:| :----------:| :----------:|
8- | [ ` c4ai-command-r-plus ` ] ( https://huggingface.co/CohereForAI/c4ai-command-r-plus ) | 8x a40 (2 nodes, 4 a40/node) | 412 tokens/s | 541 tokens/s |
10+ | [ ` c4ai-command-r-plus ` ] ( https://huggingface.co/CohereForAI/c4ai-command-r-plus ) | 8x a40 (2 nodes, 4 a40/node) | 412 tokens/s | 541 tokens/s |
11+ | [ ` c4ai-command-r-plus-08-2024 ` ] ( https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024 ) | 8x a40 (2 nodes, 4 a40/node) | - tokens/s | - tokens/s |
12+ | [ ` c4ai-command-r-08-2024 ` ] ( https://huggingface.co/CohereForAI/c4ai-command-r-08-2024 ) | 8x a40 (2 nodes, 4 a40/node) | - tokens/s | - tokens/s |
913
10- ## [ Code Llama] ( https://huggingface.co/collections/meta-llama/code-llama-family-661da32d0a9d678b6f55b933 )
14+ ### [ Code Llama] ( https://huggingface.co/collections/meta-llama/code-llama-family-661da32d0a9d678b6f55b933 )
1115
1216| Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
1317| :----------:| :----------:| :----------:| :----------:|
@@ -20,13 +24,13 @@ More profiling metrics coming soon!
2024| [ ` CodeLlama-70b-hf ` ] ( https://huggingface.co/meta-llama/CodeLlama-70b-hf ) | 4x a40 | - tokens/s | - tokens/s |
2125| [ ` CodeLlama-70b-Instruct-hf ` ] ( https://huggingface.co/meta-llama/CodeLlama-70b-Instruct-hf ) | 4x a40 | - tokens/s | - tokens/s |
2226
23- ## [ Databricks: DBRX] ( https://huggingface.co/collections/databricks/dbrx-6601c0852a0cdd3c59f71962 )
27+ ### [ Databricks: DBRX] ( https://huggingface.co/collections/databricks/dbrx-6601c0852a0cdd3c59f71962 )
2428
2529| Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
2630| :----------:| :----------:| :----------:| :----------:|
27- | [ ` dbrx-instruct ` ] ( https://huggingface.co/databricks/dbrx-instruct ) | 8x a40 (2 nodes, 4 a40/node) | 107 tokens/s | 904 tokens/s |
31+ | [ ` dbrx-instruct ` ] ( https://huggingface.co/databricks/dbrx-instruct ) | 8x a40 (2 nodes, 4 a40/node) | 107 tokens/s | 904 tokens/s |
2832
29- ## [ Google: Gemma 2] ( https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315 )
33+ ### [ Google: Gemma 2] ( https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315 )
3034
3135| Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
3236| :----------:| :----------:| :----------:| :----------:|
@@ -35,21 +39,7 @@ More profiling metrics coming soon!
3539| [ ` gemma-2-27b ` ] ( https://huggingface.co/google/gemma-2-27b ) | 2x a40 | - tokens/s | - tokens/s |
3640| [ ` gemma-2-27b-it ` ] ( https://huggingface.co/google/gemma-2-27b-it ) | 2x a40 | - tokens/s | - tokens/s |
3741
38- ## [ LLaVa-1.5] ( https://huggingface.co/collections/llava-hf/llava-15-65f762d5b6941db5c2ba07e0 )
39-
40- | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
41- | :----------:| :----------:| :----------:| :----------:|
42- | [ ` llava-1.5-7b-hf ` ] ( https://huggingface.co/llava-hf/llava-1.5-7b-hf ) | 1x a40 | - tokens/s | - tokens/s |
43- | [ ` llava-1.5-13b-hf ` ] ( https://huggingface.co/llava-hf/llava-1.5-13b-hf ) | 1x a40 | - tokens/s | - tokens/s |
44-
45- ## [ LLaVa-NeXT] ( https://huggingface.co/collections/llava-hf/llava-next-65f75c4afac77fd37dbbe6cf )
46-
47- | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
48- | :----------:| :----------:| :----------:| :----------:|
49- | [ ` llava-v1.6-mistral-7b-hf ` ] ( https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf ) | 1x a40 | - tokens/s | - tokens/s |
50- | [ ` llava-v1.6-34b-hf ` ] ( https://huggingface.co/llava-hf/llava-v1.6-34b-hf ) | 2x a40 | - tokens/s | - tokens/s |
51-
52- ## [ Meta: Llama 2] ( https://huggingface.co/collections/meta-llama/llama-2-family-661da1f90a9d678b6f55773b )
42+ ### [ Meta: Llama 2] ( https://huggingface.co/collections/meta-llama/llama-2-family-661da1f90a9d678b6f55773b )
5343
5444| Variant | Suggested resource allocation |
5545| :----------:| :----------:|
@@ -60,7 +50,7 @@ More profiling metrics coming soon!
6050| [ ` Llama-2-70b-hf ` ] ( https://huggingface.co/meta-llama/Llama-2-70b-hf ) | 4x a40 |
6151| [ ` Llama-2-70b-chat-hf ` ] ( https://huggingface.co/meta-llama/Llama-2-70b-chat-hf ) | 4x a40 |
6252
63- ## [ Meta: Llama 3] ( https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6 )
53+ ### [ Meta: Llama 3] ( https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6 )
6454
6555| Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
6656| :----------:| :----------:| :----------:| :----------:|
@@ -69,7 +59,7 @@ More profiling metrics coming soon!
6959| [ ` Meta-Llama-3-70B ` ] ( https://huggingface.co/meta-llama/Meta-Llama-3-70B ) | 4x a40 | 81 tokens/s | 618 tokens/s |
7060| [ ` Meta-Llama-3-70B-Instruct ` ] ( https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct ) | 4x a40 | 301 tokens/s | 660 tokens/s |
7161
72- ## [ Meta: Llama 3.1] ( https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f )
62+ ### [ Meta: Llama 3.1] ( https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f )
7363
7464| Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
7565| :----------:| :----------:| :----------:| :----------:|
@@ -79,28 +69,135 @@ More profiling metrics coming soon!
7969| [ ` Meta-Llama-3.1-70B-Instruct ` ] ( https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct ) | 4x a40 | - tokens/s | - tokens/s |
8070| [ ` Meta-Llama-3.1-405B-Instruct ` ] ( https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct ) | 32x a40 (8 nodes, 4 a40/node) | - tokens/s | - tokens/s |
8171
82- ## [ Mistral AI: Mistral] ( https://huggingface.co/mistralai )
72+ ### [ Meta: Llama 3.2] ( https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf )
73+
74+ | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
75+ | :----------:| :----------:| :----------:| :----------:|
76+ | [ ` Llama-3.2-1B ` ] ( https://huggingface.co/meta-llama/Llama-3.2-1B ) | 1x a40 | - tokens/s | - tokens/s |
77+ | [ ` Llama-3.2-1B-Instruct ` ] ( https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct ) | 1x a40 | - tokens/s | - tokens/s |
78+ | [ ` Llama-3.2-3B ` ] ( https://huggingface.co/meta-llama/Llama-3.2-3B ) | 1x a40 | - tokens/s | - tokens/s |
79+ | [ ` Llama-3.2-3B-Instruct ` ] ( https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct ) | 1x a40 | - tokens/s | - tokens/s |
80+
81+ ### [ Mistral AI: Mistral] ( https://huggingface.co/mistralai )
8382
8483| Variant (Mistral) | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
8584| :----------:| :----------:| :----------:| :----------:|
86- | [ ` Mistral-7B-v0.1 ` ] ( https://huggingface.co/mistralai/Mistral-7B-v0.1 ) | 1x a40 | - tokens/s | - tokens/s|
87- | [ ` Mistral-7B-Instruct-v0.1 ` ] ( https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 ) | 1x a40 | - tokens/s | - tokens/s|
88- | [ ` Mistral-7B-Instruct-v0.2 ` ] ( https://huggingface.co/mistralai/Mistral-7B-v0.2 ) | 1x a40 | - tokens/s | - tokens/s|
89- | [ ` Mistral-7B-v0.3 ` ] ( https://huggingface.co/mistralai/Mistral-7B-v0.3 ) | 1x a40 | - tokens/s | - tokens/s |
90- | [ ` Mistral-7B-Instruct-v0.3 ` ] ( https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 ) | 1x a40 | - tokens/s | - tokens/s|
91- | [ ` Mistral-Large-Instruct-2407 ` ] ( https://huggingface.co/mistralai/Mistral-Large-Instruct-2407 ) | 8x a40 (2 nodes, 4 a40/node) | - tokens/s | - tokens/s|
85+ | [ ` Mistral-7B-v0.1 ` ] ( https://huggingface.co/mistralai/Mistral-7B-v0.1 ) | 1x a40 | - tokens/s | - tokens/s|
86+ | [ ` Mistral-7B-Instruct-v0.1 ` ] ( https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 ) | 1x a40 | - tokens/s | - tokens/s|
87+ | [ ` Mistral-7B-Instruct-v0.2 ` ] ( https://huggingface.co/mistralai/Mistral-7B-v0.2 ) | 1x a40 | - tokens/s | - tokens/s|
88+ | [ ` Mistral-7B-v0.3 ` ] ( https://huggingface.co/mistralai/Mistral-7B-v0.3 ) | 1x a40 | - tokens/s | - tokens/s |
89+ | [ ` Mistral-7B-Instruct-v0.3 ` ] ( https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 ) | 1x a40 | - tokens/s | - tokens/s|
90+ | [ ` Mistral-Large-Instruct-2407 ` ] ( https://huggingface.co/mistralai/Mistral-Large-Instruct-2407 ) | 8x a40 (2 nodes, 4 a40/node) | - tokens/s | - tokens/s|
91+ | [ ` Mistral-Large-Instruct-2411 ` ] ( https://huggingface.co/mistralai/Mistral-Large-Instruct-2411 ) | 8x a40 (2 nodes, 4 a40/node) | - tokens/s | - tokens/s|
9292
93- ## [ Mistral AI: Mixtral] ( https://huggingface.co/mistralai )
93+ ### [ Mistral AI: Mixtral] ( https://huggingface.co/mistralai )
9494
9595| Variant (Mixtral) | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
9696| :----------:| :----------:| :----------:| :----------:|
97- | [ ` Mixtral-8x7B-Instruct-v0.1 ` ] ( https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 ) | 4x a40 | 222 tokens/s | 1543 tokens/s |
98- | [ ` Mixtral-8x22B-v0.1 ` ] ( https://huggingface.co/mistralai/Mixtral-8x22B-v0.1 ) | 8x a40 (2 nodes, 4 a40/node) | 145 tokens/s | 827 tokens/s|
99- | [ ` Mixtral-8x22B-Instruct-v0.1 ` ] ( https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1 ) | 8x a40 (2 nodes, 4 a40/node) | 95 tokens/s | 803 tokens/s|
97+ | [ ` Mixtral-8x7B-Instruct-v0.1 ` ] ( https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 ) | 4x a40 | 222 tokens/s | 1543 tokens/s |
98+ | [ ` Mixtral-8x22B-v0.1 ` ] ( https://huggingface.co/mistralai/Mixtral-8x22B-v0.1 ) | 8x a40 (2 nodes, 4 a40/node) | 145 tokens/s | 827 tokens/s|
99+ | [ ` Mixtral-8x22B-Instruct-v0.1 ` ] ( https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1 ) | 8x a40 (2 nodes, 4 a40/node) | 95 tokens/s | 803 tokens/s|
100100
101- ## [ Microsoft: Phi 3] ( https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3 )
101+ ### [ Microsoft: Phi 3] ( https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3 )
102102
103103| Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
104104| :----------:| :----------:| :----------:| :----------:|
105105| [ ` Phi-3-medium-128k-instruct ` ] ( https://huggingface.co/microsoft/Phi-3-medium-128k-instruct ) | 2x a40 | - tokens/s | - tokens/s |
106+
107+ ### [ Aaditya Ura: Llama3-OpenBioLLM] ( https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B )
108+
109+ | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
110+ | :----------:| :----------:| :----------:| :----------:|
111+ | [ ` Llama3-OpenBioLLM-70B ` ] ( https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B ) | 4x a40 | - tokens/s | - tokens/s |
112+
113+ ### [ Nvidia: Llama-3.1-Nemotron] ( https://huggingface.co/collections/nvidia/llama-31-nemotron-70b-670e93cd366feea16abc13d8 )
114+
115+ | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
116+ | :----------:| :----------:| :----------:| :----------:|
117+ | [ ` Llama-3.1-Nemotron-70B-Instruct-HF ` ] ( https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF ) | 4x a40 | - tokens/s | - tokens/s |
118+
119+ ### [ Qwen: Qwen2.5] ( https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e )
120+
121+ | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
122+ | :----------:| :----------:| :----------:| :----------:|
123+ | [ ` Qwen2.5-0.5B-Instruct ` ] ( https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct ) | 1x a40 | - tokens/s | - tokens/s |
124+ | [ ` Qwen2.5-1.5B-Instruct ` ] ( https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct ) | 1x a40 | - tokens/s | - tokens/s |
125+ | [ ` Qwen2.5-3B-Instruct ` ] ( https://huggingface.co/Qwen/Qwen2.5-3B-Instruct ) | 1x a40 | - tokens/s | - tokens/s |
126+ | [ ` Qwen2.5-7B-Instruct ` ] ( https://huggingface.co/Qwen/Qwen2.5-7B-Instruct ) | 1x a40 | - tokens/s | - tokens/s |
127+ | [ ` Qwen2.5-14B-Instruct ` ] ( https://huggingface.co/Qwen/Qwen2.5-14B-Instruct ) | 1x a40 | - tokens/s | - tokens/s |
128+ | [ ` Qwen2.5-32B-Instruct ` ] ( https://huggingface.co/Qwen/Qwen2.5-32B-Instruct ) | 2x a40 | - tokens/s | - tokens/s |
129+ | [ ` Qwen2.5-72B-Instruct ` ] ( https://huggingface.co/Qwen/Qwen2.5-72B-Instruct ) | 4x a40 | - tokens/s | - tokens/s |
130+
131+ ### [ Qwen: Qwen2.5-Math] ( https://huggingface.co/collections/Qwen/qwen25-math-66eaa240a1b7d5ee65f1da3e )
132+
133+ | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
134+ | :----------:| :----------:| :----------:| :----------:|
135+ | [ ` Qwen2.5-1.5B-Instruct ` ] ( https://huggingface.co/Qwen/Qwen2.5-Math-1.5B-Instruct ) | 1x a40 | - tokens/s | - tokens/s |
136+ | [ ` Qwen2.5-7B-Instruct ` ] ( https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct ) | 1x a40 | - tokens/s | - tokens/s |
137+ | [ ` Qwen2.5-72B-Instruct ` ] ( https://huggingface.co/Qwen/Qwen2.5-Math-72B-Instruct ) | 4x a40 | - tokens/s | - tokens/s |
138+
139+ ### [ Qwen: Qwen2.5-Coder] ( https://huggingface.co/collections/Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f )
140+
141+ | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
142+ | :----------:| :----------:| :----------:| :----------:|
143+ | [ ` Qwen2.5-Coder-7B-Instruct ` ] ( https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct ) | 1x a40 | - tokens/s | - tokens/s |
144+
145+ ### [ Qwen: QwQ] ( https://huggingface.co/collections/Qwen/qwq-674762b79b75eac01735070a )
146+
147+ | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
148+ | :----------:| :----------:| :----------:| :----------:|
149+ | [ ` QwQ-32B-Preview ` ] ( https://huggingface.co/Qwen/QwQ-32B-Preview ) | 2x a40 | - tokens/s | - tokens/s |
150+
151+ ## Vision Language Models
152+
153+ ### [ LLaVa-1.5] ( https://huggingface.co/collections/llava-hf/llava-15-65f762d5b6941db5c2ba07e0 )
154+
155+ | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
156+ | :----------:| :----------:| :----------:| :----------:|
157+ | [ ` llava-1.5-7b-hf ` ] ( https://huggingface.co/llava-hf/llava-1.5-7b-hf ) | 1x a40 | - tokens/s | - tokens/s |
158+ | [ ` llava-1.5-13b-hf ` ] ( https://huggingface.co/llava-hf/llava-1.5-13b-hf ) | 1x a40 | - tokens/s | - tokens/s |
159+
160+ ### [ LLaVa-NeXT] ( https://huggingface.co/collections/llava-hf/llava-next-65f75c4afac77fd37dbbe6cf )
161+
162+ | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
163+ | :----------:| :----------:| :----------:| :----------:|
164+ | [ ` llava-v1.6-mistral-7b-hf ` ] ( https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf ) | 1x a40 | - tokens/s | - tokens/s |
165+ | [ ` llava-v1.6-34b-hf ` ] ( https://huggingface.co/llava-hf/llava-v1.6-34b-hf ) | 2x a40 | - tokens/s | - tokens/s |
166+
167+ ### [ Microsoft: Phi 3] ( https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3 )
168+
169+ | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
170+ | :----------:| :----------:| :----------:| :----------:|
106171| [ ` Phi-3-vision-128k-instruct ` ] ( https://huggingface.co/microsoft/Phi-3-vision-128k-instruct ) | 2x a40 | - tokens/s | - tokens/s |
172+
173+ ### [ Meta: Llama 3.2] ( https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf )
174+
175+ | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
176+ | :----------:| :----------:| :----------:| :----------:|
177+ | [ ` Llama-3.2-11B-Vision ` ] ( https://huggingface.co/meta-llama/Llama-3.2-1B ) | 2x a40 | - tokens/s | - tokens/s |
178+ | [ ` Llama-3.2-11B-Vision-Instruct ` ] ( https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct ) | 2x a40 | - tokens/s | - tokens/s |
179+ | [ ` Llama-3.2-90B-Vision ` ] ( https://huggingface.co/meta-llama/Llama-3.2-3B ) | 8x a40 (2 nodes, 4 a40/node) | - tokens/s | - tokens/s |
180+ | [ ` Llama-3.2-90B-Vision-Instruct ` ] ( https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct ) | 8x a40 (2 nodes, 4 a40/node) | - tokens/s | - tokens/s |
181+
182+ ** NOTE** : ` MllamaForConditionalGeneration ` currently doesn't support pipeline parallelsim, to save memory, maximum number of requests is reduced and enforce eager mode is on.
183+
184+ ### [ Mistral: Pixtral] ( https://huggingface.co/mistralai )
185+
186+ | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
187+ | :----------:| :----------:| :----------:| :----------:|
188+ | [ ` Pixtral-12B-2409 ` ] ( https://huggingface.co/mistralai/Pixtral-12B-2409 ) | 1x a40 | - tokens/s | - tokens/s |
189+
190+ ## Text Embedding Models
191+
192+ ### [ Liang Wang: e5] ( https://huggingface.co/intfloat )
193+ | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
194+ | :----------:| :----------:| :----------:| :----------:|
195+ | [ ` e5-mistral-7b-instruct ` ] ( https://huggingface.co/intfloat/e5-mistral-7b-instruct ) | 1x a40 | - tokens/s | - tokens/s |
196+
197+ ## Reward Modeling Models
198+
199+ ### [ Qwen: Qwen2.5-Math] ( https://huggingface.co/collections/Qwen/qwen25-math-66eaa240a1b7d5ee65f1da3e )
200+
201+ | Variant | Suggested resource allocation | Avg prompt throughput | Avg generation throughput |
202+ | :----------:| :----------:| :----------:| :----------:|
203+ | [ ` Qwen2.5-Math-RM-72B ` ] ( https://huggingface.co/Qwen/Qwen2.5-Math-RM-72B ) | 4x a40 | - tokens/s | - tokens/s |
0 commit comments