You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Merge pull request #112 from VectorInstitute/bugfix/multinode
Misc small features and bug fixes:
- Fixed multi-node launch GPU placement group issue: `--exclusive` option is needed for slurm script and compilation config needs to stay at 0
- Set environment variables in the generated slurm script instead of in the helper to ensure reusability
- Replaced `python3.10 -m vllm.entrypoints.openai.api_server` with `vllm serve` to support custom chat template usage
- Added additional launch options: `--exclude` for excluding certain nodes, `--node-list` for targeting a specific list of nodes, and `--bind` for binding additional directories
- Added remaining vLLM engine arg short-long name mappings for robustness
- Added some notes in README to capture some gotchas
Note that there are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](vec_inf/client/config.py) for details.
97
+
**NOTE**
98
+
* There are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](vec_inf/client/config.py) for details.
99
+
* Check [vLLM Engine Arguments](https://docs.vllm.ai/en/stable/serving/engine_args.html) for the full list of available vLLM engine arguments, the default parallel size for any parallelization is default to 1, so none of the sizes were set specifically in this example
100
+
* For GPU partitions with non-Ampere architectures, e.g. `rtx6000`, `t4v2`, BF16 isn't supported. For models that have BF16 as the default type, when using a non-Ampere GPU, use FP16 instead, i.e. `--dtype: float16`
101
+
* Setting `--compilation-config` to `3` currently breaks multi-node model launches, so we don't set them for models that require multiple nodes of GPUs.
98
102
99
103
#### Other commands
100
104
@@ -161,7 +165,7 @@ Once the inference server is ready, you can start sending in inference requests.
161
165
"prompt_logprobs":null
162
166
}
163
167
```
164
-
**NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.
168
+
**NOTE**: Certain models don't adhere to OpenAI's chat template, e.g. Mistral family. For these models, you can either change your prompt to follow the model's default chat template or provide your own chat template via `--chat-template: TEMPLATE_PATH`
165
169
166
170
## SSH tunnel from your local device
167
171
If you want to run inference from your local device, you can open a SSH tunnel to your cluster environment like the following:
0 commit comments