feat: add vLLM scale-out deployment with nginx load balancing#109
feat: add vLLM scale-out deployment with nginx load balancing#109maryamtahhan wants to merge 2 commits into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Fix incorrect default container image for vllm_bench that caused "Exec format error" on x86_64 servers. Problem: - Default was ARM image: quay.io/mtahhan/vllm:arm-base-cpu - Embedding tests failed on x86_64 EC2 with "Exec format error" - Error: "image platform (linux/arm64/v8) does not match (linux/amd64)" Root Cause: - vllm_bench config (line 98) used ARM-only image as default - Introduced in commit 5a494c8 during inventory restructure - Controller architecture (Mac ARM) is irrelevant - containers run on remote targets (EC2 x86_64) - LLM tests unaffected - they use guidellm with multi-arch images Fix: - Change default to match vLLM server image (x86_64 compatible) - Old: quay.io/mtahhan/vllm:arm-base-cpu - New: docker.io/vllm/vllm-openai-cpu:v0.18.0 - Same image as DUT ensures version consistency Impact: - Embedding tests now work on x86_64 servers (AWS EC2) - Users can still override with VLLM_BENCH_CONTAINER_IMAGE env var - No impact on LLM tests (different benchmark tool) Tested: - EC2 x86_64 instances (DUT + Load Generator) - Baseline and latency tests execute successfully Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
|
@maryamtahhan make sure to also update mlfow integration and multi instance results conversion (to csv) to reflect this new feature. |
Adds turnkey Ansible automation for deploying multiple vLLM instances on a single DUT with configurable nginx load balancing. Enables performance testing at scale with flexible configuration of instance count (1-10), cores per instance (8/16/32), SMT, prefix caching, and load balancing policies (round-robin/least-conn/ip-hash). Includes comprehensive documentation, example inventory, and integration with existing GuideLLM benchmark playbooks. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
44c0408 to
4d50b81
Compare
Adds turnkey Ansible automation for deploying multiple vLLM instances on a single DUT with configurable nginx load balancing. Enables performance testing at scale with flexible configuration of instance count (1-10), cores per instance (8/16/32), SMT, prefix caching, and load balancing policies (round-robin/least-conn/ip-hash). Includes comprehensive documentation, example inventory, and integration with existing GuideLLM benchmark playbooks.