[CORE] Implement Worker Controller for Dynamic Model Loading and Inference, Reducing Cold Start Latency of Workers #15

tangkenyi2001 · 2025-12-04T10:15:37Z

[Core] Introduce Worker Controller for Dynamic Model Orchestration

Description

This PR introduces a Worker Controller architecture designed to decouple worker lifecycle management from the execution loop. This architecture allows for the maintenance of a pool of pre-warmed "dummy" workers that can be dynamically assigned to new engines, enabling efficient model loading and unloading without requiring process restarts.

Key Components

1. Worker Controller (worker_controller.py)

Role: Orchestrates the lifecycle of engines and workers.
Functionality: Manages a pool of pre-warmed "dummy" workers and dynamically assigns them to new engines upon request, facilitating faster start-up times.

2. Proxy Executor (proxy_executor.py)

Role: Acts as the communication bridge.
Functionality: Maintains persistent RPC connections to worker processes and broadcasts commands (load, unload, inference) to the assigned workers.

3. Remote Executor (remote_executor.py)

Role: Interface within the API Server.
Functionality: Runs inside the API Server process and forwards execution requests to the Proxy Executor via IPC queues.

Worker & Engine Customization

To support this architecture, the following core components were enhanced:

vllm/worker_controller/worker/gpu_worker.py: Enhanced Worker to support dynamic load_model and unload_model operations without process restart.
vllm/worker_controller/worker/model_runner.py: Custom ModelRunner tailored for the controller architecture.
vllm/worker_controller/engine/core.py: Custom EngineCore implementation with added load_model RPC hooks.

Documentation

Added vllm/worker_controller/README.md containing a detailed architecture overview and Mermaid diagrams illustrating the flow.

Testing Steps

New test scripts were added to verify the lifecycle and parallel capabilities:

Lifecycle Verification (inference.py):
- Verifies the full lifecycle: Load -> Inference -> Unload -> Load.
- Ensures GPU resources are correctly managed and released between cycles.
Parallel Inference (test_parallel_inference.py):
- Loads two distinct models (facebook/opt-125m and Qwen/Qwen3-0.6B) simultaneously.
- Verifies parallel execution on 2 GPUs.
Dynamic Allocation (test_dynamic_allocation.py):
- Loads facebook/opt-125m and Qwen/Qwen3-0.6B.
- Unloads facebook/opt-125m.
- Reloads Qwen/Qwen3-0.6B.
- Runs inference to stress-test dynamic allocation stability.

FIX #xxxx

Worker Controller (worker_controller.py): Orchestrates the lifecycle of engines and workers. Manages a pool of pre-warmed "dummy" workers. Dynamically assigns workers to new engines upon request. Proxy Executor (proxy_executor.py): Maintains persistent RPC connections to worker processes. Broadcasts commands (load, unload, inference) to assigned workers. Remote Executor (remote_executor.py): Runs inside the API Server process. Forwards execution requests to the Proxy Executor via IPC queues. Worker & Engine Customization: vllm/worker_controller/worker/gpu_worker.py: Enhanced Worker supporting dynamic load_model and unload_model without process restart. vllm/worker_controller/worker/model_runner.py: Custom ModelRunner tailored for the controller architecture. vllm/worker_controller/engine/core.py: Custom EngineCore with load_model RPC hooks. Architecture Documentation: Added vllm/worker_controller/README.md with detailed architecture overview and Mermaid diagrams. Testing Added inference.py: Verifies the full lifecycle (Load -> Inference -> Unload -> Load) to ensure resources are correctly managed and released. Added test_parallel_inference.py: Loads two models, facebook/opt-125m and Qwen/Qwen3-0.6B for parallel testing on 2 GPUs Added test_dynamic_allocation.py: Loads two models, facebook/opt-125m and Qwen/Qwen3-0.6B for parallel testing on 2 GPUs, unloads facebook/opt-125m, then loads Qwen/Qwen3-0.6B again and runs inference

github-actions · 2025-12-04T10:15:47Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

tangkenyi2001 added 2 commits December 4, 2025 10:34

fix:formatting

498ccd3

fix:formatting

8df9828

tangkenyi2001 closed this Dec 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CORE] Implement Worker Controller for Dynamic Model Loading and Inference, Reducing Cold Start Latency of Workers #15

[CORE] Implement Worker Controller for Dynamic Model Loading and Inference, Reducing Cold Start Latency of Workers #15

Uh oh!

tangkenyi2001 commented Dec 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[CORE] Implement Worker Controller for Dynamic Model Loading and Inference, Reducing Cold Start Latency of Workers #15

[CORE] Implement Worker Controller for Dynamic Model Loading and Inference, Reducing Cold Start Latency of Workers #15

Uh oh!

Conversation

tangkenyi2001 commented Dec 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[Core] Introduce Worker Controller for Dynamic Model Orchestration

Description

Key Components

Worker & Engine Customization

Documentation

Testing Steps

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tangkenyi2001 commented Dec 4, 2025 •

edited by github-actions bot

Loading