Skip to content

Conversation

@tangkenyi2001
Copy link
Collaborator

@tangkenyi2001 tangkenyi2001 commented Dec 4, 2025

[Core] Introduce Worker Controller for Dynamic Model Orchestration

Description

This PR introduces a Worker Controller architecture designed to decouple worker lifecycle management from the execution loop. This architecture allows for the maintenance of a pool of pre-warmed "dummy" workers that can be dynamically assigned to new engines, enabling efficient model loading and unloading without requiring process restarts.

Key Components

1. Worker Controller (worker_controller.py)

  • Role: Orchestrates the lifecycle of engines and workers.
  • Functionality: Manages a pool of pre-warmed "dummy" workers and dynamically assigns them to new engines upon request, facilitating faster start-up times.

2. Proxy Executor (proxy_executor.py)

  • Role: Acts as the communication bridge.
  • Functionality: Maintains persistent RPC connections to worker processes and broadcasts commands (load, unload, inference) to the assigned workers.

3. Remote Executor (remote_executor.py)

  • Role: Interface within the API Server.
  • Functionality: Runs inside the API Server process and forwards execution requests to the Proxy Executor via IPC queues.

Worker & Engine Customization

To support this architecture, the following core components were enhanced:

  • vllm/worker_controller/worker/gpu_worker.py: Enhanced Worker to support dynamic load_model and unload_model operations without process restart.
  • vllm/worker_controller/worker/model_runner.py: Custom ModelRunner tailored for the controller architecture.
  • vllm/worker_controller/engine/core.py: Custom EngineCore implementation with added load_model RPC hooks.

Documentation

  • Added vllm/worker_controller/README.md containing a detailed architecture overview and Mermaid diagrams illustrating the flow.

Testing Steps

New test scripts were added to verify the lifecycle and parallel capabilities:

  1. Lifecycle Verification (inference.py):

    • Verifies the full lifecycle: Load -> Inference -> Unload -> Load.
    • Ensures GPU resources are correctly managed and released between cycles.
  2. Parallel Inference (test_parallel_inference.py):

    • Loads two distinct models (facebook/opt-125m and Qwen/Qwen3-0.6B) simultaneously.
    • Verifies parallel execution on 2 GPUs.
  3. Dynamic Allocation (test_dynamic_allocation.py):

    • Loads facebook/opt-125m and Qwen/Qwen3-0.6B.
    • Unloads facebook/opt-125m.
    • Reloads Qwen/Qwen3-0.6B.
    • Runs inference to stress-test dynamic allocation stability.

FIX #xxxx


Worker Controller (worker_controller.py):

Orchestrates the lifecycle of engines and workers.
Manages a pool of pre-warmed "dummy" workers.
Dynamically assigns workers to new engines upon request.
Proxy Executor (proxy_executor.py):

Maintains persistent RPC connections to worker processes.
Broadcasts commands (load, unload, inference) to assigned workers.
Remote Executor (remote_executor.py):

Runs inside the API Server process.
Forwards execution requests to the Proxy Executor via IPC queues.
Worker & Engine Customization:

vllm/worker_controller/worker/gpu_worker.py: Enhanced Worker supporting dynamic load_model and unload_model without process restart.
vllm/worker_controller/worker/model_runner.py: Custom ModelRunner tailored for the controller architecture.
vllm/worker_controller/engine/core.py: Custom EngineCore with load_model RPC hooks.
Architecture Documentation:

Added vllm/worker_controller/README.md with detailed architecture overview and Mermaid diagrams.
Testing
Added inference.py: Verifies the full lifecycle (Load -> Inference -> Unload -> Load) to ensure resources are correctly managed and released.
Added test_parallel_inference.py: Loads two models, facebook/opt-125m and Qwen/Qwen3-0.6B for parallel testing on 2 GPUs
Added test_dynamic_allocation.py: Loads two models, facebook/opt-125m and Qwen/Qwen3-0.6B for parallel testing on 2 GPUs, unloads facebook/opt-125m, then loads Qwen/Qwen3-0.6B again and runs inference
@github-actions
Copy link

github-actions bot commented Dec 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant