Skip to content

[Feature] Enable disaggregated prefill functionality for v0 #658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jianzs
Copy link
Contributor

@jianzs jianzs commented Apr 25, 2025

This PR fixes the non-functional implementation of disaggregated prefill. The improvements include:

  • Model Runner
    • Invoke need_recv method in the model runner to enable kv cache reception on the decoding node.
  • Connector
    • Add MLA support in connector: - Use kv cache dimensions to determine if the MLA is used (4D tensor indicates MLA enabled, 5D tensor indicates MLA disabled)
    • Implement request unit-based kv cache transfer:
      • Address potential request reordering between prefill and decode nodes
      • Generate unique request IDs using request_id from vLLM request information for LLMDataDist reqirement
    • Consolidate key and value cache transfer into a single tensor
  • Model
    • Add config attribute to CustomDeepSeekV2Model class to expose model hidden size to connector

Note: This implementation currently supports only 1P1D and requires identical parallel configurations for both prefill and decode instances. All instances must run on the same machine.

@jianzs jianzs force-pushed the zhengsj/pd_for_v0 branch 5 times, most recently from 0a9721c to 3b56fdd Compare April 26, 2025 01:15
@jianzs
Copy link
Contributor Author

jianzs commented Apr 27, 2025

@Yikun @ganyi1996ppo The test cases in the CI are failing to retrieve the device IPs. It appears that the npu-smi or hccn_tool tools are not available in the CI environment. I would like to know if there is a way to obtain the device IPs.

@Yikun
Copy link
Collaborator

Yikun commented Apr 27, 2025

npu-smi info already dump the info here:
https://github.com/vllm-project/vllm-ascend/actions/runs/14676173763/job/41193168602#step:3:1

but not sure hccn tools work or not?, you could add cmd to test it.

the infra ci based on github + k8s, each job runs on a pod with device mount, not sure what the device IP means?

@jianzs
Copy link
Contributor Author

jianzs commented Apr 28, 2025

npu-smi info already dump the info here: https://github.com/vllm-project/vllm-ascend/actions/runs/14676173763/job/41193168602#step:3:1

but not sure hccn tools work or not?, you could add cmd to test it.

the infra ci based on github + k8s, each job runs on a pod with device mount, not sure what the device IP means?

Got it. The device IP refers to the IP address of the network interface card on the NPU.

Comment on lines 470 to 479
for ip_offset in range(world_size):
cmd = [
HCCN_TOOL_PATH, '-i', f'{npu_start_idx + ip_offset}', '-ip', '-g'
]
device_ip_info = subprocess.run(cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
universal_newlines=True)
device_ip = re.match(r'ipaddr:(.*)\n', device_ip_info.stdout).group(1)
device_ip_list.append(device_ip)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangxiyuan @ganyi1996ppo @Yikun The disaggregated prefill requires obtaining the device IP for direct d2d transmission. To obtain the device IP, it relies on a component called hccn_tool, which is typically located at /usr/local/Ascend/driver/tools/hccn_tool. However, CI indicates that the tool cannot be found. Does anyone know why? I think this is crucial for implementing test cases for disaggregated prefill.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/usr/local/Ascend/driver/tools/hccn_tool is not contained in CI container by default. I'll ask Infra team to add it later.

@jianzs jianzs force-pushed the zhengsj/pd_for_v0 branch from fcff28f to 50721c6 Compare April 28, 2025 15:25
@jianzs jianzs force-pushed the zhengsj/pd_for_v0 branch 2 times, most recently from 1bf9e49 to 809b434 Compare April 29, 2025 02:46
@jianzs jianzs force-pushed the zhengsj/pd_for_v0 branch 6 times, most recently from e746e26 to e3de138 Compare May 1, 2025 07:31
jianzs added 7 commits May 2, 2025 10:01
This commit fixes the non-functional implementation of disaggregated prefill. The improvements include:

- Model Runner
    - Invoke `need_recv` method in the model runner to enable kv cache reception on the decoding node.
- Connector
    - Add MLA support in connector:
        - Use kv cache dimensions to determine if the MLA is used (4D tensor indicates MLA enabled, 5D tensor indicates MLA disabled)
    - Implement request unit-based kv cache transfer:
        - Address potential request reordering between prefill and decode nodes
        - Generate unique request IDs using `request_id` from vLLM request information for LLMDataDist reqirement
    - Consolidate key and value cache transfer into a single tensor
- Model
    - Add `config` attribute to `CustomDeepSeekV2Model` class to expose model hidden size to connector

Note: This implementation currently supports only 1P1D configuration and requires identical parallel configurations for both prefill and decode nodes.

Signed-off-by: Jade Zheng <[email protected]>
Signed-off-by: Jade Zheng <[email protected]>
Signed-off-by: Jade Zheng <[email protected]>
Signed-off-by: Jade Zheng <[email protected]>
Signed-off-by: Jade Zheng <[email protected]>
@jianzs jianzs force-pushed the zhengsj/pd_for_v0 branch from 3347e7c to e16e389 Compare May 2, 2025 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants