[Feature] Enable disaggregated prefill functionality for v0 #658

jianzs · 2025-04-25T14:19:02Z

This PR fixes the non-functional implementation of disaggregated prefill. The improvements include:

Model Runner
- Invoke need_recv method in the model runner to enable kv cache reception on the decoding node.
Connector
- Add MLA support in connector: - Use kv cache dimensions to determine if the MLA is used (4D tensor indicates MLA enabled, 5D tensor indicates MLA disabled)
- Implement request unit-based kv cache transfer:
  - Address potential request reordering between prefill and decode nodes
  - Generate unique request IDs using request_id from vLLM request information for LLMDataDist reqirement
- Consolidate key and value cache transfer into a single tensor
Model
- Add config attribute to CustomDeepSeekV2Model class to expose model hidden size to connector

Note: This implementation currently supports only 1P1D and requires identical parallel configurations for both prefill and decode instances. All instances must run on the same machine.

jianzs · 2025-04-27T02:40:03Z

@Yikun @ganyi1996ppo The test cases in the CI are failing to retrieve the device IPs. It appears that the npu-smi or hccn_tool tools are not available in the CI environment. I would like to know if there is a way to obtain the device IPs.

Yikun · 2025-04-27T23:32:37Z

npu-smi info already dump the info here:
https://github.com/vllm-project/vllm-ascend/actions/runs/14676173763/job/41193168602#step:3:1

but not sure hccn tools work or not?, you could add cmd to test it.

the infra ci based on github + k8s, each job runs on a pod with device mount, not sure what the device IP means?

jianzs · 2025-04-28T02:24:37Z

npu-smi info already dump the info here: https://github.com/vllm-project/vllm-ascend/actions/runs/14676173763/job/41193168602#step:3:1

but not sure hccn tools work or not?, you could add cmd to test it.

the infra ci based on github + k8s, each job runs on a pod with device mount, not sure what the device IP means?

Got it. The device IP refers to the IP address of the network interface card on the NPU.

jianzs · 2025-04-28T07:32:39Z

vllm_ascend/distributed/llmdatadist_connector.py

+    for ip_offset in range(world_size):
+        cmd = [
+            HCCN_TOOL_PATH, '-i', f'{npu_start_idx + ip_offset}', '-ip', '-g'
+        ]
+        device_ip_info = subprocess.run(cmd,
+                                        stdout=subprocess.PIPE,
+                                        stderr=subprocess.PIPE,
+                                        universal_newlines=True)
+        device_ip = re.match(r'ipaddr:(.*)\n', device_ip_info.stdout).group(1)
+        device_ip_list.append(device_ip)


@wangxiyuan @ganyi1996ppo @Yikun The disaggregated prefill requires obtaining the device IP for direct d2d transmission. To obtain the device IP, it relies on a component called hccn_tool, which is typically located at /usr/local/Ascend/driver/tools/hccn_tool. However, CI indicates that the tool cannot be found. Does anyone know why? I think this is crucial for implementing test cases for disaggregated prefill.

/usr/local/Ascend/driver/tools/hccn_tool is not contained in CI container by default. I'll ask Infra team to add it later.

This commit fixes the non-functional implementation of disaggregated prefill. The improvements include: - Model Runner - Invoke `need_recv` method in the model runner to enable kv cache reception on the decoding node. - Connector - Add MLA support in connector: - Use kv cache dimensions to determine if the MLA is used (4D tensor indicates MLA enabled, 5D tensor indicates MLA disabled) - Implement request unit-based kv cache transfer: - Address potential request reordering between prefill and decode nodes - Generate unique request IDs using `request_id` from vLLM request information for LLMDataDist reqirement - Consolidate key and value cache transfer into a single tensor - Model - Add `config` attribute to `CustomDeepSeekV2Model` class to expose model hidden size to connector Note: This implementation currently supports only 1P1D configuration and requires identical parallel configurations for both prefill and decode nodes. Signed-off-by: Jade Zheng <[email protected]>

Signed-off-by: Jade Zheng <[email protected]>

github-actions bot added the module:tests label Apr 25, 2025

jianzs force-pushed the zhengsj/pd_for_v0 branch 5 times, most recently from 0a9721c to 3b56fdd Compare April 26, 2025 01:15

jianzs mentioned this pull request Apr 28, 2025

[Bug]: Cannot use PD separation feature with v0.8.4rc1 #696

Open

jianzs commented Apr 28, 2025

View reviewed changes

jianzs force-pushed the zhengsj/pd_for_v0 branch from fcff28f to 50721c6 Compare April 28, 2025 15:25

github-actions bot added the module:core label Apr 28, 2025

jianzs force-pushed the zhengsj/pd_for_v0 branch 2 times, most recently from 1bf9e49 to 809b434 Compare April 29, 2025 02:46

This was referenced Apr 29, 2025

[Bugfix] Add logic to send kv_cache in model_runner #721

Open

fix indents in model runner #722

Open

jianzs force-pushed the zhengsj/pd_for_v0 branch 6 times, most recently from e746e26 to e3de138 Compare May 1, 2025 07:31

jianzs added 7 commits May 2, 2025 10:01

chore: lint code

4338845

Signed-off-by: Jade Zheng <[email protected]>

chore: lint code

3ce9c8a

Signed-off-by: Jade Zheng <[email protected]>

feat: update ci

1ca2f25

Signed-off-by: Jade Zheng <[email protected]>

feat: update test case

78f7a1c

Signed-off-by: Jade Zheng <[email protected]>

fix: obtain device ip from hccn conf

bec6523

Signed-off-by: Jade Zheng <[email protected]>

chore: update the gpu_memory_utilization to 0.8 for ci tests

e16e389

Signed-off-by: Jade Zheng <[email protected]>

jianzs force-pushed the zhengsj/pd_for_v0 branch from 3347e7c to e16e389 Compare May 2, 2025 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Enable disaggregated prefill functionality for v0 #658

[Feature] Enable disaggregated prefill functionality for v0 #658

jianzs commented Apr 25, 2025 •

edited

Loading

jianzs commented Apr 27, 2025

Yikun commented Apr 27, 2025

jianzs commented Apr 28, 2025 •

edited

Loading

jianzs Apr 28, 2025

wangxiyuan Apr 29, 2025

[Feature] Enable disaggregated prefill functionality for v0 #658

Are you sure you want to change the base?

[Feature] Enable disaggregated prefill functionality for v0 #658

Conversation

jianzs commented Apr 25, 2025 • edited Loading

jianzs commented Apr 27, 2025

Yikun commented Apr 27, 2025

jianzs commented Apr 28, 2025 • edited Loading

jianzs Apr 28, 2025

Choose a reason for hiding this comment

wangxiyuan Apr 29, 2025

Choose a reason for hiding this comment

jianzs commented Apr 25, 2025 •

edited

Loading

jianzs commented Apr 28, 2025 •

edited

Loading