-
Notifications
You must be signed in to change notification settings - Fork 122
[Feature] Enable disaggregated prefill functionality for v0 #658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0a9721c
to
3b56fdd
Compare
@Yikun @ganyi1996ppo The test cases in the CI are failing to retrieve the device IPs. It appears that the npu-smi or hccn_tool tools are not available in the CI environment. I would like to know if there is a way to obtain the device IPs. |
but not sure hccn tools work or not?, you could add cmd to test it. the infra ci based on github + k8s, each job runs on a pod with device mount, not sure what the device IP means? |
Got it. The device IP refers to the IP address of the network interface card on the NPU. |
for ip_offset in range(world_size): | ||
cmd = [ | ||
HCCN_TOOL_PATH, '-i', f'{npu_start_idx + ip_offset}', '-ip', '-g' | ||
] | ||
device_ip_info = subprocess.run(cmd, | ||
stdout=subprocess.PIPE, | ||
stderr=subprocess.PIPE, | ||
universal_newlines=True) | ||
device_ip = re.match(r'ipaddr:(.*)\n', device_ip_info.stdout).group(1) | ||
device_ip_list.append(device_ip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangxiyuan @ganyi1996ppo @Yikun The disaggregated prefill requires obtaining the device IP for direct d2d transmission. To obtain the device IP, it relies on a component called hccn_tool, which is typically located at /usr/local/Ascend/driver/tools/hccn_tool
. However, CI indicates that the tool cannot be found. Does anyone know why? I think this is crucial for implementing test cases for disaggregated prefill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/usr/local/Ascend/driver/tools/hccn_tool
is not contained in CI container by default. I'll ask Infra team to add it later.
fcff28f
to
50721c6
Compare
1bf9e49
to
809b434
Compare
e746e26
to
e3de138
Compare
This commit fixes the non-functional implementation of disaggregated prefill. The improvements include: - Model Runner - Invoke `need_recv` method in the model runner to enable kv cache reception on the decoding node. - Connector - Add MLA support in connector: - Use kv cache dimensions to determine if the MLA is used (4D tensor indicates MLA enabled, 5D tensor indicates MLA disabled) - Implement request unit-based kv cache transfer: - Address potential request reordering between prefill and decode nodes - Generate unique request IDs using `request_id` from vLLM request information for LLMDataDist reqirement - Consolidate key and value cache transfer into a single tensor - Model - Add `config` attribute to `CustomDeepSeekV2Model` class to expose model hidden size to connector Note: This implementation currently supports only 1P1D configuration and requires identical parallel configurations for both prefill and decode nodes. Signed-off-by: Jade Zheng <[email protected]>
Signed-off-by: Jade Zheng <[email protected]>
Signed-off-by: Jade Zheng <[email protected]>
Signed-off-by: Jade Zheng <[email protected]>
Signed-off-by: Jade Zheng <[email protected]>
Signed-off-by: Jade Zheng <[email protected]>
Signed-off-by: Jade Zheng <[email protected]>
This PR fixes the non-functional implementation of disaggregated prefill. The improvements include:
need_recv
method in the model runner to enable kv cache reception on the decoding node.request_id
from vLLM request information for LLMDataDist reqirementconfig
attribute toCustomDeepSeekV2Model
class to expose model hidden size to connectorNote: This implementation currently supports only 1P1D and requires identical parallel configurations for both prefill and decode instances. All instances must run on the same machine.