You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've encountered an issue where NPUs are currently not shareable across multiple Pods. I initially suspected it might be a scheduler issue, but after further investigation, I found that the problem lies in the Node Annotations. The annotation hami.io/node-register-Ascend910B looks like this:
This function provides the expected logic by calculating the maximum allocatable count based on total available memory divided by the memory of the smallest template. Updating the Count field to Count: int32(ps.mgr.VDeviceCount()) should resolve the issue by correctly reflecting the maximum shareable capacity.
Proposed Fix: Replace Count: 1 with Count: int32(ps.mgr.VDeviceCount()) in the code, so the count reflects the maximum possible virtual devices based on available resources.
The text was updated successfully, but these errors were encountered:
Description:
I've encountered an issue where NPUs are currently not shareable across multiple Pods. I initially suspected it might be a scheduler issue, but after further investigation, I found that the problem lies in the Node Annotations. The annotation hami.io/node-register-Ascend910B looks like this:
hami.io/node-register-Ascend910B: '[{"id":"Ascend910B-0","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":1,"id":"Ascend910B-1","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":2,"id":"Ascend910B-2","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":3,"id":"Ascend910B-3","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":4,"id":"Ascend910B-4","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":5,"id":"Ascend910B-5","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":6,"id":"Ascend910B-6","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":7,"id":"Ascend910B-7","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true}]'
In each entry, "count": 1 is set, meaning each NPU instance can only be allocated to one Pod at a time, which prevents resource sharing.
Upon checking the code, I found that this value (Count: 1) is indeed hard-coded. However, there’s already a method in manager.go:
This function provides the expected logic by calculating the maximum allocatable count based on total available memory divided by the memory of the smallest template. Updating the Count field to Count: int32(ps.mgr.VDeviceCount()) should resolve the issue by correctly reflecting the maximum shareable capacity.
Proposed Fix: Replace Count: 1 with Count: int32(ps.mgr.VDeviceCount()) in the code, so the count reflects the maximum possible virtual devices based on available resources.
The text was updated successfully, but these errors were encountered: