[BUG] NPU Cannot Be Shared Across Multiple Pods #5

Nimbus318 · 2024-10-30T08:20:23Z

Description:

I've encountered an issue where NPUs are currently not shareable across multiple Pods. I initially suspected it might be a scheduler issue, but after further investigation, I found that the problem lies in the Node Annotations. The annotation hami.io/node-register-Ascend910B looks like this:

hami.io/node-register-Ascend910B: '[{"id":"Ascend910B-0","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":1,"id":"Ascend910B-1","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":2,"id":"Ascend910B-2","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":3,"id":"Ascend910B-3","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":4,"id":"Ascend910B-4","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":5,"id":"Ascend910B-5","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":6,"id":"Ascend910B-6","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":7,"id":"Ascend910B-7","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true}]'

In each entry, "count": 1 is set, meaning each NPU instance can only be allocated to one Pod at a time, which prevents resource sharing.

Upon checking the code, I found that this value (Count: 1) is indeed hard-coded. However, there’s already a method in manager.go:

func (am *AscendManager) VDeviceCount() int {
    if len(am.config.Templates) == 0 {
        return 1
    }
    return int(am.config.MemoryAllocatable / am.config.Templates[0].Memory)
}

This function provides the expected logic by calculating the maximum allocatable count based on total available memory divided by the memory of the smallest template. Updating the Count field to Count: int32(ps.mgr.VDeviceCount()) should resolve the issue by correctly reflecting the maximum shareable capacity.

Proposed Fix: Replace Count: 1 with Count: int32(ps.mgr.VDeviceCount()) in the code, so the count reflects the maximum possible virtual devices based on available resources.

Nimbus318 mentioned this issue Oct 30, 2024

fix(server): correct Count field in registerHAMi #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] NPU Cannot Be Shared Across Multiple Pods #5

[BUG] NPU Cannot Be Shared Across Multiple Pods #5

Nimbus318 commented Oct 30, 2024

[BUG] NPU Cannot Be Shared Across Multiple Pods #5

[BUG] NPU Cannot Be Shared Across Multiple Pods #5

Comments

Nimbus318 commented Oct 30, 2024

Description: