Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NPU Cannot Be Shared Across Multiple Pods #5

Open
Nimbus318 opened this issue Oct 30, 2024 · 0 comments
Open

[BUG] NPU Cannot Be Shared Across Multiple Pods #5

Nimbus318 opened this issue Oct 30, 2024 · 0 comments

Comments

@Nimbus318
Copy link

Description:

I've encountered an issue where NPUs are currently not shareable across multiple Pods. I initially suspected it might be a scheduler issue, but after further investigation, I found that the problem lies in the Node Annotations. The annotation hami.io/node-register-Ascend910B looks like this:

hami.io/node-register-Ascend910B: '[{"id":"Ascend910B-0","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":1,"id":"Ascend910B-1","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":2,"id":"Ascend910B-2","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":3,"id":"Ascend910B-3","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":4,"id":"Ascend910B-4","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":5,"id":"Ascend910B-5","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":6,"id":"Ascend910B-6","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true},{"index":7,"id":"Ascend910B-7","count":1,"devmem":65536,"devcore":20,"type":"Ascend910B","health":true}]'

In each entry, "count": 1 is set, meaning each NPU instance can only be allocated to one Pod at a time, which prevents resource sharing.

Upon checking the code, I found that this value (Count: 1) is indeed hard-coded. However, there’s already a method in manager.go:

func (am *AscendManager) VDeviceCount() int {
    if len(am.config.Templates) == 0 {
        return 1
    }
    return int(am.config.MemoryAllocatable / am.config.Templates[0].Memory)
}

This function provides the expected logic by calculating the maximum allocatable count based on total available memory divided by the memory of the smallest template. Updating the Count field to Count: int32(ps.mgr.VDeviceCount()) should resolve the issue by correctly reflecting the maximum shareable capacity.

Proposed Fix: Replace Count: 1 with Count: int32(ps.mgr.VDeviceCount()) in the code, so the count reflects the maximum possible virtual devices based on available resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant