Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix mig-partition timeout to ensure the service completes #5971

Merged
merged 1 commit into from
Mar 8, 2025

Conversation

henryli001
Copy link
Contributor

@henryli001 henryli001 commented Feb 28, 2025

What type of PR is this?

What this PR does / why we need it:
mig-partition service is only given 30sec to restart which is not sufficient for the service to complete when the restart operation is executed the first time. This will cause the provisioning process to keep retrying (up to 100 times), which may eventually happen since the previous attempts to enable MIG mode have already MIG-enabled some GPUs and if not, a re-attempt of provisioning will happen. Eventually, even if the cluster gets successfully provisioned, the entire provisioning process ends up taking multiple hours in local experiments.

This behavior was captured on the 8-card A100 GPU SKUs such as https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ndma100v4-series?tabs=sizebasic for both Azure Linux and Ubuntu AKS. Thru existing experiments, the first attempt of systemctl restart mig-partition would take around 2min+ on both Azure Linux and Ubuntu. Thus, this change increases the timeout value of the mig-partition service itself to 5min since the default timeout value of systemd is 90sec. The logic to perform systemctl restart is also modified to timeout in 5min for mig-partition service specifically while maintaing the original 30sec for other systemd services.

Which issue(s) this PR fixes:

Requirements:

Special notes for your reviewer:

Release note:

none

@cameronmeissner
Copy link
Collaborator

@henryli001 friendly reminder to please avoid including internal ADO links within PR descriptions/comments

# Parameterize the timeout value here to assign longer timeout for mig-partition service to complete
# while keeping the original timeout value for other scenarios
service=$1
if [ "$service" = "mig-partition" ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer bubbling the timeout parameter up to the top-level from systemctl_restart instead of needing to pivot based on service name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored the implementation

@henryli001 henryli001 force-pushed the lihl/fix-mig-on-a100 branch 2 times, most recently from 1c1c22e to 74a88c6 Compare March 6, 2025 03:10
@henryli001 henryli001 force-pushed the lihl/fix-mig-on-a100 branch from 74a88c6 to 7807e1c Compare March 6, 2025 08:05
Copy link
Contributor

github-actions bot commented Mar 6, 2025

No changes to cached containers or packages on Windows VHDs

@ganeshkumarashok
Copy link
Contributor

Justing noting that the PR description may not apply for Ubuntu nodes. I've not noticed MIG enablement in AKS Ubuntu nodes take "multiple hours in local experiments" when I've provisioned them. Curious why there's such a difference between OSes.

@@ -5,6 +5,7 @@ Description=Apply MIG configuration on Nvidia A100 GPU
Restart=on-failure
ExecStartPre=/usr/bin/nvidia-smi -mig 1
ExecStart=/bin/bash /opt/azure/containers/mig-partition.sh ${GPU_INSTANCE_PROFILE}
TimeoutSec=300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the default timeout when this is not added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default timeout value for normal systemd service is 90s

@ganeshkumarashok
Copy link
Contributor

Thanks for making this fix. Were you able to check whether this PR fixed the issue?

@henryli001
Copy link
Contributor Author

Justing noting that the PR description may not apply for Ubuntu nodes. I've not noticed MIG enablement in AKS Ubuntu nodes take "multiple hours in local experiments" when I've provisioned them. Curious why there's such a difference between OSes.

The issue was also reproed on Ubuntu 2204 using VM size: Standard_ND96asr_v4

Dumping the failure message here:
[{"code":"VMExtensionProvisioningError","target":"0","message":"VM has reported a failure when processing extension 'vmssCSE' (publisher 'Microsoft.Azure.Extensions' and type 'CustomScript'). Error message: 'Enable failed: failed to execute command: command terminated with exit status=124\n[stdout]\n{ \"ExitCode\": \"124\", \"Output\": \"ontrol process exited, code=killed, status=15/TERM\\nMar 07 01:07:11 aks-nodepool1-12368757-vmss000000 systemd[1]: mig-partition.service: Failed with result 'signal'.\\nMar 07 01:07:11 aks-nodepool1-12368757-vmss000000 systemd[1]: Stopped Apply MIG configuration on Nvidia A100 GPU.\\nMar 07 01:07:11 aks-nodepool1-12368757-vmss000000 systemd[1]: mig-partition.service: Consumed 9.650s CPU time.\\n

@henryli001
Copy link
Contributor Author

Thanks for making this fix. Were you able to check whether this PR fixed the issue?

Customized AgentBaker E2E using a Azure Linux AKS 3.0 image and was able to successfully deploy a MIG node and the node rebooted as expected and eventually became ready. Not able to test on Ubuntu since I don't have read permission on the sub that hosts the Ubuntu test image

@henryli001
Copy link
Contributor Author

Thanks for making this fix. Were you able to check whether this PR fixed the issue?

Customized AgentBaker E2E using a Azure Linux AKS 3.0 image and was able to successfully deploy a MIG node and the node rebooted as expected and eventually became ready. Not able to test on Ubuntu since I don't have read permission on the sub that hosts the Ubuntu test image

Able to run test against Ubuntu using the pipeline and the MIG node got provisioned successfully. After the node got ready after a reboot which was expected on A100, nvidia-smi is showing the MIG devices correctly.

Copy link
Contributor

github-actions bot commented Mar 7, 2025

No changes to cached containers or packages on Windows VHDs

Copy link
Contributor

@ganeshkumarashok ganeshkumarashok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Henry, this lgtm!

@henryli001 henryli001 enabled auto-merge (squash) March 7, 2025 23:40
@henryli001 henryli001 merged commit 52eea29 into master Mar 8, 2025
20 checks passed
@henryli001 henryli001 deleted the lihl/fix-mig-on-a100 branch March 8, 2025 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants