fix: fix mig-partition timeout to ensure the service completes #5971

henryli001 · 2025-02-28T23:32:50Z

What type of PR is this?

What this PR does / why we need it:
mig-partition service is only given 30sec to restart which is not sufficient for the service to complete when the restart operation is executed the first time. This will cause the provisioning process to keep retrying (up to 100 times), which may eventually happen since the previous attempts to enable MIG mode have already MIG-enabled some GPUs and if not, a re-attempt of provisioning will happen. Eventually, even if the cluster gets successfully provisioned, the entire provisioning process ends up taking multiple hours in local experiments.

This behavior was captured on the 8-card A100 GPU SKUs such as https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ndma100v4-series?tabs=sizebasic for both Azure Linux and Ubuntu AKS. Thru existing experiments, the first attempt of systemctl restart mig-partition would take around 2min+ on both Azure Linux and Ubuntu. Thus, this change increases the timeout value of the mig-partition service itself to 5min since the default timeout value of systemd is 90sec. The logic to perform systemctl restart is also modified to timeout in 5min for mig-partition service specifically while maintaing the original 30sec for other systemd services.

Which issue(s) this PR fixes:

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
tested upgrade from previous version

Special notes for your reviewer:

Release note:

none

cameronmeissner · 2025-03-04T23:58:09Z

@henryli001 friendly reminder to please avoid including internal ADO links within PR descriptions/comments

cameronmeissner · 2025-03-04T23:59:23Z

parts/linux/cloud-init/artifacts/cse_helpers.sh

+    # Parameterize the timeout value here to assign longer timeout for mig-partition service to complete
+    # while keeping the original timeout value for other scenarios
+    service=$1
+    if [ "$service" = "mig-partition" ]; then


I'd prefer bubbling the timeout parameter up to the top-level from systemctl_restart instead of needing to pivot based on service name

refactored the implementation

github-actions · 2025-03-06T08:06:07Z

No changes to cached containers or packages on Windows VHDs

ganeshkumarashok · 2025-03-07T00:35:45Z

Justing noting that the PR description may not apply for Ubuntu nodes. I've not noticed MIG enablement in AKS Ubuntu nodes take "multiple hours in local experiments" when I've provisioned them. Curious why there's such a difference between OSes.

ganeshkumarashok · 2025-03-07T01:20:53Z

parts/linux/cloud-init/artifacts/mig-partition.service

@@ -5,6 +5,7 @@ Description=Apply MIG configuration on Nvidia A100 GPU
 Restart=on-failure
 ExecStartPre=/usr/bin/nvidia-smi -mig 1
 ExecStart=/bin/bash /opt/azure/containers/mig-partition.sh ${GPU_INSTANCE_PROFILE}
+TimeoutSec=300


What's the default timeout when this is not added?

The default timeout value for normal systemd service is 90s

ganeshkumarashok · 2025-03-07T01:25:14Z

Thanks for making this fix. Were you able to check whether this PR fixed the issue?

henryli001 · 2025-03-07T01:52:37Z

Justing noting that the PR description may not apply for Ubuntu nodes. I've not noticed MIG enablement in AKS Ubuntu nodes take "multiple hours in local experiments" when I've provisioned them. Curious why there's such a difference between OSes.

The issue was also reproed on Ubuntu 2204 using VM size: Standard_ND96asr_v4

Dumping the failure message here:
[{"code":"VMExtensionProvisioningError","target":"0","message":"VM has reported a failure when processing extension 'vmssCSE' (publisher 'Microsoft.Azure.Extensions' and type 'CustomScript'). Error message: 'Enable failed: failed to execute command: command terminated with exit status=124\n[stdout]\n{ \"ExitCode\": \"124\", \"Output\": \"ontrol process exited, code=killed, status=15/TERM\\nMar 07 01:07:11 aks-nodepool1-12368757-vmss000000 systemd[1]: mig-partition.service: Failed with result 'signal'.\\nMar 07 01:07:11 aks-nodepool1-12368757-vmss000000 systemd[1]: Stopped Apply MIG configuration on Nvidia A100 GPU.\\nMar 07 01:07:11 aks-nodepool1-12368757-vmss000000 systemd[1]: mig-partition.service: Consumed 9.650s CPU time.\\n

henryli001 · 2025-03-07T02:06:37Z

Thanks for making this fix. Were you able to check whether this PR fixed the issue?

Customized AgentBaker E2E using a Azure Linux AKS 3.0 image and was able to successfully deploy a MIG node and the node rebooted as expected and eventually became ready. Not able to test on Ubuntu since I don't have read permission on the sub that hosts the Ubuntu test image

henryli001 · 2025-03-07T23:19:51Z

Thanks for making this fix. Were you able to check whether this PR fixed the issue?

Customized AgentBaker E2E using a Azure Linux AKS 3.0 image and was able to successfully deploy a MIG node and the node rebooted as expected and eventually became ready. Not able to test on Ubuntu since I don't have read permission on the sub that hosts the Ubuntu test image

Able to run test against Ubuntu using the pipeline and the MIG node got provisioned successfully. After the node got ready after a reboot which was expected on A100, nvidia-smi is showing the MIG devices correctly.

github-actions · 2025-03-07T23:22:05Z

No changes to cached containers or packages on Windows VHDs

ganeshkumarashok

Thanks Henry, this lgtm!

henryli001 temporarily deployed to test February 28, 2025 23:32 — with GitHub Actions Inactive

henryli001 marked this pull request as ready for review March 3, 2025 01:47

henryli001 requested review from juan-lee, cameronmeissner, UtheMan, ganeshkumarashok, anujmaheshwari1, AlisonB319, Devinwong, lilypan26, AbelHu, junjiezhang1997, jason1028kr, djsly, phealy, r2k1, timmy-wright, zachary-bailey, bravebeaver and smith1511 as code owners March 3, 2025 01:47

cameronmeissner reviewed Mar 4, 2025

View reviewed changes

henryli001 force-pushed the lihl/fix-mig-on-a100 branch 2 times, most recently from 1c1c22e to 74a88c6 Compare March 6, 2025 03:10

henryli001 temporarily deployed to test March 6, 2025 03:10 — with GitHub Actions Inactive

henryli001 force-pushed the lihl/fix-mig-on-a100 branch from 74a88c6 to 7807e1c Compare March 6, 2025 08:05

henryli001 temporarily deployed to test March 6, 2025 08:05 — with GitHub Actions Inactive

henryli001 force-pushed the lihl/fix-mig-on-a100 branch from 7807e1c to e5954fa Compare March 6, 2025 19:26

henryli001 temporarily deployed to test March 6, 2025 19:26 — with GitHub Actions Inactive

ganeshkumarashok reviewed Mar 7, 2025

View reviewed changes

fix: fix mig-partition timeout to ensure the service completes

b896af0

henryli001 force-pushed the lihl/fix-mig-on-a100 branch from e5954fa to b896af0 Compare March 7, 2025 23:21

henryli001 temporarily deployed to test March 7, 2025 23:21 — with GitHub Actions Inactive

ganeshkumarashok approved these changes Mar 7, 2025

View reviewed changes

henryli001 enabled auto-merge (squash) March 7, 2025 23:40

henryli001 merged commit 52eea29 into master Mar 8, 2025
20 checks passed

henryli001 deleted the lihl/fix-mig-on-a100 branch March 8, 2025 00:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix mig-partition timeout to ensure the service completes #5971

fix: fix mig-partition timeout to ensure the service completes #5971

henryli001 commented Feb 28, 2025 •

edited

Loading

cameronmeissner commented Mar 4, 2025

cameronmeissner Mar 4, 2025

henryli001 Mar 6, 2025

github-actions bot commented Mar 6, 2025

ganeshkumarashok commented Mar 7, 2025

ganeshkumarashok Mar 7, 2025

henryli001 Mar 7, 2025

ganeshkumarashok commented Mar 7, 2025

henryli001 commented Mar 7, 2025

henryli001 commented Mar 7, 2025

henryli001 commented Mar 7, 2025

github-actions bot commented Mar 7, 2025

ganeshkumarashok left a comment

fix: fix mig-partition timeout to ensure the service completes #5971

fix: fix mig-partition timeout to ensure the service completes #5971

Conversation

henryli001 commented Feb 28, 2025 • edited Loading

cameronmeissner commented Mar 4, 2025

cameronmeissner Mar 4, 2025

Choose a reason for hiding this comment

henryli001 Mar 6, 2025

Choose a reason for hiding this comment

github-actions bot commented Mar 6, 2025

ganeshkumarashok commented Mar 7, 2025

ganeshkumarashok Mar 7, 2025

Choose a reason for hiding this comment

henryli001 Mar 7, 2025

Choose a reason for hiding this comment

ganeshkumarashok commented Mar 7, 2025

henryli001 commented Mar 7, 2025

henryli001 commented Mar 7, 2025

henryli001 commented Mar 7, 2025

github-actions bot commented Mar 7, 2025

ganeshkumarashok left a comment

Choose a reason for hiding this comment

henryli001 commented Feb 28, 2025 •

edited

Loading