-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix mig-partition timeout to ensure the service completes #5971
Conversation
@henryli001 friendly reminder to please avoid including internal ADO links within PR descriptions/comments |
# Parameterize the timeout value here to assign longer timeout for mig-partition service to complete | ||
# while keeping the original timeout value for other scenarios | ||
service=$1 | ||
if [ "$service" = "mig-partition" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer bubbling the timeout parameter up to the top-level from systemctl_restart instead of needing to pivot based on service name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored the implementation
1c1c22e
to
74a88c6
Compare
74a88c6
to
7807e1c
Compare
No changes to cached containers or packages on Windows VHDs |
7807e1c
to
e5954fa
Compare
Justing noting that the PR description may not apply for Ubuntu nodes. I've not noticed MIG enablement in AKS Ubuntu nodes take "multiple hours in local experiments" when I've provisioned them. Curious why there's such a difference between OSes. |
@@ -5,6 +5,7 @@ Description=Apply MIG configuration on Nvidia A100 GPU | |||
Restart=on-failure | |||
ExecStartPre=/usr/bin/nvidia-smi -mig 1 | |||
ExecStart=/bin/bash /opt/azure/containers/mig-partition.sh ${GPU_INSTANCE_PROFILE} | |||
TimeoutSec=300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the default timeout when this is not added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default timeout value for normal systemd service is 90s
Thanks for making this fix. Were you able to check whether this PR fixed the issue? |
The issue was also reproed on Ubuntu 2204 using VM size: Standard_ND96asr_v4 Dumping the failure message here: |
Customized AgentBaker E2E using a Azure Linux AKS 3.0 image and was able to successfully deploy a MIG node and the node rebooted as expected and eventually became ready. Not able to test on Ubuntu since I don't have read permission on the sub that hosts the Ubuntu test image |
Able to run test against Ubuntu using the pipeline and the MIG node got provisioned successfully. After the node got ready after a reboot which was expected on A100, nvidia-smi is showing the MIG devices correctly. |
e5954fa
to
b896af0
Compare
No changes to cached containers or packages on Windows VHDs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Henry, this lgtm!
What type of PR is this?
What this PR does / why we need it:
mig-partition service is only given 30sec to restart which is not sufficient for the service to complete when the restart operation is executed the first time. This will cause the provisioning process to keep retrying (up to 100 times), which may eventually happen since the previous attempts to enable MIG mode have already MIG-enabled some GPUs and if not, a re-attempt of provisioning will happen. Eventually, even if the cluster gets successfully provisioned, the entire provisioning process ends up taking multiple hours in local experiments.
This behavior was captured on the 8-card A100 GPU SKUs such as https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ndma100v4-series?tabs=sizebasic for both Azure Linux and Ubuntu AKS. Thru existing experiments, the first attempt of systemctl restart mig-partition would take around 2min+ on both Azure Linux and Ubuntu. Thus, this change increases the timeout value of the mig-partition service itself to 5min since the default timeout value of systemd is 90sec. The logic to perform systemctl restart is also modified to timeout in 5min for mig-partition service specifically while maintaing the original 30sec for other systemd services.
Which issue(s) this PR fixes:
Requirements:
Special notes for your reviewer:
Release note: