Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install AMD GPU Kernel drivers if required #5875

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

fix cse

1901fc8
Select commit
Loading
Failed to load commit list.
Open

Install AMD GPU Kernel drivers if required #5875

fix cse
1901fc8
Select commit
Loading
Failed to load commit list.
Azure Pipelines / Agentbaker E2E failed Feb 27, 2025 in 23m 8s

Build #20250227.2 had test failures

Details

Tests

  • Failed: 3 (4.11%)
  • Passed: 70 (95.89%)
  • Other: 0 (0.00%)
  • Total: 73

Annotations

Check failure on line 9631 in Build log

See this annotation in the file changed.

@azure-pipelines azure-pipelines / Agentbaker E2E

Build log #L9631

Bash exited with code '1'.

Check failure on line 1 in Test_Ubuntu2204_AirGap_NonAnonymousACR

See this annotation in the file changed.

@azure-pipelines azure-pipelines / Agentbaker E2E

Test_Ubuntu2204_AirGap_NonAnonymousACR

Failed
Raw output
    cluster.go:269: cluster abe2e-kubenet-nonanonpull-airgap-b9a80 already exists in rg abe2e-westus3
    cluster.go:123: node resource group: MC_abe2e-westus3_abe2e-kubenet-nonanonpull-airgap-b9a80_westus3
    cluster.go:134: using private acr "privateace2enonanonpullwestus3" isAnonyomusPull true
    aks_model.go:208: Creating private Azure Container Registry privateace2enonanonpullwestus3 in rg abe2e-westus3
    aks_model.go:338: Checking if private Azure Container Registry cache rules are correct in rg abe2e-westus3
    aks_model.go:353: Private ACR cache is correct
    aks_model.go:217: Private ACR already exists at id /subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/abe2e-westus3/providers/Microsoft.ContainerRegistry/registries/privateace2enonanonpullwestus3, skipping creation
    aks_model.go:72: Adding network settings for airgap cluster abe2e-kubenet-nonanonpull-airgap-b9a80 in rg MC_abe2e-westus3_abe2e-kubenet-nonanonpull-airgap-b9a80_westus3
    aks_model.go:156: Checking if private endpoint for private container registry is in rg MC_abe2e-westus3_abe2e-kubenet-nonanonpull-airgap-b9a80_westus3
    aks_model.go:197: Private Endpoint already exists with ID: /subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-nonanonpull-airgap-b9a80_westus3/providers/Microsoft.Network/privateEndpoints/PE-for-ABE2ETests
    aks_model.go:165: Private Endpoint already exists, skipping creation
    aks_model.go:108: updated cluster abe2e-kubenet-nonanonpull-airgap-b9a80 subnet with airgap settings
    cluster.go:205: assigning ACR-Pull role to a0b1a1cd-db2f-4ffd-b22f-d91d13bec140
    kube.go:364: Creating daemonset debug-mariner-tolerated with image privateace2enonanonpullwestus3.azurecr.io/cbl-mariner/base/core:2.0
    kube.go:364: Creating daemonset debugnonhost-mariner-tolerated with image privateace2enonanonpullwestus3.azurecr.io/cbl-mariner/base/core:2.0
    kube.go:85: waiting for pod app=debug-mariner-tolerated  in "default" namespace to be ready
    kube.go:106: time before timeout: 14m43.868294178s
    kube.go:268: {
          "Name": "debug-mariner-tolerated-f7vfc",
          "Namespace": "default",
          "Containers": [
            {
              "Name": "mariner",
              "Image": "privateace2enonanonpullwestus3.azurecr.io/cbl-mariner/base/core:2.0",
              "Ports": null
            }
          ],
          "Conditions": null,
          "Phase": "Pending",
          "StartTime": "2025-02-26T22:10:48Z",
          "Events": [
            {
              "Reason": "FailedToRetrieveImagePullSecret",
              "Message": "Unable to retrieve some image pull secrets (acr-secret-code2); attempting to pull the image may not succeed.",
              "Count": 1355,
              "LastTimestamp": "2025-02-27T03:05:53Z"
            }
          ],
          "Logs": "{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"container \\\"mariner\\\" in pod \\\"debug-mariner-tolerated-f7vfc\\\" is waiting to start: trying and failing to pull image\",\"reason\":\"BadRequest\",\"code\":400}\n"
        }
    kube.go:106: time before timeout: 9m43.867558684s
    kube.go:268: {
          "Name": "debug-mariner-tolerated-f7vfc",
          "Namespace": "default",
          "Containers": [
            {
              "Name": "mariner",
              "Image": "privateace2enonanonpullwestus3.azurecr.io/cbl-mariner/base/core:2.0",
              "Ports": null
            }
          ],
          "Conditions": null,
          "Phase": "Pending",
          "StartTime": "2025-02-26T22:10:48Z",
          "Events": [
            {
              "Reason": "FailedToRetrieveImagePullSecret",
              "Message": "Unable to retrieve some image pull secrets (acr-secret-code2); attempting to pull the image may not succeed.",
              "Count": 1378,
              "LastTimestamp": "2025-02-27T03:11:01Z"
            }
          ],
       

Check failure on line 1 in Test_Ubuntu2204_GPUNoDriver

See this annotation in the file changed.

@azure-pipelines azure-pipelines / Agentbaker E2E

Test_Ubuntu2204_GPUNoDriver

Failed
Raw output
    azure.go:501: creating VMSS uish-2025-02-27-ubuntu2204gpunodriver in resource group MC_abe2e-westus3_abe2e-kubenet-322d3_westus3
    azure.go:514: created VMSS uish-2025-02-27-ubuntu2204gpunodriver in resource group MC_abe2e-westus3_abe2e-kubenet-322d3_westus3
    exec.go:190: SSH Instructions: (VM will be automatically deleted after the test finishes, set KEEP_VMSS=true to preserve it or pause the test with a breakpoint before the test finishes)
        ========================
        az account set --subscription 8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8
        az aks get-credentials --resource-group abe2e-westus3 --name abe2e-kubenet-322d3 --overwrite-existing
        kubectl exec -it debug-mariner-tolerated-swglt -- bash -c "chroot /proc/1/root /bin/bash -c 'ssh -i sshkey102240109 -o PasswordAuthentication=no -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=5 [email protected]'"
    scenario_helpers_test.go:146: vmss uish-2025-02-27-ubuntu2204gpunodriver creation succeeded
    kube.go:147: waiting for node uish-2025-02-27-ubuntu2204gpunodriver to be ready
    kube.go:170: node uish-2025-02-27-ubuntu2204gpunodriver000000 is ready. Taints: [{"key":"node.cloudprovider.kubernetes.io/uninitialized","value":"true","effect":"NoSchedule"}] Conditions: [{"type":"MemoryPressure","status":"False","lastHeartbeatTime":"2025-02-27T03:03:14Z","lastTransitionTime":"2025-02-27T03:03:14Z","reason":"KubeletHasSufficientMemory","message":"kubelet has sufficient memory available"},{"type":"DiskPressure","status":"False","lastHeartbeatTime":"2025-02-27T03:03:14Z","lastTransitionTime":"2025-02-27T03:03:14Z","reason":"KubeletHasNoDiskPressure","message":"kubelet has no disk pressure"},{"type":"PIDPressure","status":"False","lastHeartbeatTime":"2025-02-27T03:03:14Z","lastTransitionTime":"2025-02-27T03:03:14Z","reason":"KubeletHasSufficientPID","message":"kubelet has sufficient PID available"},{"type":"Ready","status":"True","lastHeartbeatTime":"2025-02-27T03:03:14Z","lastTransitionTime":"2025-02-27T03:03:14Z","reason":"KubeletReady","message":"kubelet is posting ready status"}]
    scenario_helpers_test.go:101: Choosing the private ACR "privateacre2ewestus3" for the vm validation
    pod.go:18: creating pod "uish-2025-02-27-ubuntu2204gpunodriver000000-test-pod"
    kube.go:85: waiting for pod  metadata.name=uish-2025-02-27-ubuntu2204gpunodriver000000-test-pod in "default" namespace to be ready
    kube.go:106: time before timeout: 9m49.253335098s
    kube.go:268: {
          "Name": "uish-2025-02-27-ubuntu2204gpunodriver000000-test-pod",
          "Namespace": "default",
          "Containers": [
            {
              "Name": "mariner",
              "Image": "mcr.microsoft.com/cbl-mariner/busybox:2.0",
              "Ports": [
                {
                  "containerPort": 80,
                  "protocol": "TCP"
                }
              ]
            }
          ],
          "Conditions": null,
          "Phase": "Pending",
          "StartTime": null,
          "Events": [
            {
              "Reason": "FailedScheduling",
              "Message": "0/43 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 10 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}, 32 node(s) had untolerated taint {node.kubernetes.io/network-unavailable: }. preemption: 0/43 nodes are available: 43 Preemption is not helpful for scheduling.",
              "Count": 0,
              "LastTimestamp": null
            },
            {
              "Reason": "FailedScheduling",
              "Message": "0/44 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 10 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}, 33 node(s) had untolerated taint {node.kubernetes.io/network-unavailable: }. preemption: 0/44 nodes are available: 44 Preemption is not helpful for scheduling.",
              "Count": 0,
      

Check failure on line 1 in Test_Ubuntu2404Gen2_GPUNoDriver

See this annotation in the file changed.

@azure-pipelines azure-pipelines / Agentbaker E2E

Test_Ubuntu2404Gen2_GPUNoDriver

Failed
Raw output
    vhd.go:211: finding the latest image version for 2404gen2containerd, 
    azure.go:412: found the latest image version for 2404gen2containerd, 1.1740597228.1051
    vhd.go:224: found the latest image version for 2404gen2containerd, /subscriptions/c4c3550e-a965-4993-a50c-628fd38cd3e1/resourceGroups/aksvhdtestbuildrg/providers/Microsoft.Compute/galleries/PackerSigGalleryEastUS/images/2404gen2containerd/versions/1.1740597228.1051
    azure.go:501: creating VMSS ahw6-2025-02-27-ubuntu2404gen2gpunodriver in resource group MC_abe2e-westus3_abe2e-kubenet-322d3_westus3
    azure.go:514: created VMSS ahw6-2025-02-27-ubuntu2404gen2gpunodriver in resource group MC_abe2e-westus3_abe2e-kubenet-322d3_westus3
    exec.go:190: SSH Instructions: (VM will be automatically deleted after the test finishes, set KEEP_VMSS=true to preserve it or pause the test with a breakpoint before the test finishes)
        ========================
        az account set --subscription 8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8
        az aks get-credentials --resource-group abe2e-westus3 --name abe2e-kubenet-322d3 --overwrite-existing
        kubectl exec -it debug-mariner-tolerated-swglt -- bash -c "chroot /proc/1/root /bin/bash -c 'ssh -i sshkey10224026 -o PasswordAuthentication=no -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectTimeout=5 [email protected]'"
    scenario_helpers_test.go:146: vmss ahw6-2025-02-27-ubuntu2404gen2gpunodriver creation succeeded
    kube.go:147: waiting for node ahw6-2025-02-27-ubuntu2404gen2gpunodriver to be ready
    kube.go:170: node ahw6-2025-02-27-ubuntu2404gen2gpunodriver000000 is ready. Taints: [{"key":"node.cloudprovider.kubernetes.io/uninitialized","value":"true","effect":"NoSchedule"}] Conditions: [{"type":"MemoryPressure","status":"False","lastHeartbeatTime":"2025-02-27T03:03:52Z","lastTransitionTime":"2025-02-27T03:03:52Z","reason":"KubeletHasSufficientMemory","message":"kubelet has sufficient memory available"},{"type":"DiskPressure","status":"False","lastHeartbeatTime":"2025-02-27T03:03:52Z","lastTransitionTime":"2025-02-27T03:03:52Z","reason":"KubeletHasNoDiskPressure","message":"kubelet has no disk pressure"},{"type":"PIDPressure","status":"False","lastHeartbeatTime":"2025-02-27T03:03:52Z","lastTransitionTime":"2025-02-27T03:03:52Z","reason":"KubeletHasSufficientPID","message":"kubelet has sufficient PID available"},{"type":"Ready","status":"True","lastHeartbeatTime":"2025-02-27T03:03:52Z","lastTransitionTime":"2025-02-27T03:03:52Z","reason":"KubeletReady","message":"kubelet is posting ready status"}]
    scenario_helpers_test.go:101: Choosing the private ACR "privateacre2ewestus3" for the vm validation
    pod.go:18: creating pod "ahw6-2025-02-27-ubuntu2404gen2gpunodriver000000-test-pod"
    kube.go:85: waiting for pod  metadata.name=ahw6-2025-02-27-ubuntu2404gen2gpunodriver000000-test-pod in "default" namespace to be ready
    kube.go:106: time before timeout: 9m3.846523656s
    kube.go:268: {
          "Name": "ahw6-2025-02-27-ubuntu2404gen2gpunodriver000000-test-pod",
          "Namespace": "default",
          "Containers": [
            {
              "Name": "mariner",
              "Image": "mcr.microsoft.com/cbl-mariner/busybox:2.0",
              "Ports": [
                {
                  "containerPort": 80,
                  "protocol": "TCP"
                }
              ]
            }
          ],
          "Conditions": null,
          "Phase": "Pending",
          "StartTime": null,
          "Events": [
            {
              "Reason": "FailedScheduling",
              "Message": "0/52 nodes are available: 4 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}, 42 node(s) had untolerated taint {node.kubernetes.io/network-unavailable: }, 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/52 nodes are available: 52 Preemption is not helpful for scheduling.",
              "Count": 0,
              "LastTimestamp": nul