K8sRunLauncher deployment on Azure AKS - intermittent dropped connections when calling create_namespaced_job causing failed jobs #28314

tnschneider · 2025-03-07T13:43:24Z

What's the issue?

We are running a deployment with the K8sRunLauncher on Azure AKS. Intermittently, a Dagster job will fail because of a dropped connection when calling the API to create a K8s job.

Sometimes it will drop almost immediately, other times it will hang for several minutes before dropping, or long enough to be killed by the run monitor.

After the first job fails to start, the retries will always work.

If I restart the daemon while a job seems to be stuck in the starting state, this fixes it and the job will run, but it doesn't fix it going forward, the next job run is just as likely to fail.

Looking at the source code it seems that the call to create_namespaced_job will not retry in the event of a dropped connection, that may be an easy fix if it works.

I cross posted on the kubernetes forum and there is another user having the same problem on Azure:
https://discuss.kubernetes.io/t/intermittent-dropped-connections-to-kubernetes-api-from-within-pod-dagster/31747

Dagster version: 1.8.11
K8s version: 1.30.6
Cloud: Azure AKS
OS: linux amd64

Stack trace:

urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
  File "/usr/local/lib/python3.12/site-packages/dagster/_daemon/run_coordinator/queued_run_coordinator_daemon.py", line 389, in _dequeue_run
    instance.run_launcher.launch_run(LaunchRunContext(dagster_run=run, workspace=workspace))
  File "/usr/local/lib/python3.12/site-packages/dagster_k8s/launcher.py", line 294, in launch_run
    self._launch_k8s_job_with_args(job_name, args, run)
  File "/usr/local/lib/python3.12/site-packages/dagster_k8s/launcher.py", line 275, in _launch_k8s_job_with_args
    self._api_client.create_namespaced_job_with_retries(body=job, namespace=namespace)
  File "/usr/local/lib/python3.12/site-packages/dagster_k8s/client.py", line 1007, in create_namespaced_job_with_retries
    k8s_api_retry_creation_mutation(
  File "/usr/local/lib/python3.12/site-packages/dagster_k8s/client.py", line 192, in k8s_api_retry_creation_mutation
    fn()
  File "/usr/local/lib/python3.12/site-packages/dagster_k8s/client.py", line 1008, in <lambda>
    lambda: self.batch_api.create_namespaced_job(body=body, namespace=namespace),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job
    return self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info
    return self.api_client.call_api(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
                    ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 391, in request
    return self.rest_client.POST(url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 279, in POST
    return self.request("POST", url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 172, in request
    r = self.pool_manager.request(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/_request_methods.py", line 143, in request
    return self.request_encode_body(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/_request_methods.py", line 278, in request_encode_body
    return self.urlopen(method, url, **extra_kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/poolmanager.py", line 443, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/util/retry.py", line 474, in increment
    raise reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/util/util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 534, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connection.py", line 516, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/usr/local/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/http/client.py", line 300, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"

What did you expect to happen?

Jobs never fail to launch due to a dropped connection to the K8s API.

How to reproduce?

No response

Dagster version

1.8.11

Deployment type

Dagster Helm chart

Deployment details

We did not deploy directly using helm, we used the output of the helm chart and customized it.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

The text was updated successfully, but these errors were encountered:

g-maxbeaudoin · 2025-03-11T15:03:23Z

We've been experiencing the same issue for the past 2 weeks or so.

Dagster version: 1.10.3
K8s version: 1.29.13
Cloud: Azure AKS
OS: linux amd64

tnschneider added the type: bug Something isn't working label Mar 7, 2025

garethbrickman added the deployment: k8s Related to deploying Dagster to Kubernetes label Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8sRunLauncher deployment on Azure AKS - intermittent dropped connections when calling create_namespaced_job causing failed jobs #28314

K8sRunLauncher deployment on Azure AKS - intermittent dropped connections when calling create_namespaced_job causing failed jobs #28314

tnschneider commented Mar 7, 2025

g-maxbeaudoin commented Mar 11, 2025

K8sRunLauncher deployment on Azure AKS - intermittent dropped connections when calling create_namespaced_job causing failed jobs #28314

K8sRunLauncher deployment on Azure AKS - intermittent dropped connections when calling create_namespaced_job causing failed jobs #28314

Comments

tnschneider commented Mar 7, 2025

What's the issue?

What did you expect to happen?

How to reproduce?

Dagster version

Deployment type

Deployment details

Additional information

Message from the maintainers

g-maxbeaudoin commented Mar 11, 2025