Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8sRunLauncher deployment on Azure AKS - intermittent dropped connections when calling create_namespaced_job causing failed jobs #28314

Open
tnschneider opened this issue Mar 7, 2025 · 1 comment
Labels
deployment: k8s Related to deploying Dagster to Kubernetes type: bug Something isn't working

Comments

@tnschneider
Copy link

What's the issue?

We are running a deployment with the K8sRunLauncher on Azure AKS. Intermittently, a Dagster job will fail because of a dropped connection when calling the API to create a K8s job.

Sometimes it will drop almost immediately, other times it will hang for several minutes before dropping, or long enough to be killed by the run monitor.

After the first job fails to start, the retries will always work.

If I restart the daemon while a job seems to be stuck in the starting state, this fixes it and the job will run, but it doesn't fix it going forward, the next job run is just as likely to fail.

Looking at the source code it seems that the call to create_namespaced_job will not retry in the event of a dropped connection, that may be an easy fix if it works.

I cross posted on the kubernetes forum and there is another user having the same problem on Azure:
https://discuss.kubernetes.io/t/intermittent-dropped-connections-to-kubernetes-api-from-within-pod-dagster/31747

Dagster version: 1.8.11
K8s version: 1.30.6
Cloud: Azure AKS
OS: linux amd64

Stack trace:

urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
  File "/usr/local/lib/python3.12/site-packages/dagster/_daemon/run_coordinator/queued_run_coordinator_daemon.py", line 389, in _dequeue_run
    instance.run_launcher.launch_run(LaunchRunContext(dagster_run=run, workspace=workspace))
  File "/usr/local/lib/python3.12/site-packages/dagster_k8s/launcher.py", line 294, in launch_run
    self._launch_k8s_job_with_args(job_name, args, run)
  File "/usr/local/lib/python3.12/site-packages/dagster_k8s/launcher.py", line 275, in _launch_k8s_job_with_args
    self._api_client.create_namespaced_job_with_retries(body=job, namespace=namespace)
  File "/usr/local/lib/python3.12/site-packages/dagster_k8s/client.py", line 1007, in create_namespaced_job_with_retries
    k8s_api_retry_creation_mutation(
  File "/usr/local/lib/python3.12/site-packages/dagster_k8s/client.py", line 192, in k8s_api_retry_creation_mutation
    fn()
  File "/usr/local/lib/python3.12/site-packages/dagster_k8s/client.py", line 1008, in <lambda>
    lambda: self.batch_api.create_namespaced_job(body=body, namespace=namespace),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job
    return self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info
    return self.api_client.call_api(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
                    ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 391, in request
    return self.rest_client.POST(url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 279, in POST
    return self.request("POST", url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 172, in request
    r = self.pool_manager.request(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/_request_methods.py", line 143, in request
    return self.request_encode_body(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/_request_methods.py", line 278, in request_encode_body
    return self.urlopen(method, url, **extra_kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/poolmanager.py", line 443, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/util/retry.py", line 474, in increment
    raise reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/util/util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 534, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connection.py", line 516, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/usr/local/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/http/client.py", line 300, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"

What did you expect to happen?

Jobs never fail to launch due to a dropped connection to the K8s API.

How to reproduce?

No response

Dagster version

1.8.11

Deployment type

Dagster Helm chart

Deployment details

We did not deploy directly using helm, we used the output of the helm chart and customized it.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

@tnschneider tnschneider added the type: bug Something isn't working label Mar 7, 2025
@g-maxbeaudoin
Copy link

We've been experiencing the same issue for the past 2 weeks or so.

Dagster version: 1.10.3
K8s version: 1.29.13
Cloud: Azure AKS
OS: linux amd64

@garethbrickman garethbrickman added the deployment: k8s Related to deploying Dagster to Kubernetes label Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployment: k8s Related to deploying Dagster to Kubernetes type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants