You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are running a deployment with the K8sRunLauncher on Azure AKS. Intermittently, a Dagster job will fail because of a dropped connection when calling the API to create a K8s job.
Sometimes it will drop almost immediately, other times it will hang for several minutes before dropping, or long enough to be killed by the run monitor.
After the first job fails to start, the retries will always work.
If I restart the daemon while a job seems to be stuck in the starting state, this fixes it and the job will run, but it doesn't fix it going forward, the next job run is just as likely to fail.
Looking at the source code it seems that the call to create_namespaced_job will not retry in the event of a dropped connection, that may be an easy fix if it works.
Dagster version: 1.8.11
K8s version: 1.30.6
Cloud: Azure AKS
OS: linux amd64
Stack trace:
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
File "/usr/local/lib/python3.12/site-packages/dagster/_daemon/run_coordinator/queued_run_coordinator_daemon.py", line 389, in _dequeue_run
instance.run_launcher.launch_run(LaunchRunContext(dagster_run=run, workspace=workspace))
File "/usr/local/lib/python3.12/site-packages/dagster_k8s/launcher.py", line 294, in launch_run
self._launch_k8s_job_with_args(job_name, args, run)
File "/usr/local/lib/python3.12/site-packages/dagster_k8s/launcher.py", line 275, in _launch_k8s_job_with_args
self._api_client.create_namespaced_job_with_retries(body=job, namespace=namespace)
File "/usr/local/lib/python3.12/site-packages/dagster_k8s/client.py", line 1007, in create_namespaced_job_with_retries
k8s_api_retry_creation_mutation(
File "/usr/local/lib/python3.12/site-packages/dagster_k8s/client.py", line 192, in k8s_api_retry_creation_mutation
fn()
File "/usr/local/lib/python3.12/site-packages/dagster_k8s/client.py", line 1008, in <lambda>
lambda: self.batch_api.create_namespaced_job(body=body, namespace=namespace),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job
return self.create_namespaced_job_with_http_info(namespace, body, **kwargs) # noqa: E501
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info
return self.api_client.call_api(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 348, in call_api
return self.__call_api(resource_path, method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
response_data = self.request(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 391, in request
return self.rest_client.POST(url,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 279, in POST
return self.request("POST", url,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 172, in request
r = self.pool_manager.request(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/urllib3/_request_methods.py", line 143, in request
return self.request_encode_body(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/urllib3/_request_methods.py", line 278, in request_encode_body
return self.urlopen(method, url, **extra_kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/urllib3/poolmanager.py", line 443, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 841, in urlopen
retries = retries.increment(
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/urllib3/util/retry.py", line 474, in increment
raise reraise(type(error), error, _stacktrace)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/urllib3/util/util.py", line 38, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 534, in _make_request
response = conn.getresponse()
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/urllib3/connection.py", line 516, in getresponse
httplib_response = super().getresponse()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/http/client.py", line 1428, in getresponse
response.begin()
File "/usr/local/lib/python3.12/http/client.py", line 331, in begin
version, status, reason = self._read_status()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/http/client.py", line 300, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
What did you expect to happen?
Jobs never fail to launch due to a dropped connection to the K8s API.
How to reproduce?
No response
Dagster version
1.8.11
Deployment type
Dagster Helm chart
Deployment details
We did not deploy directly using helm, we used the output of the helm chart and customized it.
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
The text was updated successfully, but these errors were encountered:
What's the issue?
We are running a deployment with the K8sRunLauncher on Azure AKS. Intermittently, a Dagster job will fail because of a dropped connection when calling the API to create a K8s job.
Sometimes it will drop almost immediately, other times it will hang for several minutes before dropping, or long enough to be killed by the run monitor.
After the first job fails to start, the retries will always work.
If I restart the daemon while a job seems to be stuck in the starting state, this fixes it and the job will run, but it doesn't fix it going forward, the next job run is just as likely to fail.
Looking at the source code it seems that the call to
create_namespaced_job
will not retry in the event of a dropped connection, that may be an easy fix if it works.I cross posted on the kubernetes forum and there is another user having the same problem on Azure:
https://discuss.kubernetes.io/t/intermittent-dropped-connections-to-kubernetes-api-from-within-pod-dagster/31747
Dagster version: 1.8.11
K8s version: 1.30.6
Cloud: Azure AKS
OS: linux amd64
Stack trace:
What did you expect to happen?
Jobs never fail to launch due to a dropped connection to the K8s API.
How to reproduce?
No response
Dagster version
1.8.11
Deployment type
Dagster Helm chart
Deployment details
We did not deploy directly using helm, we used the output of the helm chart and customized it.
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
The text was updated successfully, but these errors were encountered: