-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs with many parallel tasks cause GKE cluster failures #96
Comments
Logs from running sleep echo: errors.txt |
Some ideas after discussing with @hlapp:
We're certainly eagerly watching logs, and I know we've encountered issues in other projects when expecting streaming connections (e.g. RabbitMQ in lando) to be held open indefinitely. So after the easy/obvious steps of upgrading dependencies and k8s itself, it's probably worth looking at the watch/log connections. |
I tried running the I also ran the |
So there's certainly a possibility that all of the troubles are due to a cluster upgrade happening in the middle of a job - not at all related to the workload. A surprise cluster upgrade certainly explains:
Great detective work, seems like the auto-upgrade is causing this. |
The docs say:
And when creating a cluster on CLI, I see:
I suggest we disable auto upgrade when creating these clusters, in addition to adding retry logic. |
I think there are two different "master upgrades" that display the same within the google cloud console.
They both show the following message in
There are commands to perform the software upgrade via the command line or through the google cloud console. My assumption is that secondary master upgrade is when google is upgrading the master server to a bigger server, since workflows seem to run fine after this upgrade. I haven't been able to find a flag or command to perform this upgrade manually. |
I see - so it's not necessarily due to a software version upgrade, it may be responding to the number of pods and migrating to a larger master. So, disabling the automatic software upgrade wouldn't prevent GKE from migrating to a bigger master. Do you have a screenshot on where you see that message? |
Thanks. Your findings match up with the behavior described in this issue, and it seems to be the designed/intended behavior: hashicorp/terraform-provider-google#3385. And that makes sense - the master is dynamic and may be unavailable while it's scaling up to manage more nodes. That can certainly happen on other cloud platforms, so we should handle it better :) |
I came across some documentation that seems related to "Upgrading cluster master": https://github.com/GoogleCloudPlatform/gke-stateful-applications-demo#deployment-steps It refers to the k8s master as the
I found that you can see this status in I tried staring a cluster with 6 small nodes and only saw failures due to javascript taking too long. |
I added the calrissian The |
Running the full 24 exome sample workflow failed even with the above tweaks.
I was able to look at logging around this error via the the Google Cloud Console. Side Thought: |
Attempting to summarize different types errors seen in google cloud error logs.
|
Some of these exceptions seem to be side effects of others. For example Lines 132 to 134 in a249143
Typically w.stream blocks until we run w.stop() inside wait_for_completion . It looks like when we lose connection to the k8s server it does return(or throws exception). This in turn causes an early call to PodMonitor.cleanup() that eventually results in the ValueError above.
|
researched errors and tried to find source/cause
Should be addressed in k8s.py - anywhere we make an API call to kubernetes, check for errors and possibly retry
Could be caused by Lines 99 to 101 in 2ed7854
The sys.stdout and sys.stderr are reopened as tee subprocesses, one of which could be blocking
Seems to be downstream of an earlier failure in concat-gz-files. I believe that script needs to set pipefail: https://github.com/bespin-workflows/exomeseq-gatk4/blob/develop/tools/concat-gz-files.cwl
Errors collecting output usually indicate an earlier failure. We should make sure early tools are returning failure codes when they fail (concat-gz-output: pipefail) and revisit this.
This is odd. The expressions are simple but taking over 20s to run? |
Investigating this currently and have implemented some retry logic with tenacity. Have temporarily disabled deleting pods in an attempt to sort out the noise. I've found that in the sleep-echo workflow of 175 parallel tasks, the workflow doesn't finish because some of the pods get deleted as part of the node upgrade.
Initially this was baffling, but on further thought, kubernetes resources such as deployments, jobs, and replicasets are responsible for creating pods if they are deleted manually. So calrissian should be no different. If the pod is deleted out from under it, it should be responsible for re-submitting that pod. I considered looking into pod disruption budgets, restartPolicy, or disabling auto-scaling, but at the end of the day, raw pods are not expected to be resilient. they're the lowest level object, and we should either be acting as a manager of pods or using a Job if we want to delegate that responsibility. We decided in #23 to submit pods rather than jobs since it seemed a better fit. |
When testing a moderately parallelized job on GKE (24 exome samples, 48 files), the cluster API becomes unresponsive and triggers failures in the workflow. It's not failing because of the computations.
This is easily reproducible with a simple but highly parallel worfklow that generates 175 parallel containers: https://gist.github.com/dleehr/afdcde15aef9d727fd5226beddef126d
The above sleep-echo workflow doesn't seem to cause problems on a development cluster (Docker Desktop for Mac), so I'm surprised it can overwhelm a GKE cluster
Some ideas:
--parallel
. While running,docker stats
reported over 2000 PIDs for the calrissian container that orchestrates the workflow. This PID count includes processes and kernel threads. It's not clear that this is a problem but it may be a symptom.The text was updated successfully, but these errors were encountered: