Scheduler distributes task un-evenly #5051
Replies: 6 comments 10 replies
-
The red row on your first screen shot indicates that the worker has not submitted heartbeat for ~5700s. Typically this means the worker is either corrupt or dead. If the custom function you're executing is a C/C++ function a similar behaviour might be observed if the function doesn't release the GIL. Generally speaking the I would be curious if you have the chance to collect any logs of the dead worker, that depends on your setup. You might want to consider setting the |
Beta Was this translation helpful? Give feedback.
-
This would also be a good use case to use a ProcessPoolExecutor. I raised
an issue at #5059 to create an
instructional example
…On Wed, Jul 14, 2021 at 5:57 AM Florian Jetter ***@***.***> wrote:
I am not familiar with OR Tools but from their docs I assume that there is
some solver beneath the surface running native code which explains the
unresponsive loop problem. It is a rather frequent problem with python
C-extensions to not release the GLI properly.
Gurobi or CPLEX, or open-source solvers such as SCIP, GLPK, or Google's
GLOP and award-winning CP-SAT.
I would suggest reaching out to whatever solver you are using and ask them
if they can somehow release the GIL safely to improve threaded applications.
This is in general not necessarily a problem but will lead to some
instabilities. While the library is holding the GIL on a worker, this
worker is effectively locked down and will appear to the world as if it was
not responsive / dead. If you see a worker being "dead" for a while running
your optimizer, it may actually mean that the optimization hasn't finished,
yet, and the GIL is still locked. It may of course also be something else
but it is difficult to tell.
But i see some random workers log as below from s3(Executing on EMR) and
seen on multiple workers
distributed.core - INFO - Event loop was unresponsive in Worker for
27.94s. This is often cause
If you expect the "unresponsive" workers message in this case you can
increase the config option distributed.admin.tick.limit to something like
60s which should silence the warning but your cluster may still appear
unresponsive.
What is the default or current value of distributed.scheduler.worker-ttl ?
By default, this is turned off which is why you do not see it kicking in.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5051 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFDSOHYZLPBFVZM7ITTXVUTHANCNFSM5AGRX4HQ>
.
|
Beta Was this translation helpful? Give feedback.
-
We're working on improving this here:
#5063
I'll probably put up an example on examples.dask.org in a week or so as well
…On Thu, Jul 15, 2021 at 1:54 AM musabqamri123 ***@***.***> wrote:
Can we use ProcessPoolExecutor on distributed cluster or its only for
single machine?
I want to run this on distributed cluster.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5051 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTD5HFXFN3F4S7OLXBDTX2O4TANCNFSM5AGRX4HQ>
.
|
Beta Was this translation helpful? Give feedback.
-
I made an instructional example here and used this question as motivation (I hope that you don't mind). https://www.youtube.com/watch?v=vF2VItVU5zg I'm going to hold off on publishing this example until the next release is out (I ran into and fixed a couple of bugs while doing this) but everything should be ready if you want to install from git main |
Beta Was this translation helpful? Give feedback.
-
Dask releases every two weeks. The next release is planned for next Friday.
…On Fri, Jul 16, 2021 at 1:22 PM musabqamri123 ***@***.***> wrote:
Thanks.. will try to install from git.. BTW when is your next release
planned ?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5051 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTHIJ562QMVCTU532RLTYCIHNANCNFSM5AGRX4HQ>
.
|
Beta Was this translation helpful? Give feedback.
-
Hi,
I am using latest version of dask. There is no version mismatch.
I see tasks are distributed unevenly between the workers. Even though other workers are free all tasks are given to a specific one resulting in hang.
In the attached screenshot worker - 46.
Similar hang issue is intermittently and when we retire those workers manually. Execution resumed
https://stackoverflow.com/questions/68117963/dask-workers-taking-lot-of-time-even-if-cpu-is-available-hangs
Beta Was this translation helpful? Give feedback.
All reactions