Scheduler distributes task un-evenly #5051

musabqamri123 · 2021-07-12T10:30:14Z

musabqamri123
Jul 12, 2021

Hi,
I am using latest version of dask. There is no version mismatch.
I see tasks are distributed unevenly between the workers. Even though other workers are free all tasks are given to a specific one resulting in hang.
In the attached screenshot worker - 46.

Similar hang issue is intermittently and when we retire those workers manually. Execution resumed

https://stackoverflow.com/questions/68117963/dask-workers-taking-lot-of-time-even-if-cpu-is-available-hangs

musabqamri123 · 2021-07-12T12:23:07Z

musabqamri123
Jul 12, 2021
Author

Worker 37 has 21 tasks

1 reply

fjetter Jul 12, 2021
Maintainer

This image actually looks perfectly fine. We're not trying to balance the number of tasks but rather the occupancy of workers, i.e. how long we expect them to be busy

fjetter · 2021-07-12T14:20:41Z

fjetter
Jul 12, 2021
Maintainer

The red row on your first screen shot indicates that the worker has not submitted heartbeat for ~5700s. Typically this means the worker is either corrupt or dead. If the custom function you're executing is a C/C++ function a similar behaviour might be observed if the function doesn't release the GIL. Generally speaking the last seen metrics (right most column) is relatively large on your second screenshot. for healthy clusters I would assume this to be in the few-seconds-range if there are no GIL issues.

I would be curious if you have the chance to collect any logs of the dead worker, that depends on your setup.

You might want to consider setting the distributed.scheduler.worker-ttl variable to something large-ish (e.g. "10min") which restarts the worker if it hasn't submitted a heart beat in that time. This obviously only masks the underlying issue but may resolve your "stuck cluster" problem for now.

4 replies

musabqamri123 Jul 12, 2021
Author

Thanks.

I am trying to solve some optimization problem using ORTools in custom code which might take few minutes(5-10) to execute. But i dont see any issues in running the same code without dask(multi core machine using multiprocessing)
I was not able to collect/open logs for the hanged workers(was getting 500 server error).

But i see some random workers log as below from s3(Executing on EMR) and seen on multiple workers

distributed.core - INFO - Event loop was unresponsive in Worker for 27.94s.  
This is often caused by long-running GIL-holding functions or moving large chunks of data. 
This can cause timeouts and instability.

Given the above condition that it might take few min, which might result in unresponsiveness. How do i handle it so that it doesnt hang

What is the default or current value of distributed.scheduler.worker-ttl ?
I can set this as max time that would take to execute custom code.

fjetter Jul 14, 2021
Maintainer

I am not familiar with OR Tools but from their docs I assume that there is some solver beneath the surface running native code which explains the unresponsive loop problem. It is a rather frequent problem with python C-extensions to not release the GLI properly.

Gurobi or CPLEX, or open-source solvers such as SCIP, GLPK, or Google's GLOP and award-winning CP-SAT.

I would suggest reaching out to whatever solver you are using and ask them if they can somehow release the GIL safely to improve threaded applications. Maybe this is also a problem of OR tools itself, I don't know where and how the Python integration to the solvers is maintained so you can also start opening an issue there and see where they will point you to.

This is in general not necessarily a problem but will lead to some instabilities. While the library is holding the GIL on a worker, this worker is effectively locked down and will appear to the world as if it was not responsive / dead. If you see a worker being "dead" for a while running your optimizer, it may actually mean that the optimization hasn't finished, yet, and the GIL is still locked. It may of course also be something else but it is difficult to tell.

But i see some random workers log as below from s3(Executing on EMR) and seen on multiple workers
distributed.core - INFO - Event loop was unresponsive in Worker for 27.94s. This is often cause

If you expect the "unresponsive" workers message in this case you can increase the config option distributed.admin.tick.limit to something like 60s which should silence the warning but your cluster may still appear unresponsive.

What is the default or current value of distributed.scheduler.worker-ttl ?

By default, this is turned off which is why you do not see it kicking in.

musabqamri123 Jul 26, 2021
Author

Hi.. I am seeing another issue resulting in hang . I have set worker timeout but still its waiting infinitely on future
Workers might be failing because of memory but atleast it should fail after certain retries
Also i dont see any GIL related logs.

i see tasks as processing in status page of dashboard.
But info link doesn't show any worker processing

musabqamri123 Jul 26, 2021
Author

Any help is appreciated

mrocklin · 2021-07-14T15:24:53Z

mrocklin
Jul 14, 2021
Maintainer

This would also be a good use case to use a ProcessPoolExecutor. I raised an issue at #5059 to create an instructional example

…

On Wed, Jul 14, 2021 at 5:57 AM Florian Jetter ***@***.***> wrote: I am not familiar with OR Tools but from their docs I assume that there is some solver beneath the surface running native code which explains the unresponsive loop problem. It is a rather frequent problem with python C-extensions to not release the GLI properly. Gurobi or CPLEX, or open-source solvers such as SCIP, GLPK, or Google's GLOP and award-winning CP-SAT. I would suggest reaching out to whatever solver you are using and ask them if they can somehow release the GIL safely to improve threaded applications. This is in general not necessarily a problem but will lead to some instabilities. While the library is holding the GIL on a worker, this worker is effectively locked down and will appear to the world as if it was not responsive / dead. If you see a worker being "dead" for a while running your optimizer, it may actually mean that the optimization hasn't finished, yet, and the GIL is still locked. It may of course also be something else but it is difficult to tell. But i see some random workers log as below from s3(Executing on EMR) and seen on multiple workers distributed.core - INFO - Event loop was unresponsive in Worker for 27.94s. This is often cause If you expect the "unresponsive" workers message in this case you can increase the config option distributed.admin.tick.limit to something like 60s which should silence the warning but your cluster may still appear unresponsive. What is the default or current value of distributed.scheduler.worker-ttl ? By default, this is turned off which is why you do not see it kicking in. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5051 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTFDSOHYZLPBFVZM7ITTXVUTHANCNFSM5AGRX4HQ> .

1 reply

musabqamri123 Jul 15, 2021
Author

Can we use ProcessPoolExecutor on distributed cluster or its only for single machine?
I want to run this on distributed cluster.

mrocklin · 2021-07-15T14:51:56Z

mrocklin
Jul 15, 2021
Maintainer

We're working on improving this here: #5063 I'll probably put up an example on examples.dask.org in a week or so as well

…

On Thu, Jul 15, 2021 at 1:54 AM musabqamri123 ***@***.***> wrote: Can we use ProcessPoolExecutor on distributed cluster or its only for single machine? I want to run this on distributed cluster. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5051 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTD5HFXFN3F4S7OLXBDTX2O4TANCNFSM5AGRX4HQ> .

0 replies

mrocklin · 2021-07-16T14:23:35Z

mrocklin
Jul 16, 2021
Maintainer

I made an instructional example here and used this question as motivation (I hope that you don't mind). https://www.youtube.com/watch?v=vF2VItVU5zg

I'm going to hold off on publishing this example until the next release is out (I ran into and fixed a couple of bugs while doing this) but everything should be ready if you want to install from git main pip install git+https://github.com/dask/distributed@main

4 replies

musabqamri123 Jul 16, 2021
Author

Thanks.. will try to install from git.. BTW when is your next release planned ?

musabqamri123 Jul 20, 2021
Author

Hi .. I tried to install latest using git and followed your video

getting an error on running it.
I am trying to upload different modules of code to worker using egg file. Here is the exception that i get in worker

Code :

`class AddProcessPool(WorkerPlugin):
workers = None

def __init__(self, workers):
    self.workers = workers

def setup(self, worker):
    executor = ProcessPoolExecutor(max_workers=self.workers)
    worker.executors['processes'] = executor`

client.register_worker_plugin(AddProcessPool(10))

with dask.annotate(executors='processes', retries=3):

ml is the module which i am trying to load

distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x956\x00\x00\x00\x00\x00\x00\x00\x8c\x0cml.dask_base\x94\x8c\x0eAddProcessPool\x94\x93\x94)\x81\x94}\x94\x8c\x07workers\x94K<sb.'
Traceback (most recent call last):
File "/home/hadoop/miniconda/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 75, in loads
return pickle.loads(x)
ModuleNotFoundError: No module named 'ml'
distributed.utils - ERROR - No module named 'ml'
Traceback (most recent call last):
File "/home/hadoop/miniconda/lib/python3.7/site-packages/distributed/utils.py", line 638, in log_errors
yield
File "/home/hadoop/miniconda/lib/python3.7/site-packages/distributed/worker.py", line 2696, in plugin_add
plugin = pickle.loads(plugin)
File "/home/hadoop/miniconda/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 75, in loads
return pickle.loads(x)
ModuleNotFoundError: No module named 'ml'

mrocklin Jul 20, 2021
Maintainer

Unfortunately I don't know enough about your situation to know what's going on and be able to help. I recommend trying this without using your custom egg file, and if that works try to see what the difference is between the system that work and the system that doesn't work. Good luck!

musabqamri123 Jul 20, 2021
Author

actually same code works if i dont use WorkerPlugin/ProcessPool.

mrocklin · 2021-07-16T20:24:29Z

mrocklin
Jul 16, 2021
Maintainer

Dask releases every two weeks. The next release is planned for next Friday.

…

On Fri, Jul 16, 2021 at 1:22 PM musabqamri123 ***@***.***> wrote: Thanks.. will try to install from git.. BTW when is your next release planned ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5051 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTHIJ562QMVCTU532RLTYCIHNANCNFSM5AGRX4HQ> .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler distributes task un-evenly #5051

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Scheduler distributes task un-evenly #5051

musabqamri123 Jul 12, 2021

Replies: 6 comments · 10 replies

musabqamri123 Jul 12, 2021 Author

fjetter Jul 12, 2021 Maintainer

fjetter Jul 12, 2021 Maintainer

musabqamri123 Jul 12, 2021 Author

fjetter Jul 14, 2021 Maintainer

musabqamri123 Jul 26, 2021 Author

musabqamri123 Jul 26, 2021 Author

mrocklin Jul 14, 2021 Maintainer

musabqamri123 Jul 15, 2021 Author

mrocklin Jul 15, 2021 Maintainer

mrocklin Jul 16, 2021 Maintainer

musabqamri123 Jul 16, 2021 Author

musabqamri123 Jul 20, 2021 Author

mrocklin Jul 20, 2021 Maintainer

musabqamri123 Jul 20, 2021 Author

mrocklin Jul 16, 2021 Maintainer

musabqamri123
Jul 12, 2021

Replies: 6 comments 10 replies

musabqamri123
Jul 12, 2021
Author

fjetter Jul 12, 2021
Maintainer

fjetter
Jul 12, 2021
Maintainer

musabqamri123 Jul 12, 2021
Author

fjetter Jul 14, 2021
Maintainer

musabqamri123 Jul 26, 2021
Author

musabqamri123 Jul 26, 2021
Author

mrocklin
Jul 14, 2021
Maintainer

musabqamri123 Jul 15, 2021
Author

mrocklin
Jul 15, 2021
Maintainer

mrocklin
Jul 16, 2021
Maintainer

musabqamri123 Jul 16, 2021
Author

musabqamri123 Jul 20, 2021
Author

mrocklin Jul 20, 2021
Maintainer

musabqamri123 Jul 20, 2021
Author

mrocklin
Jul 16, 2021
Maintainer