Expose squeue polling interval to SlurmExecutor parameter and allow setting via env variable#1143
Expose squeue polling interval to SlurmExecutor parameter and allow setting via env variable#1143erjel wants to merge 1 commit intoscalableminds:masterfrom
Conversation
…r and env variable
|
Hi @erjel, thank you for your contribution! Before talking about your proposed solution, I would like to understand the problem a bit better.
How many datasets do you downsample in parallel? There should only be one
By "number of concurrent downsampling jobs" you mean number of datasets being conurrently downsampled, right?
How many squeue requests are we talking about and what interval do you want to configure to mitigate the issue? Thank you! |
Hi,
we regularly run into problems with our SLURM cluster while running (very) large downsampling jobs on webknossos datasets either via CLI or chunk-wise processing of large datasets via the cluster-tools python API. In particular we get the error message:
our cluster team traced the error down to the SLURM controller being overwhelmed by the number
squeuerequests. Technically we could downscale the number of concurrent downsampling jobs, but that would negatively impact the overall cluster utilization as well as throughput.Alternatively we searched for
squeuecommands in the cluster-tools API. We noticed the lineself.executor.get_pending_tasks()infile_wait_thread.py. It seems like you already implemented a polling throttle there via theintervalparameter but never expose the parameter toClusterExecutororSlurmExecutorto reduce the number of squeue calls.Therefore I would like to propose a change where
SlurmExecutorusers can set the polling interval (in seconds) programmatically in their python program or alternatively via environment variable.I am happy to make any additional changes to this pull request and add documentation if necessary.
Best wishes,
Eric
Issues:
FileWaitThread'sintervalparameter toSlurmExecutorSLURM_QUEUE_CHECK_INTERVALvia environment variable to provide the same functionality in a CLI-only setting.Todos:
Make sure to delete unnecessary points or to check all before merging: