"Broken jobs were found in the job queue" error spam #299

sminnee · 2020-06-02T05:12:40Z

I have queuedjobs set up on a site with raygun error logging.

If a job breaks (which reports an error via raygun) then roughly once an hour I will get a subsequent message "Broken jobs were found in the job queue".

Because this leads to raygun notification, this gets quite spammy, especially on a weekend. Since the site in question recreates jobs periodically anyway, and the broken job is benign, this is doubly so.

A few thoughts about how to address this; one or more of these might be useful.

Add a config option to decide on whether "Broken jobs were found in the job queue" errors should be thrown
Add a facility where broken jobs can be automatically retried
Lower the frequency of such alerts – a daily alert to go and clean up jobs might be more usefrul.

It would be interesting to hear whether other deployments of queuedjobs have this issue.

If it turns out these facilities already exist then I would suggest that we address this ticket by updating docs, as I couldn't see mention of this in the docs.

micschk · 2020-06-02T13:29:11Z

These 'broken jobs' messages have once used up around a 1000 euros in SMS-budget overnight on a critical system which I had temporarily set up an SMS error handler for... :-)

I think currently every cron-run checks & outputs these alerts so if you're running one or even multiple threads each minute this can result in a lot of alerts.

Instead of outputting these alerts periodically or with a lower (configurable) frequency, wouldn't it make sense to just output an alert only once (per broken job)?

sminnee · 2020-06-03T07:50:31Z

Generally speaking a job will have broken because of an error, and that error will have been passed to whatever system you have in place for error handling. So I don't think "notify once" is needed; if you disabled it entirely you would end up with the functionality you seek.

micschk · 2020-06-03T08:03:15Z

Which would ideally be the case indeed. But often job failure may caused by running out of memory or otherwise getting stuck on something and being restarted/stopped at some point by the runner, then error handling tends to not (always) get executed. I think that's the reason for the job-health checking being in place(?).

So for me it is important to get notified of 'failed' jobs (via e-mail/sms), just not every minute.
Also we don't set up Raygun/Sentry on every system so relying on a third party for notifications would be less desirable.

michalkleiner · 2020-06-03T09:32:12Z

An example for us is checking for potential composer package updates within CWP, where it's a part of the default recipe. The task there in some circumstances fails on insufficient memory, possibly due to a bug in the checker, who knows. Unscheduling/deleting the job is not a solution as it always gets recreated by dev/build.

chillu · 2020-06-09T03:40:17Z

Duplicate of #24?

sminnee · 2020-06-09T03:55:58Z

Closely related but I believe “broken jobs” and “stalled jobs” have different messages

mfendeksilverstripe · 2020-06-14T19:56:15Z

My general feedback (based on multiple projects):

email notifications are not that useful (for both stalled and broken jobs)
instead we rely on Raygun reporting
checking queue health is really useful as it applies automatic resume attempts for stalled jobs
to further reduce the number of broken jobs we have to deal with, we use automatic retry system for broken jobs, I added this system to the feature review PR
this is very useful for jobs that may break but the error can be safely ignored (jobs that trigger third party requests (request failure), embargo publish of multiple localisation of the same page (DB deadlock))

maxime-rainville added affects/v4 complexity/low impact/medium type/enhancement labels Jun 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Broken jobs were found in the job queue" error spam #299

"Broken jobs were found in the job queue" error spam #299

sminnee commented Jun 2, 2020

micschk commented Jun 2, 2020

sminnee commented Jun 3, 2020

micschk commented Jun 3, 2020

michalkleiner commented Jun 3, 2020

chillu commented Jun 9, 2020

sminnee commented Jun 9, 2020

mfendeksilverstripe commented Jun 14, 2020 •

edited

Loading

"Broken jobs were found in the job queue" error spam #299

"Broken jobs were found in the job queue" error spam #299

Comments

sminnee commented Jun 2, 2020

micschk commented Jun 2, 2020

sminnee commented Jun 3, 2020

micschk commented Jun 3, 2020

michalkleiner commented Jun 3, 2020

chillu commented Jun 9, 2020

sminnee commented Jun 9, 2020

mfendeksilverstripe commented Jun 14, 2020 • edited Loading

mfendeksilverstripe commented Jun 14, 2020 •

edited

Loading