issue(graphmanager): Issue when waiting on already scheduled jobs

## Situation

1. A task was supposed to be run
2. It failed and was retried a couple of times (apparently because of a Timeout error)
3. The task was scheduled nonetheless
4. Eventually, the command to run the task succeeds in getting a response
5. The response is that the job is already scheduled
6. Because of this, the returned job_id is None (See `BaseScript._schedule_job()` and `Task.run()`)
7. This `None` job_id is then tracked among the running jobs (See `GraphManager.run_pending_jobs()`)
8. GraphManager waits for it to finish (See `GraphManager.check_running_jobs()`)
9. Because it's not an actual job key, it never finishes, so the GraphManager is stuck and doesn't move on to the next task.

## Logs

**Affected jobs:**
* https://app.zyte.com/p/265529/132/45
* https://app.zyte.com/p/265529/132/44

```
2023-10-16 05:51:07.032000 INFO [shub_workflow.graph] Job discovery (265529/71/10349) still running
2023-10-16 05:53:07.597000 INFO [shub_workflow.graph] Job "graph_manager/discovery" (265529/71/10349) finished
2023-10-16 05:53:07.598000 INFO [shub_workflow.graph.task] Will start task "graph_manager/consumers"
2023-10-16 05:54:07.726000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 05:54:07.727000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 05:56:07.918000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 05:56:07.918000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 05:58:08.104000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 05:58:08.104000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:00:08.300000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:00:08.301000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:02:08.490000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:02:08.490000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:04:08.671000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:04:08.671000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:06:08.858000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:06:08.858000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:08:10.015000 ERROR [shub_workflow.script] py:hcfmanager.py already scheduled
2023-10-16 06:10:10.333000 INFO [shub_workflow.graph] Job consumers (None) still running
2023-10-16 06:12:10.659000 INFO [shub_workflow.graph] Job consumers (None) still running
2023-10-16 06:14:10.986000 INFO [shub_workflow.graph] Job consumers (None) still running
```

## Proposal

I don't know what's the best way to address it. On one hand, this is a temporal issue in ScrapyCloud/JobQ, so this is something external to shub_workflow. On the other hand, there's been some instability on this API for quite a while, so it could be good to handle it, at least to avoid the jobs from being stuck.

I think if `self.run_job()` returns None in GraphManager, we should probably stop the waiting loop because we won't be tracking that task's associated job ID either way.

I wonder if getting the job ID from the already scheduled job could be possible, too, so GraphManager can recover from this situation without having to restart it manually.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

issue(graphmanager): Issue when waiting on already scheduled jobs #53

Situation

Logs

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

issue(graphmanager): Issue when waiting on already scheduled jobs #53

Description

Situation

Logs

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions