Skip to content

issue(graphmanager): Issue when waiting on already scheduled jobs #53

@curita

Description

@curita

Situation

  1. A task was supposed to be run
  2. It failed and was retried a couple of times (apparently because of a Timeout error)
  3. The task was scheduled nonetheless
  4. Eventually, the command to run the task succeeds in getting a response
  5. The response is that the job is already scheduled
  6. Because of this, the returned job_id is None (See BaseScript._schedule_job() and Task.run())
  7. This None job_id is then tracked among the running jobs (See GraphManager.run_pending_jobs())
  8. GraphManager waits for it to finish (See GraphManager.check_running_jobs())
  9. Because it's not an actual job key, it never finishes, so the GraphManager is stuck and doesn't move on to the next task.

Logs

Affected jobs:

2023-10-16 05:51:07.032000 INFO [shub_workflow.graph] Job discovery (265529/71/10349) still running
2023-10-16 05:53:07.597000 INFO [shub_workflow.graph] Job "graph_manager/discovery" (265529/71/10349) finished
2023-10-16 05:53:07.598000 INFO [shub_workflow.graph.task] Will start task "graph_manager/consumers"
2023-10-16 05:54:07.726000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 05:54:07.727000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 05:56:07.918000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 05:56:07.918000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 05:58:08.104000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 05:58:08.104000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:00:08.300000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:00:08.301000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:02:08.490000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:02:08.490000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:04:08.671000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:04:08.671000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:06:08.858000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:06:08.858000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:08:10.015000 ERROR [shub_workflow.script] py:hcfmanager.py already scheduled
2023-10-16 06:10:10.333000 INFO [shub_workflow.graph] Job consumers (None) still running
2023-10-16 06:12:10.659000 INFO [shub_workflow.graph] Job consumers (None) still running
2023-10-16 06:14:10.986000 INFO [shub_workflow.graph] Job consumers (None) still running

Proposal

I don't know what's the best way to address it. On one hand, this is a temporal issue in ScrapyCloud/JobQ, so this is something external to shub_workflow. On the other hand, there's been some instability on this API for quite a while, so it could be good to handle it, at least to avoid the jobs from being stuck.

I think if self.run_job() returns None in GraphManager, we should probably stop the waiting loop because we won't be tracking that task's associated job ID either way.

I wonder if getting the job ID from the already scheduled job could be possible, too, so GraphManager can recover from this situation without having to restart it manually.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions