Suggesting new way to schedule requests #119

botzill · 2018-04-01T14:09:45Z

Hi.

This approach, adding new requests when spider is idle, works good but I think we can improve it. Here is my idea:

Imagine that we configured our spider to handle hight load(as example):

CONCURRENT_REQUESTS = 100
CONCURRENT_ITEMS = 200
DOWNLOAD_DELAY = 0.15

(I know is not good to make so many requests but there are cases when we can do that)
Now, according to docs https://doc.scrapy.org/en/latest/topics/signals.html#spider-idle an idle state
is "which means the spider has no further":

requests waiting to be downloaded
requests scheduled
items being processed in the item pipeline

Why do we need to wait until items from pipeline are being processed? There may be DB insert and other things that can slow it down but we don't need to wait for that, we can process new requests meantime. But it waits for all and then new batch of requests are added. My solution is to have a task that runs each x seconds which will check the scheduler queue size and add new requests even if there are already. Example (prototype code):

from twisted.internet import task


class RedisMixin(object):
    # .... existing code

    def setup_redis(self, crawler=None):
        # .... existing code
        self.task = task.LoopingCall(self.check_scheduler_size, crawler)
        self.task.start(60)  # Some option here defined(seconds)

    def check_scheduler_size(self, crawler):
        queue_size = len(crawler.engine.slot.scheduler)

        if queue_size <= crawler.settings.getint('MIN_QUEUE_SIZE'):
            self.schedule_next_requests()
            # Some logs if needed
        else:
            # Do nothing, we already have requests in queue
            # Some logs if needed
            pass

    # .... existing code

This way we can keep some requests in queue always so that spider does not go idle(we still can use idle case). This way we can keep the spider always busy and make it finish sooner and at the same time we have a reasonable amount of requests in queue and fetching new batch size from DB.

Let me know what you think about this approach. I can contribute with a PR.

Thx.

The text was updated successfully, but these errors were encountered:

rmax · 2018-04-01T23:40:02Z

Hi, thank you for your input.

A while ago I was thinking something similar but couldn't pursue the implementation. It would be great if you could go ahead with the PR.

gsusI · 2019-09-01T12:40:52Z

Hey! Quick question! was dis finally implemented? Thank you!

asad-haider · 2019-10-18T13:43:20Z

Hey!
Have anyone implemented this feature? I have been using Scrapy Redis since last months and I have faced this problem. There are some spiders that takes time to crawl all the URLs. Scrapy Redis keep waiting for the batch to completely crawled before processing the new urls which slow downs the large scale crawl

NiuBlibing · 2023-02-08T08:32:31Z

I wrote a temporary patch for it.

LuckyPigeon added the feature label Jan 4, 2022

NiuBlibing linked a pull request Feb 8, 2023 that will close this issue

[dev] Optimize batch fetch method to boost throughput #269

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggesting new way to schedule requests #119

Suggesting new way to schedule requests #119

botzill commented Apr 1, 2018

rmax commented Apr 1, 2018

gsusI commented Sep 1, 2019

asad-haider commented Oct 18, 2019

NiuBlibing commented Feb 8, 2023 •

edited

Loading

Suggesting new way to schedule requests #119

Suggesting new way to schedule requests #119

Comments

botzill commented Apr 1, 2018

rmax commented Apr 1, 2018

gsusI commented Sep 1, 2019

asad-haider commented Oct 18, 2019

NiuBlibing commented Feb 8, 2023 • edited Loading

NiuBlibing commented Feb 8, 2023 •

edited

Loading