Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to stop spider check duplicate with redis ? #242

Open
milkeasd opened this issue Apr 2, 2022 · 7 comments
Open

Is there a way to stop spider check duplicate with redis ? #242

milkeasd opened this issue Apr 2, 2022 · 7 comments

Comments

@milkeasd
Copy link

milkeasd commented Apr 2, 2022

My spider was extremely slow when run with scrapy-redis. Because there is a big delay between slave and master. I want to reduce the commuication to just only getting the start_urls periodically or when all start_urls is done, Is there any ways to do so ?

Moreover, I want to stop the duplication check to reduce the number of connection.

But, I cant change the DUPEFILTER_CLASS to scrapy default one, it raise error.

Is there any other ways to stop the duplicate check ?

Or any ideas can help speed up the process ?

Thanks

@LuckyPigeon
Copy link
Collaborator

@Germey Any ideas?

@LuckyPigeon
Copy link
Collaborator

@milkeasd
Could you provide related code files?

@LuckyPigeon
Copy link
Collaborator

LuckyPigeon commented Apr 3, 2022

The way I see, let developer customize their communication rules and add a disable option for DUPEFILTER_CLASS can be two great features.

@LuckyPigeon
Copy link
Collaborator

LuckyPigeon commented Apr 8, 2022

@Germey
Copy link
Collaborator

Germey commented Apr 9, 2022

@milkeasd could you please provide your code or make some sample code?

@sify21
Copy link

sify21 commented Jun 7, 2024

@LuckyPigeon it doesn't work. setting DUPEFILTER_CLASS = "scrapy.dupefilters.BaseDupeFilter" will report this error:

builtins.AttributeError: type object 'BaseDupeFilter' has no attribute 'from_spider'

Maybe there should be a custom BaseDupeFilter in scrapy-redis like RFPDupeFilter:

def from_spider(cls, spider):

From scrapy's doc: https://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class

You can disable filtering of duplicate requests by setting DUPEFILTER_CLASS to 'scrapy.dupefilters.BaseDupeFilter'. Be very careful about this however, because you can get into crawling loops. It’s usually a better idea to set the dont_filter parameter to True on the specific Request that should not be filtered.

@HairlessVillager
Copy link
Contributor

Hi, everyone! I've made a little change in scrapy_redis.scheduler.Scheduler, which maybe helpful for this issue. Feel free to use and comment.🥰

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants