RedisSpidey is a powerful tool that combines the capabilities of Spidey and Redis to enable efficient distributed crawling and web scraping. Leveraging the advanced features of Redis, RedisSpidey features a distributed architecture that supports parallel operation of multiple instances, all listening to the same queue. Additionally, RedisSpidey pushes scraped data back to Redis queues for easy distributed post-processing, enhancing the overall efficiency of the scraping process.
- Distributed Crawling: RedisSpidey enables seamless operation of multiple instances of crawlers, all listening to the same queue, for efficient distributed crawling.
- RedisPipeline: RedisSpidey provides support to push crawled data back to Redis queues for distributed post-processing
npm install spidey-redis
RedisSpidey supports all Spidey options in addition to the following specific options.
Configuration | Type | Description | Default | Required |
---|---|---|---|---|
redisUrl |
string |
Redis url such as redis://localhost:6379 |
null |
Yes |
urlsKey |
string |
Redis input queue name such as urls:queue |
null |
Yes |
dataKey |
string |
Redis output queue name such as data:queue |
null |
Yes if using RedisPipeline |
sleepDelay |
number |
Wait for new items in queue if empty | 5000ms |
No |
import { RedisSpidey, RedisPipeline } from 'spidey-redis';
class AmazonSpidey extends RedisSpidey {
constructor() {
super({
// spidey options ...
redisUrl: 'redis://localhost:6379',
// Input queue
urlsKey: 'amazon:urls',
// Output queue
dataKey: 'amazon:data',
// Redis pipeline to push crawled data to data queue
pipelines: [RedisPipeline],
});
}
}
RedisSpidey is the ultimate solution for distributed web scraping and crawling, offering unparalleled performance, scalability, and flexibility. With RedisSpidey, you can easily handle large-scale web scraping tasks with ease, while taking advantage of advanced Redis and Spidey technology for efficient distributed crawling and post-processing of data.
Spidey is licensed under the MIT License.