You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working on a project that requires the memory efficiency of scrapy redis and the features of scrapy's sitemap spider and I think there should be a special implementation for this in scrapy redis because sitemap spiders often keep lots of data in memory.
def _parse_sitemap(self, response):
if response.url.endswith('/robots.txt'):
for url in sitemap_urls_from_robots(response.text, base_url=response.url):
yield Request(url, callback=self._parse_sitemap)
else:
body = self._get_sitemap_body(response)
if body is None:
logger.warning("Ignoring invalid sitemap: %(response)s",
{'response': response}, extra={'spider': self})
return
s = Sitemap(body)
if s.type == 'sitemapindex':
for loc in iterloc(s, self.sitemap_alternate_links):
if any(x.search(loc) for x in self._follow):
yield Request(loc, callback=self._parse_sitemap)
elif s.type == 'urlset':
for loc in iterloc(s):
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break
In this method a sitemap object is created for every sitemap body and as you can see in the next example a lxml element tree object is stored in self._root.
Now, this is okay for most sitemaps that have a dozen to a few hundred urls BUT if you have to deal with huge sitemaps (example: autobidmaster.com/sitemap_index.xml.gz). You'll soon find that even with scrapy redis your heap memory will still complain about space since the entire sitemap object is not garbage collected until all requests have been yielded and handled.
To be more clear, I'll show you the heap memory, body size, sitemap size, s._root size and response url while crawling autobidmasters.com/robots.txt. I made a simple scrapy project that uses scrapy redis, here's the spider:
import sys
import logging
from scrapy.spiders import SitemapSpider
from scrapy.spiders.sitemap import regex, iterloc
from scrapy.utils.sitemap import Sitemap, sitemap_urls_from_robots
from scrapy.http import Request, XmlResponse
logger = logging.getLogger(__name__)
class TestSitemapSpider(SitemapSpider):
name = 'test_sitemap_spider'
sitemap_urls = [
'http://www.autobidmaster.com/robots.txt?',
]
def parse(self, response):
import os;
_proc_status = '/proc/%d/status' % os.getpid();
_scale = {'kB': 1024.0, 'mB': 1024.0 * 1024.0, 'KB': 1024.0, 'MB': 1024.0 * 1024.0};
t = open(_proc_status);
v = t.read();
t.close();
i = v.index('VmSize:');
v = v[i:].split(None, 3);
heap_size = float(v[1]) * _scale[v[2]] / 1000 / 1000
print(heap_size, response.url)
def _parse_sitemap(self, response):
if response.url.endswith('/robots.txt'):
for url in sitemap_urls_from_robots(response.text, base_url=response.url):
yield Request(url, callback=self._parse_sitemap)
else:
body = self._get_sitemap_body(response)
if body is None:
logger.warning("Ignoring invalid sitemap: %(response)s",
{'response': response}, extra={'spider': self})
return
s = Sitemap(body)
# Get heap memory
import os;
_proc_status = '/proc/%d/status' % os.getpid();
_scale = {'kB': 1024.0, 'mB': 1024.0 * 1024.0, 'KB': 1024.0, 'MB': 1024.0 * 1024.0};
t = open(_proc_status);
v = t.read();
t.close();
i = v.index('VmSize:');
v = v[i:].split(None, 3);
heap_size = float(v[1]) * _scale[v[2]] / 1000 / 1000
# Get _root memory
root_memory = 0
for child in s._root.getchildren():
for childs_child in child.getchildren():
root_memory += sys.getsizeof(childs_child)
print(heap_size, sys.getsizeof(body), sys.getsizeof(s), root_memory, response.url)
if s.type == 'sitemapindex':
for loc in iterloc(s, self.sitemap_alternate_links):
if any(x.search(loc) for x in self._follow):
yield Request(loc, callback=self._parse_sitemap)
elif s.type == 'urlset':
for loc in iterloc(s):
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break
And here's the ouput (the relevant parts with heap size in MB and the rest in bytes):
As you can see the memory is not freed while scraping and I think this can elegantly be done by storing urls in redis until all sitemaps have been scraped and only then start with these urls therefore releasing 's' from memory and freeing quite alot of heap.
This worked on my project and I think there should be an implementation for it in scrapy redis, now I'll make a branch asap but wanted to show you where the issue is and get some feedback as I have some questions:
If I implement this as a RedisSitemapSpider (similar to existing RedisSpider and RedisCrawlSpider) what would be the best way to store requests in redis? I think that a new key should be added "%spidername":sitemap_urls" and urls extracted from the sitemaps should be stored either in "%spidername%: start_urls" or directly into "%spidername%:requests".
In other words, the RedisSitemapSpider would take a sitemap url, extract all site links and put them into redis in some key then go to the next sitemap and repeat the process while not holding the previous sitemap into memory.
The text was updated successfully, but these errors were encountered:
Hello,
I'm working on a project that requires the memory efficiency of scrapy redis and the features of scrapy's sitemap spider and I think there should be a special implementation for this in scrapy redis because sitemap spiders often keep lots of data in memory.
Specificaly (source):
In this method a sitemap object is created for every sitemap body and as you can see in the next example a lxml element tree object is stored in self._root.
Now, this is okay for most sitemaps that have a dozen to a few hundred urls BUT if you have to deal with huge sitemaps (example: autobidmaster.com/sitemap_index.xml.gz). You'll soon find that even with scrapy redis your heap memory will still complain about space since the entire sitemap object is not garbage collected until all requests have been yielded and handled.
To be more clear, I'll show you the heap memory, body size, sitemap size, s._root size and response url while crawling autobidmasters.com/robots.txt. I made a simple scrapy project that uses scrapy redis, here's the spider:
And here's the ouput (the relevant parts with heap size in MB and the rest in bytes):
As you can see the memory is not freed while scraping and I think this can elegantly be done by storing urls in redis until all sitemaps have been scraped and only then start with these urls therefore releasing 's' from memory and freeing quite alot of heap.
This worked on my project and I think there should be an implementation for it in scrapy redis, now I'll make a branch asap but wanted to show you where the issue is and get some feedback as I have some questions:
If I implement this as a RedisSitemapSpider (similar to existing RedisSpider and RedisCrawlSpider) what would be the best way to store requests in redis? I think that a new key should be added "%spidername":sitemap_urls" and urls extracted from the sitemaps should be stored either in "%spidername%: start_urls" or directly into "%spidername%:requests".
In other words, the RedisSitemapSpider would take a sitemap url, extract all site links and put them into redis in some key then go to the next sitemap and repeat the process while not holding the previous sitemap into memory.
The text was updated successfully, but these errors were encountered: