ajax_crawler

A flexible web crawler based on Scrapy for fetching most of Ajax or other various types of web pages.

Easy to use: To customize a new web crawler, You just need to write a config file and run.

Usage

Edit A Config File In The 'Configs' Directory

cd configs
touch xxx.cfg
vim xxx.cfg

like this

[xxx] # crawler name, should be the same as config file name.
allowed_domains = dianping.com # domain name, can be a list.
start_urls = http://www.dianping.com/search/category/2/45/g152 # start url, should be a certain url.
list_url_pattern = .*category/2/45/g152[p\d]* # list url pattern # list url patern, you can use regular expressions here.
list_restrict_xpaths = '<<//div[@class="page"]//a/@href>>' # list restrict xpaths, we use this to find item urls.
list_content = list,item # decide what kind of content you can find in the list restrict xpaths.
item_url_pattern = .*shop/\d+ # item url patter, you can use regular expressions here.
item_restrict_xpaths = <<//div[@class="tit"]>> # item restrict xpaths, we use this to find item contents.
item_content = name,address,region,intro,phone_num,cover_image,hours,sport # decide what field names can find in the item_restrict_xpaths.
#item_incremental = yes # decide this crawler should be incremental (should use cache)
item_name_xpaths = <<//h1[@class="shop-title"]/text()>> # we can find item content in the item field xpaths
item_address_xpaths = <<//span[@itemprop="street-address"]/text()>>
item_region_xpaths = <<//span[@class="region"]/text()>>
item_phone_num_xpaths = <<//span[@itemprop="tel"]/text()>>
item_cover_image_xpaths = <<//img[@itemprop="photo"]/@src>>
item_hours_xpaths = <<//div[@class="desc-info"]//ul/li/span[@class="J_full-cont"]/text()>>
item_sport_xpaths = "羽毛球" # also can be a certain string
download_delay = 5 # downlaod delay to reduce crawling frequency
#js_parser = on # decide whether to use WebKit parsing js and rerendering web pages

Then Just Run The Crawler

scrapy crawl xxx

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
configs		configs
crawler		crawler
LICENSE		LICENSE
README.md		README.md
check_json_file.py		check_json_file.py
requirements.txt		requirements.txt
run_crawler.sh		run_crawler.sh
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ajax_crawler

Usage

About

Releases

Packages

Languages

License

heartfly/ajax_crawler

Folders and files

Latest commit

History

Repository files navigation

ajax_crawler

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages