Skip to content

A flexible web crawler based on Scrapy for fetching most of Ajax or other various types of web pages. Easy to use: To customize a new web crawler-You just need to write a config file and run.

License

Notifications You must be signed in to change notification settings

heartfly/ajax_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ajax_crawler

A flexible web crawler based on Scrapy for fetching most of Ajax or other various types of web pages.

Easy to use: To customize a new web crawler, You just need to write a config file and run.

Usage

  • Edit A Config File In The 'Configs' Directory
cd configs
touch xxx.cfg
vim xxx.cfg

like this

[xxx] # crawler name, should be the same as config file name.
allowed_domains = dianping.com # domain name, can be a list.
start_urls = http://www.dianping.com/search/category/2/45/g152 # start url, should be a certain url.
list_url_pattern = .*category/2/45/g152[p\d]* # list url pattern # list url patern, you can use regular expressions here.
list_restrict_xpaths = '<<//div[@class="page"]//a/@href>>' # list restrict xpaths, we use this to find item urls.
list_content = list,item # decide what kind of content you can find in the list restrict xpaths.
item_url_pattern = .*shop/\d+ # item url patter, you can use regular expressions here.
item_restrict_xpaths = <<//div[@class="tit"]>> # item restrict xpaths, we use this to find item contents.
item_content = name,address,region,intro,phone_num,cover_image,hours,sport # decide what field names can find in the item_restrict_xpaths.
#item_incremental = yes # decide this crawler should be incremental (should use cache)
item_name_xpaths = <<//h1[@class="shop-title"]/text()>> # we can find item content in the item field xpaths
item_address_xpaths = <<//span[@itemprop="street-address"]/text()>>
item_region_xpaths = <<//span[@class="region"]/text()>>
item_phone_num_xpaths = <<//span[@itemprop="tel"]/text()>>
item_cover_image_xpaths = <<//img[@itemprop="photo"]/@src>>
item_hours_xpaths = <<//div[@class="desc-info"]//ul/li/span[@class="J_full-cont"]/text()>>
item_sport_xpaths = "羽毛球" # also can be a certain string
download_delay = 5 # downlaod delay to reduce crawling frequency
#js_parser = on # decide whether to use WebKit parsing js and rerendering web pages
  • Then Just Run The Crawler
scrapy crawl xxx

About

A flexible web crawler based on Scrapy for fetching most of Ajax or other various types of web pages. Easy to use: To customize a new web crawler-You just need to write a config file and run.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published