Skip to content

qorbani/webscraper-wikipedia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraper for Wikipedia Pages

Collect following data sources from Wikipedia:

  • timezone
    • Source: List of tz database time zones
    • Actual Fields:
      • CountryCode
      • Coordinates
      • TimeZone
      • Comments
      • UTC offset
      • UTC DST offset
      • Notes
    • Extended Fields:
      • CountryCodeISO
      • CountryCodeISOWiki
      • Link
      • OffsetUTCMinutes
      • OffsetUTCDSTMinutes

Module provide following sections for each data source:

  • scraper: Scraper is responsible to parse source page from Wikipedia and provide result in callback.
  • exporter: Exporter will use scraper to collect data and expose different export types such as JSON or CSV.

Installation

Install using npm:

$ npm install webscraper-wikipedia

Usage

Please refer to ./examples/*.js for scraper samples and ./generators/*.js for exporter samples.

Data

Generated data are located in ./data/ folder. To refresh data use generators located in ./generators/ folder. For instance to refresh data for timezone database use following command:

$ cd ./generators/ ; node timezone.js

To do list


Enhance following sources:

  • timezone
    • Convert Coordinates from ISO 6709 to Longitude and Latitude
    • Fill missed fields based on linked Timezones

Add new sources:

  • countries - List of countries in ISO 3166-1

About

Website Scraper for Wikipedia

Resources

License

Stars

Watchers

Forks

Packages

No packages published