Skip to content

[ARCHIVED] Scripts to get the aggregated page views for all articles of a WM project (before pageview API was released)

Notifications You must be signed in to change notification settings

kelson42/wikimedia_wp1_hitcounter

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Note: This requires a lot of disk space. The raw data uses about 1 GB 
per day of data, The output for 40 days of data takes about 16 GB. 

1. Download the raw data
If you're running on Tool Labs, call link-month.sh with each year and month
you want to index. This will create links in source/ for each of the files in
that month.

Otherwise, download the hourly pagecounts-*.gz files from
http://dumps.wikimedia.org/other/pagecounts-raw/ (projectcounts aren't needed.)

2. Make the list of average daily hitcounts, which will be created
in the file hitcounts.raw.gz. Run
	sh make-raw.sh

On Tool Labs, the grid engine should be used:
	jsub -cwd -j y ./make-raw.sh
This will send the output of the command to ~/make-raw.out ; you can monitor
this file with
	tail -f ~/make-raw.out

About

[ARCHIVED] Scripts to get the aggregated page views for all articles of a WM project (before pageview API was released)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Perl 77.0%
  • Shell 23.0%