NOTE: This scraper is not working anymore since maven-repository.com is down.
This Python script scrapes maven-repository.com and
forwards it to Kafka. The scraper script requires: start_date,
kafka_topic, bootstrap_servers and sleeptime. It will scrape all
releases until start_date and push this to the kafka_topic running
on bootstrap_servers. It will keep repeating after sleep_time
seconds with start_time == date_of_latest_release. I.e. it scrapes
incremental updates on Maven releases.
Install all dependencies:
python3 -m venv venv
. ./venv/bin/activate
pip install requests BeautifulSoup4 kafka-pythonusage: Scrape Maven releases to Kafka. [-h]
start_date topic bootstrap_servers
sleep_timeFor example:
python scraper.py '2019-06-24 14:05:50' cf_mvn_releases localhost:29092 60This will scrape up to 2019-06-24 14:05:50 (+ incremental updates)
pushes it to cf_maven_releases located at localhost:29092.
Incremental updates are checked every 60 seconds.
Note: start_date must be in %Y-%m-%d %H:%M:%S format. Multiple
bootstrap servers should be , separated. Sleep time is in seconds.
Data will be send in the following format:
{
"groupId": "com.g2forge.alexandria",
"artifactId": "alexandria",
"version": "0.0.9",
"date": "2019-06-24 14:42:49"
}docker build -t mvn-scraper .
docker run mvn-scraper '2019-06-24 14:05:50' cf_mvn_releases localhost:29092 60docker run wzorgdrager/mvn-scraper '2019-06-24 14:05:50' cf_mvn_releases localhost:29092 60