The following project has two sub projects. One is scrapy project to perform webcrawling and another a DJango project to display the results from the webcrawling. The data will be saved in mysql database.
sudo apt install mysql-server
sudo apt-get install libmysqlclient-dev
pip install -r requirements.txt
create database webcrawl;
cd webcrawlUI/webcrawlApp/
python manage.py makemigrations webcrawlUI
python manage.py migrate webcrawlUI
5. select the database in mysql and generate the entries using the mysql file provided inside the webCrawl directory
This step is optional. This loads the data from a pre-existed mysql file. Otherwise you have to run spiders manually to create the data.
use webcrawl;
source webcrawl_db.sql
cd webcrawlUI/webCrawl
scrapyd
python setup.py bdist_egg
scrapy-deploy command sometimes is not recognized. Hence give full path of the executable from your python environment.
python /home/kiran/kiran/webCrawlProject/WebCrawl/venv/bin/scrapyd-deploy local-target -p webCrawl
{"node_name": "kiran-Inspiron-7591", "status": "ok", "project": "webCrawl", "version": "1624835529", "spiders": 3}
cd webcrawlUI
python manage.py runserver
cd webcrawlUI/webCrawl
start the scrapyd server and execute the crawling script. The crawling process runs immediately on running the script. Later it runs everyday at 1 AM. This script can be modified based on our need.
scrapyd
python webcrawl.py
There are three webcrawlers(spiders) written inside the directory "webcrawlUI/webCrawl/webCrawl/spiders"
This create the state and city names in the database by reading wikipedia pages. The links to wikipedia is provided in the file "webcrawlUI/files/states.txt". The code has to be changes if the parsing fails because of wikipedia web page change.
This spider sends a google serach for each city geminde in the database and saves the top matching links into the weblinks database table. This spider is tricky to run as sometimes google blocks the webcrawler and sometimes it fails because of cookies. Make sure to run this crawler slowly with random delay in the requests.
This is the main crawler which visits each of the geminde websites and recursively performs crawling( 3 levels) and saves the html data from each response body into webdata table. The logic in this code can be modified for better performance in future.
NOTE: In order to run the spiders manually the commented code at the end of each file has to be uncommented.
1. The dynamic search field in the UI is broken. It doesn't autocomplete/recommend all the city names in the search field. However if the name is entered correctly and hitting on the search button will display the result.
For any issues please write to me: [email protected]