The project is made for the course of Big Data with Prof. Carra Damiano at University of Verona for the master faculty of Computer Science Engenering.
The aim of the project is to create a guide to build, prepare, and run a Virtual Machine on Azure Cluster by Microsoft. Then, the main objective is to make a step - by - step guide on how to use the VM to run Sparks code in order to do an Inverted Index.
The data set used is a static dump in HTML of the italian Wikipedia pages. The processing scripts will work on all type of '''.html''' pages, but they are personalized for the Wikipedia's one.
The processing scripts are not meant to be the best and most efficent possible. They just filter and process the page to remove useless page.
TODO
- Generalize and upgrade the processing scripts.
- Make it working with a shared system between multiple VMs in Azure.