Developed by Simone Torrisi, Computer Science student at University of Catania
The goal of this project is to analyze real time questions from Stack Overflow and clustering them based on title, body and tags associated to the question. The results will be then displayed on dashboards.
You can get more information visiting docs, Kafka and Spark directories.
- Centralized service: Zookeeper
- Data Ingestion: Kafka Connect 2.4.1 with Java 11
- Data Streaming: Apache Kafka and Spark Streaming
- Data Processing: Apache Spark 3.0.0 and Spark MLlib with Java 11
- Data Indexing: Elasticsearch 7.8.0
- Data Visualization: Kibana 7.8.0
- Apache Kafka: download from here and put the tgz file into Kafka/Setup directory.
- Apache Spark: download from here and put the tgz file into Spark/Setup directory.
To start the initial setup the following script initial-setup.sh
has to be executed in the main directory.
There are two options:
- Using bash command:
bash initial-setup.sh
- Making script executable:
chmod +x initial-setup.sh
and then./initial-setup.sh
After the previous step is completed, the project can be started by using the code docker-compose up