This Project aims to set up a Data Engineering Practice environment on Local WSL with VS code IDE integration.
-
Language Support
- Java -
- Scala -
- Python - TBD
-
Big Data Stack
- Hadoop -
- Hive -
- Spark -
- Kafka -
- Hbase - TBD
- WSL installation with Ubuntu 20
- VS Code installation on Windows
- Install Remote WSL extension
- Go to correct wsl directory eg:
/home/$USER/Codebases/bigdata-dev-env-setupand do
code .This will download VS code server in WSL and allow remote development.
- Java Extension pack
- Scala Metals
- Python extension pack
- Additional Extensions and settings
- Bracket pair colorizer
- Format on save and autosave
- Shortcut key binding to convert uppercase and lowercase
- Get Noctis theme : https://marketplace.visualstudio.com/items?itemName=liviuschera.noctis
- JDK 8 and 11 (Java 8 is required mainly for Hive as it doesnot run on java 11 exasol/hadoop-etl-udfs#59 and Java 11 is required for VS code Java Language server to work)
- MySQL (for hive metastore and an RDBMS)
- Hive
- Spark
- Kafka
- Hbase
It will use the below directories for package downloads, installation and data
* DOWNLOAD_DIR="/home/$USER/bigdata-downloads"
* INSTALL_DIR="/home/$USER/bigdata-installation"
* DATA_DIR="/home/$USER/bigdata-data"- mysql server - hive metastore
- spark master
- spark worker
- Idea is to run as less daemons as possible - so hadoop services (yarn, hdfs) are not started, instead local file system is used.
- Spark with Hive: Spark version 3.1.2 works with hive 3.1.2
- Hive on Spark: Hive needs spark 2.3.3 to run on spark execution engine (doesn't work right now)
bash setup-all.sh
bash run-services.sh