Big Data for beginners

Explore a variety of tutorials and interactive demonstrations focused on Big Data technologies like Hadoop, Spark, and more, primarily presented in the format of Jupyter notebooks. Most notebooks are self-contained, with instructions for installing all required services. They can be run on Google Colab or in a virtual Ubuntu machine/container.

Setting Up Hadoop: Single-Node Configuration

Hadoop_Setting_up_a_Single_Node_Cluster.ipynb Set up a single-node Hadoop cluster on Google Colab and run some basic HDFS and MapReduce examples
Hadoop_single_node_cluster_setup_Python.ipynb Set up a single-node Hadoop cluster on Google Colab using Python
Hadoop_minicluster.ipynb Deploy a test Hadoop Cluster with a single command and no need for configuration.

Running Apache Spark in Standalone Mode

Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb Set up a single-node Spark server on Google Colab and estimate „π“ with a Montecarlo method
Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb Set up a single-node Spark server on Google Colab using the Bigtop distribution and utilities, estimate „π“ with a Montecarlo method and run another Java ML example.
Run_Spark_on_Google_Colab.ipynb Set up a single-node standalone Spark server on Google Colab including Web UI and History Server - compact version
Spark_Standalone_Architecture_on_Google_Colab.ipynb Explore the Spark architecture through the immersive experience of deploying a standalone setup.

MapReduce Tutorials

MapReduce_Primer_HelloWorld.ipynb A MapReduce Primer with “Hello, World!”
MapReduce_Primer_HelloWorld_bash.ipynb A MapReduce Primer with “Hello, World! in Bash with just a few lines of code”
mapreduce_with_bash.ipynb An introduction to MapReduce using MapReduce Streaming and bash to create mapper and reducer
simplest_mapreduce_bash_wordcount.ipynb A very basic MapReduce wordcount example
mrjob_wordcount.ipynb A simple MapReduce job with mrjob
Hadoop_spilling.ipynb Hadoop spilling explained

PySpark Tutorials

PySpark_On_Google_Colab.ipynb Explore the inner workings of PySpark on Google Colab
PySpark_miscellanea.ipynb Tips, tricks, and insights related to PySpark.
demoSparkSQLPython.ipynb Pyspark basic demo
ngrams_with_pyspark.ipynb Basic example of n-grams extraction with PySpark
generate_data_with_Faker.ipynb Fake It Till You Make It: Generate Test Data with Faker. Create customizable fake data for testing and development using the Faker library. Useful for populating databases, simulating user activity, or prototyping applications without relying on real data.
Encoding+dataframe+columns.ipynb DataFrame Column Encoding with PySpark and Parquet Format
Apache_Sedona_with_PySpark.ipynb Apache Sedona™ is a high-performance cluster computing system for processing large-scale spatial data, extending the capabilities of Apache Spark for advanced geospatial analytics. Run a basic example with PySpark on Google Colab

Miscellaneous Tutorials

GutenbergBooks.ipynb Explore and download books from the Gutenberg books collection.
TestDFSio.ipynb Demo of TestDFSio for benchmarking Hadoop clusters
Unicode.ipynb Exploring Unicode categories
polynomial_regression.ipynb Worked out example of polynomial regression with numpy and matplotlib
downloadSpark.ipynb How to download and verify the Spark distribution

Virtualization and Cloud Automation

docker_for_beginners.md Docker for beginners: an introduction to the world of containers
Terraform for beginners.md Getting started with Terraform
Terraform in 5 minutes A short introduction to Terraform, the powerful and popular tool for infrastructure provisioning and management

Big Data Learning Pathways

online_resources.md Online resources for learning Big Data

About this repository

Notebooks Testing and CI

Most executable Jupyter notebooks are tested on an Ubuntu virtual machine through a GitHub automated workflow. The log file for successful executions is named: action_log.txt (see also: Google Colab vs. GitHub Ubuntu Runner ).

Current status:

The Github workflow is a starting point for what is known as Continuous Integration (CI) in DevOps/Platform Engineering circles.

Name		Name	Last commit message	Last commit date
Latest commit History 371 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
Apache_Sedona_with_PySpark.ipynb		Apache_Sedona_with_PySpark.ipynb
Apache_Sedona_with_PySpark.nbconvert.ipynb		Apache_Sedona_with_PySpark.nbconvert.ipynb
Apache_Sedona_with_PySpark.py		Apache_Sedona_with_PySpark.py
Encoding+dataframe+columns.ipynb		Encoding+dataframe+columns.ipynb
Encoding+dataframe+columns.nbconvert.ipynb		Encoding+dataframe+columns.nbconvert.ipynb
Encoding+dataframe+columns.py		Encoding+dataframe+columns.py
Google_Colab_vs_GitHub_ubuntu_runner.ipynb		Google_Colab_vs_GitHub_ubuntu_runner.ipynb
GutenbergBooks.ipynb		GutenbergBooks.ipynb
GutenbergBooks.nbconvert.ipynb		GutenbergBooks.nbconvert.ipynb
GutenbergBooks.py		GutenbergBooks.py
HDFS_Architecture.png		HDFS_Architecture.png
HDFS_Architecture.svg		HDFS_Architecture.svg
Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb		Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb
Hadoop_Setting_up_a_Single_Node_Cluster.ipynb		Hadoop_Setting_up_a_Single_Node_Cluster.ipynb
Hadoop_minicluster.ipynb		Hadoop_minicluster.ipynb
Hadoop_minicluster.nbconvert.ipynb		Hadoop_minicluster.nbconvert.ipynb
Hadoop_minicluster.py		Hadoop_minicluster.py
Hadoop_single_node_cluster_setup_Python.ipynb		Hadoop_single_node_cluster_setup_Python.ipynb
Hadoop_spilling.ipynb		Hadoop_spilling.ipynb
LICENSE		LICENSE
MapReduce_Primer_HelloWorld.ipynb		MapReduce_Primer_HelloWorld.ipynb
MapReduce_Primer_HelloWorld.nbconvert.ipynb		MapReduce_Primer_HelloWorld.nbconvert.ipynb
MapReduce_Primer_HelloWorld.py		MapReduce_Primer_HelloWorld.py
MapReduce_Primer_HelloWorld_bash.ipynb		MapReduce_Primer_HelloWorld_bash.ipynb
MapReduce_Primer_HelloWorld_bash.nbconvert.ipynb		MapReduce_Primer_HelloWorld_bash.nbconvert.ipynb
MapReduce_Primer_HelloWorld_bash.py		MapReduce_Primer_HelloWorld_bash.py
NgramsAHPC.ipynb		NgramsAHPC.ipynb
PySpark_On_Google_Colab.ipynb		PySpark_On_Google_Colab.ipynb
PySpark_On_Google_Colab.nbconvert.ipynb		PySpark_On_Google_Colab.nbconvert.ipynb
PySpark_On_Google_Colab.py		PySpark_On_Google_Colab.py
PySpark_SQL_Cheat_Sheet_Python.pdf		PySpark_SQL_Cheat_Sheet_Python.pdf
PySpark_miscellanea.ipynb		PySpark_miscellanea.ipynb
PySpark_miscellanea.nbconvert.ipynb		PySpark_miscellanea.nbconvert.ipynb
PySpark_miscellanea.py		PySpark_miscellanea.py
PySpark_miscellanea_on_Colab.ipynb		PySpark_miscellanea_on_Colab.ipynb
README.md		README.md
Run_Spark_on_Google_Colab.ipynb		Run_Spark_on_Google_Colab.ipynb
Run_Spark_on_Google_Colab.nbconvert.ipynb		Run_Spark_on_Google_Colab.nbconvert.ipynb
Run_Spark_on_Google_Colab.py		Run_Spark_on_Google_Colab.py
Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb		Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb
Spark_Standalone_Architecture_on_Google_Colab.ipynb		Spark_Standalone_Architecture_on_Google_Colab.ipynb
Spark_Standalone_Architecture_on_Google_Colab.nbconvert.ipynb		Spark_Standalone_Architecture_on_Google_Colab.nbconvert.ipynb
Spark_Standalone_Architecture_on_Google_Colab.py		Spark_Standalone_Architecture_on_Google_Colab.py
Terraform in 5 minutes.md		Terraform in 5 minutes.md
Terraform_for_beginners.html		Terraform_for_beginners.html
TestDFSio.ipynb		TestDFSio.ipynb
Unicode.ipynb		Unicode.ipynb
Unicode.nbconvert.ipynb		Unicode.nbconvert.ipynb
Unicode.py		Unicode.py
action_log.txt		action_log.txt
data-1600cols.csv		data-1600cols.csv
demoSparkSQLPython.ipynb		demoSparkSQLPython.ipynb
docker_for_beginners.md		docker_for_beginners.md
downloadSpark.ipynb		downloadSpark.ipynb
downloadSpark.nbconvert.ipynb		downloadSpark.nbconvert.ipynb
downloadSpark.py		downloadSpark.py
environment.yml		environment.yml
generate_data_with_Faker.ipynb		generate_data_with_Faker.ipynb
generate_data_with_Faker.nbconvert.ipynb		generate_data_with_Faker.nbconvert.ipynb
generate_data_with_Faker.py		generate_data_with_Faker.py
handson_lab_docker_inspect.png		handson_lab_docker_inspect.png
logo_bdb.png		logo_bdb.png
logo_bdb.svg		logo_bdb.svg
mapreduce.png		mapreduce.png
mapreduce.svg		mapreduce.svg
mapreduce_with_bash.ipynb		mapreduce_with_bash.ipynb
mrjob_wordcount.ipynb		mrjob_wordcount.ipynb
new3.gif		new3.gif
ngrams_with_pyspark.ipynb		ngrams_with_pyspark.ipynb
ngrams_with_pyspark.nbconvert.ipynb		ngrams_with_pyspark.nbconvert.ipynb
ngrams_with_pyspark.py		ngrams_with_pyspark.py
online_resources.md		online_resources.md
pi.py		pi.py
pi.slrm		pi.slrm
polynomial_regression.ipynb		polynomial_regression.ipynb
polynomial_regression.nbconvert.ipynb		polynomial_regression.nbconvert.ipynb
polynomial_regression.py		polynomial_regression.py
requirements.txt		requirements.txt
shuffle_sort.png		shuffle_sort.png
shuffle_sort.svg		shuffle_sort.svg
simplest_mapreduce_bash_wordcount.ipynb		simplest_mapreduce_bash_wordcount.ipynb
terraform_for_beginners.md		terraform_for_beginners.md
updated.gif		updated.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data for beginners

Setting Up Hadoop: Single-Node Configuration

Running Apache Spark in Standalone Mode

MapReduce Tutorials

PySpark Tutorials

Miscellaneous Tutorials

Virtualization and Cloud Automation

Big Data Learning Pathways

About this repository

Notebooks Testing and CI

About

Releases

Packages

Contributors 3

Languages

License

groda/big_data

Folders and files

Latest commit

History

Repository files navigation

Big Data for beginners

Setting Up Hadoop: Single-Node Configuration

Running Apache Spark in Standalone Mode

MapReduce Tutorials

PySpark Tutorials

Miscellaneous Tutorials

Virtualization and Cloud Automation

Big Data Learning Pathways

About this repository

Notebooks Testing and CI

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages