Explore a variety of tutorials and interactive demonstrations focused on Big Data technologies like Hadoop, Spark, and more, primarily presented in the format of Jupyter notebooks. Most notebooks are self-contained, with instructions for installing all required services. They can be run on Google Colab or in a virtual Ubuntu machine/container.
- Hadoop_Setting_up_a_Single_Node_Cluster.ipynb
Set up a single-node Hadoop cluster on Google Colab and run some basic HDFS and MapReduce examples
- Hadoop_single_node_cluster_setup_Python.ipynb
Set up a single-node Hadoop cluster on Google Colab using Python
- Hadoop_minicluster.ipynb
Deploy a test Hadoop Cluster with a single command and no need for configuration.
- Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb
Set up a single-node Spark server on Google Colab and estimate „π“ with a Montecarlo method
- Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb
Set up a single-node Spark server on Google Colab using the Bigtop distribution and utilities, estimate „π“ with a Montecarlo method and run another Java ML example.
- Run_Spark_on_Google_Colab.ipynb
Set up a single-node standalone Spark server on Google Colab including Web UI and History Server - compact version
- Spark_Standalone_Architecture_on_Google_Colab.ipynb
Explore the Spark architecture through the immersive experience of deploying a standalone setup.
- MapReduce_Primer_HelloWorld.ipynb
A MapReduce Primer with “Hello, World!”
- MapReduce_Primer_HelloWorld_bash.ipynb
A MapReduce Primer with “Hello, World! in Bash with just a few lines of code”
- mapreduce_with_bash.ipynb An introduction to MapReduce using MapReduce Streaming and bash to create mapper and reducer
- simplest_mapreduce_bash_wordcount.ipynb A very basic MapReduce wordcount example
- mrjob_wordcount.ipynb A simple MapReduce job with mrjob
- Hadoop_spilling.ipynb Hadoop spilling explained
- PySpark_On_Google_Colab.ipynb
Explore the inner workings of PySpark on Google Colab
- PySpark_miscellanea.ipynb
Tips, tricks, and insights related to PySpark.
- demoSparkSQLPython.ipynb Pyspark basic demo
- ngrams_with_pyspark.ipynb
Basic example of n-grams extraction with PySpark
- generate_data_with_Faker.ipynb
Fake It Till You Make It: Generate Test Data with Faker. Create customizable fake data for testing and development using the Faker library. Useful for populating databases, simulating user activity, or prototyping applications without relying on real data.
- Encoding+dataframe+columns.ipynb
DataFrame Column Encoding with PySpark and Parquet Format
- Apache_Sedona_with_PySpark.ipynb
Apache Sedona™ is a high-performance cluster computing system for processing large-scale spatial data, extending the capabilities of Apache Spark for advanced geospatial analytics. Run a basic example with PySpark on Google Colab
- GutenbergBooks.ipynb
Explore and download books from the Gutenberg books collection.
- TestDFSio.ipynb Demo of TestDFSio for benchmarking Hadoop clusters
- Unicode.ipynb
Exploring Unicode categories
- polynomial_regression.ipynb
Worked out example of polynomial regression with numpy and matplotlib
- downloadSpark.ipynb
How to download and verify the Spark distribution
- docker_for_beginners.md Docker for beginners: an introduction to the world of containers
- Terraform for beginners.md Getting started with Terraform
- Terraform in 5 minutes A short introduction to Terraform, the powerful and popular tool for infrastructure provisioning and management
- online_resources.md Online resources for learning Big Data
Most executable Jupyter notebooks are tested on an Ubuntu virtual machine through a GitHub automated workflow. The log file for successful executions is named: action_log.txt (see also: Google Colab vs. GitHub Ubuntu Runner
).
Current status:
The Github workflow is a starting point for what is known as Continuous Integration (CI) in DevOps/Platform Engineering circles.