Learning Spark: Lightning-Fast Data Analytics

Authors: Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee

Ch1. Introduction to Apache Spark: A Unified Analytics Engine

Ch2. Downloading Apache Spark and Getting Started

Ch3. Apache Spark's Structured APIs

Ch4. Spark SQL and DataFrames: Introduction to Built-in Data Sources

Ch5. Spark SQL and DataFrames: Interacting with External Data Sources

Ch6. Spark SQL and Datasets

Ch7. Optimizing and Tuning Spark Applications

Ch8. Structured Streaming

Ch9. Building Reliable Data Lakes with Apache Spark

Ch10. Machine Learning with MLlib

Spark has two machine learning packages:

spark.mllib: original ml API, based on the RDD API (maintenance mode since Spark 2.0)
spark.ml: newer API, based on DataFrames -> Focuses on O(n) scale-out, so it can scale to massive amounts of data

Terminology

Transformer: accepts a DataFrame as input, and returns a new DataFrame with one or more columns appended to it. .transform() method
Estimator: fits parameters from your DataFrame -> .fit() method -> returns a Model, which is a transformer
Pipeline: organizes a series of transformers and estimators into a single model. Pipelines are estimators, but the output of pipeline.fit() returns a PipelineModel, a transformer

Preparing Features with Transformers

Many algorithms in Spark requires that all the input features are contained within a single vector in your DataFrame

VectorAssembler transformer: takes a list of input columns and creates a new DataFrame with an additional column (features) -> combines the values of those input columns into a single vector

Using Estimators to Build Models

learn parameters from data
estimator_name.fit() method
eagerly evaluated (i.e. kick off Spark jobs) whereas transformers are lazily evaluated
examples of estimators: Impute, DecisionTreeClassifier, and RandomForestRegressor

Creating a Pipeline

specify the stages the data to pass through in order
provide better code reusability and organization
In Spark, Pipelines are estimators
PipelineModels (fitted Pipelines) are transformers

One Hot Encoder

Common approach: StringIndexer and OneHotEncoder

“How does the StringIndexer handle new categories that appear in the test data set, but not in the training data set?” There is a handleInvalid parameter. Options are skip (filter out rows with invalid data), error (throw an error), or keep (put invalid data in a special additional bucket, at index numLabels).

Saving and Loading Models

similar to writing DataFrames
model.write().save(path)
optionally: model.write().overwrite().save(path) to overwrite data in the path
you need to specify the type of model you are loading back in -> always put your transformers/estimators into a Pipeline

from pyspark.ml import PipelineModel
savedPipelineModel = PipelineModel.load(pipelinePath)

Optimizing Pipelines

.setParallelism(N)
put the cross-validator inside the pipeline instead of putting the pipeline inside the cross-validator

Ch11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark

Model management

Ensure that you can reproduce and track the model's performance

Important aspects:

Library versioning
Data evolution
Order of execution
Parallel operations (nondeterministic outputs)

Having industry-wide standards for managing models is important so they can be easily shared with partners

MLflow

Open source platform that helps developers reproduce and share experiments, manage models, and more

Tracking

Logging API agnostic to libraries and envs that actually do the training
Organized around the concept of runs (executions of DS code)
Runs are aggregated into experiments
Can log:
- parameters
- metrics
- artifacts
- metadata
- models

import mflow
import mlflow.spark

with mlflow.start_run(run_name="your-run-name") as run:
    mlflow.log_param(...)
    (...)
    mlflow.spark.log_model(...)
    (...)
    mlflow.log_metrics(...)
    (...)
    mlflow.log_artifact(...)

Model deployment options with MLlib

Batch: more efficient per data point -> accumulate less overhead when amortized across all predictions made
Streaming: nice trade-off between throughput and latency
Real-time: prioritizes latency over throughput and generates predictions in a few milliseconds

MLlib is not intended for real-time model inference (computing predictions in under 50 ms)

Batch

Majority of use cases for deploying ml models. Run a regular job to generate predictions, and save the results to a table, database, data lake, etc, for downstream consumption

MLlib's model.transform(): apply the model in parallel to all partitions of your DataFrame

Questions to keep in mind:

How frequently will you generate predictions? latency and throughput trade-off
How often will you retrain the model? MLlib doesn't support online updates or warm restarts
How will you version the model? Use the MLflow model registry to keep track of the models and control how they are transitioning to/from staging, productions, and archived

Streaming

Structured Streaming can continuously perform inference on incoming data

more costly than batch
benefit of generating predictions more frequently
more complicated to maintain and monitor
offer lower latency (but if you need really low-latency predictions, you'll need to export your model out of Spark)

Use spark.readStream() rather than spark.read() and change the source of the data. We need to define the schema a priori when working with streaming data

Model export patterns for real-time inference

ONNX (Open Neural Network Exchange): open standard for machine learning interoperability
Other 3rd party libs integrate with Spark and are convenient to deploy in real-time scenarios: XGBoost and H20.ai's Sparkling Water
XGBoost is not technically part of MLlib,
- XGBoost4J-Spark library allows you to integrate distributed XGBoost into your MLlib pipelines.
- ease of deployment: train your MLlib pipeline, extract the XGBoost model and save it as a non-Spark model for serving in Python

Leveraging Spark for Non-MLlib models

Pandas UDFs

UDF = user-defined functions
Build a scikit-learn/TensorFlow model on a single machine (on a subset of your data)
Perform distributed inference on the entire data set using Spark
Spark 3.0: mapInPandas(): apply a scikit-learn model

Spark for distributed hyperparameter tuning

Joblib

Set of tools to provide lightweight pipelining in Python
It has a Spark backend to distribute tasks on a Spark cluster
pip install joblibspark

from joblibspark import register_spark


register_spark() # Register Spark backend

Hyperopt

Python lib for serial and parallel optimization over awkward search spaces, which may include real-valued, discrete, and conditional dimensions
Main ways to scale Hyperopt with Apache Spark:
- Using single-machine Hyperopt with a distributed training algorithm (e.g. MLlib)
- Using distributed Hyperopt with single-machine training algorithms with the SparkTrials class

Koalas: open source library that implements the Pandas DataFrame API on top of Apache Spark. Replace any pd logic in your code with ks (Koalas) -> scale up your analyses with Pandas without needing to entirely rewrite your codebase in PySpark. kdf.to_spark() = switch to using Spark APIs. kdf.to_pandas()= switch to using Pandas API

Ch12. Epilogue: Apache Spark 3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

notes.md

notes.md

Learning Spark: Lightning-Fast Data Analytics

Ch1. Introduction to Apache Spark: A Unified Analytics Engine

Ch2. Downloading Apache Spark and Getting Started

Ch3. Apache Spark's Structured APIs

Ch4. Spark SQL and DataFrames: Introduction to Built-in Data Sources

Ch5. Spark SQL and DataFrames: Interacting with External Data Sources

Ch6. Spark SQL and Datasets

Ch7. Optimizing and Tuning Spark Applications

Ch8. Structured Streaming

Ch9. Building Reliable Data Lakes with Apache Spark

Ch10. Machine Learning with MLlib

Terminology

Preparing Features with Transformers

Using Estimators to Build Models

Creating a Pipeline

One Hot Encoder

Saving and Loading Models

Optimizing Pipelines

Ch11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark

Model management

MLflow

Tracking

Model deployment options with MLlib

Batch

Streaming

Model export patterns for real-time inference

Leveraging Spark for Non-MLlib models

Pandas UDFs

Spark for distributed hyperparameter tuning

Joblib

Hyperopt

Ch12. Epilogue: Apache Spark 3.0

Collapse file tree

Files

notes.md

Latest commit

History

notes.md

File metadata and controls

Learning Spark: Lightning-Fast Data Analytics

Ch1. Introduction to Apache Spark: A Unified Analytics Engine

Ch2. Downloading Apache Spark and Getting Started

Ch3. Apache Spark's Structured APIs

Ch4. Spark SQL and DataFrames: Introduction to Built-in Data Sources

Ch5. Spark SQL and DataFrames: Interacting with External Data Sources

Ch6. Spark SQL and Datasets

Ch7. Optimizing and Tuning Spark Applications

Ch8. Structured Streaming

Ch9. Building Reliable Data Lakes with Apache Spark

Ch10. Machine Learning with MLlib

Terminology

Preparing Features with Transformers

Using Estimators to Build Models

Creating a Pipeline

One Hot Encoder

Saving and Loading Models

Optimizing Pipelines

Ch11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark

Model management

MLflow

Tracking

Model deployment options with MLlib

Batch

Streaming

Model export patterns for real-time inference

Leveraging Spark for Non-MLlib models

Pandas UDFs

Spark for distributed hyperparameter tuning

Joblib

Hyperopt

Ch12. Epilogue: Apache Spark 3.0