The second cloud computing applications course covered applications that are run on clouds for the processing of big data. It dealt topics that included storage, processing, parallelism, distribution, consensus and scalability and the key benefits and limitations of the various applications running on clouds.
Apache Spark is designed to be used with iterative algorithms and allows interactive data exploration/mining, things that traditional frameworks such as Hadoop MapReduce are not good for. Hadoop requires repeated access to to HDFS and offers no optimisation to data caching and data transfers and parallelism is limited to within each iteration. Acyclic data flows are inefficient for applications that repeatedly reuse a working set of data as is required when running iterative algorithms and when using interactive data mining tools. Spark aims to enhance programmability and is written on top of Scala and offers a modified interactive Scala interpreter. Multiple frameworks have been built on Spark, including Pregel (GraphX), Hive (SparkSQL) and Mllib (machine learning).
Spark runs on Apache Mesos or YARN in order to share resources with Hadoop and other applications. It can read from any Hadoop input source. The Spark scheduler uses Dryad-like directed acyclic graphs (DAGs) and pipeline functions and has cache-aware work reuse, locality awareness and partitioning awareness.
Spark uses Resilient Distributed Datasets (RDDs) to allow applications to keep working sets in memory for efficient reuse, rather than writing repeatedly to HDFS. Pieces of RDDs can be distributed across the nodes being used. RDDs are immutable, partitioned collections of objects that are created through parallel transformations e.g. map, filter, groupBy and joins and can be cached for efficient reuse. The queries done on RDDs are performed using lazy evaluations, meaning that they are only evaluated on actions e.g. count, output. RDDs achieve fault tolerance by maintaining lineage information that can be used to reconstruct lost partitions by re-performing failed computations.
HortonWorks is an example of a cloud distro that packages together multiple connected tools. It uses HDFS and connects this to a variety of processing tools through YARN. The processing tools include MapReduce Pig, Hive etc. Users can do interactive data analysis through Apache Zeppelin and allows for the management of Hadoop clusters through Ambari. Zeppelin can be used with Python, Spark and Scala and works in conjunction with user views generated by Tez, Hive, Pig and offers capacity scheduler views and file views (for managing, browsing and uploading to HDFS). Cloudera and MapR are alternatives to HortonWorks that are slightly more conservative in their offerings compared to HortonWorks.
Apache Hadoop HDFS is a large scale distributed storage system. An important part of HDFS is fault tolerance, which is useful as storage devices can fail. The probability of a storage device failing increases as we scale up to a greater number of storage devices. HDFS replicates files to achieve fault tolerance. Using HDFS allows for massive throughput that scales with the number of attached hard drives. It is optimised for reads, sequential writes and appends.
On HDFS, files are split into contiguous chunks of 16-64MB, with each chunk being replicated 2-3 times. One of the copies is placed on a different rack to improve fault tolerance. Applications communicate with HDFS through a Name node that has all the chunk mappings (metadata). Name nodes communicate with Data nodes that are responsible for handling the files given to them. Multiple APIs exist including Java, Python, C as well as a CLI. A HTTP browser can be used to browse the files of a HDFS instance.
Mesos was built to be a scalable global resource manager for data centres and YARN was built to scale Hadoop. YARN has a scheduler, NodeManagers and ApplicationsManagers. The ResourceTrackerService handles the membership of the system.
Week 2 Spark MapReduce, CAP Theorem, Distributed Key-Value Store, Scalable Database, Publish-Subscribe Queues
MapReduce aims to provide a framework for users to define functions and provides automatic parallelism, fault tolerance, I/O scheduling and status monitoring. It helps to solve some of the problems that would otherwise occur with Traditional Programming Models, such as deadlock, large overhead from comm. mismanagement, inability to load balance and having a framework under which it is difficult to write code. MapReduce uses distributed storage and pushes computations to where the relevant storage is. Distributed File systems is distributed storage where we have a global namespace and examples include Google GFS and Hadoop HDFS.
MapReduce is a programming model that originally comes from functional programming languages such as LISP. Many data processing problems can be phrased as a series of map and reduce functions. Map functions perform functions on key-value pairs in a dataset to create a new list of intermediate values. A reduce function takes intermediate values for a particular key and creates a set of merged output values (often just one). Example uses include word counting, Pi estimation, image smoothing, PageRank. The disadvantage of MapReduce is that there are restrictions on the set of problems that are solvable with this paradigm.
Consistency is an important concept in distributed systems. CAP theorem is the idea that you can have just two of Consistency, Availability and Partition tolerance. It is often decided that data centres should weaken consistency for faster response. Distributed computing gives rise to these problem because it operates over an unreliable network with latency and limited bandwidth, the network may not be secure, the topology may change, there may be multiple administrators, transportation has costs and the network/devices may not be homogenous. In a distributed system implemented over an unreliable network where nodes may fail, there is no way to determine whether a message has been lost or just arbitrarily delayed. There can be T-connected consistency, where system is consistent when there are no partitions, but on partitioning stale data may be returned. Once the partition heals there is then a time limit on how long before consistency is restored. Cloud services have different guaranteed properties which relate to CAP.
The ACID model is used for transactions and includes Atomicity (operations run once or as if once), Consistency (state is correct), Isolation (invisible concurrency), Durability (committed transactions persist). ACID is helpful as it means the developer does not need to worry about transactions ending up in a partially completed state and transactions don't see other partially completed concurrent transactions. However, using this model on the cloud can lead to unacceptable performance, partly due to poor scalability, so instead the BASE convention is often used. It stands for Basically Available Soft-State Services with Eventual Consistency. BASE involves executing transactions in a way that is more concurrent and less rigid than ACID, but with weaker guarantees on consistency. Basically available means that you have a fast response even if some replicas are slow or crash. The Soft-State Service means that you have no durable memory. The eventual consistency means that 'optimistic' answers can be sent to clients. Paxos and Zookeeper are used in BASE systems and were covered in the first two courses.
Where as Cassandra uses disk storage (covered in earlier course) Redis is an in-memory key-value store and stands for REmote DIctionary Server. Its data model is similar to a Python dictionary and supports not only strings, but also abstract data types such as lists of strings, sets of strings, sorted sets of string and hashes where keys and values are strings. Redis has a very fast response time as everything is in memory and it uses non-blocking I/O. Examples uses include using it as a session store and for logging.
Apache HBase is a distributed column-oriented data store built on top of HDFS and can provide storage for Hadoop Distributed Computing. Data is organised into tables, rows and columns. It provides fast record lookup, has support for record-level insertion and has support for updates - these are not features of HDFS on its own. A table is a sparse, distributed, persistent and a multidimensional sorted map. The map is indexed using a row key, column key and a timestamp. HBase has a schema consisting of multiple tables, each table has a set of column families and the columns are dynamic columns.
Spark SQL allows for Structed Data Processing in Apache Spark and is built on top of the RDD data abstraction. Spark can read data from HDFS, Hive tables, JSON, etc. and this data can be queried using SQL. A DataSet is a distributed collection of data built on top of RDDs and provides the benefits of both RDDs and Spark's SQL optimised execution engine. The Datasets can be constructed from JVM objects and the acted on using functional transformations such as map, flatMap etc. and the API is available in Scala and Java (not Python due to lack of typing). DataFrame is a Dataset organised into named columns and can be constructed from structured data files, Hive tables, external databased or existing RDDs.
Apache Kafka is used for streaming data. It takes data from one or more producers, processes it in a Kafka cluster and then sends that out to one or more consumers. Kafka is a distributed, partitioned, replicated publish-subscribe system and provides a commit log service. It maintains feeds of messages in categories called topics and each server is called a Broker and communicates using TCP. Its characteristics are that it is highly scalable, there are strong guarantees about messages (strictly ordered, persistent data) and it is distributed so that there is replication and partitioning (for fault tolerance). Each partition is replicated across a number of servers and has a leader with zero or more followers. The leaders coordinate read and write requests and ZooKeeper is used to keep servers consistent. Consumers can belong to Consumer Groups, which coordinate with each other to determine which consumer consumes from another consumer.
Streaming is used when we have a stream of events that flow into a system and we wish to have a real time view of this data. Example use cases driving this include real-time search, high frequency trading and social networks. The processing system that is used for this must be able to keep up with the event rate or disregard events gracefully, also called load shedding. Hadoop MapReduce was not able to handle this type of real-time stream processing. Cloud Streaming Engines include Apache Storm, Twitter Heron and Apache Flink.
There has been a rise in demand for for real-time data processing. Storm consists of Topologies (graph of spouts and bolts), Streams (unbounded sequence of incoming tuples), Spouts (input source) and Bolts (processes data). Stream processing tries to do things in-memory for greater performance. The Lambda Architecture involves sending data to both real-time and batch pipelines, to benefit from both real-time views from streaming and the strong guarantees from batch processing. Druid can be used to facilitate the combining of these two pipelines and make this process appear seamless to data users.
Apache Storm has guaranteed data processing (load shedding is configurable), horizontal scalability, fault tolerance and uses a higher level abstraction than message passing. Storm consists of a master node (Nimbus), cluster coordinator (Zookeeper) and worker processes (Supervisor). Spouts and bolts execute the same number of tasks and each of tasks can be spread across multiple nodes in the cluster. Tuples can be emitted to different tasks based on user programming and pre-defined strategies include shuffle grouping (random), fields grouping (consistent hashing), all grouping (to all tasks) and global grouping. Yahoo! started using stream processing due to memory costs going down, the nature of data collection and the desire the have real time views of incoming data.
Guaranteed message processing comes in three flavours: none, at-least-once using tuple trees, anchoring and spout replay, exactly-once like Hadoop or Puma. Storm uses tuple trees. Tuples that are outputted by a spout become the root of a new tree and as it is processed in various ways new nodes are added to the tree. If the tuple tree is not completed within a specified timeout, then the spout tuple is replayed. Storm uses ACKer tasks to keep track of tuple progress. Users can override the execute function to decide when to ACK the tuple. With at-least-once processing, if there is a failure then an event may be processed twice, but that requires that you have a spout that supports replay and not all messaging infrastructure allows this. Trident is used for providing exactly-once processing. State must be stored in order to achieve this and the choice of how to do this is left up to the user. The user writes an execute function that takes a TridentTuple and a TridentCollector and creates a TridentTopology.
Traditional streaming can try to achieve fault tolerance by using lambda architecture to create redundancy. Storm processes records at-least-once and with Trident exactly-once, but this can become slow. Spark Streaming uses discretized stream processing, which means creating small batches out of the incoming data. The batch size is configurable, but the window is limited to about 0.5s at the lowest and if lower latency is required a different system may be a better option. DStream sources can include Kafka, HDFS, flume, Akka Actors, files, and sockets. Every time Spark processes a micro batch it calls the updateState function in each node, which keeps state in an RDD, allowing for replaying on failure. Spark Streaming supports a rich ecosystem of big data tools, Spark SQL, Spark ML, Spark GraphX, SparkR. The disadvantage is that it is not true streaming as really just a form of batch processing.
Graphs can be stored in a graph database, which contain associative data sets that typically do not require join operators. There are different ways to represent a graphs and the optimal choice may depend on the nature of the graph e.g. how sparse it is, which operations are required. Graph computing has two basic operations, which are fusion where information is aggregated to a vertex from neighbours and diffusion where information is propagated from a vertex to its neighbours. Graph algorithms and uses include Page Rank for web, Shortest Path for transportation routes, Connected Components for citation relationships and Clustering Techniques for social networks. The increasing size of graphs that are being processed has transformed graph processing into a big data problem, which motivates the use of distributed computing strategies to handle them.
MapReduce is inefficient for graph processing because state must be stored at each stage of graph algorithms and it will produce too much inter-stage communication. Supercomputers can be used, but they are not truly scalable and aren't good for fault tolerance. Google's Pregel uses a C++ API with application code that subclasses the Vertex class and writes a compute method. It uses a master/worker model and persistent data is put into distributed storage and temporary data is stored on local disk. The master partitions input (e.g. vertices) to each worker and instructs works to perform supersteps. Fault tolerance is achieved through checkpointing, failure detection and recovery. Apache Giraph is a graph processing framework. It uses Zookeeper to coordinate and synchronise work between workers. Spark GraphX allows for the use of many parallelised graph algorithms. RDDs can be constructed of both the vertices and the edges of a graph and these can be combined into a GraphX graph.
Machine learning (ML) and data mining tools can be used for analysing, mining and summarising large data sets. They can be used to extract knowledge from past data and predict trends in future data. Machine learning is a subset of AI and has many applications including information retrieval, statistics, biology, linear algebra, marketing and sales. Examples of machine learning and data mining algorithms include collaborative filtering, clustering techniques, classification algorithms, association rules and frequent patter mining.
Apache Mahout is an open source project that can be used to perform distributed machine learning. It is designed to be as fast and efficient as possible for the algorithms it uses. Example algorithms include collaborative filtering, clustering, classification and frequent pattern mining. It can be used run the K-means algorithm, which is used for classification of data using a given number centroids K. Mahout can be used for Naïve Bayes classification, where it is given training dataset that is used to build a classification model. Frequency Pattern Mining involves using a set of items contained in multiple item-sets and finding which combination of items commonly appear together.
The Spark library can be used for ML through MLlib. It designed for ease-of-use and is scalable. Examples of applications included in the library are classification and regression (linear models, Naïve Bayes, decision trees), clustering (k-means, Gaussian mixture, power iteration clustering) and dimensionality reduction (singular value decomposition, principle component analysis).
Deep Learning involves using neural networks. An example application of deep learning is object recognition from images, where the system is trained on a training set of labelled images, which is an example of supervised learning. The error rate on object recognition has been gradually improving over time due to deep learning techniques. Other examples are training a deep learning network to play Go, perform natural language processing and performing speech recognition. The need for scalability for deep learning arises from the use of big data, having big multi-layers models and having increasingly complex models. The portion of deep learning that usually takes the most time is the training phase. Parallelising the training of the model can be done by distributing the data across multiple systems.
Different systems exist for simplifying the process of doing distributed deep learning. TensorFlow is one such system. TensorFlow is created by Google and offers a relatively simple API. Computational Network Toolkit (CNTK) is Microsoft's deep learning system. There is also Project Adam, MxNet, Caffe-on-Spark by Yahoo! and Yadan by Facebook. The pros of the systems are that they are model agnostic and offer horizontal scaling. The cons are that they are communication heavy and asynchronous models can be noisy. Another method of distributing the computation is to replicate all the data across the systems and have different systems train different parts of the model. The advantage of this approach is that it can be faster than data parallelism and it is not noisy. The disadvantage is that how well it works depends on the model architecture and there is a lack of horizontal scaling. Using sparse updates is another solution, where only selected, more important nodes are updated when the model is trained. Finally, a technique for faster convergence can be used, where there is fewer iterations, which is achieved by using non-back-propagation algorithms.