2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Lambda Architecture with DB Tsai [email protected] Machine Learning Engineering Lead @ Alpine Data Labs Next.ML Conference Jan 17, 2015


•  Batch Layer, managing all available big dataset which is an immutable, append-only set of raw data using distributed processing system.

•  Speed layer, processing data in streaming fashion with low latency, and the real-time views are provided by the most recent data.

•  Serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way.

Lambda Architecture


Lambda Architecture

https://www.mapr.com/developercentral/lambda-architecture


•  Different technologies are used in batch layer and speed layer traditionally.

•  If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala

•  This will very quickly becomes a maintenance nightmare.

Traditional Lambda Architecture


Unified Development Framework


Batch Layer

•  Empower users to iterate through the data by utilizing the in-memory cache.

•  Logistic regression runs up to 100x faster than Hadoop M/R in memory.

•  We’re able to train exact models without doing any approximation.


Apache Spark Utilizing in-memory Cache for M/R job

Iterative algorithms scan through the data each time

With Spark, data is cached in memory after first iteration

Quasi-Newton methods enhance in-memory benefits

921s 150m

m rows

97s


Speed Layer

•  An extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data stream.

•  Spark Streaming receives streaming input, and divides the data into batches which are then processed by Spark engine.

•  As a result, developers can maintain the same Java/Scala code in Batch and Speed layer.


MapReduce Review

•  MapReduce – Simplified Data Processing on Large Clusters, 2004.

•  Scales Linearly

•  Data Locality

•  Fault Tolerance in Data and Computation


Hard Disks Failures from Google’s 2007 Study

•  1.7% of disks failed in the first year of their life.

•  Three-year-old disks were failing at a rate of 8.6%.

•  For the hypothetical eight-disk server, the probability that none of disks fail in first year will be 81%.

•  The key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved with commodity hardware.


Hadoop MapReduce Review •  Mapper: Loads the data and emits a set of key-value pairs •  Reducer: Collects the key-value pairs with the same key to

process, and output the result. •  Combiner: Can reduce shuffle traffic by combining key-value

pairs locally before going to reducer. •  Good: Built in fault tolerance, scalable, and production proven

in industry. •  Bad: Optimized for disk IO without leveraging memory well;

iterative algorithms go through disk IO again and again; primitive API is not easy and clean to develop.


Spark MapReduce •  Spark also uses MapReduce as a programming model but

with much richer APIs in Java Scala, and Python. •  With Scala expressive APIs, 5-10x less code. •  Not just a distributed computation framework, Spark provides

several pre-built components empowering users to implement application faster and easier. - Spark Streaming - Spark SQL - MLlib (Machine Learning) - GraphX (Graph Processing)


Hadoop M/R vs Spark M/R

•  Hadoop

•  Spark


Supervised Learning •  Binary Classification: linear SVMs (SGD), logistic regression (L-

BFGS and SGD), decision trees, random forests (Spark 1.2), and naïve Bayes.

•  Multiclass Classification: Decision trees, naïve Bayes (coming soon - multinomial logistic regression in GLMNET)

•  Regression: linear least squares (SGD), Lasso (SGD + soft-threshold), ridge regression (SGD), decision trees, and random forests (Spark 1.2)

•  Currently, the regularization in linear model will penalize all the weights including the intercept which is not desired in some use-cases. Alpine has GLMNET implementation using OWLQN which can exactly reproduce R’s GLMNET package result with scalability. We’re in the process of merging it into MLlib community.


Unsupervised Learning

•  K-Means, •  Collaborative filtering (ALS) •  SVD •  PCA •  Feature extraction and transformation

http://spark.apache.org/docs/1.2.0/mllib-guide.html


Resilient Distributed Datasets (RDDs)

•  RDD is a fault-tolerant collection of elements that can be operated on in parallel.

•  RDDs can be created by parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, HIVE, or any data source offering a Hadoop InputFormat.

•  RDDs can be cached in memory or on disk


RDD Persistence/Cache •  RDD can be persisted using the persist() or cache()

methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

•  Persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap in Tachyon.


RDD Operations - two types of operations •  Transformations: Creates a new dataset from an existing

one. They are lazy, in that they do not compute their results right away. By default, each transformed RDD may be recomputed each time you run an action on it. You may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. (PS, after transformations, the dataset can be imbalanced in each executor, and this can be addressed by repartition.)

•  Actions: Returns a value to the driver program after running a computation on the dataset.


Transformations •  map(func) - Return a new distributed dataset formed by passing

each element of the source through a function func. •  filter(func) - Return a new dataset formed by selecting those

elements of the source on which func returns true. •  flatMap(func) - Similar to map, but each input item can be

mapped to 0 or more output items (so func should return a Seq rather than a single item).

•  mapPartitions(func) - Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. http://spark.apache.org/docs/latest/programming-guide.html#transformations


Actions •  reduce(func) - Aggregate the elements of the dataset

using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

•  collect() - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

•  count(), first(), take(n), saveAsTextFile(path), etc. http://spark.apache.org/docs/latest/programming-guide.html#actions


Computing the mean of data


Lab 1)


Spark Streaming: Discretized Streams •  DStream is the basic abstraction provided by Spark

Streaming over Spark’s RDDs. •  Each RDD in a DStream contains data from a certain

interval. Any operation applied on a DStream translates to operations on the underlying RDDs internally.


Word Count in Batch Processing


Word Count in Streaming Processing


Lab 2)


Lab 2) •  Need another bash shell in docker to run Netcat as a

data server.

•  In production, people often use Kafka as data server.

•  docker ps // to find the current docker PID

•  docker exec –it <PID> bash // to lunch a new shell


Lab 2)


UpdateStateByKey Operation

The updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information. •  Define the state - The state can be of arbitrary data type. •  Define the state update function - Specify with a function

how to update the state using the previous state and the new values from input stream.


UpdateStateByKey Operation


Computing the Mean of Streaming Data

•  Current sum and count at time t has to be accessible at time (t + 1) to compute new mean of stream.

•  Without UpdateSateByKey, the operations at time t and (t + 1) are independent.

•  Checkpoint directory has to be configured for persistence of the state at different time.


Computing the Mean of Streaming Data


Lab 3)


Online Learning Example


Thank you.

2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Software

Transcript of 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference