2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

39
Learn more about Advanced Analytics at http://www.alpinenow.com Lambda Architecture with DB Tsai [email protected] Machine Learning Engineering Lead @ Alpine Data Labs Next.ML Conference Jan 17, 2015

Transcript of 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Page 1: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Lambda Architecture with DB Tsai [email protected] Machine Learning Engineering Lead @ Alpine Data Labs Next.ML Conference Jan 17, 2015

Page 2: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

•  Batch Layer, managing all available big dataset which is an immutable, append-only set of raw data using distributed processing system.

•  Speed layer, processing data in streaming fashion with low latency, and the real-time views are provided by the most recent data.

•  Serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way.

Lambda Architecture

Page 3: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Lambda Architecture

https://www.mapr.com/developercentral/lambda-architecture

Page 4: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

•  Different technologies are used in batch layer and speed layer traditionally.

•  If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala

•  This will very quickly becomes a maintenance nightmare.

Traditional Lambda Architecture

Page 5: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Unified Development Framework

Page 6: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Batch Layer

•  Empower users to iterate through the data by utilizing the in-memory cache.

•  Logistic regression runs up to 100x faster than Hadoop M/R in memory.

•  We’re able to train exact models without doing any approximation.

Page 7: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Apache Spark Utilizing in-memory Cache for M/R job

Iterative algorithms scan through the data each time

With Spark, data is cached in memory after first iteration

Quasi-Newton methods enhance in-memory benefits

921s 150m

m rows

97s

Page 8: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Speed Layer

•  An extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data stream.

•  Spark Streaming receives streaming input, and divides the data into batches which are then processed by Spark engine.

•  As a result, developers can maintain the same Java/Scala code in Batch and Speed layer.

Page 9: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

MapReduce Review

•  MapReduce – Simplified Data Processing on Large Clusters, 2004.

•  Scales Linearly

•  Data Locality

•  Fault Tolerance in Data and Computation

Page 10: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Hard Disks Failures from Google’s 2007 Study

•  1.7% of disks failed in the first year of their life.

•  Three-year-old disks were failing at a rate of 8.6%.

•  For the hypothetical eight-disk server, the probability that none of disks fail in first year will be 81%.

•  The key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved with commodity hardware.

Page 11: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Hadoop MapReduce Review •  Mapper: Loads the data and emits a set of key-value pairs •  Reducer: Collects the key-value pairs with the same key to

process, and output the result. •  Combiner: Can reduce shuffle traffic by combining key-value

pairs locally before going to reducer. •  Good: Built in fault tolerance, scalable, and production proven

in industry. •  Bad: Optimized for disk IO without leveraging memory well;

iterative algorithms go through disk IO again and again; primitive API is not easy and clean to develop.

Page 12: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Spark MapReduce •  Spark also uses MapReduce as a programming model but

with much richer APIs in Java Scala, and Python. •  With Scala expressive APIs, 5-10x less code. •  Not just a distributed computation framework, Spark provides

several pre-built components empowering users to implement application faster and easier. - Spark Streaming - Spark SQL - MLlib (Machine Learning) - GraphX (Graph Processing)

Page 13: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Hadoop M/R vs Spark M/R

•  Hadoop

•  Spark

Page 14: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Supervised Learning •  Binary Classification: linear SVMs (SGD), logistic regression (L-

BFGS and SGD), decision trees, random forests (Spark 1.2), and naïve Bayes.

•  Multiclass Classification: Decision trees, naïve Bayes (coming soon - multinomial logistic regression in GLMNET)

•  Regression: linear least squares (SGD), Lasso (SGD + soft-threshold), ridge regression (SGD), decision trees, and random forests (Spark 1.2)

•  Currently, the regularization in linear model will penalize all the weights including the intercept which is not desired in some use-cases. Alpine has GLMNET implementation using OWLQN which can exactly reproduce R’s GLMNET package result with scalability. We’re in the process of merging it into MLlib community.

Page 15: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Unsupervised Learning

•  K-Means, •  Collaborative filtering (ALS) •  SVD •  PCA •  Feature extraction and transformation

http://spark.apache.org/docs/1.2.0/mllib-guide.html

Page 16: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Resilient Distributed Datasets (RDDs)

•  RDD is a fault-tolerant collection of elements that can be operated on in parallel.

•  RDDs can be created by parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, HIVE, or any data source offering a Hadoop InputFormat.

•  RDDs can be cached in memory or on disk

Page 17: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

RDD Persistence/Cache •  RDD can be persisted using the persist() or cache()

methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

•  Persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap in Tachyon.

Page 18: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

RDD Operations - two types of operations •  Transformations: Creates a new dataset from an existing

one. They are lazy, in that they do not compute their results right away. By default, each transformed RDD may be recomputed each time you run an action on it. You may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. (PS, after transformations, the dataset can be imbalanced in each executor, and this can be addressed by repartition.)

•  Actions: Returns a value to the driver program after running a computation on the dataset.

Page 19: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Transformations •  map(func) - Return a new distributed dataset formed by passing

each element of the source through a function func. •  filter(func) - Return a new dataset formed by selecting those

elements of the source on which func returns true. •  flatMap(func) - Similar to map, but each input item can be

mapped to 0 or more output items (so func should return a Seq rather than a single item).

•  mapPartitions(func) - Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. http://spark.apache.org/docs/latest/programming-guide.html#transformations

Page 20: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Actions •  reduce(func) - Aggregate the elements of the dataset

using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

•  collect() - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

•  count(), first(), take(n), saveAsTextFile(path), etc. http://spark.apache.org/docs/latest/programming-guide.html#actions

Page 21: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Computing the mean of data

Page 22: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 23: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 24: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Lab 1)

Page 25: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Spark Streaming: Discretized Streams •  DStream is the basic abstraction provided by Spark

Streaming over Spark’s RDDs. •  Each RDD in a DStream contains data from a certain

interval. Any operation applied on a DStream translates to operations on the underlying RDDs internally.

Page 26: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Word Count in Batch Processing

Page 27: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Word Count in Streaming Processing

Page 28: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Lab 2)

Page 29: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Lab 2) •  Need another bash shell in docker to run Netcat as a

data server.

•  In production, people often use Kafka as data server.

•  docker ps // to find the current docker PID

•  docker exec –it <PID> bash // to lunch a new shell

Page 30: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Lab 2)

Page 31: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

UpdateStateByKey Operation

The updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information. •  Define the state - The state can be of arbitrary data type. •  Define the state update function - Specify with a function

how to update the state using the previous state and the new values from input stream.

Page 32: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

UpdateStateByKey Operation

Page 33: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Computing the Mean of Streaming Data

•  Current sum and count at time t has to be accessible at time (t + 1) to compute new mean of stream.

•  Without UpdateSateByKey, the operations at time t and (t + 1) are independent.

•  Checkpoint directory has to be configured for persistence of the state at different time.

Page 34: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Computing the Mean of Streaming Data

Page 35: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 36: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 37: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Lab 3)

Page 38: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Online Learning Example

Page 39: 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Thank you.