Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009:...

27
Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Transcript of Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009:...

Page 1: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

AnalyticsinSpark

YanleiDiaoTimHunter

SlidesCourtesyofIonStoica,MateiZahariaandBrookeWenig

Page 2: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

Outline

1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics 4. What does unification bring? 5. Structured programming with DataFrames 6. Machine Learning with MLlib

Page 3: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

2009: State-of-the-art in Big Data Apache Hadoop

• Open Source: HDFS, Hbase, MapReduce, Hive • Large scale, flexible data processing engine • Batch computation (e.g., 10s minutes to hours)

New requirements emerging •  Iterative computations, e.g., machine learning •  Interactive computations, e.g., ad-hoc analytics

4

Page 4: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

Prior MapReduce Model

Map

Map

Map

Reduce

Reduce

Input Output

MapReduce model for clusters transforms data flowing from stable storage to stable storage, e.g., Hadoop:

Page 5: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

How to Support ML Apps Better? Iterative and interactive applications

Spark core (RDD API)

Page 6: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

How to Support Iterative Apps Iterative and interactive applications

Spark core (RDD API)

share data between stages via memory

Page 7: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

How to Support Interactive Apps

Spark core (RDD API)

Cache data in memory

Iterative and interactive applications

Page 8: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

Motivation for Spark Acyclic data flow (MapReduce) is a powerful abstraction, but not

efficient for apps that repeatedly reuse a working set of data: •  iterative algorithms (many in machine learning) •  interactive data mining tools (R, Excel, Python)

Spark makes working sets a first-class concept to efficiently

support these apps

Page 9: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

2. Technical Summary of Spark

Page 10: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

What is Apache Spark? 1)  Parallel execution engine for big data

•  Implements BSP (Bulk Synchronous Processing) model

2)  Data abstraction: Resilient Distributed Datasets (RDDs) •  Sets of objects partitioned & distributed across a cluster •  Stored in RAM or on Disk

3)  Automatic recovery based on lineage of bulk transformations

Page 11: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

1) Bulk Synchronous Processing (BSP) Model In this classic model for designing parallel algorithms,

computation proceeds in a series of supersteps: •  Concurrent computation: parallel processes perform local computation •  Communication: processes exchange data •  Barrier synchronization: when a process reaches the barrier, it waits until

all other processes have reached the same barrier

Page 12: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

Spark, as a BSP System

stage (super-step)

tasks (processors) …

stage (super-step)

tasks (processors)

RDD RDD Shuffle

Page 13: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

Spark, as a BSP System

stage (super-step)

tasks (processors) …

stage (super-step)

tasks (processors)

RDD RDD Shuffle •  All tasks in same stage run same operation,

•  single-threaded, deterministic execution

Immutable dataset

Barrier implicit by data dependency such as group data by key

Page 14: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

2) Programming Model Resilient distributed datasets (RDDs)

•  Immutable collections partitioned across cluster that can be rebuilt if a partition is lost

•  Partitioning can be based on a key in each record (using hash or range partitioning)

•  Created by transforming data in stable storage using data flow operators (map, filter, group-by, …)

•  Can be cached across parallel operations Restricted shared variables

•  Accumulators, broadcast variables

Page 15: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

Example: Log Mining Loaderrormessagesfromalogintomemory,theninteractively

searchforvariouspatternslines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count

. . .

Page 16: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

Example: Log Mining

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasksresults

Cache1

Cache2

Cache3

BaseRDDTransformedRDD

CachedRDDParalleloperation

Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)

Loaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

Page 17: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

RDDOperations

Transformations(defineanewRDD)

mapfiltersampleuniongroupByKeyreduceByKeyjoincache…

Actions(returnaresulttodriver)

reducecollectcountsavelookupKey…

http://spark.apache.org/docs/latest/programming-guide.html

Page 18: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

Transformations(defineanewRDD)

map(func):Returnanewdistributeddatasetformedbypassingeachelementofthesourcethroughafunctionfunc. filter(func):Returnanewdatasetformedbyselectingthoseelementsofthesourceonwhichfuncreturnstrue.

flatMap(func):Similartomap,buteachinputitemcanbemappedto0ormoreoutputitems(sofuncshouldreturnaSeqratherthanasingleitem). mapPartitions(func):Similartomap,butrunsseparatelyoneachpartition(block)oftheRDD,sofuncmustbeoftypeIterator<T>=>Iterator<U>whenrunningonanRDDoftypeT.

sample:Sampleafractionfractionofthedata,withorwithoutreplacement,usingagivenrandomnumbergeneratorseed.

union(otherDataset):Returnanewdatasetthatcontainstheunionoftheelementsinthesourcedatasetandtheargument.

intersection(otherDataset):ReturnanewRDDthatistheintersectionofelementsinthesourcedatasetandtheargument.

distinct:Returnanewdatasetthatcontainsthedistinctelementsofthesourcedataset.

groupByKey:Whencalledonadatasetof(K,V)pairs,returnsadatasetof(K,Iterable<V>)pairs.Note:Toperformanaggregation(such

asasumoraverage)overeachkey,usingreduceByKeyoraggregateByKeywillyieldmuchbetterperformance. reduceByKey(func):Whencalledonadatasetof(K,V)pairs,returnsadatasetof(K,V)pairswherethevaluesforeachkeyareaggregatedusingthegivenreducefunctionfunc,whichmustbeoftype(V,V)=>V.

sort([ascending]):Whencalledonadatasetof(K,V)pairswhereKimplementsOrdered,returnsadatasetof(K,V)pairssortedbykeysinascendingordescendingorder.join(otherDataset):Whencalledondatasetsoftype(K,V)and(K,W),returnsadatasetof(K,(V,W))pairswithallpairsofelementsforeachkey.cogroup(otherDataset):Whencalledondatasetsoftype(K,V)and(K,W),returnsadatasetof(K,(Iterable<V>,Iterable<W>))tuples.

Page 19: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

Actions(returnaresulttodriver)

count():Returnthenumberofelementsinthedataset.

collect():Returnalltheelementsofthedatasetasanarrayatthedriverprogram.

reduce(func):Aggregatetheelementsofthedatasetusingafunctionfunc(whichtakestwoargumentsandreturnsone).The

functionshouldbecommutativeandassociativesothatitcanbecomputedcorrectlyinparallel. take(n):Returnanarraywiththefirstnelementsofthedataset.

takeSample(n):Returnanarraywitharandomsampleofnumelementsofthedataset

saveAsTextFile(path):Writetheelementsofthedatasetasatextfile(orsetoftextfiles)inagivendirectoryinthelocalfilesystem,HDFSoranyotherHadoop-supportedfilesystem.CalltoStringoneachelementtoconvertittoalineoftextinthefile.

saveAsObjectFile(path):WritetheelementsofthedatasetinasimpleformatusingJavaserialization,whichcanthenbeloadedusingSparkContext.objectFile(). lookupKey…

Page 20: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

3) RDDFault Tolerance

RDDs maintain lineage (like logical logging in Aries) that can be used to reconstruct lost partitions

Ex:

cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache()

HdfsRDDpath:hdfs://…

FilteredRDDfunc:contains(...)

MappedRDDfunc:split(…) CachedRDD

Page 21: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

Benefits of RDD Model Consistency is easy due to immutability (no updates)

Fault tolerance is inexpensive (log lineage rather than replicating/checkpointing data)

Locality-aware scheduling of tasks on partitions

Despite being restricted, model seems applicable to a broad variety of applications

Page 22: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

3. Apache Spark’s Path to Unification

Page 23: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

Apache Spark’s Path to Unification

Unified engine across data workloads and data sources

SQL Streaming ML Graph Batch …

Page 24: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

DataFrame API

DataFrame logically equivalent to a relational table

Operators mostly relational with additional ones for statistical analysis, e.g., quantile, std, skew

Popularized by R and Python/pandas, languages of choice for Data Scientists

Page 25: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

pdata.map(lambdax:(x.dept,[x.age,1]))\.reduceByKey(lambdax,y:[x[0]+y[0],x[1]+y[1]])\.map(lambdax:[x[0],x[1][0]/x[1][1]])\.collect()

data.groupBy(“dept”).avg(“age”)

Page 26: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

DataFrames, a Unifying Abstraction

Make DataFrame declarative, unify DataFrame and SQL

DataFrame and SQL share same •  query optimizer, and •  execution engine

Tightly integrated with rest of Spark •  ML library takes DataFrames as input & output •  Easily convert RDDs ↔ DataFrames

Python DF

Logical Plan

Java/Scala DF

R DF

Execution Every optimizations automatically applies to SQL, and Scala, Python and R DataFrames

Page 27: Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009: State-of-the-art in Big Data Apache Hadoop • Open Source: HDFS, Hbase, MapReduce, Hive • Large

Today’s Apache Spark

Iterative, interactive, batch, and streaming applications

Spark core (DataFrames API, Catalyst, RDD API)

SparkSQL

SQL

Structured Streams MLlib Graph

Frames SparkR