Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009:...

AnalyticsinSpark

YanleiDiaoTimHunter

SlidesCourtesyofIonStoica,MateiZahariaandBrookeWenig

Outline

1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics 4. What does unification bring? 5. Structured programming with DataFrames 6. Machine Learning with MLlib

2009: State-of-the-art in Big Data Apache Hadoop

• Open Source: HDFS, Hbase, MapReduce, Hive • Large scale, flexible data processing engine • Batch computation (e.g., 10s minutes to hours)

New requirements emerging •  Iterative computations, e.g., machine learning •  Interactive computations, e.g., ad-hoc analytics

Prior MapReduce Model

Reduce

Input Output

MapReduce model for clusters transforms data flowing from stable storage to stable storage, e.g., Hadoop:

How to Support ML Apps Better? Iterative and interactive applications

Spark core (RDD API)

How to Support Iterative Apps Iterative and interactive applications

share data between stages via memory

How to Support Interactive Apps

Cache data in memory

Iterative and interactive applications

Motivation for Spark Acyclic data flow (MapReduce) is a powerful abstraction, but not

efficient for apps that repeatedly reuse a working set of data: •  iterative algorithms (many in machine learning) •  interactive data mining tools (R, Excel, Python)

Spark makes working sets a first-class concept to efficiently

support these apps

2. Technical Summary of Spark

What is Apache Spark? 1)  Parallel execution engine for big data

•  Implements BSP (Bulk Synchronous Processing) model

2)  Data abstraction: Resilient Distributed Datasets (RDDs) •  Sets of objects partitioned & distributed across a cluster •  Stored in RAM or on Disk

3)  Automatic recovery based on lineage of bulk transformations

1) Bulk Synchronous Processing (BSP) Model In this classic model for designing parallel algorithms,

computation proceeds in a series of supersteps: •  Concurrent computation: parallel processes perform local computation •  Communication: processes exchange data •  Barrier synchronization: when a process reaches the barrier, it waits until

all other processes have reached the same barrier

Spark, as a BSP System

stage (super-step)

tasks (processors) …

stage (super-step)

tasks (processors)

RDD RDD Shuffle

Spark, as a BSP System

stage (super-step)

tasks (processors) …

stage (super-step)

tasks (processors)

RDD RDD Shuffle •  All tasks in same stage run same operation,

•  single-threaded, deterministic execution

Immutable dataset

Barrier implicit by data dependency such as group data by key

2) Programming Model Resilient distributed datasets (RDDs)

•  Immutable collections partitioned across cluster that can be rebuilt if a partition is lost

•  Partitioning can be based on a key in each record (using hash or range partitioning)

•  Created by transforming data in stable storage using data flow operators (map, filter, group-by, …)

•  Can be cached across parallel operations Restricted shared variables

•  Accumulators, broadcast variables

Example: Log Mining Loaderrormessagesfromalogintomemory,theninteractively

searchforvariouspatternslines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count

Example: Log Mining

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count

tasksresults

Cache1

Cache2

Cache3

BaseRDDTransformedRDD

CachedRDDParalleloperation

Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)

Loaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

RDDOperations

Transformations(defineanewRDD)

mapfiltersampleuniongroupByKeyreduceByKeyjoincache…

Actions(returnaresulttodriver)

reducecollectcountsavelookupKey…

http://spark.apache.org/docs/latest/programming-guide.html

Transformations(defineanewRDD)

map(func):Returnanewdistributeddatasetformedbypassingeachelementofthesourcethroughafunctionfunc. filter(func):Returnanewdatasetformedbyselectingthoseelementsofthesourceonwhichfuncreturnstrue.

flatMap(func):Similartomap,buteachinputitemcanbemappedto0ormoreoutputitems(sofuncshouldreturnaSeqratherthanasingleitem). mapPartitions(func):Similartomap,butrunsseparatelyoneachpartition(block)oftheRDD,sofuncmustbeoftypeIterator<T>=>Iterator<U>whenrunningonanRDDoftypeT.

sample:Sampleafractionfractionofthedata,withorwithoutreplacement,usingagivenrandomnumbergeneratorseed.

union(otherDataset):Returnanewdatasetthatcontainstheunionoftheelementsinthesourcedatasetandtheargument.

intersection(otherDataset):ReturnanewRDDthatistheintersectionofelementsinthesourcedatasetandtheargument.

distinct:Returnanewdatasetthatcontainsthedistinctelementsofthesourcedataset.

groupByKey:Whencalledonadatasetof(K,V)pairs,returnsadatasetof(K,Iterable<V>)pairs.Note:Toperformanaggregation(such

asasumoraverage)overeachkey,usingreduceByKeyoraggregateByKeywillyieldmuchbetterperformance. reduceByKey(func):Whencalledonadatasetof(K,V)pairs,returnsadatasetof(K,V)pairswherethevaluesforeachkeyareaggregatedusingthegivenreducefunctionfunc,whichmustbeoftype(V,V)=>V.

sort([ascending]):Whencalledonadatasetof(K,V)pairswhereKimplementsOrdered,returnsadatasetof(K,V)pairssortedbykeysinascendingordescendingorder.join(otherDataset):Whencalledondatasetsoftype(K,V)and(K,W),returnsadatasetof(K,(V,W))pairswithallpairsofelementsforeachkey.cogroup(otherDataset):Whencalledondatasetsoftype(K,V)and(K,W),returnsadatasetof(K,(Iterable<V>,Iterable<W>))tuples.

Actions(returnaresulttodriver)

count():Returnthenumberofelementsinthedataset.

collect():Returnalltheelementsofthedatasetasanarrayatthedriverprogram.

reduce(func):Aggregatetheelementsofthedatasetusingafunctionfunc(whichtakestwoargumentsandreturnsone).The

functionshouldbecommutativeandassociativesothatitcanbecomputedcorrectlyinparallel. take(n):Returnanarraywiththefirstnelementsofthedataset.

takeSample(n):Returnanarraywitharandomsampleofnumelementsofthedataset

saveAsTextFile(path):Writetheelementsofthedatasetasatextfile(orsetoftextfiles)inagivendirectoryinthelocalfilesystem,HDFSoranyotherHadoop-supportedfilesystem.CalltoStringoneachelementtoconvertittoalineoftextinthefile.

saveAsObjectFile(path):WritetheelementsofthedatasetinasimpleformatusingJavaserialization,whichcanthenbeloadedusingSparkContext.objectFile(). lookupKey…

3) RDDFault Tolerance

RDDs maintain lineage (like logical logging in Aries) that can be used to reconstruct lost partitions

cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache()

HdfsRDDpath:hdfs://…

FilteredRDDfunc:contains(...)

MappedRDDfunc:split(…) CachedRDD

Benefits of RDD Model Consistency is easy due to immutability (no updates)

Fault tolerance is inexpensive (log lineage rather than replicating/checkpointing data)

Locality-aware scheduling of tasks on partitions

Despite being restricted, model seems applicable to a broad variety of applications

3. Apache Spark’s Path to Unification

Apache Spark’s Path to Unification

Unified engine across data workloads and data sources

SQL Streaming ML Graph Batch …

DataFrame API

DataFrame logically equivalent to a relational table

Operators mostly relational with additional ones for statistical analysis, e.g., quantile, std, skew

Popularized by R and Python/pandas, languages of choice for Data Scientists

pdata.map(lambdax:(x.dept,[x.age,1]))\.reduceByKey(lambdax,y:[x[0]+y[0],x[1]+y[1]])\.map(lambdax:[x[0],x[1][0]/x[1][1]])\.collect()

data.groupBy(“dept”).avg(“age”)

DataFrames, a Unifying Abstraction

Make DataFrame declarative, unify DataFrame and SQL

DataFrame and SQL share same •  query optimizer, and •  execution engine

Tightly integrated with rest of Spark •  ML library takes DataFrames as input & output •  Easily convert RDDs ↔ DataFrames

Python DF

Logical Plan

Java/Scala DF

Execution Every optimizations automatically applies to SQL, and Scala, Python and R DataFrames

Today’s Apache Spark

Iterative, interactive, batch, and streaming applications

Spark core (DataFrames API, Catalyst, RDD API)

SparkSQL

Structured Streams MLlib Graph

Frames SparkR

Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009:...

Documents

Transcript of Lec11 spark analyticsavid.cs.umass.edu/courses/645/s2018/lectures/Spark.pdf · 2009:...

Ecology Lec11

Q921 re1 lec11 v1

Math 291: Lecture 11 - Minnesota State University Moorheadweb.mnstate.edu/Fagerstrom/s2018/s2018-291-Week11LectureTikz.… · This lecture is based on some of the examples from the

Lec11 object-re-id

Lec11 Buses

Lec11 crashing-b

W,W;'M - pbte.edu.pk sheets/Commerce/S2018/revised_date_sheet_matric-s2018.pdf · DATE SHEET ( ) 9 9 l- 'l & of

MAE 493N 593T Lec11

2020 Lec11 Filter

Applied numerical methods lec11

Tuba Audition S2018 - Texas Tech · · 2017-12-19Microsoft Word - Tuba Audition S2018.docx Created Date: 12/12/2017 6:57:41 PM ...

MAE 241-Lec11

Lec11 Logic.pdf(Cholbe)

Employment Training Panel S2018 -2019 TRATEGIC PLAN · ETP Employment Training Panel ETP Strategic Plan 2018 19 > Employment Training Panel P > S2018 -2019 TRATEGIC LAN Barry Broad,

Poetics Spark.pdf

Lec11-Digital System Testing_1

Lec11 Hazrd Comm

Math 291: Lecture 7 - Minnesota State University Moorheadweb.mnstate.edu/Fagerstrom/s2018/s2018-291-Week7LectureThms… · Dr. Fagerstrom (MSUM) Math 291: Lecture 7 March 1, 2018

281 lec11 euk_phage replication

lec11- DESIGN OF OFFSHORE STRUCTURE