Scrap Your MapReduce - Apache Spark

35
Lightning-fast cluster computing Rahul Kavale([email protected] ) Unmesh Joshi([email protected] )

Transcript of Scrap Your MapReduce - Apache Spark

Page 1: Scrap Your MapReduce - Apache Spark

Lightning-fast cluster computing

Rahul Kavale([email protected])

Unmesh Joshi([email protected])

Page 2: Scrap Your MapReduce - Apache Spark

2

Some properties of “Big Data”

•Big data is inherently immutable, meaning it is not supposed to updated once generated.

•Mostly the operations are coarse grained when it comes to write

•Commodity hardware makes more sense for storage/computation of such enormous data,hence the data is distributed across clusterof many such machines

• The distributed nature makes the programming complicated.

Page 3: Scrap Your MapReduce - Apache Spark

3

Brush up for Hadoop concepts

Distributed Storage => HDFS

Cluster Manager => YARN

Fault tolerance => achieved via replication

Job scheduling => Scheduler in YARN

Mapper

Reducer

Combiner

Page 4: Scrap Your MapReduce - Apache Spark

4http://hadoop.apache.org/docs/r1.2.1/images/hdfsarchitecture.gif

Page 5: Scrap Your MapReduce - Apache Spark

5

Map Reduce Programming Model

Page 6: Scrap Your MapReduce - Apache Spark

6https://twitter.com/francesc/status/507942534388011008

Page 7: Scrap Your MapReduce - Apache Spark

7http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop

Page 8: Scrap Your MapReduce - Apache Spark

8

http://www.slideshare.net/JimArgeropoulos/hadoop-101-32661121

Page 9: Scrap Your MapReduce - Apache Spark

9

MapReduce pain points

• considerable latency

• only Map and Reduce phases

• Non trivial to test

• results into complex workflow

• Not suitable for Iterative processing

Page 10: Scrap Your MapReduce - Apache Spark

10

Immutability and MapReduce model

• HDFS storage is immutable or append-only.

• The MapReduce model lacks to exploit the immutable nature of

the data.

• The intermediate results are persisted resulting in huge of IO,

causing a serious performance hit.

Page 11: Scrap Your MapReduce - Apache Spark

11

Wouldn’t it be very nice if we could have• Low latency

• Programmer friendly programming model

• Unified ecosystem

• Fault tolerance and other typical distributed system properties

• Easily testable code

• Of course open source :)

Page 12: Scrap Your MapReduce - Apache Spark

12

What is Apache Spark

• Cluster computing Engine

• Abstracts the storage and cluster management

• Unified interfaces to data

• API in Scala, Python, Java, R*

Page 13: Scrap Your MapReduce - Apache Spark

13

Where does it fit in existing Bigdata ecosystem

http://www.kdnuggets.com/2014/06/yarn-all-rage-hadoop-summit.html

Page 14: Scrap Your MapReduce - Apache Spark

14

Why should you care about Apache Spark

• Abstracts underlying storage,

• Abstracts cluster management

• Easy programming model

• Very easy to test the code

• Highly performant

Page 15: Scrap Your MapReduce - Apache Spark

15

• Petabyte sort record

https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

Page 16: Scrap Your MapReduce - Apache Spark

16

• Offers in memory caching of data

• Specialized Applications

• GraphX for graph processing

• Spark Streaming

• MLib for Machine learning

• Spark SQL

• Data exploration via Spark-Shell

Page 17: Scrap Your MapReduce - Apache Spark

17

Programming model

for

Apache Spark

Page 18: Scrap Your MapReduce - Apache Spark

18

Word Count example

val file = spark.textFile("input path")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey((a, b) => a + b)

counts.saveAsTextFile("destination path")

Page 19: Scrap Your MapReduce - Apache Spark

19

Comparing example with MapReduce

Page 20: Scrap Your MapReduce - Apache Spark

20

Spark Shell Demo

• SparkContext

• RDD

• RDD operations

Page 21: Scrap Your MapReduce - Apache Spark

21

RDD

• RDD stands for Resilient Distributed Dataset.

• basic abstraction for Spark

Page 22: Scrap Your MapReduce - Apache Spark

22

• Equivalent of Distributed collections.

• The interface makes distributed nature of underlying data transparent.

• RDD is immutable

• Can be created via,

• parallelising a collection,

• transforming an existing RDD by applying a transformation function,

• reading from a persistent data store like HDFS.

Page 23: Scrap Your MapReduce - Apache Spark

23

RDD is lazily evaluated

RDD has two type of operations

• Transformations

Create a DAG of transformations to be applied on the RDD

Does not evaluating anything

• Actions

Evaluate the DAG of transformations

Page 24: Scrap Your MapReduce - Apache Spark

24

RDD operations

Transformations

map(f : T ⇒ U) : RDD[T] ⇒ RDD[U]

filter(f : T ⇒ Bool) : RDD[T] ⇒ RDD[T]

flatMap(f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U]

sample(fraction : Float) : RDD[T] ⇒ RDD[T] (Deterministic sampling)

union() : (RDD[T],RDD[T]) ⇒ RDD[T]

join() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (V, W))]

groupByKey() : RDD[(K, V)] ⇒ RDD[(K, Seq[V])]

reduceByKey(f : (V,V) ⇒ V) : RDD[(K, V)] ⇒ RDD[(K, V)]

partitionBy(p : Partitioner[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]

Page 25: Scrap Your MapReduce - Apache Spark

25

Actions

count() : RDD[T] ⇒ Long

collect() : RDD[T] ⇒ Seq[T]

reduce(f : (T,T) ⇒ T) : RDD[T] ⇒ T

lookup(k : K) : RDD[(K, V)] ⇒ Seq[V] (On hash/range partitioned RDDs)

save(path : String) : Outputs RDD to a storage system, e.g., HDFS

Page 26: Scrap Your MapReduce - Apache Spark

26

Job Execution

Page 27: Scrap Your MapReduce - Apache Spark

27

Spark Execution in Context of YARN

http://kb.cnblogs.com/page/198414/

Page 28: Scrap Your MapReduce - Apache Spark

28

Fault tolerance via lineage

MappedRDD

FilteredRDD

FlatMappedRDD

MappedRDD

HadoopRDD

Page 29: Scrap Your MapReduce - Apache Spark

29

Testing

Page 30: Scrap Your MapReduce - Apache Spark

30

Why is Spark more performant than MapReduce

Page 31: Scrap Your MapReduce - Apache Spark

31

Reduced IO

• No disk IO between phases since phases themselves are pipelined

• No network IO involved unless a shuffle is required

Page 32: Scrap Your MapReduce - Apache Spark

32

No Mandatory Shuffle

• Programs not bounded by map and reduce phases

• No mandatory Shuffle and sort required

Page 33: Scrap Your MapReduce - Apache Spark

33

In memory caching of data

• Optional In memory caching

• DAG engine can apply certain optimisations since when an action is called, it knows what all transformations as to be applied

Page 34: Scrap Your MapReduce - Apache Spark

34

Questions?

Page 35: Scrap Your MapReduce - Apache Spark

35

Thank You!