In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark...

43
In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu

Transcript of In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark...

Page 1: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

In-memory data processing Apache Spark

Pelle Jakovits

31 October 2018, Tartu

Page 2: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Outline

• Disk based vs In-Memory data processing

• Introduction to Apache Spark– Resilient Distributed Datasets (RDD)

– RDD actions and transformations

– Fault tolerance

– Frameworks powered by Spark

• Advantages & Disadvantages

Pelle Jakovits 2/42

Page 3: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Memory vs Disk based processing

• In Hadoop MapReduce all input, intermediate and output data must be written to disk

• Even if data significantly reduced, it can not be kept in memory between Map and Reduce tasks

• Hadoop MapReduce is not suitable for all types of algorithms

– Iterative algorithms, graph processing, machine learning

Pelle Jakovits 3/34

Page 4: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

In-Memory data processing frameworks

• Goal is to support computationally complex applications which can benefit from keeping intermediate data in memory

• Keep data in memory between data processing operations

• Input & Output are disk based file storage systems like HDFS

Pelle Jakovits 4/42

Page 5: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

In-Memory data processing

• Data must fit into the collective memory of the cluster

• Should still support keeping data in disk

– when it would not fit into memory

– for fault tolerance

• Fault tolerance is more complicated

– The whole application is affected when data is only kept in memory

– In Hadoop, input data is replicated in HDFS and readily available. only the last Map or Reduce task is affected

Pelle Jakovits 5/42

Page 6: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Apache Spark

• MapReduce-like & in-memory data processing framework

• From Map & Reduce -> Map, Join, Co-group, Filter, Distinct, Union, Sample, ReduceByKey, etc

• Directed acyclic graph (DAG) task execution engine– Users have more control over the data processing execution flow

• Uses Resilient Distributed Datasets (RDD) abstraction– Input data is load into RDD

– RDD transformations and user defined functions are applied to define data processing applications

Pelle Jakovits 6/42

Page 7: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Apache Spark

• More than just a replacement for MapReduce

– Spark works with Scala, Java, Python and R

– Extended with built-in tools for SQL queries, stream processing, ML and graph processing

• Integrated with Hadoop Yarn and HDFS

• Included in many public cloud platforms alongside with Hadoop MapReduce

– IBM cloud, Amazon AWS, Google Cloud, Microsoft Azure

Pelle Jakovits 7/42

Page 8: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Hadoop Mapreduce vs Spark

Pelle Jakovits

0.96

110

0 25 50 75 100 125

Logistic Regression

4.1

155

0 30 60 90 120 150 180

K-Means ClusteringHadoop

Spark

Time per Iteration (s)Source: Introduction to Spark – Patrick Wendell, Databricks

8/42

Page 9: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Resilient Distributed Datasets

• Collections of data objects

• Distributed across cluster

• Stored in RAM or Disk

• Immutable/Read-only

• Built through parallel transformations

• Automatically rebuilt on failures

Pelle Jakovits

Source: http://horicky.blogspot.com.ee/2013/12/spark-low-latency-massively-parallel.html

9/42

Page 10: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Structure of RDDs

• Contains a number of rows

• Rows are divided into partitions

• Partitions are distributed between nodes in the cluster

• Row is a tuple of records similarly to Apache Pig

• Can contain nested data structures

Pelle Jakovits

Source: http://horicky.blogspot.com.ee/2013/12/spark-low-latency-massively-parallel.html

10/42

Page 11: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Spark DAG execution flow

Pelle Jakovits

http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/

11/42

Page 12: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Spark in Java

• A lot of additional boilerplate code related to data and function types

• There are different classes for each Tuple length (Tuple2, … , Tuple9):

Tuple2 pair = new Tuple2(a, b);

pair._1 // => a

pair._2 // => b

• In Java 8 you can use lambda functions:

JavaPairRDD<String, Integer> counts = pairs.reduceByKey( (a, b) -> a + b );

• But In older Java you must use predefined function interfaces:

– Function, Function2, Function 3

– FlatMapFunction

– PairFunction

Pelle Jakovits 12/42

Page 13: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Java 7 Example - WordCount☺JavaRDD<String> lines = ctx.textFile(input_folder);

JavaRDD<String> words = lines.flatMap(

new FlatMapFunction<String, String>() {

public Iterable<String> call(String line) {

return Arrays.asList(line.split(" "));

}});

JavaPairRDD<String, Integer> ones = words.mapToPair(

new PairFunction<String, String, Integer>() {

public Tuple2<String, Integer> call(String word) {

return new Tuple2<String, Integer>(word, 1);

}});

JavaPairRDD<String, Integer> counts = ones.reduceByKey(

new Function2<Integer, Integer, Integer>(){

public Integer call(Integer i1, Integer i2) {

return i1 + i2;

}}); Pelle Jakovits 13/42

Page 14: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Java 8 Example - WordCount

JavaRDD<String> lines = ctx.textFile(input_folder);

JavaRDD<String> words = lines.flatMap(

line -> Arrays.asList(line.split(" ")).iterator()

);

JavaPairRDD<String, Integer> pairs = words.mapToPair(

word -> new Tuple2<String, Integer>(word, 1)

);

JavaPairRDD<String, Integer> wordCounts = pairs.reduceByKey(

(x, y) -> x + y

);

Pelle Jakovits 14/42

Page 15: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Python example - WordCount

• Word count in Spark's Python API:

lines = spark.textFile(input_folder)

words = lines.flatMap(lambda line: line.split() )

pairs = words.map(lambda word: (word, 1) )

wordCounts = pairs.reduceByKey(lambda a, b: a + b )

Pelle Jakovits 15/42

Page 16: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

RDD operations

• Actions

– Creating RDD’s

– Storing RDD’s

– Extracting data from RDD

• Transformations

– Restructure or transform RDDs into new RDDs

– Apply user defined functions

Pelle Jakovits 16/42

Page 17: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

RDD Actions

Pelle Jakovits 17

Page 18: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Loading Data

• Local data directly from memorydataset = [1, 2, 3, 4, 5];

slices = 5 # Number of partitions

distData = sc.parallelize(dataset, slices);

• External data from HDFS or local file systeminput = sc.textFile(“file.txt”)

input = sc.textFile(“directory/*.txt”)

input = sc.textFile(“hdfs://xxx:9000/path/file”)

Pelle Jakovits 18/42

Page 19: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Storing data

counts.saveAsTextFile("hdfs://...");

counts.saveAsObjectFile("hdfs://...");

counts.saveAsHadoopFile(

"testfile.seq",

Text.class,

LongWritable.class,

SequenceFileOutputFormat.class

);

Pelle Jakovits 19/42

Page 20: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Extracting data from RDD

• Extract data out of distributed RDD object into driver program memory:

– Collect() – Retrieve the whole RDD content as a list

– First() – Take first element from RDD

– Take(n) - Take n first elements from RDD as a list

Pelle Jakovits 20/42

Page 21: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Broadcast

• Share data between to every node in the Spark cluster, which can then be accessed inside Spark functions:

broadcastVar = sc.broadcast([1992, „gray“, „bear“])

result = input.map(lambda line: weight_first_bc(line, broadcastVar))

• Don’t have to use broadcast if data is very small. This would also work:

globalVar = [1992, „gray“, „bear“]

result = input.map(lambda line: weight_first_bc(line, globalVar))

• However, this is inefficient when passed along data is larger (> 1MB)

• Spark uses Torrent protocol to optimize broadcast data distribution

Pelle Jakovits 21/42

Page 22: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Other actions

• Reduce(func) – Apply an aggregation function to all tuples in RDD

• Count() – count number of elements in RDD

• countByKey() – count values for each unique key

Pelle Jakovits 22/42

Page 23: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

RDD Transformations

Pelle Jakovits 23

Page 24: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Map

• Applies a user defined function to every tuple in RDD.

• From the WordCount example, using a lambda function:

pairs = words.map(lambda word: (word, 1))

• Using a separately defined function:

def toPair(word):

pair = (word, 1)

return pair

pairs = words.map(toPair)

Pelle Jakovits 24/42

Page 25: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Map transformation

• pairs = words.map(lambda word: (word, 1))

Pelle Jakovits 25/42

Page 26: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

FlatMap

• Similar to Map - applied to each tuple in RDD

• But can result in multiple output tuples

• From the Python WordCount example:

words = file.flatMap(lambda line: line.split())

• User defined function has to return a list

• Each element in the output list results in a new tuple inside the resulting RDD

Pelle Jakovits 26/42

Page 27: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

FlatMap transformation

• words = lines.flatMap( lambda line: line.split() )

Pelle Jakovits 27/42

Page 28: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

GroupBy & GroupByKey

• Restructure the RDD by grouping all the values inside the RDD

• Such restructuring is inefficient and should be avoided if possible– It is better to use reduceByKey or aggregateByKey, which automatically

applies an aggregation function on the grouped data.

• GroupByKey operation uses first value inside the RDD tuples as the grouping Key

Pelle Jakovits 28/42

Page 29: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

GroupByKey transformation

• Groups RDD by key, and results in a nested RDD

wordCounts = pairs.groupByKey()

Pelle Jakovits 29/42

Page 30: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

ReduceByKey

• Groups all tuples in RDD by the first field in the tuple

• Applies a user defined aggregation function to all tuples inside a group

• Outputs a single tuple for each group

• From the Python WordCount example:

pairs = words.map(lambda word: (word, 1) )

wordCounts = pairs.reduceByKey(lambda a, b: a + b )

Pelle Jakovits 30/42

Page 31: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

ReduceByKey

Pelle Jakovits

• ReduceByKey() applies GroupByKey() together with a nested Reduce(UDF)

wordCounts = pairs.reduceByKey(lambda a, b: a + b)

31/42

Page 32: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Working with Keys

• When using OperationByKey transformations, Spark expects the RDD to contain (Key, Value) tuples

• If input RDD contains longer typles, then we need to restructure the RDD using a map() operation.

data = sc.parallelize([("hi", 1, "file1"), ("bye", 3, "file2")])

pairs = data.map(lambda (a, b, c) : (a, (b, c)) )

sums = pairs.reduceByKey(lambda (b1, c2), (b2, c2) : b1 + b2)

output = sums.collect()

for (key, value) in output:

print(key, ", " , value)

Pelle Jakovits 32/42

Page 33: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Other transformations

• sample(withReplacement, fraction, seed)

• distinct([numTasks]))

• union(otherDataset)

• filter(func)• join(otherDataset, [numTasks]) - When called on datasets

of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

• cogroup(otherDataset, [numTasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples.

Pelle Jakovits 33/42

Page 34: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Persisting/Caching data

• Spark uses Lazy evaluation

• Intermediate RDD's may be discarded to optimize memory consumption

• To force spark to keep any intermediate data in memory, we can use:

– lineLengths.persist(StorageLevel.MEMORY_ONLY);

– To forces RDD to be cached in memory after irst time it is computed

• NB! Caching should be used when an RDD is accessed multiple times!

Pelle Jakovits 34/42

Page 35: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Persistance level

• DISK_ONLY

• MEMORY_ONLY

• MEMORY_AND_DISK

• MEMORY_ONLY_SER

– More efficient

– Use more CPU

• MEMORY_ONLY_2

– Replicate data on 2 executors

Pelle Jakovits 35/42

Page 36: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Fault tolerance

• Faults are inevitable when running distributed applications in large clusters and repeating long-running tasks can be costly

• Fault recovery is more complicated for In-memory frameworks

– In Spark only the initial input data is replicated on HDFS

– Hadoop MR data is replicated in HDFS, can easily repeat failed tasks

• Checkpointing is typically used for long running in-memory distributed applications

– Processes periodically store their memory into disk storage

– Can affect the efficiency of the application

Pelle Jakovits 36/42

Page 37: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Spark Lineage

• Lineage is the history of RDDs

• Spark keeps track of each RDD partition's lineage

– What functions were applied to produce it

– Which input data partition were involved

• Rebuild lost RDD partitions according to lineage, using the latest still available partitions

• No performance cost if nothing fails (in comparison to checkpointing)

Pelle Jakovits 37/42

Page 38: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Lineage

Source: Glenn K. Lockwood, Advanced Technologies at NERSC/LBNL

Pelle Jakovits 38/42

Page 39: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Apache Spark built-in extensions

• Spark SQL - Seamlessly mix SQL queries with Spark programs– Similar to Pig and Hive

• Spark Streaming – Apply Spark on Streaming data

• Structured Streaming – Higher level abstraction for streaming applications

• MLlib - Machine learning library

• GraphX - Spark's API for graphs and graph-parallel computation

• SparkR – Utilize Spark in R scripts

Pelle Jakovits 39/42

Page 40: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Advantages of Spark

• Much faster than Hadoop when data fits into the memory– Affects all higher-level Spark or Hadoop MapReduce frameworks

• Support for more programming languages– Scala, Java, Python, R

• Has a lot of built-in extensions– DataFrames, SQL, R, ML, Streaming, Graph processing

• It is constantly being updated

• Well suitable for computationally complex algorithms processing medium-to-large scale data.

Pelle Jakovits 40/42

Page 41: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Disadvantages of Spark

• What if data does not fit into the memory?

• Hard to keep in track of how (well) the data is distributed

• Working in Java requires a lot of boiler plate code

• Saving as text files can be very slow

Pelle Jakovits 41/42

Page 42: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

Conclusion

• RDDs offer a reasonably simple and efficient programming model for a broad range of applications

• Spark provides more data manipulation operations than just Map and Reduce.

• Spark achieves fault tolerance by providing coarse-grained operations and tracking lineage

• Provides definite speedup when data fits into the collective memory

• Very large development community which has resulted in creation of many integrated tools for different types of applications

Pelle Jakovits 42/42

Page 43: In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42.

That’s all for this week

• Next week’s practice session

– Processing data with Apache Spark in Python

– Focus on RDD transformations

• Next week’s lecture

– SQL abstraction for distributed data processing

• HiveQL

• Spark SQL

Pelle Jakovits 43/42