Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein,...
Transcript of Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein,...
Distributed Systems
Distributed computation with Spark
Abraham Bernstein, Ph.D.
Course material based on:- Based on slides by Reynold Xin, Tudor Lapusan- Some additions by Johannes Schneider
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
How good is Map/Reduce?
• Abstraction• Simple?
• Automatic distribution of (data and) tasks
• Be platform agnostic
• Performance
•2
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Map/Reduce is not so simple…
• Not easy to program directly in Map/Reduce
• Most real applications require multiple steps...• Iterative algorithms (eg. PageRank): 10’s of steps
• Analytics query (eg. count & top K): 2-5
ÞEach step one map and reduce class
ÞBoilerplate code, spaghetti like…
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Higher level frameworks
• Simpler to use than Map/Reduce
• Examples• HiveQL, Pig, Spark
• Built on top of Hadoop• Use at least some parts of Hadoop
• (often can) generate Map/Reduce jobs
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Spark
• Simpler to program• Nicer syntax: no explicit map/reduce
• Faster execution
• How? Two key points:• Generalized directed acyclic graphs for computation
• Faster data sharing • Don’t write intermediate results to discs
• How to achieve faul-tolerance if data is in RAM?Þ RDD
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Spark Ecosystem
• Under development (Spark released 2014)
• This course: Spark Core Engine only
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Resilient Distributed Dataset (RDD)
• Collection of (data) elements• Held on disc or in RAM
• Can be distributed on different nodes
• Programmer can “persist/cache” RDDs• Kept in memory for faster access
• System can remove(delete) from RAM, if need space
• RDDs are immutable• Transformations: Create new RDDs from old one
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Operations on RDD
• Transformations• f(RDD) => RDD
• Lazy evaluation: not computed immediately
• Actions• Triggers computation
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Transformations and Actions
Type T to Type U
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Reminder: Java Syntax
• Assign a function to a variable• Pass functions as parameters• Functional Interfaces
Interface FlatMapFunction<T,R> {
Iterable<R> call (T t)}
FlatMapFunction<String, String> myFunc
= new FlatMapFunction<String, String>(){Iterable<String> call(String s)
{return Arrays.asList(s.split(“ “));}};
myFunc.apply(“This is first.”); => Iterator => “This”, “is”, “first”
public void flatMapSet(FlatMapFunction<String, String> mapper) {…};flapMapSet(myFunc);
Return type
Argument types
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Example:Count lines with word “Error” in file
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
public class SimpleApp {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Simple Application");JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("YOUR_SPARK_HOME/log.txt");
JavaRDD<String> linesWithError =
lines.filter(new Function<String, Boolean>() {Boolean call(String s) {return s.contains(“Error”);}
});
long nLines = linesWithError.count();
System.out.println("Lines with Errors: “ + nLines);
}
}
Example: log.txt10:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 1211:04 | Task 3 done
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Wordcount in Spark
• mapFunction operates per data item/“line”:• flatMap unifies to get one list
Example: This is first.This second.
This,is,first.,This,second.
(This,1),(is,1),(first.,1),(This,1),(second.,1)
(This,2),(is,1),(first.,1), (second.,1)
This,is,first.,This,second.
This,is,first.,This,second.
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
RDDs creation
• Create initial RDD from some data• Eg. from HDFS: “hdfs://myFile.txt”
lines = sc.textFile(“hdfs://myFile.txt”)
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
RDDs during computation
lines = sc.textFile(...)
linesWithError = lines.filter(new Function<String, Boolean>() {…}
linesWithError.count();
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Example: Count error messages with “SQL”, “php”,…
JavaRDD<String> linesWithError = lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {return s.contains(“Error”);}} );
JavaRDD<List<String>> messages = linesWithError.map(new MapFunction<String, List<String>>() {
public List<String> call(String s) {return Arrays.asList(s.split(“|“)); }});messages.cache();
JavaRDD<String> msgsSQL = messages.filter(…s.contains(“SQL”)…);long nSQLMsgs = msgsSQL.count();JavaRDD<String> msgsPHP = messages.filter(…s.contains(“php”)…);long nPHPMsgs = msgsPHP.count();
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Example: Count error messages with “SQL”, “php”,…
lines = sc.textFile(“hdfs://...”) RDD 110:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 1211:04 | Task 3 done
RDD 4Error SQL Syntax
RDD 5Error php 12
RDD 210:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 12
RDD 310:00Error SQL SyntaxTask 1 done 11:02Worker addedError php 12
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Example: Directed Acyclic Graph
• Dependencies among RDDsRDD 110:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 1211:04 | Task 3 done
RDD 4Error SQL Syntax
RDD 5Error php 12
RDD 210:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 12
RDD 310:00Error SQL SyntaxTask 1 done 11:02Worker addedError php 12
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Directed Acyclic graphMap/Reduce vs Spark
• Dependencies • of map/reduce results …. RDDs
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
RDD Recreation
• Automatically recompute (parts of) RDD if lost • Due to deletion/removal of RDD by system (to get more RAM)
• Due to fault, eg. crash of machine
• Track transformations and used (parts of) RDDs in transformation• Start from last RDD stored on disc (Checkpoint)