Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring...
Transcript of Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring...
YARN
Scheduling
Applicationmanagement
Monitoring
3
Resource Manager Application MasterApplication MasterApplication MasterApplication MasterApplication Master
YARN
ResourceManager
NodeManager NodeManager NodeManager NodeManager NodeManager
Container
ContainerContainer
4
Spark: Hello, World!
val rdd1 = sc.parallelize(List("Hello, World!", "Hello, there!")
)
ValueHello, World!Hello, there!
26
Spark: Hello, World!
val rdd1 = sc.parallelize(List("Hello, World!", "Hello, there!")
)
val rdd2 = rdd1.flatMap(value => value.split(" ")
)
27
ValueHello, World!Hello, there!
Spark: Hello, World!
val rdd1 = sc.parallelize(List("Hello, World!", "Hello, there!")
)
val rdd2 = rdd1.flatMap(value => value.split(" ")
)ValueHello,World!Hello,there!
28
Spark: Hello, World!
val rdd1 = sc.parallelize(List("Hello, World!", "Hello, there!")
)
val rdd2 = rdd1.flatMap(value => value.split(" ")
)
rdd2.countByValue()
29
ValueHello,World!Hello,there!
Spark: Hello, World!
val rdd1 = sc.parallelize(List("Hello, World!", "Hello, there!")
)
val rdd2 = rdd1.flatMap(value => value.split(" ")
)
rdd2.countByValue() Key ValueHello, 2there! 1World! 1
30
Transformation
Parallel executionLogicallayer
Physicallayer
RDD 1
RDD 2
66Task 1 Task 2 Task 3 Task 4
Spreading tasks over executors
67
Executor 1
Executor 2
Executor 3
Executor 4
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 7
Task 8
Task 9
Task 10 Task 11
Spreading tasks over cores
68
Executor 1
Executor 2
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 7
Task 8
Task 9
Task 10 Task 11
Core 1
Core 2
Core 1
Core 2
Memory
Memory
Sequence of (parallelizable) transformationsLogicallayer
Stage
73
Transformation
Transformation
Transformation
Spreading a stage over cores
76
Executor 1
Executor 2
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 7
Task 8
Task 9
Task 10 Task 11
Core 1
Core 2
Core 1
Core 2
Memory
Memory
Most important parameters
spark-submit –-num-executors 42--executor-cores 2--executor-memory 3Gmy-application.jar
Job as sequence of stages
83
Stage 1
Stage 2
Stage 3
wait for completion
wait for completion
Shuffle
Shuffle
Spark and Python (PySpark)
108
rdd = spark.sparkContext.textFile('hdfs:///dataset.txt')
rdd2 = rdd.filter(lambda l: "Spark" in l)
rdd3 = rdd2.map(lambda l: (count(l), l))
rdd4 = rdd3.reduceByKey(lambda l1, l2: l1+l2)
result = rdd4.take(10)
Spark and Python (PySpark): with JSON
109
rdd = spark.sparkContext.textFile('hdfs:///dataset.json')
rdd2 = rdd.filter(lambda l: parseJSON(l))
rdd3 = rdd2.filter(lambda l: l['key'] = 0)
rdd4 = rdd3.map(lambda l: (l['key'], l['otherfield']))
result = rdd4.countByKey()
DataFrames
114
df = spark.read.json('hdfs:///dataset.json')
df.createOrReplaceTempView("dataset")
df2 = df.sql("SELECT * FROM dataset ""WHERE guess = target ""ORDER BY target ASC, country DESC, date DESC")
result = df2.take(10)
DataFrames
115
df = spark.read.json('hdfs:///dataset.json')
df.createOrReplaceTempView("dataset")
df2 = df.sql("SELECT * FROM dataset ""WHERE guess = target ""ORDER BY target ASC, country DESC, date DESC")
result = df2.take(10)
Spark SQL
Schema inference
117
foo,bar1,true2,true3,false4,true5,true6,false7,true
foointeger1234567
barbooleantruetruefalsetruetruefalsetrue
dataset.csv DataFrame
Schema inference
118
{ "foo" : 1, "bar" : true}{ "foo" : 2, "bar" : true}{ "foo" : 3, "bar" : false}{ "foo" : 4, "bar" : true}{ "foo" : 5, "bar" : true}{ "foo" : 6, "bar" : false}{ "foo" : 7, "bar" : true}
dataset.json
Schema inference
119
{ "foo" : 1, "bar" : true}{ "foo" : 2, "bar" : true}{ "foo" : 3, "bar" : false}{ "foo" : 4, "bar" : true}{ "foo" : 5, "bar" : true}{ "foo" : 6, "bar" : false}{ "foo" : 7, "bar" : true}
foointeger1234567
barbooleantruetruefalsetruetruefalsetrue
dataset.json DataFrame
DataFrames (with logical transformations)
120
df = spark.read.json('hdfs:///dataset.json')
df2 = df.filter(df['name'] = 'Einstein')
df3 = df.sortBy(asc("theory"), desc("date"))
df4 = df.select('year')
result = df4.take(10)
Available types
121
Byte ShortIntegerLong
FloatDouble
Decimal
String
Boolean
Binary
Timestamp
Date
Array
Struct
Map
Numbers Other atomics Structured
Type mapping
122
DataFrame JavaByteType byteShortType shortIntegerType intLongType longFloatType floatDoubleType doubleBooleanType booleanStringType StringDecimalType java.math.BigDecimalTimestampType java.sql.TimestampDateType java.sql.DateBinaryType byte[]
DataFrame JavaArrayType java.util.ListMapType java.util.MapStructType Row
Other data formats
123
df = spark.read.json("hdfs:///dataset.json")
df = spark.read.parquet("hdfs:///dataset.parquet")
df = spark.read.csv("hdfs:///dir/*.csv")
df = spark.read.text("hdfs:///dataset[0-7].txt")
df = spark.read.jdbc("jdbc:postgresql://localhost/test?user=fred&password=secret",
...)
df = spark.read.format("avro").load("hdfs:///dataset.avro")
...
Your own schema
124124
First Last Picture Birthday FlagString String byte[] Date boolean
DataSet<Person>
Statically known
DataFrames
125
df = spark.read.json('hdfs:///dataset.json')
df.createOrReplaceTempView("dataset")
df2 = df.sql("SELECT * FROM dataset ""WHERE guess = target ""ORDER BY target ASC, country DESC, date DESC")
result = df2.take(10)
DataFrames
126
df = spark.read.json('hdfs:///dataset.json')
df.createOrReplaceTempView("dataset")
df2 = df.sql("SELECT * FROM dataset ""WHERE guess = target ""ORDER BY target ASC, country DESC, date DESC")
result = df2.take(10)
Spark SQL
SQL Brush-Up!
DataFrames (with logical transformations)
127
df = spark.read.json('hdfs:///dataset.json')
df2 = df.filter(df['name'] = 'Einstein')
df3 = df.sortBy(asc("theory"), desc("date"))
df4 = df.select('year')
result = df4.take(10)
Dealing with nestedness: arrays
SELECT Last, EXPLODE(Countries)FROM input
First (String) Last (String) Countries (Array of Strings)
Albert Einstein [ "D", "I", "CH", "A", "BE", "US" ]
Srinivasa Ramanujan [ "IN", "UK" ]
Kurt Gödel [ "CZ", "A", "US" ]
Leonhard Euler [ "CH", "RU" ] Last (String) Countries (String)
Einstein D
Einstein I
Einstein CH
Einstein A
Einstein BE
Einstein US
Ramanujan IN
Ramanujan UK
Gödel CZ
Gödel A
Gödel US
Euler CH
Euler RU
Dealing with nestedness: objects
SELECT Name.First, Name.LastFROM input
Name (Object) Countries (Int)
{ "First" : "Albert", "Last" : "Einstein" } 6
{ "First" : "Srinivasa", "Last" : "Ramanujan" } 2
{ "First" : "Kurt", "Last" : "Gödel" } 3
{ "First" : "John", "Last" : "Nash" } 1
{ "First" : "Alan", "Last" : "Turing" }, 1
{ "First" : "Leonhard", "Last" : "Euler" } 2
First (String) Last (String)Albert Einstein
Srinivasa Ramanujan
Kurt Gödel
John Nash
Alan Turing
Leonhard Euler
Limits of DataFrames: Heterogeneity
{ "foo" : 1, "bar" : true}{ "foo" : 2, "bar" : true}{ "foo" : [3, 4], "bar" : false}{ "foo" : 4, "bar" : true}{ "foo" : 5, "bar" : true}{ "foo" : 6, "bar" : false}{ "foo" : 7, "bar" : true}
foostring"1""2""[3, 4]""4""5""6""7"
barbooleantruetruefalsetruetruefalsetrue
dataset.json DataFrame