Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016...

24
Big Data Processing with Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig

Transcript of Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016...

Page 1: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Big Data Processing withSpark and AWS EMR @glomex17.10.2016MichaelLudwig

Page 2: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Our Architecture

2

Page 3: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

3

Page 4: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Our Use Cases

4

Billing Pre-Aggregations

Interactive Big Data

Page 5: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Spark components

5

Spark 1.6, PySpark, spark-submit, DataFrames, SparkSQL, UDFs, Accumulators

Page 6: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Example: SparkSQL

6

Page 7: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

EMR Cluster Startup

7

AWS Web Console AWS CLI

AWS SDKs(Python, Java, JS

etc.)

Page 8: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Startup parameters

8

Page 9: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Spot prices

9

Page 10: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Cluster Interaction

10

Page 11: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

YARN Manager

11

Page 12: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Monitoring: Spark UI

12

Page 13: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Monitoring: Ganglia on EMR

13

Page 14: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Error Troubleshooting

14

Page 15: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Summary§ EMR§ Easyclusterstartupandconfiguration§ Throw-Away,isolatedclusters§ Nobigupfrontinvestmentsneeded

§ Spark§ BestframeworktogetstartedwithBigdata§ Bigcommunity&fastdevelopment§ Localdevelopmenteasy

15

Page 16: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Backup§ TODO

16

Page 17: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

EMR Access Urls

17

Page 18: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

RDD, DataFrame and DataSet

18

Page 19: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Spark Cluster

19

Page 20: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

In-Memory Computation

20

Page 21: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Operations§ placeholder

21

Page 22: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

Sample Transformations

22

Page 23: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

RDD Lineage

23

Page 24: Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive

RDD DAG

24