Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016...
Transcript of Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016...
Big Data Processing withSpark and AWS EMR @glomex17.10.2016MichaelLudwig
Our Architecture
2
3
Our Use Cases
4
Billing Pre-Aggregations
Interactive Big Data
Spark components
5
Spark 1.6, PySpark, spark-submit, DataFrames, SparkSQL, UDFs, Accumulators
Example: SparkSQL
6
EMR Cluster Startup
7
AWS Web Console AWS CLI
AWS SDKs(Python, Java, JS
etc.)
Startup parameters
8
Spot prices
9
Cluster Interaction
10
YARN Manager
11
Monitoring: Spark UI
12
Monitoring: Ganglia on EMR
13
Error Troubleshooting
14
Summary§ EMR§ Easyclusterstartupandconfiguration§ Throw-Away,isolatedclusters§ Nobigupfrontinvestmentsneeded
§ Spark§ BestframeworktogetstartedwithBigdata§ Bigcommunity&fastdevelopment§ Localdevelopmenteasy
15
Backup§ TODO
16
EMR Access Urls
17
RDD, DataFrame and DataSet
18
Spark Cluster
19
In-Memory Computation
20
Operations§ placeholder
21
Sample Transformations
22
RDD Lineage
23
RDD DAG
24