Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed...
Transcript of Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed...
1 Telefónica Digital
Experiences with Spark @Telefónica
Daniel Tapiador & Ignacio Blasco 12th June 2014 – BCN Spark meetup
2 Telefónica Digital
Outline
• Introduction and motivation for a change • Spark Internals and API • Ecosystem • Tips & Tricks (Demo)
3 Telefónica Digital
Outline
• Introduction and motivation for a change • Spark Internals and API • Ecosystem • Tips & Tricks (Demo)
4 Telefónica Digital
What is Spark? (I)
• Distributed data processing framework/system • Aims at making data analytics fast to run (100x) and
to write • Works in memory and/or disk • Fills the gap for near-real time (in memory) apps • Scales out to big data sizes • Several language bindings (Scala, Python and Java)
5 Telefónica Digital
What is Spark? (II)
• Fully integrated with the Hadoop ecosystem § Supports any Hadoop input format => can read from
HDFS, Hive, Impala, Hbase, etc • Higher level (and richer) interface than MapReduce
§ Generalize MR to support new apps in same engine § General task DAG + data sharing
• Interactive shells for scala and python for exploratory work
6 Telefónica Digital
Original Niche
• Originally developed for: § Iterative algorithms § Interactive data mining
7 Telefónica Digital
Evolution (I)
MapReduce
Pregel
Dremel
GraphLab Storm
Giraph
Drill Tez
Impala
S4 …
Specialized systems (iterative, interactive and"
streaming apps)
General batch"processing
• Has aimed ever since at evolving to a much more complete framework
8 Telefónica Digital
Evolution (II)
9 Telefónica Digital
Motivation for a change (I)
• Column orientation (Parquet and ORC). Ratio 3.5x • Fast in-memory serialization (Kryo)
• Largest input (per day) is 153 GB (bz2 - text).
• Auxiliary tables sizes: § 4 GB (text – no compression) § 950 MB (text – no compression) § 12 MB (text – no compression)
• Processing time: § Something in between 5h – 15h
10 Telefónica Digital
Motivation for a change (II)
• Only batch oriented • Maintainability
§ Tedious to add/modify functionality
• Complexity (more LoC) • Needs explicit Orchestration • Testing somehow tedious with
a lot of Integration Tests
• Stability
• In memory processing • Exploration
§ Prototyping, PoC, etc.
• Speed of development • Orchestration natively done in
the code. • Testing
• v1.0.0 out but still a lot to stabilize
11 Telefónica Digital
Motivation for a change (III)
0
5000
10000
15000
20000
25000
30000
35000
C++ / Hadoop Streaming Scala / Spark
LoC (Model)
LoC (Model)
0 2000 4000 6000 8000
10000 12000
Java / Hadoop Scala / Spark
LoC (ETL)
LoC (ETL)
0
5
10
15
C++ / Hadoop Streaming
Scala / Spark
Model Development Time (in MM)
Model Development Time (in MM)
0 1 2 3 4 5
Java / Hadoop Scala / Spark
ETL Development Time (in MM)
ETL Development Time (in MM)
12 Telefónica Digital
Outline
• Introduction and motivation for a change • Spark Internals and API • Ecosystem • Tips & Tricks (Demo)
13 Telefónica Digital
Spark Programming Model (I)
• Resilient Distributed Datasets (RDDs) § Distributed collections of objects that can be cached
in memory across cluster nodes § Manipulated through various parallel operators § Automatically rebuilt on failure (some checkpointing
as well) • RDDs can be spilled to disk if needed and/or kept in
memory (de)serialized • RDDs partitioning can be explicitly stated • Spark context (sc) is the entry point
§ Can run locally (single or multicore)
14 Telefónica Digital
Spark Programming Model (II)
• Shared variables (immutable) • Broadcast variables
§ Read-only variable cached on each machine § Efficient broadcast algorithms to reduce
communication cost • Accumulators
§ Variables that are only added to through associative operations
§ Only driver program can read their value
15 Telefónica Digital
RDD Operations
• RDD operations include transformations and actions: § map, filter, flatMap, sample, union, distinct,
groupByKey, reduceByKey, sortByKey, join, cogroup, cartesian, …
§ reduce, collect, count, first, take, takeSample, countByKey, foreach, saveAsTextFile, saveAsSequenceFile, …
§ persist/cache and repartitions
16 Telefónica Digital
RDD Persistence • Partitions lost are recomputed using the original
transformations • Each RDD can be stored with a different storage level
17 Telefónica Digital
Memory tuning
• Three considerations: § Amount of memory used by the objects § The cost of accessing those objects § Overhead of garbage collection
• Java objects are fast to access but consume 2-5x more space (than raw data inside) § Prefer array of objects and primitive types § Avoid nested structures § Kryo serializer § …
18 Telefónica Digital
Outline
• Introduction and motivation for a change • Spark Internals and API • Ecosystem • Tips & Tricks (Demo)
19 Telefónica Digital
Spark Ecosystem (I)
• Unify…
20 Telefónica Digital
MLBase • Mllib
§ Distributed low-level machine learning library written against Spark
§ Maintained by Spark core developers § Algorithms for classification, regression, clustering
and collaborative filtering (more to come)
• MLI § High level ML abstractions § MLTable and LocalMatrix
• ML Optimizer § Model selection automation § Solves a search problem over feature extractors and
ML algorithms
21 Telefónica Digital
Spark Streaming motivation
• Processing the same data in live streams as well as batch post-processing
• Existing frameworks cannot do both § 100s of MB/s with low latency § TBs of data with high latency
• Extremely painful to maintain • Mutable state lost if node fails (in traditional model)
22 Telefónica Digital
Spark Streaming • Runs a streaming computation as a series of very
small, deterministic batch jobs • Chop up the live stream into
batches of X seconds • Each batch of data is processed
using DStream + RDD operations § countByWindow, reduceByWindow, slice, window,
countByValue, countByValueAndWindow, …
• Combine live data streams with historical data
• Batch sizes as low as ½ s • Input sources
§ Kafka, HDFS, Flume, Akka actors, Raw TCP sockets, custom implementaions and RDDs pushed as a stream
23 Telefónica Digital
Shark
• Interactive SQL queries + unification § GENERATE Kmeans(tweet_locations) AS TABLE
tweet_clusters • Key points added (leveraged by Spark):
§ Cached tables § In memory column orientation (3-20x reduction in size) § Data co-partitioning, fully distributed sort, dynamic join
algorithm selection based on the data, partition pruning using range statistics, etc
• Early development stage
24 Telefónica Digital
• Approximate query processing § Sampling module when ingesting § Online sample selection based on query’s latency and accuracy § Parallel query execution with appropriate error and confidence
bounds
BlinkDB
25 Telefónica Digital
GraphX
• Resilient Distributed Graph, efficient partitioning, implementations of the PowerGraph and Pregel graph-parallel frameworks using RDGs.
26 Telefónica Digital
Outline
• Introduction and motivation for a change • Spark Internals and API • Ecosystem • Tips & Tricks (Demo)
27 Telefónica Digital
Tips and Tricks
• Use of Try monad to debug § Can be used to run controlled exceptions building
RDD[(Arg,Try)] § Allow to Use Exceptions as Data
28 Telefónica Digital
Tips and Tricks
• Traits and Pimp my library to make DSLs § Spark uses implicit conversion to add method to
specific RDD (Pimp my Library) § Traits allow separation of concerns and allow to make
a modular and customizable DSL § Can be loaded with a single import § Allow custom operations and ETL
29 Telefónica Digital
Tips and Tricks
• Summary § Spark is easy to grasp and can be used with almost
no knowledge of Scala of Functional Programming but…
§ Advanced Scala features can make powerful DSL § Use of FP concept boost productivity and improves
parallelization
30 Telefónica Digital
Demo