Apache Spark 101
-
Upload
abdullah-cetin-cavdar -
Category
Data & Analytics
-
view
553 -
download
1
Transcript of Apache Spark 101
Apache Spark 101June 2016
Abdullah Cetin CAVDAR
@accavdar
#AnkaraSparkDay
Apache Spark's Goal
Apache Sparkis a fast and general engine for
large-scale data processing
Most Active Project in Big Data
Spark Survey 2015
Top 10 Industries Using Spark
Many Types of Product
Spark Engine
unified engine across diverse workloads &environments
Programming Languages
Open Source Spark Ecosystem
Most Important Aspects
SparkProgramming
Model
Challenge?Fast data sharing across parallel
jobs
Data Sharing in MapReduce
Data Sharing in Apache Spark
Components
Cluster Managers
Initializing Apache SparkSparkConf and SparkContext
Apache Spark ShellPython and Scala
RDD (Resilient Distributed Dataset)An RDD is a read-only collection of objectspartitioned across a set of machines that
can be rebuilt if a partition is lost
RDDRead-Only = Immutable
ParallelismCaching
RDDPartitioned = Distributed
More partitions = More parallelism
RDDRebuilt = Resilient
Recover lost data partitionsBy replaying data lineage
RDD Operations
RDD Operations
Partitions
logical division of data / basic unit ofparallelisim
RDD Lineage
Lazy Evaluation
DAG (Directed Acyclic Graph)
Transformation & Action
RDD CreationParallelizing a collection
into driver application memoryfor only prototyping and testing
Loading an external data set�le://, hdfs://, s3n://sc.textFile()sc.hadoopFile(), sc.newAPIHadoopFile()sqlContext.read()
Word Count :)
Driver & WorkersMain Program is executed on DriverTransformations are executed on WorkersActions transfer from Workers to DriverDriver cannot get data from executors except action and accumulator
RDD Dependencies
Minimize shuffle / WideDependencies
RDD Persistence / Caching
persist() orcache()
Without cache, it will restart from the �rst RDDLRU (Least Recently Used)Default Storega Level: MEMORY_ONLY
Storage Levels
Shared VariablesAccumulators and Broadcast
Variables
AccumulatorsUsed to implement counters or sums
Broadcast VariablesKeep a read-only variable cached on each
machine
Spark UIDefault port 4040
Deploying to a ClusterUse spark-submit
Data Frames & PerformanceDistributed collection of rows organized
into named columns
TipsAvoid groupByKey and wide dependenciesUse enough number of partitionsUse coalesce not to make too many small �lesBe cautious on Serialization/Deserialization
Major Features in 2.0
Thank you
#AnkaraSparkDay