Apache SparkMoving on from Hadoop
Víctor Sánchez AnguixUniversitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image
Course 2014/2015
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Hadoop is unbeatable (?)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
https://spark.apache.org/
Hadoop is unbeatable (?)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Google Trends
Hadoop is unbeatable (?)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
http://www.datanami.com/2014/03/06/apache_spark_3_real-world_use_cases/
Hadoop is unbeatable (?)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Open source cluster computing
➢ Distributed disk → Distributed memory
➢ Created at UC Berkeley in 2009
➢ Last major release: Dec 2014https://spark.apache.org/
What is Apache Spark?
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Core concept in Spark
➢ Distributed collection of objects in memory
➢ Operate on parallel on RDDs
➢ Read from file, distributed file system, or parallelize existing collection
Resilient Distributed Dataset (RDD)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ RDDs are fault tolerant
➢ Spark maintains DAG of operations for getting a RDD
➢ We can cache RDDs to save computations
Resilient Distributed Dataset (RDD)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
https://spark.apache.org/
Spark Architecture
Interact with cluster
Main program, coordinates tasks
Assigns resources
Carries out tasks, manages RDD chunks
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Main application for a Spark script
➢ Creates Spark context and coordinates executors
➢ Executes instructions in Java/Python/Scala
➢ ONLY parallelizes operations on RDDs
Driver program
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ We will stick to Scala
➢ Functional programming
➢ Completely integrated with Java
➢ Shorter code due to Scala abstractions
Programming in Spark
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ We have an interactive shell
spark-shell --master local[4]
spark-shell --master yarn-client
Programming in Spark
Number of cores to use in local mode
Use resources from a yarn cluster like Hadoop
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.2.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_25)
Type in expressions to have them evaluated.
Type :help for more information.
15/02/02 11:43:03 INFO repl.SparkILoop: Created spark context..
Spark context available as sc.
scala>
Programming in Spark
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Special type of object
➢ Interacts with distributed resources:○ Read data○ Add resources (e.g., jars, files) to cluster○ Creates RDDs
○ etc.
➢ In spark shell, it is automatically created in sc
Spark context
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Load from local filesystem
➢ Load from HDFS
Loading a text file
val Students =sc.textFile( “file:///home/victor.sanchez/students.tsv” )
Students: org.apache.spark.rdd.RDD[String] = file://home/victor.sanchez/students.tsv MappedRDD[1] at textFile at <console>:12scala>
RDD of Strings
val Students=sc.textFile(“hdfs://localhost/user/victor.sanchez/students.tsv”)
Final variable, content does not change
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Take n elements from RDD
➢ Get whole RDD into driver
What’s in my dataset?
val x = Students.take( 3 )
x: Array[String] = Array(1 John Doe M 18, 2 Mary Doe F 20, 3Lara Croft F 25)
Array of Strings, LOCAL!! Resides in master
val x = Students.collect
x: Array[String] = Array(1 John Doe M 18, 2 Mary Doe F 20, 3Lara Croft F 25, 4 Sherlock Holmes M 36, 5 John Watson M
38, 6 SarahKerrigan F 21, 7 Bruce Wayne M 32, 8Tony Stark M 33, 9 Princess Peach F 21, 10 Peter Parker
M 23)
Elements to take
If no arguments, no need for ()
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Parallelize collection from driver
➢ Broadcast a variable (only sent once)
Can I go the reverse way?
val myArray = Array( 1, 2, 3, 4, 5 )
myArray: Array[Int] = Array(1, 2, 3, 4, 5)
val myArrayPar = sc.parallelize( myArray )
myArrayPar: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:14
val x = 6
val xBroad = sc.broadcast( x )
xBroad: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(4)
Array creation
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Map: Project or generate new data
➢ It really takes an anonymous function as arg:
Basic operations on RDDs
val StudentsF = Students.map( l => l.split( "\t", -1 ) )
StudentsF: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[6] at map
StudentsF.take( 2 )
res6: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20))
For each l in Students, generate its split
(l:String) => l.split( "\t", -1 )(x:Int,y:Int) => x+y
Input parameters
Output
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Map: Project or generate new data
Basic operations on RDDs
def splitWrapped(line:String) : Array[String] = { line.split( "t", -1 ) }
val StudentsF = Students.map( splitWrapped )
StudentsF: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[6] at map
StudentsF.take( 2 )
res6: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20))
Output
Input
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Generate a new RDD for students with a new field indicating if the student is under 25 years
Exercise
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Foreach: Perform operation over each object
➢ Does not return a new RDD!
Basic operations on RDDs
StudentsF.foreach( x => print( x( 1 ) + " " ) )
John Bruce Tony Princess Peter 15/02/03 09:17:04 INFO executor.Executor: Finished task 1.0 in stage 13.0 (TID 27). 1693 bytes result sent to driverMary Lara Sherlock John Sarah 15/02/03 09:17:04 INFO executor.Executor: Finished task 0.0 in stage 13.0 (TID 26). 1693 bytes result sent to driver
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Filter: filter elements fulfilling a condition
Basic operations on RDDs
val StudentsFilt = StudentsF.filter( s => s( 0 ).toInt > 3 )
StudentsFilt: org.apache.spark.rdd.RDD[Array[String]] = FilteredRDD[13] at filter at <console>:16
StudentsFilt.take( 3 )
res13: Array[Array[String]] = Array(Array(4, Sherlock, Holmes, M, 36), Array(5, John, Watson, M, 38))
Anon. function
Convert to integer
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Distinct: only different objects
Basic operations on RDDs
val StudentsDis = StudentsF.map( s => s( 3 ) ).distinct
StudentsDis: org.apache.spark.rdd.RDD[String] = MappedRDD[18] at distinct at <console>:16
StudentsFilt.take( 2 )
res16: Array[String] = Array(F, M)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Fold: Reduce all objects to a single object
➢ Beware, dummy is applied more than once
Basic operations on RDDs
val dummyStudent = Array( "12", "Clark", "Kent", "M", "25" )
val StudentsFold = StudentsF.fold( dummyStudent )( (acc,value) => { if ( value( 4 ).toInt > acc( 4 ).toInt) value else acc } )
StudentsFold: Array[String] = Array(5, John, Watson, M, 38)Starting left operand
Left operand
val StudentsFold = StudentsF.fold( dummyStudent )( (acc,value) => { Array( "[" + acc( 0 ) + "-" + value( 0 ) + "]" , acc( 1 ), acc( 2 ), acc( 3 ), acc( 4 ) ) } )
StudentsFold: Array[String] = Array([[12-[[[[12-7]-8]-9]-10]]-[[[[[[12-1]-2]-3]-4]-5]-6]], Clark, Kent, M, 0)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Reduce: Reduce all objects to a single object
Basic operations on RDDs
val StudentsRed = StudentsF.map( s => s( 4 ).toInt ).reduce( _ + _ )
StudentsRed: Int = 267
Binary operatorConmutativeAssociate
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Max:
➢ Min:
Basic operations on RDDs
val StudentsMax = StudentsF.map( s => s( 4 ).toInt ).max
StudentsMax: Int = 38
val StudentsMin = StudentsF.map( s => s( 4 ).toInt ).min
StudentsMax: Int = 38
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Count:
➢ CountByValue: Count repetitions of elements
Basic operations on RDDs
val StudentsCount = StudentsF.count
StudentsCount: Long = 10
val StudentsCount = StudentsF.map( s => s( 3 ) ).countByValue
StudentsCount: scala.collection.Map[String,Long] = Map(M -> 6, F -> 4)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Count the number of students that are female
Exercise
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Sample:
➢ RandomSplit: Splits into random RDDs
Basic operations on RDDs
val StudentsSample = StudentsF.sample( true, 0.5 )
StudentsSample: org.apache.spark.rdd.RDD[Array[String]] =PartitionwiseSampledRDD[33]
StudentsSample.take( 3 )
res18: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(2, Mary, Doe, F, 20))
val StudentsSplit = StudentsF.randomSplit( Array( 0.8, 0.2 ) )
StudentsSplit: Array[org.apache.spark.rdd.RDD[Array[String]]] = Array(PartitionwiseSampledRDD[46]
StudentsSplit( 0 ).collect
res26: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(5, John, Watson, M, 38), Array(6, Sarah, Kerrigan, F, 21), Array(9, Princess, Peach, F, 21), Array(10, Peter, Parker, M, 23))
With replacement and fraction
Weights for each partition
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ SortBy: Sort elements according to value
➢ Top: Get largest elements
Basic operations on RDDs
val StudentsSorted = StudentsF.sortBy( x => x( 4 ) )
StudentsSorted.collect
Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(6, Sarah, Kerrigan, F, 21), ...
val StudentsTop = StudentsF.map( s => s( 4 ) ).top( 3 )
StudentsTop: Array[String] = Array(38, 36, 33)
Value to sort by
k elements to select
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Union: Two RDDs into one
Basic operations on RDDs
val StudentsUnder25 = Students.filter( s => s( 4 ).toInt < 25 )
val StudentsOver30 = Students.filter( s => s( 4 ).toInt > 30 )
val StudentsUnion = StudentsOver30.union( StudentsUnder25 )
StudentsUnion: org.apache.spark.rdd.RDD[Array[String]] = UnionRDD[75]
StudentsUnion.collect
Array[Array[String]] = Array(Array(4, Sherlock, Holmes, M, 36), Array(5, John, Watson, M, 38), Array(7, Bruce, Wayne, M, 32), Array(8, Tony, Stark, M, 33), Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(6, Sarah, Kerrigan, F, 21), Array(9, Princess, Peach, F, 21), Array(10, Peter, Parker, M, 23))
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Intersection: Common elements in two RDDs
Basic operations on RDDs
val StudentsUnder35 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a < 35 )
val StudentsOver25 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a > 25 )
val StudentsIntersect = StudentsUnder35.intersection( StudentsOver25 )
StudentsIntersect: org.apache.spark.rdd.RDD[Int] = MappedRDD[92]
StudentsIntersect.collect
res31: Array[Int] = Array(32, 33)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Subtract: Elements in a RDD not in the other
Basic operations on RDDs
val StudentsUnder35 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a < 35 )
val StudentsOver25 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a > 25 )
val StudentsSub = StudentsUnder35.subtract( StudentsOver25 )
StudentsSub: org.apache.spark.rdd.RDD[Int] = MappedRDD[12]
StudentsSub.collect
res0: Array[Int] = Array(18, 20, 21, 21, 23, 25)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Tuples in Scala:
➢ Pair RDDs → RDDs with tuples (key, value)
Pair RDDs
val myTuple = ( 13, "Bob", "Squarepants", "M", 10 )
myTuple._1
res6: Int = 13
Tuple creation
Access fields
val PairStudents = StudentsF.map( s => ( s( 3 ), s ) )
PairStudents.take( 3 )
res8: Array[(String, Array[String])] = Array((M,Array(1, John, Doe, M, 18)), (F,Array(2, Mary, Doe, F, 20)), (F,Array(3, Lara, Croft, F, 25)))
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Join:
Operations on Pair RDDs
val PairStudentsId = StudentsF.map( s => ( s( 0 ), s ) ) val PairGrades = GradesF.map( g => ( g( 0 ), g ) )val StudentGrades = PairStudentsId.join( PairGrades )
StudentGrades: org.apache.spark.rdd.RDD[(String, (Array[String], Array[String]))]
StudentGrades.take( 3 )
res13: Array[(String, (Array[String], Array[String]))] = Array((4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Math, 2.3))), (4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Biology, 6.7))), (4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Engineering, 8.0))))
Prepare key, value structure
Output is (key, (value1, value2))
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Left Join:
Operations on Pair RDDs
val auxRDD = sc.parallelize( Array(Array("0","Dummy","Student","M","10"), Array("1","John","Doe","M","18" ) ) ) val auxPairRDD = auxRDD.map( a => ( a( 0 ), a ) )val auxGrades = auxPairRDD.leftOuterJoin( PairGrades )
auxGrades: org.apache.spark.rdd.RDD[(String, (Array[String], Option[Array[String]]))] = FlatMappedValuesRDD[34]
auxGrades.take( 2 )
res23: Array[(String, (Array[String], Option[Array[String]]))] = Array((0,(Array(0, Dummy, Student, M, 10),None)), (1,(Array(1, John, Doe, M, 18),Some([Ljava.lang.String;@30d4fbf))))
Option = None or a value
String representation for non emptyvalue in Option
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Left Join (cont):
Operations on Pair RDDs
val auxGrades = auxPairRDD.leftOuterJoin( PairGrades ).map( p => ( p._1, ( p._2._1, if(!p._2._2.isEmpty) p._2._2.get ) ) )
auxGrades.take( 2 )
res26: Array[(String, (Array[String], Any))] = Array((0,(Array(0, Dummy, Student, M, 10),())), (1,(Array(1, John, Doe, M, 18),Array(1, Math, 5.6))))
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ reduceByKey: To single object by key
Operations on Pair RDDs
val RedKeys = PairStudents.map({case (k,v) => ( k, v( 4 ).toInt ) }).reduceByKey( _ + _ )
RedKeys: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[84]
RedKeys.take( 2 )
res33: Array[(String, Int)] = Array((F,87), (M,180))
More than 1 line in anon function
Pattern matching
Result is a RDD
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ foldByKey: To single object by key
Operations on Pair RDDs
val foldedKeys = PairStudents.map({case(k,v) => (k,v(4).toInt)}).foldByKey(0)((a,b) => Math.max(a,b))
foldedKeys: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[92]
res30: Array[(String, Array[String])] = Array((F,Array(2, Mary, Doe, F, 20)), (F,Array(3, Lara, Croft, F, 25)), (F,Array(6, Sarah, Kerrigan, F, 21)))
Left parameter
Function
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ groupByKey: group values with same key
Operations on Pair RDDs
val groupedKeys = PairStudents.groupByKey
groupedKeys: org.apache.spark.rdd.RDD[(String, Iterable[Array[String]])] = ShuffledRDD[93]groupedKeys.collect
res35: Array[(String, Iterable[Array[String]])] = Array((F,CompactBuffer([Ljava.lang.String;@31788c16, [Ljava.lang.String;@613511b9, [Ljava.lang.String;@631eba8a, [Ljava.lang.String;@7668ecdc)), (M,CompactBuffer([Ljava.lang.String;@62969c3f, [Ljava.lang.String;@dec1eaa, [Ljava.lang.String;@8d1320a, [Ljava.lang.String;@5e2c330b, [Ljava.lang.String;@27cb477a, [Ljava.lang.String;@12c1aeff)))
groupedKeys.map({case (k,v)=>(k,v.map( x => "("+ x.mkString(",")+ ")" ) ) }).take(2)res40: Array[(String, Iterable[String])] = Array((F,List((2,Mary,Doe,F,20), (3,Lara,Croft,F,25), (6,Sarah,Kerrigan,F,21), (9,Princess,Peach,F,21))), (M,List((1,John,Doe,M,18), (4,Sherlock,Holmes,M,36), (5,John,Watson,M,38), (7,Bruce,Wayne,M,32), (8,Tony,Stark,M,33), (10,Peter,Parker,M,23))))
String repr of Iterable[Array[String]]
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Really useful for AI and ML
➢ For loop example
var StudentsLoop = PairStudents.map( s => (s(0).toInt,s(1),s(2)) )
for( i <- 1 to 10 ){ StudentsLoop = StudentsLoop.map( {case (id,name,surname) => (id+1,name,surname)} )
}
StudentsLoop.collect
res43: Array[(Int, String, String)] = Array((11,John,Doe), (12,Mary,Doe), (13,Lara,Croft), (14,Sherlock,Holmes), (15,John,Watson), (16,Sarah,Kerrigan), (17,Bruce,Wayne), (18,Tony,Stark), (19,Princess,Peach), (20,Peter,Parker))
Looping!
Non final variable
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Really useful for AI and ML
➢ For loop example
var StudentsLoop = PairStudents.map( s => (s(0).toInt,s(1),s(2)) )
for( i <- 1 to 10 ){ StudentsLoop = StudentsLoop.map( {case (id,name,surname) => (id+1,name,surname)} )
}
StudentsLoop.collect
res43: Array[(Int, String, String)] = Array((11,John,Doe), (12,Mary,Doe), (13,Lara,Croft), (14,Sherlock,Holmes), (15,John,Watson), (16,Sarah,Kerrigan), (17,Bruce,Wayne), (18,Tony,Stark), (19,Princess,Peach), (20,Peter,Parker))
Looping!
Non final variable
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Persist in memory/disk RDDs
➢ Other levels of persistance:○ MEMORY_ONLY, MEMORY_AND_DISK,
MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, etc...
Caching
PairStudents.cache
import org.apache.spark.storage.StorageLevel
GradesF.persist( StorageLevel.MEMORY_AND_DISK )
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Store to local or HDFS
Saving RDDs
PairStudents.map(x =>( x._1,"("+ x._2.mkString( "," )+")")).saveAsTextFile( "file:///home/victor.sanchez/res" )
PairStudents.map(x =>( x._1,"("+ x._2.mkString( "," )+")")).saveAsTextFile( "hdfs:///user/victor.sanchez/res" )
Trick to convert Array[String] properly to String
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
package es.upv.dsic.iarfid.haia
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
object mySparkScript {
def average(data: Iterable[Double]): Double = { data.reduceLeft( _ + _ )/data.size }
def main( args: Array[String] ) {
val sc = new SparkContext( ( new SparkConf() ).setAppName( "MY SPARK SCRIPT" ) )
val Grades = sc.textFile( args( 0 ) ).map( l => l.split( “\t”, -1 ) ).map( g => ( g( 1 ), g( 2 ).toDouble ) )
val GradesGr = Grades.groupByKey.map( g => ( g._1, average( g._2 ) ) )
GradesGr.saveAsTextFile( args( 1 ) )
}
}
Scripting Packa
Package
Packa
Imports
Packa
Support methods
Packa
Main method
Singleton object
Program arguments
Packa
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Compiling Spark code
➢ Scala code is compiled to Java Byte code
➢ sbt is a scala compiler for Scala and Java
➢ sbt can help us manage our dependencies
➢ Spark cluster → Fat jar, sbt assembly can do!
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Spark project example
build.sbt
lib/
project/
plugins.sbt
build.scala
src/
main/
resources/
test/
target/
Main .sbt file. Scala code to compile your scala source!
Plugins needed by sbt to compile your source
Your project source file
Extra libraries
Output jar for your project
How to compile your main .sbt
Test sources
Additional files for your jar
Your project code
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
import AssemblyKeys._
assemblySettings
name := "haia"
version := "1.0"
scalaVersion := "2.10.4"
organization := "es.upv"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.1.0" % "provided"
)
jarName in assembly := {
name.value + ".jar"
}
outputPath in assembly := {
file( "target/" + (jarName in assembly).value )
}
Main sbt file example
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
case PathList(ps @ _*) if ps.last endsWith ".html" => MergeStrategy.first
case "unwanted.txt" => MergeStrategy.discard
case PathList( "META-INF", ".*pom.properties" ) => MergeStrategy.first
case x => old(x)
}
}
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Fat jar?○ A jar with all of the jar files it depends on○ Workers needs all dependencies
○ sbt-assembly plugin can generate fat jars
➢ Generating a fat jar:sbt assembly
Generating a fat jar
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
spark-submit --class es.upv.dsic.iarfid.haia.mySparkScript --master yarn-cluster target/haia.jar hdfs:///user/victor.sanchez/grades.tsv hdfs:///user/victor.sanchez/spark_submit_ex
How to execute Spark code from jar
Singleton object to execute
Fat jar file
Program parameters
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Simulated annealing → Optimization method
➢ Multi-point → Exploring from different points
➢ Function to optimize:
Exercise: Multi-point simulated annealing
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Single point simulated annealing
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Time to work!
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Spark is still a quite novel technology➢ Unexpected Out of memory exceptions➢ Memory issues are difficult to debug in Spark➢ Avoid out of memory scenarios:
○ Use object serialization (Java or Kryo)○ Choose data structures wisely○ Increase parallelism (spark.default.parallelism)○ Avoid groupBy operations → reduceBy
○ More memory for shuffle (spark.shuffle.spill=false or higher spark.shuffle.memoryFraction)
A final advice on Spark
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Disk based parallelization
➢ No looping
➢ More mature project
➢ Many organizations use it
Hadoop ecosystem vs Spark
➢ Memory based parallelization
➢ Loopings (nice for AI an ML)
➢ Initial steps for Spark
➢ Changing all Hadoop code has a cost
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Extra information
➢ http://spark.apache.org/
➢ Learning Spark: Lightning-Fast Big Data Analysis. Holden Karau et al. Ed. O’Reilly
➢ StackOverflow
Top Related