Post on 21-Apr-2017
Real-Life Apache Spark: Tips and Tricks from the Trenches Noah Bieler
Wealthport AG
Zurich Spark Meetup, March 2016
#ZurichSparkUsers
Overview
2
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
Overview
3
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
Spark and the MapReduce Model
4
Map Reduce
Express your computations in terms of map (embarrassingly parallel) and reduce operations.
BreadSandwich
Tomato
Cheese
Spark RDD
5
• RDD (Resilient Distributed Dataset) are the abstraction Spark uses to model parallelism.
• Uses the MapReduce model (map, reduce, immutability)
• RDDs in the code are only instructions to compute something. What Spark actually does is not obvious (optimisations, predicate pushdown etc.)
• Since the actual computation is “delayed” you cannot use RDDs within RDDs.
rdd1 rdd2 rdd3 rdd4map f countmap hmap g
c: Long
rdd1 rdd2map f . g . h count
In the code:
In the VM: c: Long
RDD: Mental Model
6
rdd1 rdd2map f
On the Driver
action
On the Nodes: rdd1 rdd2map f
Partition 1:
Partition N:
…
object Main { def test = { val rdd: RDD[Int] = ...
val a = 1 + 2 + 3 // happens on the driver
rdd.map { i => i + a // happens on the nodes } } }
parallelize
RDDs vs. DataFrames (since 1.3) vs. DataSets (since 1.6)
7
RDDs are the most basic building block on Spark. Limited API but full control and type safety.
DataFrames are RDDs of Rows (= Seq[Any], no type safety!) with a schema; basically like a table. More methods but less control. For example, one cannot control the partitioning. Possibility to use SQL statements.
New Datasets (since Spark 1.6) are like RDDs (type safety) but with (optimised) methods known from DataFrames (count).
Overview
8
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
Spark Pitfall: Join
9
joinRDD[(K,V)] RDD[(K,W)]
rdd1
1 -> “abc” 2 -> “dfg” …
3 -> “hij” 4 -> “xzy” …
rdd1
rdd2 1 -> 3.142 2 -> 2.718 …
3 -> 1.618 4 -> 8.314 …
rdd2
join
join
Partition 1:
Partition 2:
result 1 -> (“abc”, 3.142) 2 -> (“dfg”, 2.718) …
3 -> (“hij”, 1.618) 4 -> (“xzy”, 8.314) …
result
x No network traffic
Before you join, make sure that the two data frames are properly partitioned.
rdd.partitionBy(new HashPartitioner(4 * nodeCount))
Spark Pitfalls: Join
10
Don’t use map on a partitioned PairRDD but mapValues if possible. Otherwise the partitioning is destroyed.
rdd1 1 -> “abc” 2 -> “dfgh” …
rdd2
1 -> 3 2 -> 4 …
map { case (k, v) => (k, v.size) }
rdd1
1 -> “abc” 2 -> “dfgh” …
1 -> 3 2 -> 4 …
mapValues(_.size)
Spark cannot know if key was changed. → Partitioning is erased.
Spark knows that key was not changed. → Partitioning is kept.
rdd2
Spark Pitfalls: Joining a large and a small RDD
11
rdd1 sc.broadcast(rdd2.collect())
1 -> 3.142 ...1 -> “a”
1 -> “b”
2 -> “c”
1 -> “c”
3 -> “c”
not partitioned
When joining a large with a small RDD, it might be better to broadcast the small one. Especially, if otherwise the RDDs must be partitioned.
Spark Pitfalls: Persistence
12
Persist DataFrames/RDDs which result in more than one branch of transformations.
rdd1
rdd2
rdd3
rdd4
map f
map hmap g
.persist()
……
object RDDs {
/** Automatically persist and unpersist an RDD * before and after the calculation. */ def withPersistedRDD[A, B]( rdd: RDD[A], storageLevel: StorageLevel )(f: RDD[A] => B): B = { val result = Try(f(rdd.persist(storageLevel))) rdd.unpersist() result.get }
withPersistedRDD(rdd1.map(f)) { rdd2 => val rdd3 = rdd2.map(g) val rdd4 = rdd2.map(h) /* ... */ result } }
Spark Pitfalls: Serialisation
13
class Algorithm1 (val primeNumber: Int) extends Serializable {
def run(rdd: RDD[String]): RDD[Int] = {
rdd.map { s => s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt } }
val veryLargeTabe = Seq(/* ... */) }
class Algorithm2 (val primeNumber: Int) {
def run(rdd: RDD[String]): RDD[Int] = {
val _primeNUmber = primeNumber
rdd.map { s => s.foldLeft(0) { case (hash, c) => hash + _primeNumber * c.toInt } }
val veryLargeTabe = Seq(/* ... */) }
You actually use this.primeNumber and therefore serialise the whole instance(including veryLargeTable).
A local copy of this.primeNumber avoids serialising the whole instance.
Spark Pitfalls: Serialisation
14
You actually use this.hash and therefore serialise the whole instance(including veryLargeTable).
A function factory for hash avoids serialising the whole instance.
class Algorithm (val primeNumber: Int) extends Serializable {
def hash(s: String) = s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt }
def run(rdd: RDD[String]): RDD[Int] = { rdd.map { s => hash(s) } }
val veryLargeTabe = Seq(/* ... */) }
class Algorithm (val primeNumber: Int) {
def hashFunction() = { val _primeNUmber = primeNumber (s: String) => s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt } }
def run(rdd: RDD[String]): RDD[Int] = { val hash = hashFunction() rdd.map { s => hash(s) } }
val veryLargeTabe = Seq(/* ... */) }
Spark Pitfalls: MapLike not Serializable
15
object Main { val myMap = Map(1 -> "a", 2 -> "bc", 3 -> "def") .mapValues(_.size) // Produces MapLike, not Serializable .map(identity) // Produces Map again
val myOtherMap = /* ... */
val totalSize = sc.parallelize(Seq(myMap, myOtherMap)) .map(_.size) .reduce(_+_) // Would fail without map(identity), SI-7005 }
After running mapValues on a Map, run map(identity) on it, to avoid a NonSerializableException.
Spark Pitfalls: Avoid groupByKey followed by mapValues
16
The Seq produced by groupByKey can be potentially very large. Try to avoid it.
val rdd = sc.parallelize(Seq( "Hello", "World", "Bonjour", "Monde", "Guten Tag", "Welt" ))
val histogram1 = rdd. .map(_.size -> null) .groupByKey() // : RDD[(Int, Seq[Int, Any])] .mapValues(_.size)
val histogram2 = rdd. .map(_.size -> 1) .reduceByKey(_+_)
Spark Pitfalls: Row’s and null’s
17
• Spark’s Row is nothing but a wrapper for Seq[Any]: No type safety! • A Row will return a null if there is no value present!
row.getAs[String](index) == null // no exception!
row(index) == nullif
• A Row can loose its schema
A proper type hierarchy would not even define the function getAs(fieldName: String) for Rows!
dataFrame .map { row => val newRow = Row.fromSeq(row.toSeq.updated(timeIndex, timeStamp)) row.getAs[Int]("ID") -> newRow // Access element by field name } .map { case (id, row) => id -> row.getAs[String]("First Name") // No schema! }
Overview
18
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
Pimp my Spark: The “Pimp my Library” Pattern
19
object RowImplicits {
implicit class RowImplicit(row: Row) {
def updated[T](attributeId: AttributeId, value: T): Row = {
val newRow = Row.fromSeq(row.toSeq.updated(row.fieldIndex(attributeId), value))
Option(row.schema).map(newRow.withSchema).getOrElse(newRow) }
def withSchema(schema: StructType): Row = new GenericRowWithSchema(row.toSeq.toArray, schema)
def getStringOption(attributeIndex: Int): Option[String] = { if (row.isNullAt(attributeIndex)) None else Some(row.getString(attributeIndex)) } } }
Add Functionality to every possible library.
Overview
20
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
User Defined Types
21
Functional Programming stands on three pillars: • Variables are immutable (no side effects) • Functions are first class citizens (higher order functions) • Algebraic datatypes (strongly typed)
A good type hierarchy ensures that each function has only valid input and sane output.
Thus, it essential that Spark supports custom data types.
http://pt.slideshare.net/ScottWlaschin/fp-patterns-buildstufflt
def div(nominator: Int, denominator: NonZeroInteger) = nominator / denominator.value
def div(nominator: Int, denominator: Int) = denominator match { case 0 => None case _ => Some(nominator / denominator) }
User Defined Types
22
@SQLUserDefinedType(udt = classOf[EntityIdType]) case class EntityId(uuid: UUID) extends Serializable
object EntityId { def generate(): EntityId = EntityId(UUID.randomUUID()) }
case object EntityIdType extends EntityIdType
If you want to identify your rows with UUIDs,you need to use user defined types since Spark does not support UUIDs.
User Defined Types
23
class EntityIdType private extends UserDefinedType[EntityId] () { override def sqlType: DataType = StringType
override def serialize(obj: Any): UTF8String = obj match { case null => null.asInstanceOf[UTF8String] case t: EntityId => UTF8String.fromString(t.uuid.toString) case _ => throw new IllegalArgumentException(/*...*/) }
override def deserialize(datum: Any): EntityId = datum match { case s: UTF8String => new EntityId(UUID.fromString(s.toString)) case s: String => new EntityId(UUID.fromString(s)) case _ => throw new IllegalArgumentException(/*...*/) }
override def userClass: Class[EntityId] = classOf[EntityId] }
Sometimes Spark serialises using normal Strings, sometimes using UTF8Strings.
Overview
24
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
Running Spark in the Cloud (AWS)
25
Three cluster managers: • Standalone • Apache Mesos • Hadoop’s YARN
Two possibilities: • Create an EC2 instance and use the spark-ec2 scripts to manage the instances.
Time-consuming, not everything works out of the box. E.g. encoding has to be set manually. • Use Amazon EMR to have a managed environment.
Pricier and and releases are a bit slower. Uses YARN.
Both methods let you access data on S3 (AWS storage).
$ cat spark/conf/spark-defaults.conf spark.akka.frameSize 1000 spark.driver.memory 11g spark.driver.extraJavaOptions -XX:+HeapDumpOnOutOfMemoryError spark.executor.memory 55g spark.executor.extraJavaOptions -XX:+HeapDumpOnOutOfMemoryError
Overview
26
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
Running Spark and Cassandra
27
Cassandra is a distributed NoSQL database technology optimised for fault tolerance. Initially invented by Facebook and now used world-wide (twitter, reddit, …).
We have just started to experiment with it and fixed some bugs with respect to user defined types and Scala 2.10 reflection.
We are using Datastax’ driver to connect Spark and Cassandra. Supports since recently Spark 1.6 (before was 1.5).
We are using cassandra-unit to write our unit tests.
Overview
28
Spark Intro Spark Pitfalls:
Joining Persistance Serialisation and more
Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark
Testing Spark
29
abstract class TestBase extends FunSuite with BeforeAndAfterAll with BeforeAndAfterEach with Matchers {
protected val sparkConfigProperties = mutable.Map[String, String]() protected implicit var sparkContext: SparkContext = _ protected implicit var sqlContext: SQLContext = _ protected implicit var cassandraSession: Session = _
override def beforeAll(): Unit = { System.clearProperty("spark.driver.port") System.clearProperty("spark.hostPort")
val conf = new SparkConf() .setMaster("local") .set("spark.testing", "true") .set("spark.ui.enabled", "false") .set("spark.master.ui.port", String.valueOf(new ServerSocket(0))) // Avoids port clashes with parallel tests running .set("spark.worker.ui.port", String.valueOf(new ServerSocket(0))) // Avoids port clashes with parallel tests running .setAll(sparkConfigProperties)
sparkContext = new SparkContext(conf) sqlContext = new SQLContext(sparkContext) }
override def afterAll(): Unit = { sparkContext.stop() System.clearProperty("spark.driver.port") System.clearProperty("spark.hostPort") } }
Testing Spark and Cassandra
30
class CassandraTest extends TestBase { sparkConfigProperties("spark.cassandra.connection.host") = "127.0.0.1"
override def beforeAll(): Unit = { EmbeddedCassandraServerHelper.startEmbeddedCassandra("cassandra.yaml", 300000)
sparkConfigProperties("spark.cassandra.connection.port") = EmbeddedCassandraServerHelper.getNativeTransportPort.toString
super.beforeAll()
cassandraSession = CassandraConnector(sparkContext.getConf).openSession()
cassandraSession.execute(s"DROP KEYSPACE IF EXISTS test_keyspace") val dataLoader = new CQLDataLoader(cassandraSession) dataLoader.load(new ClassPathCQLDataSet("cassandra/create_schema.cql", true, "test_keyspace")) }
override def afterAll(): Unit = { cassandraSession.close() EmbeddedCassandraServerHelper.cleanEmbeddedCassandra() super.afterAll() } }