Real-Life Apache Spark: Tips and Tricks from the Trenches

Real-Life Apache Spark: Tips and Tricks from the Trenches Noah Bieler

Spark Intro Spark Pitfalls:

Joining Persistance Serialisation and more

Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark



Spark and the MapReduce Model


Map Reduce

Express your computations in terms of map (embarrassingly parallel) and reduce operations.




Spark RDD


• RDD (Resilient Distributed Dataset) are the abstraction Spark uses to model parallelism.

• Uses the MapReduce model (map, reduce, immutability)

• RDDs in the code are only instructions to compute something. What Spark actually does is not obvious (optimisations, predicate pushdown etc.)

• Since the actual computation is “delayed” you cannot use RDDs within RDDs.

rdd1 rdd2 rdd3 rdd4map f countmap hmap g

c: Long

rdd1 rdd2map f . g . h count

In the code:

In the VM: c: Long

RDD: Mental Model


rdd1 rdd2map f

On the Driver


On the Nodes: rdd1 rdd2map f

Partition 1:

Partition N:

object Main { def test = { val rdd: RDD[Int] = ...

val a = 1 + 2 + 3 // happens on the driver { i => i + a // happens on the nodes } } }


RDDs vs. DataFrames (since 1.3) vs. DataSets (since 1.6)


RDDs are the most basic building block on Spark. Limited API but full control and type safety.

DataFrames are RDDs of Rows (= Seq[Any], no type safety!) with a schema; basically like a table. More methods but less control. For example, one cannot control the partitioning. Possibility to use SQL statements.

New Datasets (since Spark 1.6) are like RDDs (type safety) but with (optimised) methods known from DataFrames (count).



Spark Pitfall: Join


joinRDD[(K,V)] RDD[(K,W)]


1 -> “abc” 2 -> “dfg” …

3 -> “hij” 4 -> “xzy” …


rdd2 1 -> 3.142 2 -> 2.718 …

3 -> 1.618 4 -> 8.314 …




Partition 1:

Partition 2:

result 1 -> (“abc”, 3.142) 2 -> (“dfg”, 2.718) …

3 -> (“hij”, 1.618) 4 -> (“xzy”, 8.314) …


x No network traffic

Before you join, make sure that the two data frames are properly partitioned.

rdd.partitionBy(new HashPartitioner(4 * nodeCount))

Spark Pitfalls: Join


Don’t use map on a partitioned PairRDD but mapValues if possible. Otherwise the partitioning is destroyed.

rdd1 1 -> “abc” 2 -> “dfgh” …


1 -> 3 2 -> 4 …

map { case (k, v) => (k, v.size) }


1 -> “abc” 2 -> “dfgh” …

1 -> 3 2 -> 4 …


Spark cannot know if key was changed. → Partitioning is erased.

Spark knows that key was not changed. → Partitioning is kept.


Spark Pitfalls: Joining a large and a small RDD


rdd1 sc.broadcast(rdd2.collect())

1 -> 3.142 ...1 -> “a”

1 -> “b”

2 -> “c”

1 -> “c”

3 -> “c”

not partitioned

When joining a large with a small RDD, it might be better to broadcast the small one. Especially, if otherwise the RDDs must be partitioned.

Spark Pitfalls: Persistence


Persist DataFrames/RDDs which result in more than one branch of transformations.





map f

map hmap g



object RDDs {

/** Automatically persist and unpersist an RDD * before and after the calculation. */ def withPersistedRDD[A, B]( rdd: RDD[A], storageLevel: StorageLevel )(f: RDD[A] => B): B = { val result = Try(f(rdd.persist(storageLevel))) rdd.unpersist() result.get }

withPersistedRDD( { rdd2 => val rdd3 = val rdd4 = /* ... */ result } }

Spark Pitfalls: Serialisation


class Algorithm1 (val primeNumber: Int) extends Serializable {

def run(rdd: RDD[String]): RDD[Int] = { { s => s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt } }

val veryLargeTabe = Seq(/* ... */) }

class Algorithm2 (val primeNumber: Int) {

def run(rdd: RDD[String]): RDD[Int] = {

val _primeNUmber = primeNumber { s => s.foldLeft(0) { case (hash, c) => hash + _primeNumber * c.toInt } }

val veryLargeTabe = Seq(/* ... */) }

You actually use this.primeNumber and therefore serialise the whole instance(including veryLargeTable).

A local copy of this.primeNumber avoids serialising the whole instance.

Spark Pitfalls: Serialisation


You actually use this.hash and therefore serialise the whole instance(including veryLargeTable).

A function factory for hash avoids serialising the whole instance.

class Algorithm (val primeNumber: Int) extends Serializable {

def hash(s: String) = s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt }

def run(rdd: RDD[String]): RDD[Int] = { { s => hash(s) } }

val veryLargeTabe = Seq(/* ... */) }

class Algorithm (val primeNumber: Int) {

def hashFunction() = { val _primeNUmber = primeNumber (s: String) => s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt } }

def run(rdd: RDD[String]): RDD[Int] = { val hash = hashFunction() { s => hash(s) } }

val veryLargeTabe = Seq(/* ... */) }

Spark Pitfalls: MapLike not Serializable


object Main { val myMap = Map(1 -> "a", 2 -> "bc", 3 -> "def") .mapValues(_.size) // Produces MapLike, not Serializable .map(identity) // Produces Map again

val myOtherMap = /* ... */

val totalSize = sc.parallelize(Seq(myMap, myOtherMap)) .map(_.size) .reduce(_+_) // Would fail without map(identity), SI-7005 }

After running mapValues on a Map, run map(identity) on it, to avoid a NonSerializableException.

Spark Pitfalls: Avoid groupByKey followed by mapValues


The Seq produced by groupByKey can be potentially very large. Try to avoid it.

val rdd = sc.parallelize(Seq( "Hello", "World", "Bonjour", "Monde", "Guten Tag", "Welt" ))

val histogram1 = rdd. .map(_.size -> null) .groupByKey() // : RDD[(Int, Seq[Int, Any])] .mapValues(_.size)

val histogram2 = rdd. .map(_.size -> 1) .reduceByKey(_+_)

Spark Pitfalls: Row’s and null’s


• Spark’s Row is nothing but a wrapper for Seq[Any]: No type safety! • A Row will return a null if there is no value present!

row.getAs[String](index) == null // no exception!

row(index) == nullif

• A Row can loose its schema

A proper type hierarchy would not even define the function getAs(fieldName: String) for Rows!

dataFrame .map { row => val newRow = Row.fromSeq(row.toSeq.updated(timeIndex, timeStamp)) row.getAs[Int]("ID") -> newRow // Access element by field name } .map { case (id, row) => id -> row.getAs[String]("First Name") // No schema! }



Pimp my Spark: The “Pimp my Library” Pattern


object RowImplicits {

implicit class RowImplicit(row: Row) {

def updated[T](attributeId: AttributeId, value: T): Row = {

val newRow = Row.fromSeq(row.toSeq.updated(row.fieldIndex(attributeId), value))

Option(row.schema).map(newRow.withSchema).getOrElse(newRow) }

def withSchema(schema: StructType): Row = new GenericRowWithSchema(row.toSeq.toArray, schema)

def getStringOption(attributeIndex: Int): Option[String] = { if (row.isNullAt(attributeIndex)) None else Some(row.getString(attributeIndex)) } } }

Add Functionality to every possible library.



User Defined Types


Functional Programming stands on three pillars: • Variables are immutable (no side effects) • Functions are first class citizens (higher order functions) • Algebraic datatypes (strongly typed)

A good type hierarchy ensures that each function has only valid input and sane output.

Thus, it essential that Spark supports custom data types.

def div(nominator: Int, denominator: NonZeroInteger) = nominator / denominator.value

def div(nominator: Int, denominator: Int) = denominator match { case 0 => None case _ => Some(nominator / denominator) }

User Defined Types


@SQLUserDefinedType(udt = classOf[EntityIdType]) case class EntityId(uuid: UUID) extends Serializable

object EntityId { def generate(): EntityId = EntityId(UUID.randomUUID()) }

case object EntityIdType extends EntityIdType

If you want to identify your rows with UUIDs,you need to use user defined types since Spark does not support UUIDs.

User Defined Types


class EntityIdType private extends UserDefinedType[EntityId] () { override def sqlType: DataType = StringType

override def serialize(obj: Any): UTF8String = obj match { case null => null.asInstanceOf[UTF8String] case t: EntityId => UTF8String.fromString(t.uuid.toString) case _ => throw new IllegalArgumentException(/*...*/) }

override def deserialize(datum: Any): EntityId = datum match { case s: UTF8String => new EntityId(UUID.fromString(s.toString)) case s: String => new EntityId(UUID.fromString(s)) case _ => throw new IllegalArgumentException(/*...*/) }

override def userClass: Class[EntityId] = classOf[EntityId] }

Sometimes Spark serialises using normal Strings, sometimes using UTF8Strings.



Running Spark in the Cloud (AWS)


Three cluster managers: • Standalone • Apache Mesos • Hadoop’s YARN

Two possibilities: • Create an EC2 instance and use the spark-ec2 scripts to manage the instances.

Time-consuming, not everything works out of the box. E.g. encoding has to be set manually. • Use Amazon EMR to have a managed environment.

Pricier and and releases are a bit slower. Uses YARN.

Both methods let you access data on S3 (AWS storage).

$ cat spark/conf/spark-defaults.conf spark.akka.frameSize 1000 spark.driver.memory 11g spark.driver.extraJavaOptions -XX:+HeapDumpOnOutOfMemoryError spark.executor.memory 55g spark.executor.extraJavaOptions -XX:+HeapDumpOnOutOfMemoryError



Running Spark and Cassandra


Cassandra is a distributed NoSQL database technology optimised for fault tolerance. Initially invented by Facebook and now used world-wide (twitter, reddit, …).

We have just started to experiment with it and fixed some bugs with respect to user defined types and Scala 2.10 reflection.

We are using Datastax’ driver to connect Spark and Cassandra. Supports since recently Spark 1.6 (before was 1.5).

We are using cassandra-unit to write our unit tests.



Testing Spark


abstract class TestBase extends FunSuite with BeforeAndAfterAll with BeforeAndAfterEach with Matchers {

protected val sparkConfigProperties = mutable.Map[String, String]() protected implicit var sparkContext: SparkContext = _ protected implicit var sqlContext: SQLContext = _ protected implicit var cassandraSession: Session = _

override def beforeAll(): Unit = { System.clearProperty("spark.driver.port") System.clearProperty("spark.hostPort")

val conf = new SparkConf() .setMaster("local") .set("spark.testing", "true") .set("spark.ui.enabled", "false") .set("spark.master.ui.port", String.valueOf(new ServerSocket(0))) // Avoids port clashes with parallel tests running .set("spark.worker.ui.port", String.valueOf(new ServerSocket(0))) // Avoids port clashes with parallel tests running .setAll(sparkConfigProperties)

sparkContext = new SparkContext(conf) sqlContext = new SQLContext(sparkContext) }

override def afterAll(): Unit = { sparkContext.stop() System.clearProperty("spark.driver.port") System.clearProperty("spark.hostPort") } }

Testing Spark and Cassandra


class CassandraTest extends TestBase { sparkConfigProperties("") = ""

override def beforeAll(): Unit = { EmbeddedCassandraServerHelper.startEmbeddedCassandra("cassandra.yaml", 300000)

sparkConfigProperties("spark.cassandra.connection.port") = EmbeddedCassandraServerHelper.getNativeTransportPort.toString


cassandraSession = CassandraConnector(sparkContext.getConf).openSession()

cassandraSession.execute(s"DROP KEYSPACE IF EXISTS test_keyspace") val dataLoader = new CQLDataLoader(cassandraSession) dataLoader.load(new ClassPathCQLDataSet("cassandra/create_schema.cql", true, "test_keyspace")) }

override def afterAll(): Unit = { cassandraSession.close() EmbeddedCassandraServerHelper.cleanEmbeddedCassandra() super.afterAll() } }

Getting your data back into shape.