Real-Life Apache Spark: Tips and Tricks from the Trenches

Real-Life Apache Spark: Tips and Tricks from the Trenches Noah Bieler

Wealthport AG

Zurich Spark Meetup, March 2016

#ZurichSparkUsers

Overview

Spark Intro Spark Pitfalls:

Joining Persistance Serialisation and more

Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark

Overview

Spark and the MapReduce Model

Map Reduce

Express your computations in terms of map (embarrassingly parallel) and reduce operations.

BreadSandwich

Tomato

Cheese

Spark RDD

• RDD (Resilient Distributed Dataset) are the abstraction Spark uses to model parallelism.

• Uses the MapReduce model (map, reduce, immutability)

• RDDs in the code are only instructions to compute something. What Spark actually does is not obvious (optimisations, predicate pushdown etc.)

• Since the actual computation is “delayed” you cannot use RDDs within RDDs.

rdd1 rdd2 rdd3 rdd4map f countmap hmap g

c: Long

rdd1 rdd2map f . g . h count

In the code:

In the VM: c: Long

RDD: Mental Model

rdd1 rdd2map f

On the Driver

action

On the Nodes: rdd1 rdd2map f

Partition 1:

Partition N:

object Main { def test = { val rdd: RDD[Int] = ...

val a = 1 + 2 + 3 // happens on the driver

rdd.map { i => i + a // happens on the nodes } } }

parallelize

RDDs vs. DataFrames (since 1.3) vs. DataSets (since 1.6)

RDDs are the most basic building block on Spark. Limited API but full control and type safety.

DataFrames are RDDs of Rows (= Seq[Any], no type safety!) with a schema; basically like a table. More methods but less control. For example, one cannot control the partitioning. Possibility to use SQL statements.

New Datasets (since Spark 1.6) are like RDDs (type safety) but with (optimised) methods known from DataFrames (count).

Overview

Spark Pitfall: Join

joinRDD[(K,V)] RDD[(K,W)]

1 -> “abc” 2 -> “dfg” …

3 -> “hij” 4 -> “xzy” …

rdd2 1 -> 3.142 2 -> 2.718 …

3 -> 1.618 4 -> 8.314 …

Partition 1:

Partition 2:

result 1 -> (“abc”, 3.142) 2 -> (“dfg”, 2.718) …

3 -> (“hij”, 1.618) 4 -> (“xzy”, 8.314) …

result

x No network traffic

Before you join, make sure that the two data frames are properly partitioned.

rdd.partitionBy(new HashPartitioner(4 * nodeCount))

Spark Pitfalls: Join

Don’t use map on a partitioned PairRDD but mapValues if possible. Otherwise the partitioning is destroyed.

rdd1 1 -> “abc” 2 -> “dfgh” …

1 -> 3 2 -> 4 …

map { case (k, v) => (k, v.size) }

1 -> “abc” 2 -> “dfgh” …

1 -> 3 2 -> 4 …

mapValues(_.size)

Spark cannot know if key was changed. → Partitioning is erased.

Spark knows that key was not changed. → Partitioning is kept.

Spark Pitfalls: Joining a large and a small RDD

rdd1 sc.broadcast(rdd2.collect())

1 -> 3.142 ...1 -> “a”

1 -> “b”

2 -> “c”

1 -> “c”

3 -> “c”

not partitioned

When joining a large with a small RDD, it might be better to broadcast the small one. Especially, if otherwise the RDDs must be partitioned.

Spark Pitfalls: Persistence

Persist DataFrames/RDDs which result in more than one branch of transformations.

map hmap g

.persist()

……

object RDDs {

/** Automatically persist and unpersist an RDD * before and after the calculation. */ def withPersistedRDD[A, B]( rdd: RDD[A], storageLevel: StorageLevel )(f: RDD[A] => B): B = { val result = Try(f(rdd.persist(storageLevel))) rdd.unpersist() result.get }

withPersistedRDD(rdd1.map(f)) { rdd2 => val rdd3 = rdd2.map(g) val rdd4 = rdd2.map(h) /* ... */ result } }

Spark Pitfalls: Serialisation

class Algorithm1 (val primeNumber: Int) extends Serializable {

def run(rdd: RDD[String]): RDD[Int] = {

rdd.map { s => s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt } }

val veryLargeTabe = Seq(/* ... */) }

class Algorithm2 (val primeNumber: Int) {

def run(rdd: RDD[String]): RDD[Int] = {

val _primeNUmber = primeNumber

rdd.map { s => s.foldLeft(0) { case (hash, c) => hash + _primeNumber * c.toInt } }

You actually use this.primeNumber and therefore serialise the whole instance(including veryLargeTable).

A local copy of this.primeNumber avoids serialising the whole instance.

Spark Pitfalls: Serialisation

You actually use this.hash and therefore serialise the whole instance(including veryLargeTable).

A function factory for hash avoids serialising the whole instance.

class Algorithm (val primeNumber: Int) extends Serializable {

def hash(s: String) = s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt }

def run(rdd: RDD[String]): RDD[Int] = { rdd.map { s => hash(s) } }

class Algorithm (val primeNumber: Int) {

def hashFunction() = { val _primeNUmber = primeNumber (s: String) => s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt } }

def run(rdd: RDD[String]): RDD[Int] = { val hash = hashFunction() rdd.map { s => hash(s) } }

Spark Pitfalls: MapLike not Serializable

object Main { val myMap = Map(1 -> "a", 2 -> "bc", 3 -> "def") .mapValues(_.size) // Produces MapLike, not Serializable .map(identity) // Produces Map again

val myOtherMap = /* ... */

val totalSize = sc.parallelize(Seq(myMap, myOtherMap)) .map(_.size) .reduce(_+_) // Would fail without map(identity), SI-7005 }

After running mapValues on a Map, run map(identity) on it, to avoid a NonSerializableException.

Spark Pitfalls: Avoid groupByKey followed by mapValues

The Seq produced by groupByKey can be potentially very large. Try to avoid it.

val rdd = sc.parallelize(Seq( "Hello", "World", "Bonjour", "Monde", "Guten Tag", "Welt" ))

val histogram1 = rdd. .map(_.size -> null) .groupByKey() // : RDD[(Int, Seq[Int, Any])] .mapValues(_.size)

val histogram2 = rdd. .map(_.size -> 1) .reduceByKey(_+_)

Spark Pitfalls: Row’s and null’s

• Spark’s Row is nothing but a wrapper for Seq[Any]: No type safety! • A Row will return a null if there is no value present!

row.getAs[String](index) == null // no exception!

row(index) == nullif

• A Row can loose its schema

A proper type hierarchy would not even define the function getAs(fieldName: String) for Rows!

dataFrame .map { row => val newRow = Row.fromSeq(row.toSeq.updated(timeIndex, timeStamp)) row.getAs[Int]("ID") -> newRow // Access element by field name } .map { case (id, row) => id -> row.getAs[String]("First Name") // No schema! }

Overview

Pimp my Spark: The “Pimp my Library” Pattern

object RowImplicits {

implicit class RowImplicit(row: Row) {

def updated[T](attributeId: AttributeId, value: T): Row = {

val newRow = Row.fromSeq(row.toSeq.updated(row.fieldIndex(attributeId), value))

Option(row.schema).map(newRow.withSchema).getOrElse(newRow) }

def withSchema(schema: StructType): Row = new GenericRowWithSchema(row.toSeq.toArray, schema)

def getStringOption(attributeIndex: Int): Option[String] = { if (row.isNullAt(attributeIndex)) None else Some(row.getString(attributeIndex)) } } }

Add Functionality to every possible library.

Overview

User Defined Types

Functional Programming stands on three pillars: • Variables are immutable (no side effects) • Functions are first class citizens (higher order functions) • Algebraic datatypes (strongly typed)

A good type hierarchy ensures that each function has only valid input and sane output.

Thus, it essential that Spark supports custom data types.

http://pt.slideshare.net/ScottWlaschin/fp-patterns-buildstufflt

def div(nominator: Int, denominator: NonZeroInteger) = nominator / denominator.value

def div(nominator: Int, denominator: Int) = denominator match { case 0 => None case _ => Some(nominator / denominator) }

User Defined Types

@SQLUserDefinedType(udt = classOf[EntityIdType]) case class EntityId(uuid: UUID) extends Serializable

object EntityId { def generate(): EntityId = EntityId(UUID.randomUUID()) }

case object EntityIdType extends EntityIdType

If you want to identify your rows with UUIDs,you need to use user defined types since Spark does not support UUIDs.

User Defined Types

class EntityIdType private extends UserDefinedType[EntityId] () { override def sqlType: DataType = StringType

override def serialize(obj: Any): UTF8String = obj match { case null => null.asInstanceOf[UTF8String] case t: EntityId => UTF8String.fromString(t.uuid.toString) case _ => throw new IllegalArgumentException(/*...*/) }

override def deserialize(datum: Any): EntityId = datum match { case s: UTF8String => new EntityId(UUID.fromString(s.toString)) case s: String => new EntityId(UUID.fromString(s)) case _ => throw new IllegalArgumentException(/*...*/) }

override def userClass: Class[EntityId] = classOf[EntityId] }

Sometimes Spark serialises using normal Strings, sometimes using UTF8Strings.

Overview

Running Spark in the Cloud (AWS)

Three cluster managers: • Standalone • Apache Mesos • Hadoop’s YARN

Two possibilities: • Create an EC2 instance and use the spark-ec2 scripts to manage the instances.

Time-consuming, not everything works out of the box. E.g. encoding has to be set manually. • Use Amazon EMR to have a managed environment.

Pricier and and releases are a bit slower. Uses YARN.

Both methods let you access data on S3 (AWS storage).

$ cat spark/conf/spark-defaults.conf spark.akka.frameSize 1000 spark.driver.memory 11g spark.driver.extraJavaOptions -XX:+HeapDumpOnOutOfMemoryError spark.executor.memory 55g spark.executor.extraJavaOptions -XX:+HeapDumpOnOutOfMemoryError

Overview

Running Spark and Cassandra

Cassandra is a distributed NoSQL database technology optimised for fault tolerance. Initially invented by Facebook and now used world-wide (twitter, reddit, …).

We have just started to experiment with it and fixed some bugs with respect to user defined types and Scala 2.10 reflection.

We are using Datastax’ driver to connect Spark and Cassandra. Supports since recently Spark 1.6 (before was 1.5).

We are using cassandra-unit to write our unit tests.

Overview

Testing Spark

abstract class TestBase extends FunSuite with BeforeAndAfterAll with BeforeAndAfterEach with Matchers {

protected val sparkConfigProperties = mutable.Map[String, String]() protected implicit var sparkContext: SparkContext = _ protected implicit var sqlContext: SQLContext = _ protected implicit var cassandraSession: Session = _

override def beforeAll(): Unit = { System.clearProperty("spark.driver.port") System.clearProperty("spark.hostPort")

val conf = new SparkConf() .setMaster("local") .set("spark.testing", "true") .set("spark.ui.enabled", "false") .set("spark.master.ui.port", String.valueOf(new ServerSocket(0))) // Avoids port clashes with parallel tests running .set("spark.worker.ui.port", String.valueOf(new ServerSocket(0))) // Avoids port clashes with parallel tests running .setAll(sparkConfigProperties)

sparkContext = new SparkContext(conf) sqlContext = new SQLContext(sparkContext) }

override def afterAll(): Unit = { sparkContext.stop() System.clearProperty("spark.driver.port") System.clearProperty("spark.hostPort") } }

Testing Spark and Cassandra

class CassandraTest extends TestBase { sparkConfigProperties("spark.cassandra.connection.host") = "127.0.0.1"

override def beforeAll(): Unit = { EmbeddedCassandraServerHelper.startEmbeddedCassandra("cassandra.yaml", 300000)

sparkConfigProperties("spark.cassandra.connection.port") = EmbeddedCassandraServerHelper.getNativeTransportPort.toString

super.beforeAll()

cassandraSession = CassandraConnector(sparkContext.getConf).openSession()

cassandraSession.execute(s"DROP KEYSPACE IF EXISTS test_keyspace") val dataLoader = new CQLDataLoader(cassandraSession) dataLoader.load(new ClassPathCQLDataSet("cassandra/create_schema.cql", true, "test_keyspace")) }

override def afterAll(): Unit = { cassandraSession.close() EmbeddedCassandraServerHelper.cleanEmbeddedCassandra() super.afterAll() } }

Wealthport AG, Rütistrasse 16, CH-8952 Schlieren, +41 43 508 50 96, info@wealthport.com, www.wealthport.com

Getting your data back into shape.

Real-Life Apache Spark: Tips and Tricks from the Trenches

Data & Analytics

Transcript of Real-Life Apache Spark: Tips and Tricks from the Trenches

Debugging Tricks with Apache HTTP Server 2people.apache.org/~trawick/AC2014-Debug.pdf · Debugging Tricks with Apache HTTP Server 2.4 Je Trawick Introduction What kinds of issues

Trenches Project

Trenches Around World

LessonsfromtheTrenches: … · LESSONS&FROMTHE&TRENCHES:&HOW&APACHE&HADOOP&IS&BEING&USED&&&THE&CHALLENGES&ITS&USERS&FACE& & How&the&Hadoop&stackhas&grown& HDFS Data Processing Data

Nginx: Accelerate Rails, HTTP Tricks - O'Reilly Mediaassets.en.oreilly.com/1/event/6/Custom Nginx Modules_ Accelerate Rails, HTTP Tricks...Nginx replaces Apache. But more importantly:

Trenches Lesson

Getting involved in world class software engineering tips and tricks to join apache open source community

Trenches Group Activity

Elasticsearch from the trenches

Cute Tricks With Perl and Apache

Trenches Sources

In The Trenches

Real-Life Apache Spark: Tips and Tricks from the Trenches

Cute Tricks With Perl and Apache - mod_perl · PDF file1 Cute Tricks With Perl and Apache 15 Feb 2014 1 Cute Tricks With Perl and Apache 1 Cute ... and URL redirection. It also shows

To the Trenches

the Trenches - AZBOCUG

Tips and Tricks - Apache OpenOffice Wiki · 2010-07-29 · Draw Guide Chapter 8 Tips and Tricks This PDF is designed to be read onscreen, two pages at a time. If you want to print

Big Data Systeme Recommendations - HAW Hamburgubicomp/projekte/master2017... · •Apache Apex •Apache Beam •Batch •Apache Hadoop •Apache Tez •Stream •Apache Storm •Apache

6G. Shear Trenches

In the Trenches - compilation the Trenches - compilati… · Ofﬁcial Publication of the National Educator Program - Special Series - “In the Trenches” Interviews with the original