Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

32
Morri Feldman The Road Less Traveled Highlights and Challenges from Running Spark on Mesos in Production morri@appsflyer.com

Transcript of Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Page 1: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Morri Feldman

The Road Less Traveled

Highlights and Challenges from Running Spark on Mesos in Production

[email protected]

Page 2: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

The Plan

Attribution & Overall Architecture

Retention Data Infrastructure - Spark on Mesos

1

2

3

Page 3: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

-OR-

User Device

Store Redirected

Enables • Cost Per Install (CPI) • Cost Per In-app Action (CPA) • Revenue Share • Network Optimization • Retargeting

Media sources

The Flow

AppsFlyer Servers

Page 4: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman
Page 5: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

RetentionInstall day 1 2 3 4 5 6 7 8 9 10 11 12

Page 6: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Retention Scale> 30 Million Installs / Day> 5 Billion Sessions / Day

RetentionInstall day 1 2 3 4 5 6 7 8 9 10 11 12

Page 7: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Retention Dimensions

Page 8: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Two Dimensions (App-Id and Media-Source)

Cascalog

DataLog / Logic programming over Cascading / Hadoop

Retention V1 (MVP)

Page 9: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Two Dimensions (App-Id and Media-Source)

Cascalog

DataLog / Logic programming over Cascading / Hadoop

Retention V1 (MVP)

Page 10: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Two Dimensions (App-Id and Media-Source)

Cascalog

DataLog / Logic programming over Cascading / Hadoop

Retention V1 (MVP)

Page 11: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

S3 Data v1 – Hadoop Sequence files:

Key, Value <Kafka Offset, Json Message> Gzip Compressed ~ 1.8 TB / Day

S3 Data v2 – Parquet Files (Schema on Write)

Retain fields required for retention, apply some business logic while converting.

Generates “tables” for installs and sessions. Retention v2 – “SELECT … JOIN ON ...”

18 Dimensions vs 2 in original report

Retention – Spark SQL / Parquet

Page 12: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Retention Calculation Phases

1. Daily aggregationCohort_day, Activity_day, <Dimensions>, Retained Count

2. PivotCohort_day, <Dimensions>, Day0, Day1, Day2 …

After Aggregation and Pivot ~ 1 billion rows

Page 13: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Data Warehouse v3

Parquet Files – Schema on ReadRetain almost all fields from original jsonDo not apply any business logicBusiness logic applied when reading throughuse of a shared library

Page 14: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Spark and Spark Streaming: ETL for Druid

SQL

Page 15: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Why?

All Data on S3 – No need for HDFS Spark & Mesos have a long history Some interest in moving our attribution services to Mesos Began using spark with EC2 “standalone” cluster scripts (No VPC) Easy to setup Culture of trying out promising technologies

Page 16: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Mesos Creature Comforts

Nice UI – Job outputs / sandbox easy to find Driver and Slave logs are accessible

Page 17: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Mesos Creature Comforts

Fault tolerant – Masters store data in zookeeper and canfail over smoothly Nodes join and leave the cluster automatically at bootup / shutdown

Page 18: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Job Scheduling – Chronos

?https://aphyr.com/posts/326-jepsen-chronos

Page 19: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Specific Lessons / Challenges using Spark, Mesos & S3

-or- What Went Wrong with

Spark / Mesos & S3 and How We Fixed It.

Spark / Mesos in production for nearly 1 year

Page 20: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

S3 is not HDFS

S3n gives tons of timeouts and DNS Errors @ 5pm Daily

Can compensate for timeouts with spark.task.maxFailures set to 20

Use S3a from Hadoop 2.7 (S3a in 2.6 generates millions of partitions – HADOOP-11584)

https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/

Page 21: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

S3 is not HDFS part 2 Use a Direct Output Commiter

https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/

Spark writes files to staging area and renames them at end of job

Rename on S3 is an expensive operation (~10s of minutes for thousands of files)

Direct Output Commiters write to final output location (Safe because S3 is atomic, so writes always succeed)

Disadvantages –Incompatible with speculative execution

Poor recovery from failures during write operations

Page 22: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Avoid .0 releases if possible

https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/

Worst example

Spark 1.4.0 randomly loses data especially on jobs with many output partitions

Fixed by SPARK-8406

Page 23: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Coarse-Grained or Fine-Grained?

TL; DR – Use coarse-grained Not Perfect, but Stable

Page 24: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Coarse-Grained – Disadvantages

spark.cores.max (not dynamic)

Page 25: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Coarse-Grained with Dynamic Allocation

Page 26: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Tuning Jobs in Coarse-Grained

Page 27: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Tuning Jobs in Coarse-Grained

Set executor memory to ~ entire memory of a machine (200GB for r3.8xlarge) spark.task.cpus is then actually spark memory per task

OOM!!

200 GB 32 cpus

Page 28: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Tuning Jobs in Coarse-Grained

More Shuffle Partitions

OOM!!

Page 29: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

Spark on Mesos Future Improvements

Increased stability – Dynamic allocation Tungsten

Mesos Maintenance Primitives, experimental in 0.25.0

Gracefully reduce size of cluster by marking nodes that will soon be killed

Inverse Offers – preemption, more dynamic scheduling

Page 30: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

How We Generated Duplicate Data

OR

S3 is Still Not HDFS

Page 31: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

S3 is Still Not HDFS

S3 is Eventually Consistent

Page 32: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman

We are Hiring! https://www.appsflyer.com/jobs/