Spark-summit-2013 Matei Zaharia

31
The State of Spark And Where We’re Going Next Matei Zaharia

description

Spark Summit 2013 The State of Spark, and Where We’re Going Next Matei Zaharia (CTO, Databricks; Assistant Professor, MIT) Summary of recent growth spark project

Transcript of Spark-summit-2013 Matei Zaharia

Page 1: Spark-summit-2013 Matei Zaharia

The State of Spark And Where We’re Going Next

Matei Zaharia

Page 2: Spark-summit-2013 Matei Zaharia

Community Growth

Page 3: Spark-summit-2013 Matei Zaharia

Project History Spark started as research project in 2009 Open sourced in 2010 » 1st version was 1600 LOC, could run Wikipedia demo

Growing community since Entered Apache Incubator in June 2013

Page 4: Spark-summit-2013 Matei Zaharia

Development Community With over 100 developers and 25 companies, one of the most active communities in big data

Comparison: Storm (48), Giraph (52), Drill (18), Tez (12)

Past 6 months: more active devs than Hadoop MapReduce!

Page 5: Spark-summit-2013 Matei Zaharia

Development Community Healthy across the whole ecosystem

Page 6: Spark-summit-2013 Matei Zaharia

Release Growth

Spark 0.6: - Java API, Maven, standalone mode

- 17 contributors

Sept ‘13 Feb ‘13 Oct ‘12

Spark 0.7: - Python API, Spark Streaming

- 31 contributors

Spark 0.8: - YARN, MLlib, monitoring UI

- 67 contributors

Page 7: Spark-summit-2013 Matei Zaharia

YARN support (Yahoo!) Columnar compression in Shark (Yahoo!) Fair scheduling (Intel) Metrics reporting (Intel, Quantifind) New RDD operators (Bizo, ClearStory) Scala 2.10 support (Imaginea)

Some Community Contributions

Page 8: Spark-summit-2013 Matei Zaharia

Conferences

0 50

100 150 200 250 300 350 400 450 500

AMP Camp 1 (Aug 2012)

AMP Camp 2 (Aug 2013)

Spark Summit (Nov 2013)

Atte

ndee

s

Page 9: Spark-summit-2013 Matei Zaharia

Projects Built on Spark

Spark

Spark Streaming

(real-time)

GraphX (graph)

… Shark"(SQL)

MLbase (machine learning)

BlinkDB

Page 10: Spark-summit-2013 Matei Zaharia

What’s Next?

Page 11: Spark-summit-2013 Matei Zaharia

Our View While big data tools have advanced a lot, they are still far too difficult to tune and use Goal: design big data systems that are as powerful & seamless as those for small data

Page 12: Spark-summit-2013 Matei Zaharia

Current Priorities Standard libraries

Deployment

Out-of-the-box usability

Enterprise use +

Page 13: Spark-summit-2013 Matei Zaharia

Standard Libraries While writing K-means in 30 lines is great, it’s even better to call it from a library! Spark’s MLlib and GraphX will be standard libraries supported by core developers » MLlib in Spark 0.8 with 7 algorithms » GraphX coming soon » Both operate directly on RDDs

Page 14: Spark-summit-2013 Matei Zaharia

Standard Libraries val  rdd:  RDD[Array[Double]]  =  ...  val  model  =  KMeans.train(rdd,  k  =  10)      val  graph  =  Graph(vertexRDD,  edgeRDD)  val  ranks  =  PageRank.run(graph,  iters  =  10)  

Page 15: Spark-summit-2013 Matei Zaharia

Standard Libraries Beyond these libraries, Databricks is investing heavily in higher-level projects Spark Streaming:"easier 24/7 operation and optimizations coming in 0.9

Shark:"calling Spark libs (e.g. MLlib), optimizer, Hive 0.11 & 0.12

Goal: a complete and interoperable stack

Page 16: Spark-summit-2013 Matei Zaharia

Deployment Want Spark to easily run anywhere Spark 0.8: much improved YARN, EC2 support Spark 0.8.1: support for YARN 2.2

SIMR: launch Spark in MapReduce clusters as a Hadoop job (no installation needed!) » For experimenting; see talk by Ahir

Page 17: Spark-summit-2013 Matei Zaharia

Monitoring and metrics (0.8) Better support for large # of tasks (0.8.1) High availability for standalone mode (0.8.1)

External hashing & sorting (0.9)

Ease of Use

Long-term: remove need to tune beyond defaults

Page 18: Spark-summit-2013 Matei Zaharia

Next Releases Spark 0.8.1 (this month) » YARN 2.2, standalone mode HA, optimized shuffle,

broadcast & result fetching

Spark 0.9 (Jan 2014) » Scala 2.10 support, configuration system, Spark

Streaming improvements

Page 19: Spark-summit-2013 Matei Zaharia

What Makes Spark Unique?

Page 20: Spark-summit-2013 Matei Zaharia

Big Data Systems Today

MapReduce

Pregel

Dremel

GraphLab Storm

Giraph

Drill Tez

Impala

S4 …

Specialized systems (iterative, interactive and"

streaming apps)

General batch"processing

Page 21: Spark-summit-2013 Matei Zaharia

Spark’s Approach Instead of specializing, generalize MapReduce"to support new apps in same engine Two changes (general task DAG & data sharing) are enough to express previous models! Unification has big benefits » For the engine » For users

Spark

Stre

aming

Gra

phX

Shar

k

MLb

ase

Page 22: Spark-summit-2013 Matei Zaharia

Code Size

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

Page 23: Spark-summit-2013 Matei Zaharia

Code Size

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

Streaming

Page 24: Spark-summit-2013 Matei Zaharia

Code Size

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

Streaming Shark*

* also calls into Hive

Page 25: Spark-summit-2013 Matei Zaharia

Code Size

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

Streaming

GraphX Shark*

* also calls into Hive

Page 26: Spark-summit-2013 Matei Zaharia

Performance

Hive

Impa

la (d

isk)

Impa

la (m

em)

Shar

k (d

isk)

Shar

k (m

em)

0

5

10

15

20

25

30

35

40

45

Resp

onse

Tim

e (s)

SQL

Hado

op

Gira

ph

Gra

phX

0

5

10

15

20

25

30

Resp

onse

Tim

e (m

in)

Graph

Stor

m

Spar

k

0

5

10

15

20

25

30

35

Thro

ughp

ut (M

B/s/

node

)

Streaming

Page 27: Spark-summit-2013 Matei Zaharia

What it Means for Users Separate frameworks:

… HDFS read

HDFS write ET

L HDFS read

HDFS write tra

in HDFS read

HDFS write qu

ery

HDFS

HDFS read ET

L tra

in qu

ery

Spark:

Interactive"analysis

Page 28: Spark-summit-2013 Matei Zaharia

Combining Processing Types

val  points  =  sc.runSql[Double,  Double](      “select  latitude,  longitude  from  historic_tweets”)    val  model  =  KMeans.train(points,  10)    sc.twitterStream(...)      .map(t  =>  (model.closestCenter(t.location),  1))      .reduceByWindow(“5s”,  _  +  _)  

From Scala:

Page 29: Spark-summit-2013 Matei Zaharia

Combining Processing Types

GENERATE  KMeans(tweet_locations)  AS  TABLE  tweet_clusters      //  Scala  table  generating  function  (TGF):  object  KMeans  {      @Schema(spec  =  “x  double,  y  double,  cluster  int”)      def  apply(points:  RDD[(Double,  Double)])  =  {          ...      }  }  

 

From SQL (in Shark 0.8.1):

Page 30: Spark-summit-2013 Matei Zaharia

Conclusion Next challenge in big data will be complex and low-latency applications Spark offers a unified engine to tackle and combine these apps Best strength is the community: enjoy Spark Summit!

Page 31: Spark-summit-2013 Matei Zaharia

Contributors Aaron Davidson Alexander Pivovarov Ali Ghodsi Ameet Talwalkar Andre Shumacher Andrew Ash Andrew Psaltis Andrew Xia Andrey Kouznetsov Andy Feng Andy Konwinski Ankur Dave Antonio Lupher Benjamin Hindman Bill Zhao Charles Reiss Chris Mattmann Christoph Grothaus Christopher Nguyen Chu Tong Cliff Engle Cody Koeninger David McCauley Denny Britz Dmitriy Lyubimov Edison Tung Eric Zhang Erik van Oosten

Ethan Jewett Evan Chan Evan Sparks Ewen Cheslack-Postava Fabrizio Milo Fernand Pajot Frank Dai Gavin Li Ginger Smith Giovanni Delussu Grace Huang Haitao Yao Haoyuan Li Harold Lim Harvey Feng Henry Milner Henry Saputra Hiral Patel Holden Karau Ian Buss Imran Rashid Ismael Juma James Phillpotts Jason Dai Jerry Shao Jey Kottalam Joseph E. Gonzalez Josh Rosen

Justin Ma Kalpit Shah Karen Feng Karthik Tunga Kay Ousterhout Kody Koeniger Konstantin Boudnik Lee Moon Soo Lian Cheng Liang-Chi Hsieh Marc Mercer Marek Kolodziej Mark Hamstra Matei Zaharia Matthew Taylor Michael Heuer Mike Potts Mikhail Bautin Mingfei Shi Mosharaf Chowdhury Mridul Muralidharan Nathan Howell Neal Wiggins Nick Pentreath Olivier Grisel Patrick Wendell Paul Cavallaro Paul Ruan

Peter Sankauskas Pierre Borckmans Prabeesh K. Prashant Sharma Ram Sriharsha Ravi Pandya Ray Racine Reynold Xin Richard Benkovsky Richard McKinley Rohit Rai Roman Tkalenko Ryan LeCompte S. Kumar Sean McNamara Shane Huang Shivaram Venkataraman Stephen Haberman Tathagata Das Thomas Dudziak Thomas Graves Timothy Hunter Tyson Hamilton Vadim Chekan Wu Zeming Xinghao Pan