Spark 1.1 and Beyond
-
Upload
cara-vasquez -
Category
Documents
-
view
67 -
download
0
description
Transcript of Spark 1.1 and Beyond
![Page 1: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/1.jpg)
Spark 1.1 and Beyond
Patrick Wendell
![Page 2: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/2.jpg)
About Me
Work at Databricks leading the Spark team
Spark 1.1 Release manager
Committer on Spark since AMPLab days
![Page 3: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/3.jpg)
This Talk
Spark 1.1 (and a bit about 1.2)
A few notes on performance
Q&A with myself, Tathagata Das, and Josh Rosen
![Page 4: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/4.jpg)
Spark RDD API
Spark Streamin
greal-time
GraphXGraph(alpha)
MLLibmachine learning
DStream’s: Streams of
RDD’s
RDD-Based Matrices
RDD-Based Graphs
SparkSQL
RDD-Based Tables
A Bit about Spark…
HDFS, S3, CassandraYARN, Mesos, Standalone
![Page 5: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/5.jpg)
Spark Release Process
~3 month release cycle, time-scoped2 months of feature development1 month of QA
Maintain older branches with bug fixes
Upcoming release: 1.1.0 (previous was 1.0.2)
![Page 6: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/6.jpg)
Master
branch-1.1
branch-1.0
V1.0.0 V1.0.1
V1.1.0Morefeatures
More stable
For any P.O.C or non production cluster, we
always recommend running off of the head
of a release branch.
![Page 7: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/7.jpg)
Spark 1.1
1,297 patches
200+ contributors (still counting)
Dozens of organizations
To get updates – join our dev list:E-mail [email protected]
![Page 8: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/8.jpg)
Roadmap
Spark 1.1 and 1.2 have similar themes
Spark core:Usability, stability, and performance
MLlib/SQL/Streaming:Expanded feature set and performanceAround ~40% of
mailing list traffic is about these libraries.
![Page 9: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/9.jpg)
Spark Core in 1.1
Performance “out of the box”Sort-based shuffleEfficient broadcastsDisk spilling in PythonYARN usability improvements
UsabilityTask progress and user-defined
countersUI behavior for failing or large jobs
![Page 10: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/10.jpg)
1.0 was the first “preview” release
1.1 provides upgrade path for SharkReplaced Shark in our benchmarks
with 2-3X perf gainsCan perform optimizations with 10-
100X less effort than Hive.
Spark SQL in 1.1
![Page 11: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/11.jpg)
Turning an RDD into a Relation• // Define the schema using a case class.
case class Person(name: String, age: Int)
// Create an RDD of Person objects, register it as a table.val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.split(",") .map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")
![Page 12: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/12.jpg)
Querying using SQL
• // SQL statements can be run directly on RDD’sval teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support // normal RDD operations.val nameList = teenagers.map(t => "Name: " + t(0)).collect()
• // Language integrated queries (ala LINQ)val teenagers = people.where('age >= 10).where('age <= 19).select('name)
![Page 13: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/13.jpg)
JDBC server for multi-tenant access and BI tools
Native JSON support
Public types API – “make your own” Schema RDD’s
Improved operator performance
Native Parquet support and optimizations
Spark SQL in 1.1
![Page 14: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/14.jpg)
Spark Streaming
Stability improvements across the board
Amazon Kinesis support
Rate limiting for streams
Support for polling Flume streams
Streaming + ML: Streaming linear regressions
![Page 15: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/15.jpg)
What’s new in MLlib v1.1• Contributors: 40 (v1.0) -> 68
• Algorithms: SVD via Lanczos, multiclass support in decision tree, logistic regression with L-BFGS, nonnegative matrix factorization, streaming linear regression
• Feature extraction and transformation: scaling, normalization, tf-idf, Word2Vec
• Statistics: sampling (core), correlations, hypothesis testing, random data generation
• Performance and scalability: major improvement to decision tree, tree aggregation
• Python API: decision tree, statistics, linear methods
![Page 16: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/16.jpg)
Performance (v1.0 vs. v1.1)
![Page 17: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/17.jpg)
Sort-based Shuffle
Old shuffle:Each mapper opens a file for each reducer and writes output simultaneously.Files = # mappers * # reducers
New Shuffle:Each mapper buffers reduce output in memory, spills, then sort-merges on disk data.
![Page 18: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/18.jpg)
GroupBy Operator
Spark groupByKey != SQL groupByNO:people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum)
YES:people.map(p => (p.zipCode, p.getIncome)) .reduceByKey(_ + _)
![Page 19: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/19.jpg)
GroupBy Operator
Spark groupByKey != SQL groupByNO:people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum)
YES:people.groupBy(‘zipCode).select(sum(‘income))
![Page 20: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/20.jpg)
GroupBy Operator
Spark groupByKey != SQL groupByNO:people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum)
YES:SELECT sum(income) FROM people GROUP BY zipCode;
![Page 21: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/21.jpg)
Spark RDD API
Spark Streamin
greal-time
GraphXGraph(alpha)
MLLibmachine learning
DStream’s: Streams of
RDD’s
RDD-Based Matrices
RDD-Based Graphs
SparkSQL
RDD-Based Tables
Other efforts
HDFS, S3, CassandraYARN, Mesos, Standalone
Pig on Spark
Hive on Spark
Ooyala Job Server
![Page 22: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/22.jpg)
Looking Ahead to 1.2+
[Core]Scala 2.11 supportDebugging tools (task progress, visualization)Netty-based communication layer
[SQL]Portability across Hive versionsPerformance optimizations (TPC-DS and Parquet)Planner integration with Cassandra and other sources
![Page 23: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/23.jpg)
Looking Ahead to 1.2+
[Streaming]Python SupportLower level Kafka API w/ recoverability
[MLLib]Multi-model trainingMany new algorithmsFaster internal linear solver
![Page 24: Spark 1.1 and Beyond](https://reader036.fdocuments.in/reader036/viewer/2022070400/56812a6c550346895d8def52/html5/thumbnails/24.jpg)
Q and A
Josh RosenPySpark and Spark Core
Tathagata DasSpark Streaming Lead