Strata NYC 2015 - What's coming for the Spark community
-
Upload
databricks -
Category
Software
-
view
934 -
download
1
Transcript of Strata NYC 2015 - What's coming for the Spark community
![Page 1: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/1.jpg)
What’s New in the Spark Community
Patrick Wendell | @pwendell
![Page 2: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/2.jpg)
About Me
Co-Founder of Databricks Founding committer of Apache Spark at U.C. Berkeley Today, manage Spark effort @ Databricks
![Page 3: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/3.jpg)
About Databricks
Team donated Spark to ASF in 2013; primary maintainers of Spark today Hosted analytics stack based on Apache Spark Managed clusters, notebooks, collaboration, and third party apps:
![Page 4: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/4.jpg)
Today’s Talk
Quick overview of Apache Spark Technical roadmap directions Community and ecosystem trends
![Page 5: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/5.jpg)
What is your familiarity with Spark?
1. Not very familiar with Spark – only very high level. 2. Understand the components/uses well, but I’ve never written code. 3. I’ve written Spark code on POC or production use case of Spark.
![Page 6: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/6.jpg)
“Spark is the Taylor Swift of big data software.” - Derrick Harris, Fortune
![Page 7: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/7.jpg)
…
Apache Spark Engine
Spark Core
Streaming SQL and
Dataframe MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
![Page 8: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/8.jpg)
This Talk
“What’s new” in Spark? And what’s coming? Two parts: Technical roadmap and community developments
“The future is already here — it's just not very evenly distributed.” - William Gibson
![Page 9: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/9.jpg)
Technical Directions
![Page 10: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/10.jpg)
Spark Technical Directions
Higher level API’s Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management Pluggability and extensibility
Make it easy for other projects to integrate with Spark
![Page 11: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/11.jpg)
Spark Technical Directions
Higher level API’s Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management Pluggability and extensibility
Make it easy for other projects to integrate with Spark
![Page 12: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/12.jpg)
Higher Level API’s
Making Spark accessible to data scientists, engineers, statisticians…
![Page 13: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/13.jpg)
Computing an Average: MapReduce vs Spark
private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("\t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) }
data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [x.[1], 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()
13
![Page 14: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/14.jpg)
Computing an Average with Spark
data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [x.[1], 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()
14
![Page 15: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/15.jpg)
Computing an Average with DataFrames
sqlCtx.table("people") \ .groupBy("name") \ .agg("name", avg("age")) \ .collect()
15
![Page 16: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/16.jpg)
Spark DataFrame API
Explicit data model and schema Selecting columns and filtering Aggregation (count, sum, average, etc)
User defined functions Joining different data sources Statistical functions and easy plotting Python, Scala, Java, and R
16
sqlCtx.table("people") \ .groupBy("name") \ .agg("name", avg("age")) \ .collect()
![Page 17: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/17.jpg)
Ask more of your framework! MapReduce Spark Spark + DataFrames Fault tolerance Fault tolerance Fault tolerance
Data distribution Data distribution Data distribution
Set operators Set operators
Operator DAG Operator DAG
Caching Caching
Schema management
Relational semantics
Logical plan optimization
Storage push down and opt.
Analytic operations
…
![Page 18: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/18.jpg)
Other high level API’s
ML Pipelines SparkR
ds0 ds1 ds2 ds3 tokenizer hashingTF lr.model
lr
> faithful <-‐ read.df("faithful.json", "json”) > head(filter(faithful, faithful $waiting < 50)) ## eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48
![Page 19: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/19.jpg)
Spark Technical Directions
Higher level API’s Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management Pluggability and extensibility
Make it easy for other projects to integrate with Spark
![Page 20: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/20.jpg)
Performance Initiatives
Project Tungsten – improving runtime efficiency of key internals Everything else – IO optimizations, dynamic plan re-writing
![Page 21: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/21.jpg)
Project Tungsten: The CPU Squeeze
2010 2015
Storage 50+MB/s (HDD)
500+MB/s (SSD) 10X
Network 1Gbps 10Gbps 10X
CPU ~3GHz ~3GHz L
![Page 22: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/22.jpg)
Project Tungsten Code generation for CPU efficiency
Code generation on by default and using Janino [SPARK-7956] Beef up built-in UDF library (added ~100 UDF’s with code gen)
AddMonths ArrayContains Ascii Base64 Bin BinaryMathExpression CheckOverflow CombineSets Contains CountSet Crc32 DateAdd
DateDiff DateFormatClass DateSub DayOfMonth DayOfYear Decode Encode EndsWith Explode Factorial FindInSet FormatNumber FromUTCTimestamp
FromUnixTime GetArrayItem GetJsonObject GetMapValue Hex InSet InitCap IsNaN IsNotNull IsNull LastDay Length Levenshtein
Like Lower MakeDecimal Md5 Month MonthsBetween NaNvl NextDay Not PromotePrecision Quarter RLike Round
Second Sha1 Sha2 ShiYLeY ShiYRight ShiYRightUnsigned SortArray SoundEx StartsWith StringInstr StringRepeat StringReverse StringSpace
StringSplit StringTrim StringTrimLeY StringTrimRight TimeAdd TimeSub ToDate ToUTCTimestamp TruncDate UnBase64 UnaryMathExpression Unhex UnixTimestamp
![Page 23: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/23.jpg)
Project Tungsten
Binary processing for memory management (all data types): External sorting with managed memory External hashing with managed memory
Memory page
hc ptr
…
key value key value key value key value
key value key value
Managed Memory HashMap in Tungsten
![Page 24: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/24.jpg)
Python Java/Scala R SQL …
DataFrame Logical Plan
LLVM JVM GPU NVRAM
Where are we going?
Tungsten backend
language frontend
…
![Page 25: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/25.jpg)
Tungsten Execution
Python SQL R Streaming
DataFrame
Advanced Analytics
![Page 26: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/26.jpg)
Spark Technical Directions
Higher level API’s Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management Pluggability and extensibility
Make it easy for other projects to integrate with Spark
![Page 27: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/27.jpg)
Pluggability: Rich IO Support
df = sqlContext.read \ .format("json") \ .option("samplingRatio", "0.1") \ .load("/home/michael/data.json”) df.write \ .format("parquet") \ .mode("append") \ .partitionBy("year") \ .saveAsTable("fasterData")
Unified interface to reading/writing data in a variety of formats
![Page 28: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/28.jpg)
Large Number of IO Integration
Spark SQL’s Data Source API can read and write DataFrames using a variety of formats.
28
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
![Page 29: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/29.jpg)
Deployment Integrations
![Page 30: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/30.jpg)
Technical Directions
Early on, the focus was: Can Spark be an engine that is faster and easier to use than Hadoop MapReduce?
Today the question is:
Can Spark & its ecosystem make big data as easy as little data?
![Page 31: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/31.jpg)
Community/User Growth
![Page 32: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/32.jpg)
Who is the “Spark Community”?
thousands of users
… hundreds of developers
… dozens of distributors
![Page 33: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/33.jpg)
Getting a better vantage point
Databricks survey - feedback from more than 1,400 users
![Page 34: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/34.jpg)
Community trends: Library & package ecosystem
Strata NY 2014: Widespread use of core RDD API Today: Most use built-in and community libraries
51% of users use 3 or more libraries
![Page 35: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/35.jpg)
Spark Packages
Strata NY 2014: Didn’t exist Today: > 100 community packages
> ./bin/spark-shell --packages databricks/spark-avro:0.2
![Page 36: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/36.jpg)
Spark Packages
API Extensions Clojure API
Spark Kernel
Zepplin Notebook
Indexed RDD
Deployment Utilities
Google Compute
Microsoft Azure
Spark Jobserver
Data Sources Redshift
Avro CSV
Elastic Search MongoDB
![Page 37: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/37.jpg)
Increasing storage options
Strata NY 2014: IO primarily through Hadoop InputFormat API January 2015: Spark adds native storage API Today: Well over 20 natively integrated storage bindings
Cassandra, ElasticSearch, MongoDB, Avro, Parquet, ORC, HBase,
Redshift, SAP, CSV, Cloudant, Oracle, JDBC, SequoiaDB, Couchbase…
![Page 38: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/38.jpg)
Deployment environments
Strata NY 2014: Traction in the Hadoop community
Today: Growth beyond Hadoop… increasingly public cloud
51% of respondents run Spark in public cloud
![Page 39: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/39.jpg)
Wrapping it up
Spark has grown and developed quickly in the last year! Looking forward expect: - Engineering effort on higher level API’s and performance - A broader surrounding ecosystem - The unexpected
![Page 40: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/40.jpg)
Where to learn more about Spark?
SparkHub community portal Spark Summit conference - https://spark-summit.org/ Massive online course (edX): Databricks Spark training Books:
![Page 41: Strata NYC 2015 - What's coming for the Spark community](https://reader031.fdocuments.in/reader031/viewer/2022030304/5879846d1a28ab6c358b6197/html5/thumbnails/41.jpg)
Questions?