Apache Spark Briefing
-
Upload
thomas-w-dinsmore -
Category
Technology
-
view
158 -
download
8
description
Transcript of Apache Spark Briefing
![Page 1: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/1.jpg)
Apache SparkThe Emerging Platform for Distributed Analytics
July 2014
Thomas W. Dinsmore
![Page 2: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/2.jpg)
What is Apache Spark?• Distributed in-memory analytics engine
• Runs in standalone clusters or Hadoop
• Fully compatible with Hadoop storage APIs
• Runs under YARN
• Top-level Apache project
• Supported in all major Hadoop distros
• Open source and vendor neutral
Thomas W. Dinsmore
![Page 3: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/3.jpg)
SAP Support
Spark Timeline
+ + + + +2009 2010 2011 2012 2013 2014 ++
Project begins Open sourced
Spark Summit 2013
Spark Summit 2013
Apache IncubatorApache Top-Level
Cloudera Support
MapR Support
Horton Support
Thomas W. Dinsmore
News cascade starting late last year.
![Page 4: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/4.jpg)
What problems does Spark solve?
![Page 5: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/5.jpg)
Problem #1: MapReduce I/O sandbags runtime for advanced analytics.
Compute Store
Must persist results after each pass through data
Advanced analytics often requires multiple passes through data
Hadoop Storage
Hadoop Storage
Thomas W. Dinsmore
![Page 6: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/6.jpg)
Spark Vision: Distributed in-memory platform
Compute
Intermediate results stay in memory.
100X performance improvement for iterative algorithms.
Compute Compute ComputeHadoop Storage
Thomas W. Dinsmore
![Page 7: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/7.jpg)
Problem #2: Many “point” solutions for advanced analytics in Hadoop
Machine !LearningQueries
Graph !Analytics
Streaming !Analytics
Thomas W. Dinsmore
![Page 8: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/8.jpg)
Spark Vision: single integrated platform for advanced analytics in Hadoop.
• Simplified administration • Integrated results.
Thomas W. Dinsmore
![Page 9: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/9.jpg)
How important is Spark?
![Page 10: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/10.jpg)
Mike Olson, Cloudera:
“The leading candidate for ‘successor to MapReduce’ today is Apache Spark.”
Thomas W. Dinsmore
![Page 11: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/11.jpg)
M.C. Srivas, MapR:
“We believe Spark on Hadoop is a game changer for any business.”
Thomas W. Dinsmore
![Page 12: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/12.jpg)
Ben Lorica, O’Reilly Media:
“The number of companies that are using Spark in production has exploded over the last year.”
Thomas W. Dinsmore
![Page 13: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/13.jpg)
Apache Spark is the most active project in the Hadoop ecosystem.
Source: Cloudera
Commits, Past 12 Months
22%
Thomas W. Dinsmore
![Page 14: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/14.jpg)
Spark’s Key Capabilities
![Page 15: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/15.jpg)
Spark 1.0 Machine Learning• Linear Regression
• Logistic Regression
• Linear Support Vector Machine
• Regularization
• Decision Trees
• Naive Bayes
• Alternating Least Squares
• K-Means Plus-Plus
• Singular Value Decomposition
• Principal Components Analysis
• Stochastic Gradient Descent
• L-BFGS
Spark project expects to double supported techniques in 1.1 (August 2014).Thomas W. Dinsmore
![Page 16: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/16.jpg)
Spark SQL• Currently most active project
• Supports fast interactive queries
• Hive-compatible
• Works with Hive data
• Runs unmodified queries
• Roadmap to support more formats
• Will absorb Shark project
Thomas W. Dinsmore
![Page 17: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/17.jpg)
Spark Streaming• Supports analysis of data streams in real time
• Unifies streaming and batch data
• Integrates with popular data sources:
• HDFS
• Flume
• Kafka
• Easy to use
• Fault tolerant
Thomas W. Dinsmore
![Page 18: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/18.jpg)
Spark Graph Analytics
• Currently Alpha release
• Unifies graph-parallel and data-parallel computing under single API
• Performance parity with Giraph
• Replaces Spark Bagel (Pregel on Spark)
Thomas W. Dinsmore
![Page 19: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/19.jpg)
Spark PerformanceMachine Learning
• 100x faster than MapReduce
Queries (Shark) !
• Comparable to Impala
• 100x faster than Hive
!
Streaming
• 2X throughput of Storm
Graph (GraphX) !
• Comparable to Giraph
• 10X faster than MapReduce
Thomas W. Dinsmore
![Page 20: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/20.jpg)
Spark Distributions
Thomas W. Dinsmore
Connector
Every major Hadoop distribution, plus…
Interface to HANABig Data Appliance
![Page 21: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/21.jpg)
Programming Interfaces
Supported APIs “Alpha” Release
Thomas W. Dinsmore
Spark project expects to release production grade R interface early 2015.
“SparkR”
![Page 22: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/22.jpg)
Spark Users
Thomas W. Dinsmore
![Page 23: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/23.jpg)
Certified on Spark
Thomas W. Dinsmore
![Page 24: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/24.jpg)
Who is Databricks?• Commercial venture, incepted 2013
• Founded by Spark principals
• Services and support business model
• Gatekeepers to Spark
• Just landed $33M in Series B
• Andreeson, Horowitz
• New Enterprise Associates
• Just announced Spark Cloud product
Thomas W. Dinsmore
![Page 25: Apache Spark Briefing](https://reader034.fdocuments.in/reader034/viewer/2022052303/54c661944a79595e038b458b/html5/thumbnails/25.jpg)
Thank You