Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and...
-
Upload
vuongkhanh -
Category
Documents
-
view
216 -
download
2
Transcript of Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and...
![Page 1: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/1.jpg)
DatabricksBuilding and Operating a Big Data Service Based on Apache Spark
Ali Ghodsi <[email protected]>
![Page 2: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/2.jpg)
Cloud Computing and Big Data
• Three major trends– Computers not getting any faster– More people connected to the Internet– More devices collecting data
• Computation moving to the cloud
![Page 3: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/3.jpg)
The Dawn of Big Data
• Most companies collect lots of data– Cheap storage (hardware, software)
• Everyone is hoping to extract insights– Great examples (Netflix, Uber, Ebay)
• Big Data is Hard!
![Page 4: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/4.jpg)
Big Data is Hard
• Compute the average of 1,000 integers
• Compute the average of 10 terabyte of integers
![Page 5: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/5.jpg)
Goal: Make Big Data Simple
![Page 6: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/6.jpg)
The Challenges of Data Science
6
Building a cluster
Build and deploy data applications
Production Deployment
Data Exploration
Dashboards & Reports
Data Warehousing
Advanced Analytics
Import and explore data with different tools
ETL
![Page 7: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/7.jpg)
Single tool for Ingest, Exploration, Advanced Analytics, Production, Visualization
Databricks is an End-to-End Solution
7
Automatically Managed Clusters
Short time to value
Data Warehousing
Dashboards & Reports
Production Deployment
Dashboards3rd party apps
Real-time query engine
Job scheduler
ETL
Diverse data source connectors
Data Exploration
Advanced Analytics
Built-in librariesNotebooks & visualization
![Page 8: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/8.jpg)
Databricks in a nutshell
Talk outline
• Apache Spark– ETL, interactive queries, streaming, machine learning
• Cluster and Cloud Management– Operating thousands of machines in the cloud
• Interactive Workspace– Notebook environment, Collaboration, Visualization, Versioning, ACLs
• Lessons– Lessons in building a large scale distributed system in the cloud
![Page 9: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/9.jpg)
PART I:Apache SparkWhat we added to to Spark
![Page 10: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/10.jpg)
Apache Spark
• Resilient Distributed Datasets (RDDs) as core abstraction– Collection of objects– Like a LinkedList <MyObjects>
• Spark RDDs are distributed– RDD collections are partitioned– RDD partitions can be cached– RDD partitions can be recomputed
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
![Page 11: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/11.jpg)
RDDs continued
• RDDs can be composed– All RDDs initially derived from data source– RDDs can be created from other RDDs – Two basic operations: map& reduce– Many other operators: join,filter,union etc
1 2 3 4 5 6 7 8 9 10 11 12
2 4 6 8 10 12 14 16 18 20 22 24
val text = sc.textFile(”s3://my-bucket/wikipedia")val words = text.flatMap(line => line.split(" "))val pairs = words.map(word => (word, 1)) val result = pairs.reduceByKey((a, b) => a + b)
![Page 12: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/12.jpg)
Spark Libraries on top of RDDs
• SQL (Spark SQL)– Full Hive SQL support with UDF, UDAFs, etc– how: Internally keep RDDs of row objects (or RDD of column segments)
• Machine Learning (MLlib)– Library of machine learning algorithms– how: Cache an RDD, repeatedly iterate it
• Streaming (Spark Streaming)– Streaming of real-time data– how: Series of RDDs, each containing seconds of real-time data
• Graph Processing (GraphX)– Iterative computation on graphs (e.g. social network)– how: RDD of Tuple<Vertex, Edge, Vertex> and perform self joins
SparkStreaming
Spark Core
Spark SQL MLlib GraphX
![Page 13: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/13.jpg)
Unifying Libraries
• Early user feedback– Different use cases for R, Python, Scala, Java, SQL– How to intermix and go across these?
• Explosion of R Data Frames and Python Pandas– DataFrame is a table– Many procedural operations– Ideal for dealing with semi-structured data
• Problem– Not declarative, hard to optimize– Eagerly executes command by command– Language specific (R dataframes, Pandas)
![Page 14: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/14.jpg)
Unifying Libraries
• Early user feedback– Different use cases for R, Python, Scala, Java, SQL– How to intermix and go across these?
• Explosion of R Data Frames and Python Pandas– DataFrame is a table– Many procedural operations– Ideal for dealing with semi-structured data
• Problem– Not declarative, hard to optimize– Eagerly executes command by command– Language specific (R dataframes, Pandas)
Common performance problem in Spark
val pairs = words.map(word => (word, 1))val grouped = pairs.groupByKey()val counts = grouped.map((key, values) => (key, values.sum))
![Page 15: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/15.jpg)
Spark Data Frames
• Procedural DataFrames vs declarative SQL– Two different approaches
• Developed DataFrames for Spark– DataFrames situated above the SQL optimizer– DataFrame operations available in R, Python, Scala, Java– SQL operations return DataFrames
users = context.sql(”select * from users”) # SQLyoung = users.filter(users.age < 21) # Pythonyoung.groupBy("gender").count()
tokenizer = Tokenizer(inputCol=”name", outputCol="words") # MLhashingTF = HashingTF(inputCol="words", outputCol="features") lr = LogisticRegression(maxIter=10, regParam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])model = pipeline.fit(young) # model
![Page 16: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/16.jpg)
Proliferation of Data Solutions
• Customers already run a slew of data management systems– MySQL category, Cassandra category, S3 category, HDFS category– ETL all data over to Databricks?
• We added Spark Data Source API– Open APIs for implementing your own data source– Examples: CSV, JDBC, Parquet/Avro, ElasticSearch, RedShift, Cassandra
• Features– Pushdown of predicates, aggregations, column pruning– Locality information– User Defined Types (UDTs), e.g. vectors
![Page 17: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/17.jpg)
Proliferation of Data Solutions
• Customers already run a slew of data management systems– MySQL category, Cassandra category, S3 category, HDFS category– ETL all data over to Databricks?
• We added Spark Data Source API– Open APIs for implementing your own data source– Examples: CSV, JDBC, Parquet/Avro, ElasticSearch, RedShift, Cassandra
• Features– Pushdown of predicates, aggregations, column pruning– Locality information– User Defined Types (UDTs), e.g. vectors
class PointUDT extends UserDefinedType[Point] {
def dataType = StructType(Seq(StructField ("x", DoubleType), StructField ("y", DoubleType) ))
def serialize(p: Point) = Row(p.x, p.y)
def deserialize(r: Row) = Point(r. getDouble (0), r. getDouble (1))
}
![Page 18: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/18.jpg)
Modern Spark Architecture
Spark Core
SparkStreamingSpark SQL MLlib GraphX
![Page 19: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/19.jpg)
Modern Spark Architecture
{JSON}
Data Sources
DataFrames
Spark Core
SparkStreamingSpark SQL MLlib GraphX
![Page 20: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/20.jpg)
Databricks as just-in-time Datawarehouse
• Traditional datawarehouse– Every night ETL all relevant data to a warehouse– Precompute cubes of fact tables– Slow, costly, poor recency
• Spark JIT datawarehouse– Switzerland of storage: NoSQL, SQL, cloud, …– Storage remains at source of truth– Spark used to directly read and cache date
Spark Core
SparkStreamingSpark SQL MLlib GraphX
{JSON}
Data Sources
DataFrames
![Page 21: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/21.jpg)
PART II:Cluster Management
![Page 22: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/22.jpg)
Spark as a Service in the Cloud
• Experience with Mesos, YARN, …– Use off-the-shelf cluster manager?
• Problems– Existing cluster managers were not cloud-aware
![Page 23: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/23.jpg)
Cloud-Aware Cluster Management
• Instance manager– Responsible for acquiring machines from cloud provider
• Resource manager– Schedule and configure isolated containers on machine instances
• Spark cluster manager– Monitor and setup Spark clusters
Resource Manager
Databricks Cluster Manager
Instance Manager
Spark Cluster Manager
![Page 24: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/24.jpg)
Databricks Instance Manager
Instance manager’s job is to manage machine instances
• Pluggable cloud providers– General interface that can be plugged in with AWS, …– Availability management (AZ, 1h), configuration management (VPCs)
• Fault-handling– Terminated or slow instances, spot price hikes– Seamlessly replace machines
• Payment management– Bid for spot instances, monitor their price– Recording cluster usage for payment system
Resource Manager
Databricks Cluster Manager
Instance Manager
SparkCluster
Manager
![Page 25: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/25.jpg)
Databricks Resource Manager
Resource manager’s job is to multiplex tenants on instances
• Isolates tenants using container technology– Manages multiple versions of Spark– Configures firewall rules, filters traffic
• Provides fast SSD/in-memory caching across containers– ramdisk for a fast in-memory cache, mmap to access from Spark JVM – Bind-mount into containers for shared in-memory cache
Resource Manager
Databricks Cluster Manager
Instance Manager
SparkCluster
Manager
![Page 26: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/26.jpg)
Databricks Spark Cluster Manager
Spark CM’s job is to setup Spark clusters and multiplex REPLs
• Setting up Spark clusters– Currently using Standalone mode Spark– Dynamic resizing of clusters based on load (wip)
• Multiplexing of multiple REPLs– Many interactive REPLs/notebooks on the same Spark cluster– ClassLoader isolation and library management
Resource Manager
Databricks Cluster Manager
Instance Manager
SparkCluster
Manager
![Page 27: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/27.jpg)
PART III:Interactive Workspace
![Page 28: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/28.jpg)
Collaborative Workspace
• Problem– Real time collaboration on notebooks– Version control of notebooks– Access control on notebooks
![Page 29: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/29.jpg)
Pub/sub-based TreeStore
• Web application server– Stores an in-memory representation of Databricks workspace
• TreeStore is a directory service + a pub-sub service– In-memory tree structure representing:
directories, notebooks, commands, results– Browsers subscribe to subtrees and get notifications on updates– Special handler sends delta-updates over web sockets
• Usage– Subscribe to a notebook, see live edits of notebook– Used to create a collaborative environment
![Page 30: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/30.jpg)
PART IV:Lessons
![Page 31: Building and Operating a Big Data Service Based on Apache Spark · · 2016-08-21Building and Operating a Big Data Service Based on Apache Spark ... – Lessons in building a large](https://reader033.fdocuments.in/reader033/viewer/2022052711/5ad176077f8b9a0f198b6e39/html5/thumbnails/31.jpg)
Lessons• Loose coupling necessary but hard
– Narrow well-defined APIs, backwards compatibility, upgrades
• State management very hard at scale– Legacy state: databases, configurations, machines, data formats…
• Cloud software development is superior– Two week sprints, two week releases, SCRUM …
• Testing is key for evolution and scale– Step-wise refinement for extension, testing pyramid 70/20/10
• Combine bottom-up with top-down approach– Top-down for quick results, bottom-up for modularity/reuse