Building a Turbo-fast Data Warehousing Platform with Databricks
-
Upload
databricks -
Category
Data & Analytics
-
view
3.021 -
download
3
Transcript of Building a Turbo-fast Data Warehousing Platform with Databricks
![Page 1: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/1.jpg)
Building a Turbo-fast Data Warehousing Platform with Databricks
Parviz Deyhim
![Page 2: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/2.jpg)
Agenda
• Introduction to Databricks • Building a end-to-end Data warehouse platform
• Infrastructure • Data ingest • ETL • Performance optimizations • Process & Visualize
• Securing your platform • Conclusion
![Page 3: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/3.jpg)
3
Parviz Deyhim (Speaker)Parviz works with variety of different customers and helps them with adopting Apache Spark and architecting scalable data processing platform with Databricks Cloud. Previous to joining Databricks, Parviz worked at AWS as a big-data solutions architect.
Denny Lee (Moderator)Denny is a Technology Evangelist with Databricks. Previous to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).
About the Speakers
![Page 4: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/4.jpg)
Introduction to Databricks
![Page 5: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/5.jpg)
We are Databricks, the company behind Spark
• Founded by the creators of Apache Spark
• Contributed ~75% of the Spark code in 2014
• Created Databricks cloud, a cloud-based big data platform
on top of Spark to make big data simple
![Page 6: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/6.jpg)
Typical big data project is far from ideal
Weeks to prepare then
explore data, and find
insights
Import and explore data
Months to build, weeks
to provision in existing
Get a cluster up and running
Months of re-engineering
to deploy as an application
Build and deploy data applications
For each new project, it takes months until results
![Page 7: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/7.jpg)
How Databricks powered by Spark helps our customers
No infrastructure management
Interactive workflow
Collaboration across the
organization
Experiment to production
instantly
100x faster than MapReduce
Spark SQL + ML + Streaming + Graph processing
Speed Flexibility Ease-of-use Unified
![Page 8: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/8.jpg)
Databricks helps you to harness the power of Spark
“Light switch” Spark clusters in the cloud
3rd Party Applications
Interactive workspace with notebooks
Production Pipeline Scheduler
![Page 9: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/9.jpg)
Databricks Internal Data Warehouse Use Case
![Page 10: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/10.jpg)
10
Databricks Internal DWH Use Case
Today: Collect logs from deployed customer clusters
Our Goal: ○ Understand customers behavior ○ Create reports for various teams (e.g. customer success &
support)
![Page 11: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/11.jpg)
StagesBuild & Maintain Infrastructure
Data Ingest Process & Visualize
Transform & Store
![Page 12: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/12.jpg)
12
Stages
Build & Maintain Infrastructure
Data Ingest Process & Visualize
Transform & Store
![Page 13: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/13.jpg)
13
Challenges of Building a Data Warehouse
Datacenter or Cloud? • Build/rent data center or use a public cloud offering?
Picking the right resources • If datacenter: what server sizes and types? Storage? • In cloud: what instance size, how large of a disk/SSD to use?
Deployment and Automation • How to automate the deployment process:
• Chef, Puppet, Cloudformation and etc
![Page 14: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/14.jpg)
14
Maintenance
• How to perform seamless upgrades?
Securing the platform • How to encrypt datasets?
• Controls, Policies, Audits
Challenges of Building a Data Warehouse
![Page 15: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/15.jpg)
15
Databricks Hosted Platform
Managed and automated hosted platform • Fully deployed on AWS
• Create resources with a single click
• Zero touch maintenance
![Page 16: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/16.jpg)
16
Compute Resources
Automatic Instance Provisioning • R3.2xlarge instances • Use SSD for caching • No EBS • Deployed in major regions and more coming
![Page 17: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/17.jpg)
17
Networking: VPC
Security & Isolation with AWS VPC
![Page 18: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/18.jpg)
18
Networking: Enhanced NetworkingHigh performance node to node
connectivity with placement groups
![Page 19: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/19.jpg)
19
Integration with AWS services
S3
Kinesis
RDS
Redshift
...
![Page 20: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/20.jpg)
20
Databricks Demo
![Page 21: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/21.jpg)
21
Stages
Build & Maintain Infrastructure
Data Ingest Process & Visualize
Transform & Store
![Page 22: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/22.jpg)
22
Customer Data Sources
Customer have variety of different data sources Cloud storage: S3
Databases: MySQL, NoSQL
APIs: Facebook, SalesForce and etc
Often required to join datasets
![Page 23: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/23.jpg)
23
Traditional Approach
Traditionally data warehouses require data to be copied
Common Question: How do I move my datasets to Databricks?
![Page 24: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/24.jpg)
24
Traditional Approach
Required to create a schema before data is copied
![Page 25: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/25.jpg)
25
Traditional Approach: Challenges
Moving Data:
• Very expensive and time consuming
• Creates inconsistency as data gets updated
Predefined Schema:
• Challenging to change schema for different use-case
![Page 26: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/26.jpg)
26
Databricks Approach: Data Sources
De-coupling compute from storage ● Leverage S3. No HDFS
Read directly from data sources ● Eliminate the need to copy data
Schema definition on read ● SparkSQL
![Page 27: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/27.jpg)
27
Spark Data Sources Support
![Page 28: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/28.jpg)
28
Databricks Use CaseDifferent data sources
• Customer metrics on S3 • Internal CRM
Need a single view of our customers
![Page 29: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/29.jpg)
29
Databricks Use Case
We use Spark to join datasets
![Page 30: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/30.jpg)
30
Databricks Demo1. Reading data from external API
2. Reading usage logs data from S3
3. Joining usage and external datasets
Link
![Page 31: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/31.jpg)
31
Stages
Build & Maintain Infrastructure
Data Ingest Process & Visualize
Transform & Store
![Page 32: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/32.jpg)
32
Data TransformationNeed to transform data before the join operation
• Aggregation • Consolidation • Data cleansing
![Page 34: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/34.jpg)
34
ETL: Common Approaches
Two common approaches • Offline • Streaming/real-time
![Page 35: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/35.jpg)
35
Extract & TransformationOffline ET
• Data gets stored in raw format (as is) • Some recurring job perform ET on the dataset • New transformed dataset gets stored for later processing
Advantage • Easy and quick to setup
Disadvantages • Traditionally slow process
![Page 36: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/36.jpg)
36
Databricks JobsDatabricks Jobs
• Schedule Production workflows using Notebooks or Jars • Create pipelines • Monitor results
![Page 37: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/37.jpg)
37
Databricks DemoJobs
![Page 38: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/38.jpg)
Performance Optimizations
![Page 39: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/39.jpg)
39
Performance Optimizations
Storing data in parquet
Partitioning dataset
Spark caching • JVM
• SSD
![Page 40: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/40.jpg)
40
Spark allows data to be stored in different data sources
![Page 41: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/41.jpg)
41
Parquet: Efficient columnar storage format for data warehousing use-cases
![Page 42: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/42.jpg)
42
Optimization: ParquetColumnar
• Faster Scans
Better compression
Optimized for storage • Memory
• Disk
![Page 43: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/43.jpg)
Advantages • Fast memory access
Disadvantages • GC pressure • No durability after JVM crash
43
Optimization: Caching (JVM)Spark caching: JVM
![Page 44: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/44.jpg)
44
Optimization: Caching (SSD)Spark caching: SSD
Advantages • Survives JVM and instance crash
Disadvantages • Much slower than JVM caching
![Page 45: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/45.jpg)
45
Databricks Use Case: Storing aggregate data in Parquet
![Page 46: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/46.jpg)
46
Databricks Use Case: Storing aggregate data in Parquet
![Page 48: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/48.jpg)
48
Stages
Build & Maintain Infrastructure
Data Ingest Process & Visualize
Transform & Store
![Page 49: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/49.jpg)
49
Databricks Visualizations
Notebook Visualizations 1. Built-in graphing capabilities 2. ggplot and matplotlib 3. D3 visualizations
![Page 50: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/50.jpg)
50
Databricks VisualizationsNotebook Visualizations (DEMO)
D3/SVG
![Page 51: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/51.jpg)
51
3rd Party VisualizationsZoomdata
![Page 52: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/52.jpg)
52
Securing Your Platform
![Page 53: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/53.jpg)
53
Secure Platform
Encryption
1. In flight: SSL 2. At rest: S3 Encryption
![Page 54: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/54.jpg)
54
Secure Platform
User Management: ACLs
Notebooks read-write-execute
Admin users
![Page 55: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/55.jpg)
55
Secure Platform
On Our Roadmap
S3 KMS encryption
Single Sign On (SSO) AD/LDAP support
![Page 56: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/56.jpg)
56
Secure Platform
On Our Roadmap IAM Roles for Spark nodes
![Page 57: Building a Turbo-fast Data Warehousing Platform with Databricks](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f9a96d760da3da068b6e7c/html5/thumbnails/57.jpg)
Thank you