Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
-
Upload
spark-summit -
Category
Data & Analytics
-
view
36 -
download
0
Transcript of Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
![Page 1: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/1.jpg)
Sparking upData Engineering
Rohan SharmaSpark Summit East - Feb 9, 2017
![Page 2: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/2.jpg)
Agenda● Context: Netflix
● Netflix Data Ecosystem
● Spark Development @ Netflix
● Stranger Things with Spark
● Q&A
![Page 3: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/3.jpg)
#NetflixEverywhere
● 93+ Million Members
● 190+ Countries
● 125+ Million streaming hours / day
● 1000 hours of Original content in 2017
● ⅓ of US internet traffic during evenings
![Page 4: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/4.jpg)
Netflix Culture● Freedom and Responsibility
● Context, not Control
● Highly aligned, loosely coupled
● The Paved road
![Page 5: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/5.jpg)
#NetflixData
![Page 6: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/6.jpg)
#NetflixData
Product Experience Streaming Experience Content
Marketing Business Operations Other Functions...
![Page 7: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/7.jpg)
Data Producers● Member Devices
● CDN Servers
● Application Servers
● Device/Server Telemetry
● Application Data
● Vendors / Partner Data
![Page 8: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/8.jpg)
Data Processing● Stream Processing - Shriya Arora
● Recommendation Systems - DB Tsai & Gary Yeh
● Batch Processing
● Experimentation Analytics
● Operational Analytics
![Page 9: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/9.jpg)
Data Platform (Batch Processing)
Storage
Compute
Service
Tools
APIBig Data Portal
S3 Parquet
Transport VisualizationQuality Pig Workflow Vis Job/Cluster Vis
Interface
Execution Metadata
Notebooks Tableau Micro Strategy JavaScript Applications
![Page 10: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/10.jpg)
Cloud Warehouse● 60+ Petabytes
● Hadoop on S3 - separate compute from storage
● Multiple workloads on same cluster
● Tens of thousands of Spark, Pig, Hive, Presto jobs
● Production Cluster : 2600 d2.4xl nodes
● Query Cluster : 400 m4.16xl nodes
![Page 11: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/11.jpg)
Spark Development● Focus on stakeholders - not process & operations
● Reuse boilerplate code & best practices
● Reuse build & deployment practices
● Make Spark easy-to-use & cruise
![Page 12: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/12.jpg)
Spark Development Template
Abstract Spark Project (Template)
Netflix Data Utils (cross-platform business logic)
Spark Utils (spark boilerplate + interface implementations)
Code Reuse
Registry + Build Properties
Project Instances
Build & Deploy Reuse
S3 Scheduler
FRAMEWORK
USER
FRAMEWORK
. . .
Jenkins
![Page 13: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/13.jpg)
Development Lifecycle
Develop&Test
Commit, Build, Deploy & Run
Zeppelin Spark ShellIntelliJ
Integration Test
Bit Bucket Jenkins S3 / Artifactory
Execute
![Page 14: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/14.jpg)
Whats Next...● Spark added to paved road in early 2016
● Improved query performance / cluster utilization
● Pyspark templates & data utils
![Page 15: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/15.jpg)
![Page 16: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/16.jpg)
Code Refactoring● Download Source Code● Semi-Compile to Abstract Syntax Trees● Check for re-factoring rules● Codegen refactored lines and update source
● Invoke dockerized CI service to build● Raise Pull Requests with inferred reviewers
More Details:Jonathan Schneider
Linked In:https://www.linkedin.com/in/jonkschneider/
Youtube: https://www.youtube.com/watch?v=JbcKFKiBU60
![Page 17: Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma](https://reader031.fdocuments.in/reader031/viewer/2022022200/58ac38b61a28ab145e8b5cb7/html5/thumbnails/17.jpg)
Questions
To learn more, please visit: youtube channel
NetflixData
twitter handle
@NetflixData