Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest...

23
Building a data warehouse Building a data warehouse using Spark SQL using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO

Transcript of Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest...

Page 1: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

Building a data warehouseBuilding a data warehouseusing Spark SQLusing Spark SQL

Budapest Data Forum 2018

Gabor Ratky, CTO

Page 2: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

About meAbout meHands-on CTO at Secret Sauce to this daySoftware engineer at heartMade enough bad decisions to know thateverything is a trade-offCode quality and maintainability above allNot writing code when I don't have toNot building distributed systems when Idon't have toNot a data warehouse guy, but ❤ data

Simple is better than complex. Complex is better than complicated.

The Zen of Python, by Tim Peters

Page 3: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

About Secret SauceAbout Secret SauceSV startup in BudapestB2B2C apparel e-commerce companyData driven products to help merchandisingServices build on top of the data we collectCloud-based infrastructure (AWS)Small, effective teamsStrong engineering culture

Code qualityCode reviewsTestability

Everybody needs data to do their jobs

Page 4: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

Early daysEarly days

Partner data MongoDB$ mongoimport

Page 5: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

MongoDB

Redshift S3

PostgreSQLPostgreSQLPartner data

Event analyticsEvent analytics

kafkakafka

Page 6: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

MongoDB

Databricks S3

PostgreSQLPostgreSQLPartner data

Data warehousingData warehousing

kafkakafka

Page 7: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

Why Databricks and Spark?Why Databricks and Spark?Storage and compute are separateManaged clusters operated by DatabricksFits into and runs as part of our existinginfrastructure (AWS)Right tool for the job

Data engineers use pysparkData analysts use SQLData scientists use Python, R, SQL, H2O,Pandas, scikit-learn, dist-keras

Shared metastore (databases and tables)Collaborative, interactive notebooksGithub integration and flowAutomated jobs and schedulesProgrammatic API

Page 8: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

ClustersClusters

Page 9: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

WorkspaceWorkspace

Page 10: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

NotebooksNotebooks

Page 11: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

JobsJobs

Page 12: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

AnalyticsAnalytics

Page 13: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

AnalyticsAnalytics

Page 14: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

Build vs buyBuild vs buy

BUYBUY

Page 15: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

CostCost

Page 16: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

Cost (Redshift)Cost (Redshift)Persistent data warehouse4x ds2.xlarge nodes (8TB, 16 vCPU, 124GB RAM)On-demand price: $0.85/hr/node1 month ~ 732 hours

$2,488$2,488

Page 17: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

Cost (Databricks)Cost (Databricks)Ephemeral, interactive, multi-tenant cluster8TB storage (S3)i3.xlarge driver node (4 vCPU, 30.5GB RAM)4x i3.xlarge worker nodes (16 vCPU, 122GB RAM)Compute: $0.712/hr

$0.312/hr on-demand price4x $0.1/hr spot price

Databricks:  $2/hr$0.4/hr/node

Storage: $188/mo + change1 month ~ 22 workdays ~ 176 hours

$665$665

Page 18: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

Utilization (Redshift)Utilization (Redshift)

Page 19: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

Utilization (Redshift)Utilization (Redshift)

Page 20: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

Utilization (Databricks)Utilization (Databricks)

~34 DBU/day, ~4.5 DBU/hr

~11.5 DBU/day

Page 21: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

Our experience so farOur experience so farStarted using Databricks in January 2018Quick adoption across the whole companyFast turnaround on data requestsEasy collaboration between technical andnon-technical peopleDatabricks allows us to focus on dataengineering, not data infrastructureGithub integration not perfect, but fits intoour workflowPartitioning and schema evolution needs alot of attentionDatabricks is an implementation detail, pickyour poisonEverything is a trade-off, make the right ones

Page 22: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

NIHS*NIHS** not invented here syndrome

Page 23: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w

Thanks!Thanks!Questions?

 

[email protected]@rgabo

https://bit.ly/sspinc-careers