Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest...

Building a data warehouseBuilding a data warehouseusing Spark SQLusing Spark SQL

Budapest Data Forum 2018

Gabor Ratky, CTO

About meAbout meHands-on CTO at Secret Sauce to this daySoftware engineer at heartMade enough bad decisions to know thateverything is a trade-offCode quality and maintainability above allNot writing code when I don't have toNot building distributed systems when Idon't have toNot a data warehouse guy, but ❤ data

Simple is better than complex. Complex is better than complicated.

The Zen of Python, by Tim Peters

About Secret SauceAbout Secret SauceSV startup in BudapestB2B2C apparel e-commerce companyData driven products to help merchandisingServices build on top of the data we collectCloud-based infrastructure (AWS)Small, effective teamsStrong engineering culture

Code qualityCode reviewsTestability

Everybody needs data to do their jobs

Early daysEarly days

Partner data MongoDB$ mongoimport

MongoDB

Redshift S3

PostgreSQLPostgreSQLPartner data

Event analyticsEvent analytics

kafkakafka

MongoDB

Databricks S3

PostgreSQLPostgreSQLPartner data

Data warehousingData warehousing

kafkakafka

Why Databricks and Spark?Why Databricks and Spark?Storage and compute are separateManaged clusters operated by DatabricksFits into and runs as part of our existinginfrastructure (AWS)Right tool for the job

Data engineers use pysparkData analysts use SQLData scientists use Python, R, SQL, H2O,Pandas, scikit-learn, dist-keras

Shared metastore (databases and tables)Collaborative, interactive notebooksGithub integration and flowAutomated jobs and schedulesProgrammatic API

ClustersClusters

WorkspaceWorkspace

NotebooksNotebooks

JobsJobs

AnalyticsAnalytics

Build vs buyBuild vs buy

BUYBUY

CostCost

Cost (Redshift)Cost (Redshift)Persistent data warehouse4x ds2.xlarge nodes (8TB, 16 vCPU, 124GB RAM)On-demand price: $0.85/hr/node1 month ~ 732 hours

$2,488$2,488

Cost (Databricks)Cost (Databricks)Ephemeral, interactive, multi-tenant cluster8TB storage (S3)i3.xlarge driver node (4 vCPU, 30.5GB RAM)4x i3.xlarge worker nodes (16 vCPU, 122GB RAM)Compute: $0.712/hr

$0.312/hr on-demand price4x $0.1/hr spot price

Databricks: $2/hr$0.4/hr/node

Storage: $188/mo + change1 month ~ 22 workdays ~ 176 hours

$665$665

Utilization (Redshift)Utilization (Redshift)

Utilization (Databricks)Utilization (Databricks)

~34 DBU/day, ~4.5 DBU/hr

~11.5 DBU/day

Our experience so farOur experience so farStarted using Databricks in January 2018Quick adoption across the whole companyFast turnaround on data requestsEasy collaboration between technical andnon-technical peopleDatabricks allows us to focus on dataengineering, not data infrastructureGithub integration not perfect, but fits intoour workflowPartitioning and schema evolution needs alot of attentionDatabricks is an implementation detail, pickyour poisonEverything is a trade-off, make the right ones

NIHS*NIHS** not invented here syndrome

Thanks!Thanks!Questions?

gabor@secretsaucepartners.com@rgabo

https://bit.ly/sspinc-careers

Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest...

Documents

Transcript of Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest...

Spark SQL: Relational Data Processing in Spark - MIT · PDF fileSpark SQL: Relational Data Processing in Spark ... Spark SQL is a new module in Apache Spark that integrates rela- ...

PostgreSQL, SQLAlchemy, and schema-less data. - WMMI.net

On Beyond (PostgreSQL) Data Types

PostGIS, a PostgreSQL module for spatial data

Crunchy Certiﬁed PostgreSQL · PostgreSQL Distribution? Crunchy Data helps enterprises leverage the power and efﬁciency of PostgreSQL for mission-critical applications through

Satellite Science Data Processing with PostgreSQL · Satellite Science Data Processing with PostgreSQL Curt Tilmes Curt.Tilmes@nasa.gov PGCon 2008 May 22, 2008

Future of OSSossforum.jp/jossfiles/3-1 Award Speech (JOPF)201811.pdf · 15/11/2018 · Cloud Infrastructure(OpenStack, Spark)、 Big Data Infrastructure (Hadoop、NoSQL)、 Enterprise(PostgreSQL,

Hacking PostgreSQL Internals - Citus Datainfo.citusdata.com/rs/...Hacking_PostgreSQL_Internals_to_Solve_Data... · Hacking PostgreSQL Internals ... without moving big data. scan scan

Migration to PostgreSQL Preparation and · PDF fileMigration to PostgreSQL Preparation and Methodology ... Data types PostgreSQL supported Data ... your PostgreSQL application ugly.

Bigger data with PostgreSQL 9

Data warehousing with PostgreSQL - PostgreSQL wiki

03 Mobile Phone Data - Roadmap · Python, Java, PIG, Hive, Spark, SQL, etc. •Skillset for PostgreSQL/PostGIS, Oracle or any other database system (including GIS data) is required.

Indexing Complex PostgreSQL Data Types

PostgreSQL: Mέθοδοι για Data Replication

PostgreSQL for Buoy Data

POSTGRESQL LOADING DATA INTO · 3/25/2019 LOADING DATA INTO POSTGRESQL 1/ 56 1 LOADING DATA INTO POSTGRESQL

Data Import for PostgreSQL - User's Manual

Crunchy Data PostgreSQL Operator · 2017-09-07 · PostgreSQL 9.5+ Container (crunchy-postgres) PostgreSQL Backup Container (crunchy-backup) PostgreSQL Upgrade Container (crunchy-upgrade)

Producing and Analyzing Rich Data with PostgreSQL

PostgreSQL for Data Architects - Sample Chapter