Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks

Self-serve Analytics Journey at Celtra:

Snowflake, Spark and Databricks

Grega Kespret

Director of Engineering, Analytics @

Matthew J. Glickman

Vice President of Product @

• Where we started (aggregate fact tables)

• Why we needed data warehouse

• Requirements and evaluations

• Snowflake adoption

• How Celtra handles schema evolution and data rewrites

• Snowflake architecture

• Next steps

Story (Agenda)

Creative

platform

for

brand

advertising

Celtra AdCreator

188,000+Ads Built

15,000+Campaigns

5000+Brands

2bn+Analytics Events / Day

1TB+New Data / Day

Key Pain Points

✗ Difficult to analyze the data collected

✗ Slow to make schema changes in cubes (e.g. adding / removing metric)

Pre-aggregations—Unable to Respond to

Speed and Complexity of Business

OLAP

cubes

Operational

data

+Event dataTrackers

Adhoc queries

Applications

Client-facing

dashboardsMySQLAmazon S3 MySQL

ETL

• Point in time facts about what happened

• Bread and butter of our analytics data

• JSON records

• Very sparse

• Complex relationships between events

• Sessionization: Combine discrete events

into sessions

Event Data

• Patterns more interesting than sums and counts of events

• Easier to troubleshoot/debug with context

• Able to check for/enforce causality

(if X happened, Y must also have happened)

• De-duplication possible (no skewed rates because of outliers)

• Later events reveal information about earlier arriving events

(e.g. session duration, attribution, identity, etc.)

Sessionization: Why We Do It

Sessionization

Events Sessions

Spark for Complex ETL on Events

• Complex ETL: de-duplicate, sessionize, clean, validate, emit facts

• Production hourly runs

• Get full expressive power of Scala for ETL

• Shuffle needed for sessionization

• Seamless integration with S3

• Needed flexibility, not provided by precomputed aggregates

(unique counting, order statistics, outliers, etc.)

• Needed answers to questions that existing data model did not support

• Wanted short development cycles and faster experimentation

• Visualizations

✗ Difficult to Analyze the Data Collected

For example:

• Analyzing effects of placement position on

engagement rates

• Troubleshooting 95th percentile of ad loading

time performance

Key Pain Points

Now able to analyze the data collected


Better with Databricks + SparkSQL

OLAP

cubes

Trackers

Adhoc queries

Applications

Client-facing

dashboardsMySQL

ETL

Operational dataEvent data

MySQLAmazon S3

+

SQL

Key Pain Points



✗ Complex ETL repeated in adhoc queries (slow, error-prone)

But a New Problem Emerged

OLAP

cubes

Trackers

Adhoc queries

Applications

Client-facing

dashboardsMySQL

ETL

Operational dataEvent data

MySQLAmazon S3

+

SQL

Idea: Split ETL, Materialize Sessions

OLAP

cubes

Event data

+Operational dataTrackers

Adhoc queries

Applications

Client-facing

dashboardsAmazon S3 MySQL MySQL

ETL ETL

Sessions

???

deduplication, sessionization, cleaning, validation, external dependencies

aggregating across different dimensions

Part 1: Complex Part 2: Simple

SQL

Requirements

• Fully managed service

• Columnar storage format

• Support for complex nested structures

• Schema evolution possible

• Data rewrites possible

• Scale compute resources separately

from storage

Needed Data Warehouse to Store

Intermediate Results

Nice-to-Haves

• Transactions

• Partitioning

• Skipping

• Access control

• Appropriate for OLAP use case

Operational tasks for self-service installation:

• Replace failed node

• Refresh projection

• Restart database with one node down

• Remove dead node from DNS

• Ensure enough (at least 2x) disk space

available for rewrites

• Backup data

• Archive data

Why We Wanted a Managed Service

We did not want to

deal with these tasks

1. Denormalize everything

✓ Speed

(aggregations without joins)

✗ Expensive storage

2. Normalize everything

✗ Speed (joins)

✓ Cheap storage

3. Nested objects: pre-group the data on each grain

✓ Speed (a "join" between parent and child is essentially free)

✓ Cheap storage

3 Choices for How to Model Sessions

Session Unit views Page views0 N 1 N

Creative

1

N

Campaign

1

N

Interactions0 N

Complex Nested Structures

Session Unit views Page views0 N 1 N

Creative

1

N

Campaign

1

N

Interactions0 N

Unit views

Session

Page views

Interactions

Creative

Campaign

1

N

1

N

Flat Data in Relational Tables Nested Data

Flat + Normalized Nested

Find top 10 pages on creative units with most interactions on average

Flat vs. Nested Queries

SELECT

creativeId,

uv.name unitName,

pv.name pageName,

AVG(COUNT(*)) avgInteractions

FROM

sessions s

JOIN unitViews uv ON uv.id = s.id

JOIN pageViews pv ON pv.uvid = uv.id

JOIN interactions i ON i.pvid = pv.id

GROUP BY 1, 2, 3 ORDER BY avgInteractions DESC LIMIT 10

Distributed join turned into local joinRequires unique ID at every grain

Joins

SELECT

creativeId,

unitViews.value:name unitName,

pageViews.value:name pageName,

AVG(ARRAY_SIZE(pageViews.value:interactions)) avgInteractions

FROM

sessions,

LATERAL FLATTEN(json:unitViews) unitViews,

LATERAL FLATTEN(unitViews.value:pageViews) pageViews

GROUP BY 1, 2, 3 ORDER BY avgInteractions DESC LIMIT 10

Final

Contenders

for New

Data

Warehouse

• Evaluated Spark + HCatalog + Parquet + S3 solution

• Too many small files problem => file stitching

• No consistency guarantees over set of files on

S3 => secondary index | convention

• Liked one layer vs. separate Query layer (Spark), Metadata layer (HCatalog),

Storage format layer (Parquet), Data layer (S3)

Work with Data, Not Files

We really wanted a database-like abstraction with

transactions, not a file format!

We Chose Snowflake as Our Managed Data

Warehouse (DWaaS)

Pain Points


Data processed once & consumed many times ETL'd data acts as a single source of truth


OLAP

cubes

Trackers

Adhoc queries

Applications

Client-facing

dashboardsMySQL

ETL ETL

SessionsEvent data

+Operational data

Amazon S3 MySQL

SQL

Snowflake Adoption

• Backfilling / recomputing sessions of the last 2 years (from January 2014)

28TB of data (compressed)

• "Soft deploy", period of mirrored writes

Soon switch completely in production

• Each developer/analyst has its own database

• Separate roles and data warehouses for: production, developers, analysts

• Analysts & data scientists already using Snowflake through Databricks daily

• Session schemaKnown, well defined (by Session Scala model) and enforced

• Latest Session modelAuthoritative source for sessions schema

• Historical sessions conform to the latest Session modelCan de-serialize any historical session

• Readers should ignore fields not in Session modelWe do not guarantee to preserve this data

• Computing facts (metrics, dimensions) from Session model is time-invariantComputed 2 months ago or today, numbers must be the same

How Celtra Handles Data Structure Evolution

Schema Evolution

Change in Session model Top level / scalar column Nested / VARIANT column

Rename fieldALTER TABLE tbl RENAME COLUMN

col1 TO col2; data rewrite (!)

Remove field ALTER TABLE tbl DROP COLUMN col; batch together in next rewrite

Add field, no historical valuesALTER TABLE tbl ADD COLUMN col

type;no change necessary

Also considered views for VARIANT schema evolution

For complex scenarios have to use Javascript UDF => lose benefits of columnar access

Not good for practical use

• They are sometimes necessary

• We have the ability to do data rewrites

• Rewrites of ~35TB (compressed) are not fun

• Complex and time consuming, so we fully automate them

• Costly, so we batch multiple changes together

• Rewrite must maintain sort order fast access (note: UPDATE breaks it!)

• Javascript UDFs are our default approach for rewrites of data in VARIANT

Data Rewrites

• Expressive power of Javascript (vs. SQL)

• Run on the whole VARIANT record

• (Almost) constant performance

• More readable and understandable

• For changing a single field,

OBJECT_INSERT/OBJECT_DELETE

are preferred

Inline Rewrites with Javascript UDFs

CREATE OR REPLACE FUNCTION transform("json" variant)RETURNS VARIANTLANGUAGE JAVASCRIPTAS '// modify jsonreturn json;

';

SELECT transform(json) FROM sessions;

Snowflake Spark Connector

• Implements Spark Data Sources API

• Access data in Snowflake through Spark SQL (via Databricks)

• Currently available in Beta, soon to be open-source

Operational

data

+Event data

Adhoc queries

MySQL Amazon S3

ETL

Sessions

SQL

Snowflake Data Warehouse as a Service

Centralized storageInstant, automatic scalability & elasticity

Single serviceScalable, resilient cloud services layer coordinates access & management

Elastically scalable computeMultiple “virtual warehouse” compute clusters scale horsepower & concurrency

Database Storage

Python

• Data Warehouse as a Service: No infrastructure, knobs or tuning

• Infinite & Independent Scalability: Scale storage and compute layers

independently

• One Place for All Data: Native support for structured & semi-

structured data

• Instant Cloning: Isolate prod/dev

• Highly Available: 11 9’s durability, 4 9’s availability

Snowflake’s Multi-cluster, Shared Data Service

Logical

Databases

VirtualWarehouse

VirtualWarehouse

ETL & Data

Loading

VirtualWarehouse

Finance

VirtualWarehouse

Dev, Test,

QA

Dashboards

VirtualWarehouse

Marketing

Clone

Data Science

Apple 101.12 250 FIH-2316

Pear 56.22 202 IHO-6912

Orange 98.21 600 WHQ-6090

Native Support for Structured

+ Semi-structured Data

Any hierarchical, nested

data type (e.g. JSON, Avro)

Optimized VARIANT data

type, no fixed schema or

transformation required

Full benefit of database

optimizations – pruning,

filtering, etc.

Structured

data{ "firstName": "John",

"lastName": "Smith",

"height_cm": 167.64,

"address": {

"streetAddress": "21 2nd Street",

"city": "New York",

"state": "NY”,

…....

.

{

"firstName": "John",

"lastName": "Smith",

"height_cm": 167.64,

"address": {

"streetAddress": "21 2nd Street",

"city": "New York",

"state": "NY",

"postalCode": "10021-3100"

},

Semi-Structured Data Stored Natively Queried using SQL

Semi-structured

data

Next Stage: Snowflake Also for Aggregates

OLAP

cubes

Trackers

Adhoc queries

Applications

Client-facing

dashboardsMySQL

ETL ETL

SessionsEvent data

+Operational data

Amazon S3 MySQL

SQL

End Goal

Pain Points


Data processed once & consumed many times ETL'd data acts as a single source of truth

Fast schema changes in cubes (e.g. adding / removing metric)

+Trackers

ETL

Sessions &

OLAP cubes

Adhoc queries

Applications

Client-facing

dashboards

Event data

+Amazon S3 MySQL

Operational data

Thank You.Grega Kespret

Director of Engineering,

Analytics @

@gregakespret

github.com/gregakespret

slideshare.net/gregak

linkedin.com/in/gregakespret

Matthew J. Glickman

Vice President of Product,

@matthewglickman

linkedin.com/in/matthewglickman

Appendix

Snowflake Query Performance

• There are no indexes or projections

• Sort the data on ingest to maintain query performance

• 3 "tiers"

• Query cache

• File cache

• S3 storage

• Save sessions to S3 in parallel

(Spark cluster)

Getting Data INTO Snowflake

Operational

data

+Event data

Adhoc queries

MySQL Amazon S3

ETL

SQL

Sessions

sessions.map(serializeJson).saveAsTextFile("s3a://...")


(Spark cluster)

• Copy from S3 to temporary table

(Snowflake cluster)


Operational

data

+Event data

Adhoc queries

MySQL Amazon S3

ETL

Sessions

CREATE TEMPORARY TABLE sessions-import (json VARIANT NOT NULL);

COPY INTO sessions-import FROM s3://...FILE_FORMAT = (FORMAT_NAME = 'session_gzip_json')CREDENTIALS = (AWS_KEY_ID = '...' AWS_SECRET_KEY = '...')REGION = 'external-1';

SQL


(Spark cluster)

• Copy from S3 to temporary table

(Snowflake cluster)

• Sort and insert into main table

(Snowflake cluster)


INSERT INTO sessionsSELECTTO_TIMESTAMP_NTZ(json:adRequestServerTimestamp::int, 3)::date,json:accountId AS accountId,json:campaignId AS campaignId,HOUR(TO_TIMESTAMP_NTZ(json:adRequestServerTimestamp::int, 3)),json:creativeId AS creativeId,json:placementId AS placementId,TO_TIMESTAMP_NTZ(json:adRequestServerTimestamp::int, 3),json

FROM sessions-importORDER BYutcDate ASC,accountId ASC,campaignId ASC,utcHour ASC,creativeId ASC,placementId ASC;

Operational

data

+Event data

Adhoc queries

MySQL Amazon S3

ETL

Sessions

SQL

Getting Data OUT of Snowflake

COPY INTO s3://...FROM (SELECT json FROM sessions WHERE ...)FILE_FORMAT = (FORMAT_NAME = 'session_gzip_json')REGION = 'external-1'CREDENTIALS = (AWS_KEY_ID = '...' AWS_SECRET_KEY = '...');

• Copy to S3 (Snowflake cluster)

OLAP

cubesApplications

Client-facing

dashboardsMySQL

ETL

Sessions

Getting Data OUT of Snowflake

val sessions: RDD[Session] = sc.textFile(s"s3a://...").map(deserialize)

• Copy to S3 (Snowflake cluster)

• Read from S3 and apply schema (Spark cluster)

OLAP

cubesApplications

Client-facing

dashboardsMySQL

ETL

Sessions

Combining Spark and Snowflake

Parallel unload

InputFormat

RDD[Array[String]]

DataFrame

Parallel consumption

AWS S3

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks

Data & Analytics

Transcript of Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks