Evolving Your Analytics Stack With Your Business · Evolving Your Analytics Stack With Your...

Evolving Your Analytics

Stack With Your

BusinessBudapest Data Forum

• Data Scientist at Snowplow.

• Work with a number of clients from various industries. Focus on business analytics.

• Help users get set up with Snowplow and build data models.

Hello! I’m Keane

• Open source event data pipeline.

• Enable users to track, process and act upon their data.

• Own your data.

What is Snowplow?

Businesses are

constantly evolving…

• Your products (apps & platforms)

change.

• Your questions should change too

• It’s critical that the analytics stack

can evolve with your business

How?

+SELF-DESCRIBING DATA EVENT DATA MODELING

EVOLVING EVENT DATA PIPELINE

SELF-DESCRIBING DATA

Part 1

No two companies are alike

Define your own events and entities

• Article Load

• Issue Open

• Paywall Hit

• Article

• Content

• Advert

• Program

• View Recipe

• Add To Basket

• Rate Recipe

• Recipe

• Customer

• Basket

• Nutrition

Events

Entities

You then define a schema for each

event and entity

"description": "Schema for a nutrition context",

"vendor": "com.gousto",

"name": “nutrition",

"version": “1-0-2“,

"properties": {

”Recipe": {"type": "string"},

”Description": {"type": "string"},

”URL": {"type": "string"},

”Calories": {"type": ["integer", "null"]},

”Protein": {"type": "string"},

”Fat": {"type": "string"}

}

}

You then define a schema for each

event and entity

"schema": "iglu:ufc/nutrition/jsonschema/1-0-2",

"data": {

”Recipe": “Beef Goulash”

”Description": “Hearty beef goulash recipe”

”Calories": “3000”,

”Protein": “13g”,

”Fat": “8g”,

”Carbohydrates": “123.5g”,

”URL": ”www.gousto.com/recipes/beefgoulash”

}

}

• Validate the data (important for data quality)

• Load the data into tidy tables into your data warehouse

• Make it easy / safe to write downstream data processing

applications (e.g for real time users)

The schemas can then be used in a number of ways

Event Data ModelingPart 2

• Event data modeling is the process of using

business logic to aggregate over event-level data to

produce 'modeled' data that is simpler for querying.

What is event data modeling?

Modeled vs. unmodeled data

IMMUTABLE. UNOPINIATED. HARD TO CONSUME. NOT

CONTENTIOUS

MUTABLE AND OPINIONATED. EASY TO

CONSUME. MAY BE CONTENTIOUS

• Late arriving events can change the way you understand earlier

arriving events

• If we change our data models, this gives us the flexibility to

recompute historical data based on the new model

In general, event data modeling is performed on the full

event stream

Evolving the data pipeline

Part 3

How do we handle pipeline evolution?

▸Businesses change over time

▸ The events that occur are going to change

▸Use of the data will change

▸ Insight -> more questions -> more insight -> more questions

▸Two types of evolution: push and pull

BUSINESSES ARE NOT STATIC, SO EVENT PIPELINES SHOULD NOT BE EITHER

Push & Pull Factors

Web

Apps

Servers

Comms channels

Push …

Data

warehouse

Data exploration

Predictive modeling

Real-time dashboards

Real-time,

data-driven applicationsRT

BidderVoucher

Person-

alization…

Collection Processing

Smart car / home

…

PUSH FACTORSWhat is being tracked

will change over time

PULL FACTORSThe questions asked of the data

will change over time.


• If data is self-describing it is easy to add an additional sources

• Self-describing data is good for managing bad data and pipeline evolution

I AM AN ISSUE OPEN EVENT AND I

HAVE INFORMATION

ABOUT THE USER AND ISSUE.


INSIGH

T

QUESTION

?

ANSWE

R

3 POSSIBILITIES

Existing data model

supports answer

ANSWERING THE QUESTION:

1

Need to update data

model and data

collection

3

Need to update

data model

2

• Updating existing events and entities in

a backward compatible way e.g. add

optional new fields

• Update existing events and entities in a

backwards incompatible way e.g.

change field types, remove fields, add

compulsory fields

• Add new event and entity types

• Add new columns to existing derived tables e.g. add new

audience segmentation

• Change the way existing derived tables are generated e.g.

change sessionization logic

• Create new derived tables

SELF-DESCRIBING DATA RECOMPUTE DATA MODELS ON ENTIRE DATA SET

Self-describing data and the ability to recompute data models are essential to enable pipeline evolution

Questions?

Evolving Your Analytics Stack With Your Business · Evolving Your Analytics Stack With Your...

Documents

Transcript of Evolving Your Analytics Stack With Your Business · Evolving Your Analytics Stack With Your...