Evolving Your Analytics Stack With Your Business · Evolving Your Analytics Stack With Your...
Transcript of Evolving Your Analytics Stack With Your Business · Evolving Your Analytics Stack With Your...
Evolving Your Analytics
Stack With Your
BusinessBudapest Data Forum
• Data Scientist at Snowplow.
• Work with a number of clients from various industries. Focus on business analytics.
• Help users get set up with Snowplow and build data models.
Hello! I’m Keane
• Open source event data pipeline.
• Enable users to track, process and act upon their data.
• Own your data.
What is Snowplow?
Businesses are
constantly evolving…
• Your products (apps & platforms)
change.
• Your questions should change too
• It’s critical that the analytics stack
can evolve with your business
How?
+SELF-DESCRIBING DATA EVENT DATA MODELING
EVOLVING EVENT DATA PIPELINE
SELF-DESCRIBING DATA
Part 1
No two companies are alike
Define your own events and entities
• Article Load
• Issue Open
• Paywall Hit
• Article
• Content
• Advert
• Program
• View Recipe
• Add To Basket
• Rate Recipe
• Recipe
• Customer
• Basket
• Nutrition
Events
Entities
You then define a schema for each
event and entity
"description": "Schema for a nutrition context",
"vendor": "com.gousto",
"name": “nutrition",
"version": “1-0-2“,
"properties": {
”Recipe": {"type": "string"},
”Description": {"type": "string"},
”URL": {"type": "string"},
”Calories": {"type": ["integer", "null"]},
”Protein": {"type": "string"},
”Fat": {"type": "string"}
}
}
You then define a schema for each
event and entity
"schema": "iglu:ufc/nutrition/jsonschema/1-0-2",
"data": {
”Recipe": “Beef Goulash”
”Description": “Hearty beef goulash recipe”
”Calories": “3000”,
”Protein": “13g”,
”Fat": “8g”,
”Carbohydrates": “123.5g”,
”URL": ”www.gousto.com/recipes/beefgoulash”
}
}
• Validate the data (important for data quality)
• Load the data into tidy tables into your data warehouse
• Make it easy / safe to write downstream data processing
applications (e.g for real time users)
The schemas can then be used in a number of ways
Event Data ModelingPart 2
• Event data modeling is the process of using
business logic to aggregate over event-level data to
produce 'modeled' data that is simpler for querying.
What is event data modeling?
Modeled vs. unmodeled data
IMMUTABLE. UNOPINIATED. HARD TO CONSUME. NOT
CONTENTIOUS
MUTABLE AND OPINIONATED. EASY TO
CONSUME. MAY BE CONTENTIOUS
• Late arriving events can change the way you understand earlier
arriving events
• If we change our data models, this gives us the flexibility to
recompute historical data based on the new model
In general, event data modeling is performed on the full
event stream
Evolving the data pipeline
Part 3
How do we handle pipeline evolution?
▸Businesses change over time
▸ The events that occur are going to change
▸Use of the data will change
▸ Insight -> more questions -> more insight -> more questions
▸Two types of evolution: push and pull
BUSINESSES ARE NOT STATIC, SO EVENT PIPELINES SHOULD NOT BE EITHER
Push & Pull Factors
Web
Apps
Servers
Comms channels
Push …
Data
warehouse
Data exploration
Predictive modeling
Real-time dashboards
Real-time,
data-driven applicationsRT
BidderVoucher
Person-
alization…
Collection Processing
Smart car / home
…
PUSH FACTORSWhat is being tracked
will change over time
PULL FACTORSThe questions asked of the data
will change over time.
How do we handle pipeline evolution?
• If data is self-describing it is easy to add an additional sources
• Self-describing data is good for managing bad data and pipeline evolution
I AM AN ISSUE OPEN EVENT AND I
HAVE INFORMATION
ABOUT THE USER AND ISSUE.
How do we handle pipeline evolution?
INSIGH
T
QUESTION
?
ANSWE
R
3 POSSIBILITIES
Existing data model
supports answer
ANSWERING THE QUESTION:
1
Need to update data
model and data
collection
3
Need to update
data model
2
• Updating existing events and entities in
a backward compatible way e.g. add
optional new fields
• Update existing events and entities in a
backwards incompatible way e.g.
change field types, remove fields, add
compulsory fields
• Add new event and entity types
• Add new columns to existing derived tables e.g. add new
audience segmentation
• Change the way existing derived tables are generated e.g.
change sessionization logic
• Create new derived tables
SELF-DESCRIBING DATA RECOMPUTE DATA MODELS ON ENTIRE DATA SET
Self-describing data and the ability to recompute data models are essential to enable pipeline evolution
Questions?