Building Analytics Infrastructure for Growing Tech Companies

48
Holistics.io Huy Nguyen CTO, Cofounder - Holistics.io Building Analytics Infrastructure for Growing Tech Companies Data Night Singapore Aug 2016 (*) aka Data Pipelike

Transcript of Building Analytics Infrastructure for Growing Tech Companies

Page 1: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Huy NguyenCTO, Cofounder - Holistics.io

Building Analytics Infrastructure for Growing Tech Companies

Data Night SingaporeAug 2016

(*) aka Data Pipelike

Page 2: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

● Cofounder of Holistics.io

○ Data Reporting (BI) and Infrastructure SaaS

● Previous○ Built Data Pipeline at Viki (Singapore)

○ Growth Team at Facebook (US)

About Me

Page 3: Building Analytics Infrastructure for Growing Tech Companies

● The Data Problem

● Typical Data Pipeline (Startup)

● Choosing An Analytics DB

● Choosing A BI Tool

Agenda

Page 4: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Background: What is Analytics/DW?

Page 5: Building Analytics Infrastructure for Growing Tech Companies

- A Typical Web Application

Data-related Business Questions:• Daily/weekly registered users by different platforms, countries?• How many video uploads do we have everyday?

Live Databases

Live DatabasesProduction

DBs

Android

iOS

Web

APIs

Page 6: Building Analytics Infrastructure for Growing Tech Companies

- A Typical Web Application

Data-related Business Questions:• Daily/weekly registered users by different platforms, countries?• How many video uploads do we have everyday?

Analytics DatabaseLive

DatabasesLive

DatabasesProduction

DBs

Android

iOS

Web

APIs

Reporting / BIDaily Snapshot

Page 7: Building Analytics Infrastructure for Growing Tech Companies

- A Typical Web Application

Data-related Business Questions:• How did my marketing campaigns affect registrations?

Analytics DatabaseLive

DatabasesLive

DatabasesProduction

DBs

Android

iOS

Web

APIs

Reporting / BIDaily Snapshot

Page 8: Building Analytics Infrastructure for Growing Tech Companies

- A Typical Web Application

Analytics DatabaseLive

DatabasesLive

DatabasesProduction

DBs

Android

iOS

Web

APIs

Reporting / BIDaily Snapshot

GA, FB Ads, Adwords...

Data-related Business Questions:• How did my marketing campaigns affect registrations?

Page 9: Building Analytics Infrastructure for Growing Tech Companies
Page 10: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

A Typical Data Pipeline

Page 11: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Analytics Database

CSVs / Excels / Google Sheets

Operational Data Data Warehouse Reporting / Analysis

Data Science / ML

Reporting / BI

Event Logs (behavioural

data)

Live Databases

Live DatabasesProduction

DBsDaily Snapshot

Load

Pre-aggregate

Update / Transform / Aggregate

3rd-party Tracking: GA, FB Ads, Adwords...

API Import

Data Analysis

Page 12: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Analytics Database

Data Warehouse

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Derived Table

Transform / Aggregate

Derived Table

Derived Table

Derived Table

CSVs / Excels / Google Sheets

Event Logs (behavioural

data)

Live Databases

Live DatabasesProduction

DBs

Daily Snapshot

Load

Pre-aggregate

3rd-party Tracking: GA, FB Ads, Adwords...

API Import

(1) Import (2) Process (3) Present

Data Warehouse

Page 13: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Data Pipeline Philosophy

Centralize Your Data: join/cross-reference your data

Unix Philosophy: Each component does one thing well

Immutable Data: Don’t modify the original data

Page 14: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Choosing An Analytics DB

Page 15: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Analytics Database

CSVs / Excels / Google Sheets

Operational Data Data Warehouse Reporting / Analysis

Data Science / ML

Reporting / BI

Event Logs (behavioural

data)

Live Databases

Live DatabasesProduction

DBsDaily Snapshot

Load

Pre-aggregate

Update / Transform / Aggregate

3rd-party Tracking: GA, FB Ads, Adwords...

API Import

Data Analysis

What database should we pick?

Page 16: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Transactional DBs vs. Analytics DBs

Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 5)

Data:

● Many single-row writes

● Current, single data

Queries:

● Generated by user activities; 10 to 1000 users

● < 1s response time

● Short queries

Data:

● Few large batch imports

● Years of data, many sources

Queries:

● Generated by large reports; 1 to 10 users

● Queries run for hours

● Long, complex queries

Page 17: Building Analytics Infrastructure for Growing Tech Companies

Holistics.ioRef: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 8)

Complex Query...

Page 18: Building Analytics Infrastructure for Growing Tech Companies

Requirements

1. Simple to Get Started

2. Rich Features for Analytics

– Data Pipeline (ETL)

– Data Analysis (SQL et al)

3. Easy to Scale Up

(3) Scale(1) Start (2) Grow

Data Growth

Page 19: Building Analytics Infrastructure for Growing Tech Companies

Requirements

1. Simple to Get Started

2. Rich Features for Analytics

– Data Pipeline (ETL)

– Data Analysis (SQL et al)

3. Easy to Scale Up

(3) Scale(1) Start (2) Grow

Data Growth

Personal Recommendation:

Page 20: Building Analytics Infrastructure for Growing Tech Companies

1. Simple to Get Started

2. Rich Features for Analytics

– Data Pipeline (ETL)

– Data Analysis (SQL et al)

3. Easy to Scale Up

Requirements

(3) Scale(1) Start (2) Grow

Data Growth

Personal Recommendation:

Page 21: Building Analytics Infrastructure for Growing Tech Companies

1 Simple to Get Started

● Data requests grow gradually as your company grows

● Business users care about results (not backend)

Postgres:

● Free (open-source)● Easy to setup

→ Need something quick to start, easy to fine-tune along the way

1. Simple start 2. Rich features 3. Scale up

Page 22: Building Analytics Infrastructure for Growing Tech Companies

Requirements

1. Simple to Get Started

2. Rich Features for Analytics

– Data Pipeline (ETL)

– Data Analysis (SQL et al)

3. Easy to Scale Up

(3) Scale(1) Start (2) Grow

Data Growth

Personal Recommendation:

Page 23: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Analytics Database

CSVs / Excels / Google Sheets

Operational Data Data Warehouse Reporting / Analysis

Data Science / ML

Reporting / BI

Event Logs (behavioural

data)

Live Databases

Live DatabasesProduction

DBsDaily Snapshot

Load

Pre-aggregate

Update / Transform / Aggregate

3rd-party Tracking: GA, FB Ads, Adwords...

API Import

Data Analysis

Data Pipeline (ETL) Data Analysis / Reporting

Page 24: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

● Managing Table Data: table partitioning

● Managing Disk Space: tablespace

● Improve Write Performance: unlogged table

● Others: foreign data wrapper, point-in-time recovery

2 a- Data Pipeline (ETL) & Performance

1. Simple start 2. Rich features 3. Scale up

Page 25: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Analytics tables hold lots of data

Managing Data Tables

pageviews_2015_06

pageviews_2015_07

pageviews_2015_08

pageviews_2015_09

Solution: Split (partition) to multiple tables

Problem:Difficult to query data across multiple months

⇒ Table grows big quickly, difficult to manage !

pageviews

(+ 100k records a day)

date_d | country | user_id | browser | page_name | views

1. Simple start 2. Rich features 3. Scale up

Page 26: Building Analytics Infrastructure for Growing Tech Companies

Managing Data Tables: parent table

pageviews_2015_06

pageviews_2015_07

pageviews_2015_08

pageviews_2015_09

ALTER TABLE pageviews_2015_09 INHERIT pageviews_parent;

ALTER TABLE pageviews_2015_09 ADD CONSTRAINTCHECK date_d >= '2015-09-01' AND date_d < '2015-10-01';

pageviews_parent (parent table)

1. Simple start 2. Rich features 3. Scale up

Page 27: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Analytics DB holds lots of data; hardware spaces are limited

Data have different accessfrequency

● Hot Data● Warm Data● Cold Data

Managing Disk-spaces

1. Simple start 2. Rich features 3. Scale up

Page 28: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Tablespace: Define where your tables are stored on disks

Managing Disk-spaces: tablespace

CREATE TABLESPACE hot_data LOCATION /disk0/ssd/CREATE TABLESPACE warm_data LOCATION /disk1/sata2/

# beginning of the month

CREATE TABLE pageviews_2016_08 TABLESPACE hot_data;ALTER TABLE pageviews_2016_07 TABLESPACE warm_data;

1. Simple start 2. Rich features 3. Scale up

Page 29: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Combining TABLESPACE and PARENT TABLE

pageviews_2015_06

pageviews_2015_07

pageviews_2015_08

pageviews_2015_09

pageviews_parent (parent table)

1. Simple start 2. Rich features 3. Scale up

Page 30: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

● Extract / transform● Aggregate / summarize● Statistical analysis

2- b- Data Analysis (writing SQLs)

1. Simple start 2. Rich features 3. Scale up

Analytics Database

Data Science / ML

Reporting / BI

Data Analysis

Page 31: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

● SQL features

○ WITH clause

○ Window functions

○ Aggregation functions

○ Statistical functions

● Data structures

○ JSON / JSONB

○ Arrays

○ PostGIS (geo data)

○ Geometry (point, line, etc)

○ HyperLogLog (extension)

2- b - Data Analysis● PL/SQL

● Full-text search (n-gram)

● Performance:

○ Parallel queries (pg9.6)

○ Materialized views

○ BRIN index

● Others:

○ DISTINCT ON

○ VALUES

○ generate_series()

○ Support FULL OUTER JOIN

○ Better EXPLAIN

Page 32: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

● SQL features

○ WITH clause

○ Window functions

○ Aggregation functions

○ Statistical functions

● Data structures

○ JSON / JSONB

○ Arrays

○ PostGIS (geo data)

○ Geometry (point, line, etc)

○ HyperLogLog (extension)

● PL/SQL

● Full-text search (n-gram)

● Performance:

○ Parallel queries (pg9.6)

○ Materialized views

○ BRIN index

● Others:

○ DISTINCT ON

○ VALUES

○ generate_series()

○ Support FULL OUTER JOIN

○ Better EXPLAIN

Page 33: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

● SQL features

○ WITH clause

○ Window functions

○ Aggregation functions

○ Statistical functions

● Data structures

○ JSON / JSONB

○ Arrays

○ PostGIS (geo data)

○ Geometry (point, line, etc)

○ HyperLogLog (extension)

● PL/SQL

● Full-text search

● Performance:

○ Parallel queries (pg9.6)

○ Materialized views

○ BRIN index

● Others:

○ DISTINCT ON

○ VALUES

○ generate_series()

○ Support FULL OUTER JOIN

○ Better EXPLAIN

Page 34: Building Analytics Infrastructure for Growing Tech Companies

Requirements

1. Simple to Get Started

2. Rich Features for Analytics

– Data Pipeline (ETL)

– Data Analysis (SQL et al)

3. Easy to Scale Up

(3) Scale(1) Start (2) Grow

Data Growth

Personal Recommendation:

Page 35: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

● Transactional DB’s downsides:○ Optimized for transactional applications○ Single-core execution; row-based storage

● CitusDB Extension (Postgres)○ Automated data sharding and parallelization○ Columnar Storage Format (better storage and performance)

● Amazon Redshift○ Fork of PostgreSQL 8.2 -- ParAccel DB○ Columnar Storage & Parallel Executions○ Pay per hour per instance types

● Google BigQuery○ Spun out of Google’s Dremel○ Pay per query, per data access

3- Scaling Up

Page 36: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Other DW Databases (Relational)● Greenplum

● Teradata

● Infobright

● Google BigQuery

● Aster Data

● Paraccel (Postgres fork)

● Vertica (from Postgres author)

● CitusDB (Postgres extension)

● Amazon Redshift (from Paraccel)

1. Simple start 2. Rich features 3. Scale up

Related to Postgres

Page 37: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Compare: Popular SQL Databases

PostgreSQL MySQL Oracle SQL Server

License / Cost Free / Open-source Free / Open-source Expensive Expensive

Analytics Features Strong Weak Strong Strong

Page 38: Building Analytics Infrastructure for Growing Tech Companies

vs. is a data storage and processing framework

– HDFS: data-storage layer

– YARN: resource management

– MapReduce/Pig/Hive/Spark: processing layer

(MPP database, massively parallel processing)

– Columnar-storage database; Meant for analytics purpose.

– OLAP – Online Analytical Processing

– Examples: Vertica, Amazon Redshift, Parracel

Page 39: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Choosing A BI Tool

Page 40: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Analytics Database

CSVs / Excels / Google Sheets

Operational Data Data Warehouse Reporting / Analysis

Data Science / ML

Reporting / BI

Event Logs (behavioural

data)

Live Databases

Live DatabasesProduction

DBsDaily Snapshot

Load

Pre-aggregate

Update / Transform / Aggregate

3rd-party Tracking: GA, FB Ads, Adwords...

API Import

Data Analysis

Which BI software?

Page 41: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Choosing A BI Tool: Criteria

Use cases: Pretty Visualizations vs. Detailed Data Access, Embedded, Email Schedules, etc

Report Creation: Technical vs. Non-technical

Data Ownership: Your own database vs. BI Software’s storage

Page 42: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Other Criteria

● Process: Direct Access vs Data Models

● ETL: How is the ETL process managed?

● License Fee: Upfront Investment vs. Value-Based Pricing

● Implementation Fee: Self Service vs. 3rd Party

● Training Fee: Training Costs vs Specialized Skill Sets

Page 43: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

BI Tools

BI features:● Heavy Visualization: Tableau, Qlik● Medium + Other Features: Holistics, Chartio, Periscope

Report Creation:● SQL: Holistics, Periscope● Drag-and-Drop/Excel: Tableau, Qlik, Sisense, PowerBI

Data Ownership:● Store your data: Periscope, Tableau, Sisense● Doesn’t store your data: Holistics, Chartio, Looker

Page 44: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Comparing BI toolsVisualization Report Creation Data Storage Pricing Notes

Tableau Strong Drag & Drop Store Per desktop + server license Leader in Visualization

Qlik Strong Drag & Drop Store Desktop + Server License ETL + Visualization

Holistics Standard SQL Doesn’t store Users + Usage Strong permission management, detailed data extractions

Periscope Data Standard SQL Store No. of records Auto-cache your data using Amazon Redshift

Chartio Standard Drag & Drop Doesn’t store User + Feature Transform your data on Chartio server

Looker StandardData Model

(LookML) based on SQL

Doesn’t store User Packs Accessed through the Looker Data Model

Page 45: Building Analytics Infrastructure for Growing Tech Companies

● The Data Problem

● Typical Pipeline (Startup)

● Choosing An Analytics DB

● Choosing A BI Tool

Summary

Page 46: Building Analytics Infrastructure for Growing Tech Companies

Not Cover

● Setting Up & Performance Optimizations

● Other data types: time-series data, geo data, search data

● Big Data frameworks: Hadoop, Spark, HDFS, etc

● Real-time Data Processing, Stream Processing (Storm, Kafka, Kinesis)

Page 47: Building Analytics Infrastructure for Growing Tech Companies

Holistics.io