Building Analytics Infrastructure for Growing Tech Companies

Post on 26-Jan-2017

232 views 2 download

Transcript of Building Analytics Infrastructure for Growing Tech Companies

Holistics.io

Huy NguyenCTO, Cofounder - Holistics.io

Building Analytics Infrastructure for Growing Tech Companies

Data Night SingaporeAug 2016

(*) aka Data Pipelike

Holistics.io

● Cofounder of Holistics.io

○ Data Reporting (BI) and Infrastructure SaaS

● Previous○ Built Data Pipeline at Viki (Singapore)

○ Growth Team at Facebook (US)

About Me

● The Data Problem

● Typical Data Pipeline (Startup)

● Choosing An Analytics DB

● Choosing A BI Tool

Agenda

Holistics.io

Background: What is Analytics/DW?

- A Typical Web Application

Data-related Business Questions:• Daily/weekly registered users by different platforms, countries?• How many video uploads do we have everyday?

Live Databases

Live DatabasesProduction

DBs

Android

iOS

Web

APIs

- A Typical Web Application

Data-related Business Questions:• Daily/weekly registered users by different platforms, countries?• How many video uploads do we have everyday?

Analytics DatabaseLive

DatabasesLive

DatabasesProduction

DBs

Android

iOS

Web

APIs

Reporting / BIDaily Snapshot

- A Typical Web Application

Data-related Business Questions:• How did my marketing campaigns affect registrations?

Analytics DatabaseLive

DatabasesLive

DatabasesProduction

DBs

Android

iOS

Web

APIs

Reporting / BIDaily Snapshot

- A Typical Web Application

Analytics DatabaseLive

DatabasesLive

DatabasesProduction

DBs

Android

iOS

Web

APIs

Reporting / BIDaily Snapshot

GA, FB Ads, Adwords...

Data-related Business Questions:• How did my marketing campaigns affect registrations?

Holistics.io

A Typical Data Pipeline

Holistics.io

Analytics Database

CSVs / Excels / Google Sheets

Operational Data Data Warehouse Reporting / Analysis

Data Science / ML

Reporting / BI

Event Logs (behavioural

data)

Live Databases

Live DatabasesProduction

DBsDaily Snapshot

Load

Pre-aggregate

Update / Transform / Aggregate

3rd-party Tracking: GA, FB Ads, Adwords...

API Import

Data Analysis

Holistics.io

Analytics Database

Data Warehouse

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Derived Table

Transform / Aggregate

Derived Table

Derived Table

Derived Table

CSVs / Excels / Google Sheets

Event Logs (behavioural

data)

Live Databases

Live DatabasesProduction

DBs

Daily Snapshot

Load

Pre-aggregate

3rd-party Tracking: GA, FB Ads, Adwords...

API Import

(1) Import (2) Process (3) Present

Data Warehouse

Holistics.io

Data Pipeline Philosophy

Centralize Your Data: join/cross-reference your data

Unix Philosophy: Each component does one thing well

Immutable Data: Don’t modify the original data

Holistics.io

Choosing An Analytics DB

Holistics.io

Analytics Database

CSVs / Excels / Google Sheets

Operational Data Data Warehouse Reporting / Analysis

Data Science / ML

Reporting / BI

Event Logs (behavioural

data)

Live Databases

Live DatabasesProduction

DBsDaily Snapshot

Load

Pre-aggregate

Update / Transform / Aggregate

3rd-party Tracking: GA, FB Ads, Adwords...

API Import

Data Analysis

What database should we pick?

Holistics.io

Transactional DBs vs. Analytics DBs

Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 5)

Data:

● Many single-row writes

● Current, single data

Queries:

● Generated by user activities; 10 to 1000 users

● < 1s response time

● Short queries

Data:

● Few large batch imports

● Years of data, many sources

Queries:

● Generated by large reports; 1 to 10 users

● Queries run for hours

● Long, complex queries

Holistics.ioRef: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 8)

Complex Query...

Requirements

1. Simple to Get Started

2. Rich Features for Analytics

– Data Pipeline (ETL)

– Data Analysis (SQL et al)

3. Easy to Scale Up

(3) Scale(1) Start (2) Grow

Data Growth

Requirements

1. Simple to Get Started

2. Rich Features for Analytics

– Data Pipeline (ETL)

– Data Analysis (SQL et al)

3. Easy to Scale Up

(3) Scale(1) Start (2) Grow

Data Growth

Personal Recommendation:

1. Simple to Get Started

2. Rich Features for Analytics

– Data Pipeline (ETL)

– Data Analysis (SQL et al)

3. Easy to Scale Up

Requirements

(3) Scale(1) Start (2) Grow

Data Growth

Personal Recommendation:

1 Simple to Get Started

● Data requests grow gradually as your company grows

● Business users care about results (not backend)

Postgres:

● Free (open-source)● Easy to setup

→ Need something quick to start, easy to fine-tune along the way

1. Simple start 2. Rich features 3. Scale up

Requirements

1. Simple to Get Started

2. Rich Features for Analytics

– Data Pipeline (ETL)

– Data Analysis (SQL et al)

3. Easy to Scale Up

(3) Scale(1) Start (2) Grow

Data Growth

Personal Recommendation:

Holistics.io

Analytics Database

CSVs / Excels / Google Sheets

Operational Data Data Warehouse Reporting / Analysis

Data Science / ML

Reporting / BI

Event Logs (behavioural

data)

Live Databases

Live DatabasesProduction

DBsDaily Snapshot

Load

Pre-aggregate

Update / Transform / Aggregate

3rd-party Tracking: GA, FB Ads, Adwords...

API Import

Data Analysis

Data Pipeline (ETL) Data Analysis / Reporting

Holistics.io

● Managing Table Data: table partitioning

● Managing Disk Space: tablespace

● Improve Write Performance: unlogged table

● Others: foreign data wrapper, point-in-time recovery

2 a- Data Pipeline (ETL) & Performance

1. Simple start 2. Rich features 3. Scale up

Holistics.io

Analytics tables hold lots of data

Managing Data Tables

pageviews_2015_06

pageviews_2015_07

pageviews_2015_08

pageviews_2015_09

Solution: Split (partition) to multiple tables

Problem:Difficult to query data across multiple months

⇒ Table grows big quickly, difficult to manage !

pageviews

(+ 100k records a day)

date_d | country | user_id | browser | page_name | views

1. Simple start 2. Rich features 3. Scale up

Managing Data Tables: parent table

pageviews_2015_06

pageviews_2015_07

pageviews_2015_08

pageviews_2015_09

ALTER TABLE pageviews_2015_09 INHERIT pageviews_parent;

ALTER TABLE pageviews_2015_09 ADD CONSTRAINTCHECK date_d >= '2015-09-01' AND date_d < '2015-10-01';

pageviews_parent (parent table)

1. Simple start 2. Rich features 3. Scale up

Holistics.io

Analytics DB holds lots of data; hardware spaces are limited

Data have different accessfrequency

● Hot Data● Warm Data● Cold Data

Managing Disk-spaces

1. Simple start 2. Rich features 3. Scale up

Holistics.io

Tablespace: Define where your tables are stored on disks

Managing Disk-spaces: tablespace

CREATE TABLESPACE hot_data LOCATION /disk0/ssd/CREATE TABLESPACE warm_data LOCATION /disk1/sata2/

# beginning of the month

CREATE TABLE pageviews_2016_08 TABLESPACE hot_data;ALTER TABLE pageviews_2016_07 TABLESPACE warm_data;

1. Simple start 2. Rich features 3. Scale up

Holistics.io

Combining TABLESPACE and PARENT TABLE

pageviews_2015_06

pageviews_2015_07

pageviews_2015_08

pageviews_2015_09

pageviews_parent (parent table)

1. Simple start 2. Rich features 3. Scale up

Holistics.io

● Extract / transform● Aggregate / summarize● Statistical analysis

2- b- Data Analysis (writing SQLs)

1. Simple start 2. Rich features 3. Scale up

Analytics Database

Data Science / ML

Reporting / BI

Data Analysis

Holistics.io

● SQL features

○ WITH clause

○ Window functions

○ Aggregation functions

○ Statistical functions

● Data structures

○ JSON / JSONB

○ Arrays

○ PostGIS (geo data)

○ Geometry (point, line, etc)

○ HyperLogLog (extension)

2- b - Data Analysis● PL/SQL

● Full-text search (n-gram)

● Performance:

○ Parallel queries (pg9.6)

○ Materialized views

○ BRIN index

● Others:

○ DISTINCT ON

○ VALUES

○ generate_series()

○ Support FULL OUTER JOIN

○ Better EXPLAIN

Holistics.io

● SQL features

○ WITH clause

○ Window functions

○ Aggregation functions

○ Statistical functions

● Data structures

○ JSON / JSONB

○ Arrays

○ PostGIS (geo data)

○ Geometry (point, line, etc)

○ HyperLogLog (extension)

● PL/SQL

● Full-text search (n-gram)

● Performance:

○ Parallel queries (pg9.6)

○ Materialized views

○ BRIN index

● Others:

○ DISTINCT ON

○ VALUES

○ generate_series()

○ Support FULL OUTER JOIN

○ Better EXPLAIN

Holistics.io

● SQL features

○ WITH clause

○ Window functions

○ Aggregation functions

○ Statistical functions

● Data structures

○ JSON / JSONB

○ Arrays

○ PostGIS (geo data)

○ Geometry (point, line, etc)

○ HyperLogLog (extension)

● PL/SQL

● Full-text search

● Performance:

○ Parallel queries (pg9.6)

○ Materialized views

○ BRIN index

● Others:

○ DISTINCT ON

○ VALUES

○ generate_series()

○ Support FULL OUTER JOIN

○ Better EXPLAIN

Requirements

1. Simple to Get Started

2. Rich Features for Analytics

– Data Pipeline (ETL)

– Data Analysis (SQL et al)

3. Easy to Scale Up

(3) Scale(1) Start (2) Grow

Data Growth

Personal Recommendation:

Holistics.io

● Transactional DB’s downsides:○ Optimized for transactional applications○ Single-core execution; row-based storage

● CitusDB Extension (Postgres)○ Automated data sharding and parallelization○ Columnar Storage Format (better storage and performance)

● Amazon Redshift○ Fork of PostgreSQL 8.2 -- ParAccel DB○ Columnar Storage & Parallel Executions○ Pay per hour per instance types

● Google BigQuery○ Spun out of Google’s Dremel○ Pay per query, per data access

3- Scaling Up

Holistics.io

Other DW Databases (Relational)● Greenplum

● Teradata

● Infobright

● Google BigQuery

● Aster Data

● Paraccel (Postgres fork)

● Vertica (from Postgres author)

● CitusDB (Postgres extension)

● Amazon Redshift (from Paraccel)

1. Simple start 2. Rich features 3. Scale up

Related to Postgres

Holistics.io

Compare: Popular SQL Databases

PostgreSQL MySQL Oracle SQL Server

License / Cost Free / Open-source Free / Open-source Expensive Expensive

Analytics Features Strong Weak Strong Strong

vs. is a data storage and processing framework

– HDFS: data-storage layer

– YARN: resource management

– MapReduce/Pig/Hive/Spark: processing layer

(MPP database, massively parallel processing)

– Columnar-storage database; Meant for analytics purpose.

– OLAP – Online Analytical Processing

– Examples: Vertica, Amazon Redshift, Parracel

Holistics.io

Choosing A BI Tool

Holistics.io

Analytics Database

CSVs / Excels / Google Sheets

Operational Data Data Warehouse Reporting / Analysis

Data Science / ML

Reporting / BI

Event Logs (behavioural

data)

Live Databases

Live DatabasesProduction

DBsDaily Snapshot

Load

Pre-aggregate

Update / Transform / Aggregate

3rd-party Tracking: GA, FB Ads, Adwords...

API Import

Data Analysis

Which BI software?

Holistics.io

Choosing A BI Tool: Criteria

Use cases: Pretty Visualizations vs. Detailed Data Access, Embedded, Email Schedules, etc

Report Creation: Technical vs. Non-technical

Data Ownership: Your own database vs. BI Software’s storage

Holistics.io

Other Criteria

● Process: Direct Access vs Data Models

● ETL: How is the ETL process managed?

● License Fee: Upfront Investment vs. Value-Based Pricing

● Implementation Fee: Self Service vs. 3rd Party

● Training Fee: Training Costs vs Specialized Skill Sets

Holistics.io

BI Tools

BI features:● Heavy Visualization: Tableau, Qlik● Medium + Other Features: Holistics, Chartio, Periscope

Report Creation:● SQL: Holistics, Periscope● Drag-and-Drop/Excel: Tableau, Qlik, Sisense, PowerBI

Data Ownership:● Store your data: Periscope, Tableau, Sisense● Doesn’t store your data: Holistics, Chartio, Looker

Holistics.io

Comparing BI toolsVisualization Report Creation Data Storage Pricing Notes

Tableau Strong Drag & Drop Store Per desktop + server license Leader in Visualization

Qlik Strong Drag & Drop Store Desktop + Server License ETL + Visualization

Holistics Standard SQL Doesn’t store Users + Usage Strong permission management, detailed data extractions

Periscope Data Standard SQL Store No. of records Auto-cache your data using Amazon Redshift

Chartio Standard Drag & Drop Doesn’t store User + Feature Transform your data on Chartio server

Looker StandardData Model

(LookML) based on SQL

Doesn’t store User Packs Accessed through the Looker Data Model

● The Data Problem

● Typical Pipeline (Startup)

● Choosing An Analytics DB

● Choosing A BI Tool

Summary

Not Cover

● Setting Up & Performance Optimizations

● Other data types: time-series data, geo data, search data

● Big Data frameworks: Hadoop, Spark, HDFS, etc

● Real-time Data Processing, Stream Processing (Storm, Kafka, Kinesis)

Holistics.io

Holistics.io

Huy Nguyenhuy@holistics.io