Hadoop analytics provisioning based on a virtual infrastructure
Building Analytics Infrastructure for Growing Tech Companies
-
Upload
holistics-software -
Category
Data & Analytics
-
view
232 -
download
2
Transcript of Building Analytics Infrastructure for Growing Tech Companies
Holistics.io
Huy NguyenCTO, Cofounder - Holistics.io
Building Analytics Infrastructure for Growing Tech Companies
Data Night SingaporeAug 2016
(*) aka Data Pipelike
Holistics.io
● Cofounder of Holistics.io
○ Data Reporting (BI) and Infrastructure SaaS
● Previous○ Built Data Pipeline at Viki (Singapore)
○ Growth Team at Facebook (US)
About Me
● The Data Problem
● Typical Data Pipeline (Startup)
● Choosing An Analytics DB
● Choosing A BI Tool
Agenda
Holistics.io
Background: What is Analytics/DW?
- A Typical Web Application
Data-related Business Questions:• Daily/weekly registered users by different platforms, countries?• How many video uploads do we have everyday?
Live Databases
Live DatabasesProduction
DBs
Android
iOS
Web
APIs
- A Typical Web Application
Data-related Business Questions:• Daily/weekly registered users by different platforms, countries?• How many video uploads do we have everyday?
Analytics DatabaseLive
DatabasesLive
DatabasesProduction
DBs
Android
iOS
Web
APIs
Reporting / BIDaily Snapshot
- A Typical Web Application
Data-related Business Questions:• How did my marketing campaigns affect registrations?
Analytics DatabaseLive
DatabasesLive
DatabasesProduction
DBs
Android
iOS
Web
APIs
Reporting / BIDaily Snapshot
- A Typical Web Application
Analytics DatabaseLive
DatabasesLive
DatabasesProduction
DBs
Android
iOS
Web
APIs
Reporting / BIDaily Snapshot
GA, FB Ads, Adwords...
Data-related Business Questions:• How did my marketing campaigns affect registrations?
Holistics.io
A Typical Data Pipeline
Holistics.io
Analytics Database
CSVs / Excels / Google Sheets
Operational Data Data Warehouse Reporting / Analysis
Data Science / ML
Reporting / BI
Event Logs (behavioural
data)
Live Databases
Live DatabasesProduction
DBsDaily Snapshot
Load
Pre-aggregate
Update / Transform / Aggregate
3rd-party Tracking: GA, FB Ads, Adwords...
API Import
Data Analysis
Holistics.io
Analytics Database
Data Warehouse
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Derived Table
Transform / Aggregate
Derived Table
Derived Table
Derived Table
CSVs / Excels / Google Sheets
Event Logs (behavioural
data)
Live Databases
Live DatabasesProduction
DBs
Daily Snapshot
Load
Pre-aggregate
3rd-party Tracking: GA, FB Ads, Adwords...
API Import
(1) Import (2) Process (3) Present
Data Warehouse
Holistics.io
Data Pipeline Philosophy
Centralize Your Data: join/cross-reference your data
Unix Philosophy: Each component does one thing well
Immutable Data: Don’t modify the original data
Holistics.io
Choosing An Analytics DB
Holistics.io
Analytics Database
CSVs / Excels / Google Sheets
Operational Data Data Warehouse Reporting / Analysis
Data Science / ML
Reporting / BI
Event Logs (behavioural
data)
Live Databases
Live DatabasesProduction
DBsDaily Snapshot
Load
Pre-aggregate
Update / Transform / Aggregate
3rd-party Tracking: GA, FB Ads, Adwords...
API Import
Data Analysis
What database should we pick?
Holistics.io
Transactional DBs vs. Analytics DBs
Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 5)
Data:
● Many single-row writes
● Current, single data
Queries:
● Generated by user activities; 10 to 1000 users
● < 1s response time
● Short queries
Data:
● Few large batch imports
● Years of data, many sources
Queries:
● Generated by large reports; 1 to 10 users
● Queries run for hours
● Long, complex queries
Holistics.ioRef: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 8)
Complex Query...
Requirements
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis (SQL et al)
3. Easy to Scale Up
(3) Scale(1) Start (2) Grow
Data Growth
Requirements
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis (SQL et al)
3. Easy to Scale Up
(3) Scale(1) Start (2) Grow
Data Growth
Personal Recommendation:
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis (SQL et al)
3. Easy to Scale Up
Requirements
(3) Scale(1) Start (2) Grow
Data Growth
Personal Recommendation:
1 Simple to Get Started
● Data requests grow gradually as your company grows
● Business users care about results (not backend)
Postgres:
● Free (open-source)● Easy to setup
→ Need something quick to start, easy to fine-tune along the way
1. Simple start 2. Rich features 3. Scale up
Requirements
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis (SQL et al)
3. Easy to Scale Up
(3) Scale(1) Start (2) Grow
Data Growth
Personal Recommendation:
Holistics.io
Analytics Database
CSVs / Excels / Google Sheets
Operational Data Data Warehouse Reporting / Analysis
Data Science / ML
Reporting / BI
Event Logs (behavioural
data)
Live Databases
Live DatabasesProduction
DBsDaily Snapshot
Load
Pre-aggregate
Update / Transform / Aggregate
3rd-party Tracking: GA, FB Ads, Adwords...
API Import
Data Analysis
Data Pipeline (ETL) Data Analysis / Reporting
Holistics.io
● Managing Table Data: table partitioning
● Managing Disk Space: tablespace
● Improve Write Performance: unlogged table
● Others: foreign data wrapper, point-in-time recovery
2 a- Data Pipeline (ETL) & Performance
1. Simple start 2. Rich features 3. Scale up
Holistics.io
Analytics tables hold lots of data
Managing Data Tables
pageviews_2015_06
pageviews_2015_07
pageviews_2015_08
pageviews_2015_09
Solution: Split (partition) to multiple tables
Problem:Difficult to query data across multiple months
⇒ Table grows big quickly, difficult to manage !
pageviews
(+ 100k records a day)
date_d | country | user_id | browser | page_name | views
1. Simple start 2. Rich features 3. Scale up
Managing Data Tables: parent table
pageviews_2015_06
pageviews_2015_07
pageviews_2015_08
pageviews_2015_09
…
ALTER TABLE pageviews_2015_09 INHERIT pageviews_parent;
ALTER TABLE pageviews_2015_09 ADD CONSTRAINTCHECK date_d >= '2015-09-01' AND date_d < '2015-10-01';
pageviews_parent (parent table)
1. Simple start 2. Rich features 3. Scale up
Holistics.io
Analytics DB holds lots of data; hardware spaces are limited
Data have different accessfrequency
● Hot Data● Warm Data● Cold Data
Managing Disk-spaces
1. Simple start 2. Rich features 3. Scale up
Holistics.io
Tablespace: Define where your tables are stored on disks
Managing Disk-spaces: tablespace
CREATE TABLESPACE hot_data LOCATION /disk0/ssd/CREATE TABLESPACE warm_data LOCATION /disk1/sata2/
# beginning of the month
CREATE TABLE pageviews_2016_08 TABLESPACE hot_data;ALTER TABLE pageviews_2016_07 TABLESPACE warm_data;
1. Simple start 2. Rich features 3. Scale up
Holistics.io
Combining TABLESPACE and PARENT TABLE
pageviews_2015_06
pageviews_2015_07
pageviews_2015_08
pageviews_2015_09
…
pageviews_parent (parent table)
1. Simple start 2. Rich features 3. Scale up
Holistics.io
● Extract / transform● Aggregate / summarize● Statistical analysis
2- b- Data Analysis (writing SQLs)
1. Simple start 2. Rich features 3. Scale up
Analytics Database
Data Science / ML
Reporting / BI
Data Analysis
Holistics.io
● SQL features
○ WITH clause
○ Window functions
○ Aggregation functions
○ Statistical functions
● Data structures
○ JSON / JSONB
○ Arrays
○ PostGIS (geo data)
○ Geometry (point, line, etc)
○ HyperLogLog (extension)
2- b - Data Analysis● PL/SQL
● Full-text search (n-gram)
● Performance:
○ Parallel queries (pg9.6)
○ Materialized views
○ BRIN index
● Others:
○ DISTINCT ON
○ VALUES
○ generate_series()
○ Support FULL OUTER JOIN
○ Better EXPLAIN
Holistics.io
● SQL features
○ WITH clause
○ Window functions
○ Aggregation functions
○ Statistical functions
● Data structures
○ JSON / JSONB
○ Arrays
○ PostGIS (geo data)
○ Geometry (point, line, etc)
○ HyperLogLog (extension)
● PL/SQL
● Full-text search (n-gram)
● Performance:
○ Parallel queries (pg9.6)
○ Materialized views
○ BRIN index
● Others:
○ DISTINCT ON
○ VALUES
○ generate_series()
○ Support FULL OUTER JOIN
○ Better EXPLAIN
Holistics.io
● SQL features
○ WITH clause
○ Window functions
○ Aggregation functions
○ Statistical functions
● Data structures
○ JSON / JSONB
○ Arrays
○ PostGIS (geo data)
○ Geometry (point, line, etc)
○ HyperLogLog (extension)
● PL/SQL
● Full-text search
● Performance:
○ Parallel queries (pg9.6)
○ Materialized views
○ BRIN index
● Others:
○ DISTINCT ON
○ VALUES
○ generate_series()
○ Support FULL OUTER JOIN
○ Better EXPLAIN
Requirements
1. Simple to Get Started
2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis (SQL et al)
3. Easy to Scale Up
(3) Scale(1) Start (2) Grow
Data Growth
Personal Recommendation:
Holistics.io
● Transactional DB’s downsides:○ Optimized for transactional applications○ Single-core execution; row-based storage
● CitusDB Extension (Postgres)○ Automated data sharding and parallelization○ Columnar Storage Format (better storage and performance)
● Amazon Redshift○ Fork of PostgreSQL 8.2 -- ParAccel DB○ Columnar Storage & Parallel Executions○ Pay per hour per instance types
● Google BigQuery○ Spun out of Google’s Dremel○ Pay per query, per data access
3- Scaling Up
Holistics.io
Other DW Databases (Relational)● Greenplum
● Teradata
● Infobright
● Google BigQuery
● Aster Data
● Paraccel (Postgres fork)
● Vertica (from Postgres author)
● CitusDB (Postgres extension)
● Amazon Redshift (from Paraccel)
1. Simple start 2. Rich features 3. Scale up
Related to Postgres
Holistics.io
Compare: Popular SQL Databases
PostgreSQL MySQL Oracle SQL Server
License / Cost Free / Open-source Free / Open-source Expensive Expensive
Analytics Features Strong Weak Strong Strong
vs. is a data storage and processing framework
– HDFS: data-storage layer
– YARN: resource management
– MapReduce/Pig/Hive/Spark: processing layer
(MPP database, massively parallel processing)
– Columnar-storage database; Meant for analytics purpose.
– OLAP – Online Analytical Processing
– Examples: Vertica, Amazon Redshift, Parracel
Holistics.io
Choosing A BI Tool
Holistics.io
Analytics Database
CSVs / Excels / Google Sheets
Operational Data Data Warehouse Reporting / Analysis
Data Science / ML
Reporting / BI
Event Logs (behavioural
data)
Live Databases
Live DatabasesProduction
DBsDaily Snapshot
Load
Pre-aggregate
Update / Transform / Aggregate
3rd-party Tracking: GA, FB Ads, Adwords...
API Import
Data Analysis
Which BI software?
Holistics.io
Choosing A BI Tool: Criteria
Use cases: Pretty Visualizations vs. Detailed Data Access, Embedded, Email Schedules, etc
Report Creation: Technical vs. Non-technical
Data Ownership: Your own database vs. BI Software’s storage
Holistics.io
Other Criteria
● Process: Direct Access vs Data Models
● ETL: How is the ETL process managed?
● License Fee: Upfront Investment vs. Value-Based Pricing
● Implementation Fee: Self Service vs. 3rd Party
● Training Fee: Training Costs vs Specialized Skill Sets
Holistics.io
BI Tools
BI features:● Heavy Visualization: Tableau, Qlik● Medium + Other Features: Holistics, Chartio, Periscope
Report Creation:● SQL: Holistics, Periscope● Drag-and-Drop/Excel: Tableau, Qlik, Sisense, PowerBI
Data Ownership:● Store your data: Periscope, Tableau, Sisense● Doesn’t store your data: Holistics, Chartio, Looker
Holistics.io
Comparing BI toolsVisualization Report Creation Data Storage Pricing Notes
Tableau Strong Drag & Drop Store Per desktop + server license Leader in Visualization
Qlik Strong Drag & Drop Store Desktop + Server License ETL + Visualization
Holistics Standard SQL Doesn’t store Users + Usage Strong permission management, detailed data extractions
Periscope Data Standard SQL Store No. of records Auto-cache your data using Amazon Redshift
Chartio Standard Drag & Drop Doesn’t store User + Feature Transform your data on Chartio server
Looker StandardData Model
(LookML) based on SQL
Doesn’t store User Packs Accessed through the Looker Data Model
● The Data Problem
● Typical Pipeline (Startup)
● Choosing An Analytics DB
● Choosing A BI Tool
Summary
Not Cover
● Setting Up & Performance Optimizations
● Other data types: time-series data, geo data, search data
● Big Data frameworks: Hadoop, Spark, HDFS, etc
● Real-time Data Processing, Stream Processing (Storm, Kafka, Kinesis)
Holistics.io