Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily...

63
Big Data Infrastructure Week 6: Analyzing Relational Data (1/3) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details CS 489/698 Big Data Infrastructure (Winter 2017) Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo February 7, 2017 These slides are available at http://lintool.github.io/bigdata-2017w/

Transcript of Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily...

Page 1: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Big Data Infrastructure

Week 6: Analyzing Relational Data (1/3)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 489/698 Big Data Infrastructure (Winter 2017)

Jimmy LinDavid R. Cheriton School of Computer Science

University of Waterloo

February 7, 2017

These slides are available at http://lintool.github.io/bigdata-2017w/

Page 2: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Structure of the Course

“Core” framework features and algorithm design

Anal

yzin

gTe

xt

Anal

yzin

gG

raph

s

Anal

yzin

gRe

latio

nal D

ata

Data

Min

ing

Page 3: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Evolution of Enterprise Architectures

Next two sessions: techniques, algorithms, and optimizations for relational processing

Page 4: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

MonolithicApplication

users

Page 5: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

Page 6: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Source: Wikipedia

Page 7: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

database

Page 8: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

An organization should retain data that result from carrying out its mission and exploit those data to generate insights that benefit the organization, for example, market analysis, strategic planning, decision making, etc.

Business Intelligence

Page 9: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

database

BI tools

analysts

Page 10: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

database

BI tools

analysts

Why is myapplication so slow?

Why does my analysis take so long?

Page 11: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Database Workloads

OLTP (online transaction processing)Typical applications: e-commerce, banking, airline reservations

User facing: real-time, low latency, highly-concurrentTasks: relatively small set of “standard” transactional queries

Data access pattern: random reads, updates, writes (small amounts of data)

OLAP (online analytical processing)Typical applications: business intelligence, data mining

Back-end processing: batch workloads, less concurrencyTasks: complex analytical queries, often ad hoc

Data access pattern: table scans, large amounts of data per query

Page 12: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

OLTP and OLAP Together?

Downsides of co-existing OLTP and OLAP workloadsPoor memory management

Conflicting data access patternsVariable latency

Solution?

users and analysts

Page 13: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Source: Wikipedia (Warehouse)

Build a data warehouse!

Page 14: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

OLTP database for user-facing transactions

OLAP database for data warehousing

What’s special about OLTP vs. OLAP?

Page 15: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Customer Billing

OrderInventory

OrderLine

A Simple OLTP Schema

Page 16: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Dim_Customer

Dim_Date

Dim_ProductFact_Sales

Dim_Store

A Simple OLAP Schema

Page 17: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

ETL

TransformData cleaning and integrity checking

Schema conversionField transformations

When does ETL happen?

Extract

Load

Page 18: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

My data is a day old… Meh.

Page 19: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

Frontend

Backend

users

Frontend

Backend

external APIs

OLTP database

OLTP database

Page 20: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

What do you actually do?

Dashboards

Report generation

Ad hoc analyses

Page 21: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

store

prod

uct

slice and dice

Common operations

roll up/drill down

pivot

OLAP Cubes

Page 22: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

OLAP Cubes: Challenges

Fundamentally, lots of joins, group-bys and aggregationsHow to take advantage of schema structure to avoid repeated work?

Cube materializationRealistic to materialize the entire cube?If not, how/when/what to materialize?

Page 23: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

Frontend

Backend

users

Frontend

Backend

external APIs

OLTP database

OLTP database

Page 24: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Fast forward…

Page 25: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

“On the first day of logging the Facebook clickstream, more than 400 gigabytes of data was collected. The load, index, and aggregation processes for this data set really taxed the Oracle data warehouse. Even after significant tuning, we were unable to aggregate a day of clickstream data in less than 24 hours.”

Jeff Hammerbacher, Information Platforms and the Rise of the Data Scientist. In, Beautiful Data, O’Reilly, 2009.

Page 26: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

Facebook context?

Page 27: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

“OLTP”

Adding friendsUpdating profilesLikes, comments…

Feed rankingFriend recommendationDemographic analysis…

Page 28: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

analysts

ETL(Extract, Transform, and Load)

“OLTP” PHP/MySQL

data scientists✗

Hadoop

or ELT?

Page 29: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

What’s changed?

Dropping cost of disksCheaper to store everything than to figure out what to throw away

Page 30: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

What’s changed?

Dropping cost of disksCheaper to store everything than to figure out what to throw away

Rise of social media and user-generated contentLarge increase in data volume

Growing maturity of data mining techniquesDemonstrates value of data analytics

Types of data collectedFrom data that’s obviously valuable to data whose value is less apparent

Page 31: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

a useful service

analyze user behavior to extract insights

transform insights into action

$(hopefully)

Google. Facebook. Twitter. Amazon. Uber.

Virtuous Product Cycle

Page 32: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

What do you actually do?

Dashboards

Report generation

Ad hoc analyses“Descriptive”“Predictive”

Data products

Page 33: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

a useful service

analyze user behavior to extract insights

transform insights into action

$(hopefully)

Google. Facebook. Twitter. Amazon. Uber.

data sciencedata products

Virtuous Product Cycle

Page 34: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

“On the first day of logging the Facebook clickstream, more than 400 gigabytes of data was collected. The load, index, and aggregation processes for this data set really taxed the Oracle data warehouse. Even after significant tuning, we were unable to aggregate a day of clickstream data in less than 24 hours.”

Jeff Hammerbacher, Information Platforms and the Rise of the Data Scientist. In, Beautiful Data, O’Reilly, 2009.

Page 35: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

data scientists

ETL(Extract, Transform, and Load)

“OLTP”

Hadoop

Page 36: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

ETL(Extract, Transform, and Load)

Hadoop

Wait, so why not use a database to begin with?

The Irony…

“OLTP”

data scientists

Page 37: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Why not just use a database?

Scalability. Cost.

SQL is awesome

Page 38: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Databases are great…If your data has structure (and you know what the structure is)

If you know what queries you’re going to run ahead of timeIf your data is reasonably clean

Databases are not so great…If your data has little structure (or you don’t know the structure)

If you don’t know what you’re looking forIf your data is messy and noisy

Page 39: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

“there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are unknown unknowns – the ones we don't know we don't know…” – Donald Rumsfeld

Source: Wikipedia

Page 40: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Databases are great…If your data has structure (and you know what the structure is)

If you know what queries you’re going to run ahead of timeIf your data is reasonably clean

Databases are not so great…If your data has little structure (or you don’t know the structure)

If you don’t know what you’re looking forIf your data is messy and noisy

Page 41: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Don’t need to know the schema ahead of time

Many analyses are better formulated imperatively

Raw scans are the most common operations

Much faster data ingest rate

Advantages of Hadoop dataflow languages

Page 42: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

What do you actually do?

Dashboards

Report generation

Ad hoc analyses“Descriptive”“Predictive”

Data products

Page 43: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

Frontend

Backend

users

Frontend

Backend

external APIs

OLTP database

OLTP database

Page 44: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

Frontend

Backend

users

Frontend

Backend

external APIs

“Traditional”BI tools

SQL on Hadoop

Othertools

Data Warehouse“Data Lake”

data scientists

OLTP database

ETL(Extract, Transform, and Load)

OLTP database

OLTP database

Page 45: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Twitter’s data warehousing architecture (circa 2012)

Page 46: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

circa ~2010~150 people total

~60 Hadoop nodes~6 people use analytics stack daily

circa ~2012~1400 people total

10s of Ks of Hadoop nodes, multiple DCs10s of PBs total Hadoop DW capacity

~100 TB ingest dailydozens of teams use Hadoop daily

10s of Ks of Hadoop jobs daily

Page 47: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

How does ETL actually happen?

Twitter’s data warehousing architecture (circa 2012)

Page 48: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Scribe Daemons(Production Hosts)

Main HadoopDW

Main Datacenter

Staging Hadoop Cluster

HDFS

ScribeAggregators

Scribe Daemons(Production Hosts)

Datacenter

Staging Hadoop Cluster

HDFS

ScribeAggregators

Scribe Daemons(Production Hosts)

Datacenter

Staging Hadoop Cluster

HDFS

ScribeAggregators

Importing Log Data

Page 49: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

DB partitions

HDFS

Tweets, graph, users profiles

LZO-compressed protobufs

select * from …

mappers

Important: Must carefully throttle resource usage…

Different periodicity (e.g., hourly, daily snapshots, etc.)

* Out of date – for illustration only

Importing Log Data*

Page 50: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

“Basically, we use Vertica as a cache for HDFS data.”@squarecog

HDFS Vertica MySQL

import

Birdbrain

aggregation

Why?Vertica provides orders of magnitude faster aggregations!

Interactivebrowsing tools

Vertica Pipeline*

* Out of date – for illustration only

Page 51: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

HDFS Vertica MySQL Birdbrain

The catch…Performance must be balanced against integration costs

Vertica integration is non-trivial

Interactivebrowsing tools

import aggregation

Vertica Pipeline*

* Out of date – for illustration only

Page 52: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

HDFS Vertica MySQL Birdbrain

Interactivebrowsing tools

import aggregation

Vertica Data Ingestion

DB partitions

HDFS

LZO-compressed protobufs

select * from …

mappers

Let’s just run this in reverse!

* Out of date – for illustration only

Page 53: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Verticapartitions

HDFS

reducers

Vertica guarantees that each of these batch inserts are atomic

So what’s the challenge?Did you remember to turn off speculative execution?

What happens when a task dies?

Vertica Pig Storage*

* Out of date – for illustration only

Page 54: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

What’s Next?Two developing trends…

Page 55: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

database

BI tools

analysts

Page 56: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

Frontend

Backend

users

Frontend

Backend

external APIs

OLTP database

OLTP database

Page 57: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

Frontend

Backend

users

Frontend

Backend

external APIs

“Traditional”BI tools

SQL on Hadoop

Othertools

Data Warehouse“Data Lake”

data scientists

OLTP database

ETL(Extract, Transform, and Load)

OLTP database

OLTP database

My data is a day old… I refuse to

accept that!

Page 58: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

ETLOLAPOLTP

What if you didn’t have to do this?

Page 59: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

HTAP

Hybrid Transactional/Analytical Processing (HTAP)

Page 60: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

Frontend

Backend

users

Frontend

Backend

external APIs

“Traditional”BI tools

SQL on Hadoop

Othertools

Data Warehouse“Data Lake”

data scientists

OLTP database

ETL(Extract, Transform, and Load)

OLTP database

OLTP database

Page 61: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

Frontend

Backend

users

Frontend

Backend

external APIs

“Traditional”BI tools

SQL on Hadoop

Othertools

Data Warehouse“Data Lake”

data scientists

HTAP database

ETL(Extract, Transform, and Load)

HTAP database

HTAP database Analytics

tools

data scientists

Analyticstools

data scientists

Page 62: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Frontend

Backend

users

Frontend

Backend

users

Frontend

Backend

external APIs

“Traditional”BI tools

SQL on Hadoop

Othertools

Data Warehouse“Data Lake”

data scientists

ETL(Extract, Transform, and Load)

Everything In the cloud!

IaaS / Load balance aaS

OLTP database

OLTP database

OLTP database

DBaaS (e.g., RDS)

DBaaS (e.g., RedShift)

S3

“Cloudified” tools

ELT aaS

Page 63: Big Data Infrastructure - GitHub Pages · ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total

Source: Wikipedia (Japanese rock garden)

Questions?