Hadoop and IDW - When_to_use_which

27
HADOOP AND THE DATA WAREHOUSE: WHEN TO USE WHICH

Transcript of Hadoop and IDW - When_to_use_which

Page 1: Hadoop and IDW - When_to_use_which

HADOOP AND THE DATA WAREHOUSE: WHEN TO USE WHICH

Page 2: Hadoop and IDW - When_to_use_which

2 Copyright Teradata

• Data warehouse strengths

> What is a Data Warehouse?

• Hadoop strengths

• When to use which

> Hadoop

> Data warehouse

Agenda

Page 3: Hadoop and IDW - When_to_use_which

3 Copyright Teradata

Data Hub/Lake

Data Warehouse

Discovery

Three Primary Workloads

• Data models • Data integration • Trusted data • Concurrent users • Workload mgmt • Response time

• Easy to use • Many tools • Algorithm collections • Data wrangling • Business user access • Semi-production

• Fast raw data ingest • Archival • ETL refinery • Search • Relaxed SLAs • Millions of files

Page 4: Hadoop and IDW - When_to_use_which

4 Copyright Teradata

Best Fit Primary Strengths and Overlaps

Data Warehouse

Discovery DataLake

Page 5: Hadoop and IDW - When_to_use_which

WHY HADOOP IS NOT A DATA WAREHOUSE

Page 6: Hadoop and IDW - When_to_use_which

6 Copyright Teradata

• A data design pattern, an architecture

> Not necessarily a database

• Definition: Gartner (2005) /Inmon (1992)

> Subject oriented

– Detailed data + modeling of sales, inventory, finance, etc.

> Integrated logical model

– Merged data

– Consistent, standardized data formats and values

> Nonvolatile

– Data stored unmodified for long periods of time

> Time variant

– Record versioning or temporal services

> Persistent storage, not virtual, not federated

What is a Data Warehouse?

Source: Gartner: Of Data Warehouses, Operational Data Stores, Data Marts and Data 'Outhouses‘, Dec 2005; Inmon, Building the Data Warehouse, 1992, Wiley and Sons

Page 7: Hadoop and IDW - When_to_use_which

7 Copyright Teradata

By Definition Data

Warehouse Hadoop

Subject oriented 5 0

Detailed data 5 5

Modeled by business subject 5 0

Integrated 5 0

Merged, deduplicated data 5 0

Standardized data formats and values 5 0

Nonvolatile storage 5 5

Time variant: record versions, temporal 5 0

Persistent storage 5 5

Data Warehouse Design Pattern

0=none, 1= poor, 2= limited, 3= average, 4=robust, 5=outstanding

Page 8: Hadoop and IDW - When_to_use_which

8 Copyright Teradata

NoSchema, Schema-on-Read, Complex Schemas

Single file (Schema-on-read)

Data Marts (Schema-on-read)

Data Warehouse (Schema-on-

write)

No schema, no joins One source Raw data 3-5 uses

Star and snowflake schemas 2-4 fact table joins Multiple sources Raw data, unknown data Key value stores

5K-10K tables 20-50 way joins Cross-organization Pre-integrated, cleansed Referential integrity Many applications

Events

Locations

Finance Transaction

Session

Orders

Inventory Call Center

POS

Page 9: Hadoop and IDW - When_to_use_which

9 Copyright Teradata

• Not a database

> No schema, indexes, optimizer

> No separation of code and data structure

> Hadoop uses objects and files

– Not rows and columns

• Hive helps a little

> Limited SQL

> Limited metadata

• Not high performance

• Not fully interactive queries

What Hadoop is Not

See also http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html http://blogs.gartner.com/donald-feinberg/2014/12/22/a-database-by-any-other-name/

Page 10: Hadoop and IDW - When_to_use_which

10 Copyright Teradata

• Guarantees database actions are processed reliably

• Ensures query result accuracy

• Supports updates and deletes

• Needed for applications that require 100% consistency

> Banks, finance, inventory, etc.

> Maybe not for Facebook, Twitter, etc.

• Data you can trust

ACID Advantages of an RDBMS

Atomicity apply all changes or none

Consistency rollback on errors

Isolation one update at a time

Durability transactions survive crashes

Page 11: Hadoop and IDW - When_to_use_which

11 Copyright Teradata

Integration and Analytics

Hadoop’s Biggest Differentiators

Capture and ETL

Long term archive

Cheap, commodity hardware

Data Warehouse

Page 12: Hadoop and IDW - When_to_use_which

12 Copyright Teradata

Data Hub Refinery: Parallel ETL

Social networks Mobile

Web Logs Sensors

Page 13: Hadoop and IDW - When_to_use_which

13 Copyright Teradata

When We’re Too Small for Hadoop ETL

Avoid hand coded transforms 2 ETL servers do the job

Prefer tool based ETL ETL is working well

Page 14: Hadoop and IDW - When_to_use_which

14 Copyright Teradata

When We Need Massive Data Integration

Dozens of ETL servers High velocity real time data

10s-100s of TB/day The risk is worth reward

Page 15: Hadoop and IDW - When_to_use_which

15 Copyright Teradata

When In-database ELT Works Well

Reference data look-ups Joins for derived data Lots of derived data

Service-level goals to meet

Page 16: Hadoop and IDW - When_to_use_which

16 Copyright Teradata

When to Use Which: It Depends

In Database ELT Hadoop

Reference data • Lookups • Joins

Transformations • Structured data • ELT modules • SQL can do it

• Unstructured • Some ETL modules • Do it yourself

Service level goals

• Predictable • System management

Data security • Robust

Costs • Commodity hardware

Data quality • Governance, MDM • Low quality/trust OK

Data volume • High volume • Extreme volume

Offload ELT • Migration costs

Agility • No governance

Page 17: Hadoop and IDW - When_to_use_which

WHERE HADOOP EXCELS

Page 18: Hadoop and IDW - When_to_use_which

18 Copyright Teradata

• Commodity low cost hardware

• Many programing languages

> But mostly it’s Java

• Free open source

• Any data structure

• Scale-out to petabytes + parallelism

Hadoop Strengths

Page 19: Hadoop and IDW - When_to_use_which

19 Copyright Teradata

• ETL on steroids

• Economically ”keep files forever”

> Queryable

• File based reporting and analytics

• Backup and archival storage

> Databases, files, development

Hadoop: the Data Hub

Page 20: Hadoop and IDW - When_to_use_which

20 Copyright Teradata

• Temporary data, data exhaust

• Data mining/exploration

> 1000s of continuous variables

> Linear algebra

> Graph mining

> Machine learning

> Random forest, decision trees

> Markov chains

• Not all data mining MapReduce

> Many things work better in MPP RDBMS

> In-database SAS, R, Fuzzy Logix

> It depends

Where MapReduce Excels

Page 21: Hadoop and IDW - When_to_use_which

21 Copyright Teradata

• Easy to work on non-relational data

> Java data types

> JSON, objects

• Hadoop is written in the Java

> Compatible APIs, skills, concepts, frameworks, scripts

• Huge open source factories

> Apache, GitHub, Eclipse, SourceForge,etc.

> Assorted compression algorithms

• People

> 9M-10M java programmers

> Web tutorials – extensive “how to” topics

> University student research

Developer Advantages with Hadoop

Page 22: Hadoop and IDW - When_to_use_which

22 Copyright Teradata

• Raw data format provides complete flexibility

• Non-traditional data types easily supported

> Graph, text, weblog, etc.

• No upfront ETL required

• No data loading required

• Flexible: late binding let’s data scientist choose

NoSchema Advantages

41521390 2013-01-01 00:25:42 2.111.94.18 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4 "http://www.cokstate.edu/welcome/" "https://www.google.com/#sclient=psyab&hl=en&source=hp&q=oklahoma+state&pbx=1&oq”

Weblog

Note: there are many pitfalls when schema-on-read is not a good solution

Page 23: Hadoop and IDW - When_to_use_which

23 Copyright Teradata

Attributes Favoring Hadoop

Reason Description

Cost Low cost, low value data before refinement

Multi structured data ingest

Raw weblogs, Twitter, Facebook, mobile, PST files, etc.

Data depth High data volume, few users, high signal-to-noise ratio

Non-SQL analytics Complex processes, pipeline transforms, random forests, Markov chains, enormous arrays, etc.

Flexibility, autonomy

Exploratory analysis with little governance Fast, short-term turn around

Ugly data Videos, satellite images, format conversions (PDF to text)

Page 24: Hadoop and IDW - When_to_use_which

24 Copyright Teradata

MPP RDBMS Hadoop

Stable schema Evolving schema

Structured data Structure agnostic

Full ANSI SQL Flexible programming

Iterative analysis Batch analysis

Fine grain security N/A

Cleansed data Raw data

Seeks Scans

Updates/deletes Ingest

Service level agreements Flexibility

Core data Source files

Complex joins Complex processing

Efficient CPU and IO Low cost storage

Key Considerations

Page 25: Hadoop and IDW - When_to_use_which

25 Copyright Teradata

• YARN and Tez

• Queries on flat files!

• Parallel scanning engine

• Developer community

• Complex parallel processing

• Fast ingest of raw data

• Long term archives at full fidelity

• Good scalability

What I Like About Hadoop

Page 26: Hadoop and IDW - When_to_use_which

26 Copyright Teradata

• Start with workload requirements

> Map the tool capabilities to the requirement

• Hadoop is a DataHub, a Data Lake

> Not a database or data warehouse

> Exploit Hadoop’s strengths

• Combine the data warehouse and Hadoop

> Two tool sets solve more objectives

> Better together

Summary

Page 27: Hadoop and IDW - When_to_use_which

27 Copyright Teradata

The End