Hadoop and IDW - When_to_use_which

HADOOP AND THE DATA WAREHOUSE: WHEN TO USE WHICH

2 Copyright Teradata

• Data warehouse strengths

> What is a Data Warehouse?

• Hadoop strengths

• When to use which

> Hadoop

> Data warehouse

Agenda


Data Hub/Lake

Data Warehouse

Discovery

Three Primary Workloads

• Data models • Data integration • Trusted data • Concurrent users • Workload mgmt • Response time

• Easy to use • Many tools • Algorithm collections • Data wrangling • Business user access • Semi-production

• Fast raw data ingest • Archival • ETL refinery • Search • Relaxed SLAs • Millions of files


Best Fit Primary Strengths and Overlaps

Data Warehouse

Discovery DataLake

WHY HADOOP IS NOT A DATA WAREHOUSE


• A data design pattern, an architecture

> Not necessarily a database

• Definition: Gartner (2005) /Inmon (1992)

> Subject oriented

– Detailed data + modeling of sales, inventory, finance, etc.

> Integrated logical model

– Merged data

– Consistent, standardized data formats and values

> Nonvolatile

– Data stored unmodified for long periods of time

> Time variant

– Record versioning or temporal services

> Persistent storage, not virtual, not federated

What is a Data Warehouse?

Source: Gartner: Of Data Warehouses, Operational Data Stores, Data Marts and Data 'Outhouses‘, Dec 2005; Inmon, Building the Data Warehouse, 1992, Wiley and Sons


By Definition Data

Warehouse Hadoop

Subject oriented 5 0

Detailed data 5 5

Modeled by business subject 5 0

Integrated 5 0

Merged, deduplicated data 5 0

Standardized data formats and values 5 0

Nonvolatile storage 5 5

Time variant: record versions, temporal 5 0

Persistent storage 5 5

Data Warehouse Design Pattern

0=none, 1= poor, 2= limited, 3= average, 4=robust, 5=outstanding


NoSchema, Schema-on-Read, Complex Schemas

Single file (Schema-on-read)

Data Marts (Schema-on-read)

Data Warehouse (Schema-on-

write)

No schema, no joins One source Raw data 3-5 uses

Star and snowflake schemas 2-4 fact table joins Multiple sources Raw data, unknown data Key value stores

5K-10K tables 20-50 way joins Cross-organization Pre-integrated, cleansed Referential integrity Many applications

Events

Locations

Finance Transaction

Session

Orders

Inventory Call Center

POS


• Not a database

> No schema, indexes, optimizer

> No separation of code and data structure

> Hadoop uses objects and files

– Not rows and columns

• Hive helps a little

> Limited SQL

> Limited metadata

• Not high performance

• Not fully interactive queries

What Hadoop is Not

See also http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html http://blogs.gartner.com/donald-feinberg/2014/12/22/a-database-by-any-other-name/

http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html

http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html

http://blogs.gartner.com/donald-feinberg/2014/12/22/a-database-by-any-other-name/















• Guarantees database actions are processed reliably

• Ensures query result accuracy

• Supports updates and deletes

• Needed for applications that require 100% consistency

> Banks, finance, inventory, etc.

> Maybe not for Facebook, Twitter, etc.

• Data you can trust

ACID Advantages of an RDBMS

Atomicity apply all changes or none

Consistency rollback on errors

Isolation one update at a time

Durability transactions survive crashes


Integration and Analytics

Hadoop’s Biggest Differentiators

Capture and ETL

Long term archive

Cheap, commodity hardware

Data Warehouse


Data Hub Refinery: Parallel ETL

Social networks Mobile

Web Logs Sensors


When We’re Too Small for Hadoop ETL

Avoid hand coded transforms 2 ETL servers do the job

Prefer tool based ETL ETL is working well


When We Need Massive Data Integration

Dozens of ETL servers High velocity real time data

10s-100s of TB/day The risk is worth reward


When In-database ELT Works Well

Reference data look-ups Joins for derived data Lots of derived data

Service-level goals to meet


When to Use Which: It Depends

In Database ELT Hadoop

Reference data • Lookups • Joins

Transformations • Structured data • ELT modules • SQL can do it

• Unstructured • Some ETL modules • Do it yourself

Service level goals

• Predictable • System management

Data security • Robust

Costs • Commodity hardware

Data quality • Governance, MDM • Low quality/trust OK

Data volume • High volume • Extreme volume

Offload ELT • Migration costs

Agility • No governance

WHERE HADOOP EXCELS


• Commodity low cost hardware

• Many programing languages

> But mostly it’s Java

• Free open source

• Any data structure

• Scale-out to petabytes + parallelism

Hadoop Strengths


• ETL on steroids

• Economically ”keep files forever”

> Queryable

• File based reporting and analytics

• Backup and archival storage

> Databases, files, development

Hadoop: the Data Hub


• Temporary data, data exhaust

• Data mining/exploration

> 1000s of continuous variables

> Linear algebra

> Graph mining

> Machine learning

> Random forest, decision trees

> Markov chains

• Not all data mining MapReduce

> Many things work better in MPP RDBMS

> In-database SAS, R, Fuzzy Logix

> It depends

Where MapReduce Excels


• Easy to work on non-relational data

> Java data types

> JSON, objects

• Hadoop is written in the Java

> Compatible APIs, skills, concepts, frameworks, scripts

• Huge open source factories

> Apache, GitHub, Eclipse, SourceForge,etc.

> Assorted compression algorithms

• People

> 9M-10M java programmers

> Web tutorials – extensive “how to” topics

> University student research

Developer Advantages with Hadoop


• Raw data format provides complete flexibility

• Non-traditional data types easily supported

> Graph, text, weblog, etc.

• No upfront ETL required

• No data loading required

• Flexible: late binding let’s data scientist choose

NoSchema Advantages

41521390 2013-01-01 00:25:42 2.111.94.18 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4 "http://www.cokstate.edu/welcome/" "https://www.google.com/#sclient=psyab&hl=en&source=hp&q=oklahoma+state&pbx=1&oq”

Weblog

Note: there are many pitfalls when schema-on-read is not a good solution


Attributes Favoring Hadoop

Reason Description

Cost Low cost, low value data before refinement

Multi structured data ingest

Raw weblogs, Twitter, Facebook, mobile, PST files, etc.

Data depth High data volume, few users, high signal-to-noise ratio

Non-SQL analytics Complex processes, pipeline transforms, random forests, Markov chains, enormous arrays, etc.

Flexibility, autonomy

Exploratory analysis with little governance Fast, short-term turn around

Ugly data Videos, satellite images, format conversions (PDF to text)


MPP RDBMS Hadoop

Stable schema Evolving schema

Structured data Structure agnostic

Full ANSI SQL Flexible programming

Iterative analysis Batch analysis

Fine grain security N/A

Cleansed data Raw data

Seeks Scans

Updates/deletes Ingest

Service level agreements Flexibility

Core data Source files

Complex joins Complex processing

Efficient CPU and IO Low cost storage

Key Considerations


• YARN and Tez

• Queries on flat files!

• Parallel scanning engine

• Developer community

• Complex parallel processing

• Fast ingest of raw data

• Long term archives at full fidelity

• Good scalability

What I Like About Hadoop


• Start with workload requirements

> Map the tool capabilities to the requirement

• Hadoop is a DataHub, a Data Lake

> Not a database or data warehouse

> Exploit Hadoop’s strengths

• Combine the data warehouse and Hadoop

> Two tool sets solve more objectives

> Better together

Summary


The End

Hadoop and IDW - When_to_use_which

Technology

Transcript of Hadoop and IDW - When_to_use_which