Hadoop and IDW - When_to_use_which
-
Upload
dan-theman -
Category
Technology
-
view
219 -
download
0
Transcript of Hadoop and IDW - When_to_use_which
HADOOP AND THE DATA WAREHOUSE: WHEN TO USE WHICH
2 Copyright Teradata
• Data warehouse strengths
> What is a Data Warehouse?
• Hadoop strengths
• When to use which
> Hadoop
> Data warehouse
Agenda
3 Copyright Teradata
Data Hub/Lake
Data Warehouse
Discovery
Three Primary Workloads
• Data models • Data integration • Trusted data • Concurrent users • Workload mgmt • Response time
• Easy to use • Many tools • Algorithm collections • Data wrangling • Business user access • Semi-production
• Fast raw data ingest • Archival • ETL refinery • Search • Relaxed SLAs • Millions of files
4 Copyright Teradata
Best Fit Primary Strengths and Overlaps
Data Warehouse
Discovery DataLake
WHY HADOOP IS NOT A DATA WAREHOUSE
6 Copyright Teradata
• A data design pattern, an architecture
> Not necessarily a database
• Definition: Gartner (2005) /Inmon (1992)
> Subject oriented
– Detailed data + modeling of sales, inventory, finance, etc.
> Integrated logical model
– Merged data
– Consistent, standardized data formats and values
> Nonvolatile
– Data stored unmodified for long periods of time
> Time variant
– Record versioning or temporal services
> Persistent storage, not virtual, not federated
What is a Data Warehouse?
Source: Gartner: Of Data Warehouses, Operational Data Stores, Data Marts and Data 'Outhouses‘, Dec 2005; Inmon, Building the Data Warehouse, 1992, Wiley and Sons
7 Copyright Teradata
By Definition Data
Warehouse Hadoop
Subject oriented 5 0
Detailed data 5 5
Modeled by business subject 5 0
Integrated 5 0
Merged, deduplicated data 5 0
Standardized data formats and values 5 0
Nonvolatile storage 5 5
Time variant: record versions, temporal 5 0
Persistent storage 5 5
Data Warehouse Design Pattern
0=none, 1= poor, 2= limited, 3= average, 4=robust, 5=outstanding
8 Copyright Teradata
NoSchema, Schema-on-Read, Complex Schemas
Single file (Schema-on-read)
Data Marts (Schema-on-read)
Data Warehouse (Schema-on-
write)
No schema, no joins One source Raw data 3-5 uses
Star and snowflake schemas 2-4 fact table joins Multiple sources Raw data, unknown data Key value stores
5K-10K tables 20-50 way joins Cross-organization Pre-integrated, cleansed Referential integrity Many applications
Events
Locations
Finance Transaction
Session
Orders
Inventory Call Center
POS
9 Copyright Teradata
• Not a database
> No schema, indexes, optimizer
> No separation of code and data structure
> Hadoop uses objects and files
– Not rows and columns
• Hive helps a little
> Limited SQL
> Limited metadata
• Not high performance
• Not fully interactive queries
What Hadoop is Not
See also http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html http://blogs.gartner.com/donald-feinberg/2014/12/22/a-database-by-any-other-name/
10 Copyright Teradata
• Guarantees database actions are processed reliably
• Ensures query result accuracy
• Supports updates and deletes
• Needed for applications that require 100% consistency
> Banks, finance, inventory, etc.
> Maybe not for Facebook, Twitter, etc.
• Data you can trust
ACID Advantages of an RDBMS
Atomicity apply all changes or none
Consistency rollback on errors
Isolation one update at a time
Durability transactions survive crashes
11 Copyright Teradata
Integration and Analytics
Hadoop’s Biggest Differentiators
Capture and ETL
Long term archive
Cheap, commodity hardware
Data Warehouse
12 Copyright Teradata
Data Hub Refinery: Parallel ETL
Social networks Mobile
Web Logs Sensors
13 Copyright Teradata
When We’re Too Small for Hadoop ETL
Avoid hand coded transforms 2 ETL servers do the job
Prefer tool based ETL ETL is working well
14 Copyright Teradata
When We Need Massive Data Integration
Dozens of ETL servers High velocity real time data
10s-100s of TB/day The risk is worth reward
15 Copyright Teradata
When In-database ELT Works Well
Reference data look-ups Joins for derived data Lots of derived data
Service-level goals to meet
16 Copyright Teradata
When to Use Which: It Depends
In Database ELT Hadoop
Reference data • Lookups • Joins
Transformations • Structured data • ELT modules • SQL can do it
• Unstructured • Some ETL modules • Do it yourself
Service level goals
• Predictable • System management
Data security • Robust
Costs • Commodity hardware
Data quality • Governance, MDM • Low quality/trust OK
Data volume • High volume • Extreme volume
Offload ELT • Migration costs
Agility • No governance
WHERE HADOOP EXCELS
18 Copyright Teradata
• Commodity low cost hardware
• Many programing languages
> But mostly it’s Java
• Free open source
• Any data structure
• Scale-out to petabytes + parallelism
Hadoop Strengths
19 Copyright Teradata
• ETL on steroids
• Economically ”keep files forever”
> Queryable
• File based reporting and analytics
• Backup and archival storage
> Databases, files, development
Hadoop: the Data Hub
20 Copyright Teradata
• Temporary data, data exhaust
• Data mining/exploration
> 1000s of continuous variables
> Linear algebra
> Graph mining
> Machine learning
> Random forest, decision trees
> Markov chains
• Not all data mining MapReduce
> Many things work better in MPP RDBMS
> In-database SAS, R, Fuzzy Logix
> It depends
Where MapReduce Excels
21 Copyright Teradata
• Easy to work on non-relational data
> Java data types
> JSON, objects
• Hadoop is written in the Java
> Compatible APIs, skills, concepts, frameworks, scripts
• Huge open source factories
> Apache, GitHub, Eclipse, SourceForge,etc.
> Assorted compression algorithms
• People
> 9M-10M java programmers
> Web tutorials – extensive “how to” topics
> University student research
Developer Advantages with Hadoop
22 Copyright Teradata
• Raw data format provides complete flexibility
• Non-traditional data types easily supported
> Graph, text, weblog, etc.
• No upfront ETL required
• No data loading required
• Flexible: late binding let’s data scientist choose
NoSchema Advantages
41521390 2013-01-01 00:25:42 2.111.94.18 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4 "http://www.cokstate.edu/welcome/" "https://www.google.com/#sclient=psyab&hl=en&source=hp&q=oklahoma+state&pbx=1&oq”
Weblog
Note: there are many pitfalls when schema-on-read is not a good solution
23 Copyright Teradata
Attributes Favoring Hadoop
Reason Description
Cost Low cost, low value data before refinement
Multi structured data ingest
Raw weblogs, Twitter, Facebook, mobile, PST files, etc.
Data depth High data volume, few users, high signal-to-noise ratio
Non-SQL analytics Complex processes, pipeline transforms, random forests, Markov chains, enormous arrays, etc.
Flexibility, autonomy
Exploratory analysis with little governance Fast, short-term turn around
Ugly data Videos, satellite images, format conversions (PDF to text)
24 Copyright Teradata
MPP RDBMS Hadoop
Stable schema Evolving schema
Structured data Structure agnostic
Full ANSI SQL Flexible programming
Iterative analysis Batch analysis
Fine grain security N/A
Cleansed data Raw data
Seeks Scans
Updates/deletes Ingest
Service level agreements Flexibility
Core data Source files
Complex joins Complex processing
Efficient CPU and IO Low cost storage
Key Considerations
25 Copyright Teradata
• YARN and Tez
• Queries on flat files!
• Parallel scanning engine
• Developer community
• Complex parallel processing
• Fast ingest of raw data
• Long term archives at full fidelity
• Good scalability
What I Like About Hadoop
26 Copyright Teradata
• Start with workload requirements
> Map the tool capabilities to the requirement
• Hadoop is a DataHub, a Data Lake
> Not a database or data warehouse
> Exploit Hadoop’s strengths
• Combine the data warehouse and Hadoop
> Two tool sets solve more objectives
> Better together
Summary
27 Copyright Teradata
The End