Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their...
Transcript of A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their...
#TDPARTNERS16 GEORGIA WORLD CONGRESS CENTER
A Data Lake is more than Hadoop.
Hadoop is more
than a Data Lake
Dan Graham
Teradata Director Technical Marketing
What’s the Big Idea? Big idea #1
“store all data” (whatever “all” means)
Big idea #2 “un-washed, raw data” (NoETL / late-binding)
Big idea #3 “resolve the nagging problem of
accessibility and data integration”
DTG
Big idea #4 Data access/integration
Isn’t that in the data warehouse?
What is a Data Lake?
A data lake is a collection of long term data containers that capture, refine, and explore any form of raw data at scale, enabled by low cost technologies, from which multiple downstream facilities may draw upon.
Data sources Downstream
Sensors email
Transactions Machine logs
Geolocation Media
BI Tools IDW
Data Marts Analysis
Apps Other Data Lake Data Lake
DTG
Data Warehouse Design Pattern Data Lake Design Pattern
Data Lake is a Design Pattern
• Scalability at low cost
• Original raw data fidelity
• Refine data for exploration
• Loosely coupled, late binding
• Serves downstream systems
• Long term storage
Subject oriented
Data model of the business
Integrated
Consolidated
Consistent data formats
Nonvolatile persisted data
Time variant
High concurrency levels
DTG
Data Lake Design Pattern Data Lake Technologies
S3
1800
Design Patterns vis-à-vis Technologies DTG
• Scalability at low cost
• Original raw data fidelity
• Refine data for exploration
• Loosely coupled, late binding
• Serves downstream systems
• Long term storage
Who is this Guy? What’s he Doing?
Data treatments
Capture, refine, explore
original raw data and metadata
DTG
Data scientists
Programmers
Business users
Batch jobs
Multiple Data Lakes DTG
Sensor data capture, refining
New product design
Market pricing
Hadoop is more than a data lake. A data lake is more than Hadoop.
DTG
What the Data Lake is Not
• Not a single central repository for all data • Unless you rebuild half the data center
• 100s of reasons data bypasses the lake
• Not only system feeding the data warehouse
• Data goes direct or through ETL servers
• Not an archive • Policies, audits, immutability, extreme security, expirations
• Not dashboards and data marts
ETL analysis
data lake
DTG
Data Manufacturing
DATA R&D
DATA LAKE DATA PRODUCTS
DTG
Data Manufacturing & Hadoop Cluster
DATA R&D
DATA LAKE DATA PRODUCTS
DTG
Data Integration Just Say No to your Inner DBA (and some users)
Levels of data trust Data integration
Certified 100%
Trustworthy 80%
Proven 60%
Experimental 40%
Raw/high risk 20% Low
High
Inve
stm
en
t
DTG
Use Cases
Data Integration Optimization
Reference data look-ups Joins for derived data Lots of derived data
Service-level goals to meet
High velocity data Unstructured data
Low value data Cost savings ROI
DTG
Dark Data Insights
• Dark data, data exhaust deleted
• New unstructured data,
• Expensive, no ROI, unknown value
• Low user demand
• Dark data often contains insights
• Data lake costs are much lower
• Explore, research, discover
• Promote some to production
sensors
weblogs
logins
tweets
GPS
Production
mobile
DTG
Complex/ Iterative Processing
• Extensive CPU usage • Iterative processing
• non sequential loops & branches
• Complex algorithms • Video content analysis
• Photo analysis
• Text analysis
• Random forests
• Monte Carlo methods
• Scientific research • Weather simulation
• Electromagnetic modeling
• Physics, DNA, etc.
Complex processing
Set processing
DTG
Managing Shadow IT
• To get their job done, users abscond with data daily
• Bypass IT, governance, and security
• Data-mart-under-my-desk
• Dispensing data reliably • HELP users get needed data
• Improve data quality
• Get some control versus none
• Add some governance, security, audit
DTG
Data Lake
Offloading the Coldest Data
• Offload coldest rows • Free up IDW storage
• Temperature = usage • Date stamp often irrelevant
• Archive, compliance
• Accessible with QueryGrid
Hot/warm data
Coldest data
ETL
QueryGrid move
DTG
Single Subject Data Analysis
• Analytics • Query and reporting
• Data mining
• Dashboards
• Single subject star schema • 1-2 raw data fact tables
• Structured + unstructured data
• Non cleansed data
• Non integrated data
• Dimension tables
#Version: 1.0 #GMT-Offset: -0800 #Software: MyCorpTopaz Web Cache 2.0.0.2.0; #Start-Date: 2015-06-21 00:00:18 #Fields: c-ip c-dns c-auth-id date time cs-method cs-uri sc-status-ctrl bytes cs(Cookie) cs(Referrer) time-taken cs(User-Agent) #date: 2015-07-31; ”buyer”=“Willcox”; order”=“lingerie”; DMS.user; GET /images/bottom.gif 200A17x 350 "BIGipServer_webcache”=“217”; ORA_UCM_AGID=%2fMP%2f8M7%3etSHPV%40%2fS%3f%3fDh3V“; "http://www.myDBl.com/nl.html" 37087 "Mozilla/4.5 [en] (WinNT;)"
Raw data files
store
address
date
type
DTG
Big Pictures
Data Lake Architecture
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC TOOLS & APPS
USERS
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
Access Preparation Acquisition
Search
Profiling
Tagging
Analytics
Cleansing
Validation
Aggregation
Materialization
Ingest
Conversion
Encryption
Security, Metadata/Lineage, Administration
Distributed Storage
SOURCES
Sensors
Social
Telemetry
Mobile
Tabular Data
Machine logs
DTG
Access Preparation Acquisition
Data Lake Architecture
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC TOOLS & APPS
USERS
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
Streams Search Aggregations
Security, Metadata/Lineage, Administration
Distributed Storage
Msg. queues Cleansing Access
Experiments Governance Files
SOURCES
Sensors
Social
Telemetry
Mobile
Tabular Data
Machine logs
DTG
Access Preparation Acquisition
Hadoop Data Lake Technologies
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC TOOLS & APPS
USERS
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
YARN, Ambari, Navigator, HCatalog, Sentry
HDFS, S3 Raw data, derived views
SOURCES
Sensors
Social
Telemetry
Mobile
Tabular Data
Machine logs
DTG
Data Lake: Teradata 1800
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC TOOLS & APPS
USERS
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
Access Preparation Acquisition
Teradata Parallel Data Environment
SOURCES
Sensors
Social
Telemetry
Mobile
Tabular Data
Machine logs
DTG
Data Lab Studio
QueryGrid
SAS mining Fuzzy Logix
SPSS Revolution R
Informatica DataStage Oracle DI
SAS DI Studio Ab Initio
Microsoft
TPT Data-mover
Listener REST APIs Attunity
Informatica, IBM Data Stage, Oracle Data Integrator, Talend
Viewpoint, Ecosystem Manager, Unity
Data Lake Definition Summary
• The data lake is a design pattern • Requires and uses many technologies
• The data lake is more than Hadoop • Amazon S3, Cassandra, Teradata
• Other tools and technologies
• Hadoop is more than a data lake
• The data lake manages raw data • Refined in downstream processes Downstream
consumers
Data
sources
DTG
Thank You
Questions/Comments
Email:
Follow Me
Twitter @
Rate This Session #
with the PARTNERS Mobile App
Remember To Share Your Virtual Passes
DanGraham_
417 -- rate it a 5 please
26
27
Data Lake Platforms
Data lake definition Hadoop Amazon
EMR Cassandra Teradata
1800
Long term data containers X X X X
Capture, refine, and explore X X X X
Raw data at scale X X X X
Low cost technologies X X X X
Feeds downstream uses X X X X
Options
Schema-on-read X X X JSON, NVPs
File system HDFS S3 CFS RDBMS
Search engines Solr Solr
SQL, Java, Python, Ruby, scripts X X X X
Data Integration on demand
Data value assumed
Typically schema-on-read
Data integration up front
Data value manufactured
Typically schema-on-write
Value Creation via Data Integration
DATA LAKE
SCM
CRM
ERP INTEGRATED
DATA WAREHOUSE
DTG
Access Preparation Acquisition
HDFS
Teradata’s Hadoop Data Lake Products
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC TOOLS & APPS
USERS
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
Listener App Center
SOURCES
Sensors
Social
Telemetry
Mobile
Tabular Data
Machine logs
DTG
Viewpoint