Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Making Sense of the Madness - FST Media · 2015-02-03 · Utilise existing data archiving software....
Transcript of Making Sense of the Madness - FST Media · 2015-02-03 · Utilise existing data archiving software....
Copyright James Mitchell 2014 1
Making Sense of the Madness
Deploying Big Data techniques to deal with real world ‘Bigish Data’ issues
Copyright James Mitchell 2014 2
Introduction
Warning!
Parental Guidance Recommended Please read the small print
The concepts contained herein are exploratory in nature.
Ideas discussion about potential Use Cases.
Help you to open your eyes to the potential benefits that Big Data technology can provide.
I’m convinced enough to give it a shot – Just don’t tell my boss! …. yet
Copyright James Mitchell 2014 3
Big Data Presentation
If you work for a bank don’t talk to your boss about Big Data.
It will only agitate him and give him cold sweats.
I’m convinced enough to give it a shot – Just don’t tell my boss!
Introduction
Copyright James Mitchell 2014 4
Introduction
Instead you tell your boss…
Sold!
But how?
I’m convinced enough to give it a shot – Just don’t tell my boss!
Save on Storage
Profile Data & Clean up
Quality Issues
Better Data Security
Business Intelligence
on Fast Track
Speed up Backup & Restore
Improve System
Performance
Big Data
Copyright James Mitchell 2014 5
Changing Focus: Data Warehouse to Big Data
Big Data Isn’t New
Low level of interest since before 2005 until technology caught up.
Hadoop technologies provided the platform to drive Big Data.
Lucky for us, it also provides other ‘Real World’ benefits.
Google Search Trends
Big Data
Source: www.trends.google.com
Copyright James Mitchell 2014 6
Changing Focus: Data Warehouse to Big Data
So what is Big Data?
Big data is the term for a collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or traditional data
processing applications. The challenges include capture, curation, storage, search,
sharing, transfer, analysis and visualization. Source Wikipedia.
We have ‘Bigish Data’
It’s not the Google or NSA kind of data.
Data from hundreds of systems spanning many years.
It’s an extremely rich set of data.
Unfortunately we only use about 5% of it, why?
It’s a mess that’s why!
It’s everywhere, it’s in every possible format and every possible technology.
Always seems people that put it together in the first place have left.
Copyright James Mitchell 2014 7
Immediate & Real Problems we Face
I have problems!
Not those kind of problems…
Big Data kind of problems!
It takes so much time and effort to build my EDW.
Data volumes are ever increasing, costs for storage going up.
Business demands more storage, regulatory requirements, analytics focus.
So much of the data is never understood hence never used.
Data quality, accuracy risks, slows pace of EDW, bad decisions.
Data security, storage limitations leading to “home grown” storage solutions.
Slower batch performance & slower DR due to swelling data volumes
I’m convinced enough to give it a shot – Just don’t tell my boss!
Copyright James Mitchell 2014 8
EDW Vs. Big Data Technologies
Enterprise Data Warehouse Technology
EDW MPP type appliances (Netezza, TeraData, GreenPlum, etc).
Fast: Multi Core, very large memory, e.g. 128 cores, 1TB memory.
But not all data needs to be so readily available or so fast.
Too costly to rely entirely on your EDW and SAN.
Big Data Technology
Apache™ Hadoop® has emerged as the main big data platform to deploy a data staging area or enterprise data lake—a massively scalable environment for loading, storing, and refining very large amounts of data of any format or schema requirement. Source TeraData.
Underlying technology comes from Google
HBase: A scalable, distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarisation & ad hoc querying.
DIY using open source from Apache & build your own infrastructure.
Vendor appliances from TeraData, IBM, HP etc. can reduce complexity.
I’m convinced enough to give it a shot – Just don’t tell my boss!
Copyright James Mitchell 2014 9
Addressing our Challenges
1. Near Line Data
2. Data Archiving
3. Document / Image Archiving
4. Database Backup Images
5. Operational Data File Storage
6. Data Discovery / Data Quality
Big Data
Copyright James Mitchell 2014 10
1. Nearline Data
Source
Source
Source
Scenario:
Flow over mechanism for cold data to Hadoop.
Move aged or untouched data on a rule basis.
Move data once a storage threshold reached.
Make data queryable via BigSQL / HiveQL.
Benefits: Less data, improved system performance
Reduction in system or SAN storage needs. Ideal for maintaining historical data
Makes multi-system audits easier.
Provides very solid foundation for BI / MIS. Net
side effect is a more complete dataset for ‘Bigish Data’ analytics.
EDW
Copyright James Mitchell 2014 11
2. Data Archiving
Source
Source
Source
Scenario:
Utilise existing data archiving software. Back end storage onto Hadoop. Utilise Hadoop API for archival.
Benefits: Archival of data for regulatory purposes.
Not clogging up productive system.
Reduction in system or SAN storage needs.
Cost reduction for data archive storage.
Hadoop API no costs for archiving application
Value add to pure archived data - analytics
Copyright James Mitchell 2014 12
3. Document / Image Archiving
Source
Source
Source
Scenario:
Utilise existing document archiving software. Back end storage onto Hadoop. Or utilise Hadoop API for document archival.
Benefits: Archival of documents for regulatory purposes. Reduction in system or SAN storage needs.
Cost reduction for document archive storage.
Hadoop API no costs for archiving application
Value add to pure archived documents- analytics
Copyright James Mitchell 2014 13
4. Database Backup Images
Source
Source
Source DR
Source DR
Scenario:
Database image backup to Hadoop.
Restore back to Prod, DR site.
Benefits: Reduced need for system or SAN storage
Tapeless backup and restore, much faster. Parallel backup and restores possible.
Faster disaster recovery (DR) performance (RTO).
Copyright James Mitchell 2014 14
T-2 … T-n files
EDW
T, T-1 files
5. Operational File Storage
Source
Source
Source
Scenario:
Daily extracts from multiple systems feed to EDW.
Move source extract files to Hadoop after batch.
Copy back if required for BAU purposes
Benefits: Reduced need for system or SAN storage on EDW.
Reduced costs for EDW storage.
Operationally useful to keep extracts for extended
periods.
Copyright James Mitchell 2014 15
6. Data Discovery
BI / MIS
Source
Source
Source
EDW
Analytics
Source
Scenario: Full backups from EDW / sources on Hadoop.
BI / MIS & Analytics feed from Hadoop.
Benefits: Aforementioned storage cost savings.
BI / Analytics don’t impact productive EDW.
Enterprise BI initiatives simplified & faster.
DB copy much easier than full mapping to EDW.
Hadoop analytics can derive information from the
mass of data like a rake, not a comb.
Run in parallel to EDW initiatives bringing value to business in months not years.
Understand data sooner, simplifies EDW migration.
Ideal platform for Data Quality initiatives.
Copyright James Mitchell 2014 16
6. Data Quality
BI / MIS Source
Source Analytics
Testing
Data Profiling / Cleansing
Scenario:
Profile & cleanse DB images on Hadoop & pass back to fix at source.
Provide storage for Data Quality activities
Profile & cleanse data before going to consumers (BI, MIS, Analytics).
Profile, cleanse & mask data for testing purposes.
Storage for test data
Benefits: Centralised & cleansed data repository.
Storage for highly data intensive activities.
Greatly benefits accuracy of enterprise data.
Copyright James Mitchell 2014 17
Summary
1. Near Line Data
2. Data Archiving
3. Document / Image
Archiving
4. Database Backup Images
5. Operational
Data File Storage
6. Data Discovery /
Data Quality
Save on Storage
Profile Data & Clean up
Quality Issues
Better Data Security
Business Intelligence
on Fast Track
Speed up Backup & Restore
Improve System
Performance
Big Data
Copyright James Mitchell 2014 18
Thank You for Listening!
James Mitchell
Head of IT, Shared Services
AmBank Berhad
Phone: +6012 709 6961
Email: [email protected]