Making Sense of the Madness - FST Media · 2015-02-03 · Utilise existing data archiving software....

Copyright James Mitchell 2014 1

Making Sense of the Madness

Deploying Big Data techniques to deal with real world ‘Bigish Data’ issues


Introduction

Warning!

Parental Guidance Recommended Please read the small print

The concepts contained herein are exploratory in nature.

Ideas discussion about potential Use Cases.

Help you to open your eyes to the potential benefits that Big Data technology can provide.

I’m convinced enough to give it a shot – Just don’t tell my boss! …. yet


Big Data Presentation

If you work for a bank don’t talk to your boss about Big Data.

It will only agitate him and give him cold sweats.

I’m convinced enough to give it a shot – Just don’t tell my boss!

Introduction


Introduction

Instead you tell your boss…

Sold!

But how?


Save on Storage

Profile Data & Clean up

Quality Issues

Better Data Security

Business Intelligence

on Fast Track

Speed up Backup & Restore

Improve System

Performance

Big Data


Changing Focus: Data Warehouse to Big Data

Big Data Isn’t New

Low level of interest since before 2005 until technology caught up.

Hadoop technologies provided the platform to drive Big Data.

Lucky for us, it also provides other ‘Real World’ benefits.

Google Search Trends

Big Data

Source: www.trends.google.com


Changing Focus: Data Warehouse to Big Data

So what is Big Data?

Big data is the term for a collection of data sets so large and complex that it becomes

difficult to process using on-hand database management tools or traditional data

processing applications. The challenges include capture, curation, storage, search,

sharing, transfer, analysis and visualization. Source Wikipedia.

We have ‘Bigish Data’

It’s not the Google or NSA kind of data.

Data from hundreds of systems spanning many years.

It’s an extremely rich set of data.

Unfortunately we only use about 5% of it, why?

It’s a mess that’s why!

It’s everywhere, it’s in every possible format and every possible technology.

Always seems people that put it together in the first place have left.

http://en.wikipedia.org/wiki/Data_set


Immediate & Real Problems we Face

I have problems!

Not those kind of problems…

Big Data kind of problems!

It takes so much time and effort to build my EDW.

Data volumes are ever increasing, costs for storage going up.

Business demands more storage, regulatory requirements, analytics focus.

So much of the data is never understood hence never used.

Data quality, accuracy risks, slows pace of EDW, bad decisions.

Data security, storage limitations leading to “home grown” storage solutions.

Slower batch performance & slower DR due to swelling data volumes



EDW Vs. Big Data Technologies

Enterprise Data Warehouse Technology

EDW MPP type appliances (Netezza, TeraData, GreenPlum, etc).

Fast: Multi Core, very large memory, e.g. 128 cores, 1TB memory.

But not all data needs to be so readily available or so fast.

Too costly to rely entirely on your EDW and SAN.

Big Data Technology

Apache™ Hadoop® has emerged as the main big data platform to deploy a data staging area or enterprise data lake—a massively scalable environment for loading, storing, and refining very large amounts of data of any format or schema requirement. Source TeraData.

Underlying technology comes from Google

HBase: A scalable, distributed database that supports structured data storage for large tables.

Hive: A data warehouse infrastructure that provides data summarisation & ad hoc querying.

DIY using open source from Apache & build your own infrastructure.

Vendor appliances from TeraData, IBM, HP etc. can reduce complexity.



Addressing our Challenges

1. Near Line Data

2. Data Archiving

3. Document / Image Archiving

4. Database Backup Images

5. Operational Data File Storage

6. Data Discovery / Data Quality

Big Data


1. Nearline Data

Source

Source

Source

Scenario:

Flow over mechanism for cold data to Hadoop.

Move aged or untouched data on a rule basis.

Move data once a storage threshold reached.

Make data queryable via BigSQL / HiveQL.

Benefits: Less data, improved system performance

Reduction in system or SAN storage needs. Ideal for maintaining historical data

Makes multi-system audits easier.

Provides very solid foundation for BI / MIS. Net

side effect is a more complete dataset for ‘Bigish Data’ analytics.

EDW


2. Data Archiving

Source

Source

Source

Scenario:

Utilise existing data archiving software. Back end storage onto Hadoop. Utilise Hadoop API for archival.

Benefits: Archival of data for regulatory purposes.

Not clogging up productive system.

Reduction in system or SAN storage needs.

Cost reduction for data archive storage.

Hadoop API no costs for archiving application

Value add to pure archived data - analytics


3. Document / Image Archiving

Source

Source

Source

Scenario:

Utilise existing document archiving software. Back end storage onto Hadoop. Or utilise Hadoop API for document archival.

Benefits: Archival of documents for regulatory purposes. Reduction in system or SAN storage needs.

Cost reduction for document archive storage.

Hadoop API no costs for archiving application

Value add to pure archived documents- analytics



Source

Source

Source DR

Source DR

Scenario:

Database image backup to Hadoop.

Restore back to Prod, DR site.

Benefits: Reduced need for system or SAN storage

Tapeless backup and restore, much faster. Parallel backup and restores possible.

Faster disaster recovery (DR) performance (RTO).


T-2 … T-n files

EDW

T, T-1 files

5. Operational File Storage

Source

Source

Source

Scenario:

Daily extracts from multiple systems feed to EDW.

Move source extract files to Hadoop after batch.

Copy back if required for BAU purposes

Benefits: Reduced need for system or SAN storage on EDW.

Reduced costs for EDW storage.

Operationally useful to keep extracts for extended

periods.


6. Data Discovery

BI / MIS

Source

Source

Source

EDW

Analytics

Source

Scenario: Full backups from EDW / sources on Hadoop.

BI / MIS & Analytics feed from Hadoop.

Benefits: Aforementioned storage cost savings.

BI / Analytics don’t impact productive EDW.

Enterprise BI initiatives simplified & faster.

DB copy much easier than full mapping to EDW.

Hadoop analytics can derive information from the

mass of data like a rake, not a comb.

Run in parallel to EDW initiatives bringing value to business in months not years.

Understand data sooner, simplifies EDW migration.

Ideal platform for Data Quality initiatives.


6. Data Quality

BI / MIS Source

Source Analytics

Testing

Data Profiling / Cleansing

Scenario:

Profile & cleanse DB images on Hadoop & pass back to fix at source.

Provide storage for Data Quality activities

Profile & cleanse data before going to consumers (BI, MIS, Analytics).

Profile, cleanse & mask data for testing purposes.

Storage for test data

Benefits: Centralised & cleansed data repository.

Storage for highly data intensive activities.

Greatly benefits accuracy of enterprise data.


Summary

1. Near Line Data

2. Data Archiving

3. Document / Image

Archiving


5. Operational

Data File Storage

6. Data Discovery /

Data Quality

Save on Storage

Profile Data & Clean up

Quality Issues

Better Data Security

Business Intelligence

on Fast Track

Speed up Backup & Restore

Improve System

Performance

Big Data


Thank You for Listening!

James Mitchell

Head of IT, Shared Services

AmBank Berhad

Phone: +6012 709 6961

Email: [email protected]

Making Sense of the Madness - FST Media · 2015-02-03 · Utilise existing data archiving software....

Documents

Transcript of Making Sense of the Madness - FST Media · 2015-02-03 · Utilise existing data archiving software....