Hadoop 2.0 - Solving the Data Quality Challenge

51
Grab some coffee and enjoy the pre-show banter before the top of the hour!

description

The Briefing Room with Dr. Claudia Imhoff and RedPoint Global Live Webcast on July 22, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=7bb4cbc33402c3b5f649343052cb9a6d Whether data is big or small, quality remains the critical characteristic. While traditional approaches to cleansing data have made strides, nonetheless, data quality remains a serious hurdle for all organizations. This is especially true for identity resolution in customer data, but also for a range of other data sets, including social, supply chain, financial and other domains. One of the most promising approaches for solving this decades-old challenge incorporates the power of massive parallel processing, a la Hadoop. Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Claudia Imhoff, who will explain how Hadoop 2.0 and its YARN architecture can make a serious impact on the previously intractable problem of data quality. She’ll be briefed by George Corugedo of RedPoint Global, who will show how his company’s platform can serve as a super-charged marshaling area for accessing, cleansing and delivering high-quality data. He’ll explain how RedPoint was one of the first applications to be certified for running on YARN, which is the latest rendition of the now-ubiquitous Hadoop. Visit InsideAnlaysis.com for more information.

Transcript of Hadoop 2.0 - Solving the Data Quality Challenge

Page 1: Hadoop 2.0 - Solving the Data Quality Challenge

Grab some coffee and enjoy the pre-show banter before the top of the hour!

Page 2: Hadoop 2.0 - Solving the Data Quality Challenge

The Briefing Room

Hadoop 2.0: Solving the Data Quality Challenge

Page 3: Hadoop 2.0 - Solving the Data Quality Challenge

Twitter Tag: #briefr

The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected] @eric_kavanagh

Page 4: Hadoop 2.0 - Solving the Data Quality Challenge

Twitter Tag: #briefr

The Briefing Room

!   Reveal the essential characteristics of enterprise software, good and bad

!   Provide a forum for detailed analysis of today’s innovative technologies

!   Give vendors a chance to explain their product to savvy analysts

!   Allow audience members to pose serious questions... and get answers!

Mission

Page 5: Hadoop 2.0 - Solving the Data Quality Challenge

Twitter Tag: #briefr

The Briefing Room

Topics

2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room

This Month: INNOVATIVE TECHNOLOGY

August: BIG DATA ECOSYSTEM

September: INTEGRATION

Page 6: Hadoop 2.0 - Solving the Data Quality Challenge
Page 7: Hadoop 2.0 - Solving the Data Quality Challenge

Twitter Tag: #briefr

The Briefing Room

Analyst: Dr. Claudia Imhoff

Claudia Imhoff is President & Founder of

Intelligent Solutions, Inc.

Page 8: Hadoop 2.0 - Solving the Data Quality Challenge

Twitter Tag: #briefr

The Briefing Room

RedPoint Global

! RedPoint Global is a data management and integrated marketing technology company

!   Its Convergent Marketing Platform™ offers products designed for data management, collaboration and architecture integration.

! RedPoint Data Management for Hadoop is YARN-compliant and enables analysts to access and manipulate data directly within the Hadoop cluster.

Page 9: Hadoop 2.0 - Solving the Data Quality Challenge

Twitter Tag: #briefr

The Briefing Room

Guest: George Corugedo

George Corugedo is Chief Technology Officer & Co-Founder at RedPoint Global Inc. A mathematician and seasoned technology executive, George has over 20 years of business and technical expertise. As co-founder and CTO of RedPoint Global, George is responsible for leading the development of the RedPoint Convergent Marketing Platform™. A former math professor, George left academia to co-found Accenture’s Customer Insight Practice, which specialized in strategic data utilization, analytics and customer strategy. Previous positions include director of client delivery at ClarityBlue, Inc., a provider of hosted customer intelligence solutions to enterprise commercial entities, and COO/CIO of Riscuity, a receivables management company specializing in the utilization of analytics to drive collections.

Page 10: Hadoop 2.0 - Solving the Data Quality Challenge

The Neglected Discipline of Data Quality in Hadoop July  2014  

Page 11: Hadoop 2.0 - Solving the Data Quality Challenge

11 © RedPoint Global Inc. 2014 Confidential

Overview – Challenges to Adoption

•  Severe  shortage  of  MR  skilled  resources  

•  Very  expensive  resources  and  hard  to  retain  

•  Inconsistent  skills  lead  to  inconsistent  results  

•  Under  uAlizes  exisAng  resources  

•  Prevents  broad  leverage  of  investments  across  enterprise  

Skills  Gap  

•  A  nascent  technology  ecosystem  around  Hadoop  

•  Emerging  technologies  only  address  narrow  slivers  of  funcAonality  

•  New  applicaAons  are  not  enterprise  class  

•  Legacy  applicaAons  have  built  short  term  capabiliAes  

Maturity  &  Governance  

•  Data  is  not  useful  in  its  raw  state,  it  must  be  turned  into  informaAon  

•  Benefit  of  Hadoop  is  that  same  data  can  be  used  from  many  perspecAves  

•  Analysts  must  now  do  the  structuring  of  the  data  based  on  intended  use  of  the  data  

Data  Into  InformaAon  

Page 12: Hadoop 2.0 - Solving the Data Quality Challenge

12 © RedPoint Global Inc. 2014 Confidential

Key Points to Cover Today

! Broad functionality across data processing domains

! Validated ease of use, speed, match quality and party data superiority

! Hadoop 2.0/YARN certified – 1 of first 17 companies to do so

! Not a repackaging of Hadoop 1.0 functionality. RedPoint Data Management is a pure YARN application (1 of only 2 in the initial wave of certifications)

! Building a complex job in RPDM takes a fraction of the time that it takes to write the same job in Map Reduce and none of the coding or java skills.

! Big functional footprint without touching a line of code

! Design model consistent with data flow paradigm

! RPDM has a “Zero-Footprint” install in the Hadoop cluster

! The same interface and functionality is available for both structured and unstructured databases. Thus it is seamless to work across both from a users perspective.

! Data quality done completely within the cluster

Page 13: Hadoop 2.0 - Solving the Data Quality Challenge

13 © RedPoint Global Inc. 2014 Confidential

Key features of RedPoint Data Management

Master  Key  Management  

ETL  &  ELT   Data  Quality  

Web  Services  IntegraAon  

IntegraAon  &  Matching  

Process  AutomaAon    &  OperaAons  

• Profiling,  reads/writes,  transformaAons  

•  Single  project  for  all  jobs  

• Cleanse  data  • Parsing,  correcAon  • Geo-­‐spaAal  analysis  

• Grouping  •  Fuzzy  match  

• Create  keys  • Track  changes  • Maintain  matches    over  Ame  

• Consume  and  publish  • HTTP/HTTPS  protocols  • XML/JSON/SOAP  formats  

•  Job  scheduling,  monitoring,  noAficaAons  

• Central  point  of  control  

All  func(ons  can  be  used    on  both    TRADITIONAL  and    BIG  DATA  

Creates  clean,  integrated,  ac/onable  data  –  quickly,  reliably  and  at  low  cost  

Page 14: Hadoop 2.0 - Solving the Data Quality Challenge

14 © RedPoint Global Inc. 2014 Confidential

RedPoint Data Management on Hadoop

ParAAoning  AM  /  Tasks  

ExecuAon  AM  /  Tasks   Data  I/O   Key  /  Split  

Analysis  

Parallel  SecAon  (UI)  

YARN  

MapReduce  

Page 15: Hadoop 2.0 - Solving the Data Quality Challenge

15 © RedPoint Global Inc. 2014 Confidential

RedPoint Functional Footprint

Monitoring and Management Tools

AMBARI

MAPREDUCE

REST

DATA REFINEMENT

HIVE PIG

HTTP

STREAM

STRUCTURE

HCATALOG (metadata services)

Query/Visualization/ Reporting/Analytical

Tools and Apps

SOURCE DATA

- Sensor Logs - Clickstream - Flat Files - Unstructured - Sentiment - Customer - Inventory

DBs

JMS Queue’s

Files Fil

es Files

Data Sources

RDBMS

EDW

INTERACTIVE

HIVE Server2

LOAD

SQOOP

WebHDFS

Flume

NFS

LOAD SQOOP/Hive

Web HDFS

YARN  

�   �   �   �   �   �   �   �   �   �  

�   �   �   �   �   �   �   �   �   �   �  

�   �   �   �   �   �   �   �   �   �   �  

�   �  

�   �  

�   n

HDFS  

1 �   �   �   �   �   �   �   �   �   �   �   �  

�  

�   �   �   �   �   �   �   �   �   �   �   �  �  

�   �   �   �   �   �   �   �   �   �   �   �  �  

�   �   �   �   �   �   �   �   �   �   �   �   �  

Page 16: Hadoop 2.0 - Solving the Data Quality Challenge

16 © RedPoint Global Inc. 2014 Confidential

RedPoint  

Benchmarks – Project Gutenberg

Map  Reduce   Pig  

Sample  MapReduce  (small  subset  of  the  entire  code  which  totals  nearly  150  lines):  public  static  class  MapClass extends  Mapper<WordOffset, Text, Text, IntWritable> {   private  final  static  String delimiters = "',./<>?;:\"[]{}-=_+()&*%^#$!@`~ \\|«»¡¢£¤¥¦©¬®¯±¶·¿";   private  final  static  IntWritable one = new  IntWritable(1);   private  Text word = new  Text();   public  void  map(WordOffset key, Text value, Context context) throws  IOException, InterruptedException { String line = value.toString();   StringTokenizer itr = new  StringTokenizer(line, delimiters);   while  (itr.hasMoreTokens()) {   word.set(itr.nextToken());   context.write(word, one);   }   }  }    

Sample  Pig  script  without  the  UDF:  SET  pig.maxCombinedSplitSize 67108864  SET  pig.splitCombination true  A = LOAD  '/testdata/pg/*/*/*';  B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS  word;  C = FOREACH B GENERATE UPPER(word) AS  word;  D = GROUP  C BY  word;  E = FOREACH D GENERATE COUNT(C) AS  occurrences, group;  F = ORDER  E BY  occurrences DESC;  STORE F INTO  '/user/cleonardi/pg/pig-count';

>150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code

6 hours of development 3 hours of development 15 min. of development

6 minutes runtime 15 minutes runtime 3 minutes runtime

Extensive optimization needed User Defined Functions required prior to running script

No tuning or optimization required

Page 17: Hadoop 2.0 - Solving the Data Quality Challenge

17 © RedPoint Global Inc. 2014 Confidential

Attributes of Information

RELEVANT  InformaAon  must  pertain  to  a  specific  problem.    General  data  must  be  connected  to  reveal  relevance  of  the  informaAon.  

COMPLETE  ParAal  informaAon  is  oaen  worse  than  no  informaAon.    ParAal  informaAon  frequently  leads  to  worse  conclusions  than  if  no  data  had  been  used  at  all.  

ACCURATE  This  one  is  obvious.    In  a  context  like  health  care,  inaccurate  data  can  be  fatal.    Precision  is  required  across  all  applicaAons  of  informaAon.  

CURRENT  As  data  ages,  it  becomes  less  accurate.    MulAple  research  studies  by  Google  and  others  show  the  decay  in  the  accuracy  of  analyAcs  as  data  becomes  stale.  

ECONOMICAL  There  has  to  be  a  clear  cost  benefit.    This  requires  work  to  idenAfy  the  realizable  benefit  of  informaAon  but  this  is  also  what  rives  the  use  if  successful  

Page 18: Hadoop 2.0 - Solving the Data Quality Challenge

18 © RedPoint Global Inc. 2014 Confidential

Reference Architecture for Matching in Hadoop

Data  Sources  CRM  

ERP  

Billing  

Subscriber  

Product  

Network  

Weather  

Compete  

Manuf.  

Clickstream  

Online  Chat  

Sensor  Data  

Social  Media  

Call  Detail  Records  

FabricaAon  Logs  

Sales  Feedback  

Field  Feedback  

Field  Feedback  

+  

Page 19: Hadoop 2.0 - Solving the Data Quality Challenge

19 © RedPoint Global Inc. 2014 Confidential

Resource  Manager  

Launches  Tasks  

Node  Manager  

DM  App  Master  

DM  Task  

Node  Manager  

DM  Task  

DM  Task  

Node  Manager  

DM  Task  

DM  Task  

Launches  DM  App  Master  

Data  Management  Designer  

DM  ExecuAon  

Server  

Parallel  SecAon  

Running  DM  Task  

12

3

RedPoint DM for Hadoop: Processing Flow

Page 20: Hadoop 2.0 - Solving the Data Quality Challenge

20 © RedPoint Global Inc. 2014 Confidential

The Data Management designer

Page 21: Hadoop 2.0 - Solving the Data Quality Challenge

21 © RedPoint Global Inc. 2014 Confidential

DM Hadoop Settings

Page 22: Hadoop 2.0 - Solving the Data Quality Challenge

22 © RedPoint Global Inc. 2014 Confidential

DM Parallel Section on Hadoop

Page 23: Hadoop 2.0 - Solving the Data Quality Challenge

23 © RedPoint Global Inc. 2014 Confidential

Who Should Care

! Companies interested in exploring the promise of Big Data Analytics and need an easy way to get started.

! Companies already investing heavily investing in Big

Data Analytics technologies but are stuck due to the shortage of skilled resources

! Large organizations that are focused on “Operational Offloading” and need to achieve it cost effectively

! Companies who recognize that much of the data that lands in Hadoop is external to the organization and need to have Data Quality and proper data governance applied to their Hadoop data.

Page 24: Hadoop 2.0 - Solving the Data Quality Challenge

24 © RedPoint Global Inc. 2014 Confidential

Data Inputs

RedPoint Convergent Marketing Ecosystem

Enh

an

ce

me

nt

SQL

Soc

ial

No

SQ

L

Analytics Machine Learning

Email

GIS

Address Std.

Inbox Analysis Segmentation Attribution

CRM Trigger Audience Offer

Marketing Rules Engine

RedPoint Interaction

Social Mobile Digital Real Time

Cache

Analytics Marketing Operations Hadoop

RedPoint Data Management

Geocoding

Web Services

Page 25: Hadoop 2.0 - Solving the Data Quality Challenge

25 © RedPoint Global Inc. 2014 Confidential

RedPoint real-time decisions: how it works (web site example)

www  

profile  data  context    data  

real-­‐Ame  profile  

winning  content  RedPoint  Machine  Learning  

rules  

inbound  personalizaAons  combined    with  outbound  contacts  to  create    cross-­‐channel  interacAon  history  

update/  maintain  over  Ame  

web  site  

REDPOINT  EXECUTION  ENVIRONMENT  

personalizaAon  opportunity  

API    call  

personalized  content  CONTENT  NEEDED  

content  

candidate  content  with  associated  eligibility    &  scoring  rules  

content  stored  in  RedPoint,    or  RedPoint  points  to  content  in  CMS  or  other  system    

API  Nulla tincidunt dolor sit amet erat. Suspendisse dictum mauris sollicitudin luctus varius. Duis a mauris leo. Aenean vel euismod est. Phasellus pretium, sem id varius viverra, nisl elit commodo orci, vel sollicitudin dolor nibh ut nisl. Sed ut magna a arcu vulputate bibendum.

Duis vehicula tellus commodo mauris consequat rutrum eget sit amet arcu. Sed quis erat leo. Morbi accumsan aliquet tellus, ac consectetur nibh aliquet nec. Vivamus vel lacus ac ipsum ornare rhoncus. Aliquam libero magna, hendrerit vitae cursus vitae, accumsan eu sapien.

1st  Party  Customer  data  in  database(s)  and/or  Hadoop  

Page 26: Hadoop 2.0 - Solving the Data Quality Challenge

26 © RedPoint Global Inc. 2014 Confidential

RedPoint vs. alternatives

ü û

ü û

ü û

ü û

ü û

ü û

ü û

Pure  YARN,    no  MapReduce  Graphical  UI,    not  code-­‐based  

All  DQ/DI    funcAons  available  Executes  in  Hadoop,  no  data  movement  

Zero  footprint  install,  nothing  in  the  cluster  Same  product  for  Hadoop  and  database  

Top  rated  for    ease-­‐of-­‐use  

Page 27: Hadoop 2.0 - Solving the Data Quality Challenge

Twitter Tag: #briefr

The Briefing Room

Perceptions & Questions

Analyst: Dr. Claudia Imhoff

Page 28: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Solve your business puzzles with Intelligent Solutions

SPONSORED BY HOSTED BY

Data Quality in the Hadoop Age

By Claudia Imhoff, PhD Intelligent Solutions, Inc.

Boulder BI Brain Trust [email protected]

Page 29: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Claudia Imhoff

29

President and Founder Intelligent Solutions, Inc.

A thought leader, visionary, and practitioner, Claudia Imhoff, Ph.D., is an internationally recognized expert on analytics, business intelligence, and the architectures to support these initiatives. Dr. Imhoff has co-authored five books on these subjects and writes articles (totaling more than 150) for technical and business magazines. She is also the Founder of the Boulder BI Brain Trust (BBBT), an international consortium of independent analysts and experts. You can follow them on Twitter at #BBBT or become a subscriber at www.bbbt.us. Email: [email protected]

Phone: 303-444-6650 Twitter: Claudia_Imhoff

Page 30: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Agenda

§  Extending the Data Warehouse Architecture §  Things to Ponder…

30

Page 31: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Next Generation BI

31Based on a concept by Shree Dandekar of Dell Slide compliments of Colin White – BI Research, Inc.

New business insights

Reduced costs

New technologies

Enhanced data

management

Advanced analytics

New deployment

options

Next generation

BI

DRIVERS

TECHNOLOGIES

Page 32: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Systems of Record

§  Remember – It all starts here! §  Transactional systems generate most of the data used for all other

activities – operational processes, BI & analytical capabilities, etc.

§  The point here is a reminder: §  Extend OLTP systems of record as a “key” source of data §  Many companies do not (or can not) leverage data they already

have in their operational systems

32

Operational systems

RT BI services

Other internal & external structured & multi-structured data

Real-time streaming data

Page 33: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Next Generation – Extended Data Warehouse Architecture (XDW)

33

Traditional EDW environment

Investigative computing platform

Data refinery

Data integration platform

Analytic tools & applications

Operational real-time environment

RT analysis platform

Other internal & external structured & multi-structured data

Real-time streaming data Operational systems

RT BI services Slide created by Colin White – BI Research, Inc.

Page 34: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Use Case: Traditional EDW

34

Most BI environments today: §  New technologies can be incorporated

into the EDW environment to improve performance, efficiency & reduce costs

Use cases: §  Production reporting (data quality

sensitive) §  Historical comparisons §  Customer analysis (next best offer,

segmentation, life-time value scores, churn analysis, etc.)

§  KPI calculations §  Profitability analysis §  Forecasting

Data integration platform

Traditional EDW environment

Analytic tools & applications

Operational systems

RT BI services

real-time models & rules

Page 35: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Data Quality Needed

§  EDW is now the “production” analytical environment §  Produces standard reports, comparisons, and analytics to be used

as final word on situations

§  Data must be integrated as much as possible §  Data must be run through data quality grist mill §  There must be a full audit trail from source to ultimate

report, analytic, etc.

35

Page 36: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Use Case: Data Refinery

36

Ingests raw detailed data in batch and/or real-time into managed data store (lake, hub, swamp, dump…)

Distills the data into useful business information and distributes the results to downstream systems

May also directly analyze some data

Employs low-cost hardware and software to enable large amounts of detailed data to be managed cost effectively

Requires (flexible) governance policies to manage data security, privacy, quality, archiving and destruction

Traditional EDW environment

Investigative computing platform

Data refinery

Data integration platform

Page 37: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Data Quality Needed

§  This is not a data dumping ground! §  It should be monitored and assessed as to the data integration and

quality needs

§  Just because you can store massive sets of data doesn’t mean it is ignored or assumed to not need governance

§  Nor does it mean that there is no need for a business case for the massive amount of data §  If analytic accuracy is at 99% using 45% of the data, why deal with

all of it?

§  But speed of integration and quality processing is also important

37

Page 38: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Use Case: Investigative Computing

New technologies used here include: §  Hadoop, in-memory computing,

columnar storage, data compression, appliances, etc.

Use cases: §  Data mining and predictive modeling

for EDW and real-time environments §  Cause and effect analysis §  Data exploration (“Did this ever

happen?” “How often?”) §  Pattern analysis §  General, unplanned investigations

of data

38

Data refinery

Data integration platform

Analytic tools & applications

Operational real-time environment

RT analysis platform

Investigative computing platform

Operational systems

RT BI services

Page 39: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Data Quality Needed

§  Much more experimental in nature – lots of queries with null results

§  Analytics may be approximations §  Data integration may be needed for some data, not for

other §  Data quality also varies in terms of what data must go

through DQ process §  Difficulty is in determining what get integrated and run

through data quality processing

39

Page 40: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Use Case: Real Time Operational Environment

Embedded or callable BI services:

§  Real-time fraud detection §  Real-time loan risk assessment §  Optimizing online promotions §  Location-based offers §  Contact center optimization §  Supply chain optimization

Real-time analysis engine: §  Traffic flow optimization §  Web event analysis §  Natural resource exploration

analysis §  Stock trading analysis §  Risk analysis §  Correlation of unrelated data

streams (e.g., weather effects on product sales)

40Operational real-time environment

RT analysis platform

Other internal & external structured & multi-structured data

Real-time streaming data

Operational systems

RT BI services

Page 41: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Data Quality Needed

§  Because of operational nature, data must be as good as it can possibly be

§  Data may or may not bee integrated with other operational systems’ data

§  False positives and negatives to models must be reconciled as quickly as possible

§  But speed of integration and quality processing is of the utmost importance!

41

Page 42: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

All Components Must Work Together

42

analytic models analyses

Analytic tools & apps

Investigative computing platform

Data refinery Operational systems

existing customer

data

next best customer offer

3rd party data location data social data

feedback

RT analysis platform call center dashboard or web event stream

Slide created by Colin White – BI Research, Inc.

Traditional EDW environment

Other internal & external structured & multi-structured data

Real-time streaming data

Page 43: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Agenda

§  Extending the Data Warehouse Architecture §  Things to Ponder…

43

Page 44: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 44

What Makes People Think These Have Gone Away?

§  Data Redundancy §  Each system, application, and department in enterprise collects

own version of key business entities and attributes §  Data Inconsistency

§  Enormous resources (time, money, and people) spent in reconciliation because of fractured data

§  Business Inefficiency §  Fractured data generates business inefficiency – low productivity,

inefficient supply chain management, customer dissatisfaction, wasted marketing efforts

§  Business Change §  Organizations are constantly changing and these disruptive events

cause a constant stream of changes to data

Page 45: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 45

Data Quality Challenges

§  Cultural Hurdles § Generating business case and obtaining

executive backing and funding § Requires a phased approach to quality deployment

§ Overcoming political barriers § E.g., moving from enterprise view to LOB/parochial

view of quality, yet still agreeing on common business definitions

Page 46: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Data Quality Challenges

§  Technology Challenges § Unusual sources of data § Creating a flexible data governance model §  Supporting complex & constantly changing data §  Providing a flexible data integration

infrastructure § Wild West mentality…

46

Page 47: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Data Governance and Data Quality is Changing

§  People using BI must “trust” the data §  IT must work with the business to create certified data sets §  Note: not all data must be certified but all data usage must be

documented and monitored

§  Governance still has an important role §  Determine whether data used is “governed” (e.g., in a data

warehouse or MDM environment) or “ungoverned” (e.g., individual spreadsheets, external source)

§  Difficulty is figuring out differences – hence the need to monitor data usage

§  IT must have monitoring or oversight capability Note: LOB IT or experienced information producers may

have to take on some previously traditional central IT roles 47

Page 48: Hadoop 2.0 - Solving the Data Quality Challenge

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Questions

§  What are the biggest challenges for data quality in the Hadoop age?

§  How do you justify the need for integration and quality processing in the “age of hurry up and give me the data”?

§  Not all data needs to be cleaned up and integrated but how do people determine what does and doesn’t?

§  What tips can you give us to help get the time, resources and funding for DQ in the refinery?

§  Technologically speaking, what is different about the Hadoop environment versus a traditional RDBMS one?

§  Who sponsors/is responsible for the data quality/integration effort in the age of Hadoop?

48

Page 49: Hadoop 2.0 - Solving the Data Quality Challenge

Twitter Tag: #briefr

The Briefing Room

Page 50: Hadoop 2.0 - Solving the Data Quality Challenge

Twitter Tag: #briefr

The Briefing Room

Upcoming Topics

www.insideanalysis.com

2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room

This Month: INNOVATIVE TECHNOLOGY

August: BIG DATA ECOSYSTEM

September: INTEGRATION

Page 51: Hadoop 2.0 - Solving the Data Quality Challenge

Twitter Tag: #briefr

The Briefing Room

THANK YOU for your

ATTENTION!