Hadoop 2.0 - Solving the Data Quality Challenge

Grab some coffee and enjoy the pre-show banter before the top of the hour!

The Briefing Room

Hadoop 2.0: Solving the Data Quality Challenge

Twitter Tag: #briefr

The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected] @eric_kavanagh


The Briefing Room

!   Reveal the essential characteristics of enterprise software, good and bad

!   Provide a forum for detailed analysis of today’s innovative technologies

!   Give vendors a chance to explain their product to savvy analysts

!   Allow audience members to pose serious questions... and get answers!

Mission


The Briefing Room

Topics

2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room

This Month: INNOVATIVE TECHNOLOGY

August: BIG DATA ECOSYSTEM

September: INTEGRATION


The Briefing Room

Analyst: Dr. Claudia Imhoff

Claudia Imhoff is President & Founder of

Intelligent Solutions, Inc.


The Briefing Room

RedPoint Global

! RedPoint Global is a data management and integrated marketing technology company

!   Its Convergent Marketing Platform™ offers products designed for data management, collaboration and architecture integration.

! RedPoint Data Management for Hadoop is YARN-compliant and enables analysts to access and manipulate data directly within the Hadoop cluster.


The Briefing Room

Guest: George Corugedo

George Corugedo is Chief Technology Officer & Co-Founder at RedPoint Global Inc. A mathematician and seasoned technology executive, George has over 20 years of business and technical expertise. As co-founder and CTO of RedPoint Global, George is responsible for leading the development of the RedPoint Convergent Marketing Platform™. A former math professor, George left academia to co-found Accenture’s Customer Insight Practice, which specialized in strategic data utilization, analytics and customer strategy. Previous positions include director of client delivery at ClarityBlue, Inc., a provider of hosted customer intelligence solutions to enterprise commercial entities, and COO/CIO of Riscuity, a receivables management company specializing in the utilization of analytics to drive collections.

The Neglected Discipline of Data Quality in Hadoop July 2014

11 © RedPoint Global Inc. 2014 Confidential

Overview – Challenges to Adoption

•  Severe shortage of MR skilled resources

•  Very expensive resources and hard to retain

•  Inconsistent skills lead to inconsistent results

•  Under uAlizes exisAng resources

•  Prevents broad leverage of investments across enterprise

Skills Gap

•  A nascent technology ecosystem around Hadoop

•  Emerging technologies only address narrow slivers of funcAonality

•  New applicaAons are not enterprise class

•  Legacy applicaAons have built short term capabiliAes

Maturity & Governance

•  Data is not useful in its raw state, it must be turned into informaAon

•  Benefit of Hadoop is that same data can be used from many perspecAves

•  Analysts must now do the structuring of the data based on intended use of the data

Data Into InformaAon


Key Points to Cover Today

! Broad functionality across data processing domains

! Validated ease of use, speed, match quality and party data superiority

! Hadoop 2.0/YARN certified – 1 of first 17 companies to do so

! Not a repackaging of Hadoop 1.0 functionality. RedPoint Data Management is a pure YARN application (1 of only 2 in the initial wave of certifications)

! Building a complex job in RPDM takes a fraction of the time that it takes to write the same job in Map Reduce and none of the coding or java skills.

! Big functional footprint without touching a line of code

! Design model consistent with data flow paradigm

! RPDM has a “Zero-Footprint” install in the Hadoop cluster

! The same interface and functionality is available for both structured and unstructured databases. Thus it is seamless to work across both from a users perspective.

! Data quality done completely within the cluster


Key features of RedPoint Data Management

Master Key Management

ETL & ELT Data Quality

Web Services IntegraAon

IntegraAon & Matching

Process AutomaAon & OperaAons

• Profiling, reads/writes, transformaAons

•  Single project for all jobs

• Cleanse data • Parsing, correcAon • Geo-‐spaAal analysis

• Grouping •  Fuzzy match

• Create keys • Track changes • Maintain matches over Ame

• Consume and publish • HTTP/HTTPS protocols • XML/JSON/SOAP formats

•  Job scheduling, monitoring, noAficaAons

• Central point of control

All func(ons can be used on both TRADITIONAL and BIG DATA

Creates clean, integrated, ac/onable data – quickly, reliably and at low cost


RedPoint Data Management on Hadoop

ParAAoning AM / Tasks

ExecuAon AM / Tasks Data I/O Key / Split

Analysis

Parallel SecAon (UI)

YARN

MapReduce


RedPoint Functional Footprint

Monitoring and Management Tools

AMBARI

MAPREDUCE

REST

DATA REFINEMENT

HIVE PIG

HTTP

STREAM

STRUCTURE

HCATALOG (metadata services)

Query/Visualization/ Reporting/Analytical

Tools and Apps

SOURCE DATA

- Sensor Logs - Clickstream - Flat Files - Unstructured - Sentiment - Customer - Inventory

DBs

JMS Queue’s

Files Fil

es Files

Data Sources

RDBMS

EDW

INTERACTIVE

HIVE Server2

LOAD

SQOOP

WebHDFS

Flume

NFS

LOAD SQOOP/Hive

Web HDFS

YARN

� � � � � � � � � �

� � � � � � � � � � �

� � � � � � � � � � �

� �

� �

� n

HDFS

1 � � � � � � � � � � � �

�

� � � � � � � � � � � � �

� � � � � � � � � � � � �

� � � � � � � � � � � � �


RedPoint

Benchmarks – Project Gutenberg

Map Reduce Pig

Sample MapReduce (small subset of the entire code which totals nearly 150 lines): public static class MapClass extends Mapper<WordOffset, Text, Text, IntWritable> { private final static String delimiters = "',./<>?;:\"[]{}-=_+()&*%^#$!@`~ \\|«»¡¢£¤¥¦©¬®¯±¶·¿"; private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WordOffset key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line, delimiters); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

Sample Pig script without the UDF: SET pig.maxCombinedSplitSize 67108864 SET pig.splitCombination true A = LOAD '/testdata/pg/*/*/*'; B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; C = FOREACH B GENERATE UPPER(word) AS word; D = GROUP C BY word; E = FOREACH D GENERATE COUNT(C) AS occurrences, group; F = ORDER E BY occurrences DESC; STORE F INTO '/user/cleonardi/pg/pig-count';

>150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code

6 hours of development 3 hours of development 15 min. of development

6 minutes runtime 15 minutes runtime 3 minutes runtime

Extensive optimization needed User Defined Functions required prior to running script

No tuning or optimization required


Attributes of Information

RELEVANT InformaAon must pertain to a specific problem. General data must be connected to reveal relevance of the informaAon.

COMPLETE ParAal informaAon is oaen worse than no informaAon. ParAal informaAon frequently leads to worse conclusions than if no data had been used at all.

ACCURATE This one is obvious. In a context like health care, inaccurate data can be fatal. Precision is required across all applicaAons of informaAon.

CURRENT As data ages, it becomes less accurate. MulAple research studies by Google and others show the decay in the accuracy of analyAcs as data becomes stale.

ECONOMICAL There has to be a clear cost benefit. This requires work to idenAfy the realizable benefit of informaAon but this is also what rives the use if successful


Reference Architecture for Matching in Hadoop

Data Sources CRM

ERP

Billing

Subscriber

Product

Network

Weather

Compete

Manuf.

Clickstream

Online Chat

Sensor Data

Social Media

Call Detail Records

FabricaAon Logs

Sales Feedback

Field Feedback

Field Feedback

+


Resource Manager

Launches Tasks

Node Manager

DM App Master

DM Task

Node Manager

DM Task

DM Task

Node Manager

DM Task

DM Task

Launches DM App Master

Data Management Designer

DM ExecuAon

Server

Parallel SecAon

Running DM Task

12

3

RedPoint DM for Hadoop: Processing Flow


The Data Management designer


DM Hadoop Settings


DM Parallel Section on Hadoop


Who Should Care

! Companies interested in exploring the promise of Big Data Analytics and need an easy way to get started.

! Companies already investing heavily investing in Big

Data Analytics technologies but are stuck due to the shortage of skilled resources

! Large organizations that are focused on “Operational Offloading” and need to achieve it cost effectively

! Companies who recognize that much of the data that lands in Hadoop is external to the organization and need to have Data Quality and proper data governance applied to their Hadoop data.


Data Inputs

RedPoint Convergent Marketing Ecosystem

Enh

an

ce

me

nt

SQL

Soc

ial

No

SQ

L

Analytics Machine Learning

Email

GIS

Address Std.

Inbox Analysis Segmentation Attribution

CRM Trigger Audience Offer

Marketing Rules Engine

RedPoint Interaction

Social Mobile Digital Real Time

Cache

Analytics Marketing Operations Hadoop

RedPoint Data Management

Geocoding

Web Services


RedPoint real-time decisions: how it works (web site example)

www

profile data context data

real-‐Ame profile

winning content RedPoint Machine Learning

rules

inbound personalizaAons combined with outbound contacts to create cross-‐channel interacAon history

update/ maintain over Ame

web site

REDPOINT EXECUTION ENVIRONMENT

personalizaAon opportunity

API call

personalized content CONTENT NEEDED

content

candidate content with associated eligibility & scoring rules

content stored in RedPoint, or RedPoint points to content in CMS or other system

API Nulla tincidunt dolor sit amet erat. Suspendisse dictum mauris sollicitudin luctus varius. Duis a mauris leo. Aenean vel euismod est. Phasellus pretium, sem id varius viverra, nisl elit commodo orci, vel sollicitudin dolor nibh ut nisl. Sed ut magna a arcu vulputate bibendum.

Duis vehicula tellus commodo mauris consequat rutrum eget sit amet arcu. Sed quis erat leo. Morbi accumsan aliquet tellus, ac consectetur nibh aliquet nec. Vivamus vel lacus ac ipsum ornare rhoncus. Aliquam libero magna, hendrerit vitae cursus vitae, accumsan eu sapien.

1st Party Customer data in database(s) and/or Hadoop


RedPoint vs. alternatives

ü û

ü û

ü û

ü û

ü û

ü û

ü û

Pure YARN, no MapReduce Graphical UI, not code-‐based

All DQ/DI funcAons available Executes in Hadoop, no data movement

Zero footprint install, nothing in the cluster Same product for Hadoop and database

Top rated for ease-‐of-‐use


The Briefing Room

Perceptions & Questions

Analyst: Dr. Claudia Imhoff

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved

Solve your business puzzles with Intelligent Solutions

SPONSORED BY HOSTED BY

Data Quality in the Hadoop Age

By Claudia Imhoff, PhD Intelligent Solutions, Inc.

Boulder BI Brain Trust [email protected]


Claudia Imhoff

29

President and Founder Intelligent Solutions, Inc.

A thought leader, visionary, and practitioner, Claudia Imhoff, Ph.D., is an internationally recognized expert on analytics, business intelligence, and the architectures to support these initiatives. Dr. Imhoff has co-authored five books on these subjects and writes articles (totaling more than 150) for technical and business magazines. She is also the Founder of the Boulder BI Brain Trust (BBBT), an international consortium of independent analysts and experts. You can follow them on Twitter at #BBBT or become a subscriber at www.bbbt.us. Email: [email protected]

Phone: 303-444-6650 Twitter: Claudia_Imhoff


Agenda

§  Extending the Data Warehouse Architecture §  Things to Ponder…

30


Next Generation BI

31Based on a concept by Shree Dandekar of Dell Slide compliments of Colin White – BI Research, Inc.

New business insights

Reduced costs

New technologies

Enhanced data

management

Advanced analytics

New deployment

options

Next generation

BI

DRIVERS

TECHNOLOGIES


Systems of Record

§  Remember – It all starts here! §  Transactional systems generate most of the data used for all other

activities – operational processes, BI & analytical capabilities, etc.

§  The point here is a reminder: §  Extend OLTP systems of record as a “key” source of data §  Many companies do not (or can not) leverage data they already

have in their operational systems

32

Operational systems

RT BI services

Other internal & external structured & multi-structured data

Real-time streaming data


Next Generation – Extended Data Warehouse Architecture (XDW)

33

Traditional EDW environment

Investigative computing platform

Data refinery

Data integration platform

Analytic tools & applications

Operational real-time environment

RT analysis platform


Real-time streaming data Operational systems

RT BI services Slide created by Colin White – BI Research, Inc.


Use Case: Traditional EDW

34

Most BI environments today: §  New technologies can be incorporated

into the EDW environment to improve performance, efficiency & reduce costs

Use cases: §  Production reporting (data quality

sensitive) §  Historical comparisons §  Customer analysis (next best offer,

segmentation, life-time value scores, churn analysis, etc.)

§  KPI calculations §  Profitability analysis §  Forecasting




Operational systems

RT BI services

real-time models & rules


Data Quality Needed

§  EDW is now the “production” analytical environment §  Produces standard reports, comparisons, and analytics to be used

as final word on situations

§  Data must be integrated as much as possible §  Data must be run through data quality grist mill §  There must be a full audit trail from source to ultimate

report, analytic, etc.

35


Use Case: Data Refinery

36

Ingests raw detailed data in batch and/or real-time into managed data store (lake, hub, swamp, dump…)

Distills the data into useful business information and distributes the results to downstream systems

May also directly analyze some data

Employs low-cost hardware and software to enable large amounts of detailed data to be managed cost effectively

Requires (flexible) governance policies to manage data security, privacy, quality, archiving and destruction



Data refinery



Data Quality Needed

§  This is not a data dumping ground! §  It should be monitored and assessed as to the data integration and

quality needs

§  Just because you can store massive sets of data doesn’t mean it is ignored or assumed to not need governance

§  Nor does it mean that there is no need for a business case for the massive amount of data §  If analytic accuracy is at 99% using 45% of the data, why deal with

all of it?

§  But speed of integration and quality processing is also important

37


Use Case: Investigative Computing

New technologies used here include: §  Hadoop, in-memory computing,

columnar storage, data compression, appliances, etc.

Use cases: §  Data mining and predictive modeling

for EDW and real-time environments §  Cause and effect analysis §  Data exploration (“Did this ever

happen?” “How often?”) §  Pattern analysis §  General, unplanned investigations

of data

38

Data refinery



Operational real-time environment



Operational systems

RT BI services


Data Quality Needed

§  Much more experimental in nature – lots of queries with null results

§  Analytics may be approximations §  Data integration may be needed for some data, not for

other §  Data quality also varies in terms of what data must go

through DQ process §  Difficulty is in determining what get integrated and run

through data quality processing

39


Use Case: Real Time Operational Environment

Embedded or callable BI services:

§  Real-time fraud detection §  Real-time loan risk assessment §  Optimizing online promotions §  Location-based offers §  Contact center optimization §  Supply chain optimization

Real-time analysis engine: §  Traffic flow optimization §  Web event analysis §  Natural resource exploration

analysis §  Stock trading analysis §  Risk analysis §  Correlation of unrelated data

streams (e.g., weather effects on product sales)

40Operational real-time environment




Operational systems

RT BI services


Data Quality Needed

§  Because of operational nature, data must be as good as it can possibly be

§  Data may or may not bee integrated with other operational systems’ data

§  False positives and negatives to models must be reconciled as quickly as possible

§  But speed of integration and quality processing is of the utmost importance!

41


All Components Must Work Together

42

analytic models analyses

Analytic tools & apps


Data refinery Operational systems

existing customer

data

next best customer offer

3rd party data location data social data

feedback

RT analysis platform call center dashboard or web event stream

Slide created by Colin White – BI Research, Inc.





Agenda

§  Extending the Data Warehouse Architecture §  Things to Ponder…

43

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 44

What Makes People Think These Have Gone Away?

§  Data Redundancy §  Each system, application, and department in enterprise collects

own version of key business entities and attributes §  Data Inconsistency

§  Enormous resources (time, money, and people) spent in reconciliation because of fractured data

§  Business Inefficiency §  Fractured data generates business inefficiency – low productivity,

inefficient supply chain management, customer dissatisfaction, wasted marketing efforts

§  Business Change §  Organizations are constantly changing and these disruptive events

cause a constant stream of changes to data

Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 45

Data Quality Challenges

§  Cultural Hurdles § Generating business case and obtaining

executive backing and funding § Requires a phased approach to quality deployment

§ Overcoming political barriers § E.g., moving from enterprise view to LOB/parochial

view of quality, yet still agreeing on common business definitions


Data Quality Challenges

§  Technology Challenges § Unusual sources of data § Creating a flexible data governance model §  Supporting complex & constantly changing data §  Providing a flexible data integration

infrastructure § Wild West mentality…

46


Data Governance and Data Quality is Changing

§  People using BI must “trust” the data §  IT must work with the business to create certified data sets §  Note: not all data must be certified but all data usage must be

documented and monitored

§  Governance still has an important role §  Determine whether data used is “governed” (e.g., in a data

warehouse or MDM environment) or “ungoverned” (e.g., individual spreadsheets, external source)

§  Difficulty is figuring out differences – hence the need to monitor data usage

§  IT must have monitoring or oversight capability Note: LOB IT or experienced information producers may

have to take on some previously traditional central IT roles 47


Questions

§  What are the biggest challenges for data quality in the Hadoop age?

§  How do you justify the need for integration and quality processing in the “age of hurry up and give me the data”?

§  Not all data needs to be cleaned up and integrated but how do people determine what does and doesn’t?

§  What tips can you give us to help get the time, resources and funding for DQ in the refinery?

§  Technologically speaking, what is different about the Hadoop environment versus a traditional RDBMS one?

§  Who sponsors/is responsible for the data quality/integration effort in the age of Hadoop?

48


The Briefing Room


The Briefing Room

Upcoming Topics

www.insideanalysis.com

2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room

This Month: INNOVATIVE TECHNOLOGY

August: BIG DATA ECOSYSTEM

September: INTEGRATION


The Briefing Room

THANK YOU for your

ATTENTION!

Hadoop 2.0 - Solving the Data Quality Challenge

Technology

Transcript of Hadoop 2.0 - Solving the Data Quality Challenge