Applying Data Quality Best Practices at Big Data Scale

Applying Data Quality Best Practices at Big Data Scale

Harald SmithDirector of Product Management

Michael UrbonasDirector of Product Marketing

Speakers

Mike UrbonasDirector of Product Marketing, Syncsort

15 years of software experience including

– BI/DW & data visualization– Data management & ETL– Text analytics– Enterprise search

Harald SmithDirector of Product Management, Syncsort

20 years in Information Management incl. data quality, integration, and governance

– Consulting, product management, software & solution development

Co-author of Patterns of Information Management, as well as two Redbooks on Information Governance and Data Integration

Today’s agenda

Problem: Huge Big Data investments, Scarce Big Data trust– New insights from Syncsort 2017 Big Data Trends survey – Root causes of Data Lake distrust

Sample use cases at Big Data scale– 360 degree view of the customer, product or other core entity– Anti-fraud

Solution: Bringing Data Quality best practices into the Data Lake– “Design once, deploy anywhere” with Syncsort/Trillium technology approach– “Intelligent execution” to leverage the strength of the Big Data platform

Nobody wants a data swamp!

“This sure looked a lot nicer on the whiteboard…”

Key problem:Big Data deemed untrustworthy by business managers and leaders

Only 33% of senior execs have a high level of trust in the accuracy

of their Big Data analytics ~ KPMG 2016




59% of global execs do not believe their company

has capabilities to generate meaningful

business insights from their data ~ Bain 2015




85% of global execs say major investments are

required to update their existing data platform, including data cleaning

and consolidating ~ Bain 2015

59% of global execs do not believe their company

has capabilities to generate meaningful

business insights from their data ~ Bain 2015

Fresh insights from Syncsort 2017 Big Data Trends survey

Data Quality is recognized as a mission-critical data lake success factor– Data Quality tops the list of

challenges of data lake implementation, followed closely by Data Governance




Financial services and insurance industry is most focused on Data Quality and Data Governance– Named Data Quality as top

priority 50% more often than participants from other industries

– Also identified Data Governance as a top priority at more than twice the rate of those from other industries




But… not everyone is making the connection between Data Quality and Big Data success– Participants who did not

include data quality as a top 3 priority for implementing the data lake expressed the most interest in analytically-intensive data lake uses… which are highly dependent on proper data quality

Financial services and insurance industry is most focused on Data Quality and Data Governance– Named Data Quality as top

priority 50% more often than participants from other industries

– Also identified Data Governance as a top priority at more than twice the rate of those from other industries

Root causes of Big Data mistrust

Root causes of Big Data mistrust

Are these numbers accurate? Are calculations using correctly aggregated data?

Is this data current? When was it last updated?

Are these terms consistent with our business definitions?

Can I trust this data enough to make key decisions and/or allow the data to be used in real-time?

Did we include all of the data we should have? Are additional data sources missing?

Root causes of Big Data mistrust… examples

False Assumptions

Pinterest targeted marketing campaign mistakenly congratulated single women on upcoming weddings...


False Assumptions


Miscoded/ Misinterpreted Data

Predictive analysis falsely found call center workers without a HS diploma were 3x more likely to remain on board for at least 9 months…

…?


False Assumptions


DuplicateData

Fraud examination revealed massive import tariff evasion on eggs, only to find there was no case to crack…

Miscoded/ Misinterpreted Data

Predictive analysis falsely found call center workers without a HS diploma were 3x more likely to remain on board for at least 9 months…

…?

Sample use cases at Big Data scale…

360 view of customer (or product, or other key entity)Is Data Lake essential for this use case?– YES… Purpose of customer 360 is to optimize customer

experience management– Increasingly broad spectrum of data sources involved in and

required for effectively personalizing customer experiences and targeted marketing offers

What Types of Data?– Internal sources – often many/overlapping– 3rd Party data – demographics– Suppression data – keeping customer information updated– New sources – mobile, social media

Internal Data Customer Master Data Point-of-Sale Data Contact Form Data Loyalty Program Data ecommerce Data Customer Service Data

Suppression Data Change of Address Mortality Do Not Call

Third-Party Data Age Occupation Education Gender Income Geographic

Sample use cases at Big Data scale…

Anti-Fraud/Anti-Money LaunderingIs Data Lake essential for this use case?– YES… Fraudulent transaction detection requires huge volumes

of customer profile data, recent transaction activity with “last known” values, device data with geolocation and time-based tagging, 3rd party news/alerts

– Data used to refine Machine Learning models (e.g., anomaly detection, implausible behavior analysis) to review new transactions in real time

What Types of Data?– Internal sources – often many/overlapping– Suppression data – keeping customer information updated– Mobile data – devices, locations– New sources – social media, 3rd party data, …

Internal Data Customer Master Data Point-of-Sale Data Contact Form Data Loyalty Program Data ecommerce Data Customer Service Data

Mobile Data Device Location Wearables Mobile wallets

Suppression Data Change of Address Mortality Do Not Call

Social Data Sentiment Opinions Interests Social handles

The Fundamental Data Quality Question: What are you trying to do?

“Never lead with a data set; lead with a question.”Anthony Scriffignano

Chief Data Scientist, Dun & BradstreetForbes Insights, May 31, 2017

“The Data Differentiator”

“If you don’t know what you want to get out of the data, how can you know what data you need

– and what insight you’re looking for?”Wolf Ruzicka

Chairman of the Board at EastBanc TechnologiesBlog post: June 1, 2017

“Grow A Data Tree Out Of The “Big Data” Swamp”

Understanding Data Quality best practices:Where to start?

Establishing ScopeAsking the “right questions” about your data (not just “what” and “how”)

– “Why” questions to understand core business problem– “Who” questions to understand varying needs of all involved users (role, function, etc.)

Empowering users (“Who”) to gain new clarity into the core problem (“Why”)– Bringing together data sources relevant to asking insightful questions of the data– Enabling the data to answer the questions freely– Building data analytics, algorithms, machine learning, etc. to expedite and broadcast

answersAbove lines of inquiry inform what Data Quality processing is required

– Determining how, what and where Data Quality is established based on business problem

– “High-quality data” definition will vary by business problem

Understanding Data Quality best practices:What’s the End Goal?

The End Goal drives Data Quality Requirements & ProcessesDo you have all the data required?

– What’s the central entity? E.g. Customer, Product, Asset– What’s the definition? E.g. “Customers” may mean customers, prospects, store visitors, …– Are the sources comprehensive? E.g. any data silos? cover all geographies?– Will “new” information be added? E.g. demographics, geolocation, …

How will data be matched, consolidated, or connected?– One “golden” record? Or multiple links to connect all the dots?

What’s needed to facilitate the matching, consolidation, or connection required?– E.g. Customer may need: Name, Address, Geolocation, Phone, Email

Have you evaluated the sources?– Are the data sources “Fit for Purpose”?

Applying Data Quality best practices:Identifying required Data Quality dimensions

What data do we care about?• What are the Critical Data Elements?

What measures can we take advantage of?1) Completeness – Are the relevant fields populated?2) Integrity – Does the data maintain an internal

structural integrity or a relational integrity across sources

3) Uniqueness – Are keys or records unique?4) Validity – Does the data have the correct values? 5) Consistency – Is the data at consistent levels of

aggregation or does it have consistent valid values over time?

6) Timeliness – Did the data arrive in a time period that makes it useful or usable?

Example: Call Center Record

Example: Call Center Record

Unique

Integrity

Complete ? Consistent

Timely

Valid ?

Is Duration = 0 important?Is 01/01/20xx a defaulted date?

And how will this be linked or connected with my other data?

The file appears complete, but does it cover all call centers?

Example: Twitter Feed


Unique?

Integrity?

Complete?

Consistent?

Timely?

Valid?

What else can we review or measure?1) Coverage (Relevance) – How well does the data source meet the defined needs?

– E.g. does it cover the relevant geography? Is it biased?

2) Continuity – Data points for all intervals or expected intervals? – E.g. sensors, weather records, call data records

3) Triangulation – What Gartner describes as ‘consistency of data across proximate data points’, i.e. consistent measurements from related points of reference.

– E.g. if temperatures in Chicago and Louisville are 30°and 32°then temperature in Indianapolis for same day is unlikely to be 70°

4) Provenance – Where did the data originate, who gathered it, and what criteria was used to create it? – E.g. government agency, 3rd party provider, free or paid data

5) Transformation from origin – how many layers and/or changes has the data passed through? – E.g. has the original data source already been merged with two other record sources? And is the result accurate?

6) Repetition or duplication of data patterns – Data points exactly the same across multiple recording intervals or across multiple sensors.

– E.g. is there tampering with sensors or call data?

Applying Data Quality best practices:‘New’ or ‘Extended’ Measures of Data Quality


Triangulated

Continuity

Provenance

Coverage

Usage

Repeated patterns

Transformation


Triangulated

Continuity

Provenance

Coverage

Usage

Repeated patterns

Transformation

Jane Doe pulled from Twitter based on #Blackberry

All items for #Blackberry in time interval appear to be included

Marketing confirms these have high value

Good association with current sales data

All tweets appear unique within the date & vs. prior feeds

This needed to include #BB and #Crackberry as well!

Applying Data Quality best practices:Understanding Context

Context is critical:Even on data that is considered “common” or “understood” such as Name or Address or Product DescriptionTo parse or standardize data to useful and usable components for additional processingTo determine when and where to verify or enrich the data contentTo determine whether and how to match records to a given entityTo identify whether to consolidate data, and if so what other data drives the consolidation

Applying Data Quality best practices:Assessing Quality Requirements

Entity data (customer, product, asset, …):

Requires understanding data provenance and contextRequires integrating data from multiple data sources Requires determining whether specific data should even be includedPresents differences in coverage, completeness, consistency, provenance, …Comes from different points in timeMay contain repetitions, particularly from 3rd Party data sourcesMay contain data at different levels of consolidation or aggregation

Robert Smith Jr

3 Davy Drive

S66 7EN

[email protected]

+44(0)1189 823606

Rotherham

Name

Address

City

Postal Code

Phone

Email

3rd Party

Applying Data Quality best practices:Utilizing Data Quality functions to achieve required DQ dimensions

Parse data values from unstructured fields to their correct domainsStandardize values to enable higher quality matching and linkageVerify and enrich global postal addresses and geolocationsEnrich data from external, third-party sources to create comprehensive, unified recordsMatch and link like records Consolidate and aggregate to “golden” record, if appropriate, based on factors such as data source, date, …Match records that belong to the same domain (i.e., household or business)

Smith

3 Davy Drive

S66 7EN

Rotherham

Name

Address

City

Postal Code

Household View

Applying Data Quality best practices:Example

Large telco organization: “What are our customers saying about us in the marketplace?

Where are the most common complaints are coming from?Issue: sparse results concentrated in one regionRequired: standardization, enrichment, geocoding, matching/record linkage, address verification

Before Data Quality After Data Quality

Applying Data Quality best practices:Example

Large telco organization: “What are our most profitable regions on a daily basis?

Which are the most profitable regions?Issue: poor geolocation identifying wrong regionsRequired: parsing, standardization, address verification, enrichment, geocoding, matching/record linkage

Before Data Quality After Data Quality

Applying Data Quality best practices:Consistent processing

Big Data at scale distributes data across many nodes –not necessarily with other relevant data!

– Implications for joining, sorting, and matching data, whether for enrichment, verification against trusted sources, or a consolidated single view

Data Quality functions must be performed in a consistent manner, no matter where actual processing takes place, how the data is segmented, and what the data volume is

– Processing routines must apply same approach and logic each time

– Critical to establishing, building, and maintaining trust

Source: HP Analyst Briefing

Trillium Quality for Big DataFocus on Data Quality, not the Big Data platform

• Use existing Data Quality skills and expertise• No need to worry about mappers, reducers, big side or small side of joins, etc• Automatic optimization for best performance, load balancing, etc.• No changes or tuning required, even if you change execution frameworks• Future-proof job designs for emerging compute frameworks, e.g. Spark 2.x• Run multiple execution frameworks in a single job

Single GUI Execute Anywhere!

35Syncsort Confidential and Proprietary - do not copy or distribute

Intelligent Execution - Insulate your organization from underlying complexities of Hadoop

Bring Data Quality best practices into the Data Lake:The Syncsort/Trillium technology approach

“Design once, deploy anywhere” – Visually design data quality jobs once and run anywhere

(MapReduce, Spark, Linux, Unix, Windows; on premise or in the cloud)

– Use-case templates to fast-track development– Test & debug locally in Windows/Linux; deploy to Big Data– Intelligent Execution dynamically optimizes data

processing at run-time based on the chosen compute framework; no changes or tuning required

Benefit: Significantly reduce manual data preparation– Major time sink for data scientists, architects and analysts– Risk of inconsistent or incomplete data preparation

Benefit: Significantly increase trust in data– Major time sink for executives– Risk of poor data-based business decisions

Single GUI

Execute Anywhere!

“Data is useful. High-quality, well-understood, auditable data is priceless.”

Ted FriedmanVP Distinguished Analyst

Article in CRM.com: Mar 8, 2005“The Coming of BI Competency Centers”

Data Quality remains Data Quality, even at scale

“Data is the new science. Big Data holds the answers. Are you asking the right questions?”

Pat GelsingerPresident and COO at EMC

Forbes Insights, June 22, 2012“Big Bets On Big Data”

Questions and Next Steps

For more information on Trillium Quality for Big Data, visit: trilliumsoftware.com/products/big-data Contact Info:

Mike Urbonas, Director of Product Marketing, Syncsort/Trillium [email protected]://www.linkedin.com/in/mikeu

Harald Smith, Director of Product Management, Syncsort/Trillium [email protected]://www.linkedin.com/in/harald-smith-71028b

Thank You!

Applying Data Quality Best Practices at Big Data Scale

Technology

Transcript of Applying Data Quality Best Practices at Big Data Scale