Orchestrated Chaos: Applying Failure Testing Research at Scale.
Applying Data Quality Best Practices at Big Data Scale
-
Upload
syncsort -
Category
Technology
-
view
213 -
download
0
Transcript of Applying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data Scale
Harald SmithDirector of Product Management
Michael UrbonasDirector of Product Marketing
Speakers
Mike UrbonasDirector of Product Marketing, Syncsort
15 years of software experience including
– BI/DW & data visualization– Data management & ETL– Text analytics– Enterprise search
Harald SmithDirector of Product Management, Syncsort
20 years in Information Management incl. data quality, integration, and governance
– Consulting, product management, software & solution development
Co-author of Patterns of Information Management, as well as two Redbooks on Information Governance and Data Integration
Today’s agenda
Problem: Huge Big Data investments, Scarce Big Data trust– New insights from Syncsort 2017 Big Data Trends survey – Root causes of Data Lake distrust
Sample use cases at Big Data scale– 360 degree view of the customer, product or other core entity– Anti-fraud
Solution: Bringing Data Quality best practices into the Data Lake– “Design once, deploy anywhere” with Syncsort/Trillium technology approach– “Intelligent execution” to leverage the strength of the Big Data platform
Nobody wants a data swamp!
“This sure looked a lot nicer on the whiteboard…”
Key problem:Big Data deemed untrustworthy by business managers and leaders
Only 33% of senior execs have a high level of trust in the accuracy
of their Big Data analytics ~ KPMG 2016
Key problem:Big Data deemed untrustworthy by business managers and leaders
Only 33% of senior execs have a high level of trust in the accuracy
of their Big Data analytics ~ KPMG 2016
59% of global execs do not believe their company
has capabilities to generate meaningful
business insights from their data ~ Bain 2015
Key problem:Big Data deemed untrustworthy by business managers and leaders
Only 33% of senior execs have a high level of trust in the accuracy
of their Big Data analytics ~ KPMG 2016
85% of global execs say major investments are
required to update their existing data platform, including data cleaning
and consolidating ~ Bain 2015
59% of global execs do not believe their company
has capabilities to generate meaningful
business insights from their data ~ Bain 2015
Fresh insights from Syncsort 2017 Big Data Trends survey
Data Quality is recognized as a mission-critical data lake success factor– Data Quality tops the list of
challenges of data lake implementation, followed closely by Data Governance
Fresh insights from Syncsort 2017 Big Data Trends survey
Data Quality is recognized as a mission-critical data lake success factor– Data Quality tops the list of
challenges of data lake implementation, followed closely by Data Governance
Financial services and insurance industry is most focused on Data Quality and Data Governance– Named Data Quality as top
priority 50% more often than participants from other industries
– Also identified Data Governance as a top priority at more than twice the rate of those from other industries
Fresh insights from Syncsort 2017 Big Data Trends survey
Data Quality is recognized as a mission-critical data lake success factor– Data Quality tops the list of
challenges of data lake implementation, followed closely by Data Governance
But… not everyone is making the connection between Data Quality and Big Data success– Participants who did not
include data quality as a top 3 priority for implementing the data lake expressed the most interest in analytically-intensive data lake uses… which are highly dependent on proper data quality
Financial services and insurance industry is most focused on Data Quality and Data Governance– Named Data Quality as top
priority 50% more often than participants from other industries
– Also identified Data Governance as a top priority at more than twice the rate of those from other industries
Root causes of Big Data mistrust
Root causes of Big Data mistrust
Are these numbers accurate? Are calculations using correctly aggregated data?
Is this data current? When was it last updated?
Are these terms consistent with our business definitions?
Can I trust this data enough to make key decisions and/or allow the data to be used in real-time?
Did we include all of the data we should have? Are additional data sources missing?
Root causes of Big Data mistrust… examples
False Assumptions
Pinterest targeted marketing campaign mistakenly congratulated single women on upcoming weddings...
Root causes of Big Data mistrust… examples
False Assumptions
Pinterest targeted marketing campaign mistakenly congratulated single women on upcoming weddings...
Miscoded/ Misinterpreted Data
Predictive analysis falsely found call center workers without a HS diploma were 3x more likely to remain on board for at least 9 months…
…?
Root causes of Big Data mistrust… examples
False Assumptions
Pinterest targeted marketing campaign mistakenly congratulated single women on upcoming weddings...
DuplicateData
Fraud examination revealed massive import tariff evasion on eggs, only to find there was no case to crack…
Miscoded/ Misinterpreted Data
Predictive analysis falsely found call center workers without a HS diploma were 3x more likely to remain on board for at least 9 months…
…?
Sample use cases at Big Data scale…
360 view of customer (or product, or other key entity)Is Data Lake essential for this use case?– YES… Purpose of customer 360 is to optimize customer
experience management– Increasingly broad spectrum of data sources involved in and
required for effectively personalizing customer experiences and targeted marketing offers
What Types of Data?– Internal sources – often many/overlapping– 3rd Party data – demographics– Suppression data – keeping customer information updated– New sources – mobile, social media
Internal Data Customer Master Data Point-of-Sale Data Contact Form Data Loyalty Program Data ecommerce Data Customer Service Data
Suppression Data Change of Address Mortality Do Not Call
Third-Party Data Age Occupation Education Gender Income Geographic
Sample use cases at Big Data scale…
Anti-Fraud/Anti-Money LaunderingIs Data Lake essential for this use case?– YES… Fraudulent transaction detection requires huge volumes
of customer profile data, recent transaction activity with “last known” values, device data with geolocation and time-based tagging, 3rd party news/alerts
– Data used to refine Machine Learning models (e.g., anomaly detection, implausible behavior analysis) to review new transactions in real time
What Types of Data?– Internal sources – often many/overlapping– Suppression data – keeping customer information updated– Mobile data – devices, locations– New sources – social media, 3rd party data, …
Internal Data Customer Master Data Point-of-Sale Data Contact Form Data Loyalty Program Data ecommerce Data Customer Service Data
Mobile Data Device Location Wearables Mobile wallets
Suppression Data Change of Address Mortality Do Not Call
Social Data Sentiment Opinions Interests Social handles
The Fundamental Data Quality Question: What are you trying to do?
“Never lead with a data set; lead with a question.”Anthony Scriffignano
Chief Data Scientist, Dun & BradstreetForbes Insights, May 31, 2017
“The Data Differentiator”
“If you don’t know what you want to get out of the data, how can you know what data you need
– and what insight you’re looking for?”Wolf Ruzicka
Chairman of the Board at EastBanc TechnologiesBlog post: June 1, 2017
“Grow A Data Tree Out Of The “Big Data” Swamp”
Understanding Data Quality best practices:Where to start?
Establishing ScopeAsking the “right questions” about your data (not just “what” and “how”)
– “Why” questions to understand core business problem– “Who” questions to understand varying needs of all involved users (role, function, etc.)
Empowering users (“Who”) to gain new clarity into the core problem (“Why”)– Bringing together data sources relevant to asking insightful questions of the data– Enabling the data to answer the questions freely– Building data analytics, algorithms, machine learning, etc. to expedite and broadcast
answersAbove lines of inquiry inform what Data Quality processing is required
– Determining how, what and where Data Quality is established based on business problem
– “High-quality data” definition will vary by business problem
Understanding Data Quality best practices:What’s the End Goal?
The End Goal drives Data Quality Requirements & ProcessesDo you have all the data required?
– What’s the central entity? E.g. Customer, Product, Asset– What’s the definition? E.g. “Customers” may mean customers, prospects, store visitors, …– Are the sources comprehensive? E.g. any data silos? cover all geographies?– Will “new” information be added? E.g. demographics, geolocation, …
How will data be matched, consolidated, or connected?– One “golden” record? Or multiple links to connect all the dots?
What’s needed to facilitate the matching, consolidation, or connection required?– E.g. Customer may need: Name, Address, Geolocation, Phone, Email
Have you evaluated the sources?– Are the data sources “Fit for Purpose”?
Applying Data Quality best practices:Identifying required Data Quality dimensions
What data do we care about?• What are the Critical Data Elements?
What measures can we take advantage of?1) Completeness – Are the relevant fields populated?2) Integrity – Does the data maintain an internal
structural integrity or a relational integrity across sources
3) Uniqueness – Are keys or records unique?4) Validity – Does the data have the correct values? 5) Consistency – Is the data at consistent levels of
aggregation or does it have consistent valid values over time?
6) Timeliness – Did the data arrive in a time period that makes it useful or usable?
Example: Call Center Record
Example: Call Center Record
Unique
Integrity
Complete ? Consistent
Timely
Valid ?
Is Duration = 0 important?Is 01/01/20xx a defaulted date?
And how will this be linked or connected with my other data?
The file appears complete, but does it cover all call centers?
Example: Twitter Feed
Example: Twitter Feed
Unique?
Integrity?
Complete?
Consistent?
Timely?
Valid?
What else can we review or measure?1) Coverage (Relevance) – How well does the data source meet the defined needs?
– E.g. does it cover the relevant geography? Is it biased?
2) Continuity – Data points for all intervals or expected intervals? – E.g. sensors, weather records, call data records
3) Triangulation – What Gartner describes as ‘consistency of data across proximate data points’, i.e. consistent measurements from related points of reference.
– E.g. if temperatures in Chicago and Louisville are 30°and 32°then temperature in Indianapolis for same day is unlikely to be 70°
4) Provenance – Where did the data originate, who gathered it, and what criteria was used to create it? – E.g. government agency, 3rd party provider, free or paid data
5) Transformation from origin – how many layers and/or changes has the data passed through? – E.g. has the original data source already been merged with two other record sources? And is the result accurate?
6) Repetition or duplication of data patterns – Data points exactly the same across multiple recording intervals or across multiple sensors.
– E.g. is there tampering with sensors or call data?
Applying Data Quality best practices:‘New’ or ‘Extended’ Measures of Data Quality
Example: Twitter Feed
Triangulated
Continuity
Provenance
Coverage
Usage
Repeated patterns
Transformation
Example: Twitter Feed
Triangulated
Continuity
Provenance
Coverage
Usage
Repeated patterns
Transformation
Jane Doe pulled from Twitter based on #Blackberry
All items for #Blackberry in time interval appear to be included
Marketing confirms these have high value
Good association with current sales data
All tweets appear unique within the date & vs. prior feeds
This needed to include #BB and #Crackberry as well!
Applying Data Quality best practices:Understanding Context
Context is critical:Even on data that is considered “common” or “understood” such as Name or Address or Product DescriptionTo parse or standardize data to useful and usable components for additional processingTo determine when and where to verify or enrich the data contentTo determine whether and how to match records to a given entityTo identify whether to consolidate data, and if so what other data drives the consolidation
Applying Data Quality best practices:Assessing Quality Requirements
Entity data (customer, product, asset, …):
Requires understanding data provenance and contextRequires integrating data from multiple data sources Requires determining whether specific data should even be includedPresents differences in coverage, completeness, consistency, provenance, …Comes from different points in timeMay contain repetitions, particularly from 3rd Party data sourcesMay contain data at different levels of consolidation or aggregation
Robert Smith Jr
3 Davy Drive
S66 7EN
+44(0)1189 823606
Rotherham
Name
Address
City
Postal Code
Phone
3rd Party
Applying Data Quality best practices:Utilizing Data Quality functions to achieve required DQ dimensions
Parse data values from unstructured fields to their correct domainsStandardize values to enable higher quality matching and linkageVerify and enrich global postal addresses and geolocationsEnrich data from external, third-party sources to create comprehensive, unified recordsMatch and link like records Consolidate and aggregate to “golden” record, if appropriate, based on factors such as data source, date, …Match records that belong to the same domain (i.e., household or business)
Smith
3 Davy Drive
S66 7EN
Rotherham
Name
Address
City
Postal Code
Household View
Applying Data Quality best practices:Example
Large telco organization: “What are our customers saying about us in the marketplace?
Where are the most common complaints are coming from?Issue: sparse results concentrated in one regionRequired: standardization, enrichment, geocoding, matching/record linkage, address verification
Before Data Quality After Data Quality
Applying Data Quality best practices:Example
Large telco organization: “What are our most profitable regions on a daily basis?
Which are the most profitable regions?Issue: poor geolocation identifying wrong regionsRequired: parsing, standardization, address verification, enrichment, geocoding, matching/record linkage
Before Data Quality After Data Quality
Applying Data Quality best practices:Consistent processing
Big Data at scale distributes data across many nodes –not necessarily with other relevant data!
– Implications for joining, sorting, and matching data, whether for enrichment, verification against trusted sources, or a consolidated single view
Data Quality functions must be performed in a consistent manner, no matter where actual processing takes place, how the data is segmented, and what the data volume is
– Processing routines must apply same approach and logic each time
– Critical to establishing, building, and maintaining trust
Source: HP Analyst Briefing
Trillium Quality for Big DataFocus on Data Quality, not the Big Data platform
• Use existing Data Quality skills and expertise• No need to worry about mappers, reducers, big side or small side of joins, etc• Automatic optimization for best performance, load balancing, etc.• No changes or tuning required, even if you change execution frameworks• Future-proof job designs for emerging compute frameworks, e.g. Spark 2.x• Run multiple execution frameworks in a single job
Single GUI Execute Anywhere!
35Syncsort Confidential and Proprietary - do not copy or distribute
Intelligent Execution - Insulate your organization from underlying complexities of Hadoop
Bring Data Quality best practices into the Data Lake:The Syncsort/Trillium technology approach
“Design once, deploy anywhere” – Visually design data quality jobs once and run anywhere
(MapReduce, Spark, Linux, Unix, Windows; on premise or in the cloud)
– Use-case templates to fast-track development– Test & debug locally in Windows/Linux; deploy to Big Data– Intelligent Execution dynamically optimizes data
processing at run-time based on the chosen compute framework; no changes or tuning required
Benefit: Significantly reduce manual data preparation– Major time sink for data scientists, architects and analysts– Risk of inconsistent or incomplete data preparation
Benefit: Significantly increase trust in data– Major time sink for executives– Risk of poor data-based business decisions
Single GUI
Execute Anywhere!
“Data is useful. High-quality, well-understood, auditable data is priceless.”
Ted FriedmanVP Distinguished Analyst
Article in CRM.com: Mar 8, 2005“The Coming of BI Competency Centers”
Data Quality remains Data Quality, even at scale
“Data is the new science. Big Data holds the answers. Are you asking the right questions?”
Pat GelsingerPresident and COO at EMC
Forbes Insights, June 22, 2012“Big Bets On Big Data”
Questions and Next Steps
For more information on Trillium Quality for Big Data, visit: trilliumsoftware.com/products/big-data Contact Info:
Mike Urbonas, Director of Product Marketing, Syncsort/Trillium [email protected]://www.linkedin.com/in/mikeu
Harald Smith, Director of Product Management, Syncsort/Trillium [email protected]://www.linkedin.com/in/harald-smith-71028b
Thank You!