IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

50
IBM Cloud and Cognitive Software Fast Start 2020 #FastStart2020 IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak™ for Data – A Data Quality Deep Dive Dan Schallenkamp Data and AI, Offering Manager for Data Quality Thurs. 30-April-2020 CHI UG Meeting

Transcript of IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Page 1: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

IBM Cloud and Cognitive Software Fast Start 2020 #FastStart2020

IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak™ for Data – A Data Quality Deep Dive

Dan SchallenkampData and AI, Offering Manager for Data Quality

Thurs. 30-April-2020 CHI UG Meeting

Page 2: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Legal Disclaimer

© IBM Corporation 2020. All Rights Reserved.The information contained in this publication is provided for informational purposes only. While efforts were made to verify the

completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreementgoverning the use of IBM software.

References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer.

Page 3: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Session Agenda

• Where is Data Quality Positioned in our offerings?• Business Value / Purpose

• Data Quality – Key Capabilities

• What’s New in the current GA release?• Demo

• What’s Planned in the Next release?• Demo

3© 2020 IBM Corporation

Page 4: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

You Are Here: How this session fits in the DataOps story

Page 5: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

5

The AI LadderA prescriptive approach to accelerating the journey to AI

IBM DataOps / © 2020 IBM Corporation

InfuseOperationalize AI throughout the business

AnalyzeBuild and scale AI with trust and transparency

CollectMake data simple and accessible

OrganizeCreate a business-ready analytics foundation

ModernizeMake your data ready for an AI and hybrid cloud world

Page 6: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

DataOps is the concept to deliver Business Ready Data

6

COLLECTORGANIZE

ANALYZE

INFUSE

your data with

AI

Analytics and AI at scale and speed

to drive

Operational Efficiency

Data Quality

Data privacy & compliance

DataOps(DevOps for Data + Data Operations)

• A concept, like DevOps for Data, enabling collaboration between data consumer & data provider at speed & scale

• Automated data operations providing curated data pipeline

• Drives agility and innovation everywhere

People Process Technology

© 2020 IBM Corporation

Page 7: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Data Quality – Key Capabilities

Page 8: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Cloud Developer Services / © 2017 IBM Corporation 8

Cloud Pak for Data

Enterprise Data Integration

Enterprise Data Quality

Enterprise Data Governance

Enterprise Data Consumption

DataStage

• Search and find relevant data• Connect & prepare data for consumption & analysis• Consume and analyze the data• Comment, rate and share

• Data lineage• Data ownership• Data stewardship• Data governance workflow• Discover metadata assets• Classify data assets• Build data glossary• Manage metadata repository• Manage Reference Data

• Deep data profiling• Data quality scoring• Apply and monitor validation rules against source data

Data Governance Teams

Data CitizensIBM Watson Knowledge Catalog on Cloud Pak for Data

AI LifecycleGround Truth gathering

Data Cleansing

Feature Engineering

Model Selection

Parameter OptimizationEnsembleModel Validation

Model Deployment

Runtime Monitoring

Model Improvement

Watson Studio, Watson Machine Learning, and Open Scale

• Build ETL jobs• Run ETL jobs• Monitor• Extract data• Collect metadata• Move data• Ingest data

Data Engineers

End-to-End Platform for Business-Ready DataIntegration of data quality (from Information Analyzer) data governance (Information Governance Catalog) and data consumption (from Watson Knowledge Catalog) now under one experience and brand.

Page 9: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Relationship &Overlap Analysis

PrimaryKey Analysis

Colum

nA

nalysis Source 1 Source 2

Rules Analysis

Source 1 Source 2

Analyze – Deep Data Profiling & AnalysisProvides the key understanding of the source data

• Column analysis• Business Term Assignments• Data Classification• Data Quality scores• Primary Key analysis• Relationship and Overlap analysis

Monitor Data Quality – using Business RulesEvaluates user-defined rules against the source data

• Data Rules – targeted evaluation• Rule Sets – combined assessment

Data Profiling and Quality – Core Capabilities

9© 2020 IBM Corporation

Page 10: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

How to get the best results from Quick scan and Auto Discovery ... Example: for your critical data elements

DQ DimensionsStep 4

Examine the 11 built-in data quality dimensions, enable/disable as needed, create and install custom dimensionsUsed to calculate the DQ Score for Given columns

Business TermsStep 1Define Terms, Policies and Rules for your top 50 or 150 CDEs

Data ClassesStep 2

Examine the 200+ built-in data classes, disable those you don’t need, create and test custom data classes.

You must link every data class to a business term.

Automation RulesStep 3

Create Automation Rules for your top 50 or 150 CDEs

- ARs trigger based on Business term assignments - Can automatically bind/create Quality Rules

Step 5 Auto Discover• Automatic metadata import• Analysis• Auto classification• Auto term assignment• Data quality scores

InnovationHomework

Spend time customizing the tool

10© 2020 IBM Corporation

Page 11: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Quick scan – Blazing Fast Bulk Discovery

An easy way to start the import, analysis, quality scores, data classification (to find PII data) and automatic business term assignments all with one easy operation.

(see screen shots in demo section below)

11© 2020 IBM Corporation

Page 12: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Classification

Automatic Business Term Assignment

Data Sources

Systems of Record

Cloud

Social Media

News

Systems of Engagement

Others

Documents

Systems of Insights

Hadoop

Curator DashboardDecisions

Recommendations & Auto Term Assignment

Approve Reject / Modify

Enterprise Data Catalog

Feedback

Data Discovery(Quick scan)

Cognitive & Deep Learning

ML Classification

Rule Based Classifiers

Publish Training

12© 2020 IBM Corporation

Page 13: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

AutomatedData Classification

Regex/Valid Value/Java Classifiers

Java Script Classifiers

Column Similarity classifiers

Public Domain Classifiers

Table Classifiers

Auto Grouping and Suggestion

13© 2020 IBM Corporation

Page 14: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

AutomatedData Quality

Quality Analysis

Quality Rules

Quality Dimensions

Automation Rules

M/L Suggested rules

Business Term Assignment

14© 2020 IBM Corporation

Page 15: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Data Quality

- The Importance of Quality Addresses- A word on Workflow

Page 16: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

The Importance of Quality Addresses

Good quality addresses are foundational to so many initiatives including:

• Know Your Customer (Prospect, Employee, Vendor, Patient)

• Data Quality in general and Matching and Deduplication specifically

• Shipping, mailing, logistics

IBM’s QualityStage Address Verification Interface (AVI) is tightlyintegrated with QualityStage

Questions :

• What do you use today to parse, correct, enhance & verify addresses?

• How often do you cleanse all your addresses and at what cost?

• Do you need to add lat/long coordinates to addresses?

16© 2020 IBM Corporation

Page 17: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Capabilities

– Supports over 248 countries and territories

– Improved verification, suggestion and correction results in batch or real time

– Bi-directional Transliteration support for 8 languages

– Tightly integrated into InfoSphere QualityStage

– Process multiple countries in a single run

– Latitude and longitude assignment

– US Census* and UK PAF data

Benefits

– Reduced errors in shipping/mailing & other activity, lowers cost

– Better customer service and increased revenue

– Increase business confidence when using enterprise data for critical decision making

– Enhanced and standardized address data supports record matching & de-duplication

Address Parse/Validate/Enhance

17© 2020 IBM Corporation

Page 18: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Data Quality – What’s New in Watson Knowledge Catalog?

EVERYTHING is New! All DQ is New!

Group Name / DOC ID / Month XX, 2018 / © 2018 IBM Corporation 18

Page 19: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Data Quality – Retire the two older IA clients in 11.7.1 SP2

11.7.1 – Information Analyzer OneUIzero footprint, microservices based client (requires the ‘UG Stack’)

– Information Analyzer WorkbenchWindows based thick client

–Information Analyzer Thin Client(old/first thin client)

19© 2020 IBM Corporation

Page 20: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

A Unified User eXperience (UX) across IIS and WKC

Information Analyzer

+Watson Knowledge Catalog

Information Governance Catalog

IBM Cloud Pak for Data

Unified User Experience &

Single Catalog

ProductStrategyNew

20© 2020 IBM Corporation

Page 21: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Data Quality within ICP/WKC

+Watson Knowledge Catalog

IBM Cloud Pak for Data

New

21© 2020 IBM Corporation

Page 22: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Quick scan – Blazing Fast Bulk Discovery

An easy way to start the import, analysis, quality scores, data classification (to find PII data) and automatic business term assignments all with one easy operation.

(see screen shots in demo section below)

22© 2020 IBM Corporation

Page 23: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Data Rule Definition Management – For the business user

23© 2020 IBM Corporation

Page 24: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Accelerating Data Quality through ML based automationMachine Learning

assisted Data Quality

• Auto Business Term Assignment – ML assisted

• Auto Business Rule Suggestion – via Automation Rules based on term assignment and data class

• Auto Discovery – a quick way to kickoff bulk analysis operations including:

• Metadata import• Data profiling• Data quality scores• Term assignment

Innovation

Think 2019 / 6912A / February, 2019 / © 2019 IBM Corporation 24

Page 25: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Accelerating the Quality & Governance Process

Automating theGovernance Process

• Utilizing Machine Learning for an accelerated Metadata Classification Process (Auto Business Term assignment)

• Automatically classify data -- including understanding your PII risk

Innovation

Automation through Machine Learning

25© 2020 IBM Corporation

Page 26: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Automation Rules

• Automatic Actions/Rules and DQ threshold based on Term assignments• Enable/Disable all or individual built-in data quality dimensions• Auto-bind one or more Data Rule Definitions

26© 2020 IBM Corporation

Page 27: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Automation Rules – Designed for the business user Innovation

• Automatic Actions/Rules and DQ threshold based on Term assignments• Enable/Disable all or individual built-in data quality dimensions• Auto-bind one or more Data Rule Definitions

27© 2020 IBM Corporation

Page 28: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

SQL Virtual Tables

Can greatly simplify the creation and maintenance of data rule logic by ‘pushing’ the complexities to the source database. Table JOINs, filters, etc.

28© 2020 IBM Corporation

Page 29: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Data Quality – What’s New?

In IIS 11.7.1 SP2 and Also in WKC?

Page 30: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

What’s New with the Nov 2019 Release?

IIS 11.7.1 SP2 and CPD WKC 2.5

1. 90% of IA (including Quick scan and Auto discovery is included in WKC and with a common UX - Demo

2. Create/edit/delete virtual columns (both)3. Limit the number of Data Rule output exceptions (both)4. Validity Benchmark is back in Data Rules (both)5. ‘Manage’ Flag in Data Rules (IIS only today)6. Remember many user choices/preferences (both)

30© 2020 IBM Corporation

Page 31: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Create/Edit/Delete Virtual Columns (both) 1 of 2

• Choose ‘Create virtual column’ from the Columns tab

• If you ‘Select’ an existing virtual column you can choose ‘Edit’ or ‘Delete’

31© 2020 IBM Corporation

Page 32: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Create/Edit/Delete Virtual Columns (both) 2 of 2• Add two or more

columns

• Move up or down

• Choose field separate and other settings

• Provide a name and description

• Treated like any other column. You can analyze, run Rules against it, etc.

32© 2020 IBM Corporation

Page 33: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Limit # of Data Rule output exceptions (both)

• Sometimes the first 100 or 1000 exceptions are more than enough to share in order to describe and diagnose the quality issue

• Can be a big time savings and disk savings vs the output of all exceptions

33© 2020 IBM Corporation

Page 34: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Validity Benchmark is back (both)

• A longtime IA feature that some customers are using

• Added to help those customers make the move to the new UI and to WKC

34© 2020 IBM Corporation

Page 35: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

‘Manage’ Flag in Data Rules (IIS only today)

• Previously only available in DQEC

• And only showed up in DQEC if the Data Rule has been executed

35© 2020 IBM Corporation

Page 36: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Planned Live Demo

36© 2020 IBM Corporation

Page 37: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

IBM Cloud Pak for Data WKCSelect Roadmap Items

Page 38: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

What’s New with the Nov 2019 Release?

IIS 11.7.1 SP2 and CPD WKC 2.5

1. 90% of IA (including Quick scan and Auto discovery is included in WKC and with a common UX - Demo

2. Create/edit/delete virtual columns (both)3. Limit the number of Data Rule output exceptions (both)4. Validity Benchmark is back in Data Rules (both)5. ‘Manage’ Flag in Data Rules (IIS only today)6. Remember many user choices/preferences (both)

38© 2020 IBM Corporation

Page 39: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

What Can We Expect in the Next Release?Planned for mid-June, 2020 release (subject to change) WKC 3.0 and 11.7.1 FP1

1. New much more intuitive Data Quality menu structure (both)2. Negative term classification (both)3. WKC experience for Data Rule exceptions (DQEC replacement) (WKC)4. Data Rule binding drag and drop (both)5. Visualization of Data Quality scores over time (both)6. On-going DQ architecture modernization (WKC)7. New ‘Column Similarity’ (aka Fingerprint) data class (WKC)8. Many minor UX improvements (retain user preferences, etc.) (both)9. Relationship Analysis more intuitive (both)10.Globalization (Translation of our UIs into several languages) (WKC)

11.ML Based Data Rule Definition Generation (WKC)12.Suggested Automation Rule (available today in 11.7.1 SP2, planned for WKC)

39© 2020 IBM Corporation

Page 40: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Negative Term Classification

• Improving DQ & Governance for business term assignment

• Remember what the user has manually rejected

• Compare to what is already published

40© 2020 IBM Corporation

Page 41: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Innovation – Column Similarity

41

• ‘No Class Detected’ columns are grouped based on similarity

• User can inspect each group, determine the cutoff score

• Create a new codeless Data Class

• The next time analysis is run, the new Data Class is working

• This is a quick way to create codeless custom Data Classes that are unique to a given customer’s data

Page 42: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Easy Data Class Creation – ‘Column Similarity’

• Mimic how a human brain thinks

• Find patterns that are similar across the multiple datasets under evaluation,

• Present them to the user as clusters of “similar patterns”

42© 2020 IBM Corporation

Page 43: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

New Visualizations and Navigation

43© 2020 IBM Corporation

Page 44: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

New Visualizations – Data Quality score over time

44© 2020 IBM Corporation

Page 45: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

New Visualizations – Data Quality score over time

45© 2020 IBM Corporation

Page 46: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

New Navigation Structure

46© 2020 IBM Corporation

Page 47: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

New Navigation Structure

47© 2020 IBM Corporation

Page 48: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Relationship Analysis

48© 2020 IBM Corporation

Page 49: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

Thank you

Dan SchallenkampData and AI, Offering Manager for Data Quality—[email protected]+1-704-458-0467

49© 2020 IBM Corporation

Page 50: IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

50© 2020 IBM Corporation