How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand,...

35
How can Data Cataloging Integrate Key Viewpoints using Knowledge Graphs?

Transcript of How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand,...

Page 1: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

How can Data Cataloging

Integrate Key Viewpoints using Knowledge Graphs?

Page 2: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

TOPQUADRANT COMPANY

FOUNDATION

▪ TopQuadrant was founded in 2001

▪ Strong commitment to standards-based approaches to data semantics

MISSION

▪ Empower people and drive results — by making enterprise information meaningful

FOCUS

▪ Provide comprehensive data governance solutions using knowledge graph technologies

TOPQUADRANT COMPANY

Page 3: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 3

▪ Data management/modeling concepts and their use today– Logical models, physical models, data dictionaries,

business glossaries, data catalogs and more

▪ The common challenge in data governance– Connecting different perspectives and aspects of the data

▪ Knowledge Graphs– What are they

– How do they help in solving the challenge

▪ Real life examples

▪ Q & A

Today’s Agenda

Irene Polikoff

Page 4: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 4

Logical Data Model

▪ Who creates – Business Analysts (typically, in larger IT shops)

▪ When– At the beginning of the system design, during the requirements gathering

and high level design phase. Often, is not updated as a system changes –unless, the development methodology insists that the starting point of any data change is a logical model.

▪ Why

– To serve as a communication and specification artifact. Focuses on a

single system.

▪ Format– Can be a PowerPoint or Visio diagram or, more formally, created in a

specialized data modeling tool: PowerDesigner, Idera, Archi, Erwin, etc. A number of notations (all focus on diagrammed depiction): UML, IDEF, …

▪ Contains– Entities, relationships and attributes, does not concern itself with

implementation details e.g., datatypes

“hotness” or “in vogue” indicator

Page 5: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 5

Physical Data Model

▪ Who creates– Database or software architects, software developers, database

administrators

▪ When– At the system design and implementation phase. Updated as

system evolves.

▪ Why– It is necessary for implementation. One per system/source.

▪ Format– A set of DDL commands, but, ultimately, a physical data model is

embodied by a database. For non RDBMS source, can be XML Schema, JSON Schema, etc.

▪ Contains– Tables, views, columns, data elements, exact data formats,

primary and foreign keys, etc.

neither “hot” nor “cold”, physical data models are (mostly)

always there

Page 6: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 6

Data Dictionary (Metadata Registry)

▪ Who creates– Combination of roles, but mostly IT

▪ When– While data dictionary can be for a single system, in a context of metadata registries, a data dictionary describes many data

sources. Dictionary for a specific system may be initially populated during the implementation phase, but is often created post factum. Must be continuously maintained and updated.

▪ Why– For a single system, as a way to have human readable documentation outside of the database. The goal of a metadata registry is

to serve as one-stop-shop for IT system analysts, designers and developers to help them understand their data.

– Can be used to help create business reports, develop new systems, find relevant data and, more generally, to understand, interpret, aggregate, integrate and translate data.

▪ Format– Documents (spreadsheet, PDF, Word). Sometimes just stored in each database

with users adding comments with additional info. If traditional metadata registry is used, information is typically stored in RDBMS.

▪ Contains– Describes data structure as well as its content e.g., number of records, where

data comes from, its meaning, calculations, where used, etc.

metadata registries are out of fashion

Page 7: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 7

Enterprise Logical Data Model▪ Who creates

– Data modelers

▪ When– Typically, triggered by large scale data warehousing efforts

▪ Why– Process and technology independent representation of business data - a single

view of the business that is not adulterated by IT concepts and processes.

– Should focus only on the business data. Disregard all that “nasty IT stuff” –queries, reporting tools, ETL, OLAP, databases – and focus on the business perspective. The business gets to have it’s opportunity to lay a foundation for the ongoing documentation of the business definitions. Covers multiple systems – ultimately, one per enterprise.

▪ Format– Most commonly, data modeling tools, diagrams and printouts are used to

communicate results

▪ Contains– Entities, relationships and attributes, does not concern itself with

implementation details e.g., datatypes

few successes, hard to do, it is deeply out of fashion

Page 8: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

Enterprise Logical Data Model

Some lessons learned:

▪ Any initiative that requires establishing enterprise-wide shared definitions and agreements is difficult

▪ Perspectives vary depending on the context

▪ For better chances of success, produced artifacts need to be:– Flexible enough to accommodate different perspectives –

some commitments are global, but many are local

– Easily accessible and usable by a broad range of stakeholders

– Easily evolvable

– Deliver tangible value early and incrementally for business to continue to invest

few successes, hard to do, it is deeply out of fashion

Page 9: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 9

Business Glossary▪ Who creates them

– Business analysts, data stewards, business subject matter experts.

▪ When– Defines shared terms. While multiple business glossaries can exist, they are not partitioned based on a boundary of any given

system or database, but rather partitioned based on a common set of business functions and processes e.g., sales and marketing, manufacturing, etc. Must be continuously maintained and updated as business needs evolve.

▪ Why– The term business glossary has been part of the mainstream data management speak for much shorter period than data

dictionary or logical data model. Data dictionaries and logical models are seen very much as IT-owned artifacts while glossariesbelieved to be created and maintained by the business and are more business oriented. The main focus of the glossary is on the information designed to improve business understanding and use of data.

▪ Format– Documents (spreadsheet, PDF, Word). If a business glossary tool is used,

information is typically stored in its database.

▪ Contains– Defines standardized meaning of key business terms. May include some

description of rules and calculations. Is not database oriented. Each term is defined individually so any database structure or “enterprise object” structure information is limited.

business glossaries are in fashion - relatively, easier to do

Page 10: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 10

Business Terms have Relationships and different Meaning in different Contexts – even for Glossaries!

A purchase order is an official document issued by a customer (buyer) committing them to paying the seller for the sale of specific products or services to be delivered in the future.

A blanket PO is a commitment from a buyer to buy products or services from a seller on an ongoing basis, until a certain maximum is reached.

PO Payment Terms describe when a buyer will make a payment to a seller. For example, on delivery of purchased goods or in 30 days after invoice is received.

Page 11: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 11

Data Catalog▪ Who creates them

– Multiple roles, gathered from sources

▪ When– Never for a single system. Must be continuously maintained and updated

as data landscape evolves.

▪ Why– Even more recent term than business glossary. Created with a goal to

serve as one-stop-shop for all data stakeholders to help them understand their data. Can be used to help create business reports, develop new systems, find relevant data and, more generally, to help in use and interpretation of data.

▪ Format– Whatever format is used by the database behind the data catalog system.

The closed nature of most data cataloging systems remains a problem.

▪ Contains– Everything you would normally see in a data dictionary/metadata registry.

Can also contain some data samples.

“Everyone wants data management but most want to avoid metadata

management”

A sexier term than metadata registry?

Page 12: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 12

A Common Theme

▪ As goals move beyond one system to a “one-stop-shop” / single view across / holistic approach, a separation between the artifacts starts to blur

▪ Each offers a slightly different take or approach towards meeting the same objective

Page 13: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 13

But wait, there is more to the picture!

▪ Requirements and Regulations– Data, security, privacy, etc.,

– Typically, in documents

▪ Reference Data– Controlled values or codes used to describe other data

– Key to data understanding, integration and aggregation

– Different sets of values may be used for the same entity e.g., Gender

– May be stored in databases, software, exported as dataset into data lakes, etc.

▪ Lineage– Complex information flows

– Can be captured at different levels – business vs. technical

– Documents, ETL scripts

▪ Data Management/Governance Model– How is governance organized – processes, responsibilities, policies, etc.

Page 14: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 14

The Problem Revamping of the same ideas – still, in a disconnected form

Page 15: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 15

• What rules and regulations apply to its data?

• What systems process it?

• Where do they process it?

• What is its provenance?

• What logical entities, business terms, etc. is it mapped to?

• What is it’s structure?

• And more …

Solution to Context and Connectivity -Knowledge Graphs

Page 16: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 16

▪ A Knowledge Graph represents a knowledge domain

▪ It represents knowledge as a graph– A network of nodes and links

– Not tables of rows and columns

▪ It represents facts (data) and models (metadata) in the same way– Rich rules and inferencing

▪ It is based on open standards, from top to bottom– Readily connects to knowledge in private and public clouds

There can be different types and instances of Knowledge Graphs …

Enter Knowledge Graphs

Page 17: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 17

KNOWLEDGE GRAPHS

Tim Berners Lee2012

Google

2019-2020

Knowledge Graphs in the news

> Google on AI and knowledge

graphs in ZDNET

> Data governance 2.0 on

Dataversity

> Conferences & workshops

> Technology trends 2019

> Forbes article by Kurt Cagle

2001

Tim Berners-Lee

Page 18: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 18

Typical Use Cases for Knowledge Graphs -All needed for Data Governance

1. Graph Traversal

2. Graph Analytics

3. Data Integration

4. Data Aggregation

5. Information Insights

6. Lineage

Graph algorithms: Statistics, Centrality,

Shortest Path, …

Inferred dependencies across composable

graphs and domains: enterprise, technical,

data, governance

Traverse across

connected graphsDrug

hasTarget

Target

hasPathway

Disease

improvedBy

Pathway

includes

Protein Gene

codedBy

KG1 KG2 KG3

Find things that share common

attributes or relationshipsA

Query brokering across disparate systems

by mappings to a unified model

360 Degree View across composable graphs

Page 19: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 19

Graph of Facts about a Table

SUBJECT PREDICATE OBJECT

rdf:type .

Facts

edg:columnOf

Page 20: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

Model

Graph Describing What Kind of Information We Want to Capture about a Table and What it Means

Page 21: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

Model

Graph Describing What Kind of Information We Want to Capture about a Table and What it Means

Page 22: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 22

Why Knowledge Graphs for Data Governance

▪ Knowledge Graphs provide an open, extensible and “smart” means to represent diverse asset types

▪ Using knowledge graphs –an interconnected set of information that meaningfully bridges enterprise metadata silos.

Personally Identifiable Information (PII)

GDPR example of a Knowledge Graph

Page 23: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 23

Real Life Customer Story: A Large Financial Services Organization

▪ Mature data governance program

▪ Logical and physical models in Erwin and other tools

▪ Databases, datasets, data lake

▪ Requirement and regulation documents in various formats

▪ Data transformation scripts in Informatica, AbInitio, stored procedures in databases, custom scripts

▪ Business glossary in Collibra, user groups download and work in Excel

▪ Various other artifacts e.g., lineage in spreadsheets and other formats, custom built applications catalog

ETL

Page 24: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 24

Creating a Knowledge GraphSources Data Governance

Knowledge Graph

ETL

Page 25: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 25

Goal: Enable Foundational Transformation

Data Governance

Knowledge Graph for Enterprise

Operational Efficiencies

Compliance Adherence

Analytical Insight

Who provides this data and who is using it?

What data is required for this function?

When does this data needs to be utilized?

Where should it go?

What form should it take?

Page 26: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 26

Results

▪ Ingested

– Over 500 logical models with over 50K entities

– Similar number of physical models, with over 50K tables, all connected

– Over 1M of data elements

– Close to 1,000 applications

– Many ETL processes

▪ Established

– Ongoing ingests

– 360 degree reporting and metrics from many different perspectives: • business area, owner, mapping to PII, applications, business process, and more

– Managed approach to archiving operation data into a data lake

Page 27: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 27

Surprises (aka Shocks!)

▪ Number of distinct entities and data elements

▪ Definition coverage for data elements – under 8%

▪ Definition coverage for entities/tables – under 2%

▪ Firm has a long standing policy that all entities must be defined

– Not possible to effectively operationalize the policy or even know if it enforced when there is no connected view

▪ Some data areas/systems are much better covered with definitions than others– The range is between 0 and 70% for data elements

– Consistently, coverage of entities is significantly lower than elements

Page 28: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 28

Data Governance Triangle

Page 29: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2019 TopQuadrant Inc. Slide 29

Governance AssetsLineage Asset Collections

Technical AssetCollections

Data AssetCollections

Enterprise AssetCollections

Assets Example for Medical Enterprise

Lineage for Patient Discharge

Hospital Facilities and Staff

Hospital Information Systems Catalog

Hospital Information Assets

Hospital Activities and Processes

Logical Flows of Data Exchanges

Healthcare Glossary

Logical Model for Hospital Information

Open MRS Data Assets

Governance Areas

Metrics

Governance Roles and Organization

Workflow Templates

Dashboards Users

Issues Policies

RXNorm

Catalog of GNU Health Database

Medical Information Public Use Datasets (PUF)

Ontologies

Lineage Model

Enterprise Assets Model

Technical Assets Model

Data Assets Model

Glossary, Reference Data and other Models

Other Collections: Reference Datasets, Glossaries, Documents

ICD10 Form Templates Corpus

Lineage for Patient Invoice

Governance Model

Pre-built Assets

Page 30: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 30

Key Take Away Points

▪ The goal of data governance is to make it possible to answer important questions for a range of data stakeholders

▪ Many different artifacts get produced and used in managing information

▪ Each contributes its own part of the picture

▪ Answering most of the key questions requires bringing the whole picture together

▪ Knowledge graphs offer a powerful and adaptive approach for doing this

Page 31: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 31

Benefits of a Knowledge Graph based Platform for Data Governance 2.0

As an enterprise knowledge graph infrastructure, TopBraid EDG supports Data Governance 2.0 and applications of AI / ML

TopBraid Enterprise Data Governance (EDG):

▪ Is flexible and extensible, based on standards

▪ Integrates reasoning and machine learning

▪ Enables people (UI) and software (APIs/web services) to view, follow and query

▪ Bridges data and metadata “silos” for a seamless data governance

▪ Delivers Knowledge-driven data governance

Page 32: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 32

… Questions?

Thank You !

Page 33: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 33

To Learn More about TopBraid EDG and Knowledge Graphs:

EDG Product Info: ▪ TopBraid Enterprise Data Governance (TopBraid EDG)

(https://www.topquadrant.com/products/topbraid-enterprise-data-governance/)

Contact us: at [email protected] to:▪ Discuss data governance and knowledge graphs

▪ Request a more targeted demo of TopBraid EDG

▪ Ask for a free EDG evaluation account

Page 34: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2020 TopQuadrant Inc. Slide 34© Copyright 2020 TopQuadrant Inc. Slide 34

Latest Whitepaper

Download a copy on our website at: topquadrant.com/knowledge-assets/whitepapers/

Page 35: How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand, interpret, aggregate, integrate and translate data. Format – Documents (spreadsheet,

© Copyright 2019 TopQuadrant Inc. Slide 35

More Resources ...

More Webinar Recordings, Slides, Q&A: ▪ https://www.topquadrant.com/knowledge-assets/topquadrant-

webinars/

Short Videos:▪ TopBraid EDG “Quick Grok” Videos

https://www.topquadrant.com/knowledge-assets/videos/

▪ TopBraid EDG Animated Video https://www.topquadrant.com/project/edg_agile_modular/

Blog:▪ https://www.topquadrant.com/the-semantic-ecosystems-journal/

Data Governance White Papers▪ https://www.topquadrant.com/knowledge-assets/whitepapers/