How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand,...
Transcript of How can Data Cataloging Integrate Key Viewpoints … · data and, more generally, to understand,...
How can Data Cataloging
Integrate Key Viewpoints using Knowledge Graphs?
TOPQUADRANT COMPANY
FOUNDATION
▪ TopQuadrant was founded in 2001
▪ Strong commitment to standards-based approaches to data semantics
MISSION
▪ Empower people and drive results — by making enterprise information meaningful
FOCUS
▪ Provide comprehensive data governance solutions using knowledge graph technologies
TOPQUADRANT COMPANY
© Copyright 2020 TopQuadrant Inc. Slide 3
▪ Data management/modeling concepts and their use today– Logical models, physical models, data dictionaries,
business glossaries, data catalogs and more
▪ The common challenge in data governance– Connecting different perspectives and aspects of the data
▪ Knowledge Graphs– What are they
– How do they help in solving the challenge
▪ Real life examples
▪ Q & A
Today’s Agenda
Irene Polikoff
© Copyright 2020 TopQuadrant Inc. Slide 4
Logical Data Model
▪ Who creates – Business Analysts (typically, in larger IT shops)
▪ When– At the beginning of the system design, during the requirements gathering
and high level design phase. Often, is not updated as a system changes –unless, the development methodology insists that the starting point of any data change is a logical model.
▪ Why
– To serve as a communication and specification artifact. Focuses on a
single system.
▪ Format– Can be a PowerPoint or Visio diagram or, more formally, created in a
specialized data modeling tool: PowerDesigner, Idera, Archi, Erwin, etc. A number of notations (all focus on diagrammed depiction): UML, IDEF, …
▪ Contains– Entities, relationships and attributes, does not concern itself with
implementation details e.g., datatypes
“hotness” or “in vogue” indicator
© Copyright 2020 TopQuadrant Inc. Slide 5
Physical Data Model
▪ Who creates– Database or software architects, software developers, database
administrators
▪ When– At the system design and implementation phase. Updated as
system evolves.
▪ Why– It is necessary for implementation. One per system/source.
▪ Format– A set of DDL commands, but, ultimately, a physical data model is
embodied by a database. For non RDBMS source, can be XML Schema, JSON Schema, etc.
▪ Contains– Tables, views, columns, data elements, exact data formats,
primary and foreign keys, etc.
neither “hot” nor “cold”, physical data models are (mostly)
always there
© Copyright 2020 TopQuadrant Inc. Slide 6
Data Dictionary (Metadata Registry)
▪ Who creates– Combination of roles, but mostly IT
▪ When– While data dictionary can be for a single system, in a context of metadata registries, a data dictionary describes many data
sources. Dictionary for a specific system may be initially populated during the implementation phase, but is often created post factum. Must be continuously maintained and updated.
▪ Why– For a single system, as a way to have human readable documentation outside of the database. The goal of a metadata registry is
to serve as one-stop-shop for IT system analysts, designers and developers to help them understand their data.
– Can be used to help create business reports, develop new systems, find relevant data and, more generally, to understand, interpret, aggregate, integrate and translate data.
▪ Format– Documents (spreadsheet, PDF, Word). Sometimes just stored in each database
with users adding comments with additional info. If traditional metadata registry is used, information is typically stored in RDBMS.
▪ Contains– Describes data structure as well as its content e.g., number of records, where
data comes from, its meaning, calculations, where used, etc.
metadata registries are out of fashion
© Copyright 2020 TopQuadrant Inc. Slide 7
Enterprise Logical Data Model▪ Who creates
– Data modelers
▪ When– Typically, triggered by large scale data warehousing efforts
▪ Why– Process and technology independent representation of business data - a single
view of the business that is not adulterated by IT concepts and processes.
– Should focus only on the business data. Disregard all that “nasty IT stuff” –queries, reporting tools, ETL, OLAP, databases – and focus on the business perspective. The business gets to have it’s opportunity to lay a foundation for the ongoing documentation of the business definitions. Covers multiple systems – ultimately, one per enterprise.
▪ Format– Most commonly, data modeling tools, diagrams and printouts are used to
communicate results
▪ Contains– Entities, relationships and attributes, does not concern itself with
implementation details e.g., datatypes
few successes, hard to do, it is deeply out of fashion
Enterprise Logical Data Model
Some lessons learned:
▪ Any initiative that requires establishing enterprise-wide shared definitions and agreements is difficult
▪ Perspectives vary depending on the context
▪ For better chances of success, produced artifacts need to be:– Flexible enough to accommodate different perspectives –
some commitments are global, but many are local
– Easily accessible and usable by a broad range of stakeholders
– Easily evolvable
– Deliver tangible value early and incrementally for business to continue to invest
few successes, hard to do, it is deeply out of fashion
© Copyright 2020 TopQuadrant Inc. Slide 9
Business Glossary▪ Who creates them
– Business analysts, data stewards, business subject matter experts.
▪ When– Defines shared terms. While multiple business glossaries can exist, they are not partitioned based on a boundary of any given
system or database, but rather partitioned based on a common set of business functions and processes e.g., sales and marketing, manufacturing, etc. Must be continuously maintained and updated as business needs evolve.
▪ Why– The term business glossary has been part of the mainstream data management speak for much shorter period than data
dictionary or logical data model. Data dictionaries and logical models are seen very much as IT-owned artifacts while glossariesbelieved to be created and maintained by the business and are more business oriented. The main focus of the glossary is on the information designed to improve business understanding and use of data.
▪ Format– Documents (spreadsheet, PDF, Word). If a business glossary tool is used,
information is typically stored in its database.
▪ Contains– Defines standardized meaning of key business terms. May include some
description of rules and calculations. Is not database oriented. Each term is defined individually so any database structure or “enterprise object” structure information is limited.
business glossaries are in fashion - relatively, easier to do
© Copyright 2020 TopQuadrant Inc. Slide 10
Business Terms have Relationships and different Meaning in different Contexts – even for Glossaries!
A purchase order is an official document issued by a customer (buyer) committing them to paying the seller for the sale of specific products or services to be delivered in the future.
A blanket PO is a commitment from a buyer to buy products or services from a seller on an ongoing basis, until a certain maximum is reached.
PO Payment Terms describe when a buyer will make a payment to a seller. For example, on delivery of purchased goods or in 30 days after invoice is received.
© Copyright 2020 TopQuadrant Inc. Slide 11
Data Catalog▪ Who creates them
– Multiple roles, gathered from sources
▪ When– Never for a single system. Must be continuously maintained and updated
as data landscape evolves.
▪ Why– Even more recent term than business glossary. Created with a goal to
serve as one-stop-shop for all data stakeholders to help them understand their data. Can be used to help create business reports, develop new systems, find relevant data and, more generally, to help in use and interpretation of data.
▪ Format– Whatever format is used by the database behind the data catalog system.
The closed nature of most data cataloging systems remains a problem.
▪ Contains– Everything you would normally see in a data dictionary/metadata registry.
Can also contain some data samples.
“Everyone wants data management but most want to avoid metadata
management”
A sexier term than metadata registry?
© Copyright 2020 TopQuadrant Inc. Slide 12
A Common Theme
▪ As goals move beyond one system to a “one-stop-shop” / single view across / holistic approach, a separation between the artifacts starts to blur
▪ Each offers a slightly different take or approach towards meeting the same objective
© Copyright 2020 TopQuadrant Inc. Slide 13
But wait, there is more to the picture!
▪ Requirements and Regulations– Data, security, privacy, etc.,
– Typically, in documents
▪ Reference Data– Controlled values or codes used to describe other data
– Key to data understanding, integration and aggregation
– Different sets of values may be used for the same entity e.g., Gender
– May be stored in databases, software, exported as dataset into data lakes, etc.
▪ Lineage– Complex information flows
– Can be captured at different levels – business vs. technical
– Documents, ETL scripts
▪ Data Management/Governance Model– How is governance organized – processes, responsibilities, policies, etc.
© Copyright 2020 TopQuadrant Inc. Slide 14
The Problem Revamping of the same ideas – still, in a disconnected form
© Copyright 2020 TopQuadrant Inc. Slide 15
• What rules and regulations apply to its data?
• What systems process it?
• Where do they process it?
• What is its provenance?
• What logical entities, business terms, etc. is it mapped to?
• What is it’s structure?
• And more …
Solution to Context and Connectivity -Knowledge Graphs
© Copyright 2020 TopQuadrant Inc. Slide 16
▪ A Knowledge Graph represents a knowledge domain
▪ It represents knowledge as a graph– A network of nodes and links
– Not tables of rows and columns
▪ It represents facts (data) and models (metadata) in the same way– Rich rules and inferencing
▪ It is based on open standards, from top to bottom– Readily connects to knowledge in private and public clouds
There can be different types and instances of Knowledge Graphs …
Enter Knowledge Graphs
© Copyright 2020 TopQuadrant Inc. Slide 17
KNOWLEDGE GRAPHS
Tim Berners Lee2012
2019-2020
Knowledge Graphs in the news
> Google on AI and knowledge
graphs in ZDNET
> Data governance 2.0 on
Dataversity
> Conferences & workshops
> Technology trends 2019
> Forbes article by Kurt Cagle
2001
Tim Berners-Lee
© Copyright 2020 TopQuadrant Inc. Slide 18
Typical Use Cases for Knowledge Graphs -All needed for Data Governance
1. Graph Traversal
2. Graph Analytics
3. Data Integration
4. Data Aggregation
5. Information Insights
6. Lineage
Graph algorithms: Statistics, Centrality,
Shortest Path, …
Inferred dependencies across composable
graphs and domains: enterprise, technical,
data, governance
Traverse across
connected graphsDrug
hasTarget
Target
hasPathway
Disease
improvedBy
Pathway
includes
Protein Gene
codedBy
KG1 KG2 KG3
Find things that share common
attributes or relationshipsA
Query brokering across disparate systems
by mappings to a unified model
360 Degree View across composable graphs
© Copyright 2020 TopQuadrant Inc. Slide 19
Graph of Facts about a Table
SUBJECT PREDICATE OBJECT
rdf:type .
Facts
edg:columnOf
Model
Graph Describing What Kind of Information We Want to Capture about a Table and What it Means
Model
Graph Describing What Kind of Information We Want to Capture about a Table and What it Means
© Copyright 2020 TopQuadrant Inc. Slide 22
Why Knowledge Graphs for Data Governance
▪ Knowledge Graphs provide an open, extensible and “smart” means to represent diverse asset types
▪ Using knowledge graphs –an interconnected set of information that meaningfully bridges enterprise metadata silos.
Personally Identifiable Information (PII)
GDPR example of a Knowledge Graph
© Copyright 2020 TopQuadrant Inc. Slide 23
Real Life Customer Story: A Large Financial Services Organization
▪ Mature data governance program
▪ Logical and physical models in Erwin and other tools
▪ Databases, datasets, data lake
▪ Requirement and regulation documents in various formats
▪ Data transformation scripts in Informatica, AbInitio, stored procedures in databases, custom scripts
▪ Business glossary in Collibra, user groups download and work in Excel
▪ Various other artifacts e.g., lineage in spreadsheets and other formats, custom built applications catalog
ETL
© Copyright 2020 TopQuadrant Inc. Slide 24
Creating a Knowledge GraphSources Data Governance
Knowledge Graph
ETL
© Copyright 2020 TopQuadrant Inc. Slide 25
Goal: Enable Foundational Transformation
Data Governance
Knowledge Graph for Enterprise
Operational Efficiencies
Compliance Adherence
Analytical Insight
Who provides this data and who is using it?
What data is required for this function?
When does this data needs to be utilized?
Where should it go?
What form should it take?
© Copyright 2020 TopQuadrant Inc. Slide 26
Results
▪ Ingested
– Over 500 logical models with over 50K entities
– Similar number of physical models, with over 50K tables, all connected
– Over 1M of data elements
– Close to 1,000 applications
– Many ETL processes
▪ Established
– Ongoing ingests
– 360 degree reporting and metrics from many different perspectives: • business area, owner, mapping to PII, applications, business process, and more
– Managed approach to archiving operation data into a data lake
© Copyright 2020 TopQuadrant Inc. Slide 27
Surprises (aka Shocks!)
▪ Number of distinct entities and data elements
▪ Definition coverage for data elements – under 8%
▪ Definition coverage for entities/tables – under 2%
▪ Firm has a long standing policy that all entities must be defined
– Not possible to effectively operationalize the policy or even know if it enforced when there is no connected view
▪ Some data areas/systems are much better covered with definitions than others– The range is between 0 and 70% for data elements
– Consistently, coverage of entities is significantly lower than elements
© Copyright 2020 TopQuadrant Inc. Slide 28
Data Governance Triangle
© Copyright 2019 TopQuadrant Inc. Slide 29
Governance AssetsLineage Asset Collections
Technical AssetCollections
Data AssetCollections
Enterprise AssetCollections
Assets Example for Medical Enterprise
Lineage for Patient Discharge
Hospital Facilities and Staff
Hospital Information Systems Catalog
Hospital Information Assets
Hospital Activities and Processes
Logical Flows of Data Exchanges
Healthcare Glossary
Logical Model for Hospital Information
Open MRS Data Assets
Governance Areas
Metrics
Governance Roles and Organization
Workflow Templates
Dashboards Users
Issues Policies
RXNorm
Catalog of GNU Health Database
Medical Information Public Use Datasets (PUF)
Ontologies
Lineage Model
Enterprise Assets Model
Technical Assets Model
Data Assets Model
Glossary, Reference Data and other Models
Other Collections: Reference Datasets, Glossaries, Documents
ICD10 Form Templates Corpus
Lineage for Patient Invoice
Governance Model
Pre-built Assets
© Copyright 2020 TopQuadrant Inc. Slide 30
Key Take Away Points
▪ The goal of data governance is to make it possible to answer important questions for a range of data stakeholders
▪ Many different artifacts get produced and used in managing information
▪ Each contributes its own part of the picture
▪ Answering most of the key questions requires bringing the whole picture together
▪ Knowledge graphs offer a powerful and adaptive approach for doing this
© Copyright 2020 TopQuadrant Inc. Slide 31
Benefits of a Knowledge Graph based Platform for Data Governance 2.0
As an enterprise knowledge graph infrastructure, TopBraid EDG supports Data Governance 2.0 and applications of AI / ML
TopBraid Enterprise Data Governance (EDG):
▪ Is flexible and extensible, based on standards
▪ Integrates reasoning and machine learning
▪ Enables people (UI) and software (APIs/web services) to view, follow and query
▪ Bridges data and metadata “silos” for a seamless data governance
▪ Delivers Knowledge-driven data governance
© Copyright 2020 TopQuadrant Inc. Slide 32
… Questions?
Thank You !
© Copyright 2020 TopQuadrant Inc. Slide 33
To Learn More about TopBraid EDG and Knowledge Graphs:
EDG Product Info: ▪ TopBraid Enterprise Data Governance (TopBraid EDG)
(https://www.topquadrant.com/products/topbraid-enterprise-data-governance/)
Contact us: at [email protected] to:▪ Discuss data governance and knowledge graphs
▪ Request a more targeted demo of TopBraid EDG
▪ Ask for a free EDG evaluation account
© Copyright 2020 TopQuadrant Inc. Slide 34© Copyright 2020 TopQuadrant Inc. Slide 34
Latest Whitepaper
Download a copy on our website at: topquadrant.com/knowledge-assets/whitepapers/
© Copyright 2019 TopQuadrant Inc. Slide 35
More Resources ...
More Webinar Recordings, Slides, Q&A: ▪ https://www.topquadrant.com/knowledge-assets/topquadrant-
webinars/
Short Videos:▪ TopBraid EDG “Quick Grok” Videos
https://www.topquadrant.com/knowledge-assets/videos/
▪ TopBraid EDG Animated Video https://www.topquadrant.com/project/edg_agile_modular/
Blog:▪ https://www.topquadrant.com/the-semantic-ecosystems-journal/
Data Governance White Papers▪ https://www.topquadrant.com/knowledge-assets/whitepapers/