Integrating Relational Database Schemas using a Standardized Dictionary.

33
Integrating Relational Integrating Relational Database Schemas using a Database Schemas using a Standardized Dictionary Standardized Dictionary R am on Law rence K en B arker U niversity ofM anitoba U niversity ofC algary umlawren@ cs.umanitoba.ca [email protected]
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    225
  • download

    1

Transcript of Integrating Relational Database Schemas using a Standardized Dictionary.

Page 1: Integrating Relational Database Schemas using a Standardized Dictionary.

Integrating Relational Integrating Relational Database Schemas using a Database Schemas using a

Standardized DictionaryStandardized Dictionary

Integrating Relational Integrating Relational Database Schemas using a Database Schemas using a

Standardized DictionaryStandardized Dictionary

Ramon Lawrence Ken BarkerUniversity of Manitoba University of Calgary

[email protected] [email protected]

Page 2: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 2

Outline Introduction, Motivation, and Background The integration architecture

Standard dictionary, X-Specs, query processor

Example integration Northwind, Southstorm databases

Querying the integrated databases Generating SQL queries from semantic queries

Unity implementation Contributions, Conclusions, and Future Work

Page 3: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 3

What is Integration? Two levels of integration:

Schema integration - the description of the data Data integration - the individual data instances

Integration problems include: Different data models and conflicts within a model Incompatible concept representations Different user or view perspectives Naming conflicts (homonym, synonym)

Integration handles the different mechanisms for storing data (structural conflicts), for referencing data (naming conflicts), and for attributing meaning to the data (semantic conflicts).

Page 4: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 4

Why is Integration Required? There are many integration environments:

Operational systems within an organization System integration during company merger Data warehouses, Intranets, and the WWW

Users require information from many data sources which often do not work together.

Companies require a global view of their entire operations which may be present in numerous operational databases for different departments and distributed geographically.

E-commerce demands integration of web databases with production systems.

Page 5: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 5

Previous Work Research systems:

integrating systems by logical rules (Sheth) defining global dictionaries (Castano) Carnot Project using the Cyc knowledge base wrapper and mediator systems:

Information Manifold, TSIMMIS, Infomaster

Industrial systems and standards: Metadata Interchange Specification (MDIS) XML, BizTalk, E-commerce portals

Query Languages: SQL, MSQL, IDL, DIRECT, SchemaSQL

Page 6: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 6

Previous Work Summary Current techniques for database integration have

some of these problems: Require integrator to understand all databases Integration process is manual Do not hide system complexity from the user Force changes on the existing database systems Construct global view manually Suffer from query imprecision (query containment)

Page 7: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 7

Our Approach Our approach combines standardization and

query mapping algorithms. The major idea is that schema conflicts can be

resolved if we: Eliminate all naming conflicts Define a language capable of determining schema

equivalence and performing transformations

Naming conflicts are eliminated by accepting a standard term dictionary.

Not a knowledge base or set of mediated views Leverages semantic information in English words

Page 8: Integrating Relational Database Schemas using a Standardized Dictionary.

Integration Architecture

Architecture Components: 1) Integrated Context View

• user’s view of integration 2) X-Spec Editor

• stores schema & metadata• uses XML

3) Standard Dictionary• terms to express semantics

4) Integration Algorithm• combines X-Specs into integrated context view

5) Query Processor• accepts query on view• determines data source mappings and joins• executes queries and formats results

Local Transactions

X-Spec

X-Spec Editor

Standard Dictionary

Integration Algorithm

Integrated Context View

Query Processor and ODBC Manager

Database

Client

Subtransactions

Client

Multidatabase Layer

Database

X-Spec

Page 9: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 9

Architecture Components The architecture consists of four components:

A standard dictionary (SD) to capture data semantics SD terms are used to build semantic names describing

semantics of schema elements.

X-Specs for storing data semantics Database metadata and semantic names stored using XML

Integration Algorithm Matches concepts in different databases by semantic names. Produces an integrated view of all database concepts.

Query Processor Allows the user to formulate queries on the view. Translates from semantic names in integrated view to SQL

queries and integrates and formats results. Involves determining correct field and table mappings and

discovery of join conditions and join paths

Page 10: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 10

The integration architecture consists of three separate processes:

Capture process: independently extracts database schema information and metadata into a XML document called a X-Spec.

Integration process: combines X-Specs into a structurally-neutral hierarchy of database concepts called an integrated context view.

Query process: allows the user to formulate queries on the integrated view that are mapped by the query processor to structural queries (SQL) and the results are integrated and formatted.

Integration Processes

Page 11: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 11

Integration Architecture:The Capture Process

RelationalSchema

StandardDictionary

X-SpecSpecification

Editor

AutomaticExtraction

DBA Lookupof terms

Capture process involves: Automatically extracting the schema information and

metadata using a specification editor Assigning semantic names to each schema element

(tables and fields) to capture their semantics

Page 12: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 12

Architecture Components: The Standard Dictionary A standard dictionary (SD) provides

standardized terms to capture data semantics. Hierarchy of terms related by IS-A or Has-A links Contains base set of common database concepts, but

new concepts can be added

A SD term is a single, unambiguous semantic definition.

Several SD entries for a single English word are required if the word has multiple definitions.

The top-level dictionary terms are those proposed by Sowa.

Page 13: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 13

Architecture Components: Dictionary vs. Knowledge Base

The standard dictionary differs from a knowledge base such as Cyc because:

Not intended to be a general English dictionary or contain knowledge facts about the world Dictionary is evolved as new terms are required Not all English words are used

Dictionary provides the systems with no “knowledge” Since no facts are stored, system cannot deduce new facts Dictionary terms are just semantic place holders, integrators

determine the semantics of the database not the system

Simplified organization Dictionary is organized as a tree for efficiency and simplicity in

determining related concepts

Re-use of terms Terms are re-used in semantic names

Page 14: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 14

Architecture Components:Using the Standard Dictionary SD terms are used to build semantic names

describing semantics of schema elements. Semantic names have the form:

semantic name := [CT_Type] | [CT_Type] CN CT_Type := CT | CT {; CT} | CT {,CT} CT := context term, CN := concept name each CT and CN is a single term from the SD

Semantic names are included in specifications describing a database.

Page 15: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 15

Northwind & Southstorm Integration Example

Northwind Database SchemaTables Fields

Categories CategoryID, CategoryNameCustomers CustomerID, CompanyNameEmployees EmployeeID, LastName, FirstNameOrderDetails OrderID, ProductID, UnitPrice, QuantityOrder OrderID, CustomerID, EmployeeID, OrderDate,

ShipviaProducts ProductID, ProductName, SupplierID, CategoryIDShippers ShipperID, CompanyNameSuppliers SupplierID, CompanyName

Page 16: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 16

Northwind & Southstorm Integration Example (2)

Southstorm Database SchemaTables Fields

Orders_tb Order_num, Cust_name, Emp_name, Item1_id, Item1_qty,Item1_price, Item2_id, Item2_qty, Item2_price

Page 17: Integrating Relational Database Schemas using a Standardized Dictionary.

Integration Example (3)Northwind Semantic Name Mappings

Type Semantic Name System Name Type Semantic Name System NameT [Category] Categories T [Order] OrdersF [Category] Id CategoryID F [Order] Id OrderIDF [Category] Name CategoryName F [Order;Customer] Id CustomerIDT [Customer] Customers F [Order;Employee] Id EmployeeIDF [Customer] Id CustomerID F [Order] Date OrderDateF [Customer] Name CompanyName F [Order;Shipper] Id ShipviaT [Employee] Employees T [Product] ProductsF [Employee] Id EmployeeID F [Product] Id ProductIDF [Employee] Last Name LastName F [Product] Name ProductNameF [Employee] First Name FirstName F [Product;Supplier] Id SupplierIDT [Order;Product] OrderDetails F [Product;Category] Id CategoryIDF [Order] Id OrderID T [Shipper] ShippersF [Order;Product] Id ProductID F [Shipper] Id ShipperIDF [Order;Product] Price UnitPrice F [Shipper] Name ShipperNameF [Order;Product]

QuantityQuantity T [Supplier] Suppliers

F [Supplier] Id SupplierIDF [Supplier] Name SupplierName

Page 17

Page 18: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 18

Northwind & Southstorm Integration Example (4)

Southstorm Semantic Name Mappings

Type Semantic Name System Name

Table [Order] Orders_tb

Field [Order] Id Order_num

Field [Order;Customer] Name Cust_name

Table [Order;Employee] Name Emp_name

Field [Order;Product] Id Item1_id

Field [Order;Product] Quantity Item1_qty

Table [Order;Product] Price Item1_price

Field [Order;Product] Id Item2_id

Field [Order;Product] Quantity Item2_qty

Field [Order;Product] Price Item2_price

Page 19: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 19

What is a semantic name? A semantic name is a universal, semantic identifier

in a domain. Similar to a field name in the Universal Relation. Semantics are guaranteed unique by construction. System has mechanism for comparing semantics across

domains even though it does not understand them. (Exploiting semantics in English words.)

Important definitions: context - a semantic name is a context if it maps to a table concept - a semantic name is a concept if it maps to a field context closure - of semantic name Si denoted Si

* is the set of semantic names produced by taking ordered subsets of the terms of Si = {T1, T2 , … TN} starting with T1.

Page 20: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 20

Architecture Components:X-Specs Database metadata and semantic names are

combined into specifications called X-Specs: Stored and transmitted using XML Contains information on a relational schema Organized into database, table, and field levels Stores semantic names to describe and integrate

schema elements

Page 21: Integrating Relational Database Schemas using a Standardized Dictionary.

Southstorm X-Spec<?xml version="1.0" ?><Schema name = "Southstorm_xspec.xml” xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes">

<ElementType name="[Order]" sys_name = "Orders_tb" sys_type="Table"> <element type = "[Order] Id" sys_name = "Order_num" sys_type = "Field"/> <element type = "[Order] Total Amount" sys_name = "Order_total" sys_type = "Field"/> <element type = "[Order;Customer] Name" sys_name = "Cust_name" sys_type = "Field"/> <element type = "[Order;Customer;Address] Address Line 1" sys_name="Cust_address" sys_type="Field"/> <element type = "[Order;Customer;Address] City" sys_name = "Cust_city" sys_type = "Field"/> <element type = "[Order;Customer;Address] Postal Code" sys_name="Cust_pc" sys_type="Field"/> <element type = "[Order;Customer;Address] Country" sys_name="Cust_country" sys_type="Field"/> <element type = "[Order;Product] Id" sys_name = "Item1_id" sys_type = "Field"/> <element type = "[Order;Product] Quantity" sys_name = "Item1_quantity" sys_type = "Field"/> <element type = "[Order;Product] Price" sys_name = "Item1_price" sys_type = "Field"/> <element type = "[Order;Product] Id" sys_name = "Item2_id" sys_type = "Field"/> <element type = "[Order;Product] Quantity" sys_name = "Item2_quantity" sys_type = "Field"/> <element type = "[Order;Product] Price" sys_name = "Item2_price" sys_type = "Field"/></ElementType></Schema>

Page 21

Page 22: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 22

Integration Product:The Integrated Context View The product of the integration is a structurally-neutral

hierarchy of concepts called an integrated context view. Define a context view (CV) as follows:

If a semantic name Si is in CV, then for any Sj in Si*, Sj is also in CV.

For each semantic name Si in CV, there exists a set of zero or more mappings Mi that associate a schema element Ej with Si.

A semantic name Si can only occur once in the CV.

A context view (CV) is a valid Universal Relation. Each field is assigned a semantic name which uniquely

identifies its semantic connotation.

Page 23: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 23

Northwind & Southstorm Integration Example

Integrated Context ViewIntegrated View

TermData Source Mappings

(not visible to user)Integrated View

TermData Source Mappings

(not visible to user)

V (view root) N/A V (view root) (cont.) N/A

- [Category] NW.Categories - [Order] NW.Orders, SS.Orders_tb

- Id NW.Categories.CategoryID -Id NW.[Orders,OrderDetails].OrderID, SS.Orders_tb.Order_num

- Name NW.Categories.CategoryName - [Customer]

- [Customer] NW.Customers - Id NW.Orders.CustomerID

- Id NW.Customers.CustomerID - Name SS.Orders_tb.Cust_name

- Name NW.Customers.CompanyName - [Employee]

- [Employee] NW.Employees - Id NW.Orders.EmployeeID

- Id NW.Employees.EmployeeID - Name SS.Orders_tb.Emp_name

- [Name] - [Product] NW.OrderDetails

- First Name NW.Employees.FirstName - Id NW.OrderDetails.ProductID, SS.Orders_tb.Item[1,2]_id

- Last Name NW.Employees.LastName - Price NW.OrderDetails.UnitPrice, SS.Orders_tb.Item[1,2]_price

- [Product] NW.Products - Quantity NW.OrderDetails.Quantity, SS.Orders_tb.Item[1,2]_qty

- Id NW.Products.PrdouctID - [Shipper] NW.Shippers

- Name NW.Products.ProductName - Id NW.Shippers.ShipperID

- [Supplier] - Name NW.Shippers.ShipperName

- Id NW.Products.SupplierID - [Supplier] NW.Suppliers

- [Category] - Id NW.Suppliers.SupplierID

- Id NW.Products.CategoryID - Name NW.Suppliers.SupplierName

Page 24: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 24

Architecture Components: The Query Processor The query processor:

Allows the user to formulate queries on the view. Translates from semantic names in the context view to

structural queries (SQL) on databases. Involves determining correct field and table mappings and

discovery of join conditions and join paths

Retrieves query results and formats them for display to the user.

Client-side query processing: Perform joins between databases using common keys. Data value formatting and transformation

Page 25: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 25

Advanced Query Processing Advanced query processor features include:

global keys and joins - a mechanism for specifying when a field stores a global key such as a social security number.

result normalization - a procedure for normalizing query results returned from each individual database. (e.g. Southstorm)

data integration - transforming data representational conflicts at the global level. For example, “M” and “F” may represent “Male” and

“Female” in one database, and another may represent these concepts using “0” and “1”.

Page 26: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 26

Northwind & Southstorm Query Examples Example 1: Retrieve all order ids ([Order] Id) and

customers ([Customer] Name): SS: SELECT Order_num, Cust_name FROM Orders_tb NW: SELECT OrderID, CompanyName FROM Orders,

Customers WHERE Orders.CustomerID = Customers.CustomerID

Example 2: Retrieve all ordered products ([Order;Product] Id) and their order ids.

SS: SELECT Order_num, Item1_id, Item2_id FROM Orders_tb NW: SELECT OrderID, ProductID FROM OrderDetails Note: In NW, selects from two different order id mappings. In

SS, result normalization is required.

Page 27: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 27

Integration Example:Discussion Important points:

System table and field names are not presented to the user who queries based on semantic names.

Database structure is not shown to the user. Field and table mappings are automatically

determined based on X-Spec information. Join conditions are inserted as needed when available

to join tables. Different physical representations for the same

concept are combined. Hierarchically related concepts are combined based

on their IS-A relationship in the standard dictionary.

Page 28: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 28

Unity Overview Unity is a software package that implements the

integration architecture with a GUI. Developed using Microsoft Visual C++ 6 and

Microsoft Foundation Classes (MFC). Unity allows the user to:

Construct and modify standard dictionaries Build X-Specs to describe data sources Integrate X-Specs into an integrated view Transparently query integrated systems using ODBC

and automatically generate SQL transactions

Unity is available for demonstration and distribution.

Page 29: Integrating Relational Database Schemas using a Standardized Dictionary.
Page 30: Integrating Relational Database Schemas using a Standardized Dictionary.
Page 31: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 31

Contributions Architecture contributions:

Has an unique application of a standard dictionary which is not a knowledge base

Separates the capture and integration processes Allows transparent querying without structure Provides algorithms for dynamically extracting

database data (creating relevant views) Algorithms for mediation of global level conflicts

(global keys, normalization, etc.) Arguably simpler method for capturing data

semantics than using description logic An implementation, Unity, which demonstrates the

practical benefits of the architecture

Page 32: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 32

Conclusions & Future Work Automatic database integration is possible by

using a standard term dictionary and defining semantic names for schema elements.

Users are able to transparently query integrated systems by concept instead of structure.

We are constantly refining Unity. Develop an integration component for a web browser

Test the system in large industrial projects. Allow distributed updates and global updates on

all databases.

Page 33: Integrating Relational Database Schemas using a Standardized Dictionary.

Page 33

References Publications:

Unity - A Database Integration Tool, R. Lawrence and K. Barker, TRLabs Emerging Technology Bulletin, January 2000.

Multidatabase Querying by Context, R. Lawrence and K. Barker, DataSem2000, pages 127-136, Oct. 2000.

Integrating Relational Database Schemas using a Standardized Dictionary, To appear in SAC’2001 - ACM Symposium on Applied Computing, March, 2001.

Sponsors: NSERC, TRLabs

Further Information: http://www.cs.umanitoba.ca/~umlawren/