Capturing Semantics in XML Documents

67
April 9, 2006 KDXD 2006, Singapore 1 Capturing Semantics Capturing Semantics in XML Documents in XML Documents Tok Wang Ling Department of Computer Science National University of Singapore

description

Capturing Semantics in XML Documents. Tok Wang Ling Department of Computer Science National University of Singapore. Roadmap. XML documents and current XML schema languages ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) [4] The applications of ORA-SS - PowerPoint PPT Presentation

Transcript of Capturing Semantics in XML Documents

Page 1: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 1

Capturing Semantics in Capturing Semantics in XML DocumentsXML Documents

Tok Wang LingDepartment of Computer Science

National University of Singapore

Page 2: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 2

RoadmapRoadmap1. XML documents and current XML schema

languages2. ORA-SS (Object-Relationship-Attribute

model for Semi-Structured data) [4]3. The applications of ORA-SS4. Discovering Semantics in XML documents5. Conclusion

[4]. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc. 2005

Page 3: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 3

RoadmapRoadmap1. XML documents and current XML schema

languages2. ORA-SS (Object-Relationship-Attribute

model for Semi-Structured data)3. The applications of ORA-SS4. Discovering Semantics in XML documents5. Conclusion

Page 4: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 4

1. XML – Brief introduction 1. XML – Brief introduction • XML (eXtensible Markup Language) is

– Released by W3C– An application of SGML– A promising standard of data publishing, integrating and

exchanging on the web• XML schema

– DTD (Data Type Definition) [3]– XSD (XML Schema Definition), W3C recommended standard

[6, 7, 8]

[3]. Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February 2004. http://www.w3.org/TR/2004/REC-xml-20040204/[6]. XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/ [7]. XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/[8]. XML Schema Part 2: Datatypes Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/

Page 5: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 5

1. XML – A motivating example1. XML – A motivating example

• Suppose we have an XML document “psj.xml” about different parts, suppliers and projects, where– The document has a root element psj;– Under psj, there is a sequence of part elements;– Under part, there is a sequence of supplier elements;– Under supplier, there is a sequence of project

elements.

Page 6: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 6

Example 1. psj.xml<?xml version="1.0" encoding="UTF-8"?><psj xmlns:xsi="…" xsi:noNamespaceSchemaLocation="…"><part> <pno>P001</pno> <pname>Nut</pname> <color>Silver</color> <supplier> <sno>S001</sno> <sname>Alfa</sname> <city>Atlanta</city> <price>5</price> <project> <jno>J001</jno> <jname>Rocket boots</jname> <budget>20000</budget> <qty>60</qty> </project> <project> <jno>J003</jno> <jname>Firework launcher</jname> <budget>250000</budget> <qty>650</qty> </project> </supplier> <supplier> <sno>S002</sno> <sname>Beta</sname> <city>Atlanta</city> <city>New York</city> <price>5.5</price> <project> <jno>J002</jno> <jname>Diving helm</jname> <budget>18000</budget> <qty>70</qty> </project> <project> <jno>J003</jno> <jname>Firework launcher</jname> <budget>250000</budget> <qty>50</qty> </project> </supplier></part>…

…<part> <pno>P002</pno> <pname>Nut</pname> <color>Copper</color> <supplier> <sno>S001</sno> <sname>Alfa</sname> <city>Atlanta</city> <price>4.6</price> <project> <jno>J002</jno> <jname>Diving helm</jname> <budget>18000</budget> <qty>60</qty> </project> </supplier> <supplier> <sno>S003</sno> <sname>Beta</sname> <city>New York</city> <price>5</price> <project> <jno>J001</jno> <jname>Rocket boots</jname> <budget>20000</budget> <qty>20</qty> </project> <project> <jno>J004</jno> <jname>Blue fireworks</jname> <budget>20000</budget> <qty>50</qty> </project> </supplier></part></psj>

Page 7: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 7

1. XML – the DTD of the “psj.xml”1. XML – the DTD of the “psj.xml”

<?xml version="1.0" encoding="UTF-8"?><!--DTD generated by XXX--><!ELEMENT psj (part+)> <!ELEMENT part (pno, pname, color, supplier+)> <!ELEMENT pno (#PCDATA)> <!ELEMENT pname (#PCDATA)> <!ELEMENT color (#PCDATA)> <!ELEMENT supplier (sno, sname, city+, price, project+)> <!ELEMENT sno (#PCDATA)> <!ELEMENT sname (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT project (jno, jname, budget, qty)> <!ELEMENT jno (#PCDATA)> <!ELEMENT jname (#PCDATA)> <!ELEMENT budget (#PCDATA)> <!ELEMENT qty (#PCDATA)>

▼♦ psj ▼♦ part ♦ pno ♦ pname ♦ color ▼♦ supplier ♦ sno ♦ sname ♦ city ♦ price ▼♦ project ♦ jno ♦ jname ♦ budget ♦ qty

(a) “psj.dtd”, The DTD of the “psj.xml” (b) psj.dtd in Data Guide

Page 8: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 8

1. XML – what the DTD says1. XML – what the DTD says• DTD is a simple definition of an XML document, where users can

define– Element/Attribute types– Occurrence constraints (e.g. ?, +, *)– Containment among different element types (the structure)

• DTD cannot express– Occurrence constraints in numbers (e.g. 2 to 8)– Uniqueness/Key constraints on a combination of attributes/elements (ID

attribute can be only assigned on one attribute at a time in DTD.)– Relationship types among elements and their degrees – Difference between the attribute (or simple element) of element type and

the attribute (or simple element) of relationship type.

Simple elements are those element types with PCDATA only without any attribute types.

Page 9: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 9

1. XML – XSD 1. XML – XSD <xs:schema xmlns:xs = “…”><xs:element name = “psj”> <xs:complexType> <xs:sequence> <xs:element name="part"> <xs:complexType> <xs:sequence> <xs:element name="pno" type="xs:string"/> <xs:element name="pname" type=" xs:string"/> <xs:element name="color" type=" xs:string"/> <xs:element name="supplier" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="sno" type=" xs:string"/> <xs:element name="sname" type=" xs:string"/> <xs:element name="city" type=" xs:string“ maxOccurs="unbounded"/> <xs:element name="price" type=" xs:string"/> <xs:element name="project" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="jno" type=" xs:string"/> <xs:element name="jname" type=" xs:string"/> <xs:element name="budget" type=" xs:string"/> <xs:element name="qty" type=" xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> <xs:key name="PK"> <xs:selector xpath="part"/> <xs:field xpath="pno"/> </xs:key></xs:element></xs:schema>

“psj.xsd”, the XSD schema of the motivating example data.

XSD definition of element occurrence constraint

XSD definition of key constraint, which requires that all part element should have a non-nil pno element and the value of all pno elements in the document should be unique.

Page 10: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 10

1. XML – what XSD can tell1. XML – what XSD can tell• XSD is the standard of XML schema definition,

recommended by W3C and supported by most vendors, which– has extensible XML syntax, – supports more data types (user-defined type and 37

built-in types)– is able to represent uniqueness/key for both attribute

types and element types.– And has many other improvements in comparison

with DTD.

Page 11: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 11

1. XML – XSD still flaws1. XML – XSD still flaws

1. A key constraint is specified by a key element. The key constraints in XSD is an extension of ID in DTD. It is totally different to the key constraint in relational databases.– E.g. In the previous XSD, the values of key attribute,

pno of part, should be unique within the set of the part elements in the whole document.

– Therefore, when an element type is located in a lower level such as supplier and project, XSD cannot declare sno and jno as their key attributes (OIDs) respectively.

XSD is not sufficient in expressing the relational semantics in XML data, such as:

Page 12: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 12

1. XML – XSD still flaws 1. XML – XSD still flaws (cont.)(cont.)

- The key element must contain the following (in order):a) One and only one selector element

- contains an XPath expression that specifies the set of elements across which the values specified by the field must be unique

b) One or more field elements - contain an XPath expressions that specifies the values

must be unique for the set of elements specified by the selector element.

- The key constraint is similar to the unique constraint, except that the column on which a unique constraint is defined can have null values.

Page 13: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 13

1. XML – XSD still flaws 1. XML – XSD still flaws (Cont.)(Cont.)

2. XSD does not support relationship types and other relational semantic constraints.

– E.g. The ternary relationship type psj among part, supplier and project in the original data is lost in the XSD.

3. XSD cannot distinguish attributes (or simple elements) of relationship types from those attributes (or simple elements) of element types.

– E.g. Price is an attribute of the binary relationship type ps between part and supplier. However, it looks the same as sname, an attribute (simple element) of the element supplier.

Page 14: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 14

Reconsider the semantics in Example 1.Reconsider the semantics in Example 1.

• The XML data in Example 1. (psj.xml) is a typical data-centric XML document that is derived from structured data contents usually stored in relational or object-relational databases.

• The semantics of the data in Example 1. can be described in the ER diagram as follows.

Page 15: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 15

The ER diagram of the data in Example 1.The ER diagram of the data in Example 1.

pa rt

pro je ct

s u pplie rPS

PS Jpn o pn a m e co lo r s n o s n a m e city

jn o jn a m e bu dg e t

qty

price

nn

n

n

Page 16: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 16

One of the object-relational database One of the object-relational database representations of psj.xmlrepresentations of psj.xml

pno pname color

P001 Nut Silver

P002 Nut Copper

sno sname city+

S001 Alfa Atlanta

S002 Beta {Atlanta,New York}

S003 Gama New York

jno jname budget

J001 Rocket boots 20000

J002 Diving helm 18000

J003 Firework launcher 250000

J004 Blue fireworks 20000

pno sno price

P001 S001 5

P001 S002 5.5

P002 S001 4.6

P002 S003 5

pno sno jno qty

P001 S001 J001 60

P001 S001 J003 650

P001 S002 J002 70

P001 S002 J003 50

P002 S001 J002 60

P002 S003 J001 20

P002 S003 J004 50

part supplier project

PS

PSJ

There 5 tables in the relational schema:

part (pno, pname, color)supplier (sno, sname, (city)+)project (jno, jname, budget)PS (pno, sno, price)PSJ (pno, sno, jno, qty)

Page 17: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 17

RoadmapRoadmap1. XML documents and current XML schema

languages2. ORA-SS (Object-Relationship-Attribute

model for Semi-Structured data)3. The applications of ORA-SS4. Discovering Semantics in XML documents5. Conclusion

Page 18: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 18

2. 2. ORA-SS in a nutshellORA-SS in a nutshell• ORA-SS is a semantics rich data model for semi-

structured data.• It can easily represent the relational semantics

and constraints in XML data.• ORA-SS model is also a bridge that connects the

tree structure of XML and the semantics in relational and object-relational databases.

• In comparison with traditional ER diagram, ORA-SS schema diagram represents the hierarchical structure of XML data.

Page 19: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 19

2. ORA-SS in a nutshell2. ORA-SS in a nutshell• A complete ORA-SS model has 4 diagrams

– Schema diagram• Represents the structure and constrains (business rules) on XML

documents

– Instance diagram• Visually represents the graphical structure of XML data

– Functional dependency diagram• Represents FDs in relationship types

– Inheritance diagram• Represents the specialization/generalization relationships among

different object classes in ORA-SS

Page 20: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 20

2. 2. ORA-SS data modelsORA-SS data models• Object class

– attributes of object class– ordering on object class

• Relationship Type– degree of relationship type– participating object classes in relationship type– attributes of relationship type– disjunctive relationship type– recursive relationship type– ID dependent relationship type

Page 21: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 21

2. ORA-SS data models 2. ORA-SS data models (Cont.)(Cont.)

• Attribute– attributes of object class or relationship type– key attribute (OID)– foreign key / referential constraint (IDREF/IDREFS)– composite attribute– disjunctive attribute– attribute with unknown structure– ordering on attributes– fixed or default value of attribute– derived attribute

Page 22: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 22

The ORA-SS schema diagram of Example 1.

Part, supplier and project are modeled as object classes.

Pno, sno and jno are declared as the object ID of part, supplier and project respectively. Price is an attribute of the relationship type PS;

and qty is an attribute of PSJ.

PS is a binary relationship type between part and supplier,

PSJ is a ternary relationship type defined among part, supplier and project

part

project

supplierpno pname

sno sname

jno jname

price

qty

PS, 2, +, +

PSJ, 3, +, +PS

PSJ

budget

city

color

+

Page 23: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 23

ORA-SS – FeaturesORA-SS – Features • ORA-SS can represent the following semantics

– Object ID attributes play the key constraints in object-relational databases, i.e. the object ID attributes functional determine (or multi-valued determine) object attributes of the same object class.

– Various relationship types including ID dependent relationship types, their degrees and participating object classes.

– Distinguish relationship attributes from object attributes.

Page 24: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 24

RoadmapRoadmap1. XML documents and current XML schema

languages2. ORA-SS (Object-Relationship-Attribute

model for Semi-Structured data)3. The applications of ORA-SS4. Discovering Semantics in XML documents5. Conclusion

Page 25: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 25

3. 3. ORA-SS applicationsORA-SS applications• Due to the rich semantics in ORA-SS, the model

can be widely used in– Normal form XML schema– Relational/object-relational storage of XML data– XML view creation and validation [1]– XML schema/data integration– XML data query, especially with graphical user

interfaces [5]– XML query optimization– etc.

[1]. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland. Oct 7-11, 2002[5]. W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA 2003.

Page 26: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 26

Store ORA-SS inStore ORA-SS in object-relational databases object-relational databases

• Current existing storage approaches store XML in flat files (NF relations), which are long and difficult to query and update;

• Pure relational DBMS – join needs much time.• ORA-SS reflects the nested structure of semi-

structured data• Less join in nested relations

3. ORA-SS applications3. ORA-SS applications

Page 27: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 27

Store ORA-SS in object-relational databasesStore ORA-SS in object-relational databases (Cont.)(Cont.)

p ar t

p ro jec t

s u p p lierp n o p n am e

s n o s n am e

jn o jn am e

p r ic e

q ty

P S , 2 , + , +

P S J , 3 , + , +P S

P S J

b u d g et

c ity

c o lo r

+

3. ORA-SS applications3. ORA-SS applications

• Each object class is stored as an object relation with its object ID and its object attributes. (e.g. part, supplier, project)

• Each relationship type is stored as a relationship relation with the object IDs of participating object classes and its relationship attributes. (e.g. PS and PSJ)

• Multi-value attributes and composite attributes are stored as nested relations. (e.g. city)

Given an ORA-SS schema diagram

Page 28: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 28

Store ORA-SS in object-relational databases Store ORA-SS in object-relational databases (Cont.)(Cont.)

Object Relations part (pno, pname, color) supplier (sno, sname, (city)+) project(jno, jname, budget)

Relationship relations PS (pno, sno, price) PSJ (pno, sno, jno, qty)

Constraint: PSJ[pno, sno] PS[pno, sno]

p ar t

p r o jec t

s u p p lierp n o p n am e

s n o s n am e

jn o jn am e

p r ic e

q ty

P S , 2 , + , +

P S J , 3 , + , +P S

P S J

b u d g et

c ity

c o lo r

+

Storage Schema for ORA-SS/XML Databases of the data in Example 1.

ORA-SS schema diagram Storage schema

3. ORA-SS applications3. ORA-SS applications

Page 29: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 29

Store ORA-SS in object-relational databases Store ORA-SS in object-relational databases (Cont.)(Cont.)

Employee (eno, ename, (hobby)*, quantification(year, degree, Univ)*, job_history(year, job_title, company)*)

em p lo y ee

en o en am e h o b b y* * *

q u an tif ic at io n jo b _ h is to r y

y ear d eg r ee Un iv . y ea r jo b _ title c o m p an y

An example to show the advantage of using object-relational database instead of relational database.

ORA-SS schema diagram

Storage schema in ORDB

3. ORA-SS applications3. ORA-SS applications

Storage schema in traditional RDB

Employee (eno, ename)E_hobby (eno, hobby)E_quantification (eno, year, degree, Univ.)E_job_history (eno, year, job_title, company)

Page 30: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 30

Define and validate Define and validate XML viewsXML views

p a r t

p r o jec t

s u p p lie rp n o p n am e

s n o s n a m e

jn o jn am e

p r ic e

q ty

P S , 2 , + , +

P S J , 3 , + , +P S

P S J

b u d g et

c ity

c o lo r

+

s u pplie r

p r o jec t

pa rt

pr ice

q ty

2

32

3

3. ORA-SS applications3. ORA-SS applications

•Valid XML views in ORA-SS•View definition operators: select, project/drop, swap, joinFor example, consider the following swapping operation that changes the position of supplier and part in different hierarchical levels:

s u pplie r

p r o jec t

pa rt pr ice

q ty

2

3

3

Valid view Invalid viewBecause price is a relationship attribute, it cannot be moved up with supplier elements, which would be semantically meaningless in the result view.

Page 31: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 31

Define and validate XML views (cont.)Define and validate XML views (cont.)

p a r t

p r o jec t

s u p p lie rp n o p n am e

s n o s n a m e

jn o jn am e

p r ic e

q ty

P S , 2 , + , +

P S J , 3 , + , +P S

P S J

b u d g et

c ity

c o lo r

+

p r o jec t

pa rt

pr ice

q ty

3. ORA-SS applications3. ORA-SS applications

Another example, consider the following projection operation that drops supplier from the structure:

Valid viewInvalid viewDropping supplier makes price and qty become multi-valued attributes, and we should apply aggregation functions to get a meaningful view.

p r o jec t

pa rt

A v g _ price

T o ta l_ q ty

Page 32: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 32

Graphical XML query based on ORA-SSGraphical XML query based on ORA-SS3. ORA-SS applications3. ORA-SS applications

A graphical XML query language is designed on the base of ORA-SS

The screenshot of the user-interface of our graphical query language

The schema panel loads the ORA-SS schema diagram

Graphical query can be posed by either dragging components from the diagram in schema panel or using the construction buttons on the top of the window.

Complex query logics such as quantification, negation, IF-THEN construction can be specified in the Condition Logic Window

Query 1: To select and display the projects that do not have any suppliers located in Atlanta.

Page 33: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 33

XML query optimizationXML query optimization• The semantic information represented in ORA-SS is also

helpful in optimizing XML query.

3. ORA-SS applications3. ORA-SS applications

Consider the following simple query example which means,(Query 2.) To display the budget of project “J001”.

Page 34: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 34

XML query optimizationXML query optimization3. ORA-SS applications3. ORA-SS applications

• Traditional processing should scan the whole XML document, checking every project with jno=“J001” and finding all corresponding budget values.

• However, in ORA-SS, since jno is the object ID and we have the functional dependecny:

jno budget so the optimized processing only need to find the first project instance with jno=“J001” and return the corresponding budget value.

Page 35: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 35

RoadmapRoadmap1. XML documents and current XML schema

languages2. ORA-SS (Object-Relationship-Attribute

model for Semi-Structured data)3. The applications of ORA-SS4. Discovering Semantics in XML documents5. Conclusion

Page 36: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 36

4. 4. Discover semanticsDiscover semantics in XML documentsin XML documents

• Problem definition– Input: a well formed XML document, probably with

a DTD or XSD schema– Output: semantics that are necessary to ORA-SS

schema• It is a process of enriching XML schema to ORA-

SS schema by using mining techniques.

Page 37: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 37

• Related issues in mining semantics– Object classes

• Identify object classes• Identify object IDs• Identify object attributes and their cardinalities• Identify IDREF(s) attributes

– Relationship types• Find relationship types with their degrees and participating

object classes• Find attributes and their cardinalities of relationship types

4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 38: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 38

4. Discover semantics in XML documents4. Discover semantics in XML documents

The whole vision of the process. S t a rt

I de n t if y O bje ctC la s s e s

P ick o u tm u lt i-v a lu ea t t ribu t e s

I de n t ify O bje ct I D

I de n t if y M u lt i- v a lu e da n d co m po s it O bje ct

a t t ribu t e s

I de n t if y re la t io n s h ip t y pe swit h re la t io n s h ip a t tr ibu t e s

I de n t if y re la t io n s h ip t y pe swith o u t re la t io n s h ip a t t r ibu t e s

En d

O bje ctC la s s e s

R e la t io n s h ipTy pe s

O bje ct I D

S in g le - v a lu e do bje ct a t tribu te s

M u lt i- v a lu e do bje ct a t t ribu te s

C o m po s iteo bje ct a t tr ibu te s

S i n g l e - va l u e dr e l a t i o n s h i p

a t t r i b u te s

M u lti -valu e dr e lation s h ip

attr ibu te s

C om pos i ter e lati on s h ip

attr ibute s

M u lt i-v a lu e da t t r ibu t e s

C o m po s it ea t tr ibu te s

The main flow of the processThe output flowThe input flow

Page 39: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 39

• Assumption– To simplify the discussion, we do not consider the

order of attributes and elements.• User-verification

– The findings of each steps during the process should be verified by the user.

– The verified findings of previous steps would be used in later steps.

4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 40: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 40

Find object classes• Identify object classes from element types:

– Scan the XML document or, if possible, the DTD/XSD of the XML document to select all internal nodes in the document tree.

– An internal node means the node must have some child nodes such as XML attribute types and/or subelement types.

– An internal node may not be an object class, but an object class must correspond to an internal node. Therefore, internal nodes are candidates of object classes.

4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 41: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 41

Find object classes (cont.)

• Detecting composite attributes from object classes– Although composite attributes are also internal nodes,

there are some special patterns that indicate they are not object classes.

4. Discover semantics in XML documents4. Discover semantics in XML documents

XML element birth da y

m o n th da y y e a r

"2 0 ""3 " "2 0 0 5 "

XML elementsOr XML attributes

values

1) Single-valued2) Always occur with the same order3) No functional dependency can be

found within the component attributes of a composite attribute.

The first pattern is that, all subelement types or attributes are

Page 42: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 42

Find object classes (cont.)4. Discover semantics in XML documents4. Discover semantics in XML documents

XML element

XML elementsOr XML attributes

values

1) Of the same type (repeated)2) The set of the subelement/attribute values is often determined by other

element/attribute values. (e.g. studNo determines the values of hobby elements under “hobbies” element)

The second pattern is that, all subelement types or attributes are:

h o bbie s

h o bby h o bby h o bby

"r ead in g ""s w im m in g " "b as k e t b a ll"

student

studNo

Page 43: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 43

<?xml version="1.0" encoding="UTF-8"?><!--DTD generated by XXX--><!ELEMENT psj (part+)> <!ELEMENT part (pno, pname, color, supplier+)> <!ELEMENT pno (#PCDATA)> <!ELEMENT pname (#PCDATA)> <!ELEMENT color (#PCDATA)> <!ELEMENT supplier (sno, sname, city+, price, project+)> <!ELEMENT sno (#PCDATA)> <!ELEMENT sname (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT project (jno, jname, budget, qty)> <!ELEMENT jno (#PCDATA)> <!ELEMENT jname (#PCDATA)> <!ELEMENT budget (#PCDATA)> <!ELEMENT qty (#PCDATA)>

▼♦ psj ▼♦ part ♦ pno ♦ pname ♦ color ▼♦ supplier ♦ sno ♦ sname ♦ city ♦ price ▼♦ project ♦ jno ♦ jname ♦ budget ♦ qty

The DTD of Example 1. Dataguide

From the DTD of Example 1, element type: psj, part, supplier and project are internal nodes (can be intuitively found in Dataguide). Then, the list { psj, part, supplier, project } contains candidate object classes. Because a well-formed XML document usually have a document root that is not concerned with the data, we can drop the root node psj from the list and get the final result

{ part, supplier, project }.

Find object classes (cont.)4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 44: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 44

Identify multi-valued attributes• After Object classes and composite attributes are

identified, we pick out all multi-valued attributes for later use.– Multi-valued attributes can be detected by checking the

occurrence constraints in DTD/XSD, or counting directly in the document.

– Multi-valued attributes can be either of an object class (e.g. city of supplier) or a relationship type. To determine the affiliation of multi-valued attributes, we need to find object ID first.

– Without considering multi-valued attributes, the search of object ID would be easier.

4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 45: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 45

• For each identified object class (after user-verified)– If it is located at the first level below the document root, and the DTD/XSD

has specified ID attribute or key constraint, then the corresponding attribute/element should be an object ID.

– Otherwise• A temporary table is built, which contains all XML attributes and single-valued

simple subelement types of the object class.• To find full functional dependencies in the temporary table.

– If all attributes/elements are fully functional dependent on an attribute/element k, then k is most likely the object ID;Else,

» find an attribute/element k’, which functional determines the most number of attributes/elements, k’ is suggested as the object ID,

» and the attributes/elements that are not determined by k’ will be classified as single-valued attributes of some relationship types to be determined later.

• The result should be verified by the user.

Find object IDs4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 46: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 46

<?xml version="1.0" encoding="UTF-8"?><!--DTD generated by XXX--><!ELEMENT psj (part+)> <!ELEMENT part (pno, pname, color, supplier+)> <!ELEMENT pno (#PCDATA)> <!ELEMENT pname (#PCDATA)> <!ELEMENT color (#PCDATA)> <!ELEMENT supplier (sno, sname, city+, price, project+)> <!ELEMENT sno (#PCDATA)> <!ELEMENT sname (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT project (jno, jname, budget, qty)> <!ELEMENT jno (#PCDATA)> <!ELEMENT jname (#PCDATA)> <!ELEMENT budget (#PCDATA)> <!ELEMENT qty (#PCDATA)>

Candidate object classes list{part, supplier, project}

Three temporary tables part_temp (pno, pname, color) supplier_temp (sno, sname, price) project_temp (jno, jname, budget, qty)

Notice that, in this stage, all simple subelement types and attributes are treated the same.

Multi-valued attributes such as city is not included inside the temporary table.

Find object IDs (cont.)4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 47: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 47

Three temporary tables part_temp (pno, pname, color) supplier_temp (sno, sname, price) project_temp (jno, jname, budget, qty)

Find object IDs (cont.)4. Discover semantics in XML documents4. Discover semantics in XML documents

1. In part_temp, we find thatpno pname, color

thus, pno is the object ID of part.2. In supplier_temp, we only have

sno sname thus, sno is the object ID of supplier, and price is picked our as a relationship attribute.3. In project_temp, we only have

jno jname, budget thus, jno is the object ID of project, and qty is picked out as a relationship attribute.

Page 48: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 48

• In the stage after the process of identifying object IDs, we find out: – Object IDs of each object class,– Single-valued object attributes and their

corresponding object classes,– Single-valued relationship attributes without knowing

what relationship type they belong to.

Find object IDs4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 49: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 49

• Recall that, before searching object ID, all multi-valued attributes are identified. Given a multi-valued attribute under an object class, we check,– for each object ID value of the object class, whether

there is a unique set of values of the attribute• If it is true, then it is a multi-valued attribute of the object

class;Else, it is classified as a multi-valued attribute of some relationship type not known yet.

Multi-valued attributes of object classes4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 50: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 50

Multi-valued attributes of object classes

• For example, the city is a multi-valued attribute under supplier– We check sno and city, since

each sno value is associated with the same set of city values, city is a multi-valued attribute of supplier

4. Discover semantics in XML documents4. Discover semantics in XML documents

sno city+

S001 Atlanta

S002 {Atlanta,New York}

S001 Atlanta

S003 New York

The temporary table of sno and city

Page 51: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 51

• For multi-valued object attributes, we should know their cardinality– If the DTD/XSD has specified, reuse it– Without schema, count the minimum and maximum

occurrences of the multi-valued attributes.– Notice that, both single-valued and multi-valued

attributes can be null (e.g. ? and *). Thus, the result should be verified by the user.

Find cardinality of object class attributes4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 52: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 52

• Identify IDREFs– If the DTD/XSD has specified IDREF/IDREFS or Keyref

constraints, reuse them.– Without the schema, we compare the object attribute values

with the values of other object IDs, • If all values of a single-valued attribute of objects of the same class

appear as object ID values of some particular object class, then it is an IDREF;

• If all values of a multi-valued attribute of objects of the same class appear as object ID values of some particular object class, then it is an IDREFS.

(Note that, if it is an XML attribute, multiple values of IDREFS are separated by a blank character.)

Find IDREF/IDREFS4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 53: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 53

• Identify relationship types (basic idea)– The search of relationship types is based on the object

ID and relationship attributes (single-valued or multi-valued).

– Along with a path from the root to a leaf node in the document tree, we may pass through several object classes. The object IDs of these object classes can form a temporary table. We build such kind of temporary tables for each single-valued relationship attributes, and find relationship types.

Find relationship types4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 54: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 54

• For each single-valued relationship attribute, there is a path from the root to the attribute, and along the path, put object IDs of object classes inside the temporary table together with the relationship attribute.– Find the FDs that determines the single-valued relationship

attribute in the temporary table. • For multi-valued relationship attributes, we should find a

combination of object IDs of different object classes that each unique combination object ID value corresponds to a unique set of the attribute values.

Find relationship types (cont.)4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 55: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 55

Find relationship types (cont.)

• From the data in Example 1, we can have a temporary table for price along with the path: “part/supplier/price” as follows

pno sno price

P001 S001 5

P001 S002 5.5

P002 S001 4.6

P002 S003 5

4. Discover semantics in XML documents4. Discover semantics in XML documents

We can find that {pno, sno} price, thus, there is an binary relationship type between part and supplier; and price is an attribute of the binary relationship type.

pa rt

p r o jec t

s u pplie rp n o p n am e

s n o s n am e

jn o jn am e

price

q tyb u d g e t

c ity

c o lo r

+

Page 56: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 56

• Similarly, we can have a temporary table for qty along with the path: “part/supplier/project/qty” as follows

pno sno jno qty

P001 S001 J001 60

P001 S001 J003 650

P001 S002 J002 70

P001 S002 J003 50

P002 S001 J002 60

P002 S003 J001 20

P002 S003 J004 50

4. Discover semantics in XML documents4. Discover semantics in XML documents

We can find that {pno, sno, jno} qty, thus, there is an ternary relationship type among part, supplier and project; and qty is an attribute of the ternary relationship type.

pa rt

pro je ct

s u pplie rp n o p n am e

s n o s n am e

jn o jn am e

p r ic e

qtyb u d g et

c ity

c o lo r

+

Find relationship types (cont.)

Page 57: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 57

• Relationship types can be exist without have relationship attributes.

• To find such kind of relationship types, we need to build a temporary table for different object classes with their object IDs based on the existing paths in the document tree.

• Search the temporary table and find MVDs (see the following example.)

Find relationship types (cont.)4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 58: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 58

We have already identified the - Hierarchical structure;- Object classes and their object IDs;- Attributes of object classes;

- But no attribute is likely to be of some relationship types.

• Suppose we have another document of project, staff, and paper. After we found their object ID attributes, accordingly, i.e. J_no, St_no, and Pa_no, we can create a temporary table as follows.

4. Discover semantics in XML documents4. Discover semantics in XML documents

pro je ct

pa pe r

s ta f fJ _ n o

S t_ n o

P a_ n o

. . .

. . .

. . .

Find relationship types (cont.)

Page 59: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 59

J_no St_no Pa_no

J001 S001 P001

J001 S002 P003

J002 S001 P001

J002 S003 P001

… … …

4. Discover semantics in XML documents4. Discover semantics in XML documents

CASE 2. If there is no FD or MVD in the table, then there is a ternary relationship among project, staff and paper.

CASE 1. If we find that each St_no value is associated with a unique set of Pa_no values, i.e. St_no multi-determines Pa_no, then there are two binary relationship types, one consists of project and staff, and the other consists of staff and paper.

Find relationship types (cont.)

We build a temporary table which consists of J_no, St_no, and Pa_no

p r o jec t

p ap er

s ta f f

2

2

CASE 1.

p r o jec t

p ap er

s ta f f

3

CASE 2.

Page 60: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 60

• The participating constraints of each relationship types can be obtained through the count of unique object ID values in the temporary table accordingly.

Find participating constraints4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 61: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 61

• All outputs, including those intermediate results, should be verified by users.

• With input from users and their verification, a semi-automatic mining process can be applied to discover the semantics in XML documents that are important in designing XML databases, storing XML data, validating XML view and processing/optimizing XML query.

• All the discovered semantics can be represented by ORA-SS; but some of them cannot be represented in DTD/XSD.

User verification4. Discover semantics in XML documents4. Discover semantics in XML documents

Page 62: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 62

RoadmapRoadmap1. XML documents and current XML schema

languages2. ORA-SS (Object-Relationship-Attribute

model for Semi-Structured data)3. The applications of ORA-SS4. Discovering Semantics in XML documents5. Conclusion

Page 63: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 63

5. 5. ConclusionConclusion1) We demonstrate a data-centric XML document and

show the limitations of current XML schema standard in represent relational semantics and constraints.

2) We Introduce ORA-SS, a semantics rich data model that can intuitively express the semantics in XML data.

3) We discuss the naïve method of mining semantics from XML data/schema to generate ORA-SS schema. More efficient methods should be further investigated.

Page 64: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 64

5. Conclusion 5. Conclusion (cont.)(cont.)

4) The semantics in ORA-SS are crucial in designing XML database, writing and interpreting XML query and validating XML views, etc.

5) The method we proposed in the presentation to discover semantics only provides candidate answers. In other words, not all the results are necessarily true because the contents of the data may be changed. Therefore, user feedback is indispensable in the process of enriching XML schema to ORA-SS schema.

Page 65: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 65

References:References:[1]. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland.

Oct 7-11, 2002[2]. C. J. Date. An Introduction to Database Systems. 3rd edition, Addison-Wesley Publishing

Company (1981).[3]. Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February

2004. http://www.w3.org/TR/2004/REC-xml-20040204/[4]. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business

media, Inc. 2005[5]. W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA

2003.[6]. XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October 2004.

http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/ [7]. XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October 2004.

http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/[8]. XML Schema Part 2: Data types Second Edition. W3C Recommendation 28 October 2004.

http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/

Page 66: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 66

Q & AQ & A

Page 67: Capturing Semantics in XML Documents

April 9, 2006 KDXD 2006, Singapore 67

The EndThe End