Extracting Schema From Data

26
Extracting Schema From Data The difference between schemas for semistructured data and traditional schemas is that a given semistructured data can have more than one schema . Given a semistructured data, compute automatically some schema for it, given several possible answers, we want the schema that best describes the structure of that particular data.This is called Schema Extraction.

description

Extracting Schema From Data. The difference between schemas for semistructured data and traditional schemas is that a given semistructured data can have more than one schema . - PowerPoint PPT Presentation

Transcript of Extracting Schema From Data

Page 1: Extracting Schema From Data

Extracting Schema From Data

• The difference between schemas for semistructured data and traditional schemas is that a given semistructured data can have more than one schema .

• Given a semistructured data, compute automatically some schema for it, given several possible answers, we want the schema that best describes the structure of that particular data.This is called Schema Extraction.

Page 2: Extracting Schema From Data

•Schema Extraction for schema graphs

•Schema Extraction for Datalog Typings

Page 3: Extracting Schema From Data

Data Guides

Our goal is to construct a new OEM graph that is a finite description of the list of paths. This is called Data Guide. The two properties to be fulfilled:

•Accurate : Every path in the data occurs in the data guide, and every path in the data guide occurs in the data.

•Concise : Every path occurs exactly once.

Page 4: Extracting Schema From Data

&r

&p1 &p2 &p3 &p6 &p7 &p7&p4 &p5

&c

employee

employee

employee

company

worksfor worksfor

worksfor

name

namename name

namename name

name

name

“Widget Trenton”

managesmanages

manages manages manages

Managedby

Managedby

Managedby Managedby

Managedbyposition

positionpositionposition

phone

“Jones” “6666”“Smith”

“Joe”“Marketing” “Dupont”

“Gaston”“Salse” “Gonnet” “Jack” “IT”

“IT”

“Fred”

Figure7.13 An Example of OEM data

Page 5: Extracting Schema From Data

We proceed as follows: The Data Guide will have a root node, call it Root. Next we examine one by one each path in the list and add new nodes to the data guide, as needed:

•employee

•employee.name

•employee.manages

•employee.manages.managedby

•employee.manages.managedby.manages

•employee.manages.managedby.manages.managedby

•company

Page 6: Extracting Schema From Data

Root&r

Employees&p1,&p2,&p3,&p4&p5,&p6,&p7,&p8

Boss&p1,&p4,&p6

Regular&p2,&p3,&p5

&p7,&p8

Company&c

employee

company

worksfor

managedby

manages

name

name

name

nameworksfor

worksfor

manages

managedby

phone

phone

position

position

A Data Guide

Page 7: Extracting Schema From Data

Root

Emp

Comp

employee

company

name

name

worksfor

phoneposition

managedby

manages

Schema graph

Page 8: Extracting Schema From Data

Simulation between a data graph and a data guide

Node in data graph Node in data guide&r Root&p1, &p2, &p3, &p4, &p5, Employee&p6, &p7, &p8 &p1, &p4, &p6 Boss&p2, &p3, &p5, &p7, &p8 Regular&c Company

Simulation from the data guide to the schema graph

Node in data guide Node in schema graphRoot RootEmployee EmpBoss EmpRegular EmpCompany Comp

Page 9: Extracting Schema From Data

This construction of the data guide resembles the technique to transform a nondeterministic finite state automaton into a deterministic one .

The data guide is the most specific schema graph for that data with the following features:

•The data guide is a deterministic schema graph.

•Any other deterministic schema graph to which our data conforms subsumes the data guide.

Page 10: Extracting Schema From Data

Root&r

Regular &p2,&p3,&p5 &p7,&p8

Boss &p1,

&p4,&p6

manages

employee

managedby

employee

name

Comp &c

company

worksfor

namephone

worksfor

A nondeterministic schema

Page 11: Extracting Schema From Data

Extracting Datalog rules from data

We have a semistructured data instance and want to extract automatically the most specific typing given by a set of Datalog rules.

We create one predicate for each complex value object in the data. We create the following predicates:

pred_r, pred_c, pred_p1, pred_p2, pred_p3, pred_p4, pred_p5, pred_p6, pred_p7, pred_p8

corresponding to the objects &r, &c, &p1, &p2, &p3, &p4, &p5, &p6, &p7, &p8 .

Page 12: Extracting Schema From Data

Next we write a set of Datalog rules defining each predicate based exactly on the outgoing edges of its corresponding object:

pred_r(X) :- ref(X, company, Y), pred_c(Y),

ref(X, employee, Z1), pred_p1(Z1), ……

ref(X, employee, Z8), pred_p8(Z8)

pred_c(X) :- ref(X, name, N), string(N)

pred_p1(X) :- ref(X, worksfor, Y), pred_c(Y), ref(X,

name, N) , string(N), ref(X, phone, P),

string(P), ref(X, manages, Z), pred_p2(Z),

ref(X, manages, U) , pred_p3(U)

Page 13: Extracting Schema From Data

pred_p2(X) :- ref(X , worksfor, Y) , pred_c(Y),

ref(X, name, N) , string(N),

ref(X, manageby, Z), pred_p1(Z)

pred_p3(X) :- …..

……

We have to compute the largest fixpoint of the Datalog program on the given data.

Page 14: Extracting Schema From Data

Object Predicate

&r&c, &p1, &p2, &p3, &p4,&p5, &p6, &p7, &p8 &p1 &p2, &p3, &p5, &p7, &p8 &p3, &p5, &p7, &p8 &p1, &p4, &p6 &p3, &p5, &p7, &p8 &p1, &p4, &p6 &p3, &p5, &p7, &p8 &p3, &p5, &p7, &p8

pred_rpred_c

pred_p1pred_p2pred_p3pred_p4pred_p5pred_p6pred_p7pred_p8

Extents of predicates after one iteration

Page 15: Extracting Schema From Data

Object Predicate

&r&c, &p1, &p2, &p3, &p4,&p5, &p6, &p7, &p8 &p1 &p2, &p3 &p3 &p1, &p4, &p6 &p3, &p5, &p7, &p8 &p1, &p4, &p6 &p3, &p5, &p7, &p8 &p3, &p5, &p7, &p8

pred_rpred_c

pred_p1pred_p2pred_p3pred_p4pred_p5pred_p6pred_p7pred_p8

Extents of predicates after two iterations

Page 16: Extracting Schema From Data

We obtain the following Datalog rules:Root(X) :- ref(X, company, Y), Company(Y), ref(X, employee, Z1), Boss1(Z1), ref(X, employee, Z2), Boss2(Z2), ref(X, employee, U1), Regular1(U1),…….., ref(X, employee, U3), Regular3(U3)Company :- ref(X, name, N), string(N)Boss1(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, phone, P), string(P), ref(X, manages,Z), Regular1(Z), ref(X, manages, U), Regular2(U) Boss2(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, phone, P), string(P), ref(X, manages,Z), Regular3(Z)Regular1(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, managedby, Z), Boss1(Z)Regular2(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, position, P), string(P), ref(X, managedby, Z), Boss1(Z)Regular3(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, position, P), string(P), ref(X, managedby, Z), Boss2(Z)

Page 17: Extracting Schema From Data

Inferring Schemas From Queries

Some semistructured data instances are the result of queries.

Query Result Schema Inferring

where bib -> L -> X, X - > “author” -> A, X -> “title” -> T, X -> “year” -> Y

create Root( ), HomePage(A), YearEntry(A,Y), PageEntry(X)

link Root() -> “person” -> HomePage(A), Homepage(A) -> “year” -> YearEntry(A,Y) YearEntry(A,Y) -> “paper” -> PaperEntry(X) PaperEntry(X) -> “title” -> T, PaperEntry(X) -> “author” -> HomePage(A), PaperEntry(X) -> “year” -> Y

The following query takes a bibliography file and constructs a homepage for every author:

Page 18: Extracting Schema From Data

Root

Homepage (“smith”)

Homepage (“Jones”)

YearEntry (“smith”,1995) YearEntry

(“smith”,1997)

YearEntry (“smith”,1997)

PaperEntry (o423)

PaperEntry (o552)

PaperEntry (o153)

personperson

authorauthor

author

author

title year title year title year

yearyear

year

paperpaper

paper

paper

Page 19: Extracting Schema From Data

Root

HomePage

YearEntry

PaperEntry

person

year

paper

title year

author

Schema graph inferred from the query

The schema will have one class for each function , and one edge for each line in the link clause.

Page 20: Extracting Schema From Data

where create Root( ), F(X), F(Y), G(X), H(Y)link Root( ) -> “A” -> F(X), F(X) -> “C” -> G(X),

Root( ) -> “B” -> F(Y), F(X) -> “D” -> H(Y)

For the following example:

We reach the following schema:

Root: {A : F, B : F}

F : {C : G, D : H}

Page 21: Extracting Schema From Data

Path Constraints

In Relational Databases

•in RDB, the relational declaration tell us more than the types•imposes a key constraint so that no two tuples have the same key

Example

Create table Employees ( Emp Id: integer, EmpName: char(30), DeptId: integer, … primary key(EmpId), foreign key(DeptId) references Departments )

Create table Departments( DeptID: integer, Dname: char(10), …… primary key(DeptId) )

Page 22: Extracting Schema From Data

In Object-Oriented Databases

Interface Publicationextent publication{ attribute String title; attribute Date date;

relationship set<Author> auth --->inclusion constraintsinverse Author::pub; --->inverse relationship

}Interface Author

extent author{ attribute String title; attribute String address;

relationship set<Publication> pub --->inclusion constraintsinverse Publication::auth; --->inverse relationship

}

Page 23: Extracting Schema From Data

Inclusion constrainsts:

•For any publication p, the set p.auth is a subset of the set author. Similarly, for any author a, the set a.pub is a subset of publication.

Inverse relationships:

•For any publication p, and for any author a in p.auth, p is a member of a.pub .

•For any author a, and for any publication p in a.pub, a is a member of p.auth .

Page 24: Extracting Schema From Data

publication publication authorauthor

auth auth auth

pubpubpub

title title datedate name name addressaddress

... ... ... ... ... ... ... ...

r

Illustration of path constraints on semistructured data

Page 25: Extracting Schema From Data

In semistructured data

•inclusion constraint is expressed as follows p (a (author(r,a) pub(a,p)) -> publication(r,p))

The general form of an inclusion constraint is x ((r,x)) -> (r,x))

• inverse relationship is p ( publication(r,p) -> a(auth(p,a) -> pub(a,p)))

The general form of this constraint is x ((r,x)) -> y((x,y)-> (y,x)))

Page 26: Extracting Schema From Data

Constraints are also important in Query Optimization. Here is an example:

Select row: P2from r.publication P1, r.publication P2, P1.auth Awhere “Database Systems” in P1.title and A in P2.auth

Select row: P’from r.publication P, P.auth A, A.pub P’where “Database Systems” in P.title

The query plan implicit in the first one requires two iterations over publication - with P1,P2 - whereas the second requests only one iteration - with P .