Table Recognition

20
The DIADEM Ontology DIADEM 1.0 Yiyang Bao 2 , Xiaonan Guo 2 , Giorgio Orsi 1,2 , Christian Schallhart 2 , Cheng Wang 2 1 Institute for the Future of Computing University of Oxford 2 Department of Computer Science University of Oxford

description

 

Transcript of Table Recognition

Page 1: Table Recognition

The DIADEM Ontology

DIADEM 1.0

Yiyang Bao2, Xiaonan Guo2, Giorgio Orsi1,2, Christian Schallhart2, Cheng Wang2

1Institute for the Future of ComputingUniversity of Oxford

2Department of Computer Science

University of Oxford

Page 2: Table Recognition

The languages of the web

HTML objects provide the data model of a web-page.

CSS boxes and properties provide the layout.

Javascript provides web dynamics.

<html> <head> </head> <body> <title> </title> <div> … </div> </body></html>

ox:Property

xsd:string

ox:address

RealWorld

Web

this.value.toLowerCase();

… ?

RDF annotations provide the conceptualization of the domain.

Page 3: Table Recognition

Why ontology?

Ontologies provide a conceptualization of a domain of interest (Gruber ‘93)

ox:Property

xsd:string

ox:address

ox:minPrice

ox:partOf

ox:priceSegment But… we do not only want to model the application domain

We must model the domain of its web representations, i.e., its phenomenology.

In the end, it is also an ontology

Page 4: Table Recognition

Why ontology?

Can be used to complete an incomplete model.

Can be used to verify a model.

Must tolerate uncertainty and inconsistency.

Page 5: Table Recognition

A logical model for web extraction

Logical model for web entities

input and refinement forms.

result pages

page blocks (e.g., ads)

Phenomenological model

How logical entities are concretely represented

Page 6: Table Recognition

The building blocks

HTML entities

labels

fields (included links)

text-nodes and text attributes

<form> <label for="male">Male</label> <input type="radio" name="sex" id="male" /> <label for="female">Female</label> <input type="radio" name="sex" id="female" /></form>

<div> <span> Price: </span> <span> £ 250 </span></div>

Price: £ 250

Logical entities

constructs of our data model

Rules

describe the phenomenology

Page 7: Table Recognition

The form model

Goal: model web form phenomenology

Page 8: Table Recognition

The form model

Areas:

button

location

price

room

type

buy/rent

order-by

display

Root entity:

RealEstateForm

Properties:

partOf hierarchical structures

Page 9: Table Recognition

The form model: elements

price

type = {min, max}

purpose = {buy, rent}

currency

room

category = {bathroom, bedroom, …}

type = {min, max}

Page 10: Table Recognition

The form model: elements

display

per page

add-in-time

property type

button

submit

reset

map search

advance submit

link button

order-by

buy

rent

buy/rent

new/resale

SSTC

other

Page 11: Table Recognition

The form model: phenomenology

Based on linguistic annotations and (visual) heuristics.

buyElement(X,F) :- visibleField(X),hasAnnotationFeature(X,"majorType", "reform.label"),hasAnnotationFeature(X,"minorType", "buy"),not hasAnnotationFeature(X,"minorType", "rent"),not hasAnnotationFeature(X,"minorType", "includeSSTC"),group(Ns,_,_,F),#member(X,Ns).

radiusElement(X,F) :-visibleField(X),hasAnnotationFeature(X,"majorType","reform.label"),hasAnnotationFeature(X,"minorType","radius"),group(Ns,_,_,F),#member(X,Ns).

Page 12: Table Recognition

The form model: segments

A segment is:

o a single element

o a group of elements

o a group of segments

o a pair <segment, label>

Segments buttons geographic price Room property type buy/rent order-by display per page add in time new/resale SSTC

Form

real-estate

Page 13: Table Recognition

The result-page model

Goal: model result-pages phenomenology

Page 14: Table Recognition

The result-page model

Attributes and values

e.g., < price, £ 250,000 >

Record

groups of pairs < attribute, value >

Data area

groups of records

Mandatory attribute(s)

must be present in a record

sanity check purposes

Page 15: Table Recognition

A Conceptual Model for Data Extraction

Conceptual Modelling on the Web

Software modelling e.g., UML and stereotypes

Ad hoc languages e.g., WebML

Page 16: Table Recognition

Linking the domain ontology: OntoX

Page 17: Table Recognition

DIADEM Ontology: discussion

Expressive power

safe nr-datalog with stratified negation and aggregation

pros: easy to compute

cons: not robust to uncertainty and inconsistencies

Adaptability

result-page model is substantially domain independent

Form model is domain dependent (entity types)

• The number of entities is limited

Page 18: Table Recognition

Uncertainty, Vagueness and Inconsistencies

Page 19: Table Recognition

Origin

annotations are noisy

entity types are uncertain

Multiple models

probabilistic models

• Markov Logic Networks (Lukasiewicz and Simari)

• C-tables, Bayesian Networks (Olteanu)

ASP

• disjunctive models

• weak constraints

Uncertainty, Vagueness and Inconsistencies

Page 20: Table Recognition

Thank you!