Web Data Extraction Como2010

42
DIADEM A Short Overview Georg Gottlob
  • date post

    20-Oct-2014
  • Category

    Documents

  • view

    1.450
  • download

    0

description

 

Transcript of Web Data Extraction Como2010

Page 1: Web Data Extraction Como2010

DIADEMA Short Overview

Georg Gottlob

Page 3: Web Data Extraction Como2010

Web data extraction

WEBHTML pages

layout

Corporateedp apps

structured data,Databases,

XML

WRAPPER

Goal: Make web contents accessible to electronic data processing

Wrappers: HTMLselect extract annotate XML

Page 4: Web Data Extraction Como2010
Page 5: Web Data Extraction Como2010

Lixto Visual Developer (VD)

Navigation Steps

Mozilla Web

Browser

Extraction Configuration

Page 6: Web Data Extraction Como2010

Need for Automatic Extraction Technology

Example: Real Estate UK 17,000 sites Many not covered by aggregators We do have a list of all homepages (Yellow

Pgs. UK) Manual or semi-automatic wrapping too

expensive - wrapper construction - testing - keeping track of changes No tool or method can do it fully

automatically. Other domains: Hospitals,restaurants, schools, travel

agents, airlines, hospitals, pharmaceutical companies

and retail companies such as supermarket

chains…..

Page 7: Web Data Extraction Como2010

Need for Automatic Extraction Technology (2)

All search engine providers need it! Many work on it.

Keywords: Vertical search, object search, semantic search.

Raghu Ramakrishnan, Yahoo!, March 2009: “no one really has done this successfully at scale yet”

Alon Halevy, Google, Feb. 2009: “Current technologies are not good enough yet to provide what search engines really need. […] any successful approach would probably need a combination of knowledge and learning.”

Page 8: Web Data Extraction Como2010

The Blackbox we want to construct

BLACKBOX

Application domain with thousands of websites

URL

Application relevant Structured data (XML or RDF)

To achieve this, we plan to combine a host of annotators with a new knowledge-based approach.

Page 9: Web Data Extraction Como2010

Real estate

Restaurants

Relationship to SeCo & Webdam

Q: Find apartments in Milan whose prices are average in quarters were restaurant quality > average.

Results

Web service A

Web service

Page 10: Web Data Extraction Como2010

How to achieve it?Rationale: Combine existing and new

“low level” annotators with “high level” AI and reasoning.

Low level annotators: - Bottom-up page analysis. - ML-based entity recognizers - NLP & ontological text annotation - Web page classification & analysis - Basic link analysis

Page 11: Web Data Extraction Como2010

High level reasoning: - Goal oriented - Conceptual domain objects. - Conceptual interaction

elements - High-level object ontology - Domain knowledge

Page 12: Web Data Extraction Como2010

<table>113

<tr> 134<tr>115

“I’m interested in”

<td>119

<table>124

radiobuttons

<tr>125 <tr>126

<td>129 <td>130

“Buying” “Renting”

<td>135

“Maximum price”

<select>136

<option>137<option>138

<td>139 <td>140

“GBP” “EUR”

Page 13: Web Data Extraction Como2010

Bottom-up (low-level) annotation

Monochromatic Rectangle

Georaphic search facility

Postcode input field

Active map ….

ISA ISA

Occurs in

Price search facility …

.

….

Occurs in

….105

105 127

[(02873,227)(03900,417)]

Geo-Price-Searchbox

ISA

[(02873,227)(03900,417)]

Page 14: Web Data Extraction Como2010

Top-down reasoning

Property SearchFacility

Property List

Single Property Description

Specially highlightedproperty

part-of m1

Page 15: Web Data Extraction Como2010

Bottom-up processing Top-down reasoning

Monochromatic Rectangle

Georaphic search facility

Postcode input field

Active map ….

ISA ISA

Occurs in

Price search facility …

.

….

Occurs in

….105

105 127

[(02873,227)(03900,417)]

Property SearchFacility

Property List

Single Property Description

Geo-Price-Searchbox

ISA

[(02873,227)(03900,417)]

Specially highlightedproperty

Phenomenology

part-of m1

Page 16: Web Data Extraction Como2010

table(T) & occurs_in(T,areaselection) & occurs_in(T,priceselection) goodtable(T).

goodtable(T) & child(Parent,T) containsgoodtable(Parent).

goodtable(T) & containsgoodtable(T) propertysearchmask(T).

If a table contains an area selection input field and a price selection field, both of which are not simultaneously contained in a smaller table, then this table is the property search mask

Datalog for Web-Object Reasoning

Page 17: Web Data Extraction Como2010

Crucial steps

• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology

Page 18: Web Data Extraction Como2010

Knowledge base

General web knowledgeHTML, CSS, script handling

Domain-specific knowledgerules, constraints, tasks, ontology

Site-specific knowledge

WP1

Factual knowledge extraction

Bottom-up property extraction

Access, interaction, navigation

Top-down pattern perception

WP2 WP3 WP4

WWW

Analysisphase

Compilation phase• Extraction program build• Optimization, parallelization

Runtime phase• Highly parallel extraction

• maximize speed• improve consistency

• Use of elastic framework(cloud computing)

WP5

WP6

Page 19: Web Data Extraction Como2010

Crucial steps

• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology

Page 20: Web Data Extraction Como2010

The Data Model

Datalog is good but does not suffice.On top of it:

Need for object creation Need for ontological reasoning Need for probabilistic reasoning Need for default reasoning

Page 21: Web Data Extraction Como2010

Object creation in Datalog+

table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)).

PRODUCTToshiba Protégé cxDell 25416 Dell 23233Acer 78987

PRICE480360 470390

Page 22: Web Data Extraction Como2010

Object creation in Datalog+

table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)).

PRODUCTToshiba Protégé cxDell 25416 Dell 23233Acer 78987

PRICE480360 470390

T1 T2

Page 23: Web Data Extraction Como2010

Object creation in Datalog+

table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)).

PRODUCTToshiba Protégé cxDell 25416 Dell 23233Acer 78987

PRICE480360 470390

PRICE480360 470390

T1 T2

Page 24: Web Data Extraction Como2010

Object creation in Datalog+

table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)).

Deduction in Datalog+ undecidable (TGDs)

Page 25: Web Data Extraction Como2010

Object creation in Datalog+

table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)).

Deduction in Datalog+ undecidable (TGDs)

Datalog : require guardedness of rule bodies. Decidable, linear-time data complexity.

Page 26: Web Data Extraction Como2010

Datalog

Family of languages.

Incorporates ontological reasoning (>DL-LITE)

Further research needed for extending it so to be an ideal language for web objects.

Transitivity:

containedin(T1,T2), containedin(T2,T3) containedin (T1,T3)

Page 27: Web Data Extraction Como2010

Datalog

Family of languages.

Incorporates ontological reasoning (>DL-LITE)

Further research needed for extending it so to be an ideal language for web objects.

Transitivity:

containedin(T1,T2), containedin(T2,T3) containedin (T1,T3)

unguarded!

Page 28: Web Data Extraction Como2010

DL-LITE

DL-LITE Datalog[ ,;Lin]

Professor TeachesTo Professsor(x) y TeachesTo(x,y)

TeachesTo- Student TeachesTo(x,y) Student(y)

HasTutor- TeachesTo HasTutor(x,y) ->TeachesTo(y,x)

funct(HasTutor) HasTutor(x,y) & HasTutor(x,y’)

(always innocuous!) & Neq(y,y’)

Professor Student Professor(x) & Student(x)

DL-Litecore

DL-LiteR

DL-LiteF

Page 29: Web Data Extraction Como2010

Crucial steps

• WP1 data model (KRR model)

• WP2 low & intermediate level annotation

• WP3 High level ontology and Rules (top down)

+ mapping HL to Int. Level: Phenomenology

• WP4 Access, interaction, & navigation

• WP5 Compilation; Learning Xpath expressions

• WP6 Highly parallel execution on clouds

• WP7 General methodology

Page 30: Web Data Extraction Como2010

We will use various existing tools and techniques(rather than re-invent the wheel)

Low & Intermediate Level Annotation

• Named entity recognizers• Machine learning• Computational linguistics• Page layout analysis • PDF- Extraction

Page 31: Web Data Extraction Como2010
Page 32: Web Data Extraction Como2010
Page 33: Web Data Extraction Como2010
Page 34: Web Data Extraction Como2010

Extraction from PDF

Tamir Hassan

Page 35: Web Data Extraction Como2010

Crucial steps

• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology

Page 36: Web Data Extraction Como2010

Navigation & Interaction

Page 37: Web Data Extraction Como2010

Crucial steps

• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology

Page 38: Web Data Extraction Como2010
Page 39: Web Data Extraction Como2010

OXPath

• Extension of XPath • Facilitates querying web form and retrieving

returned data• Simulates a user filling out web forms• Highly parallelizable (geared towds cloud

computing)• Navigation and collecting data across multiple

pages

Page 40: Web Data Extraction Como2010

Result Extraction

..../next-field::*/{“Renting”}/.../{...}/.../{“Submit”}

Atomic resultsregardless of presentation (list, table, etc.)

/<XQ>

Page 41: Web Data Extraction Como2010

Result Extraction<XQ> : For each atomic result A

Letprice = A/.../.../text()description = A/.../.../../text()

........Return

<rental area=Oxford><price> 1,200 </price><bedrooms> 3 </bedrooms><bathrooms> 1 </bathrooms><type> Flat </type><location> George Street,OX1 </location><description> ... </description><otherInfo> Furnished; Long let - more than six months</otherInfo>...

<\rental>

price description

Type OtherInfo

Bathrooms

location

type = A/.../.../../text()

Page 42: Web Data Extraction Como2010

Crucial steps

• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology