Web Data Extraction Como2010
-
date post
20-Oct-2014 -
Category
Documents
-
view
1.450 -
download
0
description
Transcript of Web Data Extraction Como2010
Web data extraction
WEBHTML pages
layout
Corporateedp apps
structured data,Databases,
XML
WRAPPER
Goal: Make web contents accessible to electronic data processing
Wrappers: HTMLselect extract annotate XML
Lixto Visual Developer (VD)
Navigation Steps
Mozilla Web
Browser
Extraction Configuration
Need for Automatic Extraction Technology
Example: Real Estate UK 17,000 sites Many not covered by aggregators We do have a list of all homepages (Yellow
Pgs. UK) Manual or semi-automatic wrapping too
expensive - wrapper construction - testing - keeping track of changes No tool or method can do it fully
automatically. Other domains: Hospitals,restaurants, schools, travel
agents, airlines, hospitals, pharmaceutical companies
and retail companies such as supermarket
chains…..
Need for Automatic Extraction Technology (2)
All search engine providers need it! Many work on it.
Keywords: Vertical search, object search, semantic search.
Raghu Ramakrishnan, Yahoo!, March 2009: “no one really has done this successfully at scale yet”
Alon Halevy, Google, Feb. 2009: “Current technologies are not good enough yet to provide what search engines really need. […] any successful approach would probably need a combination of knowledge and learning.”
The Blackbox we want to construct
BLACKBOX
Application domain with thousands of websites
URL
Application relevant Structured data (XML or RDF)
To achieve this, we plan to combine a host of annotators with a new knowledge-based approach.
Real estate
Restaurants
Relationship to SeCo & Webdam
Q: Find apartments in Milan whose prices are average in quarters were restaurant quality > average.
Results
Web service A
Web service
How to achieve it?Rationale: Combine existing and new
“low level” annotators with “high level” AI and reasoning.
Low level annotators: - Bottom-up page analysis. - ML-based entity recognizers - NLP & ontological text annotation - Web page classification & analysis - Basic link analysis
High level reasoning: - Goal oriented - Conceptual domain objects. - Conceptual interaction
elements - High-level object ontology - Domain knowledge
<table>113
<tr> 134<tr>115
“I’m interested in”
<td>119
<table>124
radiobuttons
<tr>125 <tr>126
<td>129 <td>130
“Buying” “Renting”
<td>135
“Maximum price”
<select>136
<option>137<option>138
<td>139 <td>140
“GBP” “EUR”
Bottom-up (low-level) annotation
Monochromatic Rectangle
Georaphic search facility
Postcode input field
Active map ….
ISA ISA
Occurs in
Price search facility …
.
….
Occurs in
….105
105 127
[(02873,227)(03900,417)]
Geo-Price-Searchbox
ISA
[(02873,227)(03900,417)]
Top-down reasoning
Property SearchFacility
Property List
Single Property Description
Specially highlightedproperty
part-of m1
Bottom-up processing Top-down reasoning
Monochromatic Rectangle
Georaphic search facility
Postcode input field
Active map ….
ISA ISA
Occurs in
Price search facility …
.
….
Occurs in
….105
105 127
[(02873,227)(03900,417)]
Property SearchFacility
Property List
Single Property Description
Geo-Price-Searchbox
ISA
[(02873,227)(03900,417)]
Specially highlightedproperty
Phenomenology
part-of m1
table(T) & occurs_in(T,areaselection) & occurs_in(T,priceselection) goodtable(T).
goodtable(T) & child(Parent,T) containsgoodtable(Parent).
goodtable(T) & containsgoodtable(T) propertysearchmask(T).
If a table contains an area selection input field and a price selection field, both of which are not simultaneously contained in a smaller table, then this table is the property search mask
Datalog for Web-Object Reasoning
Crucial steps
• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology
Knowledge base
General web knowledgeHTML, CSS, script handling
Domain-specific knowledgerules, constraints, tasks, ontology
Site-specific knowledge
WP1
Factual knowledge extraction
Bottom-up property extraction
Access, interaction, navigation
Top-down pattern perception
WP2 WP3 WP4
WWW
Analysisphase
Compilation phase• Extraction program build• Optimization, parallelization
Runtime phase• Highly parallel extraction
• maximize speed• improve consistency
• Use of elastic framework(cloud computing)
WP5
WP6
Crucial steps
• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology
The Data Model
Datalog is good but does not suffice.On top of it:
Need for object creation Need for ontological reasoning Need for probabilistic reasoning Need for default reasoning
Object creation in Datalog+
table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2) X (tablebox(X) & contains(X,T1) & contains(X,T2)).
PRODUCTToshiba Protégé cxDell 25416 Dell 23233Acer 78987
PRICE480360 470390
Object creation in Datalog+
table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2) X (tablebox(X) & contains(X,T1) & contains(X,T2)).
PRODUCTToshiba Protégé cxDell 25416 Dell 23233Acer 78987
PRICE480360 470390
T1 T2
Object creation in Datalog+
table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2) X (tablebox(X) & contains(X,T1) & contains(X,T2)).
PRODUCTToshiba Protégé cxDell 25416 Dell 23233Acer 78987
PRICE480360 470390
PRICE480360 470390
T1 T2
Object creation in Datalog+
table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2) X (tablebox(X) & contains(X,T1) & contains(X,T2)).
Deduction in Datalog+ undecidable (TGDs)
Object creation in Datalog+
table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2) X (tablebox(X) & contains(X,T1) & contains(X,T2)).
Deduction in Datalog+ undecidable (TGDs)
Datalog : require guardedness of rule bodies. Decidable, linear-time data complexity.
Datalog
Family of languages.
Incorporates ontological reasoning (>DL-LITE)
Further research needed for extending it so to be an ideal language for web objects.
Transitivity:
containedin(T1,T2), containedin(T2,T3) containedin (T1,T3)
Datalog
Family of languages.
Incorporates ontological reasoning (>DL-LITE)
Further research needed for extending it so to be an ideal language for web objects.
Transitivity:
containedin(T1,T2), containedin(T2,T3) containedin (T1,T3)
unguarded!
DL-LITE
DL-LITE Datalog[ ,;Lin]
Professor TeachesTo Professsor(x) y TeachesTo(x,y)
TeachesTo- Student TeachesTo(x,y) Student(y)
HasTutor- TeachesTo HasTutor(x,y) ->TeachesTo(y,x)
funct(HasTutor) HasTutor(x,y) & HasTutor(x,y’)
(always innocuous!) & Neq(y,y’)
Professor Student Professor(x) & Student(x)
DL-Litecore
DL-LiteR
DL-LiteF
Crucial steps
• WP1 data model (KRR model)
• WP2 low & intermediate level annotation
• WP3 High level ontology and Rules (top down)
+ mapping HL to Int. Level: Phenomenology
• WP4 Access, interaction, & navigation
• WP5 Compilation; Learning Xpath expressions
• WP6 Highly parallel execution on clouds
• WP7 General methodology
We will use various existing tools and techniques(rather than re-invent the wheel)
Low & Intermediate Level Annotation
• Named entity recognizers• Machine learning• Computational linguistics• Page layout analysis • PDF- Extraction
Extraction from PDF
Tamir Hassan
Crucial steps
• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology
Navigation & Interaction
Crucial steps
• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology
OXPath
• Extension of XPath • Facilitates querying web form and retrieving
returned data• Simulates a user filling out web forms• Highly parallelizable (geared towds cloud
computing)• Navigation and collecting data across multiple
pages
Result Extraction
..../next-field::*/{“Renting”}/.../{...}/.../{“Submit”}
Atomic resultsregardless of presentation (list, table, etc.)
/<XQ>
Result Extraction<XQ> : For each atomic result A
Letprice = A/.../.../text()description = A/.../.../../text()
........Return
<rental area=Oxford><price> 1,200 </price><bedrooms> 3 </bedrooms><bathrooms> 1 </bathrooms><type> Flat </type><location> George Street,OX1 </location><description> ... </description><otherInfo> Furnished; Long let - more than six months</otherInfo>...
<\rental>
price description
Type OtherInfo
Bathrooms
location
type = A/.../.../../text()
Crucial steps
• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology