WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using...
-
Upload
jerome-townsend -
Category
Documents
-
view
218 -
download
0
Transcript of WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using...
![Page 1: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/1.jpg)
WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany
Building and UsingKnowledge Bases
Steffen Staab
Saqib Mir – European Bioinformatics InstituteErmelinda d‘Oro, Massimo Ruffolo – Univ. Calabria, Italy
& WeST Team
![Page 2: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/2.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 2
Semantic Web
Web Retrieval
Social Web
Multimedia Web
Software Web
Institut WeST – Web Science & Technologies
GESIS
![Page 3: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/3.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 3
PhD thesis trauma 17 years ago
„Nach dem Auspacken der LPS 105 präsentiert sich dem Betrachter ein stabiles Laufwerk, das genauso geringe Außenmaße besitzt wie die Maxtor.“
Having unwrapped the LPS 105 – reveals itself to the onlooker - a stable disk drive, which has similarly small volume as the Maxtor.“
![Page 4: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/4.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 4
GENERAL MOTIVATION
General motivation is not information extraction,
but it is solving tasks!
![Page 5: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/5.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 5
General objective: Extracting to LOD
hasLivedInuseAsExample
Crucial to know: Ontologies nowadays reflect this structureOntologies are• Modular (vs one to rule them all)• Distributed (vs defined in one place)• Connected (vs isolated templates)• Extensible (vs claimed to be finished)• Lightweight (vs computationally intractable)• Popular ones are used more often (vs people disagreeing)
Ontologies – LEGO style
![Page 6: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/6.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 6
Most famous applications
Steve Macbeth (Microsoft): - discussion wrt Schema.org -“about 7% of pages we crawl have mark-up” http://www.w3.org/2012/06/06-schema-minutes.html
LOD Cloud
Google Knowledge Graph Bing gets its own knowledge graph
http://searchengineland.com/bing-britannica-partnership-123930
![Page 7: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/7.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 7
ANALYSIS OF URBAN PARAMETERS
Example ontology-based application 1:
![Page 8: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/8.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 8
General objective: Analysing LOD
hasLivedInuseAsExample
![Page 9: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/9.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 9
http://lisa.west.uni-koblenz.de/lisa-demo/
Family‘s analysis of Koblenz LOD + Open Street Map data
![Page 10: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/10.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 10
http://lisa.west.uni-koblenz.de/lisa-demo/
Entrepreneur‘s analysis of Koblenz LOD + Open Street Map data
1. PrizeGerman Linked Open Gov Data Competition 2012
![Page 11: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/11.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 11
FACETED MULTIMEDIA EXPLORATION
Example ontology-based application :
![Page 12: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/12.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 12
Making Web 2.0 More Accessible
Links Location
Persons
Knowledge Tags
low- to midlevel features
xxxxxxxxx
GeoNames[Schenk et al; JoWS 2009]
![Page 13: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/13.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 13
Choosing between Koblenz – and Koblenz
Video at: http://vimeo.com/2057249
![Page 16: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/16.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 16
A tag view of „Koblenz“ & „Castle“
![Page 17: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/17.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 17
Semantic Identity – Festung Ehrenbreitstein
![Page 18: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/18.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 18
Persons – Celebrities, FOAFers & Flickr Users
Billion Triples Challenge 1. Prize 2008
[Schenk et al; JoWS 2009]
![Page 19: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/19.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 19
OBSERVATIONS ON INFORMATION EXTRACTION
Now on to information extraction:
![Page 20: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/20.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 20
Challenges & Opportunities for IE
Not all web pages are created equal
![Page 21: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/21.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 21
Challenges & Opportunities for IE
Some challenges are the same, e.g. finding type instances
![Page 22: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/22.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 22
Challenges & Opportunities for IE
Some challenges are the same, e.g. finding relation instances
![Page 23: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/23.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 23
Challenges & Opportunities for IE
Some contain concepts and their descriptions, some don‘t
No types here,few relation types
![Page 24: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/24.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 24
Challenges & Opportunities for IE
Knowing that they are instances and of which type
Textual indication
Positional indication
![Page 25: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/25.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 25
Challenges & Opportunities for IE
To some extent
positional and layout
indications work across
languages and sites
![Page 26: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/26.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 26
Challenges & Opportunities for IE
owl:sameAs
We should not only think about
Web pages, but about Web sites
![Page 27: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/27.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 27
Challenges & Opportunities for IE
owl:sameAs
We should not only think about
Web pages, but about Web sites
![Page 28: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/28.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 28
Comparing related work to our objectives
Related work objectives IE on Web pages Acquiring instances and
relationship instances
IE based on linear text
Our objectives IE on Web sites Acquiring items Classifying items in
Instances Concepts Relation instances Relationships
IE also based on spatial position
There is overlap and of course there are exceptions in related work
![Page 29: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/29.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 29
Outline
The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath Implementation Evaluation
[Oro et al; VLDB 2010]
![Page 30: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/30.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 30
Presentation-oriented documents
![Page 31: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/31.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 31
Presentation-oriented documents
• HTML DOM structure is site specific• Spatial arrangements are rarely explicit• Spatial layout is hidden in complex nesting of layout elements• Intricate DOM tree structures are conceptually difficult to query
for the user (or a tool!)
![Page 32: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/32.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 32
Related Work
Web Query languages Xpath 1.0 and XQuery1.0
Established Too difficult to use for scraping from intricate DOM structures
Visual languages Spatial Graph Grammars [Kong et al.] are quite complex in
term of both usability and efficiency Algebras for creating and querying multimedia interactive
presentations (e.g. ppt) [Subrahmanian et al.]
Web wrapper induction exploiting visual interface [Gottlob et al.] [Sahuguet et al.]
generate XPath location paths of DOM nodes can benefit from using Spatial XPath
![Page 33: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/33.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 33
Outline
The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath Implementation Evaluation
![Page 34: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/34.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 34
b
e
Representing Spatial Relations between DOM Nodes
![Page 35: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/35.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 35
Idea: Use Spatial Relations among DOM Nodes
![Page 37: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/37.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 37
SXPath System Architecture
![Page 38: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/38.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 38
Querying for Relations Among Nodes
Rectangular Cardinal Relations (RCR)
Topological Relations
r1 E:NE r2
Spatial models allow for expressing disjunctive relations among regions
![Page 42: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/42.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 42
From XPath 1.0 towards Spatial Querying with SXPath
SXPath features adopts intuitive path notation:
axis::nodetest [pred]*
adds to XPath spatial axes spatial position functions
natural semantics for spatial querying
![Page 43: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/43.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 43
SXPath System Architecture
![Page 44: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/44.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 44
Complexity Results
Formal model defined in the paper [Oro et al; VLDB 2010]
![Page 45: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/45.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 45
Outline
The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath Implementation Evaluation
![Page 50: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/50.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 50
Outline
The Bio-Case Motivation The (Biochemical) Deep
Web Contributions
Page-level wrapper induction
Site-wide wrapper generation
Error Correction by Mutual Reinforcement
Conclusions and Future Directions
The Social Media Case Motivation State-of-the-Art Core idea of SXPath SXPath Language
Spatial Data Model Syntax & Semantics Complexity
Implementation Evaluation
![Page 51: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/51.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 51
>1000 Life Science DBs, number growing quickly
![Page 52: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/52.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 52
Biochemical Web Sites: Observations - 1
Labeled Data
Total Labeled Unlabeled Unlabeled(Redundant)
754 719 19 16
Table 1: Data fields across 20 Biochemical Web sites
Full survey:http://sabio.villa-bosch.de/labelsurvey.html (404)
![Page 53: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/53.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 53
Biochemical Web Sites: Observations - 2
Dynamic Web Pages
![Page 54: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/54.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 54
Biochemical Web Sites: Observations - 3
Rich Site Structure
![Page 55: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/55.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 55
Biochemical Web Sites: Observations - 4
Semantics is often only in the report, not in the underlying relational database
Web Services Survey: 11 of 100 Databases1 provide APIs Incomplete coverage Varying granularity No semantics in the service description
1 Databases indexed by the Nucleic Acids Research Journal (http://www3.oup.co.uk/nar/database/). Complete survey was available at http://sabiork.villa-bosch.de/index.html/survey.html
![Page 56: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/56.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 56
Biochemical Web Sites: Extraction Tasks
Induce Wrapper
Induce Wrapper
Induce Wrapper
[Mir et al; DILS 2009][Mir et al; ESWC 2010]
![Page 57: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/57.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 57
Contributions
Unsupervised Page-Level Wrapper Induction
Unsupervised Site-Wide Wrapper Induction (Site Structure Discovery)
(Acquiring the Schema/Ontology)
Automatic Error Detection and Correction by Mutual Reinforcement
![Page 58: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/58.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 58
Page-Level Wrapper Induction – 1D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47,…}O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}
D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18,… }O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}
//*[text()]
![Page 59: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/59.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 59
Page-Level Wrapper Induction - 2
Reclassify – Growing Data Regions
![Page 60: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/60.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 60
Page-Level Wrapper Induction - 3
D1´ = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47, 3.2.1.21 …}O1´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}
D2´ = {C00185, Cellobiose,…, R00306, 1.1.99.18, 3.2.1.21 … }O2´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}
![Page 61: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/61.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 61
Page-Level Wrapper Induction - 4
Selecting Labels for Datahtml/…./table[1]/tr[8]/td[1]/…/code[1]/a[1] (“1.1.1.47” )
html/…./table[1]/tr[6]/th[1]/…/code[1]/ (“Reaction”)
html/…./table[1]/tr[8]/th[1]/…/code[1]/ (“Enzyme”)
![Page 62: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/62.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 62
Page-Level Wrapper Induction - 5
Anchor the PathEnzyme - html/table[1]/tr[8]/th[1]/code[1]/
html/table[1]/tr[8]/td[1]/code[1]/a[1]html/table[1]/tr[8]/td[1]/code[1]/a[2]
//*[text()=‘Enzyme’] ../…./../td[1]/code[1]/a[position()≥2]/text()
Pivot GeneralizeRelative
![Page 63: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/63.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 63
Selected Sources
KEGG, ChEBI, MSDChem Basic qualitative data Popular Overlapping/complementary data
![Page 64: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/64.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 64
Wrapper Induction - Evaluation
SOURCE #L #D #S TP FN FP P R
KEGG Compoundhttp://www.genome.jp/kegg/ compound/
10 762 3 411 351 46 89.9 53.9
15 759 3 0 100 99.6
KEGG Reactionhttp://www.genome.jp/kegg/ reaction/
10 205 3 173 32 0 100 84.4
15 205 0 0 100 100
ChEBIhttp://www.ebi.ac.uk/chebi
22 831 3 595 236 41 93.5 71.6
15 829 2 0 100 99.7
MSDChemhttp://www.ebi.ac.uk/msd-srv/msdchem/
30 600 3 600 0 20 96.7 100
15 600 0 20 96.7 100
Average (based on final wrappers for each source) 99.1 99.8
~9 samples – ~99% P, ~98% R
Table 2: Page-level wrapper induction results, 20 test pages(L=Labels, D=Data entries, S=Training pages)
![Page 65: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/65.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 65
Site-Wide Wrapper Induction: Observations
Not all pages contain data (e.g. Legal disclaimers, contact pages, navigational menus)
An efficient approach should ignore these pages We dont need to learn the entire site-structure
![Page 66: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/66.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 66
Site-Wide Wrapper Induction: Observations - 2
Classified Link-Collections point to data-intensive pages of the same class.
![Page 67: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/67.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 67
Site-Wide Wrapper Induction: Observations - 3
Pages belong to the same class describe the same concepts Some concepts are sometimes omitted Ordering is always the same
![Page 68: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/68.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 68
Site-Wide Wrapper Induction
1. Start with C0
2. Follow all classified link-collections
3. Generate wrappers for each set of target pages
4. Determine if new class is formed
5. Add navigation step6. Repeat 2 – 5 for each
new class formed in 4
C0
L3
L1
L2
If C0 != Ci (i>0)S=S+Ci;
Navigation StepsW= {(C0 → L1→ C0),(C0 → L2→ C2),(C0 → L3→ C3)}
S={C0}
C1
C3
C2
![Page 69: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/69.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 69
Site-Wide Wrapper Induction – Evaluation
SOURCE #C #C’ #D TP FN FP P R
MSDChem 1 1 N/A N/A N/A N/A N/A N/A
ChEBI 3 1 1711 1195 516 0 100 69.8
KEGG 10 7 6223 5044 1179 188 97 81.1
Average 98.5 75.5
Table 3: Site-wide wrapper induction results, 20 test pages for each class(C=Classes, C´=Classes discovered, D=Data entries)
![Page 70: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/70.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 70
Error Detection and Correction:Mutual Reinforcement
Observation: Certain data reappear on more than one class of pages
![Page 71: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/71.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 71
Error Detection and Correction:Mutual Reinforcement
Reinforcement if reappearing data correctly classified as Data
Otherwise it points to misclassification Label-Data Mismatch
• Correction: Introduce more samples Label-Label Mismatch
• Cannot be detected
![Page 72: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/72.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 72
Where to go next?
Reverse engineering production1. LOD
2. Navigation model
3. Interaction model
4. Layout model
Capture this generative model using machine learning Relational learning
• Markov logic programmes?• …?
emitting RDF & RDFS
what belongs to what
(- not treated at all by us so far -)
spatial positioning
![Page 73: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/73.jpg)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 73
Bibliography
Ermelinda Oro, Massimo Ruffolo, Steffen Staab. SXPath – Extending XPath towards Spatial Querying on Web Documents. In: PVLDB – Proceedings of the VLDB Endowment, 4(2): 129-140, 2010.
S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for Life Science Deep Web Databases. In: DILS-2009 – Proc. of the Data Integration in the Life Sciences Workshop, Manchester, UK, July 20-22, LNCS, Springer, 2009.
Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised Approach for Acquiring Ontologies and RDF Data from Online Life Science Databases. In: 7th Extended Semantic Web Conference (ESWC2010), Heraklion, Greece, May 30-June 3, 2010, pp. 319-333.
![Page 74: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062315/5697bf791a28abf838c82157/html5/thumbnails/74.jpg)
WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany
Thank you for your attention!