Cui Tao PhD Dissertation Defense
-
Upload
carla-medina -
Category
Documents
-
view
30 -
download
0
description
Transcript of Cui Tao PhD Dissertation Defense
1
Cui TaoPhD Dissertation Defense
Ontology Generation, Information Harvesting and Semantic Annotation For Machine-
Generated Web Pages
2
MotivationBirth date of my great
grandpa
Price and mileage of red Nissans, 1990 or newer
Protein and amino acids information of gene cdk-4?
US states with property crime rates above 1%
3
Search by Search Engine
4
Search the Hidden Web
• The Hidden Web:– Hidden behind forms– Hard to query “cdk-4"
5
Query for Data
• The Hidden Web:– Hidden behind forms– Hard to query
Find the protein and the animo-acids
information for gene “cdk-4"
6
A Web of Pages A Web of Knowledge
• Web of Knowledge– Machine-“understandable”– Publicly accessible– Queriable by standard query languages
• Semantic annotation– Domain ontologies– Populated conceptual model
• Problems to resolve– How do we create ontologies?– How do we annotate pages for ontologies?
Contributions of Dissertation Work
• Web of Pages Web of Knowledge– Knowledge & meta-knowledge extraction– Reformulation as machine-“understandable”
knowledge
• Automatic & semi-automatic solutions via:– Sibling tables (TISP/TISP++)– User-created forms (FOCIH)
7
8
Automatic Annotation with TISP(Table Interpretation with Sibling Pages)
• Recognize tables (discard non-tables)• Locate table labels• Locate table values• Find label/value associations
9
Recognize Tables
Data Table
Layout Tables (discard)
NestedData Tables
10
Find Label/Value AssociationsExample:(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918
12
11
Interpretation Technique:Sibling Page Comparison
12
Interpretation Technique:Sibling Page Comparison
Same
13
Interpretation Technique:Sibling Page Comparison
Almost Same
14
Interpretation Technique:Sibling Page Comparison
Different
Same
15
Technique Details
• Unnest tables• Match tables in sibling pages
– “Perfect” match (table for layout discard )– “Reasonable” match (sibling table)
• Determine & use table-structure pattern– Discover pattern– Pattern usage– Dynamic pattern adjustment
16
Table Unnesting
17
Regularity Expectations:
• (<tr><(td|th)> {L} <(td|th)> {V})n
• <tr>(<(td|th)> {L})n
(<tr>(<(td|th)> {V})n)+
• …
Pattern combinations are also possible.
Table Structure Patterns
18
<tr>(<(td|th)> {L})n
(<tr>(<(td|th)> {V})n)+
Table Structure Patterns
19
Pattern Usage
20
Dynamic Pattern Adjustment
21
TISP++
• Automatic ontology generation
• Automatic information annotation
22
Ontology Generation – OSM
• Object set: table labels– Lexical: labels that associate with actual values– Non-lexical: labels that associate with other tables
• Relationship set: table nesting• Constraints: updates based on observation
23
Ontology Generation – OWL
• Object set: OWL class• Relationship set: OWL object property• Lexical object set:
– OWL data type property– Different annotation properties to keep track of
the provenance
Generated Ontology
Generated Ontology
26
RDF Graph
27
Query the DataFind the protein
and the animo-acids information for gene “cdk-4"
28
TISP Evaluation• Applications
– Commercial: car ads– Scientific: molecular biology– Geopolitical: US states and countries
• Data: > 2,000 tables in 35 sites• Evaluation
– Initial two sibling pages• Correct separation of data tables from layout tables?• Correct pattern recognition?
– Remaining tables in site• Information properly extracted?• Able to detect and adjust for pattern variations?
29
Experimental Results• Table recognition: correctly discarded 157 of
158 layout tables
• Pattern recognition: correctly found 69 of 72 structure patterns
• Extraction and adjustments: 5 path adjustments and 34 label adjustments all correct
30
TISP++ Performance
• Performance depends on TISP• TISP test set
– Generates all ontologies correctly– Annotates all information in tables correctly
31
Form-based Ontology Creation and Information Harvesting (FOCIH)
• Personalized ontology creation by form– General familiarity– Reasonable conceptual framework– Appropriate correspondence
• Transformable to ontological descriptions• Capable of accepting source data
• Automated ontology creation • Automated information harvesting
32
Form Creation
33
Created Sample Form
34
Generated Ontology View
35
Source-to-Form Mapping
36
Source-to-Form Mapping
37
Source-to-Form Mapping
38
Source-to-Form Mapping
39
Almost Ready to Harvest
• Need reading path: DOM-tree structure• Need to resolve mapping problems
– Pattern recognition– Instance recognition
40
Reading Path
41
Pattern & Instance Recognition
42
Pattern & Instance Recognition
43
Pattern & Instance Recognitionregular expression
for decimal numberleft context
right context
44
Pattern & Instance Recognition
list pattern, delimiter is “,”
45
Pattern & Instance Recognition
list pattern, delimiter is regular expression for percentage numbers and a comma
46
Pattern & Instance Recognition
list pattern, delimiter is regular expression for percentage numbers and a comma
47
Can Now Harvest
48
Can Now Harvest
49
Can Now Harvest
50
Semantic Annotation
51
Semantic Annotation
52
Semantic Annotation
53
Semantic Annotation
54
Semantic Annotation
55
Semantic Query
56
FOCIH Performance
• Ontology creation• Semantic annotation
– Depends on TISP performance– Depends on pattern and instance recognition
performance
57
FOCIH Performance
• Pattern and instance recognition:– Works with highly regular data– Tested 71 mappings– 25 full-string values (25/25 correct)– 38 substring values (29/38 correct)– 8 list patterns (6/8 correct)
58
FOCIH Difficulties
59
FOCIH Difficulties
60
FOCIH Difficulties
No selection
61
WoK via TISP
62
WoK via TISP
63
WoK via FOCIH
64
WoK via FOCIH
65
Contributions
• TISP: automatic sibling table interpretation• TISP++:
– Automatic ontology generation based on interpreted tables
– Automatic semantic annotation for interpreted tables• FOCIH:
– Semi-automatic personalized ontology creation– Automatic personalized information harvesting and
semantic annotation• All together: contributes to turning the current web
of pages into a web of Knowledge
66
Future Work
• Sibling pages in addition to sibling tables
• Reverse engineer from ontologies to forms as a basis for information harvesting for already defined ontologies.