12/7/2007
2
BYUGrand challenge: new generation World Wide Web
The current Web
Enormous amount content
Feasible for humans to read/write
But …Content is simply too much to read
The future Web
Even more content but machine-processable
Feasible for humans and machines to read/write
Key issueConverting non-machine-processable content
to machine-processable content, i.e., semantic annotation
12/7/2007
3
BYUSemantic annotation, the general picture
Data Extraction/Instance RecognitionEngine
AptRental Ontology
12/7/2007
5
BYUOntology
Definition: Explicit, formal specifications of conceptualizations
Unique identity of each concept
Unique identity of each relationship among concepts
Logic derivation rules underneath every declared relationship
Annotation:
533-0293 is-a AptRental:ContactPhone
$1250 is-a AptRental:MonthlyRate
533-0293 is-about AptRentalAd-instance-1
$1250 is-about AptRentalAd-instance-1
Ontology:
AptRentalAd has ContactPhone
AptRentalAd has MonthlyRate
Logic derivation:
To rent the apartment that costs $1250 monthly please call 533-0293. (machine understanding)
12/7/2007
6
BYUAutomated semantic annotation, methods
Layout-driven method (e.g. [Mukherjee et. al. 03])
Machine-learning-based method (e.g. [Handschuh et. al. 02])
Rule-based method (e.g. [Dill et. al. 03])
NLP-based method (e.g. [Popov et. al. 03])
Ontology-based method (e.g. [Ding et. al. 06])
12/7/2007
8
BYUData extraction ontology
Standard Ontology
BedroomNr
epistemological extension (instance recognizer)
CAPITOL HILL Luxury 2 bdrm 2 bath, 2 grg, w/d,views,
1700 sq ft. $1250 mo. Call 533-0293
BedroomNr
External representation
Context Phrase
Exception Phrase
X
12/7/2007
9
BYUOntology-based annotation
BedroomNr
External representation
Context Phrase
BathNr
External representation
Context Phrase
Feature
External representation
MonthRate
External representation
Context Phrase
ContactPhone
External representation
CAPITOL HILL Luxury 2 bdrm 2 bath, 2 grg, w/d,views,
1700 sq ft. $1250 mo. Call 533-0293
Context Keyword
12/7/2007
10
BYUOntology-based annotation: strength and weakness
Strengths
Ignore layout difference
Ignore layout change
Less maintenance once built
Weakness
Expensive to build instance recognizers
12/7/2007
13
BYULayout-driven annotation, strength and weakness
Strengths
Accurate
Simple and straightforward
Less domain knowledge requirement
Weakness
Expensive in layout-pattern maintenance
12/7/2007
14
BYUProblem
How to
overcome the weaknesses
but
retaining the strengths
at the same time?
12/7/2007
15
BYUObservation
Extraction Domain ontology
A Document
Conceptual Annotator
(ontology-based annotation)
Annotated Document
Layout Patterns
Structural Annotator
(layout-driven annotation)
Domain ontology
A Document
Annotated Document
accurate
resilient
12/7/2007
16
BYUSynergistic model
Extraction Domain ontology
A Document
Conceptual Annotator
(ontology-based annotation)
Annotated Document Pattern
Generation
Layout Patterns
Structural Annotator
(layout-driven annotation)
Annotated Document
Instance Recognizer Enrichment
12/7/2007
17
BYUPattern Generation
Get the annotated outputs from ontology-based annotator
Apply HTML-structure analysis and produce a typical layout pattern for each extracted field
If applicable, produce a sequential dependency between the generated layouts
If applicable, produce simple heuristic rules such as “if A then B” between the generated layouts
12/7/2007
18
BYUInstance recognizer enrichment
Get the annotated outputs from layout-driven annotator
Apply the results to the current corresponding instance recognizers
If recognized, continue;
Otherwise,
if dictionary-type recognizers, insert.
if regular-expression-type recognizers, try to generate a new regular expression and alert the user to check
12/7/2007
19
BYUPreliminary results
Apartment Rental domain
Ontology-based annotation90% accuracy in average on both precision
and recall for nearly all fields
Except Location and Contact Name
Layout-driven annotationNearly 100% accuracy on both precision and
recall on Location and Contact Name
Less recall on fields such as BedroomNr
Pattern generationGreat on well structured fields such as
Location
Less successful on semi-structured fields such as BedroomNr
Instance recognizer enrichmentGood results even with poorly constructed
initial instance recognizers
12/7/2007
20
BYUSummary
Automatically produce layout patterns using outputs of ontology-based annotation
Automatically enrich domain-specific instance recognizers using outputs of layout-driven annotation
A new synergistic annotation model that retains original strengths and minimizes original weaknesses
An annotation system that self-improves its performance during its execution
12/7/2007
21
BYUFuture work
Dynamical tuning annotation based on user perspectives
Ensemble of various annotators
Collaborative annotation
12/7/2007
22
BYUThank you
Yihong Ding [email protected]
(801) 422-7604
2262 TMCB, Brigham Young University
Provo, UT 84601
Data Extraction Research Lab at Brigham Young University
http://www.deg.byu.edu
Homepage, my virtual home on Web 1.0
http://www.deg.byu.edu/ding/
Thinking Space, my virtual home on Web 2.0
http://yihongs-research.blogspot.com/
Top Related