Post on 20-Mar-2016
description
May 11, 2005 WWW 2005 -- Chiba, Japan 1
Thresher: Automating the Unwrapping of
Semantic Content from the World Wide Web
Andrew HogueGoogle MIT CSAIL
May 11, 2005 WWW 2005 -- Chiba, Japan 2
Acknowledgments
• David Karger (karger@csail.mit.edu)
• Haystack Group(http://haystack.csail.mit.edu)
May 11, 2005 WWW 2005 -- Chiba, Japan 3
Agenda
• Overview• Demo• Details
– Induction– Matching– Semantics– Heuristics
May 11, 2005 WWW 2005 -- Chiba, Japan 4
Agenda
• Overview• Demo• Details
– Induction– Matching– Semantics– Heuristics
May 11, 2005 WWW 2005 -- Chiba, Japan 5
Unwrapping the Web
• Majority of semantic content in “deep web”• Transformed into human-readable HTML
by scripts• HTML is difficult for automated agents to
understand• Little incentive for content providers to
provide RDF markup• How to “unwrap” this content?
May 11, 2005 WWW 2005 -- Chiba, Japan 6
Thresher
• Simple UI for wrapper induction on structured web content
• “Demonstrate” examples of objects• Induce wrapper, or pattern, based on
DOM• User may also label properties with RDF
May 11, 2005 WWW 2005 -- Chiba, Japan 7
Thresher
• Built on Haystack Semantic Web client• Everything is RDF• Everything has context menus• Thresher brings RDF into the web browser• Wrappers reify web objects for full
interaction
May 11, 2005 WWW 2005 -- Chiba, Japan 8
Thresher
• Underlying wrapper algorithm based on tree edit distance
• Align user’s examples• Keep aligned nodes (layout elements)• Wildcard non-aligned nodes (content)• Pattern matching is also alignment
May 11, 2005 WWW 2005 -- Chiba, Japan 9
Agenda
• Overview• Demo• Details
– Induction– Matching– Semantics– Heuristics
May 11, 2005 WWW 2005 -- Chiba, Japan 10
Agenda
• Overview• Demo• Details
– Induction– Matching– Semantics– Heuristics
May 11, 2005 WWW 2005 -- Chiba, Japan 11
Wrapper Induction
• Wrapper: pattern created from examples• User provides positive examples• Generalize examples into reusable pattern• Existing techniques:
– head-left-right-tail (HLRT) descriptors– Hidden Markov models– Support Vector Machines– Other Machine Learning
May 11, 2005 WWW 2005 -- Chiba, Japan 12
Wrapper Induction
• Our approach: take advantage of hierarchical structure of HTML
• Each example picks out a subtree of DOM• Calculate tree edit distance between
examples• Least-cost edit distance gives best
mapping• Remove unmapped nodes to make pattern
May 11, 2005 WWW 2005 -- Chiba, Japan 13
Tree Edit Distance
• Calculate cost ( ) of sequence of operations to transform one tree into the other
• Operations: insert, delete, change a node• Cost of an operation = size of subtree it
affects• Least-cost set of operations gives best
mapping between elements
May 11, 2005 WWW 2005 -- Chiba, Japan 14
Mapping Examples
May 11, 2005 WWW 2005 -- Chiba, Japan 15
Mapping Examples
May 11, 2005 WWW 2005 -- Chiba, Japan 16
Mapping Examples
May 11, 2005 WWW 2005 -- Chiba, Japan 17
Agenda
• Overview• Demo• Details
– Induction– Matching– Semantics– Heuristics
May 11, 2005 WWW 2005 -- Chiba, Japan 18
Pattern Matching
• Look for document subtrees with similar structure
• Find alignments of wrapper in tree• Require every node in wrapper be mapped
to some node in document subtree• Wildcards match zero or more times• Each valid alignment is a match
May 11, 2005 WWW 2005 -- Chiba, Japan 19
Matching Example
May 11, 2005 WWW 2005 -- Chiba, Japan 20
Agenda
• Overview• Demo• Details
– Induction– Matching– Semantics– Heuristics
May 11, 2005 WWW 2005 -- Chiba, Japan 21
Adding Semantics
• How to tie wrappers to semantic content?• Assert RDF statements about unwrapped
objects• Tied to wrapper structure• Classes bound to wrappers• Properties bound to wildcards
May 11, 2005 WWW 2005 -- Chiba, Japan 22
Semantic Labels
May 11, 2005 WWW 2005 -- Chiba, Japan 23
Semantic Matching
May 11, 2005 WWW 2005 -- Chiba, Japan 24
Semantic Matching
May 11, 2005 WWW 2005 -- Chiba, Japan 25
Semantic Matching
[ <rdf:type> <TalkAnnouncement> ; <series> “Dertouzos Lect…” ; <dc:title> “Distributed Hash…” ; <time> “3:30 PM”]
May 11, 2005 WWW 2005 -- Chiba, Japan 26
Agenda
• Overview• Demo• Details
– Induction– Matching– Semantics– Heuristics
May 11, 2005 WWW 2005 -- Chiba, Japan 27
• Find additional examples automatically • Consider nodes neighboring the example• Require low normalized cost:
• Often allows us to create wrappers with a single example
Automatically Adding Examples
May 11, 2005 WWW 2005 -- Chiba, Japan 28
Automatically Adding Examples
TR
T
May 11, 2005 WWW 2005 -- Chiba, Japan 29
List Collapse
• Current wrappers generalize well for single elements
• Will not recognize variable length lists• Collapse neighboring nodes with low
normalized cost• For matching, allow nodes to match more
than once
May 11, 2005 WWW 2005 -- Chiba, Japan 30
Wrapper Wrap-up
• Gather user example(s)• Automatically find additional examples• Generalize examples using best mapping• Add semantic labels• Match by finding alignments• Overlay objects on the page for interaction
May 11, 2005 WWW 2005 -- Chiba, Japan 31
Additional Tools
• Wrapper Sharing• RSS• Web Operations
May 11, 2005 WWW 2005 -- Chiba, Japan 32
Our Contributions
• End-user wrapper induction
• Few examples required
• Bring object interaction into the browser
• Wrappers bridge syntactic-semantic gap
May 11, 2005 WWW 2005 -- Chiba, Japan 33
Future Work and Applications
• Document-level classes• Page reformatting• Autonomous agent interaction• Negative examples• Automatic wrapper induction
May 11, 2005 WWW 2005 -- Chiba, Japan 34
ahogue@google.com
http://haystack.csail.mit.edu