Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web

Post on 20-Mar-2016

22 views 0 download

description

Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web. Andrew Hogue GoogleMIT CSAIL. Acknowledgments. David Karger (karger@csail.mit.edu) Haystack Group (http://haystack.csail.mit.edu). Agenda. Overview Demo Details Induction Matching Semantics - PowerPoint PPT Presentation

Transcript of Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web

May 11, 2005 WWW 2005 -- Chiba, Japan 1

Thresher: Automating the Unwrapping of

Semantic Content from the World Wide Web

Andrew HogueGoogle MIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 2

Acknowledgments

• David Karger (karger@csail.mit.edu)

• Haystack Group(http://haystack.csail.mit.edu)

May 11, 2005 WWW 2005 -- Chiba, Japan 3

Agenda

• Overview• Demo• Details

– Induction– Matching– Semantics– Heuristics

May 11, 2005 WWW 2005 -- Chiba, Japan 4

Agenda

• Overview• Demo• Details

– Induction– Matching– Semantics– Heuristics

May 11, 2005 WWW 2005 -- Chiba, Japan 5

Unwrapping the Web

• Majority of semantic content in “deep web”• Transformed into human-readable HTML

by scripts• HTML is difficult for automated agents to

understand• Little incentive for content providers to

provide RDF markup• How to “unwrap” this content?

May 11, 2005 WWW 2005 -- Chiba, Japan 6

Thresher

• Simple UI for wrapper induction on structured web content

• “Demonstrate” examples of objects• Induce wrapper, or pattern, based on

DOM• User may also label properties with RDF

May 11, 2005 WWW 2005 -- Chiba, Japan 7

Thresher

• Built on Haystack Semantic Web client• Everything is RDF• Everything has context menus• Thresher brings RDF into the web browser• Wrappers reify web objects for full

interaction

May 11, 2005 WWW 2005 -- Chiba, Japan 8

Thresher

• Underlying wrapper algorithm based on tree edit distance

• Align user’s examples• Keep aligned nodes (layout elements)• Wildcard non-aligned nodes (content)• Pattern matching is also alignment

May 11, 2005 WWW 2005 -- Chiba, Japan 9

Agenda

• Overview• Demo• Details

– Induction– Matching– Semantics– Heuristics

May 11, 2005 WWW 2005 -- Chiba, Japan 10

Agenda

• Overview• Demo• Details

– Induction– Matching– Semantics– Heuristics

May 11, 2005 WWW 2005 -- Chiba, Japan 11

Wrapper Induction

• Wrapper: pattern created from examples• User provides positive examples• Generalize examples into reusable pattern• Existing techniques:

– head-left-right-tail (HLRT) descriptors– Hidden Markov models– Support Vector Machines– Other Machine Learning

May 11, 2005 WWW 2005 -- Chiba, Japan 12

Wrapper Induction

• Our approach: take advantage of hierarchical structure of HTML

• Each example picks out a subtree of DOM• Calculate tree edit distance between

examples• Least-cost edit distance gives best

mapping• Remove unmapped nodes to make pattern

Google Employee
is this slide necessary, or is it too much of a repeat?

May 11, 2005 WWW 2005 -- Chiba, Japan 13

Tree Edit Distance

• Calculate cost ( ) of sequence of operations to transform one tree into the other

• Operations: insert, delete, change a node• Cost of an operation = size of subtree it

affects• Least-cost set of operations gives best

mapping between elements

May 11, 2005 WWW 2005 -- Chiba, Japan 14

Mapping Examples

May 11, 2005 WWW 2005 -- Chiba, Japan 15

Mapping Examples

May 11, 2005 WWW 2005 -- Chiba, Japan 16

Mapping Examples

May 11, 2005 WWW 2005 -- Chiba, Japan 17

Agenda

• Overview• Demo• Details

– Induction– Matching– Semantics– Heuristics

May 11, 2005 WWW 2005 -- Chiba, Japan 18

Pattern Matching

• Look for document subtrees with similar structure

• Find alignments of wrapper in tree• Require every node in wrapper be mapped

to some node in document subtree• Wildcards match zero or more times• Each valid alignment is a match

May 11, 2005 WWW 2005 -- Chiba, Japan 19

Matching Example

May 11, 2005 WWW 2005 -- Chiba, Japan 20

Agenda

• Overview• Demo• Details

– Induction– Matching– Semantics– Heuristics

May 11, 2005 WWW 2005 -- Chiba, Japan 21

Adding Semantics

• How to tie wrappers to semantic content?• Assert RDF statements about unwrapped

objects• Tied to wrapper structure• Classes bound to wrappers• Properties bound to wildcards

May 11, 2005 WWW 2005 -- Chiba, Japan 22

Semantic Labels

May 11, 2005 WWW 2005 -- Chiba, Japan 23

Semantic Matching

May 11, 2005 WWW 2005 -- Chiba, Japan 24

Semantic Matching

May 11, 2005 WWW 2005 -- Chiba, Japan 25

Semantic Matching

[ <rdf:type> <TalkAnnouncement> ; <series> “Dertouzos Lect…” ; <dc:title> “Distributed Hash…” ; <time> “3:30 PM”]

May 11, 2005 WWW 2005 -- Chiba, Japan 26

Agenda

• Overview• Demo• Details

– Induction– Matching– Semantics– Heuristics

May 11, 2005 WWW 2005 -- Chiba, Japan 27

• Find additional examples automatically • Consider nodes neighboring the example• Require low normalized cost:

• Often allows us to create wrappers with a single example

Automatically Adding Examples

May 11, 2005 WWW 2005 -- Chiba, Japan 28

Automatically Adding Examples

TR

T

May 11, 2005 WWW 2005 -- Chiba, Japan 29

List Collapse

• Current wrappers generalize well for single elements

• Will not recognize variable length lists• Collapse neighboring nodes with low

normalized cost• For matching, allow nodes to match more

than once

Google Employee
Do we need this? If we need to cut time, cut list collapse altogether

May 11, 2005 WWW 2005 -- Chiba, Japan 30

Wrapper Wrap-up

• Gather user example(s)• Automatically find additional examples• Generalize examples using best mapping• Add semantic labels• Match by finding alignments• Overlay objects on the page for interaction

May 11, 2005 WWW 2005 -- Chiba, Japan 31

Additional Tools

• Wrapper Sharing• RSS• Web Operations

May 11, 2005 WWW 2005 -- Chiba, Japan 32

Our Contributions

• End-user wrapper induction

• Few examples required

• Bring object interaction into the browser

• Wrappers bridge syntactic-semantic gap

May 11, 2005 WWW 2005 -- Chiba, Japan 33

Future Work and Applications

• Document-level classes• Page reformatting• Autonomous agent interaction• Negative examples• Automatic wrapper induction

May 11, 2005 WWW 2005 -- Chiba, Japan 34

ahogue@google.com

http://haystack.csail.mit.edu