Extracting and Structuring Web Data

25
Extracting and Structuring Web Data David W. Embley Department of Computer Science Brigham Young University D.M. Campbell, Y.S. Jiang, Y.-K. Ng, R.D. Smith S.W. Liddle, D.W. Quass

description

Extracting and Structuring Web Data. David W. Embley Department of Computer Science Brigham Young University. D.M. Campbell, Y.S. Jiang, Y.-K. Ng, R.D. Smith S.W. Liddle, D.W. Quass. GOAL Query the Web like we query a database. Example: Get the year, make, model, and price for - PowerPoint PPT Presentation

Transcript of Extracting and Structuring Web Data

Page 1: Extracting and Structuring Web Data

Extracting and Structuring Web Data

David W. EmbleyDepartment of Computer Science

Brigham Young University

D.M. Campbell, Y.S. Jiang, Y.-K. Ng, R.D. SmithS.W. Liddle, D.W. Quass

Page 2: Extracting and Structuring Web Data

GOALQuery the Web like we query a database

Example: Get the year, make, model, and price for 1987 or later cars that are red or white.

Year Make Model Price-----------------------------------------------------------------------97 CHEV Cavalier 11,99594 DODGE 4,99594 DODGE Intrepid 10,00091 FORD Taurus 3,50090 FORD Probe88 FORD Escort 1,000

Page 3: Extracting and Structuring Web Data

PROBLEMThe Web is not structured like a database.

Example: <html><head><title>The Salt Lake Tribune Classified’s</title></head>…<hr><h4> ‘97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! Asking only $11,995. #1415JERRY SEINER MIDVALE, 566-3800 </h4><hr>…</html>

Page 4: Extracting and Structuring Web Data

Making the Web Look Like a Database

• Web Query Languages– Treat the Web as a graph (pages = nodes, links = edges).– Query the graph (e.g., Find all pages within one hop of

pages with the words “Cars for Sale”).

• Wrappers– Find page of interest.– Parse page to extract attribute-value pairs and insert them

into a database.• Write parser by hand.

• Use syntactic clues to generate parser semi-automatically.

– Query the database.

Page 5: Extracting and Structuring Web Data

for a page of unstructured documents, rich in data and narrow in ontological breadth

Automatic Wrapper Generation

Record Extractor

UnstructuredRecord Documents

Web Page

Ontology Parser

Record-Level Objects,Relationships, Constraints

DatabaseScheme

Database-Instance Generator

PopulatedDatabaseData-Record Table

Constant/Keyword Recognizer

Constant/KeywordMatching Rules

ApplicationOntology

Record Extractor

UnstructuredRecord Documents

Web Page

Ontology Parser

Record-Level Objects,Relationships, Constraints

DatabaseScheme

Database-Instance Generator

PopulatedDatabaseData-Record Table

Constant/Keyword Recognizer

Constant/KeywordMatching Rules

ApplicationOntology

Page 6: Extracting and Structuring Web Data

Application Ontology:Object-Relationship Model Instance

PhoneNr

Car

ExtensionFeature

Mileage

PriceYear

Make

Model

->0:1

has1:*

0:1

has

1:*

0:1

has

1:*

0:1

has

1:*

0:1has

1:*

0:*

has

1:*

0:1has

1:*

1:*

is for0:1

PhoneNr

Car

ExtensionFeature

Mileage

PriceYear

Make

Model

->0:1

has1:*

0:1

has

1:*

0:1

has

1:*

0:1

has

1:*

0:1has

1:*

0:*

has

1:*

0:1has

1:*

1:*

is for0:1

Car [-> object];Car [0:1] has Model [1:*];Car [0:1] has Make [1:*];Car [0:1] has Year [1:*];Car [0:1] has Price [1:*];Car [0:1] has Mileage [1:*];PhoneNr [1:*] is for Car [0:1];PhoneNr [0:1] has Extension [1:*];Car [0:*] has Feature [1:*];

Page 7: Extracting and Structuring Web Data

Application Ontology: Data Frames

Make matches [10] case insensitive constant { extract “\bchev\b”; }, { extract “\bchevy\b”; }, { extract “\bdodge\b”; }, …end;Model matches [16] case insensitive constant { extract “88”; context “\bolds\S*\s*88\b”; }, …end;Mileage matches [7] case insensitive constant { extract “[1-9]\d{0,2}k”; substitute “k” -> “,000”; }, … keyword “\bmiles\b”, “\bmi\.”, “\bmi\b”;end;...

Page 8: Extracting and Structuring Web Data

Ontology Parser

Record-Level Objects,Relationships, Constraints

DatabaseScheme

Constant/KeywordMatching Rules

ApplicationOntology

Ontology Parser

Record-Level Objects,Relationships, Constraints

DatabaseScheme

Constant/KeywordMatching Rules

ApplicationOntology

Ontology Parser

Make : \bchev\b…KEYWORD(Mileage) : \bmiles\bKEYWORD(Mileage) : \bmi\.... create table Car (

Car integer, Year varchar(2), … );create table CarFeature ( Car integer, Feature varchar(10)); ...

Object: Car;...Car: Year [0:1];Car: Make [0:1];…CarFeature: Car [0:*] has Feature [1:*];

Page 9: Extracting and Structuring Web Data

Record Extractor

Record Extractor

UnstructuredRecord Documents

Web Page

Record Extractor

UnstructuredRecord Documents

Web Page

<html>…<h4> ‘97 CHEV Cavalier, Red, 5 spd, … </h4><hr><h4> ‘89 CHEV Corsica Sdn teal, auto, … </h4><hr>….</html>

…#####‘97 CHEV Cavalier, Red, 5 spd, …#####‘89 CHEV Corsica Sdn teal, auto, …#####...

Page 10: Extracting and Structuring Web Data

Record Extractor:High Fan-Out Heuristic

<html><head><title>The Salt Lake Tribune … </title></head><body bgcolor=“#FFFFFF”><h1 align=”left”>Domestic Cars</h1>…<hr><h4> ‘97 CHEV Cavalier, Red, … </h4><hr><h4> ‘89 CHEV Corsica Sdn … </h4><hr>…</body></html>

html

head

title

body

… hr h4 hr h4 hr ...h1

Page 11: Extracting and Structuring Web Data

Record Extractor:Record-Separator Heuristics

…<hr><h4> ‘97 CHEV Cavalier, Red, 5 spd, <i>only 7,000 miles</i>on her. Asking <i>only $11,995</i>. … </h4><hr><h4> ‘89 CHEV Corsica Sdn teal, auto, air, <i>trouble free</i>.Only $8,995 … </h4><hr>...

• Identifiable “separator” tags• Highest-count tag(s)• Interval standard deviation• Ontological match• Repeating tag patterns

Example:

Page 12: Extracting and Structuring Web Data

Constant/Keyword Recognizer

UnstructuredRecord Documents

Data-Record Table

Constant/Keyword Recognizer

Constant/KeywordMatching Rules

UnstructuredRecord Documents

Data-Record Table

Constant/Keyword Recognizer

Constant/KeywordMatching Rules

Year|97|1|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|108|114PhoneNr|556-3800|146|153

Descriptor/String/Position(start/end)

Page 13: Extracting and Structuring Web Data

Record-Level Objects,Relationships, Constraints

DatabaseScheme

Database-Instance Generator

PopulatedDatabaseData-Record Table

Record-Level Objects,Relationships, Constraints

DatabaseScheme

Database-Instance Generator

PopulatedDatabaseData-Record Table

Database-Instance Generator

insert into Car values(1001, “97”, “CHEV”, “Cavalier”, “7,000”, “11,995”, “556-3800”)insert into CarFeature values(1001, “Red”)insert into CarFeature values(1001, “5 spd”)

Page 14: Extracting and Structuring Web Data

Recall and Precision

‘97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart

broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 556-3800

Recall = Nr Correct Facts / Actual Nr of Facts in Source

Precision = Nr Correct / (Nr Correct + Nr Incorrect)

Page 15: Extracting and Structuring Web Data

Results: Car Ads

Training set for tuning ontology: 100Test set: 116

Salt Lake Tribune

Recall % Precision %Year 100 100Make 97 100Model 82 100Mileage 90 100Price 100 100PhoneNr 94 100Extension 50 100Feature 91 99

Page 16: Extracting and Structuring Web Data

Car Ads: Comments

• Unbounded sets– missed: MERC, Town Car, 98 Royale– could use lexicon of makes and models

• Unspecified variation in lexical patterns– missed: 5 speed (instead of 5 spd), p.l (instead of p.l.)– could adjust lexical patterns

• Misidentification of attributes– classified AUTO in AUTO SALES as automatic transmission– could adjust exceptions in lexical patterns

• Typographical errors– “Chrystler”, “DODG ENeon”, “I-15566-2441”– could look for spelling variations and common typos

Page 17: Extracting and Structuring Web Data

Results: Computer Job Ads

Training set for tuning ontology: 50Test set: 50

Los Angeles Times

Recall % Precision %Degree 100 100Skill 74 100Email 91 83Fax 100 100Voice 79 92

Page 18: Extracting and Structuring Web Data

Obituaries(A More Demanding Application)

Our beloved Brian Fielding Frost,age 41, passed away Saturday morning,March 7, 1998, due to injuries sustainedin an automobile accident. He was bornAugust 4, 1956 in Salt Lake City, toDonald Fielding and Helen Glade Frost.He married Susan Fox on June 1, 1981. He is survived by Susan; sons Jord-dan (9), Travis (8), Bryce (6); parents,three brothers, Donald Glade (Lynne),Kenneth Wesley (Ellen), … Funeral services will be held at 12noon Friday, March 13, 1998 in theHoward Stake Center, 350 South 1600East. Friends may call 5-7 p.m. Thurs-day at Wasatch Lawn Mortuary, 3401S. Highland Drive, and at the StakeCenter from 10:45-11:45 a.m.

Names

Addresses

FamilyRelationships

MultipleDates

MultipleViewings

Page 19: Extracting and Structuring Web Data

Obituary Ontology

Deceased Person

Interment

Funeral

Viewing

EndingTime: Time

BeginningTime: Time

ViewingAddress:Address

Viewing Date: Date

RelativeName: Name

Relationship

DeceasedName: Name

Age

Birth Date: Date

Death Date: Date->0:1

has

1:*

0:1has1:*

0:1

has

1:*

1

has

1:* -> 1

0:*

Deceased Personhas Relationshipto Relative Name 1:*

1:* -> 1

0:*has

0:10:1

has

1:*

1

has

1:*

0:1

has1:*

0:1 has

1:*

0:1

has

1

0:1has

1

Deceased Person

Interment

Funeral

Viewing

EndingTime: Time

BeginningTime: Time

ViewingAddress:Address

Viewing Date: Date

RelativeName: Name

Relationship

DeceasedName: Name

Age

Birth Date: Date

Death Date: Date->0:1

has

1:*

0:1has1:*

0:1

has

1:*

1

has

1:* -> 1

0:*

Deceased Personhas Relationshipto Relative Name 1:*

1:* -> 1

0:*has

0:10:1

has

1:*

1

has

1:*

0:1

has1:*

0:1 has

1:*

0:1

has

1

0:1has

1

(partial)

Page 20: Extracting and Structuring Web Data

Data FramesLexicons & Specializations

Name matches [80] case sensitive constant { extract First, “\s+”, Last; }, … { extract “[A-Z][a-zA-Z]*\s+([A-Z]\.\s+)?”, Last; }, … lexicon { First case insensitive; filename “first.dict”; }, { Last case insensitive; filename “last.dict”; };end;Relative Name matches [80] case sensitive constant { extract First, “\s+\(“, First, “\)\s+”, Last; substitute “\s*\([^)]*\)” -> “”; } …end;...

Page 21: Extracting and Structuring Web Data

Keyword HeuristicsSingleton Items

RelativeName|Brian Fielding Frost|16|35DeceasedName|Brian Fielding Frost|16|35KEYWORD(Age)|age|38|40Age|41|42|43KEYWORD(DeceasedName)|passed away|46|56KEYWORD(DeathDate)|passed away|46|56BirthDate|March 7, 1998|76|88DeathDate|March 7, 1998|76|88IntermentDate|March 7, 1998|76|98FuneralDate|March 7, 1998|76|98ViewingDate|March 7, 1998|76|98...

Page 22: Extracting and Structuring Web Data

Keyword HeuristicsMultiple Items

…KEYWORD(Relationship)|born … to|152|192Relationship|parent|152|192KEYWORD(BirthDate)|born|152|156BirthDate|August 4, 1956|157|170DeathDate|August 4, 1956|157|170IntermentDate|August 4, 1956|157|170FuneralDate|August 4, 1956|157|170ViewingDate|August 4, 1956|157|170BirthDate|August 4, 1956|157|170RelativeName|Donald Fielding|194|208DeceasedName|Donald Fielding|194|208RelativeName|Helen Glade Frost|214|230DeceasedName|Helen Glade Frost|214|230KEYWORD(Relationship)|married|237|243...

Page 23: Extracting and Structuring Web Data

Results: Obituaries

Training set for tuning ontology: ~ 24Test set: 90

Arizona Daily Star

Recall % Precision %DeceasedName* 100 100Age 86 98BirthDate 96 96DeathDate 84 99FuneralDate 96 93FuneralAddress 82 82FuneralTime 92 87…Relationship 92 97RelativeName* 95 74

*partial or full name

Page 24: Extracting and Structuring Web Data

Results: Obituaries

Training set for tuning ontology: ~ 12Test set: 38

Salt Lake Tribune

Recall % Precision %DeceasedName* 100 100Age 91 95BirthDate 100 97DeathDate 94 100FuneralDate 92 100FuneralAddress 96 96FuneralTime 97 100…Relationship 81 93RelativeName* 88 71

*partial or full name

Page 25: Extracting and Structuring Web Data

Conclusions

• Given an ontology and record extractor, automatic extracting and structuring of Web data is possible.

• Recall and Precision results are encouraging.– Car Ads: ~ 94% recall and ~ 99% precision– Job Ads: ~ 84% recall and ~ 98% precision– Obituaries: ~ 90% recall and ~ 95% precision (except on names: ~

73% precision)

• Future Work– Find pages of interest.– Strengthen heuristics for separation, extraction, and construction.– Add richer conversions and additional constraints to data frames.– Provide ways to infer data as well as extract data.