Information Extraction from the World Wide Web
-
Upload
finn-bowman -
Category
Documents
-
view
30 -
download
3
description
Transcript of Information Extraction from the World Wide Web
![Page 1: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/1.jpg)
Information Extractionfrom the
World Wide Web
William W. CohenCarnegie Mellon University
Andrew McCallumUniversity of Massachusetts Amherst
KDD 2003
![Page 2: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/2.jpg)
Example: The Problem
Martin Baker, a person
Genomics job
Employers job posting form
![Page 3: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/3.jpg)
Example: A Solution
![Page 4: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/4.jpg)
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.html
OtherCompanyJobs: foodscience.com-Job1
![Page 5: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/5.jpg)
Job
Op
enin
gs:
Cat
ego
ry =
Fo
od
Ser
vice
sK
eyw
ord
= B
aker
L
oca
tio
n =
Co
nti
nen
tal
U.S
.
![Page 6: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/6.jpg)
Data Mining the Extracted Job Information
![Page 7: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/7.jpg)
IE from Research Papers
![Page 8: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/8.jpg)
IE fromChinese Documents regarding Weather
Chinese Academy of Sciences
200k+ documentsseveral millennia old
- Qing Dynasty Archives- memos- newspaper articles- diaries
![Page 9: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/9.jpg)
IE from SEC Filings
This filing covers the period from December 1996 to September 1997.
ENRON GLOBAL POWER & PIPELINES L.L.C. CONSOLIDATED BALANCE SHEETS (IN THOUSANDS, EXCEPT SHARE AMOUNTS)
SEPTEMBER 30, DECEMBER 31, 1997 1996 ------------- ------------ (UNAUDITED)ASSETSCurrent Assets Cash and cash equivalents $ 54,262 $ 24,582 Accounts receivable 8,473 6,301 Current portion of notes receivable 1,470 1,394 Other current assets 336 404 -------- -------- Total Current Assets 71,730 32,681 -------- --------Investments in to Unconsolidated Subsidiaries 286,340 298,530Notes Receivable 16,059 12,111 -------- -------- Total Assets $374,408 $343,843 ======== ========LIABILITIES AND SHAREHOLDERS' EQUITYCurrent Liabilities Accounts payable $ 13,461 $ 11,277 Accrued taxes 1,910 1,488 -------- -------- Total Current Liabilities 15,371 49,348 -------- --------Deferred Income Taxes 525 4,301
The U.S. energy markets in 1997 were subject to significant fluctuation
Data mine these reports for- suspicious behavior, - to better understand what is normal.
![Page 10: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/10.jpg)
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
![Page 11: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/11.jpg)
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
![Page 12: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/12.jpg)
What is “Information Extraction”
Information Extraction = segmentation + classification + clustering + association
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
![Page 13: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/13.jpg)
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
![Page 14: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/14.jpg)
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
![Page 15: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/15.jpg)
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N
AME
TITLE ORGANIZATION
Bill Gates
CEO
Microsoft
Bill Veghte
VP
Microsoft
Richard Stallman
founder
Free Soft..
*
*
*
*
![Page 16: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/16.jpg)
IE in Context
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IE
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
![Page 17: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/17.jpg)
Why IE from the Web?
• Science– Grand old dream of AI: Build large KB* and reason with it.
IE from the Web enables the creation of this KB.– IE from the Web is a complex problem that inspires new
advances in machine learning.
• Profit– Many companies interested in leveraging data currently
“locked in unstructured text on the Web”.– Not yet a monopolistic winner in this space.
• Fun!– Build tools that we researchers like to use ourselves:
Cora & CiteSeer, MRQE.com, FAQFinder,…– See our work get used by the general public.
* KB = “Knowledge Base”
![Page 18: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/18.jpg)
Tutorial Outline
• IE History• Landscape of problems and solutions• Parade of models for segmenting/classifying:
– Sliding window– Boundary finding– Finite state machines– Trees
• Overview of related problems and solutions– Association, Clustering– Integration with Data Mining
• Where to go from here
15 min break
![Page 19: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/19.jpg)
IE HistoryPre-Web• Mostly news articles
– De Jong’s FRUMP [1982]• Hand-built system to fill Schank-style “scripts” from news wire
– Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96]
• Most early work dominated by hand-built models– E.g. SRI’s FASTUS, hand-built FSMs.– But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and
then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
Web• AAAI ’94 Spring Symposium on “Software Agents”
– Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
• Tom Mitchell’s WebKB, ‘96– Build KB’s from the Web.
• Wrapper Induction– Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…
![Page 20: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/20.jpg)
www.apple.com/retail
What makes IE from the Web Different?Less grammar, but more formatting & linking
The directory structure, link structure, formatting & layout of the Web is its own new grammar.
Apple to Open Its First Retail Storein New York City
MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.
"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."
www.apple.com/retail/soho
www.apple.com/retail/soho/theatre.html
Newswire Web
![Page 21: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/21.jpg)
Landscape of IE Tasks (1/4):Pattern Feature Domain
Text paragraphswithout formatting
Grammatical sentencesand some formatting & links
Non-grammatical snippets,rich formatting & links Tables
Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.
![Page 22: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/22.jpg)
Landscape of IE Tasks (2/4):Pattern Scope
Web site specific Genre specific Wide, non-specific
Amazon.com Book Pages Resumes University Names
Formatting Layout Language
![Page 23: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/23.jpg)
Landscape of IE Tasks (3/4):Pattern Complexity
Closed set
He was born in Alabama…
Regular set
Phone: (413) 545-1323
Complex pattern
University of ArkansasP.O. Box 140Hope, AR 71802
…was among the six houses sold by Hope Feldman that year.
Ambiguous patterns,needing context andmany sources of evidence
The CALD main office can be reached at 412-268-1299
The big Wyoming sky…
U.S. states U.S. phone numbers
U.S. postal addresses
Person names
Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210
Pawel Opalinski, SoftwareEngineer at WhizBang Labs.
E.g. word patterns:
![Page 24: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/24.jpg)
Landscape of IE Tasks (4/4):Pattern Combinations
Single entity
Person: Jack Welch
Binary relationship
Relation: Person-TitlePerson: Jack WelchTitle: CEO
N-ary record
“Named entity” extraction
Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.
Relation: Company-LocationCompany: General ElectricLocation: Connecticut
Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt
Person: Jeffrey Immelt
Location: Connecticut
![Page 25: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/25.jpg)
Evaluation of Single Entity Extraction
Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.TRUTH:
PRED:
Precision = = # correctly predicted segments 2
# predicted segments 6
Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
Recall = = # correctly predicted segments 2
# true segments 4
F1 = Harmonic mean of Precision & Recall = ((1/P) + (1/R)) / 2
1
![Page 26: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/26.jpg)
State of the Art Performance
• Named entity recognition– Person, Location, Organization, …– F1 in high 80’s or low- to mid-90’s
• Binary relation extraction– Contained-in (Location1, Location2)
Member-of (Person1, Organization1)– F1 in 60’s or 70’s or 80’s
• Wrapper induction– Extremely accurate performance obtainable– Human effort (~30min) required on each site
![Page 27: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/27.jpg)
Landscape of IE Techniques (1/1):Models
Any of these models can be used to capture words, formatting or both.
Lexicons
AlabamaAlaska…WisconsinWyoming
Abraham Lincoln was born in Kentucky.
member?
Classify Pre-segmentedCandidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
…and beyond
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizes:
Boundary Models
Abraham Lincoln was born in Kentucky.
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Context Free Grammars
Abraham Lincoln was born in Kentucky.
NNP V P NPVNNP
NP
PP
VP
VP
S
Mos
t lik
ely
pars
e?
Finite State Machines
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
![Page 28: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/28.jpg)
Landscape:Focus of this Tutorial
Pattern complexity
Pattern feature domain
Pattern scope
Pattern combinations
Models
closed set regular complex ambiguous
words words + formatting formatting
site-specific genre-specific general
entity binary n-ary
lexicon regex window boundary FSM CFG
![Page 29: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/29.jpg)
Sliding Windows
![Page 30: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/30.jpg)
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 31: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/31.jpg)
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 32: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/32.jpg)
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 33: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/33.jpg)
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 34: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/34.jpg)
A “Naïve Bayes” Sliding Window Model[Freitag 1997]
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m
prefix contents suffix
Other examples of sliding window: [Baluja et al 2000](decision tree over individual words & their context)
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
… …
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
![Page 35: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/35.jpg)
“Naïve Bayes” Sliding Window Results
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
Domain: CMU UseNet Seminar Announcements
Field F1 Person Name: 30%Location: 61%Start Time: 98%
![Page 36: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/36.jpg)
SRV: a realistic sliding-window-classifier IE system
• What windows to consider?– all windows containing as many tokens as the shortest
example, but no more tokens than the longest example
• How to represent a classifier? It might:– Restrict the length of window;– Restrict the vocabulary or formatting used
before/after/inside window;– Restrict the relative order of tokens;– Etc…
<title>Course Information for CS213</title>
<h1>CS 213 C++ Programming</h1>
[Frietag AAAI ‘98]
![Page 37: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/37.jpg)
SRV: a rule-learner for sliding-window classification
Rule learning: greedily add conditions to rules, rules to rule set
Search metric: SRV algorithm greedily adds conditions to maximize “information gain”
To prevent overfitting:
rules are built on 2/3 of data, then their false positive rate is estimated on the 1/3 holdout set.
Candidate conditions: …“Two tokens, one a 3-char token, starting just after the title”
<title>Course Information for CS213</title>
<h1>CS 213 C++ Programming</h1>courseNumber(X) :-
tokenLength(X,=,2), every(X, inTitle, false), some(X, A, <previousToken>, inTitle, true),some(X, B, <>, tripleton, true)
![Page 38: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/38.jpg)
SRV: a rule-learner for sliding-window classification
• Primitive predicates used by SRV:– token(X,W), allLowerCase(W), numerical(W), …– nextToken(W,U), previousToken(W,V)
• HTML-specific predicates:– inTitleTag(W), inH1Tag(W), inEmTag(W),…– emphasized(W) = “inEmTag(W) or inBTag(W) or …”– tableNextCol(W,U) = “U is some token in the column
after the column W is in”– tablePreviousCol(W,V), tableRowHeader(W,T),…
![Page 39: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/39.jpg)
SRV: a rule-learner for sliding-window classification
• Non-primitive “conditions” used by SRV:– every(+X, f, c) = for all W in X : f(W)=c
– some(+X, W, <f1,…,fk>, g, c)= exists W: g(fk(…(f1(W)…))=c
– tokenLength(+X, relop, c):– position(+W,direction,relop, c):
• e.g., tokenLength(X,>,4), position(W,fromEnd,<,2)
courseNumber(X) :- tokenLength(X,=,2), every(X, inTitle, false), some(X, A, <previousToken>, inTitle, true),some(X, B, <>. tripleton, true)
Non-primitive conditions make greedy search easier
![Page 40: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/40.jpg)
Rapier: an alternative approach
A bottom-up rule learner:
initialize RULES to be one rule per example;
repeat {
randomly pick N pairs of rules (Ri,Rj);
let {G1…,GN} be the consistent pairwise generalizations;
let G* = Gi that optimizes “compression”
let RULES = RULES + {G*} – {R’: covers(G*,R’)}
}
where compression(G,RULES) = size of RULES- {R’: covers(G,R’)} and “covers(G,R)” means every example matching G matches R
[Califf & Mooney, AAAI ‘99]
![Page 41: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/41.jpg)
<title>Course Information for CS213</title>
<h1>CS 213 C++ Programming</h1> …
<title>Syllabus and meeting times for Eng 214</title>
<h1>Eng 214 Software Engineering for Non-programmers </h1>…
courseNum(window1) :- token(window1,’CS’), doubleton(‘CS’), prevToken(‘CS’,’CS213’), inTitle(‘CS213’), nextTok(‘CS’,’213’), numeric(‘213’), tripleton(‘213’), nextTok(‘213’,’C++’), tripleton(‘C++’), ….
courseNum(window2) :- token(window2,’Eng’), tripleton(‘Eng’), prevToken(‘Eng’,’214’), inTitle(‘214’), nextTok(‘Eng’,’214’), numeric(‘214’), tripleton(‘214’), nextTok(‘214’,’Software’), …
courseNum(X) :- token(X,A), prevToken(A, B), inTitle(B), nextTok(A,C)), numeric(C), tripleton(C), nextTok(C,D), …
Common conditions carried over to generalization
Differences dropped
![Page 42: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/42.jpg)
Rapier: an alternative approach
- Combines top-down and bottom-up learning- Bottom-up to find common restrictions on content- Top-down greedy addition of restrictions on context
- Use of part-of-speech and semantic features (from WORDNET).
- Special “pattern-language” based on sequences of tokens, each of which satisfies one of a set of given constraints- < <tok2{‘ate’,’hit’},POS2{‘vb’}>, <tok2{‘the’}>, <POS2{‘nn’>>
![Page 43: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/43.jpg)
Rapier: results – precision/recall
![Page 44: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/44.jpg)
Rule-learning approaches to sliding-window classification: Summary
• SRV, Rapier, and WHISK [Soderland KDD ‘97]
– Representations for classifiers allow restriction of the relationships between tokens, etc
– Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog)
– Use of these “heavyweight” representations is complicated, but seems to pay off in results
• Can simpler representations for classifiers work?
![Page 45: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/45.jpg)
BWI: Learning to detect boundaries
• Another formulation: learn three probabilistic classifiers:– START(i) = Prob( position i starts a field)– END(j) = Prob( position j ends a field)– LEN(k) = Prob( an extracted field has length k)
• Then score a possible extraction (i,j) bySTART(i) * END(j) * LEN(j-i)
• LEN(k) is estimated from a histogram
[Freitag & Kushmerick, AAAI 2000]
![Page 46: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/46.jpg)
BWI: Learning to detect boundaries
• BWI uses boosting to find “detectors” for START and END
• Each weak detector has a BEFORE and AFTER pattern (on tokens before/after position i).
• Each “pattern” is a sequence of tokens and/or wildcards like: anyAlphabeticToken, anyToken, anyUpperCaseLetter, anyNumber, …
• Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE,AFTER patterns
![Page 47: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/47.jpg)
BWI: Learning to detect boundaries
Field F1 Person Name: 30%Location: 61%Start Time: 98%
![Page 48: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/48.jpg)
Problems with Sliding Windows and Boundary Finders
• Decisions in neighboring parts of the input are made independently from each other.
– Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”.
– It is possible for two overlapping windows to both be above threshold.
– In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.
![Page 49: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/49.jpg)
Finite State Machines
![Page 50: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/50.jpg)
Hidden Markov Models
St -1
St
Ot
St+1
Ot +1
Ot -1
...
...
Finite state model Graphical model
Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)
||
11 )|()|(),(
o
ttttt soPssPosP
HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …
...transitions
observations
o1 o2 o3 o4 o5 o6 o7 o8
Generates:
State sequenceObservation sequence
Usually a multinomial over atomic, fixed alphabet
![Page 51: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/51.jpg)
IE with Hidden Markov Models
Yesterday Pedro Domingos spoke this example sentence.
Yesterday Pedro Domingos spoke this example sentence.
Person name: Pedro Domingos
Given a sequence of observations:
and a trained HMM:
Find the most likely state sequence: (Viterbi)
Any words said to be generated by the designated “person name”state extract as a person name:
),(maxarg osPs
person name
location name
background
![Page 52: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/52.jpg)
HMM Example: “Nymble”
Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]
Task: Named Entity Extraction
Train on ~500k words of news wire text.
Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%
[Bikel, et al 1998], [BBN “IdentiFinder”]
Person
Org
Other
(Five other name classes)
start-of-sentence
end-of-sentence
Transitionprobabilities
Observationprobabilities
P(st | st-1, ot-1 ) P(ot | st , st-1 )
Back-off to: Back-off to:
P(st | st-1 )
P(st )
P(ot | st , ot-1 )
P(ot | st )
P(ot )
or
Results:
![Page 53: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/53.jpg)
We want More than an Atomic View of Words
Would like richer representation of text: many arbitrary, overlapping features of the words.
St -1
St
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchorlast person name was femalenext two words are “and Associates”
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
![Page 54: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/54.jpg)
Problems with Richer Representationand a Joint Model
These arbitrary features are not independent.– Multiple levels of granularity (chars, words, phrases)
– Multiple dependent modalities (words, formatting, layout)
– Past & future
Two choices:
Model the dependencies.Each state would have its own Bayes Net. But we are already starved for training data!
Ignore the dependencies.This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!
St -1
St
Ot
St+1
Ot +1
Ot -1
St -1
St
Ot
St+1
Ot +1
Ot -1
![Page 55: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/55.jpg)
Conditional Sequence Models
• We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(s|o) instead of P(s,o):
– Can examine features, but not responsible for generating them.
– Don’t have to explicitly model their dependencies.
– Don’t “waste modeling effort” trying to generate what we are given at test time anyway.
![Page 56: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/56.jpg)
nn oooossss ,...,,..., 2121
Joint
Conditional
St-1 St
Ot
St+1
Ot+1Ot-1
...
...
St-1 St
Ot
St+1
Ot+1Ot-1
...
...
||
11 )|()|(),(
o
ttttt soPssPosP
kttkko osft ),(exp)(
(A super-special case of Conditional Random Fields.)
Conditional Finite State Sequence Models
From HMMs to CRFs[Lafferty, McCallum, Pereira 2001]
[McCallum, Freitag & Pereira, 2000]
||
11 )|()|(
)(
1)|(
o
ttttt soPssP
oPosP
||
11 ),(),(
)(
1 o
tttotts soss
oZ
where
![Page 57: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/57.jpg)
Conditional Random Fields
St St+1 St+2
O = Ot, Ot+1, Ot+2, Ot+3, Ot+4
St+3 St+4
1. FSM special-case: linear chain among unknowns, parameters tied across time steps.
||
11 ),,,(exp
)(
1)|(
o
t kttkk tossf
oZosP
[Lafferty, McCallum, Pereira 2001]
2. In general: CRFs = "Conditionally-trained Markov Network" arbitrary structure among unknowns
3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]: Parameters tied across hits from SQL-like queries ("clique templates")
![Page 58: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/58.jpg)
Feature Functions
:),,,( Example 1 tossf ttk
otherwise 0
s s )d(Capitalize if 1),,,( j1i
1,d,Capitalizettt
ttss
ssotossf
ji
Yesterday Pedro Domingos spoke this example sentence.
s3
s1 s2
s4
1 )2,,,( 21,, 31 ossf ssdCapitalize
o = o1 o2 o3 o4 o5 o6 o7
![Page 59: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/59.jpg)
Learning Parameters of CRFs
),,,(),(# where
),'(# )|'(),(#
1
2'
)()(
,
tossfos
ososPosL
ttt
kk
k
i s
ik
i
Dosk
k
Methods:• iterative scaling (quite slow)• conjugate gradient (much faster)• limited-memory quasi-Newton methods, BFGS (super fast)
[Sha & Pereira 2002] & [Malouf 2002]
Maximize log-likelihood of parameters k given training data D
Log-likelihood gradient:
k
k
Dos
o
t kttkk tossf
oZL
2
2
,
||
11 2
),,,(exp)(
1log
![Page 60: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/60.jpg)
Voted Perceptron Sequence Models
before as ),,,(),( where
),(),( :k
),,,(expmaxarg
i instances, trainingallfor
:econvergenc toIterate
0k :zero toparameters Initialize
},{ :data ningGiven trai
1
)()()(
1
k
)(
tossfosC
osCosC
tossfs
so
ttt
kk
iViterbik
iikk
t kttkksViterbi
i
[Collins 2002]
Like CRFs with stochastic gradient ascent and a Viterbi approximation.
Avoids calculating the partition function (normalizer), Zo, but gradient ascent, not 2nd-order or conjugate gradient method.
Analogous tothe gradientfor this onetraining instance
![Page 61: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/61.jpg)
General CRFs vs. HMMs
• More general and expressive modeling technique
• Comparable computational efficiency
• Features may be arbitrary functions of any or all observations
• Parameters need not fully specify generation of observations; require less training data
• Easy to incorporate domain knowledge
• State means only “state of process”, vs“state of process” and “observational history I’m keeping”
![Page 62: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/62.jpg)
Person name Extraction [McCallum 2001, unpublished]
![Page 63: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/63.jpg)
Person name Extraction
![Page 64: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/64.jpg)
Features in Experiment
Capitalized Xxxxx
Mixed Caps XxXxxx
All Caps XXXXX
Initial Cap X….
Contains Digit xxx5
All lowercase xxxx
Initial X
Punctuation .,:;!(), etc
Period .
Comma ,
Apostrophe ‘
Dash -
Preceded by HTML tag
Character n-gram classifier says string is a person name (80% accurate)
In stopword list(the, of, their, etc)
In honorific list(Mr, Mrs, Dr, Sen, etc)
In person suffix list(Jr, Sr, PhD, etc)
In name particle list (de, la, van, der, etc)
In Census lastname list;segmented by P(name)
In Census firstname list;segmented by P(name)
In locations lists(states, cities, countries)
In company name list(“J. C. Penny”)
In list of company suffixes(Inc, & Associates, Foundation)
Hand-built FSM person-name extractor says yes, (prec/recall ~ 30/95)
Conjunctions of all previous feature pairs, evaluated at the current time step.
Conjunctions of all previous feature pairs, evaluated at current step and one step ahead.
All previous features, evaluated two steps ahead.
All previous features, evaluated one step behind.
Total number of features = ~500k
![Page 65: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/65.jpg)
Training and Testing
• Trained on 65k words from 85 pages, 30 different companies’ web sites.
• Training takes 4 hours on a 1 GHz Pentium.• Training precision/recall is 96% / 96%.
• Tested on different set of web pages with similar size characteristics.
• Testing precision is 92 – 95%, recall is 89 – 91%.
![Page 66: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/66.jpg)
Table Extraction from Government ReportsCash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 --------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.
![Page 67: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/67.jpg)
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was
slightly below 1994. Producer returns averaged $12.93 per hundredweight,
$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,
1 percent above 1994. Marketings include whole milk sold to plants and dealers
as well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk were used on farms where produced,
8 percent less than 1994. Calves were fed 78 percent of this milk with the
remainder consumed in producer households.
Milk Cows and Production of Milk and Milkfat:
United States, 1993-95
--------------------------------------------------------------------------------
: : Production of Milk and Milkfat 2/
: Number :-------------------------------------------------------
Year : of : Per Milk Cow : Percentage : Total
:Milk Cows 1/:-------------------: of Fat in All :------------------
: : Milk : Milkfat : Milk Produced : Milk : Milkfat
--------------------------------------------------------------------------------
: 1,000 Head --- Pounds --- Percent Million Pounds
:
1993 : 9,589 15,704 575 3.66 150,582 5,514.4
1994 : 9,500 16,175 592 3.66 153,664 5,623.7
1995 : 9,461 16,451 602 3.66 155,644 5,694.3
--------------------------------------------------------------------------------
1/ Average number during year, excluding heifers not yet fresh.
2/ Excludes milk sucked by calves.
CRFLabels:• Non-Table• Table Title• Table Header• Table Data Row• Table Section Data Row• Table Footnote• ... (12 in all)
[Pinto, McCallum, Wei, Croft, 2003]
Features:• Percentage of digit chars• Percentage of alpha chars• Indented• Contains 5+ consecutive spaces• Whitespace in this line aligns with prev.• ...• Conjunctions of all previous features,
time offset: {0,0}, {-1,0}, {0,1}, {1,2}.
100+ documents from www.fedstats.gov
![Page 68: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/68.jpg)
Table Extraction Experimental Results
Line labels,percent correct
95 %
65 %
error = 85%
85 %
HMM
StatelessMaxEnt
CRF w/outconjunctions
CRF
52 %
[Pinto, McCallum, Wei, Croft, 2003]
![Page 69: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/69.jpg)
Named Entity Recognition
CRICKET - MILLNS SIGNS FOR BOLAND
CAPE TOWN 1996-08-22
South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.
Labels: Examples:
PER Yayuk BasukiInnocent Butare
ORG 3MKDPLeicestershire
LOC LeicestershireNirmal HridayThe Oval
MISC JavaBasque1,000 Lakes Rally
Reuters stories on international news Train on ~300k words
![Page 70: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/70.jpg)
Automatically Induced Features
Index Feature
0 inside-noun-phrase (ot-1)
5 stopword (ot)
20 capitalized (ot+1)
75 word=the (ot)
100 in-person-lexicon (ot-1)
200 word=in (ot+2)
500 word=Republic (ot+1)
711 word=RBI (ot) & header=BASEBALL
1027 header=CRICKET (ot) & in-English-county-lexicon (ot)
1298 company-suffix-word (firstmentiont+2)
4040 location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1)
4945 moderately-rare-first-name (ot-1) & very-common-last-name (ot)
4474 word=the (ot-2) & word=of (ot)
[McCallum 2003]
![Page 71: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/71.jpg)
Named Entity Extraction Results
Method F1 # parameters
BBN's Identifinder, word features 79% ~500k
CRFs word features, 80% ~500kw/out Feature Induction
CRFs many features, 75% ~3 millionw/out Feature Induction
CRFs many candidate features 90% ~60kwith Feature Induction
[McCallum & Li, 2003]
![Page 72: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/72.jpg)
Inducing State-Transition Structure[Chidlovskii, 2000]
K-reversiblegrammars
Structure learning forHMMs + IE[Seymore et al 1999][Frietag & McCallum 2000]
![Page 73: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/73.jpg)
Limitations of Finite State Models
• Finite state models have a linear structure
• Web documents have a hierarchical structure– Are we suffering by not modeling this structure
more explicitly?
• How can one learn a hierarchical extraction model?
![Page 74: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/74.jpg)
Tree-based Models
![Page 75: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/75.jpg)
• Extracting from one web site– Use site-specific formatting information: e.g., “the JobTitle is a
bold-faced paragraph in column 2”– For large well-structured sites, like parsing a formal language
• Extracting from many web sites:– Need general solutions to entity extraction, grouping into records,
etc.– Primarily use content information– Must deal with a wide range of ways that users present data.– Analogous to parsing natural language
• Problems are complementary:– Site-dependent learning can collect training data for a site-
independent learner– Site-dependent learning can boost accuracy of a site-independent
learner on selected key sites
![Page 76: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/76.jpg)
![Page 77: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/77.jpg)
Learner
User gives first K positive—and thus many implicit negative examples
![Page 78: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/78.jpg)
![Page 79: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/79.jpg)
STALKER: Hierarchical boundary finding
• Main idea:– To train a hierarchical extractor, pose a series of
learning problems, one for each node in the hierarchy
– At each stage, extraction is simplified by knowing about the “context.”
[Muslea,Minton & Knoblock 99]
![Page 80: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/80.jpg)
![Page 81: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/81.jpg)
(BEFORE=null, AFTER=(Tutorial,Topics))
(BEFORE=null, AFTER=(Tutorials,and))
![Page 82: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/82.jpg)
(BEFORE=null, AFTER=(<,li,>,))
![Page 83: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/83.jpg)
(BEFORE=(:), AFTER=null)
![Page 84: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/84.jpg)
(BEFORE=(:), AFTER=null)
![Page 85: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/85.jpg)
(BEFORE=(:), AFTER=null)
![Page 86: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/86.jpg)
Stalker: hierarchical decomposition of two web sites
![Page 87: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/87.jpg)
Stalker: summary and results
• Rule format:– “landmark automata” format for rules which
extended BWI’s format• E.g.: <a>W. Cohen</a> CMU: Web IE </li>• BWI: BEFORE=(<, /, a,>, ANY, :)• STALKER: BEGIN = SkipTo(<, /, a, >), SkipTo(:)
• Top-down rule learning algorithm– Carefully chosen ordering between types of rule
specializations
• Very fast learning: e.g. 8 examples vs. 274• A lesson: we often control the IE training data!
![Page 88: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/88.jpg)
Why low sample complexity is important in “wrapper learning”
At training time, only four examples are available—but one would like to generalize to future pages as well…
![Page 89: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/89.jpg)
“Wrapster”: a hybrid approach to representing wrappers
• Common representations for web pages include:– a rendered image
– a DOM tree (tree of HTML markup & text)• gives some of the power of hierarchical decomposition
– a sequence of tokens
– a bag of words, a sequence of characters, a node in a directed graph, . . .
• Questions: – How can we engineer a system to generalize quickly?
– How can we explore representational choices easily?
[Cohen,Jensen&Hurst WWW02]
![Page 90: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/90.jpg)
Example Wrapster predicate
http://wasBang.org/aboutus.html
WasBang.com contact info:
Currently we have offices in two locations:– Pittsburgh, PA
– Provo, UT
html
headbody
pp
ul
li li
a a
“Pittsburgh, PA” “Provo, UT”
“WasBang.com .. info:”
“Currently..”
…
![Page 91: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/91.jpg)
Example Wrapster predicate
Example:p(s1,s2) iff s2 are the tokens
below an li node inside a ul node inside s1.
EXECUTE(p,s1) extracts
– “Pittsburgh, PA”
– “Provo, UT”
http://wasBang.org/aboutus.html
WasBang.com contact info:
Currently we have offices in two locations:– Pittsburgh, PA
– Provo, UT
![Page 92: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/92.jpg)
Wrapster builders
• Builders are based on simple, restricted languages, for example:– Ltagpath: p is defined by tag1,…,tagk and ptag1,…,tagk(s1,s2)
is true iff s1 and s2 correspond to DOM nodes and s2 is reached from s1 by following a path ending in tag1,…,tagk
• EXECUTE(pul,li,s1) = {“Pittsburgh,PA”, “Provo, UT”}
– Lbracket: p is defined by a pair of strings (l,r), and pl,r(s1,s2) is true iff s2 is preceded by l and followed by r.
• EXECUTE(pin,locations,s1) = {“two”}
![Page 93: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/93.jpg)
Wrapster builders
For each language L there is a builder B which implements:
• LGG( positive examples of p(s1,s2)): least general p in L that covers all the positive examples (like pairwise generalization)– For Lbracket, longest common prefix and suffix of the
examples.
• REFINE(p, examples ): a set of p’s that cover some but not all of the examples.– For Ltagpath, extend the path with one additional tag that
appears in the examples.• Builders/languages can be combined:
– E.g. to construct a builder for (L1 and L2) or
(L1 composeWith L2)
![Page 94: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/94.jpg)
Wrapster builders - examples
• Compose `tagpaths’ and `brackets’– E.g., “extract strings between ‘(‘ and ‘)’ inside a list
item inside an unordered list”
• Compose `tagpaths’ and language-based extractors– E.g., “extract city names inside the first paragraph”
• Extract items based on position inside a rendered table, or properties of the rendered text– E.g., “extract items inside any column headed by
text containing the words ‘Job’ and ‘Title’”– E.g. “extract items in boldfaced italics”
![Page 95: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/95.jpg)
Wrapster results
F1
#examples
![Page 96: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/96.jpg)
Broader Issues in IE
![Page 97: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/97.jpg)
Broader View
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IE
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
Up to now we have been focused on segmentation and classification
![Page 98: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/98.jpg)
Broader View
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IETokenize
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
Now touch on some other issues
12
3
4
5
![Page 99: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/99.jpg)
(1) Association as Binary Classification
[Zelenko et al, 2002]
Christos Faloutsos conferred with Ted Senator, the KDD 2003 General Chair.
Person-Role (Christos Faloutsos, KDD 2003 General Chair) NO
Person-Role ( Ted Senator, KDD 2003 General Chair) YES
Person Person Role
Do this with SVMs and tree kernels over parse trees.
![Page 100: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/100.jpg)
(1) Association with Finite State Machines[Ray & Craven, 2001]
… This enzyme, UBC6, localizes to the endoplasmic reticulum, with the catalytic domain facing the cytosol. …
DET thisN enzymeN ubc6V localizesPREP toART theADJ endoplasmicN reticulumPREP withART theADJ catalyticN domainV facingART theN cytosol Subcellular-localization (UBC6, endoplasmic reticulum)
![Page 101: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/101.jpg)
(1) Association using Parse Tree[Miller et al 2000]Simultaneously POS tag, parse, extract & associate!
Increase space of parse constituents to includeentity and relation tags
Notation Description .
ch head constituent categorycm modifier constituent categoryXp X of parent nodet POS tagw word
Parameters e.g. .
P(ch|cp) P(vp|s)P(cm|cp,chp,cm-1,wp) P(per/np|s,vp,null,said)P(tm|cm,th,wh) P(per/nnp|per/np,vbd,said)P(wm|cm,tm,th,wh) P(nance|per/np,per/nnp,vbd,said)
(This is also a great exampleof extraction using a tree model.)
![Page 102: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/102.jpg)
(1) Association with Graphical Models[Roth & Yih 2002]Capture arbitrary-distance
dependencies among predictions.
Local languagemodels contributeevidence to entityclassification.
Local languagemodels contributeevidence to relationclassification.
Random variableover the class ofentity #2, e.g. over{person, location,…}
Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}
Dependencies between classesof entities and relations!
Inference with loopy belief propagation.
![Page 103: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/103.jpg)
(1) Association with Graphical Models[Roth & Yih 2002]Also capture long-distance
dependencies among predictions.
Local languagemodels contributeevidence to entityclassification.
Random variableover the class ofentity #1, e.g. over{person, location,…}
Local languagemodels contributeevidence to relationclassification.
Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}
Dependencies between classesof entities and relations!
Inference with loopy belief propagation.
person?
personlives-in
![Page 104: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/104.jpg)
(1) Association with Graphical Models[Roth & Yih 2002]Also capture long-distance
dependencies among predictions.
Local languagemodels contributeevidence to entityclassification.
Random variableover the class ofentity #1, e.g. over{person, location,…}
Local languagemodels contributeevidence to relationclassification.
Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}
Dependencies between classesof entities and relations!
Inference with loopy belief propagation.
location
personlives-in
![Page 105: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/105.jpg)
(1) Association with “Grouping Labels”
• Create a simple language that reflects a field’s relation to other fields
• Language represents ability to define:– Disjoint fields– Shared fields– Scope
• Create rules that use field labels
[Jensen & Cohen, 2001]
![Page 106: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/106.jpg)
(1) Grouping labels: A simple example
KitesBuy a kiteBox Kite $100Stunt Kite $300
Name: Box KiteCompany: -Location: -Order: -Cost: $100Description: -Color: -Size: -
Next:Name:recordstart
![Page 107: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/107.jpg)
(2) Grouping labels: A messy example
KitesBuy a kiteBox Kite $100Stunt Kite $300
Box KiteGreat for kids
Detailed specs
SpecsColor: blueSize: small
Name: Box KiteCompany: -Location: -Order: -Cost: $100Description: Great for kidsColor: blueSize: small
prevlink:Cost
next:Name:recordstart
pagetype:Product
![Page 108: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/108.jpg)
(2) User interface: adding labels to extracted fields
![Page 109: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/109.jpg)
(1) Experimental Evaluation of Grouping Labels
next84%
prev1%
pagetype10%
prevlink4% nextlink
1%
Fixed language, then wrapped 499 new sites—all of which could be handled.
![Page 110: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/110.jpg)
Broader View
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IETokenize
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
Now touch on some other issues
12
3
4
5
Object Consolidation
![Page 111: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/111.jpg)
(2) Learning a Distance Metric Between Records[Borthwick, 2000; Cohen & Richman, 2001; Bilenko & Mooney, 2002, 2003]
Learn Pr ({duplicate, not-duplicate} | record1, record2)with a Maximum Entropy classifier.
Do greedy agglomerative clustering using this Probability as a distance metric.
![Page 112: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/112.jpg)
(2) String Edit Distance
• distance(“William Cohen”, “Willliam Cohon”)
W I L L I A M _ C O H E N
W I L L L I A M _ C O H O N
C C C C I C C C C C C C S C
0 0 0 0 1 1 1 1 1 1 1 1 2 2
s
t
op
cost
alignment
![Page 113: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/113.jpg)
(2) Computing String Edit Distance
D(i,j) = minD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete
C O H E N
M 1 2 3 4 5
C 1 2 3 4 5
C 2 3 3 4 5
O 3 2 3 4 5
H 4 3 2 3 4
N 5 4 3 3 3
A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)
learntheseparameters
![Page 114: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/114.jpg)
(2) String Edit Distance Learning
Precision/recall for MAILING dataset duplicate detection
[Bilenko & Mooney, 2002, 2003]
![Page 115: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/115.jpg)
(2) Information Integration
Goal might be to merge results of two IE systems:
Name: Introduction to Computer Science
Number: CS 101
Teacher: M. A. Kludge
Time: 9-11am
Name: Data Structures in Java
Room: 5032 Wean Hall
Title: Intro. to Comp. Sci.
Num: 101
Dept: Computer Science
Teacher: Dr. Klüdge
TA: John Smith
Topic: Java Programming
Start time: 9:10 AM
[Minton, Knoblock, et al 2001], [Doan, Domingos, Halevy 2001],[Richardson & Domingos 2003]
![Page 116: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/116.jpg)
(2) Two further Object Consolidation Issues
• Efficiently clustering large data sets by pre-clustering with a cheap distance metric (hybrid of string-edit distance and term-based distances)– [McCallum, Nigam & Ungar, 2000]
• Don’t simply merge greedily: capture dependencies among multiple merges.– [Cohen, MacAllister, Kautz KDD 2000; Pasula,
Marthi, Milch, Russell, Shpitser, NIPS 2002; McCallum and Wellner, KDD WS 2003]
![Page 117: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/117.jpg)
Relational Identity Uncertaintywith Probabilistic Relational Models (PRMs)
[Russell 2001], [Pasula et al 2002][Marthi, Milch, Russell 2003]
id
wordscontext
distance fonts
id
surname
agegender
N
.
.
.
.
.
.
(Applied to citation matching, and object correspondence in vision)
![Page 118: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/118.jpg)
A Conditional Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
Y
N
N
ji l kjiikjkijijjill
x
yyyfyxxfZ
xyP, ,,
),,(''),,(exp1
)|(
4
45)
30)
(11)
[McCallum & Wellner, 2003]
![Page 119: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/119.jpg)
. . . Powell . . .
Inference in these CRFs = Graph Partitioning[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
= 22
45
11
30
134
10
106
paritions
across ,ij
paritions w/in,
ij,
ww),,()|(logjijiji l
ijjill yxxfxyP
. . . Condoleezza Rice . . .
. . . she . . .
. . . Mr. Powell . . .
![Page 120: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/120.jpg)
Inference in these CRFs = Graph Partitioning[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Condoleezza Rice . . .
. . . she . . .
. . . Powell . . . . . . Mr. Powell . . .
paritions
across ,ij
paritions w/in,
ij,
w'w),,()|(logjijiji l
ijjill yxxfxyP = 314
45
11
30
134
10
106
![Page 121: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/121.jpg)
Broader View
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IETokenize
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
Now touch on some other issues
1
1
2
3
4
5
![Page 122: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/122.jpg)
(3) Automatically Inducing an Ontology[Riloff, ‘95]
Heuristic “interesting” meta-patterns.
(1) (2)
Two inputs:
![Page 123: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/123.jpg)
(3) Automatically Inducing an Ontology[Riloff, ‘95]
Subject/Verb/Objectpatterns that occurmore often in therelevant documentsthan the irrelevantones.
![Page 124: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/124.jpg)
Broader View
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IETokenize
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
Now touch on some other issues
1
1
2
3
4
5
![Page 125: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/125.jpg)
(4) Training IE Models using Unlabeled Data[Collins & Singer, 1999]
See also [Brin 1998], [Blum & Mitchell 1998], [Riloff & Jones 1999]
…says Mr. Cooper, a vice president of …
Use two independent sets of features:
Contents: full-string=Mr._Cooper, contains(Mr.), contains(Cooper)Context: context-type=appositive, appositive-head=president
NNP NNP appositive phrase, head=president
full-string=New_York Locationfill-string=California Locationfull-string=U.S. Locationcontains(Mr.) Personcontains(Incorporated) Organizationfull-string=Microsoft Organizationfull-string=I.B.M. Organization
1. Start with just seven rules: and ~1M sentences of NYTimes
2. Alternately train & label using each feature set.
3. Obtain 83% accuracy at finding person, location, organization & other in appositives and prepositional phrases!
Consider just appositives and prepositional phrases...
![Page 126: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/126.jpg)
Broader View
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IETokenize
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
Now touch on some other issues
1
1
2
3
4
5
![Page 127: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/127.jpg)
(5) Data Mining: Working with IE Data
• Some special properties of IE data:– It is based on extracted text– It is “dirty”, (missing extraneous facts, improperly normalized
entity names, etc.)– May need cleaning before use
• What operations can be done on dirty, unnormalized databases?– Datamine it directly.– Query it directly with a language that has “soft joins” across
similar, but not identical keys. [Cohen 1998]– Use it to construct features for learners [Cohen 2000]– Infer a “best” underlying clean database
[Cohen, Kautz, MacAllester, KDD2000]
![Page 128: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/128.jpg)
(5) Data Mining: Mutually supportive IE and Data Mining [Nahm & Mooney, 2000]
Extract a large databaseLearn rules to predict the value of each field from the other fields.Use these rules to increase the accuracy of IE.
Example DB record Sample Learned Rules
platform:AIX & !application:Sybase & application:DB2application:Lotus Notes
language:C++ & language:C & application:Corba & title=SoftwareEngineer platform:Windows
language:HTML & platform:WindowsNT & application:ActiveServerPages area:Database
Language:Java & area:ActiveX & area:Graphics area:Web
![Page 129: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/129.jpg)
(5) Working with IE Data
• Association rule mining using IE data• Classification using IE data
![Page 130: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/130.jpg)
[Cohen, ICML2000]
Idea: do very lightweight “site wrapping” of relevant pages
Make use of (partial, noisy) wrappers
![Page 131: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/131.jpg)
![Page 132: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/132.jpg)
![Page 133: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/133.jpg)
![Page 134: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/134.jpg)
![Page 135: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/135.jpg)
(5) Working with IE Data
• Association rule mining using IE data• Classification using IE data
– Many features based on lists, tables, etc are proposed
– The learner filters these features and decides which to use in a classifier
– How else can proposed structures be filtered?
![Page 136: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/136.jpg)
(5) Finding “pretty good” wrappers without site-specific training data
• Local structure in extraction– Assume a set of “seed examples” similar to field to be
extracted.– Identify small possible wrappers (e.g., simple tagpaths)– Use semantic information to evaluate, for each wrapper
• Average minimum TFIDF distance to a known positive “seed example” over all extracted strings
– Adopt best single tagpath
• Results on 84 pre-wrapped page types (Cohen, AAAI-99)– 100% equivalent to target wrapper 80% of time– More conventional learning approach: 100% equivalent to
target wrapper 50% of the time (Cohen & Fan, WWW-99)
![Page 137: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/137.jpg)
![Page 138: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/138.jpg)
![Page 139: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/139.jpg)
![Page 140: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/140.jpg)
![Page 141: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/141.jpg)
(5) Working with IE Data
• Association rule mining using IE data• Classification using IE data
– Many features based on lists, tables, etc are proposed
– The learner filters these features and decides which to use in a classifier
– How else can proposed structures be filtered?– How else can structures be proposed?
![Page 142: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/142.jpg)
builder
predicate
List1
Task: classify links as to whether they point to an “executive biography” page
![Page 143: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/143.jpg)
builder
predicate
List2
![Page 144: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/144.jpg)
builder
predicate
List3
![Page 145: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/145.jpg)
Features extracted:
{ List1, List3,…},
{ List1, List2, List3,…}, { List2, List 3,…},
{ List2, List3,…},
…
![Page 146: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/146.jpg)
Experimental results
1 2 3 4 5 6 7 8 9
Winnow
None0
0.05
0.1
0.15
0.2
0.25
Winnow
D-Tree
None
Builder features hurt No improvement
Error reduced by almost half on average
![Page 147: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/147.jpg)
Learning Formatting Patterns “On the Fly”:“Scoped Learning” for IE
[Blei, Bagnell, McCallum, 2002][Taskar, Wong, Koller 2003]
Formatting is regular on each site, but there are too many different sites to wrap.Can we get the best of both worlds?
![Page 148: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/148.jpg)
Scoped Learning Generative Model
1. For each of the D documents:a) Generate the multinomial formatting
feature parameters from p(|)
2. For each of the N words in the document:
a) Generate the nth category cn from
p(cn).
b) Generate the nth word (global feature)
from p(wn|cn,)
c) Generate the nth formatting feature
(local feature) from p(fn|cn,)
w f
c
N
D
![Page 149: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/149.jpg)
Global Extractor: Precision = 46%, Recall = 75%
![Page 150: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/150.jpg)
Scoped Learning Extractor: Precision = 58%, Recall = 75% Error = -22%
![Page 151: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/151.jpg)
Wrap-up
![Page 152: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/152.jpg)
IE Resources
• Data– RISE, http://www.isi.edu/~muslea/RISE/index.html– Linguistic Data Consortium (LDC)
• Penn Treebank, Named Entities, Relations, etc.– http://www.biostat.wisc.edu/~craven/ie– http://www.cs.umass.edu/~mccallum/data
• Code– TextPro, http://www.ai.sri.com/~appelt/TextPro– MALLET, http://www.cs.umass.edu/~mccallum/mallet– SecondString, http://secondstring.sourceforge.net/
• Both– http://www.cis.upenn.edu/~adwait/penntools.html
![Page 153: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/153.jpg)
Where from Here?
• Science– Higher accuracy, integration with data mining.– Relational Learning, Minimizing labeled data needs, unified
models of all four of IE’s components.– Multi-modal IE: text, images, video, audio. Multi-lingual.
• Profit– SRA, Inxight, Fetch, Mohomine, Cymfony,… you?– Bio-informatics, Intelligent Tutors, Information Overload,
Anti-terrorism
• Fun– Search engines that return “things” instead of “pages”
(people, companies, products, universities, courses…)– New insights by mining previously untapped knowledge.
![Page 154: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/154.jpg)
Thank you!
More information:
William Cohen:http://www.cs.cmu.edu/~wcohen
Andrew McCallumhttp://www.cs.umass.edu/~mccallum
![Page 155: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/155.jpg)
References• [Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In
Proceedings of ANLP’97, p194-201.• [Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in
Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).• [Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML
documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)• [Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the
Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).• [Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual
Similarity, in Proceedings of ACM SIGMOD-98.• [Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language,
ACM Transactions on Information Systems, 18(3).• [Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning:
Proceedings of the Seventeeth International Conference (ML-2000).• [Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the
Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.• [De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural
Language Processing. Larence Erlbaum, 1982, 149-176.• [Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the
Fifteenth National Conference on Artificial Intelligence (AAAI-98).• [Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon
University.• [Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101
(2000).• Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National
Conference on Artificial Intelligence (AAAI-99)• [Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings
AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11.• [Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68).• [Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models
for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001.• [Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego.• [McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information
extraction and segmentation, In Proceedings of ICML-2000• [Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from
Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.
![Page 156: Information Extraction from the World Wide Web](https://reader037.fdocuments.in/reader037/viewer/2022110211/56813087550346895d966306/html5/thumbnails/156.jpg)
References
• [Muslea et al, 1999] Muslea, I.; Minton, S.; Knoblock, C. A.: A Hierarchical Approach to Wrapper Induction. Proceedings of Autonomous Agents-99.
• [Muslea et al, 2000] Musclea, I.; Minton, S.; and Knoblock, C. Hierarhical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems.
• [Nahm & Mooney, 2000] Nahm, Y.; and Mooney, R. A mutually beneficial integration of data mining and information extraction. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, pages 627--632, Austin, TX.
• [Punyakanok & Roth 2001] Punyakanok, V.; and Roth, D. The use of classifiers in sequential inference. Advances in Neural Information Processing Systems 13.
• [Ratnaparkhi 1996] Ratnaparkhi, A., A maximum entropy part-of-speech tagger, in Proc. Empirical Methods in Natural Language Processing Conference, p133-141.
• [Ray & Craven 2001] Ray, S.; and Craven, Ml. Representing Sentence Structure in Hidden Markov Models for Information Extraction. Proceedings of the 17th International Joint Conference on Artificial Intelligence, Seattle, WA. Morgan Kaufmann.
• [Soderland 1997]: Soderland, S.: Learning to Extract Text-Based Information from the World Wide Web. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97).
• [Soderland 1999] Soderland, S. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1/3):233-277.