Information Extraction
Sunita Sarawagi
IIT Bombayhttp://www.it.iitb.ac.in/~sunita
Information Extraction (IE) & IntegrationThe Extraction task: Given,
– E: a set of structured elements– S: unstructured source S
extract all instances of E from S
Many versions involving many source types• Actively researched in varied communities• Several tools and techniques• Several commercial applications
• Classical Named Entity Recognition – Extract person, location, organization names
According to Robert Callahan, president of Eastern's flight attendants union, the past practice of Eastern's parent, Houston-based Texas Air Corp., has involved ultimatums to unions to accept the carrier's terms
IE from free format text
Several applications–News tracking
– Monitor events–Bio-informatics
– Protein and Gene names from publications–Customer care
•Part number, problem description from emails in help centers
Problem definition
Source: concatenation of structured elements with limited reordering and some missing fields– Example: Addresses, bib records
House number Building Road Area
Zip
156 Hillside ctype Scenic drive Powai Mumbai 400076
P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.
Author Year Title JournalVolume
Page
City
Relation Extraction: Disease Outbreaks
• Extract structured relations from text
May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Information Extraction System
(e.g., NYU’s Proteus)
Disease Outbreaks in The New York Times
Information Extraction on the web
Personal Information Systems– Automatically add a bibtex entry of a paper I download– Integrate a resume in email with the candidates
database
People Papers
ProjectsEmails
Web
Files
Resumes
Hand-Coded Methods• Easy to construct in many cases
– e.g., to recognize prices, phone numbers, zip codes, conference names, etc.• Easier to debug & maintain
– Especially if written in a “high-level” language (as is usually the case): e.g.,
• Easier to incorporate / reuse domain knowledge• Can be quite labor intensive to write
ContactPattern RegularExpression(Email.body,”can be reached at”)
PersonPhone Precedes(Person Precedes(ContactPattern, Phone, D), D)
[From Avatar]
Example of Hand-Coded Entity Tagger [Ramakrishnan. G, 2005, Slides from Doan et al., SIGMOD 2006]
Rule 1 This rule will find person names with a salutation (e.g. Dr. Laura Haas) and two capitalized words
<token> INITIAL</token><token>DOT </token><token>CAPSWORD</token><token>CAPSWORD</token>
Rule 2 This rule will find person names where two capitalized words are present in a Person dictionary
<token>PERSONDICT, CAPSWORD </token><token>PERSONDICT, CAPSWORD</token>
CAPSWORD : Word starting with uppercase, second letter lowercase E.g., DeWitt will satisfy it (DEWITT will not) \p{Upper}\p{Lower}[\p{Alpha}]{1,25}
DOT : The character ‘.’
Hand Coded Rule Example: Conference Name# These are subordinate patterns$wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|fourteenth|fifteenth)";my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))";my $ordinals="(?:$wordOrdinals|$numberOrdinals)";my $confTypes="(?:Conference|Workshop|Symposium)";my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spacesmy $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; # .e.g "International Conference ...' or the conference name for workshops (e.g. "VLDB Workshop ...")my $connectors="(?:on|of)";my $abbreviations="(?:\\([A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\\))"; # Conference abbreviations like "(SIGMOD'06)"# The actual pattern we search for. A typical conference name this pattern will find is# "3rd International Conference on Blah Blah Blah (ICBBB-05)"my $fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\s+)?$abbreviations?)(?:\\n|\\r|\\.|<)";############################## ################################# Given a <dbworldMessage>, look for the conference pattern##############################################################lookForPattern($dbworldMessage, $fullNamePattern);########################################################## In a given <file>, look for occurrences of <pattern># <pattern> is a regular expression#########################################################sub lookForPattern { my ($file,$pattern) = @_;
Some Hand Coded Entity Taggers
• FRUMP [DeJong 82]• CIRCUS / AutoSlog [Riloff 93] • SRI FASTUS [Appelt, 1996]• MITRE Alembic (available for use)• Alias-I LingPipe (available for use)• OSMX [Embley, 2005] • DBLife [Doan et al, 2006]• Avatar [Jayram et al, 2006]
Learning models for extraction• Rule-based extractors
– For each label, build two classifiers for accepting its two boundaries.
– Each classifier: sequence of rules• Each rule: conjunction of predicates
– E.g: If previous token a last-name, current token “.”, next token an article start of title.
– Examples: Rapier, GATE, LP2 & several more
• Critique of rule-based approaches– Cannot output meaningful uncertainty values– Brittle– Limited flexibility in clues that can be exploited– Not too good about combining several weak clues.– (Pros) Somewhat easier to tune.
Statistical models of IE
• Generative models like HMM– Intuitive– Very restricted feature setslower accuracy– Output probabilities are highly skewed (counterpart,
naïve Bayes)
• Conditional discriminative models– Local models: Maximum entropy models– Global models: Conditional Random Fields.
Conditional models –output meaningful probabilities, –flexible, generalize, –getting increasingly popular–State-of-the-art!
IE with Hidden Markov Models
• Probabilistic models for IE
Title
Journal
Author 0.9
0.5
0.50.8
0.2
0.1
Transition probabilitie
s
Year
A
B
C
0.6
0.3
0.1
X
B
Z
0.4
0.2
0.4
Y
A
C
0.1
0.1
0.8
Emission probabiliti
es
dddd
dd
0.8
0.2
HMM Structure
• Naïve Model: One state per element
Nested model
Each element another HMM
HMM Dictionary
• For each word (=feature), associate the probability of emitting that word
• Multinomial model
• More advanced models with overlapping features of a word, – example,
• part of speech, • capitalized or not• type: number, letter, word etc
– Maximum entropy models (McCallum 2000)
Learning model parameters• When training data defines unique path through
HMM– Transition probabilities
• Probability of transitioning from state i to state j =
number of transitions from i to j total transitions from state i
– Emission probabilities• Probability of emitting symbol k from state i =
number of times k generated from i number of transition from I
• When training data defines multiple path:– A more general EM like algorithm (Baum-Welch)
Using the HMM to segment
• Find highest probability path through the HMM.• Viterbi: quadratic dynamic programming algorithm
House
ot
Road
City
Pin
115 Grant street Mumbai 400070
House
Road
City
Pin
115 Grant ……….. 400070
ot
House
Road
City
Pin
House
Road
Pin
Comparative Evaluation
• Naïve model – One state per element in the HMM• Independent HMM – One HMM per element; • Rule Learning Method – Rapier• Nested Model – Each state in the Naïve model
replaced by a HMM
Results: Comparative Evaluation
The Nested model does best in all three cases
(from Borkar 2001)
Dataset instances
Elements
IITB student
Addresses
2388 17
Company
Addresses
769 6
US
Addresses
740 6
HMM approach: summary
Inter-element sequencing
Intra-element sequencing
Element length
Characteristic words
Non-overlapping tags
Outer HMM transitions
Inner HMM
Multi-state Inner HMM
Dictionary
Global optimization
Statistical models of IE
• Generative models like HMM– Intuitive– Very restricted feature setslower accuracy– Output probabilities are highly skewed (counterpart,
naïve Bayes)
• Conditional discriminative models– Local models: Maximum entropy models– Global models: Conditional Random Fields.
Conditional models –output meaningful probabilities, –flexible, generalize, –getting increasingly popular–State-of-the-art!
Basic chain model for extraction
My review of Fermat’s last theorem by S. Singh
1 2 3 4 5 6 7 8 9
My review of Fermat’s last theorem by S. Singh
Other Other Other Title Title Title other Author Author
t
x
y
y1 y2 y3 y4 y5 y6 y7 y8 y9
Independent model
Features
• The word as-is• Orthographic word properties
• Capitalized? Digit? Ends-with-dot?• Part of speech
• Noun?• Match in a dictionary
• Appears in a dictionary of people names?• Appears in a list of stop-words?
• Fire these for each label and• The token,• W tokens to the left or right, or• Concatenation of tokens.
Basic chain model for extraction
My review of Fermat’s last theorem by S. Singh
1 2 3 4 5 6 7 8 9
My review of Fermat’s last theorem by S. Singh
Other Other Other Title Title Title other Author Author
t
x
y
y1 y2 y3 y4 y5 y6 y7 y8 y9
Global conditional model over Pr(y1,y2…y9|x)
Features• Feature vector for each position
• Examples
• Parameters: weight for each feature (vector)
i-th labelWord i & neighbors
previous label
User provided
Machine learnt
Transforming real-world extraction
• Partition label into different parts?
• Independent extraction per label?
Fred please stop by my office this afternoon
PersonUnique
Other begin
Other continu
Other end
Loc Begin
Loc End
Other uniqu
Time Unique
Unique
Begin Continue End
Other
Examples: features with weights (publications).
# Name Person Location Other
1 xi is noun 1.2 1.2 -0.5
4 “at” in {xi-1, xi-2 } -0.3 3 0.2
7 xi-1xi in people names dictionary 3 -0.4 0
10 xi-1 is single caps & dot. 2.1 -1.0 -0.1
13 yi-1 is Location -1.5 0.3 1.0
. ..
.
100000 ..
A large number
Typical numbers
• Seminars announcements (CMU): – speaker, location, timings– SVMs for start-end boundaries– 250 training examples– F1: 85% speaker, location, 92% timings (Finn &
Kushmerick ’04)
• Jobs postings in news groups– 17 fields: title, location, company,language, etc– 150 training examples– F1: 84% overall (LP2) (Lavelli et al 04)
Publications
• Cora dataset – Paper headers: Extract title,author affiliation,
address,email,abstract• 94% F1 with CRFs• 76% F1 with HMMs
– Paper citations: Extract title,author,date, editor,booktitle,pages,institution
• 91% F1 with CRFs• 78% F1 with HMMs
Peng & McCallum 2004
Top Related