Information Extraction Sunita Sarawagi IIT Bombay sunita.

Information Extraction

Sunita Sarawagi

IIT Bombayhttp://www.it.iitb.ac.in/~sunita

Information Extraction (IE) & IntegrationThe Extraction task: Given,

– E: a set of structured elements– S: unstructured source S

extract all instances of E from S

Many versions involving many source types• Actively researched in varied communities• Several tools and techniques• Several commercial applications

• Classical Named Entity Recognition – Extract person, location, organization names

According to Robert Callahan, president of Eastern's flight attendants union, the past practice of Eastern's parent, Houston-based Texas Air Corp., has involved ultimatums to unions to accept the carrier's terms

IE from free format text

Several applications–News tracking

– Monitor events–Bio-informatics

– Protein and Gene names from publications–Customer care

•Part number, problem description from emails in help centers

Problem definition

Source: concatenation of structured elements with limited reordering and some missing fields– Example: Addresses, bib records

House number Building Road Area

Zip

156 Hillside ctype Scenic drive Powai Mumbai 400076

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

Author Year Title JournalVolume

Page

City

Relation Extraction: Disease Outbreaks

• Extract structured relations from text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire

Information Extraction System

(e.g., NYU’s Proteus)

Disease Outbreaks in The New York Times

Information Extraction on the web

Personal Information Systems– Automatically add a bibtex entry of a paper I download– Integrate a resume in email with the candidates

database

People Papers

ProjectsEmails

Email

Web

Files

Resumes

Hand-Coded Methods• Easy to construct in many cases

– e.g., to recognize prices, phone numbers, zip codes, conference names, etc.• Easier to debug & maintain

– Especially if written in a “high-level” language (as is usually the case): e.g.,

• Easier to incorporate / reuse domain knowledge• Can be quite labor intensive to write

ContactPattern RegularExpression(Email.body,”can be reached at”)

PersonPhone Precedes(Person Precedes(ContactPattern, Phone, D), D)

[From Avatar]

Example of Hand-Coded Entity Tagger [Ramakrishnan. G, 2005, Slides from Doan et al., SIGMOD 2006]

Rule 1 This rule will find person names with a salutation (e.g. Dr. Laura Haas) and two capitalized words

<token> INITIAL</token><token>DOT </token><token>CAPSWORD</token><token>CAPSWORD</token>

Rule 2 This rule will find person names where two capitalized words are present in a Person dictionary

<token>PERSONDICT, CAPSWORD </token><token>PERSONDICT, CAPSWORD</token>

CAPSWORD : Word starting with uppercase, second letter lowercase E.g., DeWitt will satisfy it (DEWITT will not) \p{Upper}\p{Lower}[\p{Alpha}]{1,25}

DOT : The character ‘.’

Hand Coded Rule Example: Conference Name# These are subordinate patterns$wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|fourteenth|fifteenth)";my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))";my $ordinals="(?:$wordOrdinals|$numberOrdinals)";my $confTypes="(?:Conference|Workshop|Symposium)";my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spacesmy $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; # .e.g "International Conference ...' or the conference name for workshops (e.g. "VLDB Workshop ...")my $connectors="(?:on|of)";my $abbreviations="(?:\$[A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\$)"; # Conference abbreviations like "(SIGMOD'06)"# The actual pattern we search for. A typical conference name this pattern will find is# "3rd International Conference on Blah Blah Blah (ICBBB-05)"my $fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\s+)?$abbreviations?)(?:\\n|\\r|\\.|<)";############################## ################################# Given a <dbworldMessage>, look for the conference pattern##############################################################lookForPattern($dbworldMessage, $fullNamePattern);########################################################## In a given <file>, look for occurrences of <pattern># <pattern> is a regular expression#########################################################sub lookForPattern { my ($file,$pattern) = @_;

Some Hand Coded Entity Taggers

• FRUMP [DeJong 82]• CIRCUS / AutoSlog [Riloff 93] • SRI FASTUS [Appelt, 1996]• MITRE Alembic (available for use)• Alias-I LingPipe (available for use)• OSMX [Embley, 2005] • DBLife [Doan et al, 2006]• Avatar [Jayram et al, 2006]

Learning models for extraction• Rule-based extractors

– For each label, build two classifiers for accepting its two boundaries.

– Each classifier: sequence of rules• Each rule: conjunction of predicates

– E.g: If previous token a last-name, current token “.”, next token an article start of title.

– Examples: Rapier, GATE, LP2 & several more

• Critique of rule-based approaches– Cannot output meaningful uncertainty values– Brittle– Limited flexibility in clues that can be exploited– Not too good about combining several weak clues.– (Pros) Somewhat easier to tune.

Statistical models of IE

• Generative models like HMM– Intuitive– Very restricted feature setslower accuracy– Output probabilities are highly skewed (counterpart,

naïve Bayes)

• Conditional discriminative models– Local models: Maximum entropy models– Global models: Conditional Random Fields.

Conditional models –output meaningful probabilities, –flexible, generalize, –getting increasingly popular–State-of-the-art!

IE with Hidden Markov Models

• Probabilistic models for IE

Title

Journal

Author 0.9

0.5

0.50.8

0.2

0.1

Transition probabilitie

s

Year

A

B

C

0.6

0.3

0.1

X

B

Z

0.4

0.2

0.4

Y

A

C

0.1

0.1

0.8

Emission probabiliti

es

dddd

dd

0.8

0.2

HMM Structure

• Naïve Model: One state per element

Nested model

Each element another HMM

HMM Dictionary

• For each word (=feature), associate the probability of emitting that word

• Multinomial model

• More advanced models with overlapping features of a word, – example,

• part of speech, • capitalized or not• type: number, letter, word etc

– Maximum entropy models (McCallum 2000)

Learning model parameters• When training data defines unique path through

HMM– Transition probabilities

• Probability of transitioning from state i to state j =

number of transitions from i to j total transitions from state i

– Emission probabilities• Probability of emitting symbol k from state i =

number of times k generated from i number of transition from I

• When training data defines multiple path:– A more general EM like algorithm (Baum-Welch)

Using the HMM to segment

• Find highest probability path through the HMM.• Viterbi: quadratic dynamic programming algorithm

House

ot

Road

City

Pin

115 Grant street Mumbai 400070

House

Road

City

Pin

115 Grant ……….. 400070

ot

House

Road

City

Pin

House

Road

Pin

Comparative Evaluation

• Naïve model – One state per element in the HMM• Independent HMM – One HMM per element; • Rule Learning Method – Rapier• Nested Model – Each state in the Naïve model

replaced by a HMM

Results: Comparative Evaluation

The Nested model does best in all three cases

(from Borkar 2001)

Dataset instances

Elements

IITB student

Addresses

2388 17

Company

Addresses

769 6

US

Addresses

740 6

HMM approach: summary

Inter-element sequencing

Intra-element sequencing

Element length

Characteristic words

Non-overlapping tags

Outer HMM transitions

Inner HMM

Multi-state Inner HMM

Dictionary

Global optimization

Statistical models of IE

• Generative models like HMM– Intuitive– Very restricted feature setslower accuracy– Output probabilities are highly skewed (counterpart,

naïve Bayes)

• Conditional discriminative models– Local models: Maximum entropy models– Global models: Conditional Random Fields.

Conditional models –output meaningful probabilities, –flexible, generalize, –getting increasingly popular–State-of-the-art!

Basic chain model for extraction

My review of Fermat’s last theorem by S. Singh

1 2 3 4 5 6 7 8 9


Other Other Other Title Title Title other Author Author

t

x

y

y1 y2 y3 y4 y5 y6 y7 y8 y9

Independent model

Features

• The word as-is• Orthographic word properties

• Capitalized? Digit? Ends-with-dot?• Part of speech

• Noun?• Match in a dictionary

• Appears in a dictionary of people names?• Appears in a list of stop-words?

• Fire these for each label and• The token,• W tokens to the left or right, or• Concatenation of tokens.

Basic chain model for extraction


1 2 3 4 5 6 7 8 9


Other Other Other Title Title Title other Author Author

t

x

y

y1 y2 y3 y4 y5 y6 y7 y8 y9

Global conditional model over Pr(y1,y2…y9|x)

Features• Feature vector for each position

• Examples

• Parameters: weight for each feature (vector)

i-th labelWord i & neighbors

previous label

User provided

Machine learnt

Transforming real-world extraction

• Partition label into different parts?

• Independent extraction per label?

Fred please stop by my office this afternoon

PersonUnique

Other begin

Other continu

Other end

Loc Begin

Loc End

Other uniqu

Time Unique

Unique

Begin Continue End

Other

Examples: features with weights (publications).

# Name Person Location Other

1 xi is noun 1.2 1.2 -0.5

4 “at” in {xi-1, xi-2 } -0.3 3 0.2

7 xi-1xi in people names dictionary 3 -0.4 0

10 xi-1 is single caps & dot. 2.1 -1.0 -0.1

13 yi-1 is Location -1.5 0.3 1.0

. ..

.

100000 ..

A large number

Typical numbers

• Seminars announcements (CMU): – speaker, location, timings– SVMs for start-end boundaries– 250 training examples– F1: 85% speaker, location, 92% timings (Finn &

Kushmerick ’04)

• Jobs postings in news groups– 17 fields: title, location, company,language, etc– 150 training examples– F1: 84% overall (LP2) (Lavelli et al 04)

Publications

• Cora dataset – Paper headers: Extract title,author affiliation,

address,email,abstract• 94% F1 with CRFs• 76% F1 with HMMs

– Paper citations: Extract title,author,date, editor,booktitle,pages,institution

• 91% F1 with CRFs• 78% F1 with HMMs

Peng & McCallum 2004

Information Extraction Sunita Sarawagi IIT Bombay sunita.

Documents

Transcript of Information Extraction Sunita Sarawagi IIT Bombay sunita.