Download - Mlas06 Nigam Tie 01

8/3/2019 Mlas06 Nigam Tie 01

1/52

Machine Learning for InformationExtraction: An Overview

Kamal NigamGoogle Pittsburgh

With input, slides and suggestions from William Cohen, Andrew McCallum and Ion Muslea


2/52

Example: A Problem

Genomics job

Mt. Baker, the school district

Baker Hostetler, the company

Baker, a job opening


3/52

Example: A Solution


4/52

Job Openings:Category = Food ServicesKeyword = BakerLocation = Continental U.S.


5/52

Extracting Job Openings from the Web

Title: Ice Cream Guru

Description:If you dream of cold creamy

Contact:[email protected]

Category: Travel/Hospitality

Function: Food Services
mailto:[email protected]:[email protected]


6/52

Potential Enabler of Faceted Search


7/52

Lots of Structured Information in Text


8/52

IE from Research Papers


9/52

What is Information Extraction?

Recovering structured data from formatted text


10/52



Identifying fields (e.g. named entity recognition)


11/52




Understanding relations between fields (e.g. record

association)


12/52





association) Normalization and deduplication


13/52





association) Normalization and deduplication

Today, focus mostly on field identification &

a little on record association


14/52

IE Posed as a Machine Learning Task

Training data: documents marked up withground truth

In contrast to text classification, local features

crucial. Features of: Contents

Text just before item

Text just after item

Begin/end boundaries

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun

prefix contents suffix


15/52

Good Features for Information Extraction

Example word features:

identity of word

is in all caps

ends in -ski

is part of a noun phrase is in a list of city names

is under node X in WordNet orCyc

is in bold font

is in hyperlink anchor

features of past & future

last person name was female

next two words are and

Associates

begins-with-number

begins-with-ordinal

begins-with-punctuation

begins-with-question-wordbegins-with-subject

blank

contains-alphanum

contains-bracketed-

number

contains-http

contains-non-space

contains-number

contains-pipe

contains-question-mark

contains-question-word

ends-with-question-mark

first-alpha-is-capitalized

indented

indented-1-to-4

indented-5-to-10

more-than-one-third-space

only-punctuationprev-is-blank

prev-begins-with-ordinal

shorter-than-30

Creativity and Domain Knowledge Required!


16/52

Is Capitalized

Is Mixed Caps

Is All Caps

Initial Cap

Contains Digit

All lowercaseIs Initial

Punctuation

Period

Comma

Apostrophe

Dash

Preceded by HTML tag

Character n-gram classifiersays string is a personname (80% accurate)

In stopword list(the, of, their, etc)

In honorific list(Mr, Mrs, Dr, Sen, etc)

In person suffix list(Jr, Sr, PhD, etc)

In name particle list(de, la, van, der, etc)

In Census lastname list;segmented by P(name)

In Census firstname list;segmented by P(name)

In locations lists(states, cities, countries)

In company name list(J. C. Penny)

In list of company suffixes

(Inc, & Associates,Foundation)

Word Features lists of job titles, Lists of prefixes Lists of suffixes 350 informative phrases

HTML/Formatting Features

{begin, end, in} x{, , , } x{lengths 1, 2, 3, 4, or longer}

{begin, end} of line

Creativity and Domain Knowledge Required!

Good Features for Information Extraction


17/52

IE History

Pre-Web

Mostly news articles De Jongs FRUMP[1982]

Hand-built system to fill Schank-style scripts from news wire

Message Understanding Conference (MUC)DARPA [87-95],TIPSTER[92-96]

Most early work dominated by hand-built models

E.g. SRIs FASTUS, hand-built FSMs.

But by 1990s, some machine learning: Lehnert, Cardie, Grishman andthen HMMs: Elkan [Leek 97], BBN [Bikel et al 98]

Web

AAAI 94 Spring Symposium on Software Agents

Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. Tom Mitchells WebKB, 96

Build KBs from the Web.

Wrapper Induction

Initially hand-build, then ML: [Soderland 96], [Kushmeric 97],


18/52

Landscape of ML Techniques for IE:

Any of these models can be used to capture words, formatting or both.

Classify Candidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window


Classifier

which class?

Try alternatewindow sizes:

Boundary Models


Classifier

which class?

BEGIN END BEGIN END

BEGIN

Finite State Machines


Most likely state sequence?

Wrapper Induction


Learn and apply pattern for a website

PersonName


19/52

Sliding Windows & Boundary Detection


20/52

Information Extraction by Sliding Windows

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity

in the 1970s into a vibrant and popular

discipline in artificial intelligence

during the 1980s and 1990s. As a result

of its success and growth, machine learning

is evolving into a collection of related

disciplines: inductive concept acquisition,

analytic learning in problem solving (e.g.

analogy, explanation-based learning),

learning theory (e.g. PAC learning),

genetic algorithms, connectionist learning,

hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.

Looking forseminar

location


21/52

Information Extraction by Sliding Windows


Jaime Carbonell



3:30 pm

7500 Wean Hall














E.g.

Looking forseminar

location


22/52

Information Extraction by Sliding Window


Jaime Carbonell



3:30 pm

7500 Wean Hall














E.g.

Looking forseminar

location


23/52



Jaime Carbonell



3:30 pm

7500 Wean Hall














E.g.

Looking forseminar

location


24/52



Jaime Carbonell



3:30 pm

7500 Wean Hall














E.g.

Looking forseminar

location


25/52

Information Extraction with Sliding Windows[Freitag 97, 98; Soderland 97; Califf 98]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m

prefix contents suffix

Standard supervised learning setting Positive instances: Candidates with real label

Negative instances: All other candidates

Features based on candidate, prefix and suffix

Special-purpose rule learning systems work wellcourseNumber(X) :-

tokenLength(X,=,2),every(X, inTitle, false),some(X, A, , inTitle, true),

some(X, B, . tripleton, true)


26/52

Rule-learning approaches to sliding-window classification: Summary

Representations for classifiers allowrestriction of the relationships betweentokens, etc

Representations are carefully chosensubsets of even more powerfulrepresentations based on logic programming(ILP and Prolog)

Use of these heavyweight representations iscomplicated, but seems to pay off in results


27/52

IE by Boundary Detection


Jaime Carbonell



3:30 pm

7500 Wean Hall














E.g.

Looking forseminar

location


28/52



Jaime Carbonell



3:30 pm

7500 Wean Hall














E.g.

Looking forseminar

location


29/52



Jaime Carbonell



3:30 pm

7500 Wean Hall














E.g.

Looking forseminar

location


30/52



Jaime Carbonell



3:30 pm

7500 Wean Hall







disciplines: inductive concept acquisition,analytic learning in problem solving (e.g.






E.g.

Looking forseminar

location


31/52



Jaime Carbonell



3:30 pm

7500 Wean Hall







disciplines: inductive concept acquisition,analytic learning in problem solving (e.g.






E.g.

Looking forseminar

location


32/52

BWI: Learning to detect boundaries

Another formulation: learn three probabilisticclassifiers: START(i) = Prob( position istarts a field)

END(j) = Prob( positionjends a field)

LEN(k) = Prob( an extracted field has length k)

Then score a possible extraction (i,j) bySTART(i) * END(j) * LEN(j-i)

LEN(k) is estimated from a histogram

[Freitag & Kushmerick, AAAI 2000]


33/52


BWI uses boostingto find detectors forSTARTand END

Each weak detector has a BEFOREandAFTERpattern (on tokens before/afterposition i).

Each pattern is a sequence of tokens and/orwildcards like: anyAlphabeticToken, anyToken,anyUpperCaseLetter, anyNumber,

Weak learner for patterns uses greedysearch (+ lookahead) to repeatedly extend apair of empty BEFORE,AFTERpatterns


34/52


Field F1Person Name: 30%Location: 61%Start Time: 98%


35/52

Problems with Sliding Windowsand Boundary Finders

Decisions in neighboring parts of the inputare made independently from each other.

Nave Bayes Sliding Window may predict a

seminar end time before the seminar start time. It is possible for two overlappingwindows to both

be above threshold.

In a Boundary-Finding system, left boundaries arelaid down independently from right boundaries,and their pairing happens as a separate step.


36/52

Finite State Machines


37/52

Hidden Markov Models

St-1

St

Ot

St+1

Ot+1

Ot-1

...

...

Finite state model Graphical model

Parameters: for all states S={s1,s2,}

Start state probabilities: P(st)

Transition probabilities: P(st|st-1 )

Observation (emission) probabilities: P(ot|st)Training:

Maximize probability of training observations (w/ prior)

||

1

1 )|()|(),(o

t

tttt soPssPosP

HMMs are the standard sequence modeling tool in

genomics, music, speech, NLP,

...transitions

observations

o1 o2 o3 o4 o5 o6 o7 o8

Generates:

Statesequence

Observationsequence

Usually a multinomial overatomic, fixed alphabet


38/52

IE with Hidden Markov Models

Yesterday Lawrence Saul spoke this example sentence.

YesterdayLawrence Saulspoke this example sentence.

Person name: Lawrence Saul

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated person namestate extract as a person name:

),(maxarg osPs


39/52

Generative Extraction with HMMs

Parameters: {P(st|st-1), P(ot|st), for all states st, words ot} Parameters define generative model:

[McCallum, Nigam, Seymore & Rennie 00]

||

1

1 )|()|(),(o

t

tttt soPssPosP


40/52

HMM Example: Nymble

Other examples of HMMs in IE: [Leek 97; Freitag & McCallum 99; Seymore et al. 99]

Task: Named Entity Extraction

Train on 450k words of news wire text.Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%

[Bikel, et al 97]

Person

Org

Other

(Five other name classes)

start-of-sentence

end-of-sentence

Transitionprobabilities

Observationprobabilities

P(st | st-1,ot-1) P(ot | st,st-1)

Back-off to: Back-off to:

P(st | st-1)

P(st)

P(ot | st,ot-1)

P(ot | st)

P(ot)

or

Results:


41/52

Regrets from Atomic View of Tokens

Would like richer representation of text:

multiple overlapping features, whole chunks of text.

line, sentence, or paragraph features:

length

is centered in page

percent of non-alphabetics

white-space aligns with next line

containing sentence has two verbs

grammatically contains a question

contains links to authoritative pages

emissions that are uncountable

features at multiple levels of granularity

Example word features:

identity of word

is in all caps

ends in -ski

is part of a noun phrase

is in a list of city names

is under node X in WordNet or Cyc

is in bold font

is in hyperlink anchor

features of past & future last person name was female

next two words are and Associates

P bl i h Ri h R i


42/52

Problems with Richer Representationand a Generative Model

These arbitrary features are not independent: Overlapping and long-distance dependences

Multiple levels of granularity (words, characters)

Multiple modalities (words, formatting, layout)

Observations from past and future

HMMs are generativemodels of the text:

Generative models do not easily handle these non-independent features. Two choices:

Model the dependencies. Each state would have its ownBayes Net. But we are already starved for training data!

Ignore the dependencies. This causes over-counting ofevidence (ala nave Bayes). Big problem when combiningevidence, as in Viterbi!

),( osP


43/52

Conditional Sequence Models

We would prefer a conditionalmodel:P(s|o) instead of P(s,o):

Can examine features, but not responsible for generatingthem.

Dont have to explicitly model their dependencies.

Dont waste modeling effort trying to generate what we are

given at test time anyway.

If successful, this answers the challenge of

integrating the ability to handle many arbitraryfeatures with the full power of finite state automata.


44/52

Conditional Markov Models

St-1

St

Ot

St+1

Ot+1

Ot-1

...

...

Generative (traditional HMM)

||

1

1 )|()|(),(

o

t

tttt soPssPosP

...

transitions

observations

St-1

St

Ot

St+1

Ot+1

Ot-1

...

...

Conditional

...transitions

observations

||

11 ),|()|(

o

tttt ossPosP

Standard belief propagation: forward-backward procedure.Viterbi and Baum-Welch follow naturally.

Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000]

MaxEnt POS Tagger [Ratnaparkhi, 1996]SNoW-based Markov Model [Punyakanok & Roth, 2000]

E ti l F


45/52

Exponential Formfor Next State Function

k

ttkk

tt

ttsttt sofsoZ

osPossPt

),(exp),(

1)|(),|(

1

11

Capture dependency on st-1 with |S|independent functions, Pst-1(st|ot).

Each state contains a next-state classifierthat, given the next observation, produces aprobability of the next state, Pst-1(st|ot).

st-1

st

Recipe:- Labeled data is assigned to transitions.-Train each states exponential model by maximum entropy

weight feature


47/52

nn oooossss ,...,,..., 2121

HMM

MEMM

CRF

St-1 St

Ot

St+1

Ot+1Ot-1

St-1 St

Ot

St+1

Ot+1Ot-1

St-1 St

Ot

St+1

Ot+1Ot-1

...

...

...

...

...

...

||

1

1 )|()|(),(o

t

tttt soPssPosP

||

1

1

,

||

1

1

),(

),(

exp1

),|()|(

1

o

t

k

ttkk

j

ttjj

os

o

t

ttt

osg

ssf

Z

ossPosP

tt

||

1

1

),(

),(

exp1

)|(o

t

k

ttkk

j

ttjj

o osg

ssf

ZosP

(A special case of MEMMs and CRFs.)

Conditional Random Fields (CRFs)[Lafferty, McCallum, Pereira 2001]

From HMMs to MEMMs to CRFs


48/52

Conditional Random Fields (CRFs)

St St+1 St+2

O = Ot, Ot+1, Ot+2, Ot+3, Ot+4

St+3 St+4

Markov on s, conditional dependency on o.

||

1

1 ),,,(exp1

)|(o

t k

ttkk

o

tossfZ

osP

Hammersley-Clifford-Besag theorem stipulates that the CRF

has this forman exponential function of the cliques in the graph.

Assuming that the dependency structure of the states is tree-shaped(linear chain is a trivial tree), inference can be done by dynamicprogramming in time O(|o| |S|2)just like HMMs.

[Lafferty, McCallum, Pereira 2001]


49/52

Training CRFs

),,,(),(

),()|(),(

:gradientlikelihood-Log

}),{|}({

:dataninggiven traiparametersoflikelihood-logMaximize

1

2)()(}{

)()(

penaltysmoothingparameterscurrentbyassignedlabelsusingcountfeaturelabelscorrectusingcountfeature

)(

tt

t

kk

k

i s

ik

i

i

iik

k

i

k

sstofosC

osCosPosCL

--

soL

k

Methods: iterative scaling (quite slow) conjugate gradient (much faster) conjugate gradient with preconditioning (super fast) limited-memory quasi-Newton methods (also super fast)

Complexity comparable to standard Baum-Welch

[Sha & Pereira 2002]& [Malouf 2002]


50/52

Sample IE Applications of CRFs

Noun phrase segmentation [Sha & Pereira, 03]

Named entity recognition [McCallum & Li 03]

Protein names in bio abstracts [Settles 05]

Addresses in web pages [Culotta et al. 05] Semantic roles in text [Roth & Yih 05]

RNA structural alignment [Sato & Satakibara 05]


51/52

Examples of Recent CRF Research

Semi-Markov CRFs [Sarawagi & Cohen 05] Awkwardness of token level decisions for segments

Segment sequence model alleviates this

Two-level model with sequences of segments,

which are sequences of tokens

Stochastic Meta-Descent [Vishwanathan 06] Stochastic gradient optimization for training

Take gradient step with small batches of examples Order of magnitude faster than L-BFGS

Same resulting accuracies for extraction


52/52

Further Reading about CRFs

Charles Sutton and Andrew McCallum. AnIntroduction to Conditional Random Fields forRelational Learning. In Introduction to Statistical

Relational Learning. Edited by Lise Getoor andBen Taskar. MIT Press. 2006.

http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf
http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdfhttp://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdfhttp://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdfhttp://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdfhttp://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdfhttp://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf