Post on 10-Apr-2018
8/8/2019 Problematic Language Processing in Artificial Intelligence
1/91
Machine Learning Methods in Natural Language Processing
Michael CollinsMIT CSAIL
8/8/2019 Problematic Language Processing in Artificial Intelligence
2/91
Some NLP Problems
Information extraction
Named entities
Relationships between entities
Finding linguistic structure
Part-of-speech tagging
Parsing
Machine translation
8/8/2019 Problematic Language Processing in Artificial Intelligence
3/91
Common Themes
Need to learn mapping from one discrete structure to another
Strings to hidden state sequencesNamed-entity extraction, part-of-speech tagging
Strings to stringsMachine translation
Strings to underlying trees
Parsing
Strings to relational data structuresInformation extraction
Speech recognition is similar (and shares many techniques)
8/8/2019 Problematic Language Processing in Artificial Intelligence
4/91
Two Fundamental Problems
TAGGING: Strings to Tagged Sequences
a b e e a f h j a/C b/D e/C e/C a/D f/C h/D j/C
PARSING: Strings to Trees
d e f g (A (B (D d) (E e)) (C (F f) (G g)))
d e f g A
B
D
d
E
e
C
F
f
G
g
8/8/2019 Problematic Language Processing in Artificial Intelligence
5/91
Information Extraction
Named Entity RecognitionINPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as
their CEO Alan Mulally announced first quarter results.
OUTPUT: Profits soared at
Company Boeing Co.
, easily topping forecastson
Location Wall Street
, as their CEO
Person Alan Mulally
announced first
quarter results.
Relationships between Entities
INPUT: Boeing is located in Seattle. Alan Mulally is the CEO.
OUTPUT:
Relationship = Company-Location
Relationship = Employer-Employee
Company = Boeing Employer = Boeing Co.
Location = Seattle
Employee = Alan Mulally
8/8/2019 Problematic Language Processing in Artificial Intelligence
6/91
8/8/2019 Problematic Language Processing in Artificial Intelligence
7/91
Named Entity Extraction as Tagging
INPUT:Profi ts soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced fi rst quarter results.
OUTPUT:
Profi ts/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NAtopping/NA forecasts/NA on/NA Wall/SL Street/CL ,/NA as/NAtheir/NA CEO/NA Alan/SP Mulally/CP announced/NA fi rst/NA
quarter/NA results/NA ./NA
NA = No entity
SC = Start Company
CC = Continue CompanySL = Start Location
CL = Continue Location
8/8/2019 Problematic Language Processing in Artificial Intelligence
8/91
Parsing (Syntactic Structure)
INPUT: Boeing is located in Seattle.
OUTPUT:S
NP
N
Boeing
VP
V
is
VP
V
located
PP
P
in
NP
N
Seattle
8/8/2019 Problematic Language Processing in Artificial Intelligence
9/91
Machine Translation
INPUT:Boeing is located in Seattle. Alan Mulally is the CEO.
OUTPUT:
Boeing ist in Seattle. Alan Mulally ist der CEO.
8/8/2019 Problematic Language Processing in Artificial Intelligence
10/91
Techniques Covered in this Tutorial
Generative models for parsing
Log-linear (maximum-entropy) taggers
Learning theory for NLP
8/8/2019 Problematic Language Processing in Artificial Intelligence
11/91
Data for Parsing Experiments
Penn WSJ Treebank = 50,000 sentences with associated trees Usual set-up: 40,000 training sentences, 2400 test sentences
An example tree:
Canadian
NNP
Utilities
NNPS
NP
had
VBD
1988
CD
revenue
NN
NP
of
IN
C$
$
1.16
CD
billion
CD
,
PUNC,
QP
NP
PP
NP
mainly
RB
ADVP
from
IN
its
PRP$
natural
JJ
gas
NN
and
CC
electric
JJ
utility
NN
businesses
NNS
NP
in
IN
Alberta
NNP
,
PUNC,
NP
where
WRB
WHADVP
the
DT
company
NN
NP
serves
VBZ
about
RB
800,000
CD
QP
customers
NNS
.
PUNC.
NP
VP
S
SBAR
NP
PP
NP
PP
VP
S
TOP
Canadian Utilities had 1988 revenue of C$ 1.16 billion , mainly from its
natural gas and electric utility businesses in Alberta , where the company
serves about 800,000 customers .
8/8/2019 Problematic Language Processing in Artificial Intelligence
12/91
8/8/2019 Problematic Language Processing in Artificial Intelligence
13/91
2) Phrases S
NP
DT
the
N
burglar
VP
V
robbed
NP
DT
the
N
apartment
Noun Phrases (NP): the burglar, the apartment
Verb Phrases (VP): robbed the apartment
Sentences (S): the burglar robbed the apartment
8/8/2019 Problematic Language Processing in Artificial Intelligence
14/91
8/8/2019 Problematic Language Processing in Artificial Intelligence
15/91
An Example Application: Machine Translation
English word order is subject verb object
Japanese word order is subject object verb
English: IBM bought LotusJapanese: IBM Lotus bought
English: Sources said that IBM bought Lotus yesterdayJapanese: Sources yesterday IBM Lotus bought that said
8/8/2019 Problematic Language Processing in Artificial Intelligence
16/91
Context-Free Grammars
[Hopcroft and Ullman 1979]A context free grammar
where:
is a set of non-terminal symbols
is a set of terminal symbols
is a set of rules of the form
for
!
"
,
$
,
%
$
&
$
is a distinguished start symbol
8/8/2019 Problematic Language Processing in Artificial Intelligence
17/91
A Context-Free Grammar for English
= S, NP, VP, PP, D, Vi, Vt, N, P
= S
=
sleeps, saw, man, woman, telescope, the, with, in
S
NP VPVP
ViVP
Vt NPVP
VP PP
NP D NNP
NP PP
PP
P NP
Vi
sleepsVt
saw
N
manN
woman
N
telescopeD
the
P
withP
in
Note: S=sentence, VP=verb phrase, NP=noun phrase, PP=prepositional phrase,
D=determiner, Vi=intransitive verb, Vt=transitive verb, N=noun, P=preposition
8/8/2019 Problematic Language Processing in Artificial Intelligence
18/91
Left-Most Derivations
A left-most derivation is a sequence of strings
, where
, the start symbol
$
, i.e.
is made up of terminal symbols only
Each
%
for
is derived from
%
by picking the left-most non-terminal
in
%
and replacing it by some
where
is a rule in
For example: [S], [NP VP], [D N VP], [the N VP], [the man VP],
[the man Vi], [the man sleeps]
Representation of a derivation as a tree:
S
NP
D
the
N
man
VP
Vi
sleeps
8/8/2019 Problematic Language Processing in Artificial Intelligence
19/91
The Problem with Parsing: Ambiguity
INPUT:She announced a program to promote safety in trucks and vans
POSSIBLE OUTPUTS:
S
NP
She
VP
announced NP
NP
a program
VP
t o p ro mo te NP
safety PP
in NP
t r uc k s a n d v a ns
S
NP
She
VP
announced NP
NP
NP
a progra m
VP
t o p ro mo te NP
safety PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
a progra m
VP
to promote NP
NP
safety PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
a program
VP
t o p ro mo te NP
safety
PP
in NP
t r uc k s a n d v a ns
S
NP
She
VP
announced NP
NP
NP
a program
VP
t o p r om o te NP
safety
PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
NP
a program
VP
t o p r om o te NP
safety
PP
in NP
t r uc k s a n d v a ns
And there are more...
8/8/2019 Problematic Language Processing in Artificial Intelligence
20/91
An Example Tree
Canadian Utilities had 1988 revenue of C$ 1.16 billion ,mainly from its natural gas and electric utility businessesin Alberta , where the company serves about 800,000customers .
Canadian
NNP
Utilities
NNPS
NP
had
VBD
1988
CD
revenue
NN
NP
of
IN
C$
$
1.16
CD
billion
CD
,
PUNC,
QP
NP
PP
NP
mainly
RB
ADVP
from
IN
its
PRP$
natural
JJ
gas
NN
and
CC
electric
JJ
utility
NN
businesses
NNS
NP
in
IN
Alberta
NNP
,
PUNC,
NP
where
WRB
WHADVP
the
DT
company
NN
NP
serves
VBZ
about
RB
800,000
CD
QP
customers
NNS
.
PUNC.
NP
VP
S
SBAR
NP
PP
NP
PP
VP
S
TOP
8/8/2019 Problematic Language Processing in Artificial Intelligence
21/91
A Probabilistic Context-Free Grammar
S
NP VP 1.0
VP
Vi 0.4VP
Vt NP 0.4
VP VP PP 0.2NP
D N 0.3NP
NP PP 0.7
PP
P NP 1.0
Vi
sleeps 1.0Vt
saw 1.0
N
man 0.7
N
woman 0.2N
telescope 0.1
D
the 1.0
P
with 0.5
P
in 0.5
Probability of a tree with rules
%
%
is
%
%
%
%
Maximum Likelihood estimation
VP
V NP
VP
VP
V NP
VP
8/8/2019 Problematic Language Processing in Artificial Intelligence
22/91
PCFGs
[Booth and Thompson 73] showed that a CFG with ruleprobabilities correctly defi nes a distribution over the set ofderivations provided that:
1. The rule probabilities defi ne conditional distributions over thedifferent ways of rewriting each non-terminal.
2. A technical condition on the rule probabilities ensuring that
the probability of the derivation terminating in a fi nite numberof steps is 1. (This condition is not really a practical concern.)
8/8/2019 Problematic Language Processing in Artificial Intelligence
23/91
TOP
S
NP
N
IBM
VP
V
bought
NP
N
Lotus
PROB =
TOP
S
S
NP VP
N
VP
V NP
V
NP
N
N
NP
N
8/8/2019 Problematic Language Processing in Artificial Intelligence
24/91
The SPATTER Parser: (Magerman 95;Jelinek et al 94)
For each rule, identify the head child
S
NP VPVP
V NP
NP DT N
Add word to each non-terminal
S(questioned)
NP(lawyer)
DT
the
N
lawyer
VP(questioned)
V
questioned
NP(witness)
DT
the
N
witness
8/8/2019 Problematic Language Processing in Artificial Intelligence
25/91
A Lexicalized PCFG
S(questioned)
NP(lawyer) VP(questioned) ??
VP(questioned)
V(questioned) NP(witness) ??
NP(lawyer)
D(the) N(lawyer) ??
NP(witness)
D(the) N(witness) ??
The big question: how to estimate rule probabilities??
8/8/2019 Problematic Language Processing in Artificial Intelligence
26/91
CHARNIAK (1997)
S(questioned)
NP VP
S(questioned)
S(questioned)
NP VP(questioned)
lawyer
S,VP,NP, questioned)
S(questioned)
NP(lawyer) VP(questioned)
8/8/2019 Problematic Language Processing in Artificial Intelligence
27/91
Smoothed Estimation
NP VP S(questioned)
S(questioned)
NP VP
S(questioned)
S
NP VP
S
Where
!, and
!
8/8/2019 Problematic Language Processing in Artificial Intelligence
28/91
Smoothed Estimation
lawyer S,NP,VP,questioned
lawyer
S,NP,VP,questioned
S,NP,VP,questioned
lawyer
S,NP,VP
S,NP,VP
lawyer
NP
NP
Where
!, and
!
8/8/2019 Problematic Language Processing in Artificial Intelligence
29/91
NP(lawyer) VP(questioned)
S(questioned)
S(questioned)
NP VP
S(questioned)
S
NP VP
S
lawyer
S,NP,VP,questioned
S,NP,VP,questioned
lawyer
S,NP,VP
S,NP,VP
lawyer
NP
NP
8/8/2019 Problematic Language Processing in Artificial Intelligence
30/91
Lexicalized Probabilistic Context-Free Grammars
Transformation to lexicalized rules
S
NP VPvs. S(questioned)
NP(lawyer) VP(questioned)
Smoothed estimation techniques blend different counts
Search for most probable tree through dynamic programming
Perform vastly better than PCFGs (88% vs. 73% accuracy)
I d d A i
8/8/2019 Problematic Language Processing in Artificial Intelligence
31/91
Independence Assumptions PCFGs
S
NP
DT
the
N
lawyer
VP
V
questioned
NP
DT
the
N
witness
Lexicalized PCFGs
S(questioned)
NP(lawyer)
DT
the
N
lawyer
VP(questioned)
V
questioned
NP(witness)
DT
the
N
witness
8/8/2019 Problematic Language Processing in Artificial Intelligence
32/91
Results
Method AccuracyPCFGs (Charniak 97) 73.0%
Conditional Models Decision Trees (Magerman 95) 84.2%
Lexical Dependencies (Collins 96) 85.5%
Conditional Models Logistic (Ratnaparkhi 97) 86.9%
Generative Lexicalized Model (Charniak 97) 86.7%
Generative Lexicalized Model (Collins 97) 88.2%
Logistic-inspired Model (Charniak 99) 89.6%
Boosting (Collins 2000) 89.8%
Accuracy = average recall/precision
P i f I f ti E t ti
8/8/2019 Problematic Language Processing in Artificial Intelligence
33/91
Parsing for Information Extraction:Relationships between Entities
INPUT:Boeing is located in Seattle.
OUTPUT:
Relationship = Company-LocationCompany = BoeingLocation = Seattle
A Generati e Model (Miller et al)
8/8/2019 Problematic Language Processing in Artificial Intelligence
34/91
A Generative Model (Miller et. al)
[Miller et. al 2000] use non-terminals to carry lexical items andsemantic tags
SisCL
NPBoeingCOMPANY
Boeing
VPisCLLOC
V
is
VPlocatedCLLOC
V
located
PPinCLLOC
P
in
NPSeattleLOCATION
Seattle
PPin lexical headCLLOC semantic tag
A Generative Model [Miller et al 2000]
8/8/2019 Problematic Language Processing in Artificial Intelligence
35/91
A Generative Model [Miller et. al 2000]
Were now left with an even more complicated estimation problem,
P(SisCL NPBoeingCOMPANY VP
isCLLOC)
See [Miller et. al 2000] for the details
Parsing algorithm recovers annotated trees
Simultaneously recovers syntactic structure and namedentity relationships
Accuracy (precision/recall) is greater than 80% in recovering
relations
Techniques Covered in this Tutorial
8/8/2019 Problematic Language Processing in Artificial Intelligence
36/91
Techniques Covered in this Tutorial
Generative models for parsing
Log-linear (maximum-entropy) taggers
Learning theory for NLP
Tagging Problems
8/8/2019 Problematic Language Processing in Artificial Intelligence
37/91
Tagging Problems
TAGGING: Strings to Tagged Sequences
a b e e a f h j a/C b/D e/C e/C a/D f/C h/D j/C
Example 1: Part-of-speech tagging
Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V
forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N Alan/N
Mulally/N announced/V first/ADJ quarter/N results/N ./.
Example 2: Named Entity Recognition
Profits/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NA
topping/NA forecasts/NA on/NA Wall/SL Street/CL ,/NA as/NA their/NA
CEO/NA Alan/SP Mulally/CP announced/NA first/NA quarter/NA
results/NA ./NA
Log-Linear Models
8/8/2019 Problematic Language Processing in Artificial Intelligence
38/91
Log-Linear Models
Assume we have sets and
Goal: defi ne
for any
,
.
A feature vector representation is
Parameters
Defi ne
where
!
8/8/2019 Problematic Language Processing in Artificial Intelligence
39/91
Log-Linear Taggers: Independence Assumptions
8/8/2019 Problematic Language Processing in Artificial Intelligence
40/91
Log Linear Taggers: Independence Assumptions
The basic idea
Chain rule
Independence
assumptions
Two questions:
1. How to parameterize
?
2. How to fi nd
!"
#
?
The Parameterization Problem
8/8/2019 Problematic Language Processing in Artificial Intelligence
41/91
The Parameterization Problem
Hispaniola/NNP quickly/RB became/VB an/DT
important/JJ base/?? from which Spain expanded
its empire into the rest of the Western Hemisphere .
There are many possible tags in the position ??
Need to learn a function from (context, tag) pairs to a probability
Representation: Histories
8/8/2019 Problematic Language Processing in Artificial Intelligence
42/91
Representation: Histories
A history is a 4-tuple
are the previous two tags.
are the words in the input sentence.
is the index of the word being tagged
Hispaniola/NNP quickly/RB became/VB an/DT important/JJbase/?? from which Spain expanded its empire into the rest of theWestern Hemisphere .
DT, JJ
FeatureVector Representations
8/8/2019 Problematic Language Processing in Artificial Intelligence
43/91
a V p a
Take a history/tag pair
.
for
!
are features representing
tagging decision
in context
.
Example: POS Tagging [Ratnaparkhi 96]
Word/tag features
if current word
%
is base and
= VB
"
otherwise
if current word
%
ends in ing and
= VBG
"
otherwise
Contextual Features
if
DT, JJ, VB
"
otherwise
Part-of-Speech (POS) Tagging [Ratnaparkhi 96]
8/8/2019 Problematic Language Processing in Artificial Intelligence
44/91
p ( ) gg g [ p ]
Word/tag features
if current word
%
is base and
= VB
"
otherwise
Spelling features
if current word
%
ends in ing and
= VBG
"
otherwise
if current word
%
starts with pre and
= NN
"
otherwise
Ratnaparkhis POS Tagger
8/8/2019 Problematic Language Processing in Artificial Intelligence
45/91
p gg
Contextual Features
if
DT, JJ, VB
"
otherwise
if
JJ, VB
"
otherwise
if
VB "
otherwise
if previous word
%
= the and
VB
"
otherwise
if next word
%
= the and
VB
"
otherwise
Log-Linear (Maximum-Entropy) Models
8/8/2019 Problematic Language Processing in Artificial Intelligence
46/91
g py
Take a history/tag pair .
for
!
are features
for
!
are parameters
Parameters defi ne a conditional distribution
where
Log-Linear (Maximum Entropy) Models
8/8/2019 Problematic Language Processing in Artificial Intelligence
47/91
Word sequence
Tag sequence
Histories
Linear Score
Local NormalizationTerms
Log-Linear Models
8/8/2019 Problematic Language Processing in Artificial Intelligence
48/91
Parameter estimation:Maximize likelihood of training data through gradient descent,iterative scaling
Search for
!"
#
:Dynamic programming,
complexity
Experimental results: Almost 97% accuracy for POS tagging [Ratnaparkhi 96]
Over 90% accuracy for named-entity extraction
[Borthwick et. al 98] Better results than an HMM for FAQ segmentation
[McCallum et al. 2000]
8/8/2019 Problematic Language Processing in Artificial Intelligence
49/91
Linear Models for Classification
8/8/2019 Problematic Language Processing in Artificial Intelligence
50/91
Goal: learn a function
Training examples
%
%
for
,
A representation
, parameter vector$
.
Classifi er is defi ned as
Unifying framework for many results: Support vector machines, boosting,
kernel methods, logistic regression, margin-based generalization bounds,
online algorithms (perceptron, winnow), mistake bounds, etc.
How can these methods be generalized beyond classification
problems?
Linear Models for Parsing and Tagging
8/8/2019 Problematic Language Processing in Artificial Intelligence
51/91
Goal: learn a function
Training examples
%
%
for
,
Three components to the model:
A function
enumerating candidates for
A representation
. A parameter vector
$
.
Function is defi ned as
Component 1:
8/8/2019 Problematic Language Processing in Artificial Intelligence
52/91
enumerates a set ofcandidates for a sentence
She announced a program to promote safety in trucks and vans
S
NP
She
VP
announced NP
NP
a program
VP
t o p ro mo te NP
safety PP
in NP
t r uc k s a n d v a ns
S
NP
She
VP
announced NP
NP
NP
a progra m
VP
t o p ro mo te NP
safety PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
a progra m
VP
to promote NP
NP
safety PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
a program
VP
t o p ro mo te NP
safety
PP
in NP
t r uc k s a n d v a ns
S
NP
She
VP
announced NP
NP
NP
a program
VP
t o p r om o te NP
safety
PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
NP
a program
VP
t o p r om o te NP
safety
PP
in NP
t r uc k s a n d v a ns
Examples of
8/8/2019 Problematic Language Processing in Artificial Intelligence
53/91
A context-free grammar
A fi nite-state machine
Top
most probable analyses from a probabilistic grammar
8/8/2019 Problematic Language Processing in Artificial Intelligence
54/91
Features
8/8/2019 Problematic Language Processing in Artificial Intelligence
55/91
A feature is a function on a structure, e.g.,
Number of times A
B C
is seen in
A
B
D
d
E
e
C
F
f
G
g
A
B
D
d
E
e
C
F
h
A
B
b
C
c
!
Feature Vectors
8/8/2019 Problematic Language Processing in Artificial Intelligence
56/91
A set of functions
defi ne a feature vector
A
B
D
d
E
e
C
F
f
G
g
A
B
D
d
E
e
C
F
h
A
B
b
C
c
!
!
!
Component 3:
8/8/2019 Problematic Language Processing in Artificial Intelligence
57/91
is aparameter vector
$
and together map a candidate to a real-valued score
S
NP
She
VP
announced NP
NP
a program
VP
to VP
promote NP
safety PP
in NP
NP
trucks
and NP
vans
"
"
"
"
"
"
"
"
"
"
Putting it all Together
8/8/2019 Problematic Language Processing in Artificial Intelligence
58/91
is set of sentences, is set of possible outputs (e.g. trees)
Need to learn a function
,
, defi ne
Choose the highest scoring tree as the most plausible structure
Given examples
%
%
, how to set ?
She announced a program to promote safety in trucks and vans
8/8/2019 Problematic Language Processing in Artificial Intelligence
59/91
S
NP
She
VP
announced NP
NP
a program
VP
t o p ro mo te NP
safety PP
in NP
t r uc k s a n d v a ns
S
NP
She
VP
announced NP
NP
NP
a progra m
VP
t o p ro mo te NP
safety PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
a progra m
VP
to promote NP
NP
safety PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
a program
VP
t o p ro mo te NP
safety
PP
in NP
t r uc k s a n d v a ns
S
NP
She
VP
announced NP
NP
NP
a program
VP
t o p r om o te NP
safety
PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
NP
a program
VP
t o p r om o te NP
safety
PP
in NP
t r uc k s a n d v a ns
13.6 12.2 12.1 3.3 9.4 11.1
S
NP
She
VP
announced NP
NP
a p ro gr am
VP
to VP
promote NP
safety PP
in NP
NP
trucks
and NP
vans
Markov Random Fields[Johnson et. al 1999, Lafferty et al. 2001]
8/8/2019 Problematic Language Processing in Artificial Intelligence
60/91
[Johnson et. al 1999, Lafferty et al. 2001]
Parameters defi ne a conditional distribution over candidates:
Gaussian prior:
MAP parameter estimates maximise
Note: This is a globally normalised model
A Variant of the Perceptron Algorithm
8/8/2019 Problematic Language Processing in Artificial Intelligence
61/91
Inputs: Training set
%
%
for
Initialization: "
Define:
%
Algorithm: For
,
%
%
If
%
%
%
%
%
%
Output: Parameters
8/8/2019 Problematic Language Processing in Artificial Intelligence
62/91
Results that Carry Over from Classification
8/8/2019 Problematic Language Processing in Artificial Intelligence
63/91
[Freund and Schapire 99] defi ne the Voted Perceptron, proveresults for the inseparable case.
Compression bounds [Littlestone and Warmuth, 1986]
Say the perceptron algorithm makes
mistakes whenrun to convergence over a training set of size
.
Then for all distributions
, with probabilityat least
over the choice of training set of size
drawn from
, if
is the hypothesis at convergence,
NB.
!
#
!
8/8/2019 Problematic Language Processing in Artificial Intelligence
64/91
Define:
8/8/2019 Problematic Language Processing in Artificial Intelligence
65/91
!
"
$
'
!
"
$
&
!
$
!
!
"
$# #
!
"
$
%
$ $
%
&
!
'
$
(
)
0
# #
'
!
"
$
&
(
1
'
!
"
$
&
'
$ $
Generalization Theorem:For all distributions
!
!
"
$
generating examples, for all
2
3 with
, for all'
5
1
, with probability at least (7
over the choice of training set of
size
drawn from! ,
!
$
%
&
!
'
$
8
9
!
@
B
@
B
C
$
'
@
B
7
where9
is a constant such that
"
D
!
"
$
E
!
"
$
'
!
"
$
(
'
!
"
E
$
9
. The variableC
is the smallest positive integer suchthat
"
D
!
"
$
(
C .
Proof: See [Collins 2002b]. Based on [Bartlett 1998, Zhang, 2002]; closely
related to multiclass bounds in e.g., [Schapire et al., 1998].
Perceptron Algorithm 1: Tagging
8/8/2019 Problematic Language Processing in Artificial Intelligence
66/91
Going back to tagging:
2 Inputs
are sentences
2
i.e. all tag sequences of length
2 Global representations
are composed from local feature
vectors
where
"
#
%
#
&
'
Perceptron Algorithm 1: Tagging
8/8/2019 Problematic Language Processing in Artificial Intelligence
67/91
2 Typically, local features are indicator functions, e.g.,
if current word
ends in ing and = VBG
otherwise
2 and global features are then counts,
Number of times a word ending in ing istagged as VBG in
Perceptron Algorithm 1: Tagging
Score for a pair is
8/8/2019 Problematic Language Processing in Artificial Intelligence
68/91
Score for a
#
$
#
$
pair is
Viterbi algorithm for
!
"
0#$
%
&
$
#
$
#
$
Log-linear taggers (see earlier part of the tutorial) are locally normalised models:
@
B
'
!
!
"
0#
$
%
)
"
0#
$
%
$
$
1
)
0
2
2
3
2
!
4
1
!
1
$
5
67
8
Linear Model
(
$
1
)
0
@
B
9
!
4
1
$
5
67
8
Local Normalization
Training the Parameters
Inputs: Training set
for
.
8/8/2019 Problematic Language Processing in Artificial Intelligence
69/91
Initialization:
Algorithm: For
is output on
th sentence with current parameters
If
then
5
67
8
Correct tagsfeature value
5
67
8
Incorrect tagsfeature value
Output: Parameter vector .
An Example
S h f h
8/8/2019 Problematic Language Processing in Artificial Intelligence
70/91
Say the correct tags for
th sentence are
the/DT man/NN bit/VBD the/DT dog/NN
Under current parameters, output is
the/DT man/NN bit/NN the/DT dog/NN
Assume also that features track: (1) all bigrams; (2) word/tag pairs
Parameters incremented:
"
NN, VBD'
"
VBD, DT'
"
VBD
bit'
Parameters decremented:
"
NN, NN'
"
NN, DT'
"
NN
bit'
Experiments
8/8/2019 Problematic Language Processing in Artificial Intelligence
71/91
Wall Street Journal part-of-speech tagging data
Perceptron = 2.89%, Max-ent = 3.28%
(11.9% relative error reduction)
[Ramshaw and Marcus 95] NP chunking data
Perceptron = 93.63%, Max-ent = 93.29%
(5.1% relative error reduction)
See [Collins 2002a]
Perceptron Algorithm 2: Reranking Approaches
8/8/2019 Problematic Language Processing in Artificial Intelligence
72/91
2
is the top
most probable candidates from a base model
Parsing: a lexicalized probabilistic context-free grammar
Tagging: maximum entropy tagger Speech recognition: existing recogniser
Parsing Experiments
Beam search used to parse training and test sentences:
around 27 parses for each sentence
8/8/2019 Problematic Language Processing in Artificial Intelligence
73/91
around 27 parses for each sentence
"
'
, where
log-likelihood fromfi rst-pass parser,
are
indicator functions
if
contains "
'
otherwise
S
NP
She
VP
announced NP
NP
a program
VP
to VP
promote NP
safety PP
in NP
NP
trucks
and NP
vans
"
'
Named Entities
Top 20 segmentations from a maximum-entropy tagger
8/8/2019 Problematic Language Processing in Artificial Intelligence
74/91
"
'
,
if
contains a boundary = [The
otherwise
Whether youre an aging flower child or a clueless
[Gen-Xer], [The Day They Shot John Lennon], playing at the
[Dougherty Arts Center], entertains the imagination.
"
'
Whether youre an aging flower child or a clueless
[Gen-Xer], [The Day They Shot John Lennon], playing at the
[Dougherty Arts Center] entertains the imagination
8/8/2019 Problematic Language Processing in Artificial Intelligence
75/91
[Dougherty Arts Center], entertains the imagination.
"
'
Whether youre an aging flower child or a cluelessGen-Xer, The Day [They Shot John Lennon], playing at the
[Dougherty Arts Center], entertains the imagination.
"
'
Whether youre an aging flower child or a clueless
[Gen-Xer], The Day [They Shot John Lennon], playing at the
[Dougherty Arts Center], entertains the imagination.
"
'
Experiments
8/8/2019 Problematic Language Processing in Artificial Intelligence
76/91
Parsing Wall Street Journal Treebank
Training set = 40,000 sentences, test 2,416 sentencesState-of-the-art parser: 88.2% F-measure
Reranked model: 89.5% F-measure (11% relative error reduction)Boosting: 89.7% F-measure (13% relative error reduction)
Recovering Named-Entities in Web Data
Training data
53,609 sentences (1,047,491 words),test data 14,717 sentences (291,898 words)State-of-the-art tagger: 85.3% F-measureReranked model: 87.9% F-measure (17.7% relative error reduction)
Boosting: 87.6% F-measure (15.6% relative error reduction)
Perceptron Algorithm 3: Kernel Methods(Work with Nigel Duffy)
8/8/2019 Problematic Language Processing in Artificial Intelligence
77/91
2 Its simple to derive a dual form of the perceptron algorithm
If we can compute
efficientlywe can learn efficiently with the representation
All SubtreesRepresentation [Bod 98]
2 Given: Non-Terminal symbols
8/8/2019 Problematic Language Processing in Artificial Intelligence
78/91
G ve : No e a sy bo s
Terminal symbols
2 An infi nite set of subtrees
A
B C
A
B
b
E
A
B
b
C
A B
A
B A
B
b
C
2 Step 1:
Choose an (arbitrary) mapping from subtrees to integers
Number of times subtree
is seen in
"
%
'
All Subtrees Representation
8/8/2019 Problematic Language Processing in Artificial Intelligence
79/91
is now huge
But inner product
can be computed
effi ciently using dynamic programming.See [Collins and Duffy 2001, Collins and Duffy 2002]
8/8/2019 Problematic Language Processing in Artificial Intelligence
80/91
An Example
A A
8/8/2019 Problematic Language Processing in Artificial Intelligence
81/91
0
B
D
d
E
e
C
F
f
G
g
B
D
d
E
e
C
F
h
G
i
'
!
0
$
&
'
!
$
!
$
!
$
!
$
!
$
!
$
2 Most of these terms are (e.g. ).
2 Some are non-zero, e.g.
B
D E
B
D
d
E
B
D E
e
B
D
d
E
e
Recursive Definition of
2 If the productions at
and
are different
8/8/2019 Problematic Language Processing in Artificial Intelligence
82/91
p %
%
2 Else if
%
are pre-terminals,
%
2 Else
%
&
%
&
is number of children of node ;
&
is the&
th child of
.
Illustration of the Recursion
A A
8/8/2019 Problematic Language Processing in Artificial Intelligence
83/91
B
D
d
E
e
C
F
f
G
g
B
D
d
E
e
C
F
h
G
i
How many subtrees do nodes and have in common? i.e., What is !
$
?
B
D E
B
D
d
E
B
D E
e
B
D
d
E
e
C
F G
8/8/2019 Problematic Language Processing in Artificial Intelligence
84/91
Similar Kernels Exist for Tagged Sequences
8/8/2019 Problematic Language Processing in Artificial Intelligence
85/91
Whether youre an aging flower child or a clueless[Gen-Xer], [The Day They Shot John Lennon], playing
at the [Dougherty Arts Center], entertains the imagination.
Whether [Gen-Xer], Day They John Lennon],playing
Whether youre an aging flower child or a clueless [Gen
8/8/2019 Problematic Language Processing in Artificial Intelligence
86/91
Open Questions
2
h l i h h i b i d
8/8/2019 Problematic Language Processing in Artificial Intelligence
87/91
Can the large-margin hypothesis be trainedeffi ciently, even when
is huge? (see[Altun, Tsochantaridis, and Hofmann, 2003])
2
Can the large-margin bound be improved, to remove the
factor?
2 Which representations lead to good performance on parsing,
tagging etc.?
2 Can methods with global kernels be implementedeffi ciently?
Conclusions
Some Other Topics in Statistical NLP:
M hi l i
8/8/2019 Problematic Language Processing in Artificial Intelligence
88/91
2 Machine translation
2 Unsupervised/partially supervised methods
2 Finite state machines
2 Generation
2 Question answering
2 Coreference
2 Language modeling for speech recognition
2 Lexical semantics
2 Word sense disambiguation
2 Summarization
References
[Alt n Tsochantaridis and Hofmann 2003] Alt n Y I Tsochantaridis and T Hofmann 2003
8/8/2019 Problematic Language Processing in Artificial Intelligence
89/91
[Altun, Tsochantaridis, and Hofmann, 2003] Altun, Y., I. Tsochantaridis, and T. Hofmann. 2003.Hidden Markov Support Vector Machines. In Proceedings of ICML 2003.
[Bartlett 1998] P. L. Bartlett. 1998. The sample complexity of pattern classifi cation with neural
networks: the size of the weights is more important than the size of the network, IEEE
Transactions on Information Theory, 44(2): 525-536, 1998.
[Bod 98] Bod, R. (1998). Beyond Grammar: An Experience-Based Theory of Language. CSLI
Publications/Cambridge University Press.
[Booth and Thompson 73] Booth, T., and Thompson, R. 1973. Applying probability measures to
abstract languages. IEEE Transactions on Computers, C-22(5), pages 442450.
[Borthwick et. al 98] Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R. (1998). Exploiting
Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition. Proc.
of the Sixth Workshop on Very Large Corpora.[Collins and Duffy 2001] Collins, M. and Duffy, N. (2001). Convolution Kernels for Natural
Language. In Proceedings of NIPS 14.
[Collins and Duffy 2002] Collins, M. and Duffy, N. (2002). New Ranking Algorithms for Parsing
and Tagging: Kernels over Discrete Structures, and the Voted Perceptron. In Proceedings
of ACL 2002.[Collins 2002a] Collins, M. (2002a). Discriminative Training Methods for Hidden Markov Models:
Theory and Experiments with the Perceptron Algorithm. In Proceedings of EMNLP 2002.
[Collins 2002b] Collins, M. (2002b). Parameter Estimation for Statistical Parsing Models: Theory
and Practice of Distribution-Free Methods. To appear as a book chapter.
[Crammer and Singer 2001a] Crammer, K., and Singer, Y. 2001a. On the Algorithmic
Implementation of Multiclass Kernel-based Vector Machines. In Journal of Machine
Learning Research 2(Dec):265 292
8/8/2019 Problematic Language Processing in Artificial Intelligence
90/91
Learning Research, 2(Dec):265-292.[Crammer and Singer 2001b] Koby Crammer and Yoram Singer. 2001b. Ultraconservative Online
Algorithms for Multiclass Problems In Proceedings of COLT 2001.
[Freund and Schapire 99] Freund, Y. and Schapire, R. (1999). Large Margin Classifi cation using the
Perceptron Algorithm. In Machine Learning, 37(3):277296.
[Helmbold and Warmuth 95] Helmbold, D., and Warmuth, M. On Weak Learning. Journal of
Computer and System Sciences, 50(3):551-573, June 1995.[Hopcroft and Ullman 1979] Hopcroft, J. E., and Ullman, J. D. 1979. Introduction to automata
theory, languages, and computation. Reading, Mass.: AddisonWesley.
[Johnson et. al 1999] Johnson, M., Geman, S., Canon, S., Chi, S., & Riezler, S. (1999). Estimators
for stochastic unifi cation-based grammars. In Proceedings of the 37th Annual Meeting
of the Association for Computational Linguistics. San Francisco: Morgan Kaufmann.[Lafferty et al. 2001] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random
fi elds: Probabilistic models for segmenting and labeling sequence data. In Proceedings of
ICML-01, pages 282-289, 2001.
[Littlestone and Warmuth, 1986] Littlestone, N., and Warmuth, M. 1986. Relating data compression
and learnability. Technical report, University of California, Santa Cruz.
[MSM93] Marcus, M., Santorini, B., & Marcinkiewicz, M. (1993). Building a large annotated
corpus of english: The Penn treebank. Computational Linguistics, 19, 313-330.
[McCallum et al. 2000] McCallum, A., Freitag, D., and Pereira, F. (2000) Maximum entropy markov
models for information extraction and segmentation. In Proceedings of ICML 2000.
[Miller et. al 2000] Miller, S., Fox, H., Ramshaw, L., and Weischedel, R. 2000. A Novel Use of
Statistical Parsing to Extract Information from Text. In Proceedings of ANLP 2000.
[Ramshaw and Marcus 95] Ramshaw L and Marcus M P (1995) Text Chunking Using
8/8/2019 Problematic Language Processing in Artificial Intelligence
91/91
[Ramshaw and Marcus 95] Ramshaw, L., and Marcus, M. P. (1995). Text Chunking UsingTransformation-Based Learning. In Proceedings of the Third ACL Workshop on Very Large
Corpora, Association for Computational Linguistics, 1995.
[Ratnaparkhi 96] A maximum entropy part-of-speech tagger. In Proceedings of the empirical
methods in natural language processing conference.
[Schapire et al., 1998] Schapire R., Freund Y., Bartlett P. and Lee W. S. 1998. Boosting the margin:A new explanation for the effectiveness of voting methods. The Annals of Statistics,
26(5):1651-1686.
[Zhang, 2002] Zhang, T. 2002. Covering Number Bounds of Certain Regularized Linear Function
Classes. In Journal of Machine Learning Research, 2(Mar):527-550, 2002.