CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags •...
Transcript of CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags •...
![Page 1: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/1.jpg)
Behavioral Data Mining
Lecture 16
Natural Language Processing I
![Page 2: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/2.jpg)
Outline
• Parts-of-Speech and POS taggers
• HMMs
• Named Entity Recognition
• Conditional Random Fields (CRFs)
![Page 3: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/3.jpg)
Part-Of-Speech Tags
• Part-of-speech tags provide limited but useful information
about word type (e.g. local sentiment attachment).
• They are widely used as an alternative to full parsing, which
can be very slow.
• While our intuition about tags tends to be semantic (nouns
are “people, places, things”), classification and inference is
based on morphology and context.
![Page 4: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/4.jpg)
Penn Treebank Tagset
![Page 5: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/5.jpg)
Part-Of-Speech Tags
POS tagging normally follows tokenization, which includes
punctuation:
Sheena
leads
,
Sheila
needs
.
![Page 6: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/6.jpg)
POS tags
• Morphology:
– “liked,” “follows,” “poke”
• Context: “can” can be:
– “trash can,” “can do,” “can it”
• Therefore taggers should look at the neighborhood of a
word.
![Page 7: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/7.jpg)
Constraint-Based Tagging
Based on a table of words with morphological and context
features:
![Page 8: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/8.jpg)
Outline
• Parts-of-Speech and POS taggers
• HMMs
• Named Entity Recognition
• Conditional Random Fields (CRFs)
![Page 9: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/9.jpg)
Markov Models (notes due to David Hall)
• A Markov model is a chain-structured Bayes Net
– Each node is identically distributed (stationarity)
– Value of X at a given time is called the state
– As a Bayes Net:
• Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial probs)
– Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial probs)
X2 X1 X3 X4
![Page 10: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/10.jpg)
Hidden Markov Models
• Markov chains not so useful for most applications – Eventually you don’t know anything anymore
– Need observations to update your beliefs
• Hidden Markov models (HMMs) – Underlying Markov chain over states S
– You observe outputs (effects) at each time step
– As a Bayes’ net:
X5 X2
E1
X1 X3 X4
E2 E3 E4 E5
![Page 11: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/11.jpg)
Example
• An HMM is defined by: – Initial distribution:
– Transitions:
– Emissions:
![Page 12: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/12.jpg)
Conditional Independence • HMMs have two important independence properties:
– Markov hidden process, future depends on past via the present
– Current observation independent of all else given current state
• Does this mean that observations are independent given no evidence?
– [No, correlated by the hidden state]
X5 X2
E1
X1 X3 X4
E2 E3 E4 E5
![Page 13: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/13.jpg)
Real HMM Examples
POS tagging
DNA alignment Observations are nucleotides (ACGT)
Transitions are substitutions, insertions, deletions
Named Entity Recognition/Field Segmentation Observations are words (tens of thousands)
States are “Person”, “Place”, “Other”, etc. (usually 3-20)
Ad Targeting: Observations are click streams
States are interests, or likelihood to buy
![Page 14: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/14.jpg)
Named Entity Recognition
• People, places, (specific) things – “Chez Panisse”: Restaurant
– “Berkeley, CA”: Place
– “Churchill-Brenneis Orchards”: Company(?)
– Not “Page” or “Medjool”: part of a non-rigid designator
14
![Page 15: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/15.jpg)
Named Entity Recognition
• Why a sequence model? – “Page” can be “Larry Page” or “Page mandarins” or just a
page.
– “Berkeley” can be a person.
15
![Page 16: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/16.jpg)
Named Entity Recognition
• Chez Panisse, Berkeley, CA - A bowl of Churchill-Brenneis Orchards Page mandarins and Medjool dates
• States: – Restaurant, Place, Company, GPE, Movie, “Outside”
– Usually use BeginRestaurant, InsideRestaurant. • Why?
• Emissions: words – Estimating these probabilities is harder.
16
![Page 17: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/17.jpg)
Filtering / Monitoring
• Filtering, or monitoring, is the task of tracking the distribution
B(X) (the belief state) over time
• We start with B(X) in an initial setting, usually uniform
• As time passes, or we get observations, we update B(X)
• The Kalman filter (a continuous HMM) was invented in the
60’s and first implemented as a method of trajectory
estimation for the Apollo program
![Page 18: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/18.jpg)
Passage of Time
Assume we have current belief P(X | evidence to date)
Then, after one time step passes:
Or, compactly:
Basic idea: beliefs get “pushed” through the transitions
X2 X1
![Page 19: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/19.jpg)
Example HMM
![Page 20: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/20.jpg)
The Forward Algorithm
We are given evidence at each time and want to know
We can derive the following updates We can normalize
as we go if we want
to have P(x|e) at
each time step, or
just once at the
end…
![Page 21: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/21.jpg)
The Forward Algorithm
• Expressible as repeated matrix multiplication
• Careful: can get numerical underflow very quickly! – Use logs or renormalize after each product
21
![Page 22: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/22.jpg)
Most Likely Sequence
The Viterbi Algorithm
Just like the forward algorithm with max instead of sum
Compare to the Forward Algorithm
How do we actually extract a max sequence from V?
– Store back pointers of the best
![Page 23: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/23.jpg)
Estimating HMMs: Forward-Backward
• Given observed state sequences for an HMM, the forward-
backward algorithm (aka Baum-Welch) produces an estimate
for the model parameters.
• Baum-Welch is a special case of EM (Expectation-
Maximization), and requires iteration to find the best
parameters.
• Convergence is generally quite fast (10s of iterations for
speech HMMs).
23
![Page 24: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/24.jpg)
Higher-Order Markov Models
• Sometimes, first-order history isn’t enough – Part-of-Speech Tagging
• Need to model
• Claim: Can use a first-order model – How?
• Solution: Product States
• Estimating probabilities a little more subtle
24
![Page 25: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/25.jpg)
Outline
• Parts-of-Speech and POS taggers
• HMMs
• Named Entity Recognition
• Conditional Random Fields (CRFs)
![Page 26: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/26.jpg)
Conditional Random Fields
• HMMs are generative models of sequences.
• Hidden states X
• Observations E
• Two kinds of parameters: – Transmissions: P(X | X)
– Emission: P(E | X)
26
F1 F2 F3 F4
X1 X2 X3 X4
![Page 27: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/27.jpg)
Generative Models
• Generative Models have both a prior term p(X) and
a likelihood term P(E|X).
• By Bayes’ Rule:
– P(X|E) = P(X)P(E|X)/ P(E)
• We always observe E...
• Why do we waste time modeling E?
27
![Page 28: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/28.jpg)
Generative Models
• Forward algorithm
28
![Page 29: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/29.jpg)
Discriminative Models
• Key Idea: Model P(X|E) directly!
– Consequence: P(E) and P(E|X) no longer (exactly) make
sense.
• Use a linear score function, just like SVMs or
regression:
• (Generic version:)
– If you have a dynamic program in a generative model with
P(...|...), just replace them with a score function.
29
![Page 30: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/30.jpg)
CRF intuition
Naïve Bayes HMMs
Logistic Regression CRFs
30
![Page 31: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/31.jpg)
Conditional Random Fields
• is called the “partition function” or
normalizing constant. –It’s like P(E)
![Page 32: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/32.jpg)
CRFs vs HMMs
• Easy to map HMMs to CRFs
• How? – One feature for transition
– One feature for emission
32
![Page 33: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/33.jpg)
Named Entity Recognition
• People, places, (specific) things – “Chez Panisse”: Restaurant
– “Berkeley, CA”: Place
– “Churchill-Brenneis Orchards”: Company(?)
– Not “Page” or “Medjool”: part of a non-rigid designator
33
![Page 34: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/34.jpg)
Named Entity Recognition
• With an HMM, have to estimate probability P(E|X) – But what’s P(Panisse|Restaurant)
• If you’ve never seen Panisse before?
• Hard to estimate!
34
![Page 35: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/35.jpg)
CRF features
• NER: – Features to detect names:
• Word on the left is “Mr.” or “Ms.”
• Word two back is “Mr.” or “Ms.”
• Word is capitalized.
• Caution: If you have features over an “arbitrary” window size, complexity guarantees of forward algorithm go out the window!
35
![Page 36: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/36.jpg)
CRFs vs HMMs
• Not possible to do unsupervised learning – You need a generative model, or a MRF
• Slower to train
• Much easier to “overfit” – Use regularization!
36
![Page 37: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/37.jpg)
Training CRFs
• Needs gradient updates (SGD/LBFGS)
• Optimize log-likelihood
37
x’
features
x’
features
![Page 38: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/38.jpg)
Training CRFs
• Needs gradient updates (SGD/LBFGS)
• Optimize log-likelihood
38
...
![Page 39: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/39.jpg)
Training CRFs
• Interpretation: derivative looks to minimize difference between true and guessed features
39
Ground truth
features
Model guess
features
![Page 40: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/40.jpg)
Training CRFs
• CRF log likelihood is convex
• Any minimum is a global minimum
40
![Page 41: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/41.jpg)
CRF Viterbi
• HMM Viterbi
• CRF Viterbi
41
![Page 42: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/42.jpg)
Software
• Stanford POS tagger:
http://nlp.stanford.edu/software/tagger.shtml
• Stanford NER recognizer:
http://nlp.stanford.edu/software/CRF-NER.shtml
![Page 43: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local](https://reader035.fdocuments.in/reader035/viewer/2022071212/6025f5402a23702a5116ab26/html5/thumbnails/43.jpg)
Summary
• Parts-of-Speech and POS taggers
• HMMs
• Named Entity Recognition
• Conditional Random Fields (CRFs)