Spectracular Learning Algorithms for Natural Language Processing - University of...
Transcript of Spectracular Learning Algorithms for Natural Language Processing - University of...
![Page 1: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/1.jpg)
Spectracular Learning Algorithms for NaturalLanguage Processing
Shay Cohen
University of Edinburgh
June 10, 2014
Spectral Learning for NLP 1
![Page 2: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/2.jpg)
Latent-variable Models
Latent-variable models are used in many areas of NLP, speech, etc.:
I Hidden Markov Models
I Latent-variable PCFGs
I Naive Bayes for clustering
I Lexical representations: Brown clustering, Saul and Pereira,etc.
I Alignments in statistical machine translation
I Topic modeling
I etc. etc.
The Expectation-maximization (EM) algorithm is generally usedfor estimation in these models (Dempster et al., 1977)
Other relevant algorithms: cotraining, clustering methods
Spectral Learning for NLP 2
![Page 3: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/3.jpg)
Example 1: Hidden Markov Models
S1 S2 S3 S4
the dog saw him
Parameterized by π(s), t(s|s′) and o(w|s)
Spectral learning: Hsu et al. (2009)
Dynamical systems: Siddiqi et al. (2009), Boots and Gordon(2011)
Head-automaton grammars for dep. parsing: Luque et al. (2012)
Spectral Learning for NLP 3
![Page 4: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/4.jpg)
Example 2: Latent-Variable PCFGs (Matsuzaki et al., 2005; Petrov
et al., 2006)
S
NP
D
the
N
dog
VP
V
saw
P
him
=⇒
S1
NP3
D1
the
N2
dog
VP2
V4
saw
P1
him
Spectral Learning for NLP 4
![Page 5: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/5.jpg)
Example 3: Naıve Bayes
H
X Y
p(h, x, y) = p(h)× p(x|h)× p(y|h)
(the, dog)(I, saw)(ran, to)
(John, was)...
I EM can be used to estimate parameters
Spectral Learning for NLP 5
![Page 6: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/6.jpg)
Example 4: Language Modelling
h
w1 w2
p(w2|w1) =∑
h p(h|w1)× p(w2|h) (Saul and Pereira, 1997)
Spectral Learning for NLP 6
![Page 7: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/7.jpg)
Example 5: HMMs for Speech
Phoneme boundaries are hidden variables
Refinement HMMs (Stratos et al., 2013)
Spectral Learning for NLP 7
![Page 8: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/8.jpg)
Example 6: Topic Models
Latent topics attached to a document or to each word in adocument
Method of moments algorithms such as Arora et al. (2012; 2013)
Spectral Learning for NLP 8
![Page 9: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/9.jpg)
Example 7: Unsupervised Parsing
The bear ate the fish
𝑤1 , 𝑤2 , 𝑤3 , 𝑤4 , 𝑤5 , 𝑧1, 𝑧2, 𝑧3
𝒙 = (𝐷𝑇,𝑁𝑁, 𝑉𝐵𝐷, 𝐷𝑇,𝑁𝑁)
𝑢(𝒙)
((DT NN) (VBD (DT NN)))
w1 w2 w3
z3
z1
w4 w5
z2
w1 w2 w3
z3z1
w4 w5
z2
Latent structure is a bracketing (Parikh et al., 2014)
Similar in flavor to tree learning algorithms (e.g. Anandkumar,2011)
Very different in flavor from estimation algorithms
Spectral Learning for NLP 9
![Page 10: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/10.jpg)
Example 8: Word Embeddings
-0.2 0.0 0.2 0.4
-0.2
-0.1
0.0
0.1
0.2
0.3
PC 1
PC
2homecar
houseword
talk
river
dog
agree
catlisten
boatcarry
truck
sleepdrink
eatpush
disagree
Embed a vocabulary into d-dimensional space
Can later be used for various NLP problems downstream
Related to canonical correlation analysis (Dhillon et al., 2012)
Spectral Learning for NLP 10
![Page 11: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/11.jpg)
Spectral MethodsBasic idea: replace EM with methods based on matrixdecompositions, in particular singular value decomposition (SVD)SVD: given matrix A with m rows, n columns, approximate as
A ≈ UΣV >
which means
Ajk ≈d∑
h=1
σhUjhVkh
where σh are “singular values”
U and V are m× d and n× d matrices
Remarkably, can find the optimal rank-d approximation efficiently
Spectral Learning for NLP 11
![Page 12: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/12.jpg)
Similarity of SVD to Naıve Bayes
H
X Y
P (X = x, Y = y) =
d∑h=1
p(h)p(x|h)p(y|h)
Ajk ≈d∑
h=1
σhUjhVkh
I SVD approximation minimizes squared loss, not log-lossI σh not interpretable as probabilitiesI Ujh, Vjh may be positive or negative, not probabilities
BUT we can still do a lot with SVD (and higher-order,tensor-based decompositions)
Spectral Learning for NLP 12
![Page 13: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/13.jpg)
Outline
• Singular value decomposition
• Canonical correlation analysis
• Spectral learning of hidden Markov models
• Algorithm for latent-variable PCFGs
Spectral Learning for NLP 13
![Page 14: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/14.jpg)
Singular Value Decomposition (SVD)
A︸︷︷︸m×n
SVD
=
d∑i=1
σi︸︷︷︸scalar
ui︸︷︷︸m×1
(vi)>︸︷︷︸1×n︸ ︷︷ ︸
m×nI d = min(m,n)
I σ1 ≥ . . . ≥ σd ≥ 0
I u1 . . . ud ∈ Rm are orthonormal:∣∣∣∣ui∣∣∣∣2
= 1 ui · uj = 0 ∀i 6= j
I v1 . . . vd ∈ Rn are orthonormal:∣∣∣∣vi∣∣∣∣2
= 1 vi · vj = 0 ∀i 6= j
Spectral Learning for NLP 14
![Page 15: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/15.jpg)
Singular Value Decomposition (SVD)
A︸︷︷︸m×n
SVD
=
d∑i=1
σi︸︷︷︸scalar
ui︸︷︷︸m×1
(vi)>︸︷︷︸1×n︸ ︷︷ ︸
m×nI d = min(m,n)
I σ1 ≥ . . . ≥ σd ≥ 0
I u1 . . . ud ∈ Rm are orthonormal:∣∣∣∣ui∣∣∣∣2
= 1 ui · uj = 0 ∀i 6= j
I v1 . . . vd ∈ Rn are orthonormal:∣∣∣∣vi∣∣∣∣2
= 1 vi · vj = 0 ∀i 6= j
Spectral Learning for NLP 14
![Page 16: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/16.jpg)
Singular Value Decomposition (SVD)
A︸︷︷︸m×n
SVD
=
d∑i=1
σi︸︷︷︸scalar
ui︸︷︷︸m×1
(vi)>︸︷︷︸1×n︸ ︷︷ ︸
m×nI d = min(m,n)
I σ1 ≥ . . . ≥ σd ≥ 0
I u1 . . . ud ∈ Rm are orthonormal:∣∣∣∣ui∣∣∣∣2
= 1 ui · uj = 0 ∀i 6= j
I v1 . . . vd ∈ Rn are orthonormal:∣∣∣∣vi∣∣∣∣2
= 1 vi · vj = 0 ∀i 6= j
Spectral Learning for NLP 14
![Page 17: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/17.jpg)
Singular Value Decomposition (SVD)
A︸︷︷︸m×n
SVD
=
d∑i=1
σi︸︷︷︸scalar
ui︸︷︷︸m×1
(vi)>︸︷︷︸1×n︸ ︷︷ ︸
m×nI d = min(m,n)
I σ1 ≥ . . . ≥ σd ≥ 0
I u1 . . . ud ∈ Rm are orthonormal:∣∣∣∣ui∣∣∣∣2
= 1 ui · uj = 0 ∀i 6= j
I v1 . . . vd ∈ Rn are orthonormal:∣∣∣∣vi∣∣∣∣2
= 1 vi · vj = 0 ∀i 6= jSpectral Learning for NLP 14
![Page 18: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/18.jpg)
SVD in Matrix Form
A︸︷︷︸m×n
SVD
= U︸︷︷︸m×d
Σ︸︷︷︸d×d
V >︸︷︷︸d×n
U =
| |u1 . . . ud
| |
∈ Rm×d Σ =
σ1 0
. . .
0 σd
∈ Rd×d
V =
| |v1 . . . vd
| |
∈ Rn×d
Spectral Learning for NLP 15
![Page 19: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/19.jpg)
Matrix Rank
A ∈ Rm×n
rank(A) ≤ min(m,n)
I rank(A) := number of linearly independent columns in A
1 1 21 2 21 1 2
1 1 21 2 21 1 3
rank 2 rank 3
(full-rank)
Spectral Learning for NLP 16
![Page 20: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/20.jpg)
Matrix Rank: Alternative Definition
I rank(A) := number of positive singular values of A
1 1 21 2 21 1 2
1 1 21 2 21 1 3
Σ =
4.53 0 00 0.7 00 0 0
Σ =
5 0 00 0.98 00 0 0.2
rank 2 rank 3
(full-rank)
Spectral Learning for NLP 17
![Page 21: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/21.jpg)
SVD and Low-Rank Matrix Approximation
I Suppose we want to find B∗ such that
B∗ = argminB: rank(B)=r
∑jk
(Ajk −Bjk)2
I Solution:
B∗ =r∑i=1
σiui(vi)>
Spectral Learning for NLP 18
![Page 22: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/22.jpg)
SVD in Practice
I Black box, e.g., in Matlab
I Input: matrix A, output: scalars σ1 . . . σd, vectors u1 . . . ud
and v1 . . . vd
I Efficient implementations
I Approximate, randomized approaches also available
I Can be used to solve a variety of optimization problems
I For instance, Canonical Correlation Analysis (CCA)
Spectral Learning for NLP 19
![Page 23: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/23.jpg)
SVD in Practice - Random Projections
For large matrices (Halko et al., 2011)
Spectral Learning for NLP 20
![Page 24: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/24.jpg)
Outline
• Singular value decomposition
• Canonical correlation analysis
• Spectral learning of hidden Markov models
• Algorithm for latent-variable PCFGs
Spectral Learning for NLP 21
![Page 25: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/25.jpg)
Simplest Model in Complexity: Naive Bayes
H
X Y
p(h, x, y) = p(h)× p(x|h)× p(y|h)
(the, dog)(I, saw)(ran, to)
(John, was)...
CCA helps identify H
Spectral Learning for NLP 22
![Page 26: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/26.jpg)
Canonical Correlation Analysis (CCA)
I Data consists of paired samples: (x(i), y(i)) for i = 1 . . . n
I As in co-training, x(i) ∈ Rd and y(i) ∈ Rd′ are two “views” ofa sample point
View 1 View 2
x(1) = (1, 0, 0, 0) y(1) = (1, 0, 0, 1, 0, 1, 0)
x(2) = (0, 0, 1, 0) y(2) = (0, 1, 0, 0, 0, 0, 1)
......
x(100000) = (0, 1, 0, 0) y(100000) = (0, 0, 1, 0, 1, 1, 1)
Spectral Learning for NLP 23
![Page 27: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/27.jpg)
Projection Matrices
I Project samples to lower dimensional space
x ∈ Rd =⇒ x′ ∈ Rp
I If p is small, we can learn with far fewer samples!
I CCA finds projection matrices A ∈ Rd×p, B ∈ Rd′×p
I The new data points are a(i) ∈ Rp, b(i) ∈ Rp where
a(i)︸︷︷︸p×1
= A>︸︷︷︸p×d
x(i)︸︷︷︸d×1
b(i)︸︷︷︸p×1
= B>︸︷︷︸p×d′
y(i)︸︷︷︸d′×1
Spectral Learning for NLP 24
![Page 28: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/28.jpg)
Projection Matrices
I Project samples to lower dimensional space
x ∈ Rd =⇒ x′ ∈ Rp
I If p is small, we can learn with far fewer samples!
I CCA finds projection matrices A ∈ Rd×p, B ∈ Rd′×p
I The new data points are a(i) ∈ Rp, b(i) ∈ Rp where
a(i)︸︷︷︸p×1
= A>︸︷︷︸p×d
x(i)︸︷︷︸d×1
b(i)︸︷︷︸p×1
= B>︸︷︷︸p×d′
y(i)︸︷︷︸d′×1
Spectral Learning for NLP 24
![Page 29: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/29.jpg)
Mechanics of CCA: Step 1
I Compute CXY ∈ Rd×d′ , CXX ∈ Rd×d, and CY Y ∈ Rd′×d′
[CXY ]jk =1
n
n∑i=1
(x(i)j − xj)(y
(i)k − yk)
[CXX ]jk=1
n
n∑i=1
(x(i)j − xj)(x
(i)k − xk)
[CY Y ]jk=1
n
n∑i=1
(y(i)j − yj)(y
(i)k − yk)
where x =∑
i x(i)/n and y =
∑i y
(i)/n
Spectral Learning for NLP 25
![Page 30: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/30.jpg)
Mechanics of CCA: Step 1
I Compute CXY ∈ Rd×d′ , CXX ∈ Rd×d, and CY Y ∈ Rd′×d′
[CXY ]jk =1
n
n∑i=1
(x(i)j − xj)(y
(i)k − yk)
[CXX ]jk =1
n
n∑i=1
(x(i)j − xj)(x
(i)k − xk)
[CY Y ]jk=1
n
n∑i=1
(y(i)j − yj)(y
(i)k − yk)
where x =∑
i x(i)/n and y =
∑i y
(i)/n
Spectral Learning for NLP 26
![Page 31: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/31.jpg)
Mechanics of CCA: Step 1
I Compute CXY ∈ Rd×d′ , CXX ∈ Rd×d, and CY Y ∈ Rd′×d′
[CXY ]jk =1
n
n∑i=1
(x(i)j − xj)(y
(i)k − yk)
[CXX ]jk =1
n
n∑i=1
(x(i)j − xj)(x
(i)k − xk)
[CY Y ]jk =1
n
n∑i=1
(y(i)j − yj)(y
(i)k − yk)
where x =∑
i x(i)/n and y =
∑i y
(i)/n
Spectral Learning for NLP 27
![Page 32: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/32.jpg)
Mechanics of CCA: Step 2
I Do SVD on C−1/2XX CXY C
−1/2Y Y ∈ Rd×d′
C−1/2XX CXY C
−1/2Y Y
SVD
= UΣV >
Let Up ∈ Rd×p be the top p left singular vectors. LetVp ∈ Rd′×p be the top p right singular vectors.
Spectral Learning for NLP 28
![Page 33: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/33.jpg)
Mechanics of CCA: Step 3
I Define projection matrices A ∈ Rd×p and B ∈ Rd′×p
A = C−1/2XX Up B = C
−1/2Y Y Vp
I Use A and B to project each (x(i), y(i)) for i = 1 . . . n:
x(i) ∈ Rd =⇒ A>x(i) ∈ Rp
y(i) ∈ Rd′
=⇒ B>y(i) ∈ Rp
Spectral Learning for NLP 29
![Page 34: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/34.jpg)
Justification of CCA: Correlation Coefficients
I Sample correlation coefficient for a1 . . . an ∈ R andb1 . . . bn ∈ R is
Corr({ai}ni=1, {bi}ni=1) =
∑ni=1(ai − a)(bi − b)√∑n
i=1(ai − a)2√∑n
i=1(bi − b)2
where a =∑
i ai/n, b =∑
i bi/n
a
b
Correlation ≈ 1
Spectral Learning for NLP 30
![Page 35: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/35.jpg)
Simple Case: p = 1
I CCA projection matrices are vectors u1 ∈ Rd, v1 ∈ Rd′
I Project x(i) and y(i) to scalars u1 · x(i) and v1 · y(i)
I What vectors does CCA find? Answer:
u1, v1 = arg maxu,v
Corr({u · x(i)}ni=1, {v · y(i)}ni=1
)
Spectral Learning for NLP 31
![Page 36: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/36.jpg)
Simple Case: p = 1
I CCA projection matrices are vectors u1 ∈ Rd, v1 ∈ Rd′
I Project x(i) and y(i) to scalars u1 · x(i) and v1 · y(i)
I What vectors does CCA find? Answer:
u1, v1 = arg maxu,v
Corr({u · x(i)}ni=1, {v · y(i)}ni=1
)
Spectral Learning for NLP 31
![Page 37: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/37.jpg)
Finding the Next Projections
I After finding u1 and v1, what vectors u2 and v2 does CCA
find? Answer:
u2, v2 = arg maxu,v
Corr({u · x(i)}ni=1, {v · y(i)}ni=1
)subject to the constraints
Corr({u2 · x(i)}ni=1, {u1 · x(i)}ni=1
)= 0
Corr({v2 · y(i)}ni=1, {v1 · y(i)}ni=1
)= 0
Spectral Learning for NLP 32
![Page 38: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/38.jpg)
CCA as an Optimization Problem
I CCA finds for j = 1 . . . p (each column of A and B)
uj, vj = arg maxu,v
Corr({u · x(i)}ni=1, {v · y(i)}ni=1
)subject to the constraints
Corr({uj · x(i)}ni=1, {uk · x(i)}ni=1
)= 0
Corr({vj · y(i)}ni=1, {vk · y(i)}ni=1
)= 0
for k < j
Spectral Learning for NLP 33
![Page 39: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/39.jpg)
Guarantees for CCA
H
X Y
I Assume data is generated from a Naive Bayes model
I Latent-variable H is of dimension k, variables X and Y are ofdimension d and d′ (typically k � d and k � d′)
I Use CCA to project X and Y down to k dimensions (needs(x, y) pairs only!)
I Theorem: the projected samples are as good as the originalsamples for prediction of H(Foster, Johnson, Kakade, Zhang, 2009)
I Because k � d and k � d′ we can learn to predict H with farfewer labeled examples
Spectral Learning for NLP 34
![Page 40: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/40.jpg)
Guarantees for CCA (continued)
Kakade and Foster, 2007 - cotraining-style setting:
I Assume that we have a regression problem: predict somevalue z given two “views” x and y
I Assumption: either view x or y is sufficient for prediction
I Use CCA to project x and y down to a low-dimensional space
I Theorem: if correlation coefficients drop off to zero quickly,we will need far fewer samples to learn when using theprojected representation
I Very similar setting to cotraining, but no assumption ofindependence between the two views
Spectral Learning for NLP 35
![Page 41: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/41.jpg)
“Variants” of CCA
C−1/2XX CXY C
−1/2Y Y ∈ Rd×d
′
Centering leads to non-sparse CXY .
Computing C−1/2XX and C
−1/2Y Y leads to large non-sparse matrices
Spectral Learning for NLP 36
![Page 42: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/42.jpg)
Outline
• Singular value decomposition
• Canonical correlation analysis
• Spectral learning of hidden Markov models
• Algorithm for latent-variable PCFGs
Spectral Learning for NLP 37
![Page 43: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/43.jpg)
A Spectral Learning Algorithm for HMMs
I Algorithm due to Hsu, Kakade and Zhang (COLT 2009; JCSS2012)
I Algorithm relies on singular value decomposition followed byvery simple matrix operations
I Close connections to CCA
I Under assumptions on singular values arising from the model,has PAC-learning style guarantees (contrast with EM, whichhas problems with local optima)
I It is a very different algorithm from EM
Spectral Learning for NLP 38
![Page 44: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/44.jpg)
Hidden Markov Models (HMMs)
H1 H2 H3 H4
the dog saw him
p(the dog saw him︸ ︷︷ ︸x1...x4
, 1 2 1 3︸ ︷︷ ︸h1...h4
)
= π(1)× t(2|1)× t(1|2)× t(3|1)
×o(the|1)× o(dog|2)× o(saw|1)× o(him|3)
I Initial parameters: π(h) for each latent state h
I Transition parameters: t(h′|h) for each pair of states h′, h
I Observation parameters: o(x|h) for each state h, obs. x
Spectral Learning for NLP 39
![Page 45: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/45.jpg)
Hidden Markov Models (HMMs)
H1 H2 H3 H4
the dog saw him
p(the dog saw him︸ ︷︷ ︸x1...x4
, 1 2 1 3︸ ︷︷ ︸h1...h4
)
= π(1)× t(2|1)× t(1|2)× t(3|1)
×o(the|1)× o(dog|2)× o(saw|1)× o(him|3)
I Initial parameters: π(h) for each latent state h
I Transition parameters: t(h′|h) for each pair of states h′, h
I Observation parameters: o(x|h) for each state h, obs. x
Spectral Learning for NLP 39
![Page 46: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/46.jpg)
Hidden Markov Models (HMMs)
H1 H2 H3 H4
the dog saw him
p(the dog saw him︸ ︷︷ ︸x1...x4
, 1 2 1 3︸ ︷︷ ︸h1...h4
)
= π(1)× t(2|1)× t(1|2)× t(3|1)
×o(the|1)× o(dog|2)× o(saw|1)× o(him|3)
I Initial parameters: π(h) for each latent state h
I Transition parameters: t(h′|h) for each pair of states h′, h
I Observation parameters: o(x|h) for each state h, obs. x
Spectral Learning for NLP 39
![Page 47: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/47.jpg)
Hidden Markov Models (HMMs)
H1 H2 H3 H4
the dog saw him
Throughout this section:
I We use m to refer to the number of hidden states
I We use n to refer to the number of possible words(observations)
I Typically, m� n (e.g., m = 20, n = 50, 000)
Spectral Learning for NLP 40
![Page 48: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/48.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h) f1h =∑h′
t(h|h′)o(the|h′)f0h′
f2h =∑h′
t(h|h′)o(dog|h′)f1h′ f3h =∑h′
t(h|h′)o(saw|h′)f2h′
f4h =∑h′
t(h|h′)o(him|h′)f3h′ p(. . .) =∑h
f4h
Spectral Learning for NLP 41
![Page 49: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/49.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h) f1h =∑h′
t(h|h′)o(the|h′)f0h′
f2h =∑h′
t(h|h′)o(dog|h′)f1h′ f3h =∑h′
t(h|h′)o(saw|h′)f2h′
f4h =∑h′
t(h|h′)o(him|h′)f3h′ p(. . .) =∑h
f4h
Spectral Learning for NLP 41
![Page 50: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/50.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h)
f1h =∑h′
t(h|h′)o(the|h′)f0h′
f2h =∑h′
t(h|h′)o(dog|h′)f1h′ f3h =∑h′
t(h|h′)o(saw|h′)f2h′
f4h =∑h′
t(h|h′)o(him|h′)f3h′ p(. . .) =∑h
f4h
Spectral Learning for NLP 41
![Page 51: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/51.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h) f1h =∑h′
t(h|h′)o(the|h′)f0h′
f2h =∑h′
t(h|h′)o(dog|h′)f1h′ f3h =∑h′
t(h|h′)o(saw|h′)f2h′
f4h =∑h′
t(h|h′)o(him|h′)f3h′ p(. . .) =∑h
f4h
Spectral Learning for NLP 41
![Page 52: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/52.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h) f1h =∑h′
t(h|h′)o(the|h′)f0h′
f2h =∑h′
t(h|h′)o(dog|h′)f1h′
f3h =∑h′
t(h|h′)o(saw|h′)f2h′
f4h =∑h′
t(h|h′)o(him|h′)f3h′ p(. . .) =∑h
f4h
Spectral Learning for NLP 41
![Page 53: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/53.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h) f1h =∑h′
t(h|h′)o(the|h′)f0h′
f2h =∑h′
t(h|h′)o(dog|h′)f1h′ f3h =∑h′
t(h|h′)o(saw|h′)f2h′
f4h =∑h′
t(h|h′)o(him|h′)f3h′ p(. . .) =∑h
f4h
Spectral Learning for NLP 41
![Page 54: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/54.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h) f1h =∑h′
t(h|h′)o(the|h′)f0h′
f2h =∑h′
t(h|h′)o(dog|h′)f1h′ f3h =∑h′
t(h|h′)o(saw|h′)f2h′
f4h =∑h′
t(h|h′)o(him|h′)f3h′
p(. . .) =∑h
f4h
Spectral Learning for NLP 41
![Page 55: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/55.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h) f1h =∑h′
t(h|h′)o(the|h′)f0h′
f2h =∑h′
t(h|h′)o(dog|h′)f1h′ f3h =∑h′
t(h|h′)o(saw|h′)f2h′
f4h =∑h′
t(h|h′)o(him|h′)f3h′ p(. . .) =∑h
f4h
Spectral Learning for NLP 41
![Page 56: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/56.jpg)
HMMs: the forward algorithm in matrix form
H1 H2 H3 H4
the dog saw him
I For each word x, define the matrix Ax ∈ Rm×m as
[Ax]h′,h = t(h′|h)o(x|h) e.g., [Athe]h′,h = t(h′|h)o(the|h)
I Define π as vector with elements πh, 1 as vector of all ones
I Then
p(the dog saw him) = 1>×Ahim×Asaw×Adog×Athe× π
Forward algorithm through matrix multiplication!
Spectral Learning for NLP 42
![Page 57: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/57.jpg)
HMMs: the forward algorithm in matrix form
H1 H2 H3 H4
the dog saw him
I For each word x, define the matrix Ax ∈ Rm×m as
[Ax]h′,h = t(h′|h)o(x|h) e.g., [Athe]h′,h = t(h′|h)o(the|h)
I Define π as vector with elements πh, 1 as vector of all ones
I Then
p(the dog saw him) = 1>×Ahim×Asaw×Adog×Athe× π
Forward algorithm through matrix multiplication!
Spectral Learning for NLP 42
![Page 58: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/58.jpg)
HMMs: the forward algorithm in matrix form
H1 H2 H3 H4
the dog saw him
I For each word x, define the matrix Ax ∈ Rm×m as
[Ax]h′,h = t(h′|h)o(x|h)
e.g., [Athe]h′,h = t(h′|h)o(the|h)
I Define π as vector with elements πh, 1 as vector of all ones
I Then
p(the dog saw him) = 1>×Ahim×Asaw×Adog×Athe× π
Forward algorithm through matrix multiplication!
Spectral Learning for NLP 42
![Page 59: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/59.jpg)
HMMs: the forward algorithm in matrix form
H1 H2 H3 H4
the dog saw him
I For each word x, define the matrix Ax ∈ Rm×m as
[Ax]h′,h = t(h′|h)o(x|h) e.g., [Athe]h′,h = t(h′|h)o(the|h)
I Define π as vector with elements πh, 1 as vector of all ones
I Then
p(the dog saw him) = 1>×Ahim×Asaw×Adog×Athe× π
Forward algorithm through matrix multiplication!
Spectral Learning for NLP 42
![Page 60: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/60.jpg)
HMMs: the forward algorithm in matrix form
H1 H2 H3 H4
the dog saw him
I For each word x, define the matrix Ax ∈ Rm×m as
[Ax]h′,h = t(h′|h)o(x|h) e.g., [Athe]h′,h = t(h′|h)o(the|h)
I Define π as vector with elements πh, 1 as vector of all ones
I Then
p(the dog saw him) = 1>×Ahim×Asaw×Adog×Athe× π
Forward algorithm through matrix multiplication!
Spectral Learning for NLP 42
![Page 61: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/61.jpg)
HMMs: the forward algorithm in matrix form
H1 H2 H3 H4
the dog saw him
I For each word x, define the matrix Ax ∈ Rm×m as
[Ax]h′,h = t(h′|h)o(x|h) e.g., [Athe]h′,h = t(h′|h)o(the|h)
I Define π as vector with elements πh, 1 as vector of all ones
I Then
p(the dog saw him) = 1>×Ahim×Asaw×Adog×Athe× π
Forward algorithm through matrix multiplication!
Spectral Learning for NLP 42
![Page 62: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/62.jpg)
The Spectral Algorithm: definitions
H1 H2 H3 H4
the dog saw him
Define the following matrix P2,1 ∈ Rn×n:
[P2,1]i,j = P(X2 = i,X1 = j)
Easy to derive an estimate:
[P2,1]i,j =Count(X2 = i,X1 = j)
N
Spectral Learning for NLP 43
![Page 63: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/63.jpg)
The Spectral Algorithm: definitions
H1 H2 H3 H4
the dog saw him
For each word x, define the following matrix P3,x,1 ∈ Rn×n:
[P3,x,1]i,j = P(X3 = i,X2 = x,X1 = j)
Easy to derive an estimate, e.g.,:
[P3,dog,1]i,j =Count(X3 = i,X2 = dog, X1 = j)
N
Spectral Learning for NLP 44
![Page 64: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/64.jpg)
Main Result Underlying the Spectral AlgorithmI Define the following matrix P2,1 ∈ Rn×n:
[P2,1]i,j = P(X2 = i,X1 = j)
I For each word x, define the following matrix P3,x,1 ∈ Rn×n:
[P3,x,1]i,j = P(X3 = i,X2 = x,X1 = j)
I SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
I Definition:Bx = U> × P3,x,1 × V︸ ︷︷ ︸
m×m
× Σ−1︸︷︷︸m×m
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
Spectral Learning for NLP 45
![Page 65: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/65.jpg)
Main Result Underlying the Spectral AlgorithmI Define the following matrix P2,1 ∈ Rn×n:
[P2,1]i,j = P(X2 = i,X1 = j)
I For each word x, define the following matrix P3,x,1 ∈ Rn×n:
[P3,x,1]i,j = P(X3 = i,X2 = x,X1 = j)
I SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
I Definition:Bx = U> × P3,x,1 × V︸ ︷︷ ︸
m×m
× Σ−1︸︷︷︸m×m
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
Spectral Learning for NLP 45
![Page 66: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/66.jpg)
Main Result Underlying the Spectral AlgorithmI Define the following matrix P2,1 ∈ Rn×n:
[P2,1]i,j = P(X2 = i,X1 = j)
I For each word x, define the following matrix P3,x,1 ∈ Rn×n:
[P3,x,1]i,j = P(X3 = i,X2 = x,X1 = j)
I SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
I Definition:Bx = U> × P3,x,1 × V︸ ︷︷ ︸
m×m
× Σ−1︸︷︷︸m×m
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
Spectral Learning for NLP 45
![Page 67: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/67.jpg)
Main Result Underlying the Spectral AlgorithmI Define the following matrix P2,1 ∈ Rn×n:
[P2,1]i,j = P(X2 = i,X1 = j)
I For each word x, define the following matrix P3,x,1 ∈ Rn×n:
[P3,x,1]i,j = P(X3 = i,X2 = x,X1 = j)
I SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
I Definition:Bx = U> × P3,x,1 × V︸ ︷︷ ︸
m×m
× Σ−1︸︷︷︸m×m
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
Spectral Learning for NLP 45
![Page 68: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/68.jpg)
Why does this matter?
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
I Recall p(the dog saw him) = 1>AhimAsawAdogAtheπ.
Forward algorithm through matrix multiplication!
I Now note that
Bhim ×Bsaw ×Bdog ×Bthe
= GAhimG−1 ×GAsawG
−1 ×GAdogG−1 ×GAtheG
−1
= GAhim ×Asaw ×Adog ×AtheG−1
The G’s cancel!
I Follows that if we have b∞ = 1>G−1 and b0 = Gπ then
b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
= 1> ×Ahim ×Asaw ×Adog ×Athe × π
Spectral Learning for NLP 46
![Page 69: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/69.jpg)
Why does this matter?
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
I Recall p(the dog saw him) = 1>AhimAsawAdogAtheπ.
Forward algorithm through matrix multiplication!
I Now note that
Bhim ×Bsaw ×Bdog ×Bthe
= GAhimG−1 ×GAsawG
−1 ×GAdogG−1 ×GAtheG
−1
= GAhim ×Asaw ×Adog ×AtheG−1
The G’s cancel!
I Follows that if we have b∞ = 1>G−1 and b0 = Gπ then
b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
= 1> ×Ahim ×Asaw ×Adog ×Athe × π
Spectral Learning for NLP 46
![Page 70: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/70.jpg)
Why does this matter?
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
I Recall p(the dog saw him) = 1>AhimAsawAdogAtheπ.
Forward algorithm through matrix multiplication!
I Now note that
Bhim ×Bsaw ×Bdog ×Bthe
= GAhimG−1 ×GAsawG
−1 ×GAdogG−1 ×GAtheG
−1
= GAhim ×Asaw ×Adog ×AtheG−1
The G’s cancel!
I Follows that if we have b∞ = 1>G−1 and b0 = Gπ then
b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
= 1> ×Ahim ×Asaw ×Adog ×Athe × π
Spectral Learning for NLP 46
![Page 71: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/71.jpg)
Why does this matter?
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
I Recall p(the dog saw him) = 1>AhimAsawAdogAtheπ.
Forward algorithm through matrix multiplication!
I Now note that
Bhim ×Bsaw ×Bdog ×Bthe
= GAhimG−1 ×GAsawG
−1 ×GAdogG−1 ×GAtheG
−1
= GAhim ×Asaw ×Adog ×AtheG−1
The G’s cancel!
I Follows that if we have b∞ = 1>G−1 and b0 = Gπ then
b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
= 1> ×Ahim ×Asaw ×Adog ×Athe × π
Spectral Learning for NLP 46
![Page 72: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/72.jpg)
Why does this matter?
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
I Recall p(the dog saw him) = 1>AhimAsawAdogAtheπ.
Forward algorithm through matrix multiplication!
I Now note that
Bhim ×Bsaw ×Bdog ×Bthe
= GAhimG−1 ×GAsawG
−1 ×GAdogG−1 ×GAtheG
−1
= GAhim ×Asaw ×Adog ×AtheG−1
The G’s cancel!
I Follows that if we have b∞ = 1>G−1 and b0 = Gπ then
b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
= 1> ×Ahim ×Asaw ×Adog ×Athe × π
Spectral Learning for NLP 46
![Page 73: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/73.jpg)
Why does this matter?
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
I Recall p(the dog saw him) = 1>AhimAsawAdogAtheπ.
Forward algorithm through matrix multiplication!
I Now note that
Bhim ×Bsaw ×Bdog ×Bthe
= GAhimG−1 ×GAsawG
−1 ×GAdogG−1 ×GAtheG
−1
= GAhim ×Asaw ×Adog ×AtheG−1
The G’s cancel!
I Follows that if we have b∞ = 1>G−1 and b0 = Gπ then
b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
= 1> ×Ahim ×Asaw ×Adog ×Athe × π
Spectral Learning for NLP 46
![Page 74: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/74.jpg)
Why does this matter?
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
I Recall p(the dog saw him) = 1>AhimAsawAdogAtheπ.
Forward algorithm through matrix multiplication!
I Now note that
Bhim ×Bsaw ×Bdog ×Bthe
= GAhimG−1 ×GAsawG
−1 ×GAdogG−1 ×GAtheG
−1
= GAhim ×Asaw ×Adog ×AtheG−1
The G’s cancel!
I Follows that if we have b∞ = 1>G−1 and b0 = Gπ then
b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
= 1> ×Ahim ×Asaw ×Adog ×Athe × πSpectral Learning for NLP 46
![Page 75: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/75.jpg)
The Spectral Learning Algorithm
1. Derive estimates
[P2,1]i,j =Count(X2 = i,X1 = j)
N
For all words x,
[P3,x,1]i,j =Count(X3 = i,X2 = x,X1 = j)
N
2. SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
3. For all words x, define Bx = U> × P3,x,1 × V︸ ︷︷ ︸m×m
× Σ−1︸︷︷︸m×m
.
(similar definitions for b0, b∞, details omitted)
4. For a new sentence x1 . . . xn, can calculate its probability, e.g.,
p(the dog saw him)
= b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
Spectral Learning for NLP 47
![Page 76: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/76.jpg)
The Spectral Learning Algorithm
1. Derive estimates
[P2,1]i,j =Count(X2 = i,X1 = j)
N
For all words x,
[P3,x,1]i,j =Count(X3 = i,X2 = x,X1 = j)
N
2. SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
3. For all words x, define Bx = U> × P3,x,1 × V︸ ︷︷ ︸m×m
× Σ−1︸︷︷︸m×m
.
(similar definitions for b0, b∞, details omitted)
4. For a new sentence x1 . . . xn, can calculate its probability, e.g.,
p(the dog saw him)
= b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
Spectral Learning for NLP 47
![Page 77: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/77.jpg)
The Spectral Learning Algorithm
1. Derive estimates
[P2,1]i,j =Count(X2 = i,X1 = j)
N
For all words x,
[P3,x,1]i,j =Count(X3 = i,X2 = x,X1 = j)
N
2. SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
3. For all words x, define Bx = U> × P3,x,1 × V︸ ︷︷ ︸m×m
× Σ−1︸︷︷︸m×m
.
(similar definitions for b0, b∞, details omitted)
4. For a new sentence x1 . . . xn, can calculate its probability, e.g.,
p(the dog saw him)
= b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
Spectral Learning for NLP 47
![Page 78: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/78.jpg)
The Spectral Learning Algorithm
1. Derive estimates
[P2,1]i,j =Count(X2 = i,X1 = j)
N
For all words x,
[P3,x,1]i,j =Count(X3 = i,X2 = x,X1 = j)
N
2. SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
3. For all words x, define Bx = U> × P3,x,1 × V︸ ︷︷ ︸m×m
× Σ−1︸︷︷︸m×m
.
(similar definitions for b0, b∞, details omitted)
4. For a new sentence x1 . . . xn, can calculate its probability, e.g.,
p(the dog saw him)
= b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
Spectral Learning for NLP 47
![Page 79: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/79.jpg)
GuaranteesI Throughout the algorithm we’ve used estimates P2,1 andP3,x,1 in place of P2,1 and P3,x,1
I If P2,1 = P2,1 and P3,x,1 = P3,x,1 then the method is exact.But we will always have estimation errors
I A PAC-Style Theorem: Fix some length T . To have∑x1...xT
|p(x1 . . . xT )− p(x1 . . . xT )|︸ ︷︷ ︸L1 distance between p and p
≤ ε
with probability at least 1− δ, then number of samplesrequired is polynomial in
n,m, 1/ε, 1/δ, 1/σ, T
where σ is m’th largest singular value of P2,1
Spectral Learning for NLP 48
![Page 80: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/80.jpg)
Intuition behind the TheoremI Define
||A−A||2 =
√∑j,k
(Aj,k −Aj,k)2
I With N samples, with probability at least 1− δ
||P2,1 − P2,1||2 ≤ ε
||P3,x,1 − P3,x,1||2 ≤ εwhere
ε =
√1
Nlog
1
δ+
√1
N
I Then need to carefully bound how the error ε propagatesthrough the SVD step, the various matrix multiplications, etcetc. The “rate” at which ε propagates depends on T , m, n,1/σ
Spectral Learning for NLP 49
![Page 81: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/81.jpg)
Summary
I The problem solved by EM: estimate HMM parameters π(h),t(h′|h), o(x|h) from observation sequences x1 . . . xn
I The spectral algorithm:
I Calculate estimates P2,1 (bigram counts) and P3,x,1 (trigramcounts)
I Run an SVD on P2,1
I Calculate parameter estimates using simple matrix operationsI Guarantee: we recover the parameters up to linear transforms
that cancel
Spectral Learning for NLP 50
![Page 82: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/82.jpg)
Outline
• Singular value decomposition
• Canonical correlation analysis
• Spectral learning of hidden Markov models
• Algorithm for latent-variable PCFGs
Spectral Learning for NLP 51
![Page 83: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/83.jpg)
Problems with spectral HMM learning algorithm
Parameters are masked by an unknown linear transformation
I Negative marginals (due to sampling error)
I Parameters cannot be easily interpreted
I Cannot improve parameters using, for example, EM
Hsu et al. suggest a way to extract probabilities, but the method isunstable
Spectral Learning for NLP 52
![Page 84: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/84.jpg)
This part of the talk
Like the spectral algorithm, has theoretical guarantees
Estimates are actual probabilities
More efficient than EM
Can be used to initialize EM, which converges in an iteration or two
Spectral Learning for NLP 53
![Page 85: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/85.jpg)
L-PCFGs (Matsuzaki et al., 2005; Petrov et al., 2006)
S
NP
D
the
N
dog
VP
V
saw
P
him
⇒ S1
NP3
D1
the
N2
dog
VP2
V4
saw
P1
him
Spectral Learning for NLP 54
![Page 86: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/86.jpg)
The probability of a tree
S1
NP3
D1
the
N2
dog
VP2
V4
saw
P1
him
p(tree, 1 3 1 2 2 4 1)
= π(S1)×t(S1 → NP3 VP2|S1)×t(NP3 → D1 N2|NP3)×t(VP2 → V4 P1|VP2)×q(D1 → the|D1)×q(N2 → dog|N2)×q(V4 → saw|V4)×q(P1 → him|P1)
p(tree) =∑h1...h7
p(tree, h1 h2 h3 h4 h5 h6 h7)
Spectral Learning for NLP 55
![Page 87: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/87.jpg)
Inside and Outside Trees
At node VP:
S
NP
D
the
N
dog
VP
V
saw
P
him
Outside tree o = S
NP
D
the
N
dog
VP
Inside tree t = VP
V
saw
P
him
Conditionally independent given the label and the hidden state
p(o, t|VP, h) = p(o|VP, h)× p(t|VP, h)
Spectral Learning for NLP 56
![Page 88: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/88.jpg)
Designing Feature Functions
Design functions ψ and φ:
φ maps any inside tree to a binary vector of length d
ψ maps any outside tree to a binary vector of length d′
S
NP
D
the
N
dog
VP
VP
V
saw
P
him
Outside tree o⇒ Inside tree t⇒ψ(o) = [0, 1, 0, 0, . . . , 0, 0] ∈ Rd′ φ(t) = [1, 0, 0, 0, . . . , 0, 0] ∈ Rd
ψ and φ as multinomials p(f) for f ∈ [d] and p(g) for g ∈ [d′].
Spectral Learning for NLP 57
![Page 89: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/89.jpg)
Latent State Distributions
Think of f and g as representing a whole inside/outside tree
Say we had a way of getting:
I p(f |h,VP) for each h and f inside feature
I p(g|h,VP) for each h and g outside feature
Then we could run EM on a convex problem to find parameters.How?
Spectral Learning for NLP 58
![Page 90: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/90.jpg)
Binary rule estimationTake M samples of nodes with rule VP→ V NP.
At sample i
I g(i) = outside feature at VP
I f(i)2 = inside feature at V
I f(i)3 = inside feature at NP
t(h1, h2, h3|VP→ V NP)
= maxt
M∑i=1
log∑
h1,h2,h3
(t(h1, h2, h3|VP→ V NP)×
p(g(i)|h1,VP)p(f(i)2 |h2,V)p(f
(i)3 |h3,NP)
)Spectral Learning for NLP 59
![Page 91: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/91.jpg)
Binary Rule Estimation
• Use Bayes rule to convert
t(h1, h2, h3|VP→ V NP)
to
t(VP→ V NP, h2, h3|VP, h1).
• The log-likelihood function is convex, and therefore EMconverges to global maximum
• Estimation of π and q is similar in flavor
Main question: how do we get the latent state distributionsp(h|f,VP) and p(h|g,VP)?
Spectral Learning for NLP 60
![Page 92: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/92.jpg)
Vector Representation of Inside and Outside TreesDesign functions Z and Y :
Y maps any inside feature value f ∈ [d′] to a vector of length m.
Z maps any outside feature value g ∈ [d] to a vector of length m.
Convention: m is the number of hidden states under the L-PCFG.
S
NP
D
the
N
dog
VP
VP
V
saw
P
him
Outside tree o⇒ Inside tree t⇒Z(g) = [1, 0.4,−5.3, . . . , 72] ∈ Rm Y (f) = [−3, 17, 2, . . . , 3.5] ∈ Rm
Z and Y reduce the dimensionality of φ and ψ using CCA
Spectral Learning for NLP 61
![Page 93: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/93.jpg)
Identifying Latent State Distributions
• For each f ∈ [d], define:
v(f) =∑d′
g=1 p(g|f,VP)Z(g) = E[Z(g)|f,VP]
• v(f) ∈ Rm is “the expected value of an outside tree(representation) given an inside tree (feature)”
• By conditional independence:
v(f) =m∑h=1
p(h|f,VP)w(h)
where w(h) ∈ Rm and
w(h) =∑d′
g=1 p(g|h,VP)Z(g) = E[Z(g)|h,VP].
• w(h) is “the expected value of an outside tree(representation) given a latent state”
Spectral Learning for NLP 62
![Page 94: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/94.jpg)
Identifying Latent State Distributions
• For each f ∈ [d], define:
v(f) =∑d′
g=1 p(g|f,VP)Z(g) = E[Z(g)|f,VP]
• v(f) ∈ Rm is “the expected value of an outside tree(representation) given an inside tree (feature)”
• By conditional independence:
v(f) =
m∑h=1
p(h|f,VP)w(h)
where w(h) ∈ Rm and
w(h) =∑d′
g=1 p(g|h,VP)Z(g) = E[Z(g)|h,VP].
• w(h) is “the expected value of an outside tree(representation) given a latent state”
Spectral Learning for NLP 62
![Page 95: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/95.jpg)
Pivot AssumptionReminder: v(f) =
∑mh=1 p(h|f,VP)w(h)
• If we know w(h), we can find latent state distributions:
I Given an inside tree (feature f) and a node such as VP,compute v(f)
I Solve
arg minp(h|f,VP)
||v(f)−m∑h=1
p(h|f,VP)w(h)||2
Assumption: For each latent state there is f ∈ [d] a “pivotfeature value” s.t.
p(h|f,VP) = 1
.
Result of this: v(f) = w(h) for any pivot feature f
Spectral Learning for NLP 63
![Page 96: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/96.jpg)
Identifying Latent State Distributions
• m pivot features {f1, . . . , fm} such that v(fh) = w(h)Then, for all f ∈ [d]
v(f) =
m∑h=1
p(h|f,VP)v(fh)
• Therefore, we can identify p(h|f,VP) for all f by solving:
arg minp(h|f,VP)
||v(f)−m∑h=1
p(h|f,VP)v(fh)||2
Spectral Learning for NLP 64
![Page 97: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/97.jpg)
Identifying Pivot Features
• v(f) are observable quantities, can be calculated from data
• Arora et al. (2012) showed how to find the pivot features
• Basic idea: find the corners of the convex hull spanned by the dfeatures
Spectral Learning for NLP 65
![Page 98: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/98.jpg)
Identifying Latent State Distributions
Algorithm: • Identify m pivot features f1, . . . , fm by findingvertices of ConvexHull(v1, . . . , vd) (Arora et al., 2012)• Solve for each f ∈ [d]:
arg minp(h|f,VP)
||v(f)−m∑h=1
p(h|f,VP)v(fh)||2
Output:
I Latent state distributions p(h|f,VP) for any f ∈ [d]
Can analogously get:
I Latent state distributions p(h|g,VP) for any g ∈ [d′]
• We managed to extract latent state probabilities from observeddata only!
Spectral Learning for NLP 66
![Page 99: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/99.jpg)
Experiments - Language Modeling
• Saul and Pereira (1997):
p(w2|w1) =∑h
p(w2|h)p(h|w1).
h
w1 w2
This model is a specific case of L-PCFG
• Experimented with bi-gram modeling for two corpora: Browncorpus and Gigaword corpus
Spectral Learning for NLP 67
![Page 100: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/100.jpg)
Results: perplexity
Brown NYTm 128 256 test 128 256 test
bigram Kneser-Ney 408 415 271 279
trigram Kneser-Ney 386 394 150 158
EMiterations
3889
3658
36428435
26532
267
pivot 426 597 560 782 886 715
Spectral Learning for NLP 68
![Page 101: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/101.jpg)
Results: perplexity
Brown NYTm 128 256 test 128 256 test
bigram Kneser-Ney 408 415 271 279
trigram Kneser-Ney 386 394 150 158
EMiterations
3889
3658
36428435
26532
267
pivot 426 597 560 782 886 715
pivot+EMiterations
3101
3271
35727919
29212
281
• Initialize EM with pivot algorithm output
• EM converges in much fewer iterations
• Still consistent - called “two-step estimation” (Lehmann andCasella, 1998)
Spectral Learning for NLP 69
![Page 102: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/102.jpg)
Results with EM (section 22 of Penn treebank)
Performance with expectation-maximization (m = 32): 88.56%
Vanilla binarized PCFG maximum likelihood estimationperformance: 68.62%
Performance with spectral algorithm (Cohen et al., 2013): 88.82%
Spectral Learning for NLP 70
![Page 103: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/103.jpg)
Inside features usedConsider the VP node in the following tree:
S
NP
D
the
N
cat
VP
V
saw
NP
D
the
N
dogThe inside features consist of:
I The pairs (VP, V) and (VP, NP)
I The rule VP → V NP
I The tree fragment (VP (V saw) NP)
I The tree fragment (VP V (NP D N))
I The pair of head part-of-speech tag with VP: (VP, V)
Spectral Learning for NLP 71
![Page 104: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/104.jpg)
Outside features usedConsider the D node inthe following tree:
S
NP
D
the
N
cat
VP
V
saw
NP
D
the
N
dogThe outside features consist of:
I The fragments NP
D∗ N
, VP
V NP
D∗ N
and S
NP VP
V NP
D∗ N
I The pair (D, NP) and triplet (D, NP, VP)
I The pair of head part-of-speech tag with D: (D, N)
Spectral Learning for NLP 72
![Page 105: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/105.jpg)
Results
sec. 22 sec. 23m 8 16 24 32
EMiterations
86.6940
88.3230
88.3530
88.5620
87.76
Spectral(Cohen et al., 2013)
85.60 87.77 88.53 88.82 88.05
Pivot 83.56 86.00 86.87 86.40 85.83
Pivot+EMiterations
86.832
88.146
88.642
88.552
88.03
Again, EM converges in very few iterations
Spectral Learning for NLP 73
![Page 106: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/106.jpg)
Conclusion
Formal guarantees:
I Statistical consistency
I No problem of local maxima
Advantages over traditional spectral methods:
I No negative probabilities
I More intuitive to understand
Spectral Learning for NLP 74
![Page 107: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/107.jpg)
Things we did not talk about
Theses algorithms can be kernelized (e.g. Song et al., 2010)
Many other algorithms similar in flavor (see reading list)
I Rely on some decomposition of observable quantities to get ahandle on the parameters
Spectral Learning for NLP 75
![Page 108: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/108.jpg)
References I
[1] A. Anandkumar, D. Foster, D. Hsu, S. M. Kakade, andY. Liu. A spectral algorithm for latent dirichlet allocation.arXiv:1204.6703, 2012.
[2] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, andM. Telgarsky. Tensor decompositions for learninglatent-variable models. arXiv:1210.7559, 2012.
[3] S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra,D. Sontag, Y. Wu, and M. Zhu. A practical algorithm fortopic modeling with provable guarantees. arXiv preprintarXiv:1212.4777, 2012.
[4] R. Bailly, A. Habrar, and F. Denis. A spectral approach forprobabilistic grammatical inference on trees. In Proceedingsof ALT, 2010.
Spectral Learning for NLP 76
![Page 109: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/109.jpg)
References II
[5] B. Balle and M. Mohri. Spectral learning of general weightedautomata via constrained matrix completion. In P. Bartlett,F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q.Weinberger, editors, Advances in Neural InformationProcessing Systems 25, pages 2168–2176. 2012.
[6] B. Balle, A. Quattoni, and X. Carreras. A spectral learningalgorithm for finite state transducers. In Proceedings ofECML, 2011.
[7] Byron Boots and Geoffrey J Gordon. An online spectrallearning algorithm for partially observable nonlinear dynamicalsystems. In AAAI, 2011.
[8] S. B. Cohen and M. Collins. A provably correct learningalgorithm for latent-variable PCFGs. In Proceedings of ACL,2014.
Spectral Learning for NLP 77
![Page 110: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/110.jpg)
References III[9] S. B. Cohen, K. Stratos, M. Collins, D. P. Foster, and
L. Ungar. Experiments with spectral learning oflatent-variable PCFGs. In Proceedings of NAACL, 2013.
[10] S. B. Cohen, K. Stratos, M. Collins, D. P. Foster, andL. Ungar. Spectral learning of latent-variable PCFGs:Algorithms and sample complexity. Journal of MachineLearning Research, 2014.
[11] A. Dempster, N. Laird, and D. Rubin. Maximum likelihoodestimation from incomplete data via the EM algorithm.Journal of the Royal Statistical Society B, 39:1–38, 1977.
[12] P. Dhillon, D. P. Foster, and L. H. Ungar. Multi-view learningof word embeddings via CCA. In Proceedings of NIPS, 2011.
[13] P. Dhillon, J. Rodu, M. Collins, D. P. Foster, and L. H.Ungar. Spectral dependency parsing with latent variables. InProceedings of EMNLP, 2012.
Spectral Learning for NLP 78
![Page 111: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/111.jpg)
References IV
[14] D. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonicalcorrelation analysis: An overview with application to learningmethods. Neural Computation, 16(12):2639–2664, 2004.
[15] H. Hotelling. Relations between two sets of variants.Biometrika, 28:321–377, 1936.
[16] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithmfor learning hidden Markov models. In Proceedings of COLT,2009.
[17] H. Jaeger. Observable operator models for discrete stochastictime series. Neural Computation, 12(6), 2000.
[18] T. K. Landauer, P. W. Foltz, and D. Laham. An introductionto latent semantic analysis. Discourse Processes,(25):259–284, 1998.
Spectral Learning for NLP 79
![Page 112: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/112.jpg)
References V
[19] Percy Liang, Daniel J Hsu, and Sham M Kakade.Identifiability and unmixing of latent parse trees. In Advancesin Neural Information Processing Systems, pages 1511–1519,2012.
[20] F. M. Luque, A. Quattoni, B. Balle, and X. Carreras. Spectrallearning for non-deterministic dependency parsing. InProceedings of EACL, 2012.
[21] T. Matsuzaki, Y. Miyao, and J. Tsujii. Probabilistic CFG withlatent annotations. In Proceedings of ACL, 2005.
[22] A. Parikh, L. Song, and E. P. Xing. A spectral algorithm forlatent tree graphical models. In Proceedings of The 28thInternational Conference on Machine Learning (ICML 2011),2011.
Spectral Learning for NLP 80
![Page 113: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/113.jpg)
References VI
[23] A. P. Parikh, S. B. Cohen, and E. Xing. Spectral unsupervisedparsing with additive tree metrics. In Proceedings of ACL,2014.
[24] S. Petrov, L. Barrett, R. Thibaux, and D. Klein. Learningaccurate, compact, and interpretable tree annotation. InProceedings of COLING-ACL, 2006.
[25] L. Saul, F. Pereira, and O. Pereira. Aggregate andmixed-order markov models for statistical language processing.In In Proceedings of the Second Conference on EmpiricalMethods in Natural Language Processing, pages 81–89, 1997.
[26] L. Song, B. Boots, S. M. Siddiqi, G. J. Gordon, and Alex JSmola. Hilbert space embeddings of hidden markov models.In Proceedings of the 27th international conference onmachine learning (ICML-10), pages 991–998, 2010.
Spectral Learning for NLP 81
![Page 114: Spectracular Learning Algorithms for Natural Language Processing - University of Edinburghhomepages.inf.ed.ac.uk/scohen/cmu-tutorial.pdf · 2014-06-27 · car home house word talk](https://reader034.fdocuments.in/reader034/viewer/2022050520/5fa3ca83d194fd10b53a0ee1/html5/thumbnails/114.jpg)
References VII
[27] A. Tropp, N. Halko, and P. G. Martinsson. Finding structurewith randomness: Stochastic algorithms for constructingapproximate matrix decompositions. In Technical Report No.2009-05, 2009.
[28] S. Vempala and G. Wang. A spectral algorithm for learningmixtures of distributions. Journal of Computer and SystemSciences, 68(4):841–860, 2004.
Spectral Learning for NLP 82