Graph Ssl Icassp2013 Tutorial Slides
-
Upload
mikelamar77 -
Category
Documents
-
view
228 -
download
0
description
Transcript of Graph Ssl Icassp2013 Tutorial Slides
-
A Tutorial on
Graph-based Semi-Supervised Learning Algorithms for Speech and Spoken Language Processing
Partha Pratim Talukdar(Carnegie Mellon University)
Amarnag Subramanya(Google Research)
http://graph-ssl.wikidot.com/ICASSP 2013, Vancouver
-
Supervised Learning
ModelLabeled DataLearning
Algorithm
2
-
Supervised Learning
ModelLabeled DataLearning
Algorithm
2
Examples:Decision Trees
Support Vector Machine (SVM) Maximum Entropy (MaxEnt)
-
Semi-Supervised Learning (SSL)
ModelLabeled Data
Learning Algorithm
A Lot of Unlabeled Data
3
-
Semi-Supervised Learning (SSL)
ModelLabeled Data
Learning Algorithm
A Lot of Unlabeled Data
3
Examples:Self-TrainingCo-Training
-
Why SSL?
How can unlabeled data be helpful?
4
-
Without Unlabeled Data
Why SSL?
How can unlabeled data be helpful?
4
-
Without Unlabeled Data
Why SSL?
How can unlabeled data be helpful?
Labeled Instances
4
-
Without Unlabeled Data
Why SSL?
How can unlabeled data be helpful?
Labeled Instances
DecisionBoundary
4
-
With Unlabeled DataWithout Unlabeled Data
Why SSL?
How can unlabeled data be helpful?
Labeled Instances
DecisionBoundary
4
-
With Unlabeled DataWithout Unlabeled Data
Why SSL?
How can unlabeled data be helpful?
Labeled Instances
DecisionBoundary Unlabeled
Instances
4
-
With Unlabeled DataWithout Unlabeled Data
Why SSL?
How can unlabeled data be helpful?
Example from [Belkin et al., JMLR 2006]
Labeled Instances
DecisionBoundary
More accurate decision boundary in the presence of unlabeled instances
UnlabeledInstances
4
-
Inductive vs Transductive
5
-
Inductive vs Transductive
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
5
-
Inductive vs Transductive
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
5
-
Inductive vs Transductive
SVM, Maximum Entropy X
Manifold Regularization
Label Propagation, MAD, MP, TACO, ...
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
5
-
Inductive vs Transductive
SVM, Maximum Entropy X
Manifold Regularization
Label Propagation, MAD, MP, TACO, ...
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
5
-
Inductive vs Transductive
SVM, Maximum Entropy X
Manifold Regularization
Label Propagation, MAD, MP, TACO, ...
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
5
-
Inductive vs Transductive
SVM, Maximum Entropy X
Manifold Regularization
Label Propagation, MAD, MP, TACO, ...
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
5
-
Inductive vs Transductive
SVM, Maximum Entropy X
Manifold Regularization
Label Propagation, MAD, MP, TACO, ...
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
Most Graph SSL algorithms are non-parametric (i.e., # parameters grows with data size)
5
-
Inductive vs Transductive
SVM, Maximum Entropy X
Manifold Regularization
Label Propagation, MAD, MP, TACO, ...
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
Most Graph SSL algorithms are non-parametric (i.e., # parameters grows with data size)
5
Focus of this tutorial
-
Inductive vs Transductive
SVM, Maximum Entropy X
Manifold Regularization
Label Propagation, MAD, MP, TACO, ...
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
See Chapter 25 of SSL Book: http://olivier.chapelle.cc/ssl-book/discussion.pdf
Most Graph SSL algorithms are non-parametric (i.e., # parameters grows with data size)
5
Focus of this tutorial
-
Why Graph-based SSL?
6
-
Why Graph-based SSL?
Some datasets are naturally represented by a graph web, citation network, social network, ...
6
-
Why Graph-based SSL?
Some datasets are naturally represented by a graph web, citation network, social network, ...
Uniform representation for heterogeneous data
6
-
Why Graph-based SSL?
Some datasets are naturally represented by a graph web, citation network, social network, ...
Uniform representation for heterogeneous data Easily parallelizable, scalable to large data
6
-
Why Graph-based SSL?
Some datasets are naturally represented by a graph web, citation network, social network, ...
Uniform representation for heterogeneous data Easily parallelizable, scalable to large data Effective in practice
6
-
Why Graph-based SSL?
Some datasets are naturally represented by a graph web, citation network, social network, ...
Uniform representation for heterogeneous data Easily parallelizable, scalable to large data Effective in practice
Graph SSL
Supervised
Non-Graph SSL
Text Classification
6
-
Graph-based SSL
7
-
Graph-based SSL
7
-
Graph-based SSL
Similarity
7
-
Graph-based SSL
Similarity
0.60.20.8
7
-
Graph-based SSL
/f//m/
Similarity
0.60.20.8
7
-
Graph-based SSL
/f//m/
/m/ /f/
Similarity
0.60.20.8
7
-
Graph-based SSL
8
-
Graph-based SSL
Smoothness Assumption If two instances are similar
according to the graph, then output labels should be similar
8
-
Graph-based SSL
Smoothness Assumption If two instances are similar
according to the graph, then output labels should be similar
8
-
Graph-based SSL
Two stages Graph construction (if not already present) Label Inference
Smoothness Assumption If two instances are similar
according to the graph, then output labels should be similar
8
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
9
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
10
-
Graph Construction
Neighborhood Methods k-NN Graph Construction (k-NNG) e-Neighborhood Method
Metric Learning Other approaches
11
-
Neighborhood Methods
12
-
Neighborhood Methods
k-Nearest Neighbor Graph (k-NNG) add edges between an instance and its
k-nearest neighbors
12
-
Neighborhood Methods
k-Nearest Neighbor Graph (k-NNG) add edges between an instance and its
k-nearest neighborsk = 3
12
-
Neighborhood Methods
k-Nearest Neighbor Graph (k-NNG) add edges between an instance and its
k-nearest neighbors
e-Neighborhood add edges to all instances inside a ball of
radius e
k = 3
12
-
Neighborhood Methods
k-Nearest Neighbor Graph (k-NNG) add edges between an instance and its
k-nearest neighbors
e-Neighborhood add edges to all instances inside a ball of
radius e
e
k = 3
12
-
Issues with k-NNG
13
-
Issues with k-NNG
13
Not scalable (quadratic)
-
Issues with k-NNG
13
Not scalable (quadratic) Results in an asymmetric graph
-
Issues with k-NNG
13
Not scalable (quadratic) Results in an asymmetric graph
b is the closest neighbor of a, but not the other way
a b c
-
Issues with k-NNG
Results in irregular graphs some nodes may end up with
higher degree than other nodes
13
Not scalable (quadratic) Results in an asymmetric graph
b is the closest neighbor of a, but not the other way
a b c
-
Issues with k-NNG
Results in irregular graphs some nodes may end up with
higher degree than other nodes
Node of degree 4 inthe k-NNG (k = 1)
13
Not scalable (quadratic) Results in an asymmetric graph
b is the closest neighbor of a, but not the other way
a b c
-
Issues with e-Neighborhood
14
-
Issues with e-Neighborhood
Not scalable
14
-
Issues with e-Neighborhood
Not scalable
Sensitive to value of e : not invariant to scaling
14
-
Issues with e-Neighborhood
Not scalable
Sensitive to value of e : not invariant to scaling
Fragmented Graph: disconnected components
14
-
Issues with e-Neighborhood
Not scalable
Sensitive to value of e : not invariant to scaling
Fragmented Graph: disconnected components
Figure from [Jebara et al., ICML 2009]
e-NeighborhoodData
Disconnected
14
-
Graph Construction using Metric Learning
15
-
Graph Construction using Metric Learning
xi xjwij exp(DA(xi, xj))
15
-
Graph Construction using Metric Learning
xi xjwij exp(DA(xi, xj))
Estimated using Mahalanobis metric learning algorithms
DA(xi, xj) = (xi xj)TA(xi xj)
15
-
Graph Construction using Metric Learning
Supervised Metric Learning ITML [Kulis et al., ICML 2007] LMNN [Weinberger and Saul, JMLR 2009]
Semi-supervised Metric Learning IDML [Dhillon et al., UPenn TR 2010]
xi xjwij exp(DA(xi, xj))
Estimated using Mahalanobis metric learning algorithms
DA(xi, xj) = (xi xj)TA(xi xj)
15
-
Benefits of Metric Learning forGraph Construction
16
-
Benefits of Metric Learning forGraph Construction
0
0.125
0.25
0.375
0.5
Amazon Newsgroups Reuters Enron A Text
Original RP PCA ITML IDML
Erro
r
100 seed and1400 test instances, all inferences using LP
16
-
Benefits of Metric Learning forGraph Construction
Graph constructed using supervised metric learning
0
0.125
0.25
0.375
0.5
Amazon Newsgroups Reuters Enron A Text
Original RP PCA ITML IDML
Erro
r
100 seed and1400 test instances, all inferences using LP
16
-
Benefits of Metric Learning forGraph Construction
[Dhillon et al., UPenn TR 2010]
Graph constructed using supervised metric learning
0
0.125
0.25
0.375
0.5
Amazon Newsgroups Reuters Enron A Text
Original RP PCA ITML IDML
Erro
r
100 seed and1400 test instances, all inferences using LP
Graph constructed using semi-supervised metric learning[Dhillon et al., 2010]
16
-
Benefits of Metric Learning forGraph Construction
Careful graph construction is critical![Dhillon et al., UPenn TR 2010]
Graph constructed using supervised metric learning
0
0.125
0.25
0.375
0.5
Amazon Newsgroups Reuters Enron A Text
Original RP PCA ITML IDML
Erro
r
100 seed and1400 test instances, all inferences using LP
Graph constructed using semi-supervised metric learning[Dhillon et al., 2010]
16
-
Other Graph Construction Approaches
Local Reconstruction Linear Neighborhood [Wang and Zhang, ICML 2005] Regular Graph: b-matching [Jebara et al., ICML 2008] Fitting Graph to Vector Data [Daitch et al., ICML 2009]
Graph Kernels [Zhu et al., NIPS 2005]
17
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation
18
-
19
Graph Laplacian
-
Laplacian (un-normalized) of a graph:
L = D W,where Dii =j
Wij , Dij( =i) = 0
19
Graph Laplacian
-
Laplacian (un-normalized) of a graph:
1
12
3
a
b
c
d
L = D W,where Dii =j
Wij , Dij( =i) = 0
19
Graph Laplacian
-
Laplacian (un-normalized) of a graph:
1
12
3
a
b
c
d
3 -1 -2 0-1 4 -3 0-2 -3 6 -1 0 0 -1 1
abcd
a b c d
L = D W,where Dii =j
Wij , Dij( =i) = 0
19
Graph Laplacian
-
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:
1
20
-
fTLf =i,j
Wij(fi fj)2
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:
1
20
-
fTLf =i,j
Wij(fi fj)2
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:Measure of
Non-Smoothness
1
20
-
fTLf =i,j
Wij(fi fj)2
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:Measure of
Non-SmoothnessVector of scores for single label on nodes
1
20
-
fTLf =i,j
Wij(fi fj)2
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:Measure of
Non-SmoothnessVector of scores for single label on nodes
125
5
10
1
12
3
a
b
c
d1
fT = [1 10 5 25]
20
-
fTLf =i,j
Wij(fi fj)2
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:Measure of
Non-SmoothnessVector of scores for single label on nodes
125
5
10
1
12
3
a
b
c
d1
fT = [1 10 5 25]
fTLf = 588 Not Smooth20
-
fTLf =i,j
Wij(fi fj)2
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:Measure of
Non-SmoothnessVector of scores for single label on nodes
125
5
10
1
12
3
a
b
c
d1
fT = [1 10 5 25]
fTLf = 588 Not Smooth
1
1
1
3
fT = [1113]fTLf = 4
1
12
3
a
b
c
d
20
-
fTLf =i,j
Wij(fi fj)2
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:Measure of
Non-SmoothnessVector of scores for single label on nodes
125
5
10
1
12
3
a
b
c
d1
fT = [1 10 5 25]
fTLf = 588 Not Smooth
1
1
1
3
fT = [1113]fTLf = 4
1
12
3
a
b
c
d
Smooth20
-
Lg = g
gTLg = gT g
gTLg =
Relationship between Eigenvalues of the Laplacian and Smoothness
21
-
Lg = g
gTLg = gT g
gTLg =
Relationship between Eigenvalues of the Laplacian and Smoothness
21
Eigenvalue of LEigenvector of L
-
Lg = g
gTLg = gT g
gTLg =
Relationship between Eigenvalues of the Laplacian and Smoothness
21
Eigenvalue of LEigenvector of L
= 1, as eigenvectors are are orthonormal
-
Lg = g
gTLg = gT g
gTLg =
Relationship between Eigenvalues of the Laplacian and Smoothness
21
Eigenvalue of LEigenvector of L
= 1, as eigenvectors are are orthonormal
Measure of Non-Smoothness(previous slide)
-
Lg = g
gTLg = gT g
gTLg =
Relationship between Eigenvalues of the Laplacian and Smoothness
21
Eigenvalue of LEigenvector of L
= 1, as eigenvectors are are orthonormal
If an eigenvector is used to classify nodes, then the
corresponding eigenvalue gives the measure of non-smoothness
Measure of Non-Smoothness(previous slide)
-
Spectrum of the Graph Laplacian
Figure from [Zhu et al., 2005]
Number ofconnected
components = Number of 0 eigenvalues
Constant withincomponent
Higher Eigenvalue,Irregular Eigenvector,
Less smoothness
22
-
NotationsSeed Scores
Estimated Scores
Label Priors
Yv,l : score of estimated label l on node v
Yv,l : score of seed label l on node v
Rv,l : regularization target for label l on node v
S : seed node indicator (diagonal matrix)
v
Wuv : weight of edge (u, v) in the graph
23
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation
24
-
NotationsSeed Scores
Estimated Scores
Label Regularization
Yv,l : score of estimated label l on node v
Yv,l : score of seed label l on node v
Rv,l : regularization target for label l on node v
S : seed node indicator (diagonal matrix)
v
Wuv : weight of edge (u, v) in the graph
25
-
LP-ZGL [Zhu et al., ICML 2003]
argminY
m
l=1
Wuv(Yul Yvl)2
Yul = Yul, Suu = 1such that
=
m
l=1
YTl LYl
GraphLaplacian
26
-
LP-ZGL [Zhu et al., ICML 2003]
argminY
m
l=1
Wuv(Yul Yvl)2
Yul = Yul, Suu = 1such that
Smooth
=
m
l=1
YTl LYl
GraphLaplacian
26
-
LP-ZGL [Zhu et al., ICML 2003]
argminY
m
l=1
Wuv(Yul Yvl)2
Yul = Yul, Suu = 1such that
Smooth
Match Seeds (hard)
=
m
l=1
YTl LYl
GraphLaplacian
26
-
LP-ZGL [Zhu et al., ICML 2003]
argminY
m
l=1
Wuv(Yul Yvl)2
Yul = Yul, Suu = 1such that
Smooth
Match Seeds (hard)
Smoothness two nodes connected by
an edge with high weight should be assigned similar labels
=
m
l=1
YTl LYl
GraphLaplacian
26
-
LP-ZGL [Zhu et al., ICML 2003]
argminY
m
l=1
Wuv(Yul Yvl)2
Yul = Yul, Suu = 1such that
Smooth
Match Seeds (hard)
Smoothness two nodes connected by
an edge with high weight should be assigned similar labels
Solution satisfies harmonic property
=
m
l=1
YTl LYl
GraphLaplacian
26
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation
27
-
Two Related Views
UL
Label Diffusion
28
Labeled (Seeded) Node
Unlabeled Node
-
Two Related Views
UL
Label Diffusion
L
Random Walk
U
28
Labeled (Seeded) Node
Unlabeled Node
-
Random Walk View
29
-
Random Walk View
UV
what next? Starting node
29
-
Continue walk with probability
Assign Vs seed label to U with probability
Abandon random walk with probability assign U a dummy label
pcont
v
pinjv
pabndv
Random Walk View
UV
what next? Starting node
29
-
Discounting Nodes
30
-
Discounting Nodes
Certain nodes can be unreliable (e.g., high degree nodes) do not allow propagation/walk through them
30
-
Discounting Nodes
Certain nodes can be unreliable (e.g., high degree nodes) do not allow propagation/walk through them
Solution: increase abandon probability on such nodes:
30
-
Discounting Nodes
Certain nodes can be unreliable (e.g., high degree nodes) do not allow propagation/walk through them
Solution: increase abandon probability on such nodes:
pabndv degree(v)
30
-
Redefining Matrices
W
uv= pcont
uWuv
Suu =
pinju
Ru! = pabndu
, and 0 for non-dummy labels
Dummy Label
New EdgeWeight
31
-
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
32
-
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
argminY
m+1l=1
SY l SY l2 + 1
u,v
Muv(Y ul Y vl)2 + 2Y l Rl2
m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v
Seed Scores
EstimatedScores
Label Priorsv
32
-
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
argminY
m+1l=1
SY l SY l2 + 1
u,v
Muv(Y ul Y vl)2 + 2Y l Rl2
m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v
Match Seeds (soft)
Seed Scores
EstimatedScores
Label Priorsv
32
-
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
argminY
m+1l=1
SY l SY l2 + 1
u,v
Muv(Y ul Y vl)2 + 2Y l Rl2
m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v
Match Seeds (soft) Smooth
Seed Scores
EstimatedScores
Label Priorsv
32
-
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
argminY
m+1l=1
SY l SY l2 + 1
u,v
Muv(Y ul Y vl)2 + 2Y l Rl2
m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v
Match Seeds (soft) SmoothMatch Priors(Regularizer)
Seed Scores
EstimatedScores
Label Priorsv
32
-
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
argminY
m+1l=1
SY l SY l2 + 1
u,v
Muv(Y ul Y vl)2 + 2Y l Rl2
m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v
Match Seeds (soft) SmoothMatch Priors(Regularizer)
Seed Scores
EstimatedScores
Label Priorsv
for none-of-the-above label
32
-
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
argminY
m+1l=1
SY l SY l2 + 1
u,v
Muv(Y ul Y vl)2 + 2Y l Rl2
m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v
Match Seeds (soft) SmoothMatch Priors(Regularizer)
Seed Scores
EstimatedScores
Label Priorsv
MAD has extra regularization compared to LP-ZGL [Zhu et al, ICML 03]; similar to QC [Bengio et al, 2006]
for none-of-the-above label
32
-
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
argminY
m+1l=1
SY l SY l2 + 1
u,v
Muv(Y ul Y vl)2 + 2Y l Rl2
m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v
Match Seeds (soft) SmoothMatch Priors(Regularizer)
Seed Scores
EstimatedScores
Label Priorsv
MAD has extra regularization compared to LP-ZGL [Zhu et al, ICML 03]; similar to QC [Bengio et al, 2006]
for none-of-the-above label
MADs Objective is Convex
32
-
Solving MAD Objective
33
-
Solving MAD Objective
Can be solved using matrix inversion (like in LP) but matrix inversion is expensive
33
-
Solving MAD Objective
Can be solved using matrix inversion (like in LP) but matrix inversion is expensive
Instead solved exactly using a system of linear equations (Ax = b)
solved using Jacobi iterations results in iterative updates guaranteed convergence see [Bengio et al., 2006] and
[Talukdar and Crammer, ECML 2009] for details
33
-
Solving MAD using Iterative Updates
Inputs Y ,R : |V | (|L| + 1), W : |V | |V |, S : |V | |V | diagonalY YM =W +W
Zv Svv + 1
u =vMvu + 2 v Vrepeatfor all v V doY v 1Zv
(SY )v + 1MvY + 2Rv
end for
until convergence
Seed Prior
0.75
0.05
0.60
Current label estimate on ba b
c
v
34
-
Solving MAD using Iterative Updates
Inputs Y ,R : |V | (|L| + 1), W : |V | |V |, S : |V | |V | diagonalY YM =W +W
Zv Svv + 1
u =vMvu + 2 v Vrepeatfor all v V doY v 1Zv
(SY )v + 1MvY + 2Rv
end for
until convergence
Seed Prior
0.75
0.05
0.60
New label estimate on v
a b
c
v
34
-
Solving MAD using Iterative Updates
Inputs Y ,R : |V | (|L| + 1), W : |V | |V |, S : |V | |V | diagonalY YM =W +W
Zv Svv + 1
u =vMvu + 2 v Vrepeatfor all v V doY v 1Zv
(SY )v + 1MvY + 2Rv
end for
until convergence
Seed Prior
0.75
0.05
0.60
New label estimate on v
a b
c
v
34
Importance of a node can be discounted Easily Parallelizable: Scalable (more later)
-
When is MAD most effective?
35
-
When is MAD most effective?
0
0.1
0.2
0.3
0.4
0 7.5 15 22.5 30
Re
lati
ve I
ncr
eas
e i
n M
RR
by M
AD
ove
r L
P-Z
GL
Average Degree
35
-
When is MAD most effective?
0
0.1
0.2
0.3
0.4
0 7.5 15 22.5 30
Re
lati
ve I
ncr
eas
e i
n M
RR
by M
AD
ove
r L
P-Z
GL
Average Degree
MAD is particularly effective in denser graphs, where there is greater need for regularization.
35
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation
36
-
Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]
37
-
Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]
37
Main lesson from MAD: discount bad (high degree) nodes
-
Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]
37
Main lesson from MAD: discount bad (high degree) nodes
TACO generalizes no=on of bad, adds per-node, per-class condence
-
Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]
37
LowMain lesson from MAD: discount bad (high degree) nodes
TACO generalizes no=on of bad, adds per-node, per-class condence
-
Label Scores:
Condence:
Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]
37
LowMain lesson from MAD: discount bad (high degree) nodes
TACO generalizes no=on of bad, adds per-node, per-class condence
-
Introducing Confidence
38
0.49
0.5
0.51
0.3
0.90.3
? ?
-
Introducing Confidence
38
0.49
0.50.5
0.51
0.3
0.5
0.90.3
-
Introducing Confidence
38
0.49
0.50.5
0.51
0.3
0.5
0.90.3
-
Introducing Confidence
38
0.49
0.50.5
0.51
0.3
0.5
0.90.3
-
Introducing Confidence
38
0.49
0.50.5
0.51
0.3
0.5
0.90.3
-
Introducing Confidence
38
0.49
0.50.5
0.51
0.3
0.90.3
0.35
-
Neighborhood disagreement => low condence
Introducing Confidence
38
0.49
0.50.5
0.51
0.3
0.90.3
0.35
-
Neighborhood disagreement => low condence
Lower the eect of poorly es=mated (low condence) scores
Introducing Confidence
38
0.49
0.50.5
0.51
0.3
0.90.3
0.35
-
TACO Objective
39
-
TACO Objective
39
-
TACO Objective
39
-
TACO Objective
Convex objective Iterative solution
39
-
TACO: Iterative Algorithm
40
-
TACO: Iterative Algorithm
40
-
TACO: Score Update
410.3
?
0.3
0.9
-
TACO: Score Update
410.3
?
0.3
0.9
-
TACO: Score Update
410.3
?
0.3
0.9
-
TACO: Score Update
410.3
?
0.3
0.9
-
TACO: Score Update
410.3
?
0.3
0.9
-
TACO: Score Update
410.3
0.3
0.35 0.9
- Average Neighbors- Consider Confidence
-
TACO: Confidence Update
420.3
0.3
0.90.35
-
TACO: Confidence Update
420.3
0.3
0.90.35
-
TACO: Confidence Update
420.3
0.3
0.90.35
-
TACO: Confidence Update
420.3
0.3
0.90.35
-
TACO: Confidence Update
420.3
0.3
0.90.35
-
TACO: Confidence Update
420.3
0.3
0.9
Confidence is monotonicin neighborsdisagreement
0.35
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation
43
-
Manifold Regularization[Belkin et al., JMLR 2006]
f = argminf
1
l
li=1
V (yi, f(xi)) + fTLf + ||f ||2K
44
-
Manifold Regularization[Belkin et al., JMLR 2006]
Loss Function(e.g., soft margin)
Training DataLoss
f = argminf
1
l
li=1
V (yi, f(xi)) + fTLf + ||f ||2K
44
-
Manifold Regularization[Belkin et al., JMLR 2006]
Loss Function(e.g., soft margin)
Laplacian of graph over labeled and unlabeled data
Training DataLoss
Smoothness Regularizer
f = argminf
1
l
li=1
V (yi, f(xi)) + fTLf + ||f ||2K
44
-
Manifold Regularization[Belkin et al., JMLR 2006]
Loss Function(e.g., soft margin)
Laplacian of graph over labeled and unlabeled data
Training DataLoss
Smoothness Regularizer
Regularizer(e.g., L2)
f = argminf
1
l
li=1
V (yi, f(xi)) + fTLf + ||f ||2K
44
-
Manifold Regularization[Belkin et al., JMLR 2006]
Loss Function(e.g., soft margin)
Laplacian of graph over labeled and unlabeled data
Trains an inductive classifier which can generalize to unseen instances
Training DataLoss
Smoothness Regularizer
Regularizer(e.g., L2)
f = argminf
1
l
li=1
V (yi, f(xi)) + fTLf + ||f ||2K
44
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation
45
-
argmin{pi}
li=1
DKL(ri||pi) + i,j
wijDKL(pi||pj) ni=1
H(pi)
Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]
CKL
s.t.y
pi(y) = 1, pi(y) 0, y, i
46
-
argmin{pi}
li=1
DKL(ri||pi) + i,j
wijDKL(pi||pj) ni=1
H(pi)
Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]
Divergence onseed nodes
Seed and estimated label distributions (normalized)
on node i
CKL
s.t.y
pi(y) = 1, pi(y) 0, y, i
46
-
argmin{pi}
li=1
DKL(ri||pi) + i,j
wijDKL(pi||pj) ni=1
H(pi)
Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]
Divergence onseed nodes
Smoothness(divergence across edge)
KL Divergence
DKL(pi||pj) =y
pi(y) logpi(y)
pj(y)
Seed and estimated label distributions (normalized)
on node i
CKL
s.t.y
pi(y) = 1, pi(y) 0, y, i
46
-
argmin{pi}
li=1
DKL(ri||pi) + i,j
wijDKL(pi||pj) ni=1
H(pi)
Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]
Divergence onseed nodes
Smoothness(divergence across edge)
Entropic Regularizer
KL Divergence
DKL(pi||pj) =y
pi(y) logpi(y)
pj(y)
EntropyH(pi) =
y
pi(y) log pi(y)
Seed and estimated label distributions (normalized)
on node i
CKL
s.t.y
pi(y) = 1, pi(y) 0, y, i
46
-
argmin{pi}
li=1
DKL(ri||pi) + i,j
wijDKL(pi||pj) ni=1
H(pi)
Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]
Divergence onseed nodes
Smoothness(divergence across edge)
Entropic Regularizer
KL Divergence
DKL(pi||pj) =y
pi(y) logpi(y)
pj(y)
EntropyH(pi) =
y
pi(y) log pi(y)
Seed and estimated label distributions (normalized)
on node i
Normalization Constraint
CKL
s.t.y
pi(y) = 1, pi(y) 0, y, i
46
-
argmin{pi}
li=1
DKL(ri||pi) + i,j
wijDKL(pi||pj) ni=1
H(pi)
Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]
Divergence onseed nodes
Smoothness(divergence across edge)
Entropic Regularizer
KL Divergence
DKL(pi||pj) =y
pi(y) logpi(y)
pj(y)
EntropyH(pi) =
y
pi(y) log pi(y)
Seed and estimated label distributions (normalized)
on node i
CKL is convex (with non-negative edge weights and hyper-parameters)MP is related to Information Regularization [Corduneanu and Jaakkola, 2003]
Normalization Constraint
CKL
s.t.y
pi(y) = 1, pi(y) 0, y, i
46
-
Solving MP Objective
For ease of optimization, reformulate MP objective:
arg min{pi,qi}
li=1
DKL(ri||qi) + i,j
wijDKL(pi||qj)
ni=1
H(pi)
CMP
47
-
Solving MP Objective
For ease of optimization, reformulate MP objective:
arg min{pi,qi}
li=1
DKL(ri||qi) + i,j
wijDKL(pi||qj)
ni=1
H(pi)
New probability measure, one for each
vertex, similar to pi
CMP
47
-
Solving MP Objective
For ease of optimization, reformulate MP objective:
arg min{pi,qi}
li=1
DKL(ri||qi) + i,j
wijDKL(pi||qj)
ni=1
H(pi)
wij = wij + (i, j)New probability
measure, one for each vertex, similar to pi
CMP
47
-
Solving MP Objective
For ease of optimization, reformulate MP objective:
arg min{pi,qi}
li=1
DKL(ri||qi) + i,j
wijDKL(pi||qj)
ni=1
H(pi)
wij = wij + (i, j)New probability
measure, one for each vertex, similar to pi
CMP
Encourages agreement between pi and qi
47
-
Solving MP Objective
For ease of optimization, reformulate MP objective:
arg min{pi,qi}
li=1
DKL(ri||qi) + i,j
wijDKL(pi||qj)
ni=1
H(pi)
wij = wij + (i, j)New probability
measure, one for each vertex, similar to pi
CMP
Encourages agreement between pi and qi
CMP is also convex(with non-negative edge weights and hyper-parameters)
47
-
Solving MP Objective
For ease of optimization, reformulate MP objective:
arg min{pi,qi}
li=1
DKL(ri||qi) + i,j
wijDKL(pi||qj)
ni=1
H(pi)
wij = wij + (i, j)New probability
measure, one for each vertex, similar to pi
CMP
Encourages agreement between pi and qi
CMP is also convex(with non-negative edge weights and hyper-parameters)
CMP can be solved using Alternating Minimization (AM)47
-
Alternating Minimization
48
-
Alternating Minimization
Q0
48
-
Alternating Minimization
Q0
P1
48
-
Alternating Minimization
Q0
Q1
P1
48
-
Alternating Minimization
Q0
Q1
P1P2
48
-
Q2
Alternating Minimization
Q0
Q1
P1P2
48
-
Q2
Alternating Minimization
Q0
Q1
P1P2P3
48
-
Q2
Alternating Minimization
Q0
Q1
P1P2P3
48
-
Q2
Alternating Minimization
Q0
Q1
P1P2P3
48
-
Q2
Alternating Minimization
Q0
Q1
P1P2P3
CMP satisfies the necessary conditions for AM to converge [Subramanya and Bilmes, JMLR 2011]
48
-
Why AM?
49
-
Why AM?
49
-
Why AM?
49
-
Performance of SSL Algorithms
Comparison of accuracies for different number of labeledsamples across COIL (6 classes) and OPT (10 classes) datasets
50
-
Performance of SSL Algorithms
Comparison of accuracies for different number of labeledsamples across COIL (6 classes) and OPT (10 classes) datasets
50
-
Performance of SSL Algorithms
Comparison of accuracies for different number of labeledsamples across COIL (6 classes) and OPT (10 classes) datasets
Graph SSL can be effective when the data satisfies manifold assumption. More results and discussion in Chapter 21 of
the SSL Book (Chapelle et al.)
50
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Label Propagation- Modified Adsorption- Manifold Regularization- Spectral Graph Transduction- Measure Propagation- Sparse Label Propagation
51
-
Background: Factor Graphs[Kschischang et al., 2001]
Factor Graph bipartite graph variable nodes (e.g., label distribution on a node) factor nodes: fitness function over variable assignment
Distribution over all variables values
Variable Nodes (V)
Factor Nodes (F)
variables connected to factor f52
-
Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]
min Edge Smoothness LossRegularization
Loss + +Seed Matching Loss (if any)
3-term Graph SSL Objective (common to many algorithms)
53
-
Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]
min Edge Smoothness LossRegularization
Loss + +Seed Matching Loss (if any)
3-term Graph SSL Objective (common to many algorithms)
q1 q2w1,2 ||q1 q2||2
53
-
Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]
min Edge Smoothness LossRegularization
Loss + +Seed Matching Loss (if any)
3-term Graph SSL Objective (common to many algorithms)
q1 q2w1,2 ||q1 q2||2
q1 q2(q1, q2)
53
-
Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]
min Edge Smoothness LossRegularization
Loss + +Seed Matching Loss (if any)
3-term Graph SSL Objective (common to many algorithms)
q1 q2w1,2 ||q1 q2||2
q1 q2(q1, q2)
(q1, q2) w1,2 ||q1 q2||2Smoothness Factor53
-
Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]
min Edge Smoothness LossRegularization
Loss + +Seed Matching Loss (if any)
3-term Graph SSL Objective (common to many algorithms)
q1 q2w1,2 ||q1 q2||2
q1 q2(q1, q2)
(q1, q2) w1,2 ||q1 q2||2Smoothness Factor
Seed MatchingFactor (unary)
53
-
Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]
min Edge Smoothness LossRegularization
Loss + +Seed Matching Loss (if any)
3-term Graph SSL Objective (common to many algorithms)
q1 q2w1,2 ||q1 q2||2
q1 q2(q1, q2)
Regularization factor (unary)
(q1, q2) w1,2 ||q1 q2||2Smoothness Factor
Seed MatchingFactor (unary)
53
-
Factor Graph Interpretation[Zhu et al., ICML 2003][Das and Smith, NAACL 2012]
r1 r2
r3 r4
q1 q2
q4 q3
q9264
q9265
q9266
q9267 q9268 q9269 q9270 1. Factor encouraging agreement on seed
labels
2. SmoothnessFactor
3. Unary factor for regularizationlogt(qt)
54
-
Label Propagation with Sparsity
55
-
Label Propagation with Sparsity
Enforce through sparsity inducing unary factor
55
-
Label Propagation with Sparsity
Enforce through sparsity inducing unary factor
logt(qt) =
logt(qt) =
Lasso (Tibshirani, 1996)
Elitist Lasso (Kowalski and Torrsani, 2009)
55
-
Label Propagation with Sparsity
Enforce through sparsity inducing unary factor
logt(qt) =
logt(qt) =
Lasso (Tibshirani, 1996)
Elitist Lasso (Kowalski and Torrsani, 2009)
For more details, see [Das and Smith, NAACL 2012]
55
-
Other Graph-SSL Methods
Spectral Graph Transduction [Joachims, ICML 2003] SSL on Directed Graphs
[Zhou et al, NIPS 2005], [Zhou et al., ICML 2005]
Learning with dissimilarity edges [Goldberg et al., AISTATS 2007]
Learning to order: GraphOrder [Talukdar et al., CIKM 2012] Graph Transduction using Alternating Minimization
[Wang et al., ICML 2008]
Graph as regularizer for Multi-Layered Perceptron [Karlen et al., ICML 2008], [Malkin et al., Interspeech 2009]
56
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Scalability Issues- Node reordering- MapReduce Parallelization
57
-
More (Unlabeled) Data is Better Data
[Subramanya & Bilmes, JMLR 2011]58
-
More (Unlabeled) Data is Better Data
Graph with 120m vertices
[Subramanya & Bilmes, JMLR 2011]58
-
More (Unlabeled) Data is Better Data
Graph with 120m vertices
Challenges with large unlabeled data:
Constructing graph from large data Scalable inference over large graphs
[Subramanya & Bilmes, JMLR 2011]58
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Scalability Issues- Node reordering- MapReduce Parallelization
59
-
Scalability Issues (I)Graph Construction
60
-
Brute force (exact) k-NNG too expensive (quadratic)
Scalability Issues (I)Graph Construction
60
-
Brute force (exact) k-NNG too expensive (quadratic)
Approximate nearest neighbor using kd-tree [Friedman et al., 1977, also see http://www.cs.umd.edu/mount/]
Scalability Issues (I)Graph Construction
60
-
Sub-sample the data
Construct graph over a subset of a unlabeled data [Delalleau et al., AISTATS 2005]
Sparse Grids [Garcke & Griebel, KDD 2001]
Scalability Issues (II)Label Inference
61
-
Sub-sample the data
Construct graph over a subset of a unlabeled data [Delalleau et al., AISTATS 2005]
Sparse Grids [Garcke & Griebel, KDD 2001]
How about using more computation? (next section) Symmetric multi-processor (SMP) Distributed Computer
Scalability Issues (II)Label Inference
61
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Scalability Issues- Node reordering [Subramanya & Bilmes, JMLR 2011; Bilmes & Subramanya, 2011]
- MapReduce Parallelization
62
-
Parallel computation on a SMP
.....
k+1
k
2
1
..... Graph nodes (neighbors not shown)
SMP with k Processors
-
Parallel computation on a SMP
Processor 1
.....
k+1
k
2
1
..... Graph nodes (neighbors not shown)
SMP with k Processors
-
Parallel computation on a SMP
Processor 1
.....
k+1
k
2
1
.....
Processor 2
Graph nodes (neighbors not shown)
SMP with k Processors
-
Parallel computation on a SMP
Processor 1
.....
k+1
k
2
1
.....
Processor 2
Processor k
Processor 1
Graph nodes (neighbors not shown)
SMP with k Processors
-
Parallel computation on a SMP
Processor 1
.....
k+1
k
2
1
.....
Processor 2
Processor k
Processor 1
Graph nodes (neighbors not shown)
SMP with k Processors
-
Label Update using Message Passing
Seed Prior
0.75
0.05
0.60Current label estimate on a
a b
c
v
64
Processor 1
.....
k+1
k
2
1
.....
Processor 2
Processor k
Processor 1
Graph nodes (neighbors not shown)
SMP with k Processors
-
Label Update using Message Passing
Seed Prior
0.75
0.05
0.60
New label estimate on v
a b
c
v
64
Processor 1
.....
k+1
k
2
1
.....
Processor 2
Processor k
Processor 1
Graph nodes (neighbors not shown)
SMP with k Processors
-
Speed-up on SMP
[Subramanya & Bilmes, JMLR, 2011]
Graph with 1.4M nodes SMP with 16 cores and 128GB of RAM
65
-
Speed-up on SMP
Ratio of time spent when using 1 processor to
time spent using n processors
[Subramanya & Bilmes, JMLR, 2011]
Graph with 1.4M nodes SMP with 16 cores and 128GB of RAM
65
-
Speed-up on SMP
Ratio of time spent when using 1 processor to
time spent using n processors
Cache miss?
[Subramanya & Bilmes, JMLR, 2011]
Graph with 1.4M nodes SMP with 16 cores and 128GB of RAM
65
-
Node Reordering Algorithm
Input: Graph G = (V, E)
Result: Node ordered graph
1. Select an arbitrary node v
2. while unselected nodes remain do
2.1. select an unselected node v` from among the neighbors neighbors of v that has maximum overlap with v` neighbors
2.2. mark v` as selected
2.3. set v to v`
[Subramanya & Bilmes, JMLR, 2011]66
-
Node Reordering Algorithm
Input: Graph G = (V, E)
Result: Node ordered graph
1. Select an arbitrary node v
2. while unselected nodes remain do
2.1. select an unselected node v` from among the neighbors neighbors of v that has maximum overlap with v` neighbors
2.2. mark v` as selected
2.3. set v to v`
Exhaustive for sparse
(e.g., k-NN) graphs
[Subramanya & Bilmes, JMLR, 2011]66
-
Node Reordering Algorithm : Intuition
k
a
b
c
67
-
Node Reordering Algorithm : Intuition
k
a
b
c
67
Which node should be placed after k to
optimize cache performance?
-
Node Reordering Algorithm : Intuition
k
a
b
c
a
e
j
c
67
Which node should be placed after k to
optimize cache performance?
-
Node Reordering Algorithm : Intuition
k
a
b
c
a
e
j
c
b
a
d
c
c
l
m
n67
Which node should be placed after k to
optimize cache performance?
-
Node Reordering Algorithm : Intuition
k
a
b
c
a
e
j
c
b
a
d
c
c
l
m
n
|N(k) N(a)| = 1
67
Which node should be placed after k to
optimize cache performance?
-
Node Reordering Algorithm : Intuition
k
a
b
c
a
e
j
c
b
a
d
c
c
l
m
n
|N(k) N(a)| = 1
67
Which node should be placed after k to
optimize cache performance?
Cardinality of Intersection
-
Node Reordering Algorithm : Intuition
k
a
b
c
a
e
j
c
b
a
d
c
c
l
m
n
|N(k) N(a)| = 1
|N(k) N(b)| = 2
|N(k) N(c)| = 0
67
Which node should be placed after k to
optimize cache performance?
Cardinality of Intersection
-
Node Reordering Algorithm : Intuition
k
a
b
c
a
e
j
c
b
a
d
c
c
l
m
n
|N(k) N(a)| = 1
|N(k) N(b)| = 2
|N(k) N(c)| = 0
Best Node
67
Which node should be placed after k to
optimize cache performance?
Cardinality of Intersection
-
Speed-up on SMP after Node Ordering
[Subramanya & Bilmes, JMLR, 2011]68
-
Distributed Processing
Maximize overlap between consecutive nodes within the same machine
Minimize overlap across machines (reduce inter machine communication)
69
-
Distributed Processing
[Bilmes & Subramanya, 2011]70
-
Distributed ProcessingMaximize
intra-processor overlap
[Bilmes & Subramanya, 2011]70
-
Distributed ProcessingMaximize
intra-processor overlap
Minimizeinter-processor
overlap
[Bilmes & Subramanya, 2011]70
-
Node reordering for Distributed Computer
...... ............ ......
Processor #i Processor #j
71
-
Node reordering for Distributed Computer
...... ............ ......
Processor #i Processor #j
Minimize Overlap
71
-
Node reordering for Distributed Computer
...... ............ ......
Processor #i Processor #j
Minimize OverlapMaximize
Overlap
71
-
Distributed Processing Results
[Bilmes & Subramanya, 2011]72
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Scalability Issues- Node reordering- MapReduce Parallelization
73
-
MapReduce Implementation of MAD
Seed Prior
0.75
0.05
0.60
Current label estimate on ba b
c
v
74
-
MapReduce Implementation of MAD Map
Each node send its current label assignments to its neighbors
Seed Prior
0.75
0.05
0.60
Current label estimate on ba b
c
v
74
-
MapReduce Implementation of MAD Map
Each node send its current label assignments to its neighbors
Seed Prior
0.75
0.05
0.60
a b
c
v
74
-
MapReduce Implementation of MAD Map
Each node send its current label assignments to its neighbors
Seed Prior
0.75
0.05
0.60
a b
c
v Reduce Each node updates its own label
assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.)
Repeat until convergence
Reduce
74
-
MapReduce Implementation of MAD Map
Each node send its current label assignments to its neighbors
Seed Prior
0.75
0.05
0.60
New label estimate on v
a b
c
v Reduce Each node updates its own label
assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.)
Repeat until convergence74
-
MapReduce Implementation of MAD Map
Each node send its current label assignments to its neighbors
Seed Prior
0.75
0.05
0.60
New label estimate on v
a b
c
v Reduce Each node updates its own label
assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.)
Repeat until convergenceCode in Junto Label Propagation Toolkit
(includes Hadoop-based implementation)
http://code.google.com/p/junto/74
-
MapReduce Implementation of MAD Map
Each node send its current label assignments to its neighbors
Seed Prior
0.75
0.05
0.60
New label estimate on v
a b
c
v Reduce Each node updates its own label
assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.)
Repeat until convergenceCode in Junto Label Propagation Toolkit
(includes Hadoop-based implementation)
http://code.google.com/p/junto/74
Graph-based algorithms are amenable to distributed processing
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Phone Classification- Text Categorization - Dialog Act Tagging- Statistical Machine Translation- POS Tagging- MultiLingual POS Tagging
75
-
Problem Description & Motivation
76
-
Problem Description & Motivation
Given a frame of speech classify it into one of n phones
76
-
Problem Description & Motivation
Given a frame of speech classify it into one of n phones
Training supervised models requires large amounts of labeled data (phone classification in resource-scarce languages)
76
-
TIMIT
77
-
TIMIT
Corpus of read speech
77
-
TIMIT
Corpus of read speech Broadband recordings of 630 speakers of 8 major
dialects of American English
77
-
TIMIT
Corpus of read speech Broadband recordings of 630 speakers of 8 major
dialects of American English
Each speaker has read 10 sentences
77
-
TIMIT
Corpus of read speech Broadband recordings of 630 speakers of 8 major
dialects of American English
Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions
77
-
TIMIT
Corpus of read speech Broadband recordings of 630 speakers of 8 major
dialects of American English
Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions Phone set has 61 phones [Lee & Hon, 89]
77
-
TIMIT
Corpus of read speech Broadband recordings of 630 speakers of 8 major
dialects of American English
Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions Phone set has 61 phones [Lee & Hon, 89] mapped down to 48 phones for modeling
77
-
TIMIT
Corpus of read speech Broadband recordings of 630 speakers of 8 major
dialects of American English
Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions Phone set has 61 phones [Lee & Hon, 89] mapped down to 48 phones for modeling further mapped down to 39 phones for scoring
77
-
Feature Extraction
78
-
Feature Extraction
78
MFCC
Hamming Window
25ms @ 100 Hz
-
Feature Extraction
78
MFCC
Hamming Window
25ms @ 100 Hz
Delta
-
Feature Extraction
78
MFCC
Hamming Window
25ms @ 100 Hz
Delta .... ....
ti
-
Feature Extraction
78
MFCC
Hamming Window
25ms @ 100 Hz
Delta .... ....
ti
-
Feature Extraction
78
MFCC
Hamming Window
25ms @ 100 Hz
Delta .... ....
ti
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
-
Graph Construction
79
TIMIT Training Set
TIMIT Development
Set
-
Graph Construction
79
TIMIT Training Set
TIMIT Development
Set
+
-
Graph Construction
79
TIMIT Training Set
TIMIT Development
Set
+ k-NN Graph
-
Graph Construction
79
TIMIT Training Set
TIMIT Development
Set
+ k-NN Graphwij = exp{(xi xj)T1(xi xj)}
-
Graph Construction
79
TIMIT Training Set
TIMIT Development
Set
+ k-NN Graph
- k = 10- Graph with about 1.4 M nodes
wij = exp{(xi xj)T1(xi xj)}
-
Graph Construction
79
TIMIT Training Set
TIMIT Development
Set
+ k-NN Graph
- k = 10- Graph with about 1.4 M nodes
Labels not used during graph construction
wij = exp{(xi xj)T1(xi xj)}
-
Inductive Extension
80
TIMIT Training Set
TIMIT Development
Set
+ Graph
-
Inductive Extension
80
TIMIT Training Set
TIMIT Development
Set
+ Graph
Test Sample
-
Inductive Extension
80
TIMIT Training Set
TIMIT Development
Set
+ Graph
Test Sample
Feature Extraction
-
Inductive Extension
80
TIMIT Training Set
TIMIT Development
Set
+ Graph Nearest Neighbors
Test Sample
Feature Extraction
-
Inductive Extension
80
TIMIT Training Set
TIMIT Development
Set
+ Graph Nearest Neighbors
Test Sample
Feature Extraction
-
Inductive Extension
80
TIMIT Training Set
TIMIT Development
Set
+ Graph Nearest Neighbors
Test Sample
Feature Extraction
Nadaraya-Watson Estimator
-
Results (I)
81
-
Results (II)
82
-
Labeled Graph Construction
83
TIMIT Training Set
[Alexandrescu & Kirchoff, ASRU 2007]
-
Labeled Graph Construction
83
TIMIT Training Set
Feature Extraction
[Alexandrescu & Kirchoff, ASRU 2007]
-
Labeled Graph Construction
83
TIMIT Training Set
Feature Extraction
MLP Training
[Alexandrescu & Kirchoff, ASRU 2007]
-
Labeled Graph Construction
83
TIMIT Training Set
Feature Extraction
MLP Training
TIMIT Training Set
TIMIT Development
Set
+ MLP
[Alexandrescu & Kirchoff, ASRU 2007]
-
Labeled Graph Construction
83
TIMIT Training Set
Feature Extraction
MLP Training
TIMIT Training Set
TIMIT Development
Set
+ MLP Graph ConstructionJenson-Shannon
Divergence[Alexandrescu & Kirchoff, ASRU 2007]
-
Results (III)
84
65
67.25
69.5
71.75
74
10% 50% 100%
MLP MP MAD p-MP
[Liu & Kirchoff, UW-EE TR 2012]
-
Results (III)
84
65
67.25
69.5
71.75
74
10% 50% 100%
MLP MP MAD p-MP
[Liu & Kirchoff, UW-EE TR 2012]
Take advantage of labeled data during graph construction Regularize towards prior (when available)
-
Switchboard Phonetic Annotation
85
-
Switchboard Phonetic Annotation
Switchboard corpus consists of about 300 hours of conversational speech.
Less reliable automatically generated phone-level annotations [Deshmukh et al., 1998]
85
-
Switchboard Phonetic Annotation
Switchboard corpus consists of about 300 hours of conversational speech.
Less reliable automatically generated phone-level annotations [Deshmukh et al., 1998]
Switchboard transcription project (STP) [Greenberg, 1995]
Manual phonetic annotationOnly about 75 minutes of data annotated
85
-
Graph Construction
86
-
Graph Construction
86
x% of SWB
STP
+
-
Graph Construction
86
x% of SWB
STP
+ k-NN Graph
-
Graph Construction
86
x% of SWB
STP
+ k-NN Graph
- kd-tree- k = 10- Graph with about 120 M nodes
-
Results
[Subramanya & Bilmes, JMLR 2011]87
-
Results
Graph with 120m vertices
[Subramanya & Bilmes, JMLR 2011]87
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Phone Classification- Text Categorization - Dialog Act Tagging- Statistical Machine Translation- POS Tagging- MultiLingual POS Tagging
88
-
Problem Description & Motivation
89
-
Problem Description & Motivation
Given a document (e.g., web page, news article), assign it to a fixed number of semantic categories (e.g., sports, politics, entertainment)
89
-
Problem Description & Motivation
Given a document (e.g., web page, news article), assign it to a fixed number of semantic categories (e.g., sports, politics, entertainment)
Multi-label problem
89
-
Problem Description & Motivation
Given a document (e.g., web page, news article), assign it to a fixed number of semantic categories (e.g., sports, politics, entertainment)
Multi-label problem Training supervised models requires large
amounts of labeled data [Dumais et al., 1998]
89
-
Corpora
Reuters [Lewis, et al., 1978] Newswire About 20K document with 135 categories. Use
top 10 categories (e.g., earnings, acquistions, wheat, interest) and label the remaining as other
90
-
Corpora
Reuters [Lewis, et al., 1978] Newswire About 20K document with 135 categories. Use
top 10 categories (e.g., earnings, acquistions, wheat, interest) and label the remaining as other
WebKB [Bekkerman, et al., 2003] 8K webpages from 4 academic domains Categories include course, department,
faculty and project90
-
Feature Extraction
Showers continued throughout the week inthe Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, ...
Document[Lewis, et al., 1978]
91
-
Feature Extraction
Showers continued throughout the week inthe Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, ...
Document
Showers continued week Bahia cocoa zone alleviating drought early January improving prospects coming temporao, ...
Stop-word Removal[Lewis, et al., 1978]
91
-
Feature Extraction
Showers continued throughout the week inthe Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, ...
Document
Showers continued week Bahia cocoa zone alleviating drought early January improving prospects coming temporao, ...
shower continu week bahia cocoa zone allevi drought earli januari improv prospect come temporao, ...
Stop-word Removal Stemming[Lewis, et al., 1978]
91
-
Feature Extraction
Showers continued throughout the week inthe Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, ...
Document
Showers continued week Bahia cocoa zone alleviating drought early January improving prospects coming temporao, ...
shower continu week bahia cocoa zone allevi drought earli januari improv prospect come temporao, ...
Stop-word Removal Stemming
showerbahiacocoa
.
.
.
.
0.010.340.33
.
.
.
.
TFIDF weighted Bag-of-words[Lewis, et al., 1978]
91
-
Results
Average PRBEP
SVM TSVM SGT LP MP MAD
Reuters 48.9 59.3 60.3 59.7 66.3 -
WebKB 23.0 29.2 36.8 41.2 51.9 53.7
Precision-recall break even point (PRBEP)92
-
Results
Average PRBEP
SVM TSVM SGT LP MP MAD
Reuters 48.9 59.3 60.3 59.7 66.3 -
WebKB 23.0 29.2 36.8 41.2 51.9 53.7
Precision-recall break even point (PRBEP)
Support Vector
Machine (Supervised)
92
-
Results
Average PRBEP
SVM TSVM SGT LP MP MAD
Reuters 48.9 59.3 60.3 59.7 66.3 -
WebKB 23.0 29.2 36.8 41.2 51.9 53.7
Precision-recall break even point (PRBEP)
Support Vector
Machine (Supervised)
Transductive SVM
[Joachims 1999]
92
-
Results
Average PRBEP
SVM TSVM SGT LP MP MAD
Reuters 48.9 59.3 60.3 59.7 66.3 -
WebKB 23.0 29.2 36.8 41.2 51.9 53.7
Precision-recall break even point (PRBEP)
Support Vector
Machine (Supervised)
Transductive SVM
[Joachims 1999]
Spectral Graph
Transduction (SGT)
[Joachims 2003]
Label Propagation [Zhu & Ghahramani 2002]
Measure Propagation [Subramanya & Bilmes 2008]
Modified Adsorption[Talukdar & Crammer 2009]
92
-
Results on WebKB
[Subramanya & Bilmes, EMNLP 2008]93
-
Results on WebKBModified Adsorption (MAD)
[Talukdar & Crammer 2009]
[Subramanya & Bilmes, EMNLP 2008]93
-
Results on WebKBModified Adsorption (MAD)
[Talukdar & Crammer 2009]
More labeled data => Better Performance Unnormalized distributions (scores) more suitable for multi-label problems (MAD outperforms other approaches)
[Subramanya & Bilmes, EMNLP 2008]93
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Phone Classification- Text Categorization- Dialog Act Tagging [Subramanya & Bilmes, JMLR 2011]- Statistical Machine Translation- POS Tagging- MultiLingual POS Tagging
94
-
Problem description
Dialog acts (DA) reflect the function that utterances serve in discourse
Applications in automatic speech recognition (ASR), machine translation (MT) & natural language processing (NLP)
95
-
Switchboard DA Corpus
96
-
Switchboard DA Corpus
Switchboard dialog act (DA) tagging project [Jurafsky, et al., 1997]
96
-
Switchboard DA Corpus
Switchboard dialog act (DA) tagging project [Jurafsky, et al., 1997]
Manually labeled 1155 conversations (about 200k sentences)
96
-
Switchboard DA Corpus
Switchboard dialog act (DA) tagging project [Jurafsky, et al., 1997]
Manually labeled 1155 conversations (about 200k sentences) Example labels: question, answer, backchannel, agreement... (total
of 42 different DAs)
96
-
Switchboard DA Corpus
Switchboard dialog act (DA) tagging project [Jurafsky, et al., 1997]
Manually labeled 1155 conversations (about 200k sentences) Example labels: question, answer, backchannel, agreement... (total
of 42 different DAs)
Use top 18 DAs (about 185k sentences)
96
-
Graph Construction
97
-
Graph Construction
Bigram & trigram TFIDF
97
-
Graph Construction
Bigram & trigram TFIDF Cosine similarity
97
-
Graph Construction
Bigram & trigram TFIDF Cosine similarity k-NN graph
97
-
SWB DA Tagging Results
Baseline performance: 84.2% [Ji & Bilmes, 2005]
98
79
80.75
82.5
84.25
86
Bigram Trigram
SQ-Loss MP
[Subramanya & Bilmes, JMLR 2011]
-
Dihana Corpus
99
-
Dihana Corpus
Computer-human dialogues answering telephone queries about train services in Spain about 900 dialogues topics include timetables, fares and services
99
-
Dihana Corpus
Computer-human dialogues answering telephone queries about train services in Spain about 900 dialogues topics include timetables, fares and services
225 speakers
99
-
Dihana Corpus
Computer-human dialogues answering telephone queries about train services in Spain about 900 dialogues topics include timetables, fares and services
225 speakers Standard train/test set (16k/7.5k sentences)
99
-
Dihana Corpus
Computer-human dialogues answering telephone queries about train services in Spain about 900 dialogues topics include timetables, fares and services
225 speakers Standard train/test set (16k/7.5k sentences) Number of labels = 72
99
-
Dihana Results
Results for classifying user turns
Baseline Performance: 76.4%
100
75
76.75
78.5
80.25
82
Bigram Trigram
SQ-Loss MP
[Subramanya & Bilmes, JMLR 2011]
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Phone Classification- Text Categorization- Dialog Act Tagging- Statistical Machine Translation [Alexandrescu & Kirchoff, NAACL 2009]- POS Tagging- MultiLingual POS Tagging
101
-
Problem Description
Phrase-based statistical machine translation (SMT) Sentences are translated in isolation
102
-
Motivation
103
-
Motivation
103
Asf lA ymknk *lk hnAk .....Source
-
Motivation
103
Asf lA ymknk *lk hnAk .....Source
im sorry you cant there is a cost1-best
-
Motivation
103
Asf lA ymknk *lk hnAk .....Source
im sorry you cant there is a cost1-best
E*rA lA ymknk t$gyl ......Source
excuse me i you turn tv ....1-best
-
Motivation
103
Asf lA ymknk *lk hnAk .....Source
im sorry you cant there is a cost1-best
E*rA lA ymknk t$gyl ......Source
excuse me i you turn tv ....1-best
-
Motivation
103
Asf lA ymknk *lk hnAk .....Source
im sorry you cant there is a cost1-best
E*rA lA ymknk t$gyl ......Source
excuse me i you turn tv ....1-best
CorrectlA ymknk you may not/you cannot
-
Motivation
103
Asf lA ymknk *lk hnAk .....Source
im sorry you cant there is a cost1-best
E*rA lA ymknk t$gyl ......Source
excuse me i you turn tv ....1-best
CorrectlA ymknk you may not/you cannot
-
Motivation
103
Asf lA ymknk *lk hnAk .....Source
im sorry you cant there is a cost1-best
E*rA lA ymknk t$gyl ......Source
excuse me i you turn tv ....1-best
Correct
Different Translations
lA ymknk you may not/you cannot
-
Motivation
103
Asf lA ymknk *lk hnAk .....Source
im sorry you cant there is a cost1-best
E*rA lA ymknk t$gyl ......Source
excuse me i you turn tv ....1-best
Correct
Different TranslationsGraph to enforce smoothness constraint
lA ymknk you may not/you cannot
-
Issues
104
-
Issues
What we want to do - exploit similarity between sentences
104
-
Issues
What we want to do - exploit similarity between sentences
Input consists of variable-length word strings
104
-
Issues
What we want to do - exploit similarity between sentences
Input consists of variable-length word strings Output space is structured (number of possible
labels is very large)
104
-
Graph Construction (I)
105
-
Graph Construction (I)
105
{(s1, t1), . . . , (sl, tl)}Labeled data:
-
Graph Construction (I)
105
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
-
Graph Construction (I)
105
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
(si, ti)
Encodes a source &
target sentence
-
Graph Construction (I)
105
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
(si, ti)
Encodes a source &
target sentence
(sj , tj)
-
Graph Construction (I)
105
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
(si, ti)
Encodes a source &
target sentence
(sj , tj)wij?
-
Graph Construction (I)
105
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
(si, ti)
Encodes a source &
target sentence
(sj , tj)wij?
How do we compute similarity What about the test set?
-
Graph Construction (II)
106
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
-
Graph Construction (II)
106
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
si
Unlabeled sentence
-
Graph Construction (II)
106
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
First pass Decoder
si {h(1)si , . . . , h(N)si }N-best listUnlabeled
sentence
-
Graph Construction (II)
106
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
First pass Decoder
si {h(1)si , . . . , h(N)si }N-best listUnlabeled
sentence
X
-
Graph Construction (II)
106
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
First pass Decoder
si {h(1)si , . . . , h(N)si }N-best listUnlabeled
sentence
X
{(si, h(1)si ), . . . , (si, h(N)si )}
-
Graph Construction (III)
107
si
Unlabeled sentence
-
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
-
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
-
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
-
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
(si, sj) >
-
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
(si, sj) >
-
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
h(j)si
(si, sj) >
-
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
h(j)si
(si, sj) >
-
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
h(j)si
Targets & Hypothesis
(si, sj) >
-
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
h(j)si
Targets & Hypothesis
(si, sj) >
-
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
h(j)si
Targets & Hypothesis
(h(j)si , tk)
(si, sj) >
-
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
h(j)si
Targets & Hypothesis
(h(j)si , tk)
Label similarity
(si, sj) >
-
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
h(j)si
Targets & Hypothesis
(h(j)si , tk)
Averaging
Label similarity
(si, sj) >
-
Corpora
IWSLT 2007 task Italian-to-English (IE) & Arabic-to-English (AE) travel tasks
Each task has train/dev/eval sets Baseline: Standard phrase-based SMT based on a
log-linear model. Yields state-of-the-art performance.
Results are measured using BLEU score and Phrase error rate (PER) [Papineni et al., ACL 2002]
108
-
Results (I)
109
29
30
31
32
33
BLEU-based String Kernel
BLEU
Baseline LP
IE Results on eval set
-
Results (I)
109
29
30
31
32
33
BLEU-based String Kernel
BLEU
Baseline LP
IE Results on eval set
Geometric mean based averaging worked the best
-
Results (II)
110
37
37.75
38.5
39.25
40
BLEU PER
Baseline LP
IE Results on eval set
-
Results (II)
110
37
37.75
38.5
39.25
40
BLEU PER
Baseline LP
IE Results on eval set
Higher is better
Lower is better
-
Results (II)
110
37
37.75
38.5
39.25
40
BLEU PER
Baseline LP
IE Results on eval set37
38.25
39.5
40.75
42
BLEU PER
Baseline LP
AE Results on eval setHigher is better
Lower is better
-
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Phone Classification- Text Categorization- Dialog Act Tagging- Statistical Machine Translation- POS Tagging [Subramanya et. al., EMNLP 2008]- MultiLingual POS Tagging
111
-
Motivation
Unlabeled Data
Small amounts of labeled
sourcedomain data
Large amounts of unlabeled
targetdomain data
112
-
Motivation
DomainAdaptation
Unlabeled Data
Small amounts of labeled
sourcedomain data
Large amounts of unlabeled
targetdomain data
112
-
Motivation
... bought a book detailing the ...
... wanted to book a flight to ...
... the book is about the ...
DomainAdaptation
... VBD DT NN VBG DT ...
... VBD TO VB DT NN TO ...
... DT NN VBZ PP DT ...
Unlabeled Data
Small amounts of labeled
sourcedomain data
Large amounts of unlabeled
targetdomain data
112
-
Motivation
... how to book a band ...can you book a day room ...
... bought a book detailing the ...
... wanted to book a flight to ...
... the book is about the ...
DomainAdaptation
... VBD DT NN VBG DT ...
... VBD TO VB DT NN TO ...
... DT NN VBZ PP DT ...
Unlabeled Data
Small amounts of labeled
sourcedomain data
Large amounts of unlabeled
targetdomain data
112
-
Motivation
... how to book a band ...can you book a day room ...
... bought a book detailing the ...
... wanted to book a flight to ...
... the book is about the ...
DomainAdaptation
... VBD DT NN VBG DT ...
... VBD TO VB DT NN TO ...
... DT NN VBZ PP DT ...
Unlabeled Data
Small amounts of labeled
sourcedomain data
Large amounts of unlabeled
targetdomain data
112
-
Motivation
... how to book a band ...can you book a day room ...
... bought a book detailing the ...
... wanted to book a flight to ...
... the book is about the ...
DomainAdaptation
... VBD DT NN VBG DT ...
... VBD TO VB DT NN TO ...
... DT NN VBZ PP DT ...
Unlabeled Data
Small amounts of labeled
sourcedomain data
Large amounts of unlabeled
targetdomain data
112
-
Motivation
... how to book a band ...can you book a day room ...
... bought a book detailing the ...
... wanted to book a flight to ...
... the book is about the ... ... how to unrar a file ...
DomainAdaptation
... VBD DT NN VBG DT ...
... VBD TO VB DT NN TO ...
... DT NN VBZ PP DT ...
Unlabeled Data
Small amounts of labeled
sourcedomain data
Large amounts of unlabeled
targetdomain data
112
-
Graph Construction (I)
when do you book plane tickets?
do you read a book on the plane?
113
-
Graph Construction (I)
when do you book plane tickets?
do you read a book on the plane?
Similarity
113
-
Graph Construction (I)
when do you book plane tickets?WRB VBP PRP VB NN NNS
VBP PRP VB DT NN IN DT NN
do you read a book on the plane?
Similarity
113
-
Graph Construction (II)
how much does it cost to book a band ?
what was the book that has no letter e ?
how to get a book agent ?
can you book a day room at hilton hawaiian village ?
114
-
Graph Construction (II)
how much does it cost to book a band ?
what was the book that has no letter e ?
how to get a book agent ?
can you book a day room at hilton hawaiian village ?
114
-
Graph Construction (II)
how much does it cost to book a band ?
what was the book that has no letter e ?
how to get a book agent ?
can you book a day room at hilton hawaiian village ?
114
-
Graph Construction (II)
you book athe book that
to book a
a book agent
114
-
Graph Construction (II)
you book athe book that
to book a
a book agent
k-nearestneighbors?
114
-
Graph Construction (III)
you book athe book that
to book a
a book agent
the city that
to run a
to become a
you start a
a movie agent
115
-
Graph Construction (III)
you book athe book that
to book a
a book agent
the city that
to run a
to become a
you start a
a movie agent
115
-
Graph Construction (III)
you book athe book that
to book a
a book agent
the city that
to run a
to become a
you start a
a movie agent
you unrar a
115
-
Graph Construction - Features
how much does it cost to book a band ?to book acost to book a band
116
-
Graph Construction - Features
how much does it cost to book a band ?to book acost to book a band
116
-
Graph Construction - Features
how much does it cost to book a band ?
Trigram + Context
Left Context
Right Context
Center Word book
Trigram - Center Word to ____ a
Left Word + Right Context to ____ a band
Left Context + Right Word cost to ____ a
Suffix none
to book a
cost to book a band
116
-
Graph Construction - Features
how much does it cost to book a band ?
Trigram + Context
Left Context
Right Context
Center Word book
Trigram - Center Word to ____ a
Left Word + Right Context to ____ a band
Left Context + Right Word cost to ____ a
Suffix none
to book a
cost to book a band
cost to
116
-
Graph Construction - Features
how much does it cost to book a band ?
Trigram + Context
Left Context
Right Context
Center Word book
Trigram - Center Word to ____ a
Left Word + Right Context to ____ a band
Left Context + Right Word cost to ____ a
Suffix none
to book a