Graph Ssl Icassp2013 Tutorial Slides

509
 A Tutorial on Graph-based Semi-Supervised Learning Algorithms for Speech and Spoken Language Processing Par tha Pratim T alukda r (Carnegie Mellon University) Amarnag Subramanya (Google Research) http://graph-ssl.wikidot.com/ ICASSP 2013, V ancouver

description

icassp 2013 graph learning tutorial

Transcript of Graph Ssl Icassp2013 Tutorial Slides

  • A Tutorial on

    Graph-based Semi-Supervised Learning Algorithms for Speech and Spoken Language Processing

    Partha Pratim Talukdar(Carnegie Mellon University)

    Amarnag Subramanya(Google Research)

    http://graph-ssl.wikidot.com/ICASSP 2013, Vancouver

  • Supervised Learning

    ModelLabeled DataLearning

    Algorithm

    2

  • Supervised Learning

    ModelLabeled DataLearning

    Algorithm

    2

    Examples:Decision Trees

    Support Vector Machine (SVM) Maximum Entropy (MaxEnt)

  • Semi-Supervised Learning (SSL)

    ModelLabeled Data

    Learning Algorithm

    A Lot of Unlabeled Data

    3

  • Semi-Supervised Learning (SSL)

    ModelLabeled Data

    Learning Algorithm

    A Lot of Unlabeled Data

    3

    Examples:Self-TrainingCo-Training

  • Why SSL?

    How can unlabeled data be helpful?

    4

  • Without Unlabeled Data

    Why SSL?

    How can unlabeled data be helpful?

    4

  • Without Unlabeled Data

    Why SSL?

    How can unlabeled data be helpful?

    Labeled Instances

    4

  • Without Unlabeled Data

    Why SSL?

    How can unlabeled data be helpful?

    Labeled Instances

    DecisionBoundary

    4

  • With Unlabeled DataWithout Unlabeled Data

    Why SSL?

    How can unlabeled data be helpful?

    Labeled Instances

    DecisionBoundary

    4

  • With Unlabeled DataWithout Unlabeled Data

    Why SSL?

    How can unlabeled data be helpful?

    Labeled Instances

    DecisionBoundary Unlabeled

    Instances

    4

  • With Unlabeled DataWithout Unlabeled Data

    Why SSL?

    How can unlabeled data be helpful?

    Example from [Belkin et al., JMLR 2006]

    Labeled Instances

    DecisionBoundary

    More accurate decision boundary in the presence of unlabeled instances

    UnlabeledInstances

    4

  • Inductive vs Transductive

    5

  • Inductive vs Transductive

    Supervised(Labeled)

    Semi-supervised(Labeled + Unlabeled)

    5

  • Inductive vs Transductive

    Supervised(Labeled)

    Semi-supervised(Labeled + Unlabeled)

    Inductive(Generalize toUnseen Data)

    Transductive(Doesnt Generalize to

    Unseen Data)

    5

  • Inductive vs Transductive

    SVM, Maximum Entropy X

    Manifold Regularization

    Label Propagation, MAD, MP, TACO, ...

    Supervised(Labeled)

    Semi-supervised(Labeled + Unlabeled)

    Inductive(Generalize toUnseen Data)

    Transductive(Doesnt Generalize to

    Unseen Data)

    5

  • Inductive vs Transductive

    SVM, Maximum Entropy X

    Manifold Regularization

    Label Propagation, MAD, MP, TACO, ...

    Supervised(Labeled)

    Semi-supervised(Labeled + Unlabeled)

    Inductive(Generalize toUnseen Data)

    Transductive(Doesnt Generalize to

    Unseen Data)

    5

  • Inductive vs Transductive

    SVM, Maximum Entropy X

    Manifold Regularization

    Label Propagation, MAD, MP, TACO, ...

    Supervised(Labeled)

    Semi-supervised(Labeled + Unlabeled)

    Inductive(Generalize toUnseen Data)

    Transductive(Doesnt Generalize to

    Unseen Data)

    5

  • Inductive vs Transductive

    SVM, Maximum Entropy X

    Manifold Regularization

    Label Propagation, MAD, MP, TACO, ...

    Supervised(Labeled)

    Semi-supervised(Labeled + Unlabeled)

    Inductive(Generalize toUnseen Data)

    Transductive(Doesnt Generalize to

    Unseen Data)

    5

  • Inductive vs Transductive

    SVM, Maximum Entropy X

    Manifold Regularization

    Label Propagation, MAD, MP, TACO, ...

    Supervised(Labeled)

    Semi-supervised(Labeled + Unlabeled)

    Inductive(Generalize toUnseen Data)

    Transductive(Doesnt Generalize to

    Unseen Data)

    Most Graph SSL algorithms are non-parametric (i.e., # parameters grows with data size)

    5

  • Inductive vs Transductive

    SVM, Maximum Entropy X

    Manifold Regularization

    Label Propagation, MAD, MP, TACO, ...

    Supervised(Labeled)

    Semi-supervised(Labeled + Unlabeled)

    Inductive(Generalize toUnseen Data)

    Transductive(Doesnt Generalize to

    Unseen Data)

    Most Graph SSL algorithms are non-parametric (i.e., # parameters grows with data size)

    5

    Focus of this tutorial

  • Inductive vs Transductive

    SVM, Maximum Entropy X

    Manifold Regularization

    Label Propagation, MAD, MP, TACO, ...

    Supervised(Labeled)

    Semi-supervised(Labeled + Unlabeled)

    Inductive(Generalize toUnseen Data)

    Transductive(Doesnt Generalize to

    Unseen Data)

    See Chapter 25 of SSL Book: http://olivier.chapelle.cc/ssl-book/discussion.pdf

    Most Graph SSL algorithms are non-parametric (i.e., # parameters grows with data size)

    5

    Focus of this tutorial

  • Why Graph-based SSL?

    6

  • Why Graph-based SSL?

    Some datasets are naturally represented by a graph web, citation network, social network, ...

    6

  • Why Graph-based SSL?

    Some datasets are naturally represented by a graph web, citation network, social network, ...

    Uniform representation for heterogeneous data

    6

  • Why Graph-based SSL?

    Some datasets are naturally represented by a graph web, citation network, social network, ...

    Uniform representation for heterogeneous data Easily parallelizable, scalable to large data

    6

  • Why Graph-based SSL?

    Some datasets are naturally represented by a graph web, citation network, social network, ...

    Uniform representation for heterogeneous data Easily parallelizable, scalable to large data Effective in practice

    6

  • Why Graph-based SSL?

    Some datasets are naturally represented by a graph web, citation network, social network, ...

    Uniform representation for heterogeneous data Easily parallelizable, scalable to large data Effective in practice

    Graph SSL

    Supervised

    Non-Graph SSL

    Text Classification

    6

  • Graph-based SSL

    7

  • Graph-based SSL

    7

  • Graph-based SSL

    Similarity

    7

  • Graph-based SSL

    Similarity

    0.60.20.8

    7

  • Graph-based SSL

    /f//m/

    Similarity

    0.60.20.8

    7

  • Graph-based SSL

    /f//m/

    /m/ /f/

    Similarity

    0.60.20.8

    7

  • Graph-based SSL

    8

  • Graph-based SSL

    Smoothness Assumption If two instances are similar

    according to the graph, then output labels should be similar

    8

  • Graph-based SSL

    Smoothness Assumption If two instances are similar

    according to the graph, then output labels should be similar

    8

  • Graph-based SSL

    Two stages Graph construction (if not already present) Label Inference

    Smoothness Assumption If two instances are similar

    according to the graph, then output labels should be similar

    8

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    9

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    10

  • Graph Construction

    Neighborhood Methods k-NN Graph Construction (k-NNG) e-Neighborhood Method

    Metric Learning Other approaches

    11

  • Neighborhood Methods

    12

  • Neighborhood Methods

    k-Nearest Neighbor Graph (k-NNG) add edges between an instance and its

    k-nearest neighbors

    12

  • Neighborhood Methods

    k-Nearest Neighbor Graph (k-NNG) add edges between an instance and its

    k-nearest neighborsk = 3

    12

  • Neighborhood Methods

    k-Nearest Neighbor Graph (k-NNG) add edges between an instance and its

    k-nearest neighbors

    e-Neighborhood add edges to all instances inside a ball of

    radius e

    k = 3

    12

  • Neighborhood Methods

    k-Nearest Neighbor Graph (k-NNG) add edges between an instance and its

    k-nearest neighbors

    e-Neighborhood add edges to all instances inside a ball of

    radius e

    e

    k = 3

    12

  • Issues with k-NNG

    13

  • Issues with k-NNG

    13

    Not scalable (quadratic)

  • Issues with k-NNG

    13

    Not scalable (quadratic) Results in an asymmetric graph

  • Issues with k-NNG

    13

    Not scalable (quadratic) Results in an asymmetric graph

    b is the closest neighbor of a, but not the other way

    a b c

  • Issues with k-NNG

    Results in irregular graphs some nodes may end up with

    higher degree than other nodes

    13

    Not scalable (quadratic) Results in an asymmetric graph

    b is the closest neighbor of a, but not the other way

    a b c

  • Issues with k-NNG

    Results in irregular graphs some nodes may end up with

    higher degree than other nodes

    Node of degree 4 inthe k-NNG (k = 1)

    13

    Not scalable (quadratic) Results in an asymmetric graph

    b is the closest neighbor of a, but not the other way

    a b c

  • Issues with e-Neighborhood

    14

  • Issues with e-Neighborhood

    Not scalable

    14

  • Issues with e-Neighborhood

    Not scalable

    Sensitive to value of e : not invariant to scaling

    14

  • Issues with e-Neighborhood

    Not scalable

    Sensitive to value of e : not invariant to scaling

    Fragmented Graph: disconnected components

    14

  • Issues with e-Neighborhood

    Not scalable

    Sensitive to value of e : not invariant to scaling

    Fragmented Graph: disconnected components

    Figure from [Jebara et al., ICML 2009]

    e-NeighborhoodData

    Disconnected

    14

  • Graph Construction using Metric Learning

    15

  • Graph Construction using Metric Learning

    xi xjwij exp(DA(xi, xj))

    15

  • Graph Construction using Metric Learning

    xi xjwij exp(DA(xi, xj))

    Estimated using Mahalanobis metric learning algorithms

    DA(xi, xj) = (xi xj)TA(xi xj)

    15

  • Graph Construction using Metric Learning

    Supervised Metric Learning ITML [Kulis et al., ICML 2007] LMNN [Weinberger and Saul, JMLR 2009]

    Semi-supervised Metric Learning IDML [Dhillon et al., UPenn TR 2010]

    xi xjwij exp(DA(xi, xj))

    Estimated using Mahalanobis metric learning algorithms

    DA(xi, xj) = (xi xj)TA(xi xj)

    15

  • Benefits of Metric Learning forGraph Construction

    16

  • Benefits of Metric Learning forGraph Construction

    0

    0.125

    0.25

    0.375

    0.5

    Amazon Newsgroups Reuters Enron A Text

    Original RP PCA ITML IDML

    Erro

    r

    100 seed and1400 test instances, all inferences using LP

    16

  • Benefits of Metric Learning forGraph Construction

    Graph constructed using supervised metric learning

    0

    0.125

    0.25

    0.375

    0.5

    Amazon Newsgroups Reuters Enron A Text

    Original RP PCA ITML IDML

    Erro

    r

    100 seed and1400 test instances, all inferences using LP

    16

  • Benefits of Metric Learning forGraph Construction

    [Dhillon et al., UPenn TR 2010]

    Graph constructed using supervised metric learning

    0

    0.125

    0.25

    0.375

    0.5

    Amazon Newsgroups Reuters Enron A Text

    Original RP PCA ITML IDML

    Erro

    r

    100 seed and1400 test instances, all inferences using LP

    Graph constructed using semi-supervised metric learning[Dhillon et al., 2010]

    16

  • Benefits of Metric Learning forGraph Construction

    Careful graph construction is critical![Dhillon et al., UPenn TR 2010]

    Graph constructed using supervised metric learning

    0

    0.125

    0.25

    0.375

    0.5

    Amazon Newsgroups Reuters Enron A Text

    Original RP PCA ITML IDML

    Erro

    r

    100 seed and1400 test instances, all inferences using LP

    Graph constructed using semi-supervised metric learning[Dhillon et al., 2010]

    16

  • Other Graph Construction Approaches

    Local Reconstruction Linear Neighborhood [Wang and Zhang, ICML 2005] Regular Graph: b-matching [Jebara et al., ICML 2008] Fitting Graph to Vector Data [Daitch et al., ICML 2009]

    Graph Kernels [Zhu et al., NIPS 2005]

    17

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation

    18

  • 19

    Graph Laplacian

  • Laplacian (un-normalized) of a graph:

    L = D W,where Dii =j

    Wij , Dij( =i) = 0

    19

    Graph Laplacian

  • Laplacian (un-normalized) of a graph:

    1

    12

    3

    a

    b

    c

    d

    L = D W,where Dii =j

    Wij , Dij( =i) = 0

    19

    Graph Laplacian

  • Laplacian (un-normalized) of a graph:

    1

    12

    3

    a

    b

    c

    d

    3 -1 -2 0-1 4 -3 0-2 -3 6 -1 0 0 -1 1

    abcd

    a b c d

    L = D W,where Dii =j

    Wij , Dij( =i) = 0

    19

    Graph Laplacian

  • Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in

    terms of the Laplacian:

    1

    20

  • fTLf =i,j

    Wij(fi fj)2

    Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in

    terms of the Laplacian:

    1

    20

  • fTLf =i,j

    Wij(fi fj)2

    Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in

    terms of the Laplacian:Measure of

    Non-Smoothness

    1

    20

  • fTLf =i,j

    Wij(fi fj)2

    Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in

    terms of the Laplacian:Measure of

    Non-SmoothnessVector of scores for single label on nodes

    1

    20

  • fTLf =i,j

    Wij(fi fj)2

    Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in

    terms of the Laplacian:Measure of

    Non-SmoothnessVector of scores for single label on nodes

    125

    5

    10

    1

    12

    3

    a

    b

    c

    d1

    fT = [1 10 5 25]

    20

  • fTLf =i,j

    Wij(fi fj)2

    Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in

    terms of the Laplacian:Measure of

    Non-SmoothnessVector of scores for single label on nodes

    125

    5

    10

    1

    12

    3

    a

    b

    c

    d1

    fT = [1 10 5 25]

    fTLf = 588 Not Smooth20

  • fTLf =i,j

    Wij(fi fj)2

    Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in

    terms of the Laplacian:Measure of

    Non-SmoothnessVector of scores for single label on nodes

    125

    5

    10

    1

    12

    3

    a

    b

    c

    d1

    fT = [1 10 5 25]

    fTLf = 588 Not Smooth

    1

    1

    1

    3

    fT = [1113]fTLf = 4

    1

    12

    3

    a

    b

    c

    d

    20

  • fTLf =i,j

    Wij(fi fj)2

    Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in

    terms of the Laplacian:Measure of

    Non-SmoothnessVector of scores for single label on nodes

    125

    5

    10

    1

    12

    3

    a

    b

    c

    d1

    fT = [1 10 5 25]

    fTLf = 588 Not Smooth

    1

    1

    1

    3

    fT = [1113]fTLf = 4

    1

    12

    3

    a

    b

    c

    d

    Smooth20

  • Lg = g

    gTLg = gT g

    gTLg =

    Relationship between Eigenvalues of the Laplacian and Smoothness

    21

  • Lg = g

    gTLg = gT g

    gTLg =

    Relationship between Eigenvalues of the Laplacian and Smoothness

    21

    Eigenvalue of LEigenvector of L

  • Lg = g

    gTLg = gT g

    gTLg =

    Relationship between Eigenvalues of the Laplacian and Smoothness

    21

    Eigenvalue of LEigenvector of L

    = 1, as eigenvectors are are orthonormal

  • Lg = g

    gTLg = gT g

    gTLg =

    Relationship between Eigenvalues of the Laplacian and Smoothness

    21

    Eigenvalue of LEigenvector of L

    = 1, as eigenvectors are are orthonormal

    Measure of Non-Smoothness(previous slide)

  • Lg = g

    gTLg = gT g

    gTLg =

    Relationship between Eigenvalues of the Laplacian and Smoothness

    21

    Eigenvalue of LEigenvector of L

    = 1, as eigenvectors are are orthonormal

    If an eigenvector is used to classify nodes, then the

    corresponding eigenvalue gives the measure of non-smoothness

    Measure of Non-Smoothness(previous slide)

  • Spectrum of the Graph Laplacian

    Figure from [Zhu et al., 2005]

    Number ofconnected

    components = Number of 0 eigenvalues

    Constant withincomponent

    Higher Eigenvalue,Irregular Eigenvector,

    Less smoothness

    22

  • NotationsSeed Scores

    Estimated Scores

    Label Priors

    Yv,l : score of estimated label l on node v

    Yv,l : score of seed label l on node v

    Rv,l : regularization target for label l on node v

    S : seed node indicator (diagonal matrix)

    v

    Wuv : weight of edge (u, v) in the graph

    23

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation

    24

  • NotationsSeed Scores

    Estimated Scores

    Label Regularization

    Yv,l : score of estimated label l on node v

    Yv,l : score of seed label l on node v

    Rv,l : regularization target for label l on node v

    S : seed node indicator (diagonal matrix)

    v

    Wuv : weight of edge (u, v) in the graph

    25

  • LP-ZGL [Zhu et al., ICML 2003]

    argminY

    m

    l=1

    Wuv(Yul Yvl)2

    Yul = Yul, Suu = 1such that

    =

    m

    l=1

    YTl LYl

    GraphLaplacian

    26

  • LP-ZGL [Zhu et al., ICML 2003]

    argminY

    m

    l=1

    Wuv(Yul Yvl)2

    Yul = Yul, Suu = 1such that

    Smooth

    =

    m

    l=1

    YTl LYl

    GraphLaplacian

    26

  • LP-ZGL [Zhu et al., ICML 2003]

    argminY

    m

    l=1

    Wuv(Yul Yvl)2

    Yul = Yul, Suu = 1such that

    Smooth

    Match Seeds (hard)

    =

    m

    l=1

    YTl LYl

    GraphLaplacian

    26

  • LP-ZGL [Zhu et al., ICML 2003]

    argminY

    m

    l=1

    Wuv(Yul Yvl)2

    Yul = Yul, Suu = 1such that

    Smooth

    Match Seeds (hard)

    Smoothness two nodes connected by

    an edge with high weight should be assigned similar labels

    =

    m

    l=1

    YTl LYl

    GraphLaplacian

    26

  • LP-ZGL [Zhu et al., ICML 2003]

    argminY

    m

    l=1

    Wuv(Yul Yvl)2

    Yul = Yul, Suu = 1such that

    Smooth

    Match Seeds (hard)

    Smoothness two nodes connected by

    an edge with high weight should be assigned similar labels

    Solution satisfies harmonic property

    =

    m

    l=1

    YTl LYl

    GraphLaplacian

    26

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation

    27

  • Two Related Views

    UL

    Label Diffusion

    28

    Labeled (Seeded) Node

    Unlabeled Node

  • Two Related Views

    UL

    Label Diffusion

    L

    Random Walk

    U

    28

    Labeled (Seeded) Node

    Unlabeled Node

  • Random Walk View

    29

  • Random Walk View

    UV

    what next? Starting node

    29

  • Continue walk with probability

    Assign Vs seed label to U with probability

    Abandon random walk with probability assign U a dummy label

    pcont

    v

    pinjv

    pabndv

    Random Walk View

    UV

    what next? Starting node

    29

  • Discounting Nodes

    30

  • Discounting Nodes

    Certain nodes can be unreliable (e.g., high degree nodes) do not allow propagation/walk through them

    30

  • Discounting Nodes

    Certain nodes can be unreliable (e.g., high degree nodes) do not allow propagation/walk through them

    Solution: increase abandon probability on such nodes:

    30

  • Discounting Nodes

    Certain nodes can be unreliable (e.g., high degree nodes) do not allow propagation/walk through them

    Solution: increase abandon probability on such nodes:

    pabndv degree(v)

    30

  • Redefining Matrices

    W

    uv= pcont

    uWuv

    Suu =

    pinju

    Ru! = pabndu

    , and 0 for non-dummy labels

    Dummy Label

    New EdgeWeight

    31

  • Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

    32

  • Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

    argminY

    m+1l=1

    SY l SY l2 + 1

    u,v

    Muv(Y ul Y vl)2 + 2Y l Rl2

    m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v

    Seed Scores

    EstimatedScores

    Label Priorsv

    32

  • Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

    argminY

    m+1l=1

    SY l SY l2 + 1

    u,v

    Muv(Y ul Y vl)2 + 2Y l Rl2

    m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v

    Match Seeds (soft)

    Seed Scores

    EstimatedScores

    Label Priorsv

    32

  • Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

    argminY

    m+1l=1

    SY l SY l2 + 1

    u,v

    Muv(Y ul Y vl)2 + 2Y l Rl2

    m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v

    Match Seeds (soft) Smooth

    Seed Scores

    EstimatedScores

    Label Priorsv

    32

  • Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

    argminY

    m+1l=1

    SY l SY l2 + 1

    u,v

    Muv(Y ul Y vl)2 + 2Y l Rl2

    m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v

    Match Seeds (soft) SmoothMatch Priors(Regularizer)

    Seed Scores

    EstimatedScores

    Label Priorsv

    32

  • Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

    argminY

    m+1l=1

    SY l SY l2 + 1

    u,v

    Muv(Y ul Y vl)2 + 2Y l Rl2

    m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v

    Match Seeds (soft) SmoothMatch Priors(Regularizer)

    Seed Scores

    EstimatedScores

    Label Priorsv

    for none-of-the-above label

    32

  • Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

    argminY

    m+1l=1

    SY l SY l2 + 1

    u,v

    Muv(Y ul Y vl)2 + 2Y l Rl2

    m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v

    Match Seeds (soft) SmoothMatch Priors(Regularizer)

    Seed Scores

    EstimatedScores

    Label Priorsv

    MAD has extra regularization compared to LP-ZGL [Zhu et al, ICML 03]; similar to QC [Bengio et al, 2006]

    for none-of-the-above label

    32

  • Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

    argminY

    m+1l=1

    SY l SY l2 + 1

    u,v

    Muv(Y ul Y vl)2 + 2Y l Rl2

    m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v

    Match Seeds (soft) SmoothMatch Priors(Regularizer)

    Seed Scores

    EstimatedScores

    Label Priorsv

    MAD has extra regularization compared to LP-ZGL [Zhu et al, ICML 03]; similar to QC [Bengio et al, 2006]

    for none-of-the-above label

    MADs Objective is Convex

    32

  • Solving MAD Objective

    33

  • Solving MAD Objective

    Can be solved using matrix inversion (like in LP) but matrix inversion is expensive

    33

  • Solving MAD Objective

    Can be solved using matrix inversion (like in LP) but matrix inversion is expensive

    Instead solved exactly using a system of linear equations (Ax = b)

    solved using Jacobi iterations results in iterative updates guaranteed convergence see [Bengio et al., 2006] and

    [Talukdar and Crammer, ECML 2009] for details

    33

  • Solving MAD using Iterative Updates

    Inputs Y ,R : |V | (|L| + 1), W : |V | |V |, S : |V | |V | diagonalY YM =W +W

    Zv Svv + 1

    u =vMvu + 2 v Vrepeatfor all v V doY v 1Zv

    (SY )v + 1MvY + 2Rv

    end for

    until convergence

    Seed Prior

    0.75

    0.05

    0.60

    Current label estimate on ba b

    c

    v

    34

  • Solving MAD using Iterative Updates

    Inputs Y ,R : |V | (|L| + 1), W : |V | |V |, S : |V | |V | diagonalY YM =W +W

    Zv Svv + 1

    u =vMvu + 2 v Vrepeatfor all v V doY v 1Zv

    (SY )v + 1MvY + 2Rv

    end for

    until convergence

    Seed Prior

    0.75

    0.05

    0.60

    New label estimate on v

    a b

    c

    v

    34

  • Solving MAD using Iterative Updates

    Inputs Y ,R : |V | (|L| + 1), W : |V | |V |, S : |V | |V | diagonalY YM =W +W

    Zv Svv + 1

    u =vMvu + 2 v Vrepeatfor all v V doY v 1Zv

    (SY )v + 1MvY + 2Rv

    end for

    until convergence

    Seed Prior

    0.75

    0.05

    0.60

    New label estimate on v

    a b

    c

    v

    34

    Importance of a node can be discounted Easily Parallelizable: Scalable (more later)

  • When is MAD most effective?

    35

  • When is MAD most effective?

    0

    0.1

    0.2

    0.3

    0.4

    0 7.5 15 22.5 30

    Re

    lati

    ve I

    ncr

    eas

    e i

    n M

    RR

    by M

    AD

    ove

    r L

    P-Z

    GL

    Average Degree

    35

  • When is MAD most effective?

    0

    0.1

    0.2

    0.3

    0.4

    0 7.5 15 22.5 30

    Re

    lati

    ve I

    ncr

    eas

    e i

    n M

    RR

    by M

    AD

    ove

    r L

    P-Z

    GL

    Average Degree

    MAD is particularly effective in denser graphs, where there is greater need for regularization.

    35

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation

    36

  • Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]

    37

  • Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]

    37

    Main lesson from MAD: discount bad (high degree) nodes

  • Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]

    37

    Main lesson from MAD: discount bad (high degree) nodes

    TACO generalizes no=on of bad, adds per-node, per-class condence

  • Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]

    37

    LowMain lesson from MAD: discount bad (high degree) nodes

    TACO generalizes no=on of bad, adds per-node, per-class condence

  • Label Scores:

    Condence:

    Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]

    37

    LowMain lesson from MAD: discount bad (high degree) nodes

    TACO generalizes no=on of bad, adds per-node, per-class condence

  • Introducing Confidence

    38

    0.49

    0.5

    0.51

    0.3

    0.90.3

    ? ?

  • Introducing Confidence

    38

    0.49

    0.50.5

    0.51

    0.3

    0.5

    0.90.3

  • Introducing Confidence

    38

    0.49

    0.50.5

    0.51

    0.3

    0.5

    0.90.3

  • Introducing Confidence

    38

    0.49

    0.50.5

    0.51

    0.3

    0.5

    0.90.3

  • Introducing Confidence

    38

    0.49

    0.50.5

    0.51

    0.3

    0.5

    0.90.3

  • Introducing Confidence

    38

    0.49

    0.50.5

    0.51

    0.3

    0.90.3

    0.35

  • Neighborhood disagreement => low condence

    Introducing Confidence

    38

    0.49

    0.50.5

    0.51

    0.3

    0.90.3

    0.35

  • Neighborhood disagreement => low condence

    Lower the eect of poorly es=mated (low condence) scores

    Introducing Confidence

    38

    0.49

    0.50.5

    0.51

    0.3

    0.90.3

    0.35

  • TACO Objective

    39

  • TACO Objective

    39

  • TACO Objective

    39

  • TACO Objective

    Convex objective Iterative solution

    39

  • TACO: Iterative Algorithm

    40

  • TACO: Iterative Algorithm

    40

  • TACO: Score Update

    410.3

    ?

    0.3

    0.9

  • TACO: Score Update

    410.3

    ?

    0.3

    0.9

  • TACO: Score Update

    410.3

    ?

    0.3

    0.9

  • TACO: Score Update

    410.3

    ?

    0.3

    0.9

  • TACO: Score Update

    410.3

    ?

    0.3

    0.9

  • TACO: Score Update

    410.3

    0.3

    0.35 0.9

    - Average Neighbors- Consider Confidence

  • TACO: Confidence Update

    420.3

    0.3

    0.90.35

  • TACO: Confidence Update

    420.3

    0.3

    0.90.35

  • TACO: Confidence Update

    420.3

    0.3

    0.90.35

  • TACO: Confidence Update

    420.3

    0.3

    0.90.35

  • TACO: Confidence Update

    420.3

    0.3

    0.90.35

  • TACO: Confidence Update

    420.3

    0.3

    0.9

    Confidence is monotonicin neighborsdisagreement

    0.35

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation

    43

  • Manifold Regularization[Belkin et al., JMLR 2006]

    f = argminf

    1

    l

    li=1

    V (yi, f(xi)) + fTLf + ||f ||2K

    44

  • Manifold Regularization[Belkin et al., JMLR 2006]

    Loss Function(e.g., soft margin)

    Training DataLoss

    f = argminf

    1

    l

    li=1

    V (yi, f(xi)) + fTLf + ||f ||2K

    44

  • Manifold Regularization[Belkin et al., JMLR 2006]

    Loss Function(e.g., soft margin)

    Laplacian of graph over labeled and unlabeled data

    Training DataLoss

    Smoothness Regularizer

    f = argminf

    1

    l

    li=1

    V (yi, f(xi)) + fTLf + ||f ||2K

    44

  • Manifold Regularization[Belkin et al., JMLR 2006]

    Loss Function(e.g., soft margin)

    Laplacian of graph over labeled and unlabeled data

    Training DataLoss

    Smoothness Regularizer

    Regularizer(e.g., L2)

    f = argminf

    1

    l

    li=1

    V (yi, f(xi)) + fTLf + ||f ||2K

    44

  • Manifold Regularization[Belkin et al., JMLR 2006]

    Loss Function(e.g., soft margin)

    Laplacian of graph over labeled and unlabeled data

    Trains an inductive classifier which can generalize to unseen instances

    Training DataLoss

    Smoothness Regularizer

    Regularizer(e.g., L2)

    f = argminf

    1

    l

    li=1

    V (yi, f(xi)) + fTLf + ||f ||2K

    44

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation

    45

  • argmin{pi}

    li=1

    DKL(ri||pi) + i,j

    wijDKL(pi||pj) ni=1

    H(pi)

    Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]

    CKL

    s.t.y

    pi(y) = 1, pi(y) 0, y, i

    46

  • argmin{pi}

    li=1

    DKL(ri||pi) + i,j

    wijDKL(pi||pj) ni=1

    H(pi)

    Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]

    Divergence onseed nodes

    Seed and estimated label distributions (normalized)

    on node i

    CKL

    s.t.y

    pi(y) = 1, pi(y) 0, y, i

    46

  • argmin{pi}

    li=1

    DKL(ri||pi) + i,j

    wijDKL(pi||pj) ni=1

    H(pi)

    Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]

    Divergence onseed nodes

    Smoothness(divergence across edge)

    KL Divergence

    DKL(pi||pj) =y

    pi(y) logpi(y)

    pj(y)

    Seed and estimated label distributions (normalized)

    on node i

    CKL

    s.t.y

    pi(y) = 1, pi(y) 0, y, i

    46

  • argmin{pi}

    li=1

    DKL(ri||pi) + i,j

    wijDKL(pi||pj) ni=1

    H(pi)

    Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]

    Divergence onseed nodes

    Smoothness(divergence across edge)

    Entropic Regularizer

    KL Divergence

    DKL(pi||pj) =y

    pi(y) logpi(y)

    pj(y)

    EntropyH(pi) =

    y

    pi(y) log pi(y)

    Seed and estimated label distributions (normalized)

    on node i

    CKL

    s.t.y

    pi(y) = 1, pi(y) 0, y, i

    46

  • argmin{pi}

    li=1

    DKL(ri||pi) + i,j

    wijDKL(pi||pj) ni=1

    H(pi)

    Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]

    Divergence onseed nodes

    Smoothness(divergence across edge)

    Entropic Regularizer

    KL Divergence

    DKL(pi||pj) =y

    pi(y) logpi(y)

    pj(y)

    EntropyH(pi) =

    y

    pi(y) log pi(y)

    Seed and estimated label distributions (normalized)

    on node i

    Normalization Constraint

    CKL

    s.t.y

    pi(y) = 1, pi(y) 0, y, i

    46

  • argmin{pi}

    li=1

    DKL(ri||pi) + i,j

    wijDKL(pi||pj) ni=1

    H(pi)

    Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]

    Divergence onseed nodes

    Smoothness(divergence across edge)

    Entropic Regularizer

    KL Divergence

    DKL(pi||pj) =y

    pi(y) logpi(y)

    pj(y)

    EntropyH(pi) =

    y

    pi(y) log pi(y)

    Seed and estimated label distributions (normalized)

    on node i

    CKL is convex (with non-negative edge weights and hyper-parameters)MP is related to Information Regularization [Corduneanu and Jaakkola, 2003]

    Normalization Constraint

    CKL

    s.t.y

    pi(y) = 1, pi(y) 0, y, i

    46

  • Solving MP Objective

    For ease of optimization, reformulate MP objective:

    arg min{pi,qi}

    li=1

    DKL(ri||qi) + i,j

    wijDKL(pi||qj)

    ni=1

    H(pi)

    CMP

    47

  • Solving MP Objective

    For ease of optimization, reformulate MP objective:

    arg min{pi,qi}

    li=1

    DKL(ri||qi) + i,j

    wijDKL(pi||qj)

    ni=1

    H(pi)

    New probability measure, one for each

    vertex, similar to pi

    CMP

    47

  • Solving MP Objective

    For ease of optimization, reformulate MP objective:

    arg min{pi,qi}

    li=1

    DKL(ri||qi) + i,j

    wijDKL(pi||qj)

    ni=1

    H(pi)

    wij = wij + (i, j)New probability

    measure, one for each vertex, similar to pi

    CMP

    47

  • Solving MP Objective

    For ease of optimization, reformulate MP objective:

    arg min{pi,qi}

    li=1

    DKL(ri||qi) + i,j

    wijDKL(pi||qj)

    ni=1

    H(pi)

    wij = wij + (i, j)New probability

    measure, one for each vertex, similar to pi

    CMP

    Encourages agreement between pi and qi

    47

  • Solving MP Objective

    For ease of optimization, reformulate MP objective:

    arg min{pi,qi}

    li=1

    DKL(ri||qi) + i,j

    wijDKL(pi||qj)

    ni=1

    H(pi)

    wij = wij + (i, j)New probability

    measure, one for each vertex, similar to pi

    CMP

    Encourages agreement between pi and qi

    CMP is also convex(with non-negative edge weights and hyper-parameters)

    47

  • Solving MP Objective

    For ease of optimization, reformulate MP objective:

    arg min{pi,qi}

    li=1

    DKL(ri||qi) + i,j

    wijDKL(pi||qj)

    ni=1

    H(pi)

    wij = wij + (i, j)New probability

    measure, one for each vertex, similar to pi

    CMP

    Encourages agreement between pi and qi

    CMP is also convex(with non-negative edge weights and hyper-parameters)

    CMP can be solved using Alternating Minimization (AM)47

  • Alternating Minimization

    48

  • Alternating Minimization

    Q0

    48

  • Alternating Minimization

    Q0

    P1

    48

  • Alternating Minimization

    Q0

    Q1

    P1

    48

  • Alternating Minimization

    Q0

    Q1

    P1P2

    48

  • Q2

    Alternating Minimization

    Q0

    Q1

    P1P2

    48

  • Q2

    Alternating Minimization

    Q0

    Q1

    P1P2P3

    48

  • Q2

    Alternating Minimization

    Q0

    Q1

    P1P2P3

    48

  • Q2

    Alternating Minimization

    Q0

    Q1

    P1P2P3

    48

  • Q2

    Alternating Minimization

    Q0

    Q1

    P1P2P3

    CMP satisfies the necessary conditions for AM to converge [Subramanya and Bilmes, JMLR 2011]

    48

  • Why AM?

    49

  • Why AM?

    49

  • Why AM?

    49

  • Performance of SSL Algorithms

    Comparison of accuracies for different number of labeledsamples across COIL (6 classes) and OPT (10 classes) datasets

    50

  • Performance of SSL Algorithms

    Comparison of accuracies for different number of labeledsamples across COIL (6 classes) and OPT (10 classes) datasets

    50

  • Performance of SSL Algorithms

    Comparison of accuracies for different number of labeledsamples across COIL (6 classes) and OPT (10 classes) datasets

    Graph SSL can be effective when the data satisfies manifold assumption. More results and discussion in Chapter 21 of

    the SSL Book (Chapelle et al.)

    50

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Label Propagation- Modified Adsorption- Manifold Regularization- Spectral Graph Transduction- Measure Propagation- Sparse Label Propagation

    51

  • Background: Factor Graphs[Kschischang et al., 2001]

    Factor Graph bipartite graph variable nodes (e.g., label distribution on a node) factor nodes: fitness function over variable assignment

    Distribution over all variables values

    Variable Nodes (V)

    Factor Nodes (F)

    variables connected to factor f52

  • Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]

    min Edge Smoothness LossRegularization

    Loss + +Seed Matching Loss (if any)

    3-term Graph SSL Objective (common to many algorithms)

    53

  • Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]

    min Edge Smoothness LossRegularization

    Loss + +Seed Matching Loss (if any)

    3-term Graph SSL Objective (common to many algorithms)

    q1 q2w1,2 ||q1 q2||2

    53

  • Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]

    min Edge Smoothness LossRegularization

    Loss + +Seed Matching Loss (if any)

    3-term Graph SSL Objective (common to many algorithms)

    q1 q2w1,2 ||q1 q2||2

    q1 q2(q1, q2)

    53

  • Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]

    min Edge Smoothness LossRegularization

    Loss + +Seed Matching Loss (if any)

    3-term Graph SSL Objective (common to many algorithms)

    q1 q2w1,2 ||q1 q2||2

    q1 q2(q1, q2)

    (q1, q2) w1,2 ||q1 q2||2Smoothness Factor53

  • Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]

    min Edge Smoothness LossRegularization

    Loss + +Seed Matching Loss (if any)

    3-term Graph SSL Objective (common to many algorithms)

    q1 q2w1,2 ||q1 q2||2

    q1 q2(q1, q2)

    (q1, q2) w1,2 ||q1 q2||2Smoothness Factor

    Seed MatchingFactor (unary)

    53

  • Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]

    min Edge Smoothness LossRegularization

    Loss + +Seed Matching Loss (if any)

    3-term Graph SSL Objective (common to many algorithms)

    q1 q2w1,2 ||q1 q2||2

    q1 q2(q1, q2)

    Regularization factor (unary)

    (q1, q2) w1,2 ||q1 q2||2Smoothness Factor

    Seed MatchingFactor (unary)

    53

  • Factor Graph Interpretation[Zhu et al., ICML 2003][Das and Smith, NAACL 2012]

    r1 r2

    r3 r4

    q1 q2

    q4 q3

    q9264

    q9265

    q9266

    q9267 q9268 q9269 q9270 1. Factor encouraging agreement on seed

    labels

    2. SmoothnessFactor

    3. Unary factor for regularizationlogt(qt)

    54

  • Label Propagation with Sparsity

    55

  • Label Propagation with Sparsity

    Enforce through sparsity inducing unary factor

    55

  • Label Propagation with Sparsity

    Enforce through sparsity inducing unary factor

    logt(qt) =

    logt(qt) =

    Lasso (Tibshirani, 1996)

    Elitist Lasso (Kowalski and Torrsani, 2009)

    55

  • Label Propagation with Sparsity

    Enforce through sparsity inducing unary factor

    logt(qt) =

    logt(qt) =

    Lasso (Tibshirani, 1996)

    Elitist Lasso (Kowalski and Torrsani, 2009)

    For more details, see [Das and Smith, NAACL 2012]

    55

  • Other Graph-SSL Methods

    Spectral Graph Transduction [Joachims, ICML 2003] SSL on Directed Graphs

    [Zhou et al, NIPS 2005], [Zhou et al., ICML 2005]

    Learning with dissimilarity edges [Goldberg et al., AISTATS 2007]

    Learning to order: GraphOrder [Talukdar et al., CIKM 2012] Graph Transduction using Alternating Minimization

    [Wang et al., ICML 2008]

    Graph as regularizer for Multi-Layered Perceptron [Karlen et al., ICML 2008], [Malkin et al., Interspeech 2009]

    56

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Scalability Issues- Node reordering- MapReduce Parallelization

    57

  • More (Unlabeled) Data is Better Data

    [Subramanya & Bilmes, JMLR 2011]58

  • More (Unlabeled) Data is Better Data

    Graph with 120m vertices

    [Subramanya & Bilmes, JMLR 2011]58

  • More (Unlabeled) Data is Better Data

    Graph with 120m vertices

    Challenges with large unlabeled data:

    Constructing graph from large data Scalable inference over large graphs

    [Subramanya & Bilmes, JMLR 2011]58

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Scalability Issues- Node reordering- MapReduce Parallelization

    59

  • Scalability Issues (I)Graph Construction

    60

  • Brute force (exact) k-NNG too expensive (quadratic)

    Scalability Issues (I)Graph Construction

    60

  • Brute force (exact) k-NNG too expensive (quadratic)

    Approximate nearest neighbor using kd-tree [Friedman et al., 1977, also see http://www.cs.umd.edu/mount/]

    Scalability Issues (I)Graph Construction

    60

  • Sub-sample the data

    Construct graph over a subset of a unlabeled data [Delalleau et al., AISTATS 2005]

    Sparse Grids [Garcke & Griebel, KDD 2001]

    Scalability Issues (II)Label Inference

    61

  • Sub-sample the data

    Construct graph over a subset of a unlabeled data [Delalleau et al., AISTATS 2005]

    Sparse Grids [Garcke & Griebel, KDD 2001]

    How about using more computation? (next section) Symmetric multi-processor (SMP) Distributed Computer

    Scalability Issues (II)Label Inference

    61

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Scalability Issues- Node reordering [Subramanya & Bilmes, JMLR 2011; Bilmes & Subramanya, 2011]

    - MapReduce Parallelization

    62

  • Parallel computation on a SMP

    .....

    k+1

    k

    2

    1

    ..... Graph nodes (neighbors not shown)

    SMP with k Processors

  • Parallel computation on a SMP

    Processor 1

    .....

    k+1

    k

    2

    1

    ..... Graph nodes (neighbors not shown)

    SMP with k Processors

  • Parallel computation on a SMP

    Processor 1

    .....

    k+1

    k

    2

    1

    .....

    Processor 2

    Graph nodes (neighbors not shown)

    SMP with k Processors

  • Parallel computation on a SMP

    Processor 1

    .....

    k+1

    k

    2

    1

    .....

    Processor 2

    Processor k

    Processor 1

    Graph nodes (neighbors not shown)

    SMP with k Processors

  • Parallel computation on a SMP

    Processor 1

    .....

    k+1

    k

    2

    1

    .....

    Processor 2

    Processor k

    Processor 1

    Graph nodes (neighbors not shown)

    SMP with k Processors

  • Label Update using Message Passing

    Seed Prior

    0.75

    0.05

    0.60Current label estimate on a

    a b

    c

    v

    64

    Processor 1

    .....

    k+1

    k

    2

    1

    .....

    Processor 2

    Processor k

    Processor 1

    Graph nodes (neighbors not shown)

    SMP with k Processors

  • Label Update using Message Passing

    Seed Prior

    0.75

    0.05

    0.60

    New label estimate on v

    a b

    c

    v

    64

    Processor 1

    .....

    k+1

    k

    2

    1

    .....

    Processor 2

    Processor k

    Processor 1

    Graph nodes (neighbors not shown)

    SMP with k Processors

  • Speed-up on SMP

    [Subramanya & Bilmes, JMLR, 2011]

    Graph with 1.4M nodes SMP with 16 cores and 128GB of RAM

    65

  • Speed-up on SMP

    Ratio of time spent when using 1 processor to

    time spent using n processors

    [Subramanya & Bilmes, JMLR, 2011]

    Graph with 1.4M nodes SMP with 16 cores and 128GB of RAM

    65

  • Speed-up on SMP

    Ratio of time spent when using 1 processor to

    time spent using n processors

    Cache miss?

    [Subramanya & Bilmes, JMLR, 2011]

    Graph with 1.4M nodes SMP with 16 cores and 128GB of RAM

    65

  • Node Reordering Algorithm

    Input: Graph G = (V, E)

    Result: Node ordered graph

    1. Select an arbitrary node v

    2. while unselected nodes remain do

    2.1. select an unselected node v` from among the neighbors neighbors of v that has maximum overlap with v` neighbors

    2.2. mark v` as selected

    2.3. set v to v`

    [Subramanya & Bilmes, JMLR, 2011]66

  • Node Reordering Algorithm

    Input: Graph G = (V, E)

    Result: Node ordered graph

    1. Select an arbitrary node v

    2. while unselected nodes remain do

    2.1. select an unselected node v` from among the neighbors neighbors of v that has maximum overlap with v` neighbors

    2.2. mark v` as selected

    2.3. set v to v`

    Exhaustive for sparse

    (e.g., k-NN) graphs

    [Subramanya & Bilmes, JMLR, 2011]66

  • Node Reordering Algorithm : Intuition

    k

    a

    b

    c

    67

  • Node Reordering Algorithm : Intuition

    k

    a

    b

    c

    67

    Which node should be placed after k to

    optimize cache performance?

  • Node Reordering Algorithm : Intuition

    k

    a

    b

    c

    a

    e

    j

    c

    67

    Which node should be placed after k to

    optimize cache performance?

  • Node Reordering Algorithm : Intuition

    k

    a

    b

    c

    a

    e

    j

    c

    b

    a

    d

    c

    c

    l

    m

    n67

    Which node should be placed after k to

    optimize cache performance?

  • Node Reordering Algorithm : Intuition

    k

    a

    b

    c

    a

    e

    j

    c

    b

    a

    d

    c

    c

    l

    m

    n

    |N(k) N(a)| = 1

    67

    Which node should be placed after k to

    optimize cache performance?

  • Node Reordering Algorithm : Intuition

    k

    a

    b

    c

    a

    e

    j

    c

    b

    a

    d

    c

    c

    l

    m

    n

    |N(k) N(a)| = 1

    67

    Which node should be placed after k to

    optimize cache performance?

    Cardinality of Intersection

  • Node Reordering Algorithm : Intuition

    k

    a

    b

    c

    a

    e

    j

    c

    b

    a

    d

    c

    c

    l

    m

    n

    |N(k) N(a)| = 1

    |N(k) N(b)| = 2

    |N(k) N(c)| = 0

    67

    Which node should be placed after k to

    optimize cache performance?

    Cardinality of Intersection

  • Node Reordering Algorithm : Intuition

    k

    a

    b

    c

    a

    e

    j

    c

    b

    a

    d

    c

    c

    l

    m

    n

    |N(k) N(a)| = 1

    |N(k) N(b)| = 2

    |N(k) N(c)| = 0

    Best Node

    67

    Which node should be placed after k to

    optimize cache performance?

    Cardinality of Intersection

  • Speed-up on SMP after Node Ordering

    [Subramanya & Bilmes, JMLR, 2011]68

  • Distributed Processing

    Maximize overlap between consecutive nodes within the same machine

    Minimize overlap across machines (reduce inter machine communication)

    69

  • Distributed Processing

    [Bilmes & Subramanya, 2011]70

  • Distributed ProcessingMaximize

    intra-processor overlap

    [Bilmes & Subramanya, 2011]70

  • Distributed ProcessingMaximize

    intra-processor overlap

    Minimizeinter-processor

    overlap

    [Bilmes & Subramanya, 2011]70

  • Node reordering for Distributed Computer

    ...... ............ ......

    Processor #i Processor #j

    71

  • Node reordering for Distributed Computer

    ...... ............ ......

    Processor #i Processor #j

    Minimize Overlap

    71

  • Node reordering for Distributed Computer

    ...... ............ ......

    Processor #i Processor #j

    Minimize OverlapMaximize

    Overlap

    71

  • Distributed Processing Results

    [Bilmes & Subramanya, 2011]72

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Scalability Issues- Node reordering- MapReduce Parallelization

    73

  • MapReduce Implementation of MAD

    Seed Prior

    0.75

    0.05

    0.60

    Current label estimate on ba b

    c

    v

    74

  • MapReduce Implementation of MAD Map

    Each node send its current label assignments to its neighbors

    Seed Prior

    0.75

    0.05

    0.60

    Current label estimate on ba b

    c

    v

    74

  • MapReduce Implementation of MAD Map

    Each node send its current label assignments to its neighbors

    Seed Prior

    0.75

    0.05

    0.60

    a b

    c

    v

    74

  • MapReduce Implementation of MAD Map

    Each node send its current label assignments to its neighbors

    Seed Prior

    0.75

    0.05

    0.60

    a b

    c

    v Reduce Each node updates its own label

    assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.)

    Repeat until convergence

    Reduce

    74

  • MapReduce Implementation of MAD Map

    Each node send its current label assignments to its neighbors

    Seed Prior

    0.75

    0.05

    0.60

    New label estimate on v

    a b

    c

    v Reduce Each node updates its own label

    assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.)

    Repeat until convergence74

  • MapReduce Implementation of MAD Map

    Each node send its current label assignments to its neighbors

    Seed Prior

    0.75

    0.05

    0.60

    New label estimate on v

    a b

    c

    v Reduce Each node updates its own label

    assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.)

    Repeat until convergenceCode in Junto Label Propagation Toolkit

    (includes Hadoop-based implementation)

    http://code.google.com/p/junto/74

  • MapReduce Implementation of MAD Map

    Each node send its current label assignments to its neighbors

    Seed Prior

    0.75

    0.05

    0.60

    New label estimate on v

    a b

    c

    v Reduce Each node updates its own label

    assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.)

    Repeat until convergenceCode in Junto Label Propagation Toolkit

    (includes Hadoop-based implementation)

    http://code.google.com/p/junto/74

    Graph-based algorithms are amenable to distributed processing

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Phone Classification- Text Categorization - Dialog Act Tagging- Statistical Machine Translation- POS Tagging- MultiLingual POS Tagging

    75

  • Problem Description & Motivation

    76

  • Problem Description & Motivation

    Given a frame of speech classify it into one of n phones

    76

  • Problem Description & Motivation

    Given a frame of speech classify it into one of n phones

    Training supervised models requires large amounts of labeled data (phone classification in resource-scarce languages)

    76

  • TIMIT

    77

  • TIMIT

    Corpus of read speech

    77

  • TIMIT

    Corpus of read speech Broadband recordings of 630 speakers of 8 major

    dialects of American English

    77

  • TIMIT

    Corpus of read speech Broadband recordings of 630 speakers of 8 major

    dialects of American English

    Each speaker has read 10 sentences

    77

  • TIMIT

    Corpus of read speech Broadband recordings of 630 speakers of 8 major

    dialects of American English

    Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions

    77

  • TIMIT

    Corpus of read speech Broadband recordings of 630 speakers of 8 major

    dialects of American English

    Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions Phone set has 61 phones [Lee & Hon, 89]

    77

  • TIMIT

    Corpus of read speech Broadband recordings of 630 speakers of 8 major

    dialects of American English

    Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions Phone set has 61 phones [Lee & Hon, 89] mapped down to 48 phones for modeling

    77

  • TIMIT

    Corpus of read speech Broadband recordings of 630 speakers of 8 major

    dialects of American English

    Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions Phone set has 61 phones [Lee & Hon, 89] mapped down to 48 phones for modeling further mapped down to 39 phones for scoring

    77

  • Feature Extraction

    78

  • Feature Extraction

    78

    MFCC

    Hamming Window

    25ms @ 100 Hz

  • Feature Extraction

    78

    MFCC

    Hamming Window

    25ms @ 100 Hz

    Delta

  • Feature Extraction

    78

    MFCC

    Hamming Window

    25ms @ 100 Hz

    Delta .... ....

    ti

  • Feature Extraction

    78

    MFCC

    Hamming Window

    25ms @ 100 Hz

    Delta .... ....

    ti

  • Feature Extraction

    78

    MFCC

    Hamming Window

    25ms @ 100 Hz

    Delta .... ....

    ti

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

  • Graph Construction

    79

    TIMIT Training Set

    TIMIT Development

    Set

  • Graph Construction

    79

    TIMIT Training Set

    TIMIT Development

    Set

    +

  • Graph Construction

    79

    TIMIT Training Set

    TIMIT Development

    Set

    + k-NN Graph

  • Graph Construction

    79

    TIMIT Training Set

    TIMIT Development

    Set

    + k-NN Graphwij = exp{(xi xj)T1(xi xj)}

  • Graph Construction

    79

    TIMIT Training Set

    TIMIT Development

    Set

    + k-NN Graph

    - k = 10- Graph with about 1.4 M nodes

    wij = exp{(xi xj)T1(xi xj)}

  • Graph Construction

    79

    TIMIT Training Set

    TIMIT Development

    Set

    + k-NN Graph

    - k = 10- Graph with about 1.4 M nodes

    Labels not used during graph construction

    wij = exp{(xi xj)T1(xi xj)}

  • Inductive Extension

    80

    TIMIT Training Set

    TIMIT Development

    Set

    + Graph

  • Inductive Extension

    80

    TIMIT Training Set

    TIMIT Development

    Set

    + Graph

    Test Sample

  • Inductive Extension

    80

    TIMIT Training Set

    TIMIT Development

    Set

    + Graph

    Test Sample

    Feature Extraction

  • Inductive Extension

    80

    TIMIT Training Set

    TIMIT Development

    Set

    + Graph Nearest Neighbors

    Test Sample

    Feature Extraction

  • Inductive Extension

    80

    TIMIT Training Set

    TIMIT Development

    Set

    + Graph Nearest Neighbors

    Test Sample

    Feature Extraction

  • Inductive Extension

    80

    TIMIT Training Set

    TIMIT Development

    Set

    + Graph Nearest Neighbors

    Test Sample

    Feature Extraction

    Nadaraya-Watson Estimator

  • Results (I)

    81

  • Results (II)

    82

  • Labeled Graph Construction

    83

    TIMIT Training Set

    [Alexandrescu & Kirchoff, ASRU 2007]

  • Labeled Graph Construction

    83

    TIMIT Training Set

    Feature Extraction

    [Alexandrescu & Kirchoff, ASRU 2007]

  • Labeled Graph Construction

    83

    TIMIT Training Set

    Feature Extraction

    MLP Training

    [Alexandrescu & Kirchoff, ASRU 2007]

  • Labeled Graph Construction

    83

    TIMIT Training Set

    Feature Extraction

    MLP Training

    TIMIT Training Set

    TIMIT Development

    Set

    + MLP

    [Alexandrescu & Kirchoff, ASRU 2007]

  • Labeled Graph Construction

    83

    TIMIT Training Set

    Feature Extraction

    MLP Training

    TIMIT Training Set

    TIMIT Development

    Set

    + MLP Graph ConstructionJenson-Shannon

    Divergence[Alexandrescu & Kirchoff, ASRU 2007]

  • Results (III)

    84

    65

    67.25

    69.5

    71.75

    74

    10% 50% 100%

    MLP MP MAD p-MP

    [Liu & Kirchoff, UW-EE TR 2012]

  • Results (III)

    84

    65

    67.25

    69.5

    71.75

    74

    10% 50% 100%

    MLP MP MAD p-MP

    [Liu & Kirchoff, UW-EE TR 2012]

    Take advantage of labeled data during graph construction Regularize towards prior (when available)

  • Switchboard Phonetic Annotation

    85

  • Switchboard Phonetic Annotation

    Switchboard corpus consists of about 300 hours of conversational speech.

    Less reliable automatically generated phone-level annotations [Deshmukh et al., 1998]

    85

  • Switchboard Phonetic Annotation

    Switchboard corpus consists of about 300 hours of conversational speech.

    Less reliable automatically generated phone-level annotations [Deshmukh et al., 1998]

    Switchboard transcription project (STP) [Greenberg, 1995]

    Manual phonetic annotationOnly about 75 minutes of data annotated

    85

  • Graph Construction

    86

  • Graph Construction

    86

    x% of SWB

    STP

    +

  • Graph Construction

    86

    x% of SWB

    STP

    + k-NN Graph

  • Graph Construction

    86

    x% of SWB

    STP

    + k-NN Graph

    - kd-tree- k = 10- Graph with about 120 M nodes

  • Results

    [Subramanya & Bilmes, JMLR 2011]87

  • Results

    Graph with 120m vertices

    [Subramanya & Bilmes, JMLR 2011]87

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Phone Classification- Text Categorization - Dialog Act Tagging- Statistical Machine Translation- POS Tagging- MultiLingual POS Tagging

    88

  • Problem Description & Motivation

    89

  • Problem Description & Motivation

    Given a document (e.g., web page, news article), assign it to a fixed number of semantic categories (e.g., sports, politics, entertainment)

    89

  • Problem Description & Motivation

    Given a document (e.g., web page, news article), assign it to a fixed number of semantic categories (e.g., sports, politics, entertainment)

    Multi-label problem

    89

  • Problem Description & Motivation

    Given a document (e.g., web page, news article), assign it to a fixed number of semantic categories (e.g., sports, politics, entertainment)

    Multi-label problem Training supervised models requires large

    amounts of labeled data [Dumais et al., 1998]

    89

  • Corpora

    Reuters [Lewis, et al., 1978] Newswire About 20K document with 135 categories. Use

    top 10 categories (e.g., earnings, acquistions, wheat, interest) and label the remaining as other

    90

  • Corpora

    Reuters [Lewis, et al., 1978] Newswire About 20K document with 135 categories. Use

    top 10 categories (e.g., earnings, acquistions, wheat, interest) and label the remaining as other

    WebKB [Bekkerman, et al., 2003] 8K webpages from 4 academic domains Categories include course, department,

    faculty and project90

  • Feature Extraction

    Showers continued throughout the week inthe Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, ...

    Document[Lewis, et al., 1978]

    91

  • Feature Extraction

    Showers continued throughout the week inthe Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, ...

    Document

    Showers continued week Bahia cocoa zone alleviating drought early January improving prospects coming temporao, ...

    Stop-word Removal[Lewis, et al., 1978]

    91

  • Feature Extraction

    Showers continued throughout the week inthe Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, ...

    Document

    Showers continued week Bahia cocoa zone alleviating drought early January improving prospects coming temporao, ...

    shower continu week bahia cocoa zone allevi drought earli januari improv prospect come temporao, ...

    Stop-word Removal Stemming[Lewis, et al., 1978]

    91

  • Feature Extraction

    Showers continued throughout the week inthe Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, ...

    Document

    Showers continued week Bahia cocoa zone alleviating drought early January improving prospects coming temporao, ...

    shower continu week bahia cocoa zone allevi drought earli januari improv prospect come temporao, ...

    Stop-word Removal Stemming

    showerbahiacocoa

    .

    .

    .

    .

    0.010.340.33

    .

    .

    .

    .

    TFIDF weighted Bag-of-words[Lewis, et al., 1978]

    91

  • Results

    Average PRBEP

    SVM TSVM SGT LP MP MAD

    Reuters 48.9 59.3 60.3 59.7 66.3 -

    WebKB 23.0 29.2 36.8 41.2 51.9 53.7

    Precision-recall break even point (PRBEP)92

  • Results

    Average PRBEP

    SVM TSVM SGT LP MP MAD

    Reuters 48.9 59.3 60.3 59.7 66.3 -

    WebKB 23.0 29.2 36.8 41.2 51.9 53.7

    Precision-recall break even point (PRBEP)

    Support Vector

    Machine (Supervised)

    92

  • Results

    Average PRBEP

    SVM TSVM SGT LP MP MAD

    Reuters 48.9 59.3 60.3 59.7 66.3 -

    WebKB 23.0 29.2 36.8 41.2 51.9 53.7

    Precision-recall break even point (PRBEP)

    Support Vector

    Machine (Supervised)

    Transductive SVM

    [Joachims 1999]

    92

  • Results

    Average PRBEP

    SVM TSVM SGT LP MP MAD

    Reuters 48.9 59.3 60.3 59.7 66.3 -

    WebKB 23.0 29.2 36.8 41.2 51.9 53.7

    Precision-recall break even point (PRBEP)

    Support Vector

    Machine (Supervised)

    Transductive SVM

    [Joachims 1999]

    Spectral Graph

    Transduction (SGT)

    [Joachims 2003]

    Label Propagation [Zhu & Ghahramani 2002]

    Measure Propagation [Subramanya & Bilmes 2008]

    Modified Adsorption[Talukdar & Crammer 2009]

    92

  • Results on WebKB

    [Subramanya & Bilmes, EMNLP 2008]93

  • Results on WebKBModified Adsorption (MAD)

    [Talukdar & Crammer 2009]

    [Subramanya & Bilmes, EMNLP 2008]93

  • Results on WebKBModified Adsorption (MAD)

    [Talukdar & Crammer 2009]

    More labeled data => Better Performance Unnormalized distributions (scores) more suitable for multi-label problems (MAD outperforms other approaches)

    [Subramanya & Bilmes, EMNLP 2008]93

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Phone Classification- Text Categorization- Dialog Act Tagging [Subramanya & Bilmes, JMLR 2011]- Statistical Machine Translation- POS Tagging- MultiLingual POS Tagging

    94

  • Problem description

    Dialog acts (DA) reflect the function that utterances serve in discourse

    Applications in automatic speech recognition (ASR), machine translation (MT) & natural language processing (NLP)

    95

  • Switchboard DA Corpus

    96

  • Switchboard DA Corpus

    Switchboard dialog act (DA) tagging project [Jurafsky, et al., 1997]

    96

  • Switchboard DA Corpus

    Switchboard dialog act (DA) tagging project [Jurafsky, et al., 1997]

    Manually labeled 1155 conversations (about 200k sentences)

    96

  • Switchboard DA Corpus

    Switchboard dialog act (DA) tagging project [Jurafsky, et al., 1997]

    Manually labeled 1155 conversations (about 200k sentences) Example labels: question, answer, backchannel, agreement... (total

    of 42 different DAs)

    96

  • Switchboard DA Corpus

    Switchboard dialog act (DA) tagging project [Jurafsky, et al., 1997]

    Manually labeled 1155 conversations (about 200k sentences) Example labels: question, answer, backchannel, agreement... (total

    of 42 different DAs)

    Use top 18 DAs (about 185k sentences)

    96

  • Graph Construction

    97

  • Graph Construction

    Bigram & trigram TFIDF

    97

  • Graph Construction

    Bigram & trigram TFIDF Cosine similarity

    97

  • Graph Construction

    Bigram & trigram TFIDF Cosine similarity k-NN graph

    97

  • SWB DA Tagging Results

    Baseline performance: 84.2% [Ji & Bilmes, 2005]

    98

    79

    80.75

    82.5

    84.25

    86

    Bigram Trigram

    SQ-Loss MP

    [Subramanya & Bilmes, JMLR 2011]

  • Dihana Corpus

    99

  • Dihana Corpus

    Computer-human dialogues answering telephone queries about train services in Spain about 900 dialogues topics include timetables, fares and services

    99

  • Dihana Corpus

    Computer-human dialogues answering telephone queries about train services in Spain about 900 dialogues topics include timetables, fares and services

    225 speakers

    99

  • Dihana Corpus

    Computer-human dialogues answering telephone queries about train services in Spain about 900 dialogues topics include timetables, fares and services

    225 speakers Standard train/test set (16k/7.5k sentences)

    99

  • Dihana Corpus

    Computer-human dialogues answering telephone queries about train services in Spain about 900 dialogues topics include timetables, fares and services

    225 speakers Standard train/test set (16k/7.5k sentences) Number of labels = 72

    99

  • Dihana Results

    Results for classifying user turns

    Baseline Performance: 76.4%

    100

    75

    76.75

    78.5

    80.25

    82

    Bigram Trigram

    SQ-Loss MP

    [Subramanya & Bilmes, JMLR 2011]

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Phone Classification- Text Categorization- Dialog Act Tagging- Statistical Machine Translation [Alexandrescu & Kirchoff, NAACL 2009]- POS Tagging- MultiLingual POS Tagging

    101

  • Problem Description

    Phrase-based statistical machine translation (SMT) Sentences are translated in isolation

    102

  • Motivation

    103

  • Motivation

    103

    Asf lA ymknk *lk hnAk .....Source

  • Motivation

    103

    Asf lA ymknk *lk hnAk .....Source

    im sorry you cant there is a cost1-best

  • Motivation

    103

    Asf lA ymknk *lk hnAk .....Source

    im sorry you cant there is a cost1-best

    E*rA lA ymknk t$gyl ......Source

    excuse me i you turn tv ....1-best

  • Motivation

    103

    Asf lA ymknk *lk hnAk .....Source

    im sorry you cant there is a cost1-best

    E*rA lA ymknk t$gyl ......Source

    excuse me i you turn tv ....1-best

  • Motivation

    103

    Asf lA ymknk *lk hnAk .....Source

    im sorry you cant there is a cost1-best

    E*rA lA ymknk t$gyl ......Source

    excuse me i you turn tv ....1-best

    CorrectlA ymknk you may not/you cannot

  • Motivation

    103

    Asf lA ymknk *lk hnAk .....Source

    im sorry you cant there is a cost1-best

    E*rA lA ymknk t$gyl ......Source

    excuse me i you turn tv ....1-best

    CorrectlA ymknk you may not/you cannot

  • Motivation

    103

    Asf lA ymknk *lk hnAk .....Source

    im sorry you cant there is a cost1-best

    E*rA lA ymknk t$gyl ......Source

    excuse me i you turn tv ....1-best

    Correct

    Different Translations

    lA ymknk you may not/you cannot

  • Motivation

    103

    Asf lA ymknk *lk hnAk .....Source

    im sorry you cant there is a cost1-best

    E*rA lA ymknk t$gyl ......Source

    excuse me i you turn tv ....1-best

    Correct

    Different TranslationsGraph to enforce smoothness constraint

    lA ymknk you may not/you cannot

  • Issues

    104

  • Issues

    What we want to do - exploit similarity between sentences

    104

  • Issues

    What we want to do - exploit similarity between sentences

    Input consists of variable-length word strings

    104

  • Issues

    What we want to do - exploit similarity between sentences

    Input consists of variable-length word strings Output space is structured (number of possible

    labels is very large)

    104

  • Graph Construction (I)

    105

  • Graph Construction (I)

    105

    {(s1, t1), . . . , (sl, tl)}Labeled data:

  • Graph Construction (I)

    105

    {(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}

  • Graph Construction (I)

    105

    {(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}

    (si, ti)

    Encodes a source &

    target sentence

  • Graph Construction (I)

    105

    {(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}

    (si, ti)

    Encodes a source &

    target sentence

    (sj , tj)

  • Graph Construction (I)

    105

    {(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}

    (si, ti)

    Encodes a source &

    target sentence

    (sj , tj)wij?

  • Graph Construction (I)

    105

    {(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}

    (si, ti)

    Encodes a source &

    target sentence

    (sj , tj)wij?

    How do we compute similarity What about the test set?

  • Graph Construction (II)

    106

    {(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}

  • Graph Construction (II)

    106

    {(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}

    si

    Unlabeled sentence

  • Graph Construction (II)

    106

    {(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}

    First pass Decoder

    si {h(1)si , . . . , h(N)si }N-best listUnlabeled

    sentence

  • Graph Construction (II)

    106

    {(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}

    First pass Decoder

    si {h(1)si , . . . , h(N)si }N-best listUnlabeled

    sentence

    X

  • Graph Construction (II)

    106

    {(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}

    First pass Decoder

    si {h(1)si , . . . , h(N)si }N-best listUnlabeled

    sentence

    X

    {(si, h(1)si ), . . . , (si, h(N)si )}

  • Graph Construction (III)

    107

    si

    Unlabeled sentence

  • Graph Construction (III)

    107

    All sentences (Labeled + Test)

    si

    Unlabeled sentence

  • Graph Construction (III)

    107

    All sentences (Labeled + Test)

    si

    Unlabeled sentence

    Find most similar

    sentences

  • Graph Construction (III)

    107

    All sentences (Labeled + Test)

    si

    Unlabeled sentence

    Find most similar

    sentences

  • Graph Construction (III)

    107

    All sentences (Labeled + Test)

    si

    Unlabeled sentence

    Find most similar

    sentences

    (si, sj) >

  • Graph Construction (III)

    107

    All sentences (Labeled + Test)

    si

    Unlabeled sentence

    Find most similar

    sentences

    Sentence similarity

    (si, sj) >

  • Graph Construction (III)

    107

    All sentences (Labeled + Test)

    si

    Unlabeled sentence

    Find most similar

    sentences

    Sentence similarity

    h(j)si

    (si, sj) >

  • Graph Construction (III)

    107

    All sentences (Labeled + Test)

    si

    Unlabeled sentence

    Find most similar

    sentences

    Sentence similarity

    h(j)si

    (si, sj) >

  • Graph Construction (III)

    107

    All sentences (Labeled + Test)

    si

    Unlabeled sentence

    Find most similar

    sentences

    Sentence similarity

    h(j)si

    Targets & Hypothesis

    (si, sj) >

  • Graph Construction (III)

    107

    All sentences (Labeled + Test)

    si

    Unlabeled sentence

    Find most similar

    sentences

    Sentence similarity

    h(j)si

    Targets & Hypothesis

    (si, sj) >

  • Graph Construction (III)

    107

    All sentences (Labeled + Test)

    si

    Unlabeled sentence

    Find most similar

    sentences

    Sentence similarity

    h(j)si

    Targets & Hypothesis

    (h(j)si , tk)

    (si, sj) >

  • Graph Construction (III)

    107

    All sentences (Labeled + Test)

    si

    Unlabeled sentence

    Find most similar

    sentences

    Sentence similarity

    h(j)si

    Targets & Hypothesis

    (h(j)si , tk)

    Label similarity

    (si, sj) >

  • Graph Construction (III)

    107

    All sentences (Labeled + Test)

    si

    Unlabeled sentence

    Find most similar

    sentences

    Sentence similarity

    h(j)si

    Targets & Hypothesis

    (h(j)si , tk)

    Averaging

    Label similarity

    (si, sj) >

  • Corpora

    IWSLT 2007 task Italian-to-English (IE) & Arabic-to-English (AE) travel tasks

    Each task has train/dev/eval sets Baseline: Standard phrase-based SMT based on a

    log-linear model. Yields state-of-the-art performance.

    Results are measured using BLEU score and Phrase error rate (PER) [Papineni et al., ACL 2002]

    108

  • Results (I)

    109

    29

    30

    31

    32

    33

    BLEU-based String Kernel

    BLEU

    Baseline LP

    IE Results on eval set

  • Results (I)

    109

    29

    30

    31

    32

    33

    BLEU-based String Kernel

    BLEU

    Baseline LP

    IE Results on eval set

    Geometric mean based averaging worked the best

  • Results (II)

    110

    37

    37.75

    38.5

    39.25

    40

    BLEU PER

    Baseline LP

    IE Results on eval set

  • Results (II)

    110

    37

    37.75

    38.5

    39.25

    40

    BLEU PER

    Baseline LP

    IE Results on eval set

    Higher is better

    Lower is better

  • Results (II)

    110

    37

    37.75

    38.5

    39.25

    40

    BLEU PER

    Baseline LP

    IE Results on eval set37

    38.25

    39.5

    40.75

    42

    BLEU PER

    Baseline LP

    AE Results on eval setHigher is better

    Lower is better

  • Outline

    Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

    - Phone Classification- Text Categorization- Dialog Act Tagging- Statistical Machine Translation- POS Tagging [Subramanya et. al., EMNLP 2008]- MultiLingual POS Tagging

    111

  • Motivation

    Unlabeled Data

    Small amounts of labeled

    sourcedomain data

    Large amounts of unlabeled

    targetdomain data

    112

  • Motivation

    DomainAdaptation

    Unlabeled Data

    Small amounts of labeled

    sourcedomain data

    Large amounts of unlabeled

    targetdomain data

    112

  • Motivation

    ... bought a book detailing the ...

    ... wanted to book a flight to ...

    ... the book is about the ...

    DomainAdaptation

    ... VBD DT NN VBG DT ...

    ... VBD TO VB DT NN TO ...

    ... DT NN VBZ PP DT ...

    Unlabeled Data

    Small amounts of labeled

    sourcedomain data

    Large amounts of unlabeled

    targetdomain data

    112

  • Motivation

    ... how to book a band ...can you book a day room ...

    ... bought a book detailing the ...

    ... wanted to book a flight to ...

    ... the book is about the ...

    DomainAdaptation

    ... VBD DT NN VBG DT ...

    ... VBD TO VB DT NN TO ...

    ... DT NN VBZ PP DT ...

    Unlabeled Data

    Small amounts of labeled

    sourcedomain data

    Large amounts of unlabeled

    targetdomain data

    112

  • Motivation

    ... how to book a band ...can you book a day room ...

    ... bought a book detailing the ...

    ... wanted to book a flight to ...

    ... the book is about the ...

    DomainAdaptation

    ... VBD DT NN VBG DT ...

    ... VBD TO VB DT NN TO ...

    ... DT NN VBZ PP DT ...

    Unlabeled Data

    Small amounts of labeled

    sourcedomain data

    Large amounts of unlabeled

    targetdomain data

    112

  • Motivation

    ... how to book a band ...can you book a day room ...

    ... bought a book detailing the ...

    ... wanted to book a flight to ...

    ... the book is about the ...

    DomainAdaptation

    ... VBD DT NN VBG DT ...

    ... VBD TO VB DT NN TO ...

    ... DT NN VBZ PP DT ...

    Unlabeled Data

    Small amounts of labeled

    sourcedomain data

    Large amounts of unlabeled

    targetdomain data

    112

  • Motivation

    ... how to book a band ...can you book a day room ...

    ... bought a book detailing the ...

    ... wanted to book a flight to ...

    ... the book is about the ... ... how to unrar a file ...

    DomainAdaptation

    ... VBD DT NN VBG DT ...

    ... VBD TO VB DT NN TO ...

    ... DT NN VBZ PP DT ...

    Unlabeled Data

    Small amounts of labeled

    sourcedomain data

    Large amounts of unlabeled

    targetdomain data

    112

  • Graph Construction (I)

    when do you book plane tickets?

    do you read a book on the plane?

    113

  • Graph Construction (I)

    when do you book plane tickets?

    do you read a book on the plane?

    Similarity

    113

  • Graph Construction (I)

    when do you book plane tickets?WRB VBP PRP VB NN NNS

    VBP PRP VB DT NN IN DT NN

    do you read a book on the plane?

    Similarity

    113

  • Graph Construction (II)

    how much does it cost to book a band ?

    what was the book that has no letter e ?

    how to get a book agent ?

    can you book a day room at hilton hawaiian village ?

    114

  • Graph Construction (II)

    how much does it cost to book a band ?

    what was the book that has no letter e ?

    how to get a book agent ?

    can you book a day room at hilton hawaiian village ?

    114

  • Graph Construction (II)

    how much does it cost to book a band ?

    what was the book that has no letter e ?

    how to get a book agent ?

    can you book a day room at hilton hawaiian village ?

    114

  • Graph Construction (II)

    you book athe book that

    to book a

    a book agent

    114

  • Graph Construction (II)

    you book athe book that

    to book a

    a book agent

    k-nearestneighbors?

    114

  • Graph Construction (III)

    you book athe book that

    to book a

    a book agent

    the city that

    to run a

    to become a

    you start a

    a movie agent

    115

  • Graph Construction (III)

    you book athe book that

    to book a

    a book agent

    the city that

    to run a

    to become a

    you start a

    a movie agent

    115

  • Graph Construction (III)

    you book athe book that

    to book a

    a book agent

    the city that

    to run a

    to become a

    you start a

    a movie agent

    you unrar a

    115

  • Graph Construction - Features

    how much does it cost to book a band ?to book acost to book a band

    116

  • Graph Construction - Features

    how much does it cost to book a band ?to book acost to book a band

    116

  • Graph Construction - Features

    how much does it cost to book a band ?

    Trigram + Context

    Left Context

    Right Context

    Center Word book

    Trigram - Center Word to ____ a

    Left Word + Right Context to ____ a band

    Left Context + Right Word cost to ____ a

    Suffix none

    to book a

    cost to book a band

    116

  • Graph Construction - Features

    how much does it cost to book a band ?

    Trigram + Context

    Left Context

    Right Context

    Center Word book

    Trigram - Center Word to ____ a

    Left Word + Right Context to ____ a band

    Left Context + Right Word cost to ____ a

    Suffix none

    to book a

    cost to book a band

    cost to

    116

  • Graph Construction - Features

    how much does it cost to book a band ?

    Trigram + Context

    Left Context

    Right Context

    Center Word book

    Trigram - Center Word to ____ a

    Left Word + Right Context to ____ a band

    Left Context + Right Word cost to ____ a

    Suffix none

    to book a