Graph-Based Semi-Supervised Learningdelallea/pub/delalleau_semisup_ciar.pdf · Graph-Based...

Semi-Supervised Setting Graph Regularization and Label Prop. Transduction / Induction The Curse

Graph-Based Semi-Supervised Learning

Olivier Delalleau, Yoshua Bengio and Nicolas Le Roux

Université de Montréal

CIAR Workshop - April 26th, 2005

Yoshua Bengio, Olivier Delalleau and Nicolas Le Roux

Nicolas Le Roux, Olivier Delalleau and Yoshua Bengio

Olivier Delalleau, Nicolas Le Roux and Yoshua Bengio

Yoshua Bengio, Nicolas Le Roux and Olivier Delalleau

Nicolas Le Roux, Yoshua Bengio and Olivier Delalleau

Outline

1 Semi-Supervised Setting

2 Graph Regularization and Label Propagation

3 Transduction vs. Induction

4 Curse of Dimensionality

Semi-Supervised Learning for Dummies

Task of binary classification with labels yi ∈ {−1, 1}

Semi-supervised = learn something about labels usingboth labeled and unlabeled data

X = (x 1, x 2, . . . , x n)

Yl = (y1, y2, . . . , yl)

n = l + u

Transduction ⇒ Yu = (yl+1, yl+2, . . . , yn)

Induction ⇒ y : x → y(x)

Task of binary classification with labels yi ∈ {−1, 1}Semi-supervised = learn something about labels usingboth labeled and unlabeled data

X = (x 1, x 2, . . . , x n)

Yl = (y1, y2, . . . , yl)

n = l + u

X = (x 1, x 2, . . . , x n)

Yl = (y1, y2, . . . , yl)

n = l + u

X = (x 1, x 2, . . . , x n)

Yl = (y1, y2, . . . , yl)

n = l + u

The Classical Two-Moon Problem

Where are Manifolds and Kernels?

What is a good labeling Y = (Yl , Yu)?

1 one that is consistent with the given labels: Yl ' Yl

2 one that is smooth on the

manifold

where the data lie(

manifold

/ cluster assumption):

yi ' yj when x i close to x j

⇒ Cost function

C(Y ) =l∑

(yi − yi)2 +

n∑i,j=1

W ij(yi − yj)2

with W ij = WX (x i , x j) a positive weighting function (e.g

kernel

What is a good labeling Y = (Yl , Yu)?1 one that is consistent with the given labels: Yl ' Yl

manifold

where the data lie(

manifold

⇒ Cost function

C(Y ) =l∑

(yi − yi)2 +

n∑i,j=1

W ij(yi − yj)2

kernel

manifold

where the data lie(

manifold

⇒ Cost function

C(Y ) =l∑

(yi − yi)2 +

n∑i,j=1

W ij(yi − yj)2

kernel

2 one that is smooth on the manifold where the data lie(manifold / cluster assumption):

⇒ Cost function

C(Y ) =l∑

(yi − yi)2 +

n∑i,j=1

W ij(yi − yj)2

kernel

⇒ Cost function

C(Y ) =l∑

(yi − yi)2 +

n∑i,j=1

W ij(yi − yj)2

kernel

⇒ Cost function

C(Y ) =l∑

(yi − yi)2 +

n∑i,j=1

W ij(yi − yj)2

with W ij = WX (x i , x j) a positive weighting function (e.g kernel)

Graphs Would be Cool too!

Nodes = points, edges (i , j) ⇔ W ij > 0.

Graphs Would be Cool too!

Nodes = points, edges (i , j) ⇔ W ij > 0.

Regularization on Graph

Graph Laplacian: L ii =∑

j 6=i W ij and L ij = −W ij .

C(Y ) =l∑

(yi − yi)2 +

n∑i,j=1

W ij(yi − yj)2

= ‖Yl − Yl‖2 + µY>LY

C(Y ) is minimized when

(S + µL) Y = Y

⇒ linear system with n unknowns and equations.

Regularization on Graph

Graph Laplacian: L ii =∑

j 6=i W ij and L ij = −W ij .

C(Y ) =l∑

(yi − yi)2 +

n∑i,j=1

W ij(yi − yj)2

= ‖Yl − Yl‖2 + µY>LY

C(Y ) is minimized when

(S + µL) Y = Y

⇒ linear system with n unknowns and equations.

From Matrix Inversion to Label Propagation

Linear system rewrites for a labeled point

∑j W ij yj + 1

µyi∑j W ij + 1

and for an unlabeled point

∑j W ij yj∑j W ij

From Matrix Inversion to Label Propagation

Linear system rewrites for a labeled point

y (t+1)i =

∑j W ij y

(t)j + 1

µyi∑j W ij + 1

and for an unlabeled point

y (t+1)i =

∑j W ij y

(t)j∑

j W ij.

Jacobi or Gauss-Seidel iteration algorithm.

No, I didn’t come up with that last week-end

X. Zhu, Z. Ghahramani and J. Lafferty (2003):Semi-supervised learning using Gaussian fields andharmonic functions

D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B.Schölkopf (2004): Learning with local and globalconsistency

M. Belkin, I. Matveeva and P. Niyogi (2004): Regularizationand Semi-supervised Learning on Large Graphs

Remember your Physics’ class?

Electric network analogy (Doyle and Snell, 1984; Zhu,Ghahramani and Lafferty, 2003).Graph ⇔ electric network with resistors between nodes:

R ij =1

Ohm’s law (potential = label):

yj − yi = R ij I ij

Kirchoff’s law on an unlabeled node i :∑j

I ij = 0

⇒ Same linear system as minimizing C(Y ) over Yu only.

R ij =1

I ij = 0

R ij =1

I ij = 0

R ij =1

I ij = 0

From Transduction to Induction

Solving the linear system ⇒ Y (transduction ).

From a new point x and already computed Y :

C(y(x )) = C(Y ) +µ

n∑i=1

WX (x i , x )(yi − y(x ))2

⇒ y(x ) =

∑ni=1 WX (x i , x )yi∑ni=1 WX (x i , x )

(Induction like Parzen Windows, but using estimated labels Y ).

From Transduction to Induction

Solving the linear system ⇒ Y (transduction ).From a new point x and already computed Y :

C(y(x )) = C(Y ) +µ

n∑i=1

WX (x i , x )(yi − y(x ))2

⇒ y(x ) =

∑ni=1 WX (x i , x )yi∑ni=1 WX (x i , x )

(Induction like Parzen Windows, but using estimated labels Y ).

Faster Training from Subset

Previous algorithms are at least quadratic in n.

Induction formula ⇒ could train only on subset S.

Better: minimize the full cost over YS only.

For x i ∈ R = X \ S:

∑j∈S W ij yj∑j∈S W ij

i.e. YR = W RSYS: the cost C(Y ) now only depends on YS.

Nice Cost isn’t it?

C(YS) =µ

∑i,j∈R

W ij(yi − yj

︸︷︷︸CRR

+ 2× µ

∑i∈R,j∈S

W ij(yi − yj

︸︷︷︸CRS

∑i,j∈S

W ij(yi − yj

︸︷︷︸CSS

+∑i∈L

(yi − yi)2

︸︷︷︸CL

Let’s Make it Simpler

Computing CRR is quadratic in n ⇒ just get rid of it.

Linear system with |S| = m � n unknowns ⇒ much faster

Still need to do matrix multiplications ⇒ scales as O(m2n)

This slide looked pretty empty with only three points

It is important to have reading material when nobodyunderstands your accent

Subset Selection

1 Random :

fast, easy, crappy.Main problem = does not “fill the space” well enough ⇒bad approximation by the induction formula (some pointshave no near neighbors in the subset)

2 Heuristic : greedy construction of subset by starting withthe labeled points and iteratively minimizing over i∑

i.e. choose the point x i farthest from the current subset.(Additional tricks to eliminate outliers and sample morepoints near decision surface)

Subset Selection

1 Random : fast,

easy, crappy.Main problem = does not “fill the space” well enough ⇒bad approximation by the induction formula (some pointshave no near neighbors in the subset)

Subset Selection

1 Random : fast, easy,

crappy.Main problem = does not “fill the space” well enough ⇒bad approximation by the induction formula (some pointshave no near neighbors in the subset)

Subset Selection

1 Random : fast, easy, crappy.Main problem = does not “fill the space” well enough ⇒bad approximation by the induction formula (some pointshave no near neighbors in the subset)

Subset Selection

1 Random : fast, easy, crappy.Main problem = does not “fill the space” well enough ⇒bad approximation by the induction formula (some pointshave no near neighbors in the subset)

Experimental ResultsThis is not last-minute research so we do have results!

Table: Comparative Classification Error (Induction)

% labeled LETTERS MNIST COVTYPE1%

NoSub 56.0 35.8 47.3RandSubsubOnly 59.8 29.6 44.8

RandSub 57.4 27.7 75.7SmartSub 55.8 24.4 45.0

5%NoSub 27.1 12.8 37.1

RandSubsubOnly 32.1 14.9 35.4RandSub 29.1 12.6 70.6

SmartSub 28.5 12.3 35.810%

More comparisons between RandSub and SmartSub on 8 more UCI datasets⇒SmartSub always performs better.

Experimental ResultsThis is not last-minute research so we do have results!

Table: Comparative Classification Error (Induction)

% labeled LETTERS MNIST COVTYPE1%

5%NoSub 27.1 12.8 37.1

RandSubsubOnly 32.1 14.9 35.4RandSub 29.1 12.6 70.6

SmartSub 28.5 12.3 35.810%

More comparisons between RandSub and SmartSub on 8 more UCI datasets⇒SmartSub always performs better.

Curse of Dimensionality

Labeling function: y(x ) =∑

i yiW X (x i , x )

Locality : far point prediction = nearestneighbor; also normal vector ∂y

∂x (x ) isapproximately in the span of the nearestneighbors of x

Smoothness : it does not vary much within aball of radius small w.r.t. σ (obtained fromsecond derivative); also form of cost ⇒the learned function varies smoothly inregions with no labeled examples

Curse : one needs lots of unlabeled examplesto “fill” the region near the decision surface,and lots of labeled examples to account for all“clusters”

Curse of Dimensionality

Labeling function: y(x ) =∑

i yiW X (x i , x )

Locality : far point prediction = nearestneighbor; also normal vector ∂y

∂x (x ) isapproximately in the span of the nearestneighbors of x

Smoothness : it does not vary much within aball of radius small w.r.t. σ (obtained fromsecond derivative); also form of cost ⇒the learned function varies smoothly inregions with no labeled examples

Curse : one needs lots of unlabeled examplesto “fill” the region near the decision surface,and lots of labeled examples to account for all“clusters”

“Almost Real-Life” Example

Need unlabeled exam-ples along the sinus deci-sion surface, and labeledexamples in each classregion.

Conclusion

Simple non-parametric setting ⇒ powerful non-parametricsemi-supervised algorithm

Can scale to large datasets thanks to sparsity / subsetselection

Interesting links with electric networks / heat diffusion

Limitations of local weights: curse of dimensionality

Makes it possible to be on time for lunch!!!

Conclusion

References

Belkin, M., Matveeva, I., and Niyogi, P. (2004).Regularization and semi-supervised learning on large graphs.In Shawe-Taylor, J. and Singer, Y., editors, COLT’2004. Springer.

Delalleau, O., Bengio, Y., and Le Roux, N. (2005).Efficient non-parametric function induction in semi-supervised learning.In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics.

Doyle, P. G. and Snell, J. L. (1984).Random walks and electric networks.Mathematical Association of America.

Zhou, D., Bousquet, O., Navin Lal, T., Weston, J., and Schölkopf, B. (2004).Learning with local and global consistency.In Thrun, S., Saul, L., and Schölkopf, B., editors, Advances in Neural Information Processing Systems 16,Cambridge, MA. MIT Press.

Zhu, X., Ghahramani, Z., and Lafferty, J. (2003).Semi-supervised learning using Gaussian fields and harmonic functions.In ICML’2003.

Graph-Based Semi-Supervised Learningdelallea/pub/delalleau_semisup_ciar.pdf · Graph-Based...

Documents

Transcript of Graph-Based Semi-Supervised Learningdelallea/pub/delalleau_semisup_ciar.pdf · Graph-Based...