Active Semi-supervised Learning Using Sampling Theory for ... · Active Semi-supervised Learning...

Active Semi-supervised Learning Using Sampling Theoryfor Graph Signals

Akshay Gadde, Aamir Anis and Antonio Ortega

University of Southern California

August 26, 2014

KDD 2014 Active SSL using sampling theory for graph signals 1 / 21

Motivation and Problem Definition

I Unlabeled data is abundant. Labeled data is expensive and scarce.

I Solution: Active Semi-supervised Learning (SSL).

I Problem setting: Offline, pool-based, batch-mode active SSL via graphs

Active Semi-supervised Learning Using Sampling Theory for Graph Signals

Akshay Gadde, Aamir Anis and Antonio OrtegaUniversity of Southern California

Choose points to label

Predict labels for the rest

Data points in feature space

Construct similarity graph

1/B

B/2

Sampling rate of B allows perfect reconstruction of signals with bandwidth B/2

For graph signals, different sampling patterns uniquely represent signals of different bandwidths.

For given budget, choose a sampling pattern that can represent signals of maximum bandwidth.

Problem

MethodologyExtending Nyquist Shannon sampling theory to signals on graphs

ProposedLLGC boundLLRΨ-max

Result

1. How to predict unknown labels from the known labels?

2. What is the optimal set of nodes to label given the learning algorithm?


Graph Signal Processing

I Graph G = (V, E) with N nodes

I nodes ≡ data points; wij : similarity between i and j .

I Adjacency matrix W = [wij ]n×n.

I Degree matrix D = diag∑j wij.I Laplacian L = D−W.

I Normalized Laplacian L = D−1/2LD−1/2.

w12

1

2 3

4

w24

w13

w23

I Graph signal f : V → R, denoted as f ∈ RN .

I Class membership functions are graph signals.

fc(j) =

1, if node j is in class c0, otherwise


Graph Signal Processing

I Graph G = (V, E) with N nodes

I nodes ≡ data points; wij : similarity between i and j .

I Adjacency matrix W = [wij ]n×n.

I Degree matrix D = diag∑j wij.I Laplacian L = D−W.

I Normalized Laplacian L = D−1/2LD−1/2.

w12

1

2 3

4

w24

w13

w23

f(1)

f(4)

f(2)

f(3)

I Graph signal f : V → R, denoted as f ∈ RN .

I Class membership functions are graph signals.

fc(j) =

1, if node j is in class c0, otherwise


Notion of Frequency for Graph Signals

Spectrum of L provides frequency interpretation:

I λk ∈ [0, 2]: graph frequencies.

I uk : graph Fourier basis.

ω = π/4 x 0

ω = π/4 x 1 ω = π/4 x 4 ω = π/4 x 7

λ = 0λ = 0.27 λ = 1.32 λ = 1.59

I Fourier coefficients of f: f(λi ) = 〈f, ui 〉.I Graph Fourier Transform (GFT):

f = UT f.


Bandlimited Signals on Graphs

I ω-bandlimited signal: GFT has support [0, ω].

I Paley-Wiener space PWω(G): Space of all ω-bandlimited signals.

I PWω(G) is a subspace of RN .I ω1 ≤ ω2 ⇒ PWω1 (G) ⊆ PWω2 (G).

I Bandwidth of a signal:

ω(f) = arg maxλ

f(λ) s.t. |f(λ)| ≥ 0

I Class membership functions can be approximated by bandlimited graphsignals.

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Eigenvalue index

CD

F o

f e

nerg

y in

GF

T c

oe

ffic

ien

ts

(a) USPS

0 200 400 600 800 1000 1200 14000

0.2

0.4

0.6

0.8

1

Eigenvalue index

CD

F o

f ene

rgy in G

FT

co

effic

ien

ts

(b) Isolet

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Eigenvalue index

CD

F o

f e

nerg

y in

GF

T c

oe

ffic

ien

ts

(c) 20 newsgroups

Figure 3: Cumulative distribution of energy in the GFT coefficients of one of the class membership functions pertaining tothe three real-world dataset experiments considered in Section 5. Note that most of the energy is concentrated in the low-passregion.

term is expected to be negligible as compared to the firstone due to differencing, and we get

xtLx ≈

j∈Sc

pj

dj

x2

j , (24)

where, pj =

i∈S wij is defined as the “partial out-degree”of node j ∈ Sc, i.e., it is the sum of weights of edges crossingover to the set S. Therefore, given a current selected S, thegreedy algorithm selects the next node, to be added to S,that maximizes the increase in

Ω1(S) ≈ inf||x||=1

j∈Sc

pj

dj

x2

j . (25)

Due to the constraint ||x|| = 1, the expression being mini-mized is essentially an infimum over a convex combinationof the fractional out-degrees and its value is largely deter-mined by nodes j ∈ Sc for which pj/dj is small. In otherwords, we must worry about those nodes that have a lowratio of partial degree to the actual degree. Thus, in thesimplest case, our selection algorithm tries to remove thosenodes from the unlabeled set that are weakly connected tonodes in the labeled set. This makes intuitive sense as, inthe end, most prediction algorithms involve propagation oflabels from the labeled to the unlabeled nodes. If an unla-beled node is strongly connected to various numerous points,its label can be assigned with greater confidence.

Note that using a higher power k in the cost function,i.e., finding Ωk(S) for k > 1 involves xLkx which, looselyspeaking, takes into account higher order interactions be-tween the nodes while choosing the nodes to label. In asense, we expect it to capture the connectivities in a moreglobal sense, beyond local interactions, taking into accountthe underlying manifold structure of the data.

3.3 ComplexityWe now comment on the time and space complexity of

our algorithm. The most complex step in the greedy proce-dure for maximizing Ωk(S) is computing the smallest eigen-pair of (Lk)Sc . This can be accomplished using an iterativeRayleigh-quotient minimization based algorithm. Specifi-cally, the locally-optimal pre-conditioned conjugate gradi-ent (LOPCG) method [14] is suitable for this approach.Note that (Lk)Sc can be written as ISc,V .L.L . . . L.IV,Sc ,hence the eigenvalue computation can be broken into atomic

matrix-vector products: L.x. Typically, the graphs encoun-tered in learning applications are sparse, leading to efficientimplementations of L.x. If |L| denotes the number of non-zero elements in L, then the complexity of the matrix-vectorproduct is O(|L|). The complexity of each eigen-pair com-putation for (Lk)Sc is then O(k|L|r), where r is a constantequal to the average number of iterations required for theLOPCG algorithm (r depends on the spectral properties ofL and is independent of its size |V|). The complexity of thelabel selection algorithm then becomes O(k|L|mr), wherem is the number of labels requested.

In the iterative reconstruction algorithm, since we usepolynomial graph filters (Section 2.5), once again the atomicstep is the matrix-vector product L.x. The complexity ofthis algorithm can be given as O(|L|pq), where p is the orderof the polynomial used to design the filter and q is the av-erage number of iterations required for convergence. Again,both these parameters are independent of |V|. Thus, theoverall complexity of our algorithm is O(|L|(kmr + pq)). Inaddition, our algorithm has major advantages in terms ofspace complexity: Since, the atomic operation at each stepis the matrix-vector product L.x, we only need to store Land a constant number of vectors. Moreover, the structureof the Laplacian matrix allows one to perform the afore-mentioned operations in a distributed fashion. This makesit well-suited for large-scale implementations using softwarepackages such as GraphLab [16].

3.4 Prediction Error and Number of LabelsAs discussed in Section 2.5, given the samples fS of the

true graph signal on a subset of nodes S ⊂ V, its estimateon Sc is obtained by solving the following problem:

f(Sc) = USc,Kα∗ where, α∗ = arg minα

US,Kα− f(S)(26)

Here, K is the index set of eigenvectors with eigenvalues lessthan the cut-off ωc(S). If the true signal f ∈ PWωc(S)(G),then the prediction is perfect. However, this is not the casein most problems. The prediction error f − f roughlyequals the portion of energy of the true signal in [ωc(S),λN ]frequency band. By choosing the sampling set S that max-imizes ωc(S), we try to capture most of the signal energyand thus, reduce the prediction error.

An important question in the context of active learning isdetermining the minimum number of labels required so that


Bandlimited Signals on Graphs

I ω-bandlimited signal: GFT has support [0, ω].

I Paley-Wiener space PWω(G): Space of all ω-bandlimited signals.

I PWω(G) is a subspace of RN .I ω1 ≤ ω2 ⇒ PWω1 (G) ⊆ PWω2 (G).

I Bandwidth of a signal:

ω(f) = arg maxλ

f(λ) s.t. |f(λ)| ≥ 0

I Class membership functions can be approximated by bandlimited graphsignals.

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Eigenvalue index

CD

F o

f e

ne

rgy in

GF

T c

oe

ffic

ien

ts

(a) USPS

0 200 400 600 800 1000 1200 14000

0.2

0.4

0.6

0.8

1

Eigenvalue index

CD

F o

f energ

y in G

FT

coeffic

ients

(b) Isolet

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Eigenvalue index

CD

F o

f e

ne

rgy in

GF

T c

oe

ffic

ien

ts

(c) 20 newsgroups



xtLx ≈

j∈Sc

pj

dj

x2

j , (24)

where, pj =


Ω1(S) ≈ inf||x||=1

j∈Sc

pj

dj

x2

j . (25)










US,Kα− f(S)(26)




Sampling Theory for Graph Signals

Sampling theorem: bandwidth ω ⇔ sampling rate for unique representation

ReconstructionBandlimited signal

Sampling theory for graph signals:

Maximumgiven

,P1: Smallestgiven

,P2: Estimategiven

,P3:,


Relevance of Sampling Theory to Active SSL

Criterion function

Cut-off frequency for given sampling set

Select points to label based on

criterion function

Choose sampling set that maximizes

cut-off frequency

Predict unknown labels from known

Reconstruct bandlimited signal

from sample values

Active Semi-supervised Learning

Graph Signal Sampling

Class labels

Bandlimited signals


P1: Cut-off Frequency

How “smooth” the label set information have to be to reconstruct from S?

Condition for unique sampling of PWω(G ) on SLet L2(Sc) = φ : φ(S) = 0. Then, we need PWω(G) ∩ L2(Sc) = 0.

Sampling Theorem

f can be perfectly recovered from f(S) iff

ω(f) ≤ ωc(S)4= inf

φL2(Sc )

ω(φ)

I Cut-off frequency = smallest bandwidth that a φ ∈ L2(Sc) can have.


P1: Computing the Cut-off Frequency for Given S

Approximate bandwidth of a signal

ωk(f)4=

(f>Lk f

f>f

)1/k

, where k ∈ Z+

I Monotonicity: ∀f, k1 < k2 ⇒ ωk1 (f) ≤ ωk2 (f).I Convergence: limk→∞ ωk(f) = ω(f).

Minimize approximate bandwidth over L2(Sc) to estimate cut-off frequency

Ωk (S)4= minφ∈L2(Sc )

ωk (φ) = minφ:φ(S)=0

(φTLkφ

φTφ

)1/k

=

(minψ

ψT (Lk )Scψ

ψTψ︸︷︷︸Rayleigh quotient

)1/k

Let σ1,k , ψ1,k → smallest eigen-pair of (Lk)Sc .

Estimated cutoff frequency Ωk(S) = (σ1,k)1/k ,

Corresponding smoothest signal φoptk (Sc) = ψ1,k , φ

optk (S) = 0.


P2: Sampling Set Selection

I Optimal sampling set should maximally capture signal information.

I Sopt = arg max|S|=m Ωk(S)→ combinatorial!

Idλk

α(t)

dt(i)

∣∣∣t=1S

≈ α(φoptk (i))2.

Greedy algorithm

S ← S ∪ v , where v = arg maxj(φopt(j))2


P2: Sampling Set Selection

I Optimal sampling set should maximally capture signal information.

I Sopt = arg max|S|=m Ωk(S)→ combinatorial!

I Greedy gradient-based approach.I Start with S = ∅.I Add nodes one by one while ensuring maximum increase in Ωk (S).

relax the constraint

binary relaxation

Idλk

α(t)

dt(i)

∣∣∣t=1S

≈ α(φoptk (i))2.

Greedy algorithm

S ← S ∪ v , where v = arg maxj(φopt(j))2


Connection with Active Learning

I Cut-off function Ωk(S) ≡ variation of smoothest signal in L2(Sc).

I Larger cut-off function ⇒ more variation in φopt ⇒ more cross-links.

Intuition

Unlabeled nodes are strongly connected to labeled nodes!


P3: Label Prediction as Signal Reconstruction

I C1 = x : x(S) = f(S) and C2 = PWω(G).

I We need to find a unique f ∈ C1 ∩ C2 ⇒sampling theorem guarantees uniqueness.

Projection onto convex sets

fi+1 = PC2 PC1 fi , where f0 = [f(S)>, 0]>.

C1fdu

C2

f1

C1 ∩ C2f0 f2

f3

I PC1 resets the samples on S to f(S).

I PC2 = Uh(Λ)U> sets f(λ) = 0 if λ > ω.

h(λ) =

1, if λ < ω0, if λ ≥ ω

0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

λ

h(λ

)

Polynomial approx.Exact

I PC2 ≈∑n

i=1

(∑pj=0 ajλ

ji

)uiu>i =

∑pj=0 ajLj → p-hop localized

Predicted class of node n = arg maxc fc(n).


Summary of the Algorithm

Construct graph

Choose nodes to label by maximizing cut-off frequency

Predict labels by signal reconstruction

Query labels of chosen nodes

Input data


Related Work

Submodular optimization:I Optimizing “strength” of a network (Ψ-max) [Guillory and Bilmes, 2011]

I computationally complex

I Graph partitioning based heuristic (METIS) [Guillory and Bilmes, 2009]

Generalization error bound minimization:I Minimizing generalization error bound for LLGC [Gu and Han, 2012]

I contains a regularization parameter that needs to be tuned.

Optimal experiment design:I Local linear reconstruction (LLR) [Zhang et al., 2011]

I does not consider the learning algorithm


Results: Toy Example

Task

Pick 8 data points for labeling.

Active Semi-supervised Learning Using Sampling Theory for Graph Signals

Akshay Gadde, Aamir Anis and Antonio OrtegaUniversity of Southern California

Choose points to label Predict labels for the restData points in feature space Construct similarity graph

1/B

B/2

Sampling rate of B allows perfect reconstruction of signals with bandwidth B/2

For graph signals, different sampling patterns uniquely represent signals of different bandwidths.

For given budget, choose a sampling pattern that can represent signals of maximum bandwidth.

Problem

MethodologyExtending Nyquist Shannon sampling theory to signals on graphs

ProposedLLGC boundLLRΨ-max

I 4 data points picked from each circle.

I Maximally separated points within one circle.

I Maximal spacing between selected data points in different circles.


Results: Real Datasets

1 2 3 4 5 6 7 8 9 100.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Percentage of labeled data

Accu

racy

RandomLLGC BoundMETISLLRProposed

I USPS: handwritten digits

I xi = 16× 16 image

I number of classes = 10

I K -NN graph with K = 10

I wij = exp

(−‖xi−xj‖

2

2σ2

)

2 3 4 5 6 7 8 9 100.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Percentage of labeled dataA

ccu

racy

I ISOLET: spoken letters

I xi ∈ R617 speech features.



I wij = exp

(−‖xi−xj‖

2

2σ2

)

1 2 3 4 5 6 7 8 9 100.1

0.2

0.3

0.4

0.5

0.6

0.7

Percentage of labeled data

Accu

racy

I Newsgroups: documents

I xi ∈ R3000 tf-idf of words



I wij =x>i xj‖xi‖‖xj‖


Results: Effect of k

Larger k ⇒ better estimate of cut-off frequency is optimized.

1 2 3 4 5 6 7 8 9 100.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Percentage of labelled data

Accu

racy

k = 4

k = 6

k = 8

(a) USPS

1 2 3 4 5 6 7 8 9 100.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85


Accu

racy

k = 4

k = 6

k = 8

(b) Isolet

1 2 3 4 5 6 7 8 9 100.4

0.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6


Accu

racy

k = 4

k = 6

k = 8

(c) 20 newsgroups

Figure 6: Effect of k on classification accuracy of the proposed method. Plots show the average classification accuracy fordifferent percentages of labelled data.

[8] Q. Gu, T. Zhang, C. Ding, and J. Han. Selectivelabeling via error bound minimization. In Advances inNeural Information Processing Systems 25, pages332–340. 2012.

[9] A. Guillory and J. Bilmes. Label selection on graphs.In Advances in Neural Information Processing Systems22, pages 691–699. 2009.

[10] A. Guillory and J. Bilmes. Active semi-supervisedlearning using submodular functions. In Proceedings of27th Conference on Uncertainty in ArtificialIntelligence, pages 274–282, 2011.

[11] D. Hammond, P. Vandergheynst, and R. Gribonval.Wavelets on graphs via spectral graph theory. Appliedand Computational Harmonic Analysis, 30(2):129 –150, 2011.

[12] S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch modeactive learning and its application to medical imageclassification. In Proceedings of the 23rd Internationalconference on Machine learning, pages 417–424, 2006.

[13] G. Karypis and V. Kumar. A fast and high qualitymultilevel scheme for partitioning irregular graphs.SIAM Journal on Scientific Computing, 20(1), 1998.

[14] A. V. Knyazev. Toward the optimal preconditionedeigensolver: Locally optimal block preconditionedconjugate gradient method. SIAM journal onscientific computing, 23(2):517–541, 2001.

[15] D. Kong, C. H. Ding, H. Huang, and F. Nie. Aniterative locally linear embedding algorithm. InProceedings of the 29th International Conference onMachine Learning, 2012.

[16] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin,A. Kyrola, and J. M. Hellerstein. Distributedgraphlab: A framework for machine learning and datamining in the cloud. Proc. VLDB Endow.,5(8):716–727, Apr. 2012.

[17] S. Narang and A. Ortega. Perfect reconstructiontwo-channel wavelet filter banks for graph structureddata. IEEE Transactions on Signal Processing,60(6):2786–2799, June 2012.

[18] S. K. Narang, A. Gadde, E. Sanou, and A. Ortega.Localized iterative methods for interpolation in graphstructured data. In Signal and Information Processing(GlobalSIP), 2013 IEEE Global Conference on, 2013.

[19] B. Osting, C. D. White, and E. Oudet. MinimalDirichlet energy partitions for graphs. Aug. 2013.arXiv:1308.4915 [math.OC].

[20] I. Pesenson. Sampling in Paley-Wiener spaces oncombinatorial graphs. Transactions of the AmericanMathematical Society, 360(10):5603–5627, 2008.

[21] S. T. Roweis and L. K. Saul. Nonlinear dimensionalityreduction by locally linear embedding. Science,290(5500):2323–2326, 2000.

[22] K. Sauer and J. Allebach. Iterative reconstruction ofbandlimited images from nonuniformly spacedsamples. Circuits and Systems, IEEE Transactions on,34(12):1497–1506, 1987.

[23] B. Settles. Active learning literature survey. ComputerSciences Technical Report 1648, University ofWisconsin–Madison, 2010.

[24] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega,and P. Vandergheynst. Signal processing on graphs:Extending high-dimensional data analysis to networksand other irregular data domains. Signal ProcessingMagazine, arXiv:1211.0053, May. 2013.

[25] A. J. Smola and R. Kondor. Kernels andregularization on graphs. In Learning theory andkernel machines, pages 144–158. Springer, 2003.

[26] K. Yu, J. Bi, and V. Tresp. Active learning viatransductive experimental design. In Proceedings ofthe 23rd International conference on Machinelearning, pages 1081–1088, 2006.

[27] L. Zhang, C. Chen, J. Bu, D. Cai, X. He, and T. S.Huang. Active learning based on locally linearreconstruction. Pattern Analysis and MachineIntelligence, IEEE Transactions on, 33(10):2026–2038,2011.

[28] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, andB. Scholkopf. Learning with local and globalconsistency. In Advances in Neural InformationProcessing Systems 16. 2004.

[29] X. Zhu. Semi-supervised learning literature survey.Technical Report 1530, Computer Sciences, Universityof Wisconsin-Madison, 2008.

[30] X. Zhu, Z. Ghahramani, and J. Lafferty.Semi-supervised learning using gaussian fields andharmonic functions. In ICML, volume 3, pages912–919, 2003.

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Eigenvalue index

CD

F o

f e

ne

rgy in

GF

T c

oe

ffic

ien

ts

(a) USPS

0 200 400 600 800 1000 1200 14000

0.2

0.4

0.6

0.8

1

Eigenvalue index

CD

F o

f e

ne

rgy in

GF

T c

oe

ffic

ien

ts

(b) Isolet

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Eigenvalue index

CD

F o

f e

ne

rgy in

GF

T c

oe

ffic

ien

ts(c) 20 newsgroups



xtLx ≈

j∈Sc

pj

dj

x2

j , (24)

where, pj =


Ω1(S) ≈ inf||x||=1

j∈Sc

pj

dj

x2

j . (25)










US,Kα− f(S)(26)




Conclusion and Future Work

I Application of graph signal sampling theory to active SSL

I Class labels ⇒ bandlimited graph signalsI Choosing nodes ⇒ Best sampling set selectionI Predicting unknown labels ⇒ Signal reconstruction from samples

I Proposed approach gives significantly better results.

I Future work:I Approximate optimality of proposed sampling set selection.I Robustness against noise


References

A. Anis, A. Gadde, and A. Ortega.Towards a sampling theorem for signals on arbitrary graphs.In ICASSP, 2014.

S.K. Narang, A. Gadde, and A. Ortega.Localized iterative methods for interpolation in graph structured data.In IEEE GlobalSIP, 2013.

A. Guillory and J. Bilmes.Active semi-supervised learning using submodular functions.In UAI, 2011.

A. Guillory and J. Bilmes.Label selection on graphs.In NIPS. 2009.

L. Zhang, C. Chen, J. Bu, D. Cai, X. He, and T. Huang.Active learning based on locally linear reconstruction.TPAMI, 2011.

Q. Gu and J. Han.Towards active learning on graphs, an error bound minimization approach.In ICDM, 2012.


Thank you!


Label Complexity

I Let f be the reconstruction of f obtained from its samples on S.

I What is the minimum number of labels required so that ‖f − f‖ ≤ δ?

Smoothness of a signal

Let Pθ be the projector for PWθ(G). Then γ(f) = min θ s.t. ‖f − Pθf‖ ≤ δ.

Theorem

The minimum number of labels |S| required to satisfy ‖f − f‖ ≤ δ is greaterthan p, where p is the number of eigenvalues of L less than γ(f).


Active Semi-supervised Learning Using Sampling Theory for ... · Active Semi-supervised Learning...

Documents

Transcript of Active Semi-supervised Learning Using Sampling Theory for ... · Active Semi-supervised Learning...