Active Semi-supervised Learning Using Sampling Theory for ... · Active Semi-supervised Learning...
Transcript of Active Semi-supervised Learning Using Sampling Theory for ... · Active Semi-supervised Learning...
Active Semi-supervised Learning Using Sampling Theoryfor Graph Signals
Akshay Gadde, Aamir Anis and Antonio Ortega
University of Southern California
August 26, 2014
KDD 2014 Active SSL using sampling theory for graph signals 1 / 21
Motivation and Problem Definition
I Unlabeled data is abundant. Labeled data is expensive and scarce.
I Solution: Active Semi-supervised Learning (SSL).
I Problem setting: Offline, pool-based, batch-mode active SSL via graphs
Active Semi-supervised Learning Using Sampling Theory for Graph Signals
Akshay Gadde, Aamir Anis and Antonio OrtegaUniversity of Southern California
Choose points to label
Predict labels for the rest
Data points in feature space
Construct similarity graph
1/B
B/2
Sampling rate of B allows perfect reconstruction of signals with bandwidth B/2
For graph signals, different sampling patterns uniquely represent signals of different bandwidths.
For given budget, choose a sampling pattern that can represent signals of maximum bandwidth.
Problem
MethodologyExtending Nyquist Shannon sampling theory to signals on graphs
ProposedLLGC boundLLRΨ-max
Result
1. How to predict unknown labels from the known labels?
2. What is the optimal set of nodes to label given the learning algorithm?
KDD 2014 Active SSL using sampling theory for graph signals 2 / 21
Graph Signal Processing
I Graph G = (V, E) with N nodes
I nodes ≡ data points; wij : similarity between i and j .
I Adjacency matrix W = [wij ]n×n.
I Degree matrix D = diag∑j wij.I Laplacian L = D−W.
I Normalized Laplacian L = D−1/2LD−1/2.
w12
1
2 3
4
w24
w13
w23
I Graph signal f : V → R, denoted as f ∈ RN .
I Class membership functions are graph signals.
fc(j) =
1, if node j is in class c0, otherwise
KDD 2014 Active SSL using sampling theory for graph signals 3 / 21
Graph Signal Processing
I Graph G = (V, E) with N nodes
I nodes ≡ data points; wij : similarity between i and j .
I Adjacency matrix W = [wij ]n×n.
I Degree matrix D = diag∑j wij.I Laplacian L = D−W.
I Normalized Laplacian L = D−1/2LD−1/2.
w12
1
2 3
4
w24
w13
w23
f(1)
f(4)
f(2)
f(3)
I Graph signal f : V → R, denoted as f ∈ RN .
I Class membership functions are graph signals.
fc(j) =
1, if node j is in class c0, otherwise
KDD 2014 Active SSL using sampling theory for graph signals 3 / 21
Notion of Frequency for Graph Signals
Spectrum of L provides frequency interpretation:
I λk ∈ [0, 2]: graph frequencies.
I uk : graph Fourier basis.
ω = π/4 x 0
ω = π/4 x 1 ω = π/4 x 4 ω = π/4 x 7
λ = 0λ = 0.27 λ = 1.32 λ = 1.59
I Fourier coefficients of f: f(λi ) = 〈f, ui 〉.I Graph Fourier Transform (GFT):
f = UT f.
KDD 2014 Active SSL using sampling theory for graph signals 4 / 21
Bandlimited Signals on Graphs
I ω-bandlimited signal: GFT has support [0, ω].
I Paley-Wiener space PWω(G): Space of all ω-bandlimited signals.
I PWω(G) is a subspace of RN .I ω1 ≤ ω2 ⇒ PWω1 (G) ⊆ PWω2 (G).
I Bandwidth of a signal:
ω(f) = arg maxλ
f(λ) s.t. |f(λ)| ≥ 0
I Class membership functions can be approximated by bandlimited graphsignals.
0 200 400 600 800 10000
0.2
0.4
0.6
0.8
1
Eigenvalue index
CD
F o
f e
nerg
y in
GF
T c
oe
ffic
ien
ts
(a) USPS
0 200 400 600 800 1000 1200 14000
0.2
0.4
0.6
0.8
1
Eigenvalue index
CD
F o
f ene
rgy in G
FT
co
effic
ien
ts
(b) Isolet
0 200 400 600 800 10000
0.2
0.4
0.6
0.8
1
Eigenvalue index
CD
F o
f e
nerg
y in
GF
T c
oe
ffic
ien
ts
(c) 20 newsgroups
Figure 3: Cumulative distribution of energy in the GFT coefficients of one of the class membership functions pertaining tothe three real-world dataset experiments considered in Section 5. Note that most of the energy is concentrated in the low-passregion.
term is expected to be negligible as compared to the firstone due to differencing, and we get
xtLx ≈
j∈Sc
pj
dj
x2
j , (24)
where, pj =
i∈S wij is defined as the “partial out-degree”of node j ∈ Sc, i.e., it is the sum of weights of edges crossingover to the set S. Therefore, given a current selected S, thegreedy algorithm selects the next node, to be added to S,that maximizes the increase in
Ω1(S) ≈ inf||x||=1
j∈Sc
pj
dj
x2
j . (25)
Due to the constraint ||x|| = 1, the expression being mini-mized is essentially an infimum over a convex combinationof the fractional out-degrees and its value is largely deter-mined by nodes j ∈ Sc for which pj/dj is small. In otherwords, we must worry about those nodes that have a lowratio of partial degree to the actual degree. Thus, in thesimplest case, our selection algorithm tries to remove thosenodes from the unlabeled set that are weakly connected tonodes in the labeled set. This makes intuitive sense as, inthe end, most prediction algorithms involve propagation oflabels from the labeled to the unlabeled nodes. If an unla-beled node is strongly connected to various numerous points,its label can be assigned with greater confidence.
Note that using a higher power k in the cost function,i.e., finding Ωk(S) for k > 1 involves xLkx which, looselyspeaking, takes into account higher order interactions be-tween the nodes while choosing the nodes to label. In asense, we expect it to capture the connectivities in a moreglobal sense, beyond local interactions, taking into accountthe underlying manifold structure of the data.
3.3 ComplexityWe now comment on the time and space complexity of
our algorithm. The most complex step in the greedy proce-dure for maximizing Ωk(S) is computing the smallest eigen-pair of (Lk)Sc . This can be accomplished using an iterativeRayleigh-quotient minimization based algorithm. Specifi-cally, the locally-optimal pre-conditioned conjugate gradi-ent (LOPCG) method [14] is suitable for this approach.Note that (Lk)Sc can be written as ISc,V .L.L . . . L.IV,Sc ,hence the eigenvalue computation can be broken into atomic
matrix-vector products: L.x. Typically, the graphs encoun-tered in learning applications are sparse, leading to efficientimplementations of L.x. If |L| denotes the number of non-zero elements in L, then the complexity of the matrix-vectorproduct is O(|L|). The complexity of each eigen-pair com-putation for (Lk)Sc is then O(k|L|r), where r is a constantequal to the average number of iterations required for theLOPCG algorithm (r depends on the spectral properties ofL and is independent of its size |V|). The complexity of thelabel selection algorithm then becomes O(k|L|mr), wherem is the number of labels requested.
In the iterative reconstruction algorithm, since we usepolynomial graph filters (Section 2.5), once again the atomicstep is the matrix-vector product L.x. The complexity ofthis algorithm can be given as O(|L|pq), where p is the orderof the polynomial used to design the filter and q is the av-erage number of iterations required for convergence. Again,both these parameters are independent of |V|. Thus, theoverall complexity of our algorithm is O(|L|(kmr + pq)). Inaddition, our algorithm has major advantages in terms ofspace complexity: Since, the atomic operation at each stepis the matrix-vector product L.x, we only need to store Land a constant number of vectors. Moreover, the structureof the Laplacian matrix allows one to perform the afore-mentioned operations in a distributed fashion. This makesit well-suited for large-scale implementations using softwarepackages such as GraphLab [16].
3.4 Prediction Error and Number of LabelsAs discussed in Section 2.5, given the samples fS of the
true graph signal on a subset of nodes S ⊂ V, its estimateon Sc is obtained by solving the following problem:
f(Sc) = USc,Kα∗ where, α∗ = arg minα
US,Kα− f(S)(26)
Here, K is the index set of eigenvectors with eigenvalues lessthan the cut-off ωc(S). If the true signal f ∈ PWωc(S)(G),then the prediction is perfect. However, this is not the casein most problems. The prediction error f − f roughlyequals the portion of energy of the true signal in [ωc(S),λN ]frequency band. By choosing the sampling set S that max-imizes ωc(S), we try to capture most of the signal energyand thus, reduce the prediction error.
An important question in the context of active learning isdetermining the minimum number of labels required so that
KDD 2014 Active SSL using sampling theory for graph signals 5 / 21
Bandlimited Signals on Graphs
I ω-bandlimited signal: GFT has support [0, ω].
I Paley-Wiener space PWω(G): Space of all ω-bandlimited signals.
I PWω(G) is a subspace of RN .I ω1 ≤ ω2 ⇒ PWω1 (G) ⊆ PWω2 (G).
I Bandwidth of a signal:
ω(f) = arg maxλ
f(λ) s.t. |f(λ)| ≥ 0
I Class membership functions can be approximated by bandlimited graphsignals.
0 200 400 600 800 10000
0.2
0.4
0.6
0.8
1
Eigenvalue index
CD
F o
f e
ne
rgy in
GF
T c
oe
ffic
ien
ts
(a) USPS
0 200 400 600 800 1000 1200 14000
0.2
0.4
0.6
0.8
1
Eigenvalue index
CD
F o
f energ
y in G
FT
coeffic
ients
(b) Isolet
0 200 400 600 800 10000
0.2
0.4
0.6
0.8
1
Eigenvalue index
CD
F o
f e
ne
rgy in
GF
T c
oe
ffic
ien
ts
(c) 20 newsgroups
Figure 3: Cumulative distribution of energy in the GFT coefficients of one of the class membership functions pertaining tothe three real-world dataset experiments considered in Section 5. Note that most of the energy is concentrated in the low-passregion.
term is expected to be negligible as compared to the firstone due to differencing, and we get
xtLx ≈
j∈Sc
pj
dj
x2
j , (24)
where, pj =
i∈S wij is defined as the “partial out-degree”of node j ∈ Sc, i.e., it is the sum of weights of edges crossingover to the set S. Therefore, given a current selected S, thegreedy algorithm selects the next node, to be added to S,that maximizes the increase in
Ω1(S) ≈ inf||x||=1
j∈Sc
pj
dj
x2
j . (25)
Due to the constraint ||x|| = 1, the expression being mini-mized is essentially an infimum over a convex combinationof the fractional out-degrees and its value is largely deter-mined by nodes j ∈ Sc for which pj/dj is small. In otherwords, we must worry about those nodes that have a lowratio of partial degree to the actual degree. Thus, in thesimplest case, our selection algorithm tries to remove thosenodes from the unlabeled set that are weakly connected tonodes in the labeled set. This makes intuitive sense as, inthe end, most prediction algorithms involve propagation oflabels from the labeled to the unlabeled nodes. If an unla-beled node is strongly connected to various numerous points,its label can be assigned with greater confidence.
Note that using a higher power k in the cost function,i.e., finding Ωk(S) for k > 1 involves xLkx which, looselyspeaking, takes into account higher order interactions be-tween the nodes while choosing the nodes to label. In asense, we expect it to capture the connectivities in a moreglobal sense, beyond local interactions, taking into accountthe underlying manifold structure of the data.
3.3 ComplexityWe now comment on the time and space complexity of
our algorithm. The most complex step in the greedy proce-dure for maximizing Ωk(S) is computing the smallest eigen-pair of (Lk)Sc . This can be accomplished using an iterativeRayleigh-quotient minimization based algorithm. Specifi-cally, the locally-optimal pre-conditioned conjugate gradi-ent (LOPCG) method [14] is suitable for this approach.Note that (Lk)Sc can be written as ISc,V .L.L . . . L.IV,Sc ,hence the eigenvalue computation can be broken into atomic
matrix-vector products: L.x. Typically, the graphs encoun-tered in learning applications are sparse, leading to efficientimplementations of L.x. If |L| denotes the number of non-zero elements in L, then the complexity of the matrix-vectorproduct is O(|L|). The complexity of each eigen-pair com-putation for (Lk)Sc is then O(k|L|r), where r is a constantequal to the average number of iterations required for theLOPCG algorithm (r depends on the spectral properties ofL and is independent of its size |V|). The complexity of thelabel selection algorithm then becomes O(k|L|mr), wherem is the number of labels requested.
In the iterative reconstruction algorithm, since we usepolynomial graph filters (Section 2.5), once again the atomicstep is the matrix-vector product L.x. The complexity ofthis algorithm can be given as O(|L|pq), where p is the orderof the polynomial used to design the filter and q is the av-erage number of iterations required for convergence. Again,both these parameters are independent of |V|. Thus, theoverall complexity of our algorithm is O(|L|(kmr + pq)). Inaddition, our algorithm has major advantages in terms ofspace complexity: Since, the atomic operation at each stepis the matrix-vector product L.x, we only need to store Land a constant number of vectors. Moreover, the structureof the Laplacian matrix allows one to perform the afore-mentioned operations in a distributed fashion. This makesit well-suited for large-scale implementations using softwarepackages such as GraphLab [16].
3.4 Prediction Error and Number of LabelsAs discussed in Section 2.5, given the samples fS of the
true graph signal on a subset of nodes S ⊂ V, its estimateon Sc is obtained by solving the following problem:
f(Sc) = USc,Kα∗ where, α∗ = arg minα
US,Kα− f(S)(26)
Here, K is the index set of eigenvectors with eigenvalues lessthan the cut-off ωc(S). If the true signal f ∈ PWωc(S)(G),then the prediction is perfect. However, this is not the casein most problems. The prediction error f − f roughlyequals the portion of energy of the true signal in [ωc(S),λN ]frequency band. By choosing the sampling set S that max-imizes ωc(S), we try to capture most of the signal energyand thus, reduce the prediction error.
An important question in the context of active learning isdetermining the minimum number of labels required so that
KDD 2014 Active SSL using sampling theory for graph signals 5 / 21
Sampling Theory for Graph Signals
Sampling theorem: bandwidth ω ⇔ sampling rate for unique representation
ReconstructionBandlimited signal
Sampling theory for graph signals:
Maximumgiven
,P1: Smallestgiven
,P2: Estimategiven
,P3:,
KDD 2014 Active SSL using sampling theory for graph signals 6 / 21
Sampling Theory for Graph Signals
Sampling theorem: bandwidth ω ⇔ sampling rate for unique representation
ReconstructionBandlimited signal
Sampling theory for graph signals:
Maximumgiven
,P1: Smallestgiven
,P2: Estimategiven
,P3:,
KDD 2014 Active SSL using sampling theory for graph signals 6 / 21
Relevance of Sampling Theory to Active SSL
Criterion function
Cut-off frequency for given sampling set
Select points to label based on
criterion function
Choose sampling set that maximizes
cut-off frequency
Predict unknown labels from known
Reconstruct bandlimited signal
from sample values
Active Semi-supervised Learning
Graph Signal Sampling
Class labels
Bandlimited signals
KDD 2014 Active SSL using sampling theory for graph signals 7 / 21
P1: Cut-off Frequency
How “smooth” the label set information have to be to reconstruct from S?
Condition for unique sampling of PWω(G ) on SLet L2(Sc) = φ : φ(S) = 0. Then, we need PWω(G) ∩ L2(Sc) = 0.
Sampling Theorem
f can be perfectly recovered from f(S) iff
ω(f) ≤ ωc(S)4= inf
φL2(Sc )
ω(φ)
I Cut-off frequency = smallest bandwidth that a φ ∈ L2(Sc) can have.
KDD 2014 Active SSL using sampling theory for graph signals 8 / 21
P1: Cut-off Frequency
How “smooth” the label set information have to be to reconstruct from S?
Condition for unique sampling of PWω(G ) on SLet L2(Sc) = φ : φ(S) = 0. Then, we need PWω(G) ∩ L2(Sc) = 0.
Sampling Theorem
f can be perfectly recovered from f(S) iff
ω(f) ≤ ωc(S)4= inf
φL2(Sc )
ω(φ)
I Cut-off frequency = smallest bandwidth that a φ ∈ L2(Sc) can have.
KDD 2014 Active SSL using sampling theory for graph signals 8 / 21
P1: Computing the Cut-off Frequency for Given S
Approximate bandwidth of a signal
ωk(f)4=
(f>Lk f
f>f
)1/k
, where k ∈ Z+
I Monotonicity: ∀f, k1 < k2 ⇒ ωk1 (f) ≤ ωk2 (f).I Convergence: limk→∞ ωk(f) = ω(f).
Minimize approximate bandwidth over L2(Sc) to estimate cut-off frequency
Ωk (S)4= minφ∈L2(Sc )
ωk (φ) = minφ:φ(S)=0
(φTLkφ
φTφ
)1/k
=
(minψ
ψT (Lk )Scψ
ψTψ︸ ︷︷ ︸Rayleigh quotient
)1/k
Let σ1,k , ψ1,k → smallest eigen-pair of (Lk)Sc .
Estimated cutoff frequency Ωk(S) = (σ1,k)1/k ,
Corresponding smoothest signal φoptk (Sc) = ψ1,k , φ
optk (S) = 0.
KDD 2014 Active SSL using sampling theory for graph signals 9 / 21
P1: Computing the Cut-off Frequency for Given S
Approximate bandwidth of a signal
ωk(f)4=
(f>Lk f
f>f
)1/k
, where k ∈ Z+
I Monotonicity: ∀f, k1 < k2 ⇒ ωk1 (f) ≤ ωk2 (f).I Convergence: limk→∞ ωk(f) = ω(f).
Minimize approximate bandwidth over L2(Sc) to estimate cut-off frequency
Ωk (S)4= minφ∈L2(Sc )
ωk (φ) = minφ:φ(S)=0
(φTLkφ
φTφ
)1/k
=
(minψ
ψT (Lk )Scψ
ψTψ︸ ︷︷ ︸Rayleigh quotient
)1/k
Let σ1,k , ψ1,k → smallest eigen-pair of (Lk)Sc .
Estimated cutoff frequency Ωk(S) = (σ1,k)1/k ,
Corresponding smoothest signal φoptk (Sc) = ψ1,k , φ
optk (S) = 0.
KDD 2014 Active SSL using sampling theory for graph signals 9 / 21
P1: Computing the Cut-off Frequency for Given S
Approximate bandwidth of a signal
ωk(f)4=
(f>Lk f
f>f
)1/k
, where k ∈ Z+
I Monotonicity: ∀f, k1 < k2 ⇒ ωk1 (f) ≤ ωk2 (f).I Convergence: limk→∞ ωk(f) = ω(f).
Minimize approximate bandwidth over L2(Sc) to estimate cut-off frequency
Ωk (S)4= minφ∈L2(Sc )
ωk (φ) = minφ:φ(S)=0
(φTLkφ
φTφ
)1/k
=
(minψ
ψT (Lk )Scψ
ψTψ︸ ︷︷ ︸Rayleigh quotient
)1/k
Let σ1,k , ψ1,k → smallest eigen-pair of (Lk)Sc .
Estimated cutoff frequency Ωk(S) = (σ1,k)1/k ,
Corresponding smoothest signal φoptk (Sc) = ψ1,k , φ
optk (S) = 0.
KDD 2014 Active SSL using sampling theory for graph signals 9 / 21
P1: Computing the Cut-off Frequency for Given S
Approximate bandwidth of a signal
ωk(f)4=
(f>Lk f
f>f
)1/k
, where k ∈ Z+
I Monotonicity: ∀f, k1 < k2 ⇒ ωk1 (f) ≤ ωk2 (f).I Convergence: limk→∞ ωk(f) = ω(f).
Minimize approximate bandwidth over L2(Sc) to estimate cut-off frequency
Ωk (S)4= minφ∈L2(Sc )
ωk (φ) = minφ:φ(S)=0
(φTLkφ
φTφ
)1/k
=
(minψ
ψT (Lk )Scψ
ψTψ︸ ︷︷ ︸Rayleigh quotient
)1/k
Let σ1,k , ψ1,k → smallest eigen-pair of (Lk)Sc .
Estimated cutoff frequency Ωk(S) = (σ1,k)1/k ,
Corresponding smoothest signal φoptk (Sc) = ψ1,k , φ
optk (S) = 0.
KDD 2014 Active SSL using sampling theory for graph signals 9 / 21
P2: Sampling Set Selection
I Optimal sampling set should maximally capture signal information.
I Sopt = arg max|S|=m Ωk(S)→ combinatorial!
Idλk
α(t)
dt(i)
∣∣∣t=1S
≈ α(φoptk (i))2.
Greedy algorithm
S ← S ∪ v , where v = arg maxj(φopt(j))2
KDD 2014 Active SSL using sampling theory for graph signals 10 / 21
P2: Sampling Set Selection
I Optimal sampling set should maximally capture signal information.
I Sopt = arg max|S|=m Ωk(S)→ combinatorial!
I Greedy gradient-based approach.I Start with S = ∅.I Add nodes one by one while ensuring maximum increase in Ωk (S).
relax the constraint
binary relaxation
Idλk
α(t)
dt(i)
∣∣∣t=1S
≈ α(φoptk (i))2.
Greedy algorithm
S ← S ∪ v , where v = arg maxj(φopt(j))2
KDD 2014 Active SSL using sampling theory for graph signals 10 / 21
P2: Sampling Set Selection
I Optimal sampling set should maximally capture signal information.
I Sopt = arg max|S|=m Ωk(S)→ combinatorial!
I Greedy gradient-based approach.I Start with S = ∅.I Add nodes one by one while ensuring maximum increase in Ωk (S).
relax the constraint
binary relaxation
Idλk
α(t)
dt(i)
∣∣∣t=1S
≈ α(φoptk (i))2.
Greedy algorithm
S ← S ∪ v , where v = arg maxj(φopt(j))2
KDD 2014 Active SSL using sampling theory for graph signals 10 / 21
Connection with Active Learning
I Cut-off function Ωk(S) ≡ variation of smoothest signal in L2(Sc).
I Larger cut-off function ⇒ more variation in φopt ⇒ more cross-links.
Intuition
Unlabeled nodes are strongly connected to labeled nodes!
KDD 2014 Active SSL using sampling theory for graph signals 11 / 21
P3: Label Prediction as Signal Reconstruction
I C1 = x : x(S) = f(S) and C2 = PWω(G).
I We need to find a unique f ∈ C1 ∩ C2 ⇒sampling theorem guarantees uniqueness.
Projection onto convex sets
fi+1 = PC2 PC1 fi , where f0 = [f(S)>, 0]>.
C1fdu
C2
f1
C1 ∩ C2f0 f2
f3
I PC1 resets the samples on S to f(S).
I PC2 = Uh(Λ)U> sets f(λ) = 0 if λ > ω.
h(λ) =
1, if λ < ω0, if λ ≥ ω
0 0.5 1 1.5 2−0.5
0
0.5
1
1.5
λ
h(λ
)
Polynomial approx.Exact
I PC2 ≈∑n
i=1
(∑pj=0 ajλ
ji
)uiu>i =
∑pj=0 ajLj → p-hop localized
Predicted class of node n = arg maxc fc(n).
KDD 2014 Active SSL using sampling theory for graph signals 12 / 21
P3: Label Prediction as Signal Reconstruction
I C1 = x : x(S) = f(S) and C2 = PWω(G).
I We need to find a unique f ∈ C1 ∩ C2 ⇒sampling theorem guarantees uniqueness.
Projection onto convex sets
fi+1 = PC2 PC1 fi , where f0 = [f(S)>, 0]>.
C1fdu
C2
f1
C1 ∩ C2f0 f2
f3
I PC1 resets the samples on S to f(S).
I PC2 = Uh(Λ)U> sets f(λ) = 0 if λ > ω.
h(λ) =
1, if λ < ω0, if λ ≥ ω
0 0.5 1 1.5 2−0.5
0
0.5
1
1.5
λ
h(λ
)
Polynomial approx.Exact
I PC2 ≈∑n
i=1
(∑pj=0 ajλ
ji
)uiu>i =
∑pj=0 ajLj → p-hop localized
Predicted class of node n = arg maxc fc(n).
KDD 2014 Active SSL using sampling theory for graph signals 12 / 21
P3: Label Prediction as Signal Reconstruction
I C1 = x : x(S) = f(S) and C2 = PWω(G).
I We need to find a unique f ∈ C1 ∩ C2 ⇒sampling theorem guarantees uniqueness.
Projection onto convex sets
fi+1 = PC2 PC1 fi , where f0 = [f(S)>, 0]>.
C1fdu
C2
f1
C1 ∩ C2f0 f2
f3
I PC1 resets the samples on S to f(S).
I PC2 = Uh(Λ)U> sets f(λ) = 0 if λ > ω.
h(λ) =
1, if λ < ω0, if λ ≥ ω
0 0.5 1 1.5 2−0.5
0
0.5
1
1.5
λ
h(λ
)
Polynomial approx.Exact
I PC2 ≈∑n
i=1
(∑pj=0 ajλ
ji
)uiu>i =
∑pj=0 ajLj → p-hop localized
Predicted class of node n = arg maxc fc(n).
KDD 2014 Active SSL using sampling theory for graph signals 12 / 21
Summary of the Algorithm
Construct graph
Choose nodes to label by maximizing cut-off frequency
Predict labels by signal reconstruction
Query labels of chosen nodes
Input data
KDD 2014 Active SSL using sampling theory for graph signals 13 / 21
Related Work
Submodular optimization:I Optimizing “strength” of a network (Ψ-max) [Guillory and Bilmes, 2011]
I computationally complex
I Graph partitioning based heuristic (METIS) [Guillory and Bilmes, 2009]
Generalization error bound minimization:I Minimizing generalization error bound for LLGC [Gu and Han, 2012]
I contains a regularization parameter that needs to be tuned.
Optimal experiment design:I Local linear reconstruction (LLR) [Zhang et al., 2011]
I does not consider the learning algorithm
KDD 2014 Active SSL using sampling theory for graph signals 14 / 21
Results: Toy Example
Task
Pick 8 data points for labeling.
Active Semi-supervised Learning Using Sampling Theory for Graph Signals
Akshay Gadde, Aamir Anis and Antonio OrtegaUniversity of Southern California
Choose points to label Predict labels for the restData points in feature space Construct similarity graph
1/B
B/2
Sampling rate of B allows perfect reconstruction of signals with bandwidth B/2
For graph signals, different sampling patterns uniquely represent signals of different bandwidths.
For given budget, choose a sampling pattern that can represent signals of maximum bandwidth.
Problem
MethodologyExtending Nyquist Shannon sampling theory to signals on graphs
ProposedLLGC boundLLRΨ-max
I 4 data points picked from each circle.
I Maximally separated points within one circle.
I Maximal spacing between selected data points in different circles.
KDD 2014 Active SSL using sampling theory for graph signals 15 / 21
Results: Toy Example
Task
Pick 8 data points for labeling.
Active Semi-supervised Learning Using Sampling Theory for Graph Signals
Akshay Gadde, Aamir Anis and Antonio OrtegaUniversity of Southern California
Choose points to label Predict labels for the restData points in feature space Construct similarity graph
1/B
B/2
Sampling rate of B allows perfect reconstruction of signals with bandwidth B/2
For graph signals, different sampling patterns uniquely represent signals of different bandwidths.
For given budget, choose a sampling pattern that can represent signals of maximum bandwidth.
Problem
MethodologyExtending Nyquist Shannon sampling theory to signals on graphs
ProposedLLGC boundLLRΨ-max
I 4 data points picked from each circle.
I Maximally separated points within one circle.
I Maximal spacing between selected data points in different circles.
KDD 2014 Active SSL using sampling theory for graph signals 15 / 21
Results: Real Datasets
1 2 3 4 5 6 7 8 9 100.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Percentage of labeled data
Accu
racy
RandomLLGC BoundMETISLLRProposed
I USPS: handwritten digits
I xi = 16× 16 image
I number of classes = 10
I K -NN graph with K = 10
I wij = exp
(−‖xi−xj‖
2
2σ2
)
2 3 4 5 6 7 8 9 100.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Percentage of labeled dataA
ccu
racy
I ISOLET: spoken letters
I xi ∈ R617 speech features.
I number of classes = 26
I K -NN graph with K = 10
I wij = exp
(−‖xi−xj‖
2
2σ2
)
1 2 3 4 5 6 7 8 9 100.1
0.2
0.3
0.4
0.5
0.6
0.7
Percentage of labeled data
Accu
racy
I Newsgroups: documents
I xi ∈ R3000 tf-idf of words
I number of classes = 10
I K -NN graph with K = 10
I wij =x>i xj‖xi‖‖xj‖
KDD 2014 Active SSL using sampling theory for graph signals 16 / 21
Results: Effect of k
Larger k ⇒ better estimate of cut-off frequency is optimized.
1 2 3 4 5 6 7 8 9 100.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
Percentage of labelled data
Accu
racy
k = 4
k = 6
k = 8
(a) USPS
1 2 3 4 5 6 7 8 9 100.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
Percentage of labelled data
Accu
racy
k = 4
k = 6
k = 8
(b) Isolet
1 2 3 4 5 6 7 8 9 100.4
0.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
0.6
Percentage of labelled data
Accu
racy
k = 4
k = 6
k = 8
(c) 20 newsgroups
Figure 6: Effect of k on classification accuracy of the proposed method. Plots show the average classification accuracy fordifferent percentages of labelled data.
[8] Q. Gu, T. Zhang, C. Ding, and J. Han. Selectivelabeling via error bound minimization. In Advances inNeural Information Processing Systems 25, pages332–340. 2012.
[9] A. Guillory and J. Bilmes. Label selection on graphs.In Advances in Neural Information Processing Systems22, pages 691–699. 2009.
[10] A. Guillory and J. Bilmes. Active semi-supervisedlearning using submodular functions. In Proceedings of27th Conference on Uncertainty in ArtificialIntelligence, pages 274–282, 2011.
[11] D. Hammond, P. Vandergheynst, and R. Gribonval.Wavelets on graphs via spectral graph theory. Appliedand Computational Harmonic Analysis, 30(2):129 –150, 2011.
[12] S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch modeactive learning and its application to medical imageclassification. In Proceedings of the 23rd Internationalconference on Machine learning, pages 417–424, 2006.
[13] G. Karypis and V. Kumar. A fast and high qualitymultilevel scheme for partitioning irregular graphs.SIAM Journal on Scientific Computing, 20(1), 1998.
[14] A. V. Knyazev. Toward the optimal preconditionedeigensolver: Locally optimal block preconditionedconjugate gradient method. SIAM journal onscientific computing, 23(2):517–541, 2001.
[15] D. Kong, C. H. Ding, H. Huang, and F. Nie. Aniterative locally linear embedding algorithm. InProceedings of the 29th International Conference onMachine Learning, 2012.
[16] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin,A. Kyrola, and J. M. Hellerstein. Distributedgraphlab: A framework for machine learning and datamining in the cloud. Proc. VLDB Endow.,5(8):716–727, Apr. 2012.
[17] S. Narang and A. Ortega. Perfect reconstructiontwo-channel wavelet filter banks for graph structureddata. IEEE Transactions on Signal Processing,60(6):2786–2799, June 2012.
[18] S. K. Narang, A. Gadde, E. Sanou, and A. Ortega.Localized iterative methods for interpolation in graphstructured data. In Signal and Information Processing(GlobalSIP), 2013 IEEE Global Conference on, 2013.
[19] B. Osting, C. D. White, and E. Oudet. MinimalDirichlet energy partitions for graphs. Aug. 2013.arXiv:1308.4915 [math.OC].
[20] I. Pesenson. Sampling in Paley-Wiener spaces oncombinatorial graphs. Transactions of the AmericanMathematical Society, 360(10):5603–5627, 2008.
[21] S. T. Roweis and L. K. Saul. Nonlinear dimensionalityreduction by locally linear embedding. Science,290(5500):2323–2326, 2000.
[22] K. Sauer and J. Allebach. Iterative reconstruction ofbandlimited images from nonuniformly spacedsamples. Circuits and Systems, IEEE Transactions on,34(12):1497–1506, 1987.
[23] B. Settles. Active learning literature survey. ComputerSciences Technical Report 1648, University ofWisconsin–Madison, 2010.
[24] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega,and P. Vandergheynst. Signal processing on graphs:Extending high-dimensional data analysis to networksand other irregular data domains. Signal ProcessingMagazine, arXiv:1211.0053, May. 2013.
[25] A. J. Smola and R. Kondor. Kernels andregularization on graphs. In Learning theory andkernel machines, pages 144–158. Springer, 2003.
[26] K. Yu, J. Bi, and V. Tresp. Active learning viatransductive experimental design. In Proceedings ofthe 23rd International conference on Machinelearning, pages 1081–1088, 2006.
[27] L. Zhang, C. Chen, J. Bu, D. Cai, X. He, and T. S.Huang. Active learning based on locally linearreconstruction. Pattern Analysis and MachineIntelligence, IEEE Transactions on, 33(10):2026–2038,2011.
[28] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, andB. Scholkopf. Learning with local and globalconsistency. In Advances in Neural InformationProcessing Systems 16. 2004.
[29] X. Zhu. Semi-supervised learning literature survey.Technical Report 1530, Computer Sciences, Universityof Wisconsin-Madison, 2008.
[30] X. Zhu, Z. Ghahramani, and J. Lafferty.Semi-supervised learning using gaussian fields andharmonic functions. In ICML, volume 3, pages912–919, 2003.
0 200 400 600 800 10000
0.2
0.4
0.6
0.8
1
Eigenvalue index
CD
F o
f e
ne
rgy in
GF
T c
oe
ffic
ien
ts
(a) USPS
0 200 400 600 800 1000 1200 14000
0.2
0.4
0.6
0.8
1
Eigenvalue index
CD
F o
f e
ne
rgy in
GF
T c
oe
ffic
ien
ts
(b) Isolet
0 200 400 600 800 10000
0.2
0.4
0.6
0.8
1
Eigenvalue index
CD
F o
f e
ne
rgy in
GF
T c
oe
ffic
ien
ts(c) 20 newsgroups
Figure 3: Cumulative distribution of energy in the GFT coefficients of one of the class membership functions pertaining tothe three real-world dataset experiments considered in Section 5. Note that most of the energy is concentrated in the low-passregion.
term is expected to be negligible as compared to the firstone due to differencing, and we get
xtLx ≈
j∈Sc
pj
dj
x2
j , (24)
where, pj =
i∈S wij is defined as the “partial out-degree”of node j ∈ Sc, i.e., it is the sum of weights of edges crossingover to the set S. Therefore, given a current selected S, thegreedy algorithm selects the next node, to be added to S,that maximizes the increase in
Ω1(S) ≈ inf||x||=1
j∈Sc
pj
dj
x2
j . (25)
Due to the constraint ||x|| = 1, the expression being mini-mized is essentially an infimum over a convex combinationof the fractional out-degrees and its value is largely deter-mined by nodes j ∈ Sc for which pj/dj is small. In otherwords, we must worry about those nodes that have a lowratio of partial degree to the actual degree. Thus, in thesimplest case, our selection algorithm tries to remove thosenodes from the unlabeled set that are weakly connected tonodes in the labeled set. This makes intuitive sense as, inthe end, most prediction algorithms involve propagation oflabels from the labeled to the unlabeled nodes. If an unla-beled node is strongly connected to various numerous points,its label can be assigned with greater confidence.
Note that using a higher power k in the cost function,i.e., finding Ωk(S) for k > 1 involves xLkx which, looselyspeaking, takes into account higher order interactions be-tween the nodes while choosing the nodes to label. In asense, we expect it to capture the connectivities in a moreglobal sense, beyond local interactions, taking into accountthe underlying manifold structure of the data.
3.3 ComplexityWe now comment on the time and space complexity of
our algorithm. The most complex step in the greedy proce-dure for maximizing Ωk(S) is computing the smallest eigen-pair of (Lk)Sc . This can be accomplished using an iterativeRayleigh-quotient minimization based algorithm. Specifi-cally, the locally-optimal pre-conditioned conjugate gradi-ent (LOPCG) method [14] is suitable for this approach.Note that (Lk)Sc can be written as ISc,V .L.L . . . L.IV,Sc ,hence the eigenvalue computation can be broken into atomic
matrix-vector products: L.x. Typically, the graphs encoun-tered in learning applications are sparse, leading to efficientimplementations of L.x. If |L| denotes the number of non-zero elements in L, then the complexity of the matrix-vectorproduct is O(|L|). The complexity of each eigen-pair com-putation for (Lk)Sc is then O(k|L|r), where r is a constantequal to the average number of iterations required for theLOPCG algorithm (r depends on the spectral properties ofL and is independent of its size |V|). The complexity of thelabel selection algorithm then becomes O(k|L|mr), wherem is the number of labels requested.
In the iterative reconstruction algorithm, since we usepolynomial graph filters (Section 2.5), once again the atomicstep is the matrix-vector product L.x. The complexity ofthis algorithm can be given as O(|L|pq), where p is the orderof the polynomial used to design the filter and q is the av-erage number of iterations required for convergence. Again,both these parameters are independent of |V|. Thus, theoverall complexity of our algorithm is O(|L|(kmr + pq)). Inaddition, our algorithm has major advantages in terms ofspace complexity: Since, the atomic operation at each stepis the matrix-vector product L.x, we only need to store Land a constant number of vectors. Moreover, the structureof the Laplacian matrix allows one to perform the afore-mentioned operations in a distributed fashion. This makesit well-suited for large-scale implementations using softwarepackages such as GraphLab [16].
3.4 Prediction Error and Number of LabelsAs discussed in Section 2.5, given the samples fS of the
true graph signal on a subset of nodes S ⊂ V, its estimateon Sc is obtained by solving the following problem:
f(Sc) = USc,Kα∗ where, α∗ = arg minα
US,Kα− f(S)(26)
Here, K is the index set of eigenvectors with eigenvalues lessthan the cut-off ωc(S). If the true signal f ∈ PWωc(S)(G),then the prediction is perfect. However, this is not the casein most problems. The prediction error f − f roughlyequals the portion of energy of the true signal in [ωc(S),λN ]frequency band. By choosing the sampling set S that max-imizes ωc(S), we try to capture most of the signal energyand thus, reduce the prediction error.
An important question in the context of active learning isdetermining the minimum number of labels required so that
KDD 2014 Active SSL using sampling theory for graph signals 17 / 21
Conclusion and Future Work
I Application of graph signal sampling theory to active SSL
I Class labels ⇒ bandlimited graph signalsI Choosing nodes ⇒ Best sampling set selectionI Predicting unknown labels ⇒ Signal reconstruction from samples
I Proposed approach gives significantly better results.
I Future work:I Approximate optimality of proposed sampling set selection.I Robustness against noise
KDD 2014 Active SSL using sampling theory for graph signals 18 / 21
References
A. Anis, A. Gadde, and A. Ortega.Towards a sampling theorem for signals on arbitrary graphs.In ICASSP, 2014.
S.K. Narang, A. Gadde, and A. Ortega.Localized iterative methods for interpolation in graph structured data.In IEEE GlobalSIP, 2013.
A. Guillory and J. Bilmes.Active semi-supervised learning using submodular functions.In UAI, 2011.
A. Guillory and J. Bilmes.Label selection on graphs.In NIPS. 2009.
L. Zhang, C. Chen, J. Bu, D. Cai, X. He, and T. Huang.Active learning based on locally linear reconstruction.TPAMI, 2011.
Q. Gu and J. Han.Towards active learning on graphs, an error bound minimization approach.In ICDM, 2012.
KDD 2014 Active SSL using sampling theory for graph signals 19 / 21
Thank you!
KDD 2014 Active SSL using sampling theory for graph signals 20 / 21
Label Complexity
I Let f be the reconstruction of f obtained from its samples on S.
I What is the minimum number of labels required so that ‖f − f‖ ≤ δ?
Smoothness of a signal
Let Pθ be the projector for PWθ(G). Then γ(f) = min θ s.t. ‖f − Pθf‖ ≤ δ.
Theorem
The minimum number of labels |S| required to satisfy ‖f − f‖ ≤ δ is greaterthan p, where p is the number of eigenvalues of L less than γ(f).
KDD 2014 Active SSL using sampling theory for graph signals 21 / 21