Convolutional Neural Networks on Graphs with Fast...
Transcript of Convolutional Neural Networks on Graphs with Fast...
Convolutional Neural Networks on Graphs with Fast LocalizedSpectral Filtering
M. Defferrard1 X. Bresson2 P. Vandergheynst1
1EPFL, Lausanne, Switzerland
2Nanyang Technological University, Singapore
Itay Boneh, Asher KabakovitchTel-Aviv University, Deep Learning Seminar
2017
Spectral FilteringNon-parametric
From convolution theorem:
x ∗G g = U(UTg � UT x
)= U
(g � UT x
)Or algebrically:
x ∗G g = U
g1 0. . .
0 gn
UT x
I Not localized
I O(n) parameters to train
I O(n2) multiplications (no FFT)
Spectral FilteringNon-parametric
From convolution theorem:
x ∗G g = U(UTg � UT x
)= U
(g � UT x
)Or algebrically:
x ∗G g = U
g1 0. . .
0 gn
UT x
I Not localized
I O(n) parameters to train
I O(n2) multiplications (no FFT)
Spectral FilteringPolynomial Parametrization
g is a continuous 1-D function, parametrized by θ:
x ∗G g = U(gθ(λ)� UT x
)= U
gθ(λ1) 0. . .
0 gθ(λn)
UT x
= Ugθ(Λ)UT x = gθ(L)x
Tφ = T~λ⇒ f (T ) = φf (Λ)φ−1
Spectral FilteringPolynomial Parametrization
g is a continuous 1-D function, parametrized by θ:
x ∗G g = U(gθ(λ)� UT x
)= U
gθ(λ1) 0. . .
0 gθ(λn)
UT x
= Ugθ(Λ)UT x = gθ(L)x
Polynomial parametrization:
gθ(Λ) =K−1∑k=0
θkΛk
I K-localized: 1-hop for every L application
I K parameters to train - independent on the graph’s size
I Still O(n2) multiplications (multiplications with the basis U)
Spectral FilteringPolynomial Parametrization
Impulse response on a 2D Euclidean domain
Impulse response on a graph
Spectral FilteringRecursive Polynomial Parametrization
Chebyshev polynomials: Tk(λ) = 2λTk−1(λ)− Tk−2(λ)
T0(λ) = 1
T1(λ) = λ
Parametrization:
gθ(L) =K−1∑k=0
θkTk(L)
L = 2Lλ−1n − I (orthonormal basis in [−1, 1])
Spectral FilteringRecursive Polynomial Parametrization
Filtering:
y = gθ(L)x =K−1∑k=0
θkTk(L)x =K−1∑k=0
θk xk
Recurrence: xk = Tk(L)x = 2Lxk−1 − xk−2
x0 = x
x1 = Lx
I K-localized: 1-hop for every L application
I K parameters to train - independent on the graph’s size
I O(Kn) multiplications (actually O(K |ε|))
I No EVD of L
Spectral FilteringRecursive Polynomial Parametrization
Filtering:
y = gθ(L)x =K−1∑k=0
θkTk(L)x =K−1∑k=0
θk xk
Recurrence: xk = Tk(L)x = 2Lxk−1 − xk−2
x0 = x
x1 = Lx
I K-localized: 1-hop for every L application
I K parameters to train - independent on the graph’s size
I O(Kn) multiplications (actually O(K |ε|))
I No EVD of L
Graph CNN
Graph CNNLearning Filters
ys,j =
Fin∑i=1
gθi,j (L)xs,i
∂L∂θi ,j
=S∑
s=1
[xs,i ,0, . . . , xs,i ,K−1]T∂L∂ys,j
∂L∂xs,i
=Fout∑j=1
gθi,j (L)∂L∂ys,j
s = 1, . . . ,S - sample indexi = 1, . . . ,Fin - input feature map indexj = 1, . . . ,Fout - output feature map indexθi ,j - Fin × Fout Chebyshev coefficients vectors of order KL - Loss over a mini-batch of S samples
Graph CNNLearning Filters
ys,j =
Fin∑i=1
gθi,j (L)xs,i
∂L∂θi ,j
=S∑
s=1
[xs,i ,0, . . . , xs,i ,K−1]T∂L∂ys,j
∂L∂xs,i
=Fout∑j=1
gθi,j (L)∂L∂ys,j
O (K |ε|FinFoutS)
Easily paralleled
Graph PoolingCoarsening
I Multilevel clustering algorithm
I Reduce the size of the graph by a specified factor (2)
I Do all this efficiently
Graclus multilevel clustering algorithm
I Maximizing local normalized cut
I Greedily pick an unmarked vertex i and match it with an unmatched vertex jwhich maximizes the local normalized cut Wi ,j(1/di + 1/dj).
I Extremely fast.
I Dividing the number of nodes by approximately 2.
I Might generate singletons (non matched) nodes. Solved by using fake nodes.
Graph PoolingCoarsening
I Multilevel clustering algorithm
I Reduce the size of the graph by a specified factor (2)
I Do all this efficiently
Graclus multilevel clustering algorithm
I Maximizing local normalized cut
I Greedily pick an unmarked vertex i and match it with an unmatched vertex jwhich maximizes the local normalized cut Wi ,j(1/di + 1/dj).
I Extremely fast.
I Dividing the number of nodes by approximately 2.
I Might generate singletons (non matched) nodes. Solved by using fake nodes.
Graph PoolingFast Pooling
Example: Pooling by 4
I Graclus generates singletons: n0 = 8→ n1 = 5→ n2 = 3
I By adding fake nodes we get: n2 = 3→ n1 = 6→ n0 = 12
I z = [max(x0, x1),max(x4, x5, x6),max(x8, x9, x10)]
I Balanced binary trees ⇒ efficient on GPUs.
ExperimentsMNIST
Wi ,j = e−‖xi−xj‖22/σ2
I 28×28 pixels + 192 fake nodes ⇒ n = |V| = 976
I 8-NN graph ⇒ |ε| = 3198
I Based on LeNet-5 ⇒ K = 5
ExperimentsMNIST
Wi ,j = e−‖xi−xj‖22/σ2
I 28×28 pixels + 192 fake nodes ⇒ n = |V| = 976
I 8-NN graph ⇒ |ε| = 3198
I Based on LeNet-5 ⇒ K = 5
ExperimentsMNIST
Wi ,j = e−‖xi−xj‖22/σ2
I 28×28 pixels + 192 fake nodes ⇒ n = |V| = 976
I 8-NN graph ⇒ |ε| = 3198
I Based on LeNet-5 ⇒ K = 5
I Isotropic filters (no orientation)
I Uninvestigated optimizations and initializations
Experiments20NEWS
I 18,846 text documentsassociated with 20 classes
I 10K most common wordsfrom the 94K unique words⇒ n = |V| = 10K
I 16-NN,Wi ,j = e−‖zi−zj‖
22/σ
2
(zi - word2vec) ⇒|ε| = 132, 834
I x - bag-of-words model
Experiments20NEWS
Comparison with other methods (K = 5)
I Slightly worse than multinomial naiveBayes classifier.
I Defeats fully-connected networks withmuch less parameters.
Total training time divided by # ofgradient steps
I Scales as O(n) as opposed toO(n2)
Experiments20NEWS
Comparison with other methods (K = 5)
I Slightly worse than multinomial naiveBayes classifier.
I Defeats fully-connected networks withmuch less parameters.
Total training time divided by # ofgradient steps
I Scales as O(n) as opposed toO(n2)
ExperimentsGraph Quality
⇒ The data structure is important
⇒ A well constructed graph is important
⇒ Proper approximations (LSHForest) can be used for larger DBs.
Conclusions and Future Work
I Introduced a model with linear complexity.
I The quality of the input graph is of paramount importance.
I Local stationarity and compositionality are verified for text documents as long asthe graph is well constructed.