Convolutional Graph Embeddings - cs.ubc.ca

88
Convolutional Graph Embeddings Si Yi (Cathy) Meng March 11, 2019 UBC MLRG

Transcript of Convolutional Graph Embeddings - cs.ubc.ca

Page 1: Convolutional Graph Embeddings - cs.ubc.ca

Convolutional Graph Embeddings

Si Yi (Cathy) Meng

March 11, 2019

UBC MLRG

Page 2: Convolutional Graph Embeddings - cs.ubc.ca

Motivation

• Graphs are everywhere.

• Machine Learning tasks on graphs:

• Node classification

• Link prediction

• Neighbourhood identification

• . . .

• Representation learning on graphs: learn vector rerpresentations

of nodes or subgraphs for downstream ML tasks.

1

Page 4: Convolutional Graph Embeddings - cs.ubc.ca

Motivation

Figure 2: Schizophrenia PPIs3

Page 5: Convolutional Graph Embeddings - cs.ubc.ca

Table of contents

1. Node Embeddings

2. Convolution on Graphs (Graph-CNN)

3. Graph Convolutional Networks (GCN)

4. Inductive Representation Learning on Large Graphs (GraphSAGE)

4

Page 6: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings

Page 7: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings

Figure 3: Perozzi et al. 2014. [6]

• The vector representation of nodes should preserve information

about pairwise relationships.

• But mapping from non-Euclidean space to a feature vector is not

straightforward.

5

Page 8: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings

Figure 3: Perozzi et al. 2014. [6]

• The vector representation of nodes should preserve information

about pairwise relationships.

• But mapping from non-Euclidean space to a feature vector is not

straightforward.

5

Page 9: Convolutional Graph Embeddings - cs.ubc.ca

Notation

• Undirected graph G = (V, E), |V| = n nodes

• Adjacency matrix A ∈ Rn×n, binary or weighted

• Degree matrix D where Dii =∑

j Aij , diagonal

• May also have node attributes X ∈ Rn×d

6

Page 10: Convolutional Graph Embeddings - cs.ubc.ca

Notation

• Undirected graph G = (V, E), |V| = n nodes

• Adjacency matrix A ∈ Rn×n, binary or weighted

• Degree matrix D where Dii =∑

j Aij , diagonal

• May also have node attributes X ∈ Rn×d

6

Page 11: Convolutional Graph Embeddings - cs.ubc.ca

Notation

• Undirected graph G = (V, E), |V| = n nodes

• Adjacency matrix A ∈ Rn×n, binary or weighted

• Degree matrix D where Dii =∑

j Aij , diagonal

• May also have node attributes X ∈ Rn×d

6

Page 12: Convolutional Graph Embeddings - cs.ubc.ca

Notation

• Undirected graph G = (V, E), |V| = n nodes

• Adjacency matrix A ∈ Rn×n, binary or weighted

• Degree matrix D where Dii =∑

j Aij , diagonal

• May also have node attributes X ∈ Rn×d

6

Page 13: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings - Shallow Embeddings

Encoder-decoder framework

ENC(vi ) = ZviZ ∈ Rm×n, vi = one-hot

DEC(ENC(vi ), ENC(vj)) = DEC(zi , zj) ≈ sG(vi , vj)

where sG is a pre-defined similarity metric between two nodes,

defined over the graph.

7

Page 14: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings - Shallow Embeddings

Encoder-decoder framework

ENC(vi ) = ZviZ ∈ Rm×n, vi = one-hot

DEC(ENC(vi ), ENC(vj)) = DEC(zi , zj) ≈ sG(vi , vj)

where sG is a pre-defined similarity metric between two nodes,

defined over the graph.

7

Page 15: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings - Shallow Embeddings

• Matrix factorization-based approaches:

• Deterministic similarity measure such as Aij .

• Minimize the reconstruction error:

L =∑

i,j ` (DEC(zi , zj), sG(vi , vj)) ≈ ‖ZTZ − S‖22.

• Random walk approaches:

• Stochastic measure of node similarity based on random walk

statistics.

• Decoder uses softmax over the inner products of the encoded

features.

See Hamilton et al. 2017 [3] for an in-depth review.

8

Page 16: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings - Shallow Embeddings

• Matrix factorization-based approaches:

• Deterministic similarity measure such as Aij .

• Minimize the reconstruction error:

L =∑

i,j ` (DEC(zi , zj), sG(vi , vj)) ≈ ‖ZTZ − S‖22.

• Random walk approaches:

• Stochastic measure of node similarity based on random walk

statistics.

• Decoder uses softmax over the inner products of the encoded

features.

See Hamilton et al. 2017 [3] for an in-depth review.

8

Page 17: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings - Shallow Embeddings

• Matrix factorization-based approaches:

• Deterministic similarity measure such as Aij .

• Minimize the reconstruction error:

L =∑

i,j ` (DEC(zi , zj), sG(vi , vj)) ≈ ‖ZTZ − S‖22.

• Random walk approaches:

• Stochastic measure of node similarity based on random walk

statistics.

• Decoder uses softmax over the inner products of the encoded

features.

See Hamilton et al. 2017 [3] for an in-depth review.

8

Page 18: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings - Shallow Embeddings

• Matrix factorization-based approaches:

• Deterministic similarity measure such as Aij .

• Minimize the reconstruction error:

L =∑

i,j ` (DEC(zi , zj), sG(vi , vj)) ≈ ‖ZTZ − S‖22.

• Random walk approaches:

• Stochastic measure of node similarity based on random walk

statistics.

• Decoder uses softmax over the inner products of the encoded

features.

See Hamilton et al. 2017 [3] for an in-depth review.

8

Page 19: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings - Shallow Embeddings

• Matrix factorization-based approaches:

• Deterministic similarity measure such as Aij .

• Minimize the reconstruction error:

L =∑

i,j ` (DEC(zi , zj), sG(vi , vj)) ≈ ‖ZTZ − S‖22.

• Random walk approaches:

• Stochastic measure of node similarity based on random walk

statistics.

• Decoder uses softmax over the inner products of the encoded

features.

See Hamilton et al. 2017 [3] for an in-depth review.

8

Page 20: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings - Shallow Embeddings

• Matrix factorization-based approaches:

• Deterministic similarity measure such as Aij .

• Minimize the reconstruction error:

L =∑

i,j ` (DEC(zi , zj), sG(vi , vj)) ≈ ‖ZTZ − S‖22.

• Random walk approaches:

• Stochastic measure of node similarity based on random walk

statistics.

• Decoder uses softmax over the inner products of the encoded

features.

See Hamilton et al. 2017 [3] for an in-depth review.

8

Page 21: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings - Shallow Embeddings

• These approaches are simple and intuitive.

• But . . .

• No parameter sharing as the encoder is just a lookup table.

• Overfitting

• Very costly as for large n

• Not utilizing node features X .

• Inherently transductive.

• Instead of a lookup table, we could use a neural network to encode a

node’s local structure.

• We will focus on methods using convolution operations on graphs.

9

Page 22: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings - Shallow Embeddings

• These approaches are simple and intuitive.

• But . . .

• No parameter sharing as the encoder is just a lookup table.

• Overfitting

• Very costly as for large n

• Not utilizing node features X .

• Inherently transductive.

• Instead of a lookup table, we could use a neural network to encode a

node’s local structure.

• We will focus on methods using convolution operations on graphs.

9

Page 23: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings - Shallow Embeddings

• These approaches are simple and intuitive.

• But . . .

• No parameter sharing as the encoder is just a lookup table.

• Overfitting

• Very costly as for large n

• Not utilizing node features X .

• Inherently transductive.

• Instead of a lookup table, we could use a neural network to encode a

node’s local structure.

• We will focus on methods using convolution operations on graphs.

9

Page 24: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings - Shallow Embeddings

• These approaches are simple and intuitive.

• But . . .

• No parameter sharing as the encoder is just a lookup table.

• Overfitting

• Very costly as for large n

• Not utilizing node features X .

• Inherently transductive.

• Instead of a lookup table, we could use a neural network to encode a

node’s local structure.

• We will focus on methods using convolution operations on graphs.

9

Page 25: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings - Shallow Embeddings

• These approaches are simple and intuitive.

• But . . .

• No parameter sharing as the encoder is just a lookup table.

• Overfitting

• Very costly as for large n

• Not utilizing node features X .

• Inherently transductive.

• Instead of a lookup table, we could use a neural network to encode a

node’s local structure.

• We will focus on methods using convolution operations on graphs.

9

Page 26: Convolutional Graph Embeddings - cs.ubc.ca

Node Embeddings - Shallow Embeddings

• These approaches are simple and intuitive.

• But . . .

• No parameter sharing as the encoder is just a lookup table.

• Overfitting

• Very costly as for large n

• Not utilizing node features X .

• Inherently transductive.

• Instead of a lookup table, we could use a neural network to encode a

node’s local structure.

• We will focus on methods using convolution operations on graphs.

9

Page 27: Convolutional Graph Embeddings - cs.ubc.ca

Convolution on Graphs

(Graph-CNN)

Page 28: Convolutional Graph Embeddings - cs.ubc.ca

Convolution on Graphs - a Spectral Formulation

How do we define localized convolutional filters on graphs?

We don’t have grids or sequences to define a fixed-size neighborhood.

10

Page 29: Convolutional Graph Embeddings - cs.ubc.ca

the Graph Laplacian

Unormalized graph Laplacian ∆ = D − A

• Symmetric normalized graph Laplacian

L := D−1/2∆D−1/2 = In − D−1/2AD−1/2

• L = LT � 0

• Multiplicity of the eigenvalue 0 indicates the number of connected

components in the graph.

Figure 4: Example of Laplacian Matrix

11

Page 30: Convolutional Graph Embeddings - cs.ubc.ca

the Graph Laplacian

Since L = LT � 0, for ` ∈ [1, n], it has

• Complete set of orthonormal eigenvectors {u`} ∈ Rn

• also called ”Fourier modes”, ”Fourier basis functions”

• related to spectral clustering

• Corresponding ordered real nonnegative eigenvalues {λ`}• ”frequencies of the graph”

• L = UΛUT , U = [u1, . . . , un], Λ = diag(λ1, . . . , λn)

12

Page 31: Convolutional Graph Embeddings - cs.ubc.ca

the Graph Laplacian

Since L = LT � 0, for ` ∈ [1, n], it has

• Complete set of orthonormal eigenvectors {u`} ∈ Rn

• also called ”Fourier modes”, ”Fourier basis functions”

• related to spectral clustering

• Corresponding ordered real nonnegative eigenvalues {λ`}• ”frequencies of the graph”

• L = UΛUT , U = [u1, . . . , un], Λ = diag(λ1, . . . , λn)

12

Page 32: Convolutional Graph Embeddings - cs.ubc.ca

the Graph Laplacian

Since L = LT � 0, for ` ∈ [1, n], it has

• Complete set of orthonormal eigenvectors {u`} ∈ Rn

• also called ”Fourier modes”, ”Fourier basis functions”

• related to spectral clustering

• Corresponding ordered real nonnegative eigenvalues {λ`}

• ”frequencies of the graph”

• L = UΛUT , U = [u1, . . . , un], Λ = diag(λ1, . . . , λn)

12

Page 33: Convolutional Graph Embeddings - cs.ubc.ca

the Graph Laplacian

Since L = LT � 0, for ` ∈ [1, n], it has

• Complete set of orthonormal eigenvectors {u`} ∈ Rn

• also called ”Fourier modes”, ”Fourier basis functions”

• related to spectral clustering

• Corresponding ordered real nonnegative eigenvalues {λ`}• ”frequencies of the graph”

• L = UΛUT , U = [u1, . . . , un], Λ = diag(λ1, . . . , λn)

12

Page 34: Convolutional Graph Embeddings - cs.ubc.ca

the Graph Laplacian

Since L = LT � 0, for ` ∈ [1, n], it has

• Complete set of orthonormal eigenvectors {u`} ∈ Rn

• also called ”Fourier modes”, ”Fourier basis functions”

• related to spectral clustering

• Corresponding ordered real nonnegative eigenvalues {λ`}• ”frequencies of the graph”

• L = UΛUT , U = [u1, . . . , un], Λ = diag(λ1, . . . , λn)

12

Page 35: Convolutional Graph Embeddings - cs.ubc.ca

Spectral Filtering

Let x ∈ Rn be a signal vector for all the nodes (we can generalize this to

a vector per node). The graph Fourier transform of x is defined as

x = UT x

and the inverse GFT is is x = Ux .

Define the spectral convolution as the multiplication of a signal with a

filter gθ = diag(θ) parameterized by coefficients θ ∈ Rn in the Fourier

domain as

gθ ? x = UgθUT x

GFT of x , apply filter in Fourier domain, then transform back. Note that

gθ is a function of Λ.

13

Page 36: Convolutional Graph Embeddings - cs.ubc.ca

Spectral Filtering

Let x ∈ Rn be a signal vector for all the nodes (we can generalize this to

a vector per node). The graph Fourier transform of x is defined as

x = UT x

and the inverse GFT is is x = Ux .

Define the spectral convolution as the multiplication of a signal with a

filter gθ = diag(θ) parameterized by coefficients θ ∈ Rn in the Fourier

domain as

gθ ? x = UgθUT x

GFT of x , apply filter in Fourier domain, then transform back. Note that

gθ is a function of Λ.

13

Page 37: Convolutional Graph Embeddings - cs.ubc.ca

Spectral Filtering

• But since gθ is non-parametric, learning is expensive.

• And it’s not localized in space.

Solution: Approximate gθ(Λ) with a polynomial filter

gθ(Λ) ≈K−1∑k=0

θkΛk ,

where θ = [θ0, . . . , θK−1] is now of size independent of n.

14

Page 38: Convolutional Graph Embeddings - cs.ubc.ca

Spectral Filtering

• But since gθ is non-parametric, learning is expensive.

• And it’s not localized in space.

Solution: Approximate gθ(Λ) with a polynomial filter

gθ(Λ) ≈K−1∑k=0

θkΛk ,

where θ = [θ0, . . . , θK−1] is now of size independent of n.

14

Page 39: Convolutional Graph Embeddings - cs.ubc.ca

Spectral Filtering

• But since gθ is non-parametric, learning is expensive.

• And it’s not localized in space.

Solution: Approximate gθ(Λ) with a polynomial filter

gθ(Λ) ≈K−1∑k=0

θkΛk ,

where θ = [θ0, . . . , θK−1] is now of size independent of n.

14

Page 40: Convolutional Graph Embeddings - cs.ubc.ca

Spectral Filtering

• But computing the eigendecomposition of L is also expensive for

large graphs

Solution: Approximate gθ(Λ) with a K th-order truncated expansion of

Chebyshev polynomials Tk(x):

gθ(Λ) ≈K−1∑k=0

θkTk(Λ)

with a rescaled Λ = 2λmax

Λ− In.

The Chebyshev polynomials are recursively defined as

Tk(x) = 2xTk−1(x)− Tk−2(x), with T0(x) = 1 and T1(x) = x .

15

Page 41: Convolutional Graph Embeddings - cs.ubc.ca

Spectral Filtering

• But computing the eigendecomposition of L is also expensive for

large graphs

Solution: Approximate gθ(Λ) with a K th-order truncated expansion of

Chebyshev polynomials Tk(x):

gθ(Λ) ≈K−1∑k=0

θkTk(Λ)

with a rescaled Λ = 2λmax

Λ− In.

The Chebyshev polynomials are recursively defined as

Tk(x) = 2xTk−1(x)− Tk−2(x), with T0(x) = 1 and T1(x) = x .

15

Page 42: Convolutional Graph Embeddings - cs.ubc.ca

Spectral Filtering

The filtering operation gθ ? x can now be written as

gθ ? x = UgθUT x

≈ U( K−1∑

k=0

θkTk(Λ))UT x

=K−1∑k=0

θkTk(L)x

with L = 2λmax

L− In, and the last equality comes from

Lk = (UΛUT )k = UΛkUT .

16

Page 43: Convolutional Graph Embeddings - cs.ubc.ca

Spectral Filtering

The filtering operation gθ ? x can now be written as

gθ ? x = UgθUT x

≈ U( K−1∑

k=0

θkTk(Λ))UT x

=K−1∑k=0

θkTk(L)x

with L = 2λmax

L− In, and the last equality comes from

Lk = (UΛUT )k = UΛkUT .

16

Page 44: Convolutional Graph Embeddings - cs.ubc.ca

Spectral Filtering

The filtering operation gθ ? x can now be written as

gθ ? x = UgθUT x

≈ U( K−1∑

k=0

θkTk(Λ))UT x

=K−1∑k=0

θkTk(L)x

with L = 2λmax

L− In, and the last equality comes from

Lk = (UΛUT )k = UΛkUT .

16

Page 45: Convolutional Graph Embeddings - cs.ubc.ca

Spectral Filtering

The spectral filter represented by L is also localized:

• It can be shown that dG(i , j) > k ′ =⇒ (Lk′)i,j = 0, where dG(i , j)

is the shortest path distance between two vertices.

• gθ operates on the K -hop neighbors of a vertex!

Now we got rid of the eigendecomposition.

So what’s the algorithm?

17

Page 46: Convolutional Graph Embeddings - cs.ubc.ca

Spectral Filtering

The spectral filter represented by L is also localized:

• It can be shown that dG(i , j) > k ′ =⇒ (Lk′)i,j = 0, where dG(i , j)

is the shortest path distance between two vertices.

• gθ operates on the K -hop neighbors of a vertex!

Now we got rid of the eigendecomposition.

So what’s the algorithm?

17

Page 47: Convolutional Graph Embeddings - cs.ubc.ca

Spectral Filtering

The spectral filter represented by L is also localized:

• It can be shown that dG(i , j) > k ′ =⇒ (Lk′)i,j = 0, where dG(i , j)

is the shortest path distance between two vertices.

• gθ operates on the K -hop neighbors of a vertex!

Now we got rid of the eigendecomposition.

So what’s the algorithm?

17

Page 48: Convolutional Graph Embeddings - cs.ubc.ca

Chebyshev Spectral Graph Convolution

We had a feature matrix X ∈ Rn×d , let Hk = Tk(L)X ∈ Rn×d , then we

have

H0 = X

H1 = LX

Hk = 2LHk−1 − Hk−2

The filtering operation costs O(K |E|), and the corresponding K -hop

convolution operation is

X ′ =K−1∑k=0

HkΘk ,

where Θk ∈ Rd×m for a desired output size m. We can now use X ′ as a

feature extractor (node embeddings).

18

Page 49: Convolutional Graph Embeddings - cs.ubc.ca

Chebyshev Spectral Graph Convolution

We had a feature matrix X ∈ Rn×d , let Hk = Tk(L)X ∈ Rn×d , then we

have

H0 = X

H1 = LX

Hk = 2LHk−1 − Hk−2

The filtering operation costs O(K |E|), and the corresponding K -hop

convolution operation is

X ′ =K−1∑k=0

HkΘk ,

where Θk ∈ Rd×m for a desired output size m. We can now use X ′ as a

feature extractor (node embeddings).

18

Page 50: Convolutional Graph Embeddings - cs.ubc.ca

Graph-CNN

Remarks:

• Convolution followed by non-linear activation.

• Graph coarsening/downsampling to group together similar vertices.

• Graph clustering, but NP-hard.

• Greedy algorithm: Graclus multilevel clustering, gives successive

coarsened graphs.

• Graph pooling: create balanced binary tree to remember which

nodes were matched to perform pooling.

• For more information, see [1, 4].

19

Page 51: Convolutional Graph Embeddings - cs.ubc.ca

Graph-CNN

Remarks:

• Convolution followed by non-linear activation.

• Graph coarsening/downsampling to group together similar vertices.

• Graph clustering, but NP-hard.

• Greedy algorithm: Graclus multilevel clustering, gives successive

coarsened graphs.

• Graph pooling: create balanced binary tree to remember which

nodes were matched to perform pooling.

• For more information, see [1, 4].

19

Page 52: Convolutional Graph Embeddings - cs.ubc.ca

Graph-CNN

Remarks:

• Convolution followed by non-linear activation.

• Graph coarsening/downsampling to group together similar vertices.

• Graph clustering, but NP-hard.

• Greedy algorithm: Graclus multilevel clustering, gives successive

coarsened graphs.

• Graph pooling: create balanced binary tree to remember which

nodes were matched to perform pooling.

• For more information, see [1, 4].

19

Page 53: Convolutional Graph Embeddings - cs.ubc.ca

Graph-CNN

Remarks:

• Convolution followed by non-linear activation.

• Graph coarsening/downsampling to group together similar vertices.

• Graph clustering, but NP-hard.

• Greedy algorithm: Graclus multilevel clustering, gives successive

coarsened graphs.

• Graph pooling: create balanced binary tree to remember which

nodes were matched to perform pooling.

• For more information, see [1, 4].

19

Page 54: Convolutional Graph Embeddings - cs.ubc.ca

Graph-CNN

Figure 5: Architecture of a CNN on graphs and the four ingredients of a

(graph) convolutional layer. Defferrard et al. 2016 [1].

20

Page 55: Convolutional Graph Embeddings - cs.ubc.ca

Graph Convolutional Networks

(GCN)

Page 56: Convolutional Graph Embeddings - cs.ubc.ca

GCN

Kipf & Welling [5] introduced the multi-layer Graph Convolutional

Network (GCN) with the following layer-wise propagation rule:

H`+1 = σ(D−1/2AD−1/2H`W`

), (1)

where A = A + In is the adjacency matrix of the undirected graph G with

added self-loops, σ(·) is some nonlinear activation, and H0 = X .

21

Page 57: Convolutional Graph Embeddings - cs.ubc.ca

GCN

Recall the Chebyshev spectral graph convolution derived earlier,

gθ ? x ≈K−1∑k=0

θkTk(L)x

For K = 2 and approximate λmax ≈ 2, we have

gθ ? x ≈ θ0x + θ1Lx

= θ0x + θ1(2

λmaxL− In)x

≈ θ0x + θ1(L− In)x

= θ0x − θ1D−1/2AD−1/2x

= θ(In + D−1/2AD−1/2)x By letting θ = θ0 = −θ1

The eigenvalues of In + D−1/2AD−1/2 are in range [0, 2], which may lead

to numerical instability in repeated applications of this filter.

22

Page 58: Convolutional Graph Embeddings - cs.ubc.ca

GCN

Recall the Chebyshev spectral graph convolution derived earlier,

gθ ? x ≈K−1∑k=0

θkTk(L)x

For K = 2 and approximate λmax ≈ 2, we have

gθ ? x ≈ θ0x + θ1Lx

= θ0x + θ1(2

λmaxL− In)x

≈ θ0x + θ1(L− In)x

= θ0x − θ1D−1/2AD−1/2x

= θ(In + D−1/2AD−1/2)x By letting θ = θ0 = −θ1

The eigenvalues of In + D−1/2AD−1/2 are in range [0, 2], which may lead

to numerical instability in repeated applications of this filter.

22

Page 59: Convolutional Graph Embeddings - cs.ubc.ca

GCN

Recall the Chebyshev spectral graph convolution derived earlier,

gθ ? x ≈K−1∑k=0

θkTk(L)x

For K = 2 and approximate λmax ≈ 2, we have

gθ ? x ≈ θ0x + θ1Lx

= θ0x + θ1(2

λmaxL− In)x

≈ θ0x + θ1(L− In)x

= θ0x − θ1D−1/2AD−1/2x

= θ(In + D−1/2AD−1/2)x By letting θ = θ0 = −θ1

The eigenvalues of In + D−1/2AD−1/2 are in range [0, 2], which may lead

to numerical instability in repeated applications of this filter.

22

Page 60: Convolutional Graph Embeddings - cs.ubc.ca

GCN

Recall the Chebyshev spectral graph convolution derived earlier,

gθ ? x ≈K−1∑k=0

θkTk(L)x

For K = 2 and approximate λmax ≈ 2, we have

gθ ? x ≈ θ0x + θ1Lx

= θ0x + θ1(2

λmaxL− In)x

≈ θ0x + θ1(L− In)x

= θ0x − θ1D−1/2AD−1/2x

= θ(In + D−1/2AD−1/2)x By letting θ = θ0 = −θ1

The eigenvalues of In + D−1/2AD−1/2 are in range [0, 2], which may lead

to numerical instability in repeated applications of this filter.

22

Page 61: Convolutional Graph Embeddings - cs.ubc.ca

GCN

Recall the Chebyshev spectral graph convolution derived earlier,

gθ ? x ≈K−1∑k=0

θkTk(L)x

For K = 2 and approximate λmax ≈ 2, we have

gθ ? x ≈ θ0x + θ1Lx

= θ0x + θ1(2

λmaxL− In)x

≈ θ0x + θ1(L− In)x

= θ0x − θ1D−1/2AD−1/2x

= θ(In + D−1/2AD−1/2)x By letting θ = θ0 = −θ1

The eigenvalues of In + D−1/2AD−1/2 are in range [0, 2], which may lead

to numerical instability in repeated applications of this filter.

22

Page 62: Convolutional Graph Embeddings - cs.ubc.ca

GCN

Recall the Chebyshev spectral graph convolution derived earlier,

gθ ? x ≈K−1∑k=0

θkTk(L)x

For K = 2 and approximate λmax ≈ 2, we have

gθ ? x ≈ θ0x + θ1Lx

= θ0x + θ1(2

λmaxL− In)x

≈ θ0x + θ1(L− In)x

= θ0x − θ1D−1/2AD−1/2x

= θ(In + D−1/2AD−1/2)x By letting θ = θ0 = −θ1

The eigenvalues of In + D−1/2AD−1/2 are in range [0, 2], which may lead

to numerical instability in repeated applications of this filter.

22

Page 63: Convolutional Graph Embeddings - cs.ubc.ca

GCN

Renormalization trick:

In + D−1/2AD−1/2 → D−1/2AD−1/2

Generalizing to node signals of multiple dimensions and using W as the

parameters instead of θ, we get the convolution operation (prior to

activation) in eq.1,

Z = D−1/2AD−1/2XW

where W ∈ Rd×m for a desired output size m.

The cost of the filtering operation (prior to multiplication by W ) is

O(|E|), and and all matrix multiplications here can be efficiently

computed.

23

Page 64: Convolutional Graph Embeddings - cs.ubc.ca

GCN

Renormalization trick:

In + D−1/2AD−1/2 → D−1/2AD−1/2

Generalizing to node signals of multiple dimensions and using W as the

parameters instead of θ, we get the convolution operation (prior to

activation) in eq.1,

Z = D−1/2AD−1/2XW

where W ∈ Rd×m for a desired output size m.

The cost of the filtering operation (prior to multiplication by W ) is

O(|E|), and and all matrix multiplications here can be efficiently

computed.

23

Page 65: Convolutional Graph Embeddings - cs.ubc.ca

GCN - Semi-supervised Classification

To perform semi-supervised classification under this framework, first

compute A = D−1/2AD−1/2. The 2-layer forward model used is

Y = softmax(AReLU(AXW0)W1)

The cross-entropy loss is applied over all labeled examples

L = −∑i∈YL

K∑c=1

Yic ln Yic

where YL is the set of node indices where labels exist.

24

Page 66: Convolutional Graph Embeddings - cs.ubc.ca

GCN

Remarks:

• No longer limited to the explicit parameterization given by the

Chebyshev polynomials.

• Alleviate the problem of overfitting on local neighborhood structures

for graphs with very wide node degree distributions.

• Scalable.

• But it is transductive in nature.

25

Page 67: Convolutional Graph Embeddings - cs.ubc.ca

GCN

Remarks:

• No longer limited to the explicit parameterization given by the

Chebyshev polynomials.

• Alleviate the problem of overfitting on local neighborhood structures

for graphs with very wide node degree distributions.

• Scalable.

• But it is transductive in nature.

25

Page 68: Convolutional Graph Embeddings - cs.ubc.ca

GCN

Remarks:

• No longer limited to the explicit parameterization given by the

Chebyshev polynomials.

• Alleviate the problem of overfitting on local neighborhood structures

for graphs with very wide node degree distributions.

• Scalable.

• But it is transductive in nature.

25

Page 69: Convolutional Graph Embeddings - cs.ubc.ca

GCN

Remarks:

• No longer limited to the explicit parameterization given by the

Chebyshev polynomials.

• Alleviate the problem of overfitting on local neighborhood structures

for graphs with very wide node degree distributions.

• Scalable.

• But it is transductive in nature.

25

Page 70: Convolutional Graph Embeddings - cs.ubc.ca

Inductive Representation

Learning on Large Graphs

(GraphSAGE)

Page 71: Convolutional Graph Embeddings - cs.ubc.ca

GraphSAGE

Goal: Efficiently generate node embeddings for nodes unseen at training

time, or entirely new graphs.

• Essential for high throughput, production level systems.

• Generalization across graphs with similar structures.

• How to achieve this without re-training with the entire graph?

26

Page 72: Convolutional Graph Embeddings - cs.ubc.ca

GraphSAGE

Goal: Efficiently generate node embeddings for nodes unseen at training

time, or entirely new graphs.

• Essential for high throughput, production level systems.

• Generalization across graphs with similar structures.

• How to achieve this without re-training with the entire graph?

26

Page 73: Convolutional Graph Embeddings - cs.ubc.ca

GraphSAGE

Goal: Efficiently generate node embeddings for nodes unseen at training

time, or entirely new graphs.

• Essential for high throughput, production level systems.

• Generalization across graphs with similar structures.

• How to achieve this without re-training with the entire graph?

26

Page 74: Convolutional Graph Embeddings - cs.ubc.ca

GraphSAGE

GraphSAGE: Sample and Aggregate (Hamilton et al. [2])

• Train a set of aggregator functions that learn to aggregate feature

information from a node’s local neighborhood.

• Learn how to aggregate node features, degree statistics, etc.

• At test time, apply the learned aggregation functions to generate

embeddings for entirely unseen nodes.

• Unsupervised loss function.

27

Page 75: Convolutional Graph Embeddings - cs.ubc.ca

GraphSAGE

GraphSAGE: Sample and Aggregate (Hamilton et al. [2])

• Train a set of aggregator functions that learn to aggregate feature

information from a node’s local neighborhood.

• Learn how to aggregate node features, degree statistics, etc.

• At test time, apply the learned aggregation functions to generate

embeddings for entirely unseen nodes.

• Unsupervised loss function.

27

Page 76: Convolutional Graph Embeddings - cs.ubc.ca

GraphSAGE

GraphSAGE: Sample and Aggregate (Hamilton et al. [2])

• Train a set of aggregator functions that learn to aggregate feature

information from a node’s local neighborhood.

• Learn how to aggregate node features, degree statistics, etc.

• At test time, apply the learned aggregation functions to generate

embeddings for entirely unseen nodes.

• Unsupervised loss function.

27

Page 77: Convolutional Graph Embeddings - cs.ubc.ca

GraphSAGE - Embedding Generation

Assume ∀k ∈ {1, . . . ,K}, the AGGREGATEk functions are learned, as well

as a set of weights Wk , the embedding generation procedure is

Figure 6: Hamilton et al. [2]28

Page 78: Convolutional Graph Embeddings - cs.ubc.ca

GraphSAGE - Embedding Generation

Figure 7: Hamilton et al. [2]

After K iterations, each node’s embedding will contain information for all

its K -hop neighbors. In the minibatch setting, first forward sample the

required neighborhood sets and then run the inner loop.

29

Page 79: Convolutional Graph Embeddings - cs.ubc.ca

GraphSAGE

• Uniformly sample a fixed-size set of neighbors to keep the

computional cost of each batch under control.

• Graph-based loss function (unsupervised):

JG(zu) = − log(σ(zTu zv ))− Q · Evn∼Pn(v) log(σ(−zTu zvn))

• v : a node that co-occurs near u on a fixed-length random walk

• σ: sigmoid

• Pn: negative sampling distribution

• Q: number of negative samples

• Can also replace/augment this loss with a supervised, task-specific

objective.

30

Page 80: Convolutional Graph Embeddings - cs.ubc.ca

GraphSAGE

• Uniformly sample a fixed-size set of neighbors to keep the

computional cost of each batch under control.

• Graph-based loss function (unsupervised):

JG(zu) = − log(σ(zTu zv ))− Q · Evn∼Pn(v) log(σ(−zTu zvn))

• v : a node that co-occurs near u on a fixed-length random walk

• σ: sigmoid

• Pn: negative sampling distribution

• Q: number of negative samples

• Can also replace/augment this loss with a supervised, task-specific

objective.

30

Page 81: Convolutional Graph Embeddings - cs.ubc.ca

GraphSAGE

• Uniformly sample a fixed-size set of neighbors to keep the

computional cost of each batch under control.

• Graph-based loss function (unsupervised):

JG(zu) = − log(σ(zTu zv ))− Q · Evn∼Pn(v) log(σ(−zTu zvn))

• v : a node that co-occurs near u on a fixed-length random walk

• σ: sigmoid

• Pn: negative sampling distribution

• Q: number of negative samples

• Can also replace/augment this loss with a supervised, task-specific

objective.

30

Page 82: Convolutional Graph Embeddings - cs.ubc.ca

GraphSAGE - Aggregator Functions

• Mean aggregator

• Elementwise mean of {hk−1u ,∀u ∈ N (v)}.

• LSTM aggregator

• LSTMs operate on sequences.

• Apply LSTMs to a random permutation of a node’s neighbors.

• Pooling aggregator

• Each neighbor’s vector is independently fed through a FC layer.

• Then perform elementwise max-pooling.

• AGGREGATEpoolk = max({σ(Wpoolh

kui + b), ∀ui ∈ N (v)})

31

Page 83: Convolutional Graph Embeddings - cs.ubc.ca

GraphSAGE - Aggregator Functions

• Mean aggregator

• Elementwise mean of {hk−1u ,∀u ∈ N (v)}.

• LSTM aggregator

• LSTMs operate on sequences.

• Apply LSTMs to a random permutation of a node’s neighbors.

• Pooling aggregator

• Each neighbor’s vector is independently fed through a FC layer.

• Then perform elementwise max-pooling.

• AGGREGATEpoolk = max({σ(Wpoolh

kui + b), ∀ui ∈ N (v)})

31

Page 84: Convolutional Graph Embeddings - cs.ubc.ca

GraphSAGE - Aggregator Functions

• Mean aggregator

• Elementwise mean of {hk−1u ,∀u ∈ N (v)}.

• LSTM aggregator

• LSTMs operate on sequences.

• Apply LSTMs to a random permutation of a node’s neighbors.

• Pooling aggregator

• Each neighbor’s vector is independently fed through a FC layer.

• Then perform elementwise max-pooling.

• AGGREGATEpoolk = max({σ(Wpoolh

kui + b), ∀ui ∈ N (v)})

31

Page 85: Convolutional Graph Embeddings - cs.ubc.ca

Other methods

Figure 8: PyTorch geometric

32

Page 86: Convolutional Graph Embeddings - cs.ubc.ca

Libraries

Figure 9: DGL

Figure 10: PyTorch geometric

33

Page 87: Convolutional Graph Embeddings - cs.ubc.ca

References i

M. Defferrard, X. Bresson, and P. Vandergheynst.

Convolutional neural networks on graphs with fast localized

spectral filtering.

In Advances in neural information processing systems, pages

3844–3852, 2016.

W. Hamilton, Z. Ying, and J. Leskovec.

Inductive representation learning on large graphs.

In Advances in Neural Information Processing Systems, pages

1024–1034, 2017.

W. L. Hamilton, R. Ying, and J. Leskovec.

Representation learning on graphs: Methods and applications.

arXiv preprint arXiv:1709.05584, 2017.

34

Page 88: Convolutional Graph Embeddings - cs.ubc.ca

References ii

D. K. Hammond, P. Vandergheynst, and R. Gribonval.

Wavelets on graphs via spectral graph theory.

Applied and Computational Harmonic Analysis, 30(2):129–150,

2011.

T. N. Kipf and M. Welling.

Semi-supervised classification with graph convolutional

networks.

arXiv preprint arXiv:1609.02907, 2016.

B. Perozzi, R. Al-Rfou, and S. Skiena.

Deepwalk: Online learning of social representations.

In Proceedings of the 20th ACM SIGKDD international conference

on Knowledge discovery and data mining, pages 701–710. ACM,

2014.

35