Cross-Graph Learning of Multi-Relational Associationshanxiaol/slides/icml2016-liu.pdfStructure...
Transcript of Cross-Graph Learning of Multi-Relational Associationshanxiaol/slides/icml2016-liu.pdfStructure...
Cross-Graph Learning of Multi-Relational
Associations
Hanxiao Liu, Yiming YangCarnegie Mellon University
{hanxiaol, yiming}@cs.cmu.edu
June 22, 2016
1 / 24
Outline
Task Description
New Contributions
Framework
Scalable Inference
Empirical Evaluation
Summary
2 / 24
Task Description
Goal: Predict associations among heterogeneous graphs.
ProteinCompound
Structure Similarity Sequence Similarity
Interact
(a) Drug-Target Interaction
Paper
Author Venue
Coauthorship
Citation
Shared Foci
WritePublish
Attend
(b) Citation Network
“John publish a reinforcement learning paper at ICML.”(John,RL Paper,ICML)
3 / 24
Outline
Task Description
New Contributions
Framework
Scalable Inference
Empirical Evaluation
Summary
4 / 24
New Contributions
I A unified framework to integrating heterogeneousinformation in multiple graphs.
I Transductive learning to leverage both labeled data(sparse) and unlabeled data (massive).
I A convex approximation for the scalable inference overthe combinatorial number of possible tuples.
5 / 24
Outline
Task Description
New Contributions
Framework
Scalable Inference
Empirical Evaluation
Summary
6 / 24
Framework
Notation
I G(1), G(2), . . . , G(J) are individual graphs;
I nj is the #nodes in G(j);
I (i1, i2, . . . , iJ) is a tuple (multi-relation);
I fi1,i2,...,iJ is the predicted score for the tuple;
I f is a tensor in Rn1×n2×···×nJ .
7 / 24
Framework
Product Graph (P) induced from G(1), . . . , G(J).
P
(︸ ︷︷ ︸G(1)
, ︸ ︷︷ ︸G(2)
, ︸ ︷︷ ︸G(3)
)=
Tensor product: P(G(1), G(2), G(3)) = G(1) ⊗G(2) ⊗G(3)
8 / 24
Framework
Product Graph (P) induced from G(1), . . . , G(J).
P
(︸ ︷︷ ︸G(1)
, ︸ ︷︷ ︸G(2)
, ︸ ︷︷ ︸G(3)
)=
Tensor product: P(G(1), G(2), G(3)) = G(1) ⊗G(2) ⊗G(3)
8 / 24
Framework
Why product graph?
I Mapping heterogeneous graphs onto a unified graph forlabel propagation (transductive learning).
9 / 24
Framework
Assumingvec(f) ∼ N (0,P) (1)
which implies:
− log p (f |P) ∝ vec(f)>P−1vec(f) := ‖f‖2P (2)
Optimization problem
minf
`O(f) +γ
2‖f‖2P (3)
10 / 24
Framework
Assumingvec(f) ∼ N (0,P) (1)
which implies:
− log p (f |P) ∝ vec(f)>P−1vec(f) := ‖f‖2P (2)
Optimization problem
minf
`O(f) +γ
2‖f‖2P (3)
10 / 24
Framework
Assumingvec(f) ∼ N (0,P) (1)
which implies:
− log p (f |P) ∝ vec(f)>P−1vec(f) := ‖f‖2P (2)
Optimization problem
minf
`O(f) +γ
2‖f‖2P (3)
10 / 24
Framework
For computational tractability, we focus on the spectralgraph product family of P.
Spectral Graph Product (SGP)The eigensystem of Pκ
(G(1), . . . , G(J)
)is parametrized by
the eigensystems of individual graphs, i.e.,{κ(λi1 , . . . , λiJ
),⊗j
vij
}i1,...,iJ
(4)
λij/vij is the ij-th eigenvalue/eigenvector of the j-th graph.
11 / 24
Framework
Nice properties of SGP:
Subsuming basic operations
κ(x, y) = x× y =⇒ Pκ(G,H) = G⊗H Tensor (5)
κ(x, y) = x+ y =⇒ Pκ(G,H) = G⊕H Cartesian (6)
Supporting graph diffusions
σHeat(Pκ) =I + Pκ +1
2P2
κ + · · · = Peκ (7)
σvon−Neumann(Pκ) =I + Pκ + P2κ + · · · = P 1
1−κ(8)
Order-insensitive: If κ is commutative, then SGP iscommutative (up to graph isomorphism).
12 / 24
Framework
Nice properties of SGP:
Subsuming basic operations
κ(x, y) = x× y =⇒ Pκ(G,H) = G⊗H Tensor (5)
κ(x, y) = x+ y =⇒ Pκ(G,H) = G⊕H Cartesian (6)
Supporting graph diffusions
σHeat(Pκ) =I + Pκ +1
2P2
κ + · · · = Peκ (7)
σvon−Neumann(Pκ) =I + Pκ + P2κ + · · · = P 1
1−κ(8)
Order-insensitive: If κ is commutative, then SGP iscommutative (up to graph isomorphism).
12 / 24
Framework
Nice properties of SGP:
Subsuming basic operations
κ(x, y) = x× y =⇒ Pκ(G,H) = G⊗H Tensor (5)
κ(x, y) = x+ y =⇒ Pκ(G,H) = G⊕H Cartesian (6)
Supporting graph diffusions
σHeat(Pκ) =I + Pκ +1
2P2
κ + · · · = Peκ (7)
σvon−Neumann(Pκ) =I + Pκ + P2κ + · · · = P 1
1−κ(8)
Order-insensitive: If κ is commutative, then SGP iscommutative (up to graph isomorphism).
12 / 24
Outline
Task Description
New Contributions
Framework
Scalable Inference
Empirical Evaluation
Summary
13 / 24
Scalable Inference
For general GP, the semi-norm is computed as
‖f‖2P = vec(f)>P−1vec(f) (9)
For SGP, Pκ no longer has to be explicitly computed.
‖f‖2Pκ=
n1,n2,...,nJ∑i1,i2,...,iJ
f(vi1 , . . . , viJ
)2κ(λi1 , . . . , λiJ
) (10)
I f(vi1 , vi2 , . . . , viJ ) = f ×1 vi1 ×2 vi2 · · · ×J viJI However, even evaluating (10) is expensive.
14 / 24
Scalable Inference
For general GP, the semi-norm is computed as
‖f‖2P = vec(f)>P−1vec(f) (9)
For SGP, Pκ no longer has to be explicitly computed.
‖f‖2Pκ=
n1,n2,...,nJ∑i1,i2,...,iJ
f(vi1 , . . . , viJ
)2κ(λi1 , . . . , λiJ
) (10)
I f(vi1 , vi2 , . . . , viJ ) = f ×1 vi1 ×2 vi2 · · · ×J viJI However, even evaluating (10) is expensive.
14 / 24
Scalable Inference
For general GP, the semi-norm is computed as
‖f‖2P = vec(f)>P−1vec(f) (9)
For SGP, Pκ no longer has to be explicitly computed.
‖f‖2Pκ=
n1,n2,...,nJ∑i1,i2,...,iJ
f(vi1 , . . . , viJ
)2κ(λi1 , . . . , λiJ
) (10)
I f(vi1 , vi2 , . . . , viJ ) = f ×1 vi1 ×2 vi2 · · · ×J viJI However, even evaluating (10) is expensive.
14 / 24
Scalable Inference
Using low-rank SGP
I f lies in the linear span of the eigenvectors of P.
I Eigenvectors of high volatility can be pruned away.
Figure : Eigenvectors of G (blue), H (red) and P(G,H).
15 / 24
Scalable Inference
Using low-rank SGP
I f lies in the linear span of the eigenvectors of P.
I Eigenvectors of high volatility can be pruned away.
Figure : Eigenvectors of G (blue), H (red) and P(G,H).
15 / 24
Scalable Inference
Restrict f in the linear span of “smooth” bases of P.
f(α) =
d1,d2,··· ,dJ∑i1,i2,··· ,iJ=1
αi1,i2,··· ,iJ⊗j
vij (11)
where the core tensor α ∈ Rd1×d2×···×dJ , dj � nj.
The semi-norm becomes
‖f(α)‖2Pκ=
d1,d2,··· ,dJ∑i1,i2,...,iJ=1
α2i1,i2,··· ,iJ
κ(λi1 , λi2 , . . . , λiJ
) (12)
We then optimize w.r.t. α instead of f . Parameter size:∏j nj →
∏j dj.
16 / 24
Scalable Inference
Restrict f in the linear span of “smooth” bases of P.
f(α) =
d1,d2,··· ,dJ∑i1,i2,··· ,iJ=1
αi1,i2,··· ,iJ⊗j
vij (11)
where the core tensor α ∈ Rd1×d2×···×dJ , dj � nj.
The semi-norm becomes
‖f(α)‖2Pκ=
d1,d2,··· ,dJ∑i1,i2,...,iJ=1
α2i1,i2,··· ,iJ
κ(λi1 , λi2 , . . . , λiJ
) (12)
We then optimize w.r.t. α instead of f . Parameter size:∏j nj →
∏j dj.
16 / 24
Scalable Inference
Figure : Tucker Decomposition, where α is the core tensor.
17 / 24
Scalable Inference
Revised optimization objective
minα∈Rd1×d2···×dJ
`O (f(α)) +γ
2‖f(α)‖2Pκ
(13)
Ranking loss function
`O(f) =
∑(i1, . . . , iJ ) ∈ O(i′1, . . . , i
′J ) ∈ O
(fi1...iJ − fi′1...i′J
)2+
|O × O|(14)
∇α =∂`O∂f
(∂fi1,...,iJ∂α
−∂fi′1,...,i′J∂α
)+ γα� κ (15)
Tensor algebras are carried out on GPU.
18 / 24
Scalable Inference
Revised optimization objective
minα∈Rd1×d2···×dJ
`O (f(α)) +γ
2‖f(α)‖2Pκ
(13)
Ranking loss function
`O(f) =
∑(i1, . . . , iJ ) ∈ O(i′1, . . . , i
′J ) ∈ O
(fi1...iJ − fi′1...i′J
)2+
|O × O|(14)
∇α =∂`O∂f
(∂fi1,...,iJ∂α
−∂fi′1,...,i′J∂α
)+ γα� κ (15)
Tensor algebras are carried out on GPU.
18 / 24
Scalable Inference
Revised optimization objective
minα∈Rd1×d2···×dJ
`O (f(α)) +γ
2‖f(α)‖2Pκ
(13)
Ranking loss function
`O(f) =
∑(i1, . . . , iJ ) ∈ O(i′1, . . . , i
′J ) ∈ O
(fi1...iJ − fi′1...i′J
)2+
|O × O|(14)
∇α =∂`O∂f
(∂fi1,...,iJ∂α
−∂fi′1,...,i′J∂α
)+ γα� κ (15)
Tensor algebras are carried out on GPU.
18 / 24
Outline
Task Description
New Contributions
Framework
Scalable Inference
Empirical Evaluation
Summary
19 / 24
Empirical Evaluation
Datasets
Enzyme 445 compounds, 664 proteins.
DBLP 34K authors, 11K papers, 22 venues.
Representative Baselines
TF/GRTF Tensor Factorization/Graph-Regularized TF
NN One-class Nearest Neighbor
RSVM Ranking SVMs
LTKM Low-Rank Tensor Kernel Machines
20 / 24
Empirical Evaluation
Datasets
Enzyme 445 compounds, 664 proteins.
DBLP 34K authors, 11K papers, 22 venues.
Representative Baselines
TF/GRTF Tensor Factorization/Graph-Regularized TF
NN One-class Nearest Neighbor
RSVM Ranking SVMs
LTKM Low-Rank Tensor Kernel Machines
20 / 24
Empirical Evaluation
Our method: “TOP” (blue).
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
12.5 25 50 100
MAP
Training Size (%)
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
12.5 25 50 100AU
CTraining Size (%)
TOPLTKM
NNRSVM
TFGRTF
5
10
15
20
25
30
35
12.5 25 50 100
Hits
@5
(%)
Training Size (%)
Figure : Performance on Enzyme (above) and DBLP (below).
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
12.5 25 50 100
MAP
Training Size (%)
0.5 0.55
0.6 0.65
0.7 0.75
0.8 0.85
0.9 0.95
12.5 25 50 100
AUC
Training Size (%)
TOPLTKM
NNRSVM
TFGRTF
0
2
4
6
8
10
12
12.5 25 50 100
Hits
@5
(%)
Training Size (%)
21 / 24
Outline
Task Description
New Contributions
Framework
Scalable Inference
Empirical Evaluation
Summary
22 / 24
Summary
Contribution
I A unified framework to integrating heterogeneousinformation in multiple graphs.
I Transductive learning to leverage both labeled data(sparse) and unlabeled data (massive).
I A convex approximation for the scalable inference overthe combinatorial number of possible tuples.
Future/On-going Work
I Learning structured associations.
I Larger problems: Microsoft Academic Graph (37 GB).
23 / 24
Summary
Contribution
I A unified framework to integrating heterogeneousinformation in multiple graphs.
I Transductive learning to leverage both labeled data(sparse) and unlabeled data (massive).
I A convex approximation for the scalable inference overthe combinatorial number of possible tuples.
Future/On-going Work
I Learning structured associations.
I Larger problems: Microsoft Academic Graph (37 GB).
23 / 24
Thank You
24 / 24