Representation and Generation of Molecular Graphs

43
Representation and Generation of Molecular Graphs Tommi Jaakkola MIT in collaboration with Wengong Jin, Vikas Garg, Regina Barzilay

Transcript of Representation and Generation of Molecular Graphs

Page 1: Representation and Generation of Molecular Graphs

Representation and Generation of Molecular

Graphs

Tommi JaakkolaMIT

in collaboration with Wengong Jin, Vikas Garg, Regina Barzilay

Page 2: Representation and Generation of Molecular Graphs

‣ E.g., antibiotic (cephalosporin)

‣ Together, the features give rise to various molecular properties (e.g., solubility, toxicity, etc)

Molecules = richly annotated graphs

3D information

node labels substructures (motifs)

edge labels

Page 3: Representation and Generation of Molecular Graphs

Why interesting for ML?‣ Rich, complex objects: molecules are complicated

structures, properties may depend on intricate features ‣ Data: big and small, heterogenous ‣ Estimation/inferential challenges: many high-impact

but non-trivial tasks such as prediction of chemical properties, molecular optimization, etc.

(Daptomycin antibiotic)

Page 4: Representation and Generation of Molecular Graphs

Our motivation: this talk‣ MIT Consortium (https://mlpds.mit.edu/)

- 13 major pharmaceutical companies - chemistry/chemical engineering (Jensen, Green, Jamison) - computer science (Barzilay, Jaakkola)

‣ Deeper into known chemistry - extract chemical knowledge from journals, notebooks

‣ Deeper into molecules - molecular property prediction (e.g., toxicity, bioactivity) - (multi-criteria) optimization (e.g., potency, toxicity)

‣ Deeper into reactions - forward synthesis prediction (major products of reactions) - forward synthesis optimization (conditions, reagents,

etc.) ‣ Deeper into making things

- retrosynthetic planning (efficient/inexpensive routes)

Page 5: Representation and Generation of Molecular Graphs

Our motivation: broader context‣ MIT Consortium (https://mlpds.mit.edu/)

- 14 major pharmaceutical companies - chemistry/chemical engineering (Jensen, Green, Jamison) - computer science (Barzilay, Jaakkola)

‣ Deeper into known chemistry - extract chemical knowledge from journals, notebooks

‣ Deeper into molecules - molecular property prediction (e.g., toxicity, bioactivity) - (multi-criteria) optimization (e.g., potency, toxicity)

‣ Deeper into reactions - forward synthesis prediction (major products of reactions) - forward synthesis optimization (conditions, reagents,

etc.) ‣ Deeper into making things

- retrosynthetic planning (efficient/inexpensive routes)

Page 6: Representation and Generation of Molecular Graphs

Automating Drug design‣ Our problem: how to programmatically modify pre-cursor

molecules to have better properties

‣ Key challenges:

1. representation and prediction: learn to predict molecular properties

2. generation and optimization: realize target molecules with better properties programmatically

3. understanding: uncover principles (or diagnose errors) underlying complex predictions

design specs

Page 7: Representation and Generation of Molecular Graphs

Graph neural networks (GNNs)‣ GNNs are parameterized message passing algorithms

operating on molecular graphs, and result in atom, bond, and graph embeddings, tailored for the end task

‣ Many recent results about representational power (Xu et al. 2019, Sato et al. 2019, Maron et al., 2019, …)

solubility, toxicity,

bioactivity, etc.

iterative embedding readout prediction

ht+1u = UPDATE

�htu, xu,AGGREGATE({ht

v, xuv}v2N(u))�

hTG = READOUT({hT

u })

<latexit sha1_base64="z/8IKlmNGRTWbxqDlAxATpKbnpA=">AAACpnicbZFbi9NAFMcn8bbGy3b10ZfBoqQoJZGCviy0uKX6staabBc63TiZTtthc2MuZcswH80v4ZvfxkkbYS8eOPDnnN+5zJm0ypiQQfDHce/df/Dw0cFj78nTZ88PW0cvzkSpOKExKbOSn6dY0IwVNJZMZvS84hTnaUan6eXnOj/dUC5YWURyW9F5jlcFWzKCpQ0lrV/rCy3fhSZR8O0x1GjXUaeZogZJeiV1PD4ZREODUrbyTQ1b9P2Vde8fzOmiYQej0WQ4qnHfIL2HNxbWamOQSfQGsQKe+qpjrtd2jHdzrh3VMQh564soGdVr7btPhoOTb3Fk/Lp1lChkOkmrHXSDncG7ImxEGzQ2Tlq/0aIkKqeFJBkWYhYGlZxrzCUjGTUeUoJWmFziFZ1ZWeCcirneLWfgGxtZwGXJrRcS7qLXKzTOhdjmqSVzLNfidq4O/i83U3L5aa5ZUSlJC7IftFQZlCWs/wwuGKdEZlsrMOHM7grJGnNMpP1Zzx4hvP3ku+LsQzfsdXvfe+1+vznHAXgFXgMfhOAj6IMvYAxiQJy289WZOD9c3z11Y3e6R12nqXkJbpj78y8x4dGc</latexit>

Page 8: Representation and Generation of Molecular Graphs

Representational power of “GNNs”

the set of all n-node graphs

Generalization and Representational Limits of Graph Neural Networks

Proposition 2. LU-GNNs with sum/average/max readout

cannot decide the following graph properties: (a) girth

(length of shortest cycle), (b) circumference (length of

longest cycle), (c) diameter (maximum distance, in terms

of shortest path, between any pair of nodes in the graph),

(d) radius (minimum node eccentricity, where eccentricity

of a node u is defined as the maximum distance from u to

other vertices), (e) conjoint cycle (two cycles that share an

edge), (f) total number of cycles, and (g) k-clique (a sub-

graph of at least k � 3 vertices such that each vertex in the

subgraph is connected by an edge to any other vertex in the

subgraph). These results also hold for CPNGNN models.

A B

CD

A B

CD

A1 B1 C1 D1

D2 C2 B2 A2

S4

S4

S8

Proof. To argue about girth and circumference, we con-struct a pair of graphs that have cycles of different lengthbut produce the same output embedding via the readoutfunction. Specifically, we show that LU-GNNs cannot de-cide a graph having cycles of length n from a cycle oflength 2n. We first construct a counterexample for aver-age and max readout functions for n = 4. Let {A, B, C, D}be the set of node labels. Consider a cycle S4 = ABCDA

(of length 4), and a cycle S8 = A1B1C1D1A2B2C2D2A1

(of length 8), where X1 and X2 are copies of X 2{A,B,C,D} (i.e. have the same node feature vectors asX). We also enforce that edge e(X`, Y`0) is a copy ofe(X,Y ), for `, `0 2 {1, 2}, whenever X` is a copy of Xand Y`0 is a copy of Y .

Additionally, for CPNGNN (Sato et al., 2019), we first finda consistent port numbering for S4 using Algorithm 3 in(Sato et al., 2019). Thus, we find a function p which satis-fies the following. For each edge (X,Y ) in S4, we obtain apair of port numbers (i, j) with i 2 {1, 2, . . . , degree(X)}and j 2 {1, 2, . . . , degree(Y )} such that p(X, i) = (Y, j)and p(p(X, i)) = (X, i). For each edge (X`, Y`0) in S8, weassociate the same pair of ports (i, j) as for (X,Y ) in S4.This ensures that both S4 and S8 have locally isomorphicconsistent port-ordered neighborhoods.

We first prove by induction that LU-GNNs and CPNGNNsproduce identical node embeddings for locally isomorphicgraphs. Consider any such GNN with L + 1 layers pa-rameterized by the sequence ✓1:L+1 , (✓1, . . . , ✓L, ✓L+1).

Fix any X 2 {A,B,C,D}. Since the neighbors and cor-responding port pairs of X in S4 are identical copies ofthose of X1 and X2 in S8, the updated embeddings X(✓1),X1(✓1), and X2(✓1) output by the first layer are all identi-cal. Assume that the node embeddings X(✓1:L), X1(✓1:L),and X2(✓1:L) are identical. Since the neighbors of X in S4

are locally isomorphic to neighbors of X1 (and X2) in S8,this immediately implies that (for any ✓L+1) we must haveX(✓1:L+1) = X1(✓1:L+1) = X2(✓1:L+1).

Thus, the average embedding vector for S4 is same asfor S8 (and likewise for their respective maximum embed-dings) even though girth as well as circumference is equalto 4 for S4, and equal to 8 for S8. Moreover, note that diam-eter and radius of S4 are both equal to 2. In contrast, diame-ter (and radius) of S8 is 4. For the sum function, we simplycompare S8 with a graph that has two disjoint copies ofS4 as its components. To edges in both these components,we assign the corresponding port numbers from S4. Doingso does not change the girth and circumference. However,since the graph with two components is disconnected, itsdiameter (and radius) is 1.

Our choice of ✓1:L+1 was arbitrary, so properties (a)-(d)follow immediately for any LU-GNN and CPNGNN withsum, average, or maximum readout.

A

B

C

D

A

B

C

D

A1

B1C1D1

G1

G1

A2

D2C2B2

G2

For properties (e)-(g), we craft another pair of graphsthat differ in the specified properties but cannot bediscriminated. The main idea is to replicate the effectof the common edge in the conjoint cycle (graph G1)via two identical subgraphs (of graph G2 that does nothave any conjoint cycle) that are cleverly aligned toreproduce the local port-ordered neighborhoods and thuspresent the same view to each node (see the adjoiningfigure). Note that the number of cycles in G1 is 3(ABCDA, ABDA, BCDB), while the correspondingnumber in G2 is 6 (A1B1C1D1A1, A2B2C2D2A2,A1B1D2A2B2D1A1, A1B1D2C2B2D1A1,A2B2D1C1B1D2A2, D1C1B1D2C2B2), and thus

Page 9: Representation and Generation of Molecular Graphs

Representational power of “GNNs”

the set of all n-node graphs

1-WL equivalence

class Generalization and Representational Limits of Graph Neural Networks

Proposition 2. LU-GNNs with sum/average/max readout

cannot decide the following graph properties: (a) girth

(length of shortest cycle), (b) circumference (length of

longest cycle), (c) diameter (maximum distance, in terms

of shortest path, between any pair of nodes in the graph),

(d) radius (minimum node eccentricity, where eccentricity

of a node u is defined as the maximum distance from u to

other vertices), (e) conjoint cycle (two cycles that share an

edge), (f) total number of cycles, and (g) k-clique (a sub-

graph of at least k � 3 vertices such that each vertex in the

subgraph is connected by an edge to any other vertex in the

subgraph). These results also hold for CPNGNN models.

A B

CD

A B

CD

A1 B1 C1 D1

D2 C2 B2 A2

S4

S4

S8

Proof. To argue about girth and circumference, we con-struct a pair of graphs that have cycles of different lengthbut produce the same output embedding via the readoutfunction. Specifically, we show that LU-GNNs cannot de-cide a graph having cycles of length n from a cycle oflength 2n. We first construct a counterexample for aver-age and max readout functions for n = 4. Let {A, B, C, D}be the set of node labels. Consider a cycle S4 = ABCDA

(of length 4), and a cycle S8 = A1B1C1D1A2B2C2D2A1

(of length 8), where X1 and X2 are copies of X 2{A,B,C,D} (i.e. have the same node feature vectors asX). We also enforce that edge e(X`, Y`0) is a copy ofe(X,Y ), for `, `0 2 {1, 2}, whenever X` is a copy of Xand Y`0 is a copy of Y .

Additionally, for CPNGNN (Sato et al., 2019), we first finda consistent port numbering for S4 using Algorithm 3 in(Sato et al., 2019). Thus, we find a function p which satis-fies the following. For each edge (X,Y ) in S4, we obtain apair of port numbers (i, j) with i 2 {1, 2, . . . , degree(X)}and j 2 {1, 2, . . . , degree(Y )} such that p(X, i) = (Y, j)and p(p(X, i)) = (X, i). For each edge (X`, Y`0) in S8, weassociate the same pair of ports (i, j) as for (X,Y ) in S4.This ensures that both S4 and S8 have locally isomorphicconsistent port-ordered neighborhoods.

We first prove by induction that LU-GNNs and CPNGNNsproduce identical node embeddings for locally isomorphicgraphs. Consider any such GNN with L + 1 layers pa-rameterized by the sequence ✓1:L+1 , (✓1, . . . , ✓L, ✓L+1).

Fix any X 2 {A,B,C,D}. Since the neighbors and cor-responding port pairs of X in S4 are identical copies ofthose of X1 and X2 in S8, the updated embeddings X(✓1),X1(✓1), and X2(✓1) output by the first layer are all identi-cal. Assume that the node embeddings X(✓1:L), X1(✓1:L),and X2(✓1:L) are identical. Since the neighbors of X in S4

are locally isomorphic to neighbors of X1 (and X2) in S8,this immediately implies that (for any ✓L+1) we must haveX(✓1:L+1) = X1(✓1:L+1) = X2(✓1:L+1).

Thus, the average embedding vector for S4 is same asfor S8 (and likewise for their respective maximum embed-dings) even though girth as well as circumference is equalto 4 for S4, and equal to 8 for S8. Moreover, note that diam-eter and radius of S4 are both equal to 2. In contrast, diame-ter (and radius) of S8 is 4. For the sum function, we simplycompare S8 with a graph that has two disjoint copies ofS4 as its components. To edges in both these components,we assign the corresponding port numbers from S4. Doingso does not change the girth and circumference. However,since the graph with two components is disconnected, itsdiameter (and radius) is 1.

Our choice of ✓1:L+1 was arbitrary, so properties (a)-(d)follow immediately for any LU-GNN and CPNGNN withsum, average, or maximum readout.

A

B

C

D

A

B

C

D

A1

B1C1D1

G1

G1

A2

D2C2B2

G2

For properties (e)-(g), we craft another pair of graphsthat differ in the specified properties but cannot bediscriminated. The main idea is to replicate the effectof the common edge in the conjoint cycle (graph G1)via two identical subgraphs (of graph G2 that does nothave any conjoint cycle) that are cleverly aligned toreproduce the local port-ordered neighborhoods and thuspresent the same view to each node (see the adjoiningfigure). Note that the number of cycles in G1 is 3(ABCDA, ABDA, BCDB), while the correspondingnumber in G2 is 6 (A1B1C1D1A1, A2B2C2D2A2,A1B1D2A2B2D1A1, A1B1D2C2B2D1A1,A2B2D1C1B1D2A2, D1C1B1D2C2B2), and thus

Page 10: Representation and Generation of Molecular Graphs

Representational power of “GNNs”

the set of all n-node graphs

1-WL equivalence

class Generalization and Representational Limits of Graph Neural Networks

Proposition 2. LU-GNNs with sum/average/max readout

cannot decide the following graph properties: (a) girth

(length of shortest cycle), (b) circumference (length of

longest cycle), (c) diameter (maximum distance, in terms

of shortest path, between any pair of nodes in the graph),

(d) radius (minimum node eccentricity, where eccentricity

of a node u is defined as the maximum distance from u to

other vertices), (e) conjoint cycle (two cycles that share an

edge), (f) total number of cycles, and (g) k-clique (a sub-

graph of at least k � 3 vertices such that each vertex in the

subgraph is connected by an edge to any other vertex in the

subgraph). These results also hold for CPNGNN models.

A B

CD

A B

CD

A1 B1 C1 D1

D2 C2 B2 A2

S4

S4

S8

Proof. To argue about girth and circumference, we con-struct a pair of graphs that have cycles of different lengthbut produce the same output embedding via the readoutfunction. Specifically, we show that LU-GNNs cannot de-cide a graph having cycles of length n from a cycle oflength 2n. We first construct a counterexample for aver-age and max readout functions for n = 4. Let {A, B, C, D}be the set of node labels. Consider a cycle S4 = ABCDA

(of length 4), and a cycle S8 = A1B1C1D1A2B2C2D2A1

(of length 8), where X1 and X2 are copies of X 2{A,B,C,D} (i.e. have the same node feature vectors asX). We also enforce that edge e(X`, Y`0) is a copy ofe(X,Y ), for `, `0 2 {1, 2}, whenever X` is a copy of Xand Y`0 is a copy of Y .

Additionally, for CPNGNN (Sato et al., 2019), we first finda consistent port numbering for S4 using Algorithm 3 in(Sato et al., 2019). Thus, we find a function p which satis-fies the following. For each edge (X,Y ) in S4, we obtain apair of port numbers (i, j) with i 2 {1, 2, . . . , degree(X)}and j 2 {1, 2, . . . , degree(Y )} such that p(X, i) = (Y, j)and p(p(X, i)) = (X, i). For each edge (X`, Y`0) in S8, weassociate the same pair of ports (i, j) as for (X,Y ) in S4.This ensures that both S4 and S8 have locally isomorphicconsistent port-ordered neighborhoods.

We first prove by induction that LU-GNNs and CPNGNNsproduce identical node embeddings for locally isomorphicgraphs. Consider any such GNN with L + 1 layers pa-rameterized by the sequence ✓1:L+1 , (✓1, . . . , ✓L, ✓L+1).

Fix any X 2 {A,B,C,D}. Since the neighbors and cor-responding port pairs of X in S4 are identical copies ofthose of X1 and X2 in S8, the updated embeddings X(✓1),X1(✓1), and X2(✓1) output by the first layer are all identi-cal. Assume that the node embeddings X(✓1:L), X1(✓1:L),and X2(✓1:L) are identical. Since the neighbors of X in S4

are locally isomorphic to neighbors of X1 (and X2) in S8,this immediately implies that (for any ✓L+1) we must haveX(✓1:L+1) = X1(✓1:L+1) = X2(✓1:L+1).

Thus, the average embedding vector for S4 is same asfor S8 (and likewise for their respective maximum embed-dings) even though girth as well as circumference is equalto 4 for S4, and equal to 8 for S8. Moreover, note that diam-eter and radius of S4 are both equal to 2. In contrast, diame-ter (and radius) of S8 is 4. For the sum function, we simplycompare S8 with a graph that has two disjoint copies ofS4 as its components. To edges in both these components,we assign the corresponding port numbers from S4. Doingso does not change the girth and circumference. However,since the graph with two components is disconnected, itsdiameter (and radius) is 1.

Our choice of ✓1:L+1 was arbitrary, so properties (a)-(d)follow immediately for any LU-GNN and CPNGNN withsum, average, or maximum readout.

A

B

C

D

A

B

C

D

A1

B1C1D1

G1

G1

A2

D2C2B2

G2

For properties (e)-(g), we craft another pair of graphsthat differ in the specified properties but cannot bediscriminated. The main idea is to replicate the effectof the common edge in the conjoint cycle (graph G1)via two identical subgraphs (of graph G2 that does nothave any conjoint cycle) that are cleverly aligned toreproduce the local port-ordered neighborhoods and thuspresent the same view to each node (see the adjoiningfigure). Note that the number of cycles in G1 is 3(ABCDA, ABDA, BCDB), while the correspondingnumber in G2 is 6 (A1B1C1D1A1, A2B2C2D2A2,A1B1D2A2B2D1A1, A1B1D2C2B2D1A1,A2B2D1C1B1D2A2, D1C1B1D2C2B2), and thus

Page 11: Representation and Generation of Molecular Graphs

Representational power of “GNNs”

the set of all n-node graphs

1-WL equivalence

class Generalization and Representational Limits of Graph Neural Networks

Proposition 2. LU-GNNs with sum/average/max readout

cannot decide the following graph properties: (a) girth

(length of shortest cycle), (b) circumference (length of

longest cycle), (c) diameter (maximum distance, in terms

of shortest path, between any pair of nodes in the graph),

(d) radius (minimum node eccentricity, where eccentricity

of a node u is defined as the maximum distance from u to

other vertices), (e) conjoint cycle (two cycles that share an

edge), (f) total number of cycles, and (g) k-clique (a sub-

graph of at least k � 3 vertices such that each vertex in the

subgraph is connected by an edge to any other vertex in the

subgraph). These results also hold for CPNGNN models.

A B

CD

A B

CD

A1 B1 C1 D1

D2 C2 B2 A2

S4

S4

S8

Proof. To argue about girth and circumference, we con-struct a pair of graphs that have cycles of different lengthbut produce the same output embedding via the readoutfunction. Specifically, we show that LU-GNNs cannot de-cide a graph having cycles of length n from a cycle oflength 2n. We first construct a counterexample for aver-age and max readout functions for n = 4. Let {A, B, C, D}be the set of node labels. Consider a cycle S4 = ABCDA

(of length 4), and a cycle S8 = A1B1C1D1A2B2C2D2A1

(of length 8), where X1 and X2 are copies of X 2{A,B,C,D} (i.e. have the same node feature vectors asX). We also enforce that edge e(X`, Y`0) is a copy ofe(X,Y ), for `, `0 2 {1, 2}, whenever X` is a copy of Xand Y`0 is a copy of Y .

Additionally, for CPNGNN (Sato et al., 2019), we first finda consistent port numbering for S4 using Algorithm 3 in(Sato et al., 2019). Thus, we find a function p which satis-fies the following. For each edge (X,Y ) in S4, we obtain apair of port numbers (i, j) with i 2 {1, 2, . . . , degree(X)}and j 2 {1, 2, . . . , degree(Y )} such that p(X, i) = (Y, j)and p(p(X, i)) = (X, i). For each edge (X`, Y`0) in S8, weassociate the same pair of ports (i, j) as for (X,Y ) in S4.This ensures that both S4 and S8 have locally isomorphicconsistent port-ordered neighborhoods.

We first prove by induction that LU-GNNs and CPNGNNsproduce identical node embeddings for locally isomorphicgraphs. Consider any such GNN with L + 1 layers pa-rameterized by the sequence ✓1:L+1 , (✓1, . . . , ✓L, ✓L+1).

Fix any X 2 {A,B,C,D}. Since the neighbors and cor-responding port pairs of X in S4 are identical copies ofthose of X1 and X2 in S8, the updated embeddings X(✓1),X1(✓1), and X2(✓1) output by the first layer are all identi-cal. Assume that the node embeddings X(✓1:L), X1(✓1:L),and X2(✓1:L) are identical. Since the neighbors of X in S4

are locally isomorphic to neighbors of X1 (and X2) in S8,this immediately implies that (for any ✓L+1) we must haveX(✓1:L+1) = X1(✓1:L+1) = X2(✓1:L+1).

Thus, the average embedding vector for S4 is same asfor S8 (and likewise for their respective maximum embed-dings) even though girth as well as circumference is equalto 4 for S4, and equal to 8 for S8. Moreover, note that diam-eter and radius of S4 are both equal to 2. In contrast, diame-ter (and radius) of S8 is 4. For the sum function, we simplycompare S8 with a graph that has two disjoint copies ofS4 as its components. To edges in both these components,we assign the corresponding port numbers from S4. Doingso does not change the girth and circumference. However,since the graph with two components is disconnected, itsdiameter (and radius) is 1.

Our choice of ✓1:L+1 was arbitrary, so properties (a)-(d)follow immediately for any LU-GNN and CPNGNN withsum, average, or maximum readout.

A

B

C

D

A

B

C

D

A1

B1C1D1

G1

G1

A2

D2C2B2

G2

For properties (e)-(g), we craft another pair of graphsthat differ in the specified properties but cannot bediscriminated. The main idea is to replicate the effectof the common edge in the conjoint cycle (graph G1)via two identical subgraphs (of graph G2 that does nothave any conjoint cycle) that are cleverly aligned toreproduce the local port-ordered neighborhoods and thuspresent the same view to each node (see the adjoiningfigure). Note that the number of cycles in G1 is 3(ABCDA, ABDA, BCDB), while the correspondingnumber in G2 is 6 (A1B1C1D1A1, A2B2C2D2A2,A1B1D2A2B2D1A1, A1B1D2C2B2D1A1,A2B2D1C1B1D2A2, D1C1B1D2C2B2), and thus

{h1, . . . , hn}

<latexit sha1_base64="sBFMQIHhxC9HtKlxJzHO7GOq960=">AAAB/HicbVDLSsNAFJ34rPUV7dLNYBFclJJIQZcFNy4r2Ac0IUwmk3boZCbMTIQQ6q+4caGIWz/EnX/jpM1CWw9cOJxzL/feE6aMKu0439bG5tb2zm5tr75/cHh0bJ+cDpTIJCZ9LJiQoxApwignfU01I6NUEpSEjAzD2W3pDx+JVFTwB52nxE/QhNOYYqSNFNgNr5gGbstjkdCqNQ24N68HdtNpOwvAdeJWpAkq9AL7y4sEzhLCNWZIqbHrpNovkNQUMzKve5kiKcIzNCFjQzlKiPKLxfFzeGGUCMZCmuIaLtTfEwVKlMqT0HQmSE/VqleK/3njTMc3fkF5mmnC8XJRnDGoBSyTgBGVBGuWG4KwpOZWiKdIIqxNXmUI7urL62Rw1XY77c59p9ntVnHUwBk4B5fABdegC+5AD/QBBjl4Bq/gzXqyXqx362PZumFVMw3wB9bnD77olC8=</latexit>

{h0⇡1, . . . , h0

⇡n}

<latexit sha1_base64="KpHepxM7wNsS65LJGZybRrd/Abo=">AAACCnicbVDLSsNAFJ3UV62vqEs3o0V0UUoiBV0W3LisYB/QhDCZTNqhk5kwMxFKyNqNv+LGhSJu/QJ3/o1JG0FbD1w4nHMv997jx4wqbVlfRmVldW19o7pZ29re2d0z9w96SiQSky4WTMiBjxRhlJOuppqRQSwJinxG+v7kuvD790QqKvidnsbEjdCI05BipHPJM4+ddHzmpU5MPTtrOCwQWjV+FJ45Wc0z61bTmgEuE7skdVCi45mfTiBwEhGuMUNKDW0r1m6KpKaYkazmJIrECE/QiAxzylFElJvOXsngaa4EMBQyL67hTP09kaJIqWnk550R0mO16BXif94w0eGVm1IeJ5pwPF8UJgxqAYtcYEAlwZpNc4KwpPmtEI+RRFjn6RUh2IsvL5PeRdNuNVu3rXq7XcZRBUfgBJwDG1yCNrgBHdAFGDyAJ/ACXo1H49l4M97nrRWjnDkEf2B8fAN8Rpoh</latexit>

GNN/GIN

Page 12: Representation and Generation of Molecular Graphs

Representational power of “GNNs”

the set of all n-node graphs

1-WL equivalence

class Generalization and Representational Limits of Graph Neural Networks

Proposition 2. LU-GNNs with sum/average/max readout

cannot decide the following graph properties: (a) girth

(length of shortest cycle), (b) circumference (length of

longest cycle), (c) diameter (maximum distance, in terms

of shortest path, between any pair of nodes in the graph),

(d) radius (minimum node eccentricity, where eccentricity

of a node u is defined as the maximum distance from u to

other vertices), (e) conjoint cycle (two cycles that share an

edge), (f) total number of cycles, and (g) k-clique (a sub-

graph of at least k � 3 vertices such that each vertex in the

subgraph is connected by an edge to any other vertex in the

subgraph). These results also hold for CPNGNN models.

A B

CD

A B

CD

A1 B1 C1 D1

D2 C2 B2 A2

S4

S4

S8

Proof. To argue about girth and circumference, we con-struct a pair of graphs that have cycles of different lengthbut produce the same output embedding via the readoutfunction. Specifically, we show that LU-GNNs cannot de-cide a graph having cycles of length n from a cycle oflength 2n. We first construct a counterexample for aver-age and max readout functions for n = 4. Let {A, B, C, D}be the set of node labels. Consider a cycle S4 = ABCDA

(of length 4), and a cycle S8 = A1B1C1D1A2B2C2D2A1

(of length 8), where X1 and X2 are copies of X 2{A,B,C,D} (i.e. have the same node feature vectors asX). We also enforce that edge e(X`, Y`0) is a copy ofe(X,Y ), for `, `0 2 {1, 2}, whenever X` is a copy of Xand Y`0 is a copy of Y .

Additionally, for CPNGNN (Sato et al., 2019), we first finda consistent port numbering for S4 using Algorithm 3 in(Sato et al., 2019). Thus, we find a function p which satis-fies the following. For each edge (X,Y ) in S4, we obtain apair of port numbers (i, j) with i 2 {1, 2, . . . , degree(X)}and j 2 {1, 2, . . . , degree(Y )} such that p(X, i) = (Y, j)and p(p(X, i)) = (X, i). For each edge (X`, Y`0) in S8, weassociate the same pair of ports (i, j) as for (X,Y ) in S4.This ensures that both S4 and S8 have locally isomorphicconsistent port-ordered neighborhoods.

We first prove by induction that LU-GNNs and CPNGNNsproduce identical node embeddings for locally isomorphicgraphs. Consider any such GNN with L + 1 layers pa-rameterized by the sequence ✓1:L+1 , (✓1, . . . , ✓L, ✓L+1).

Fix any X 2 {A,B,C,D}. Since the neighbors and cor-responding port pairs of X in S4 are identical copies ofthose of X1 and X2 in S8, the updated embeddings X(✓1),X1(✓1), and X2(✓1) output by the first layer are all identi-cal. Assume that the node embeddings X(✓1:L), X1(✓1:L),and X2(✓1:L) are identical. Since the neighbors of X in S4

are locally isomorphic to neighbors of X1 (and X2) in S8,this immediately implies that (for any ✓L+1) we must haveX(✓1:L+1) = X1(✓1:L+1) = X2(✓1:L+1).

Thus, the average embedding vector for S4 is same asfor S8 (and likewise for their respective maximum embed-dings) even though girth as well as circumference is equalto 4 for S4, and equal to 8 for S8. Moreover, note that diam-eter and radius of S4 are both equal to 2. In contrast, diame-ter (and radius) of S8 is 4. For the sum function, we simplycompare S8 with a graph that has two disjoint copies ofS4 as its components. To edges in both these components,we assign the corresponding port numbers from S4. Doingso does not change the girth and circumference. However,since the graph with two components is disconnected, itsdiameter (and radius) is 1.

Our choice of ✓1:L+1 was arbitrary, so properties (a)-(d)follow immediately for any LU-GNN and CPNGNN withsum, average, or maximum readout.

A

B

C

D

A

B

C

D

A1

B1C1D1

G1

G1

A2

D2C2B2

G2

For properties (e)-(g), we craft another pair of graphsthat differ in the specified properties but cannot bediscriminated. The main idea is to replicate the effectof the common edge in the conjoint cycle (graph G1)via two identical subgraphs (of graph G2 that does nothave any conjoint cycle) that are cleverly aligned toreproduce the local port-ordered neighborhoods and thuspresent the same view to each node (see the adjoiningfigure). Note that the number of cycles in G1 is 3(ABCDA, ABDA, BCDB), while the correspondingnumber in G2 is 6 (A1B1C1D1A1, A2B2C2D2A2,A1B1D2A2B2D1A1, A1B1D2C2B2D1A1,A2B2D1C1B1D2A2, D1C1B1D2C2B2), and thus

{h1, . . . , hn}

<latexit sha1_base64="sBFMQIHhxC9HtKlxJzHO7GOq960=">AAAB/HicbVDLSsNAFJ34rPUV7dLNYBFclJJIQZcFNy4r2Ac0IUwmk3boZCbMTIQQ6q+4caGIWz/EnX/jpM1CWw9cOJxzL/feE6aMKu0439bG5tb2zm5tr75/cHh0bJ+cDpTIJCZ9LJiQoxApwignfU01I6NUEpSEjAzD2W3pDx+JVFTwB52nxE/QhNOYYqSNFNgNr5gGbstjkdCqNQ24N68HdtNpOwvAdeJWpAkq9AL7y4sEzhLCNWZIqbHrpNovkNQUMzKve5kiKcIzNCFjQzlKiPKLxfFzeGGUCMZCmuIaLtTfEwVKlMqT0HQmSE/VqleK/3njTMc3fkF5mmnC8XJRnDGoBSyTgBGVBGuWG4KwpOZWiKdIIqxNXmUI7urL62Rw1XY77c59p9ntVnHUwBk4B5fABdegC+5AD/QBBjl4Bq/gzXqyXqx362PZumFVMw3wB9bnD77olC8=</latexit>

{h0⇡1, . . . , h0

⇡n}

<latexit sha1_base64="KpHepxM7wNsS65LJGZybRrd/Abo=">AAACCnicbVDLSsNAFJ3UV62vqEs3o0V0UUoiBV0W3LisYB/QhDCZTNqhk5kwMxFKyNqNv+LGhSJu/QJ3/o1JG0FbD1w4nHMv997jx4wqbVlfRmVldW19o7pZ29re2d0z9w96SiQSky4WTMiBjxRhlJOuppqRQSwJinxG+v7kuvD790QqKvidnsbEjdCI05BipHPJM4+ddHzmpU5MPTtrOCwQWjV+FJ45Wc0z61bTmgEuE7skdVCi45mfTiBwEhGuMUNKDW0r1m6KpKaYkazmJIrECE/QiAxzylFElJvOXsngaa4EMBQyL67hTP09kaJIqWnk550R0mO16BXif94w0eGVm1IeJ5pwPF8UJgxqAYtcYEAlwZpNc4KwpPmtEI+RRFjn6RUh2IsvL5PeRdNuNVu3rXq7XcZRBUfgBJwDG1yCNrgBHdAFGDyAJ/ACXo1H49l4M97nrRWjnDkEf2B8fAN8Rpoh</latexit>

GNN/GIN

identical node embeddings as multi-sets

[Xu et al. 2019][Garg et al. 2020]

Page 13: Representation and Generation of Molecular Graphs

Representational power of “GNNs”

‣ Indistinguishable graph features - shortest/largest cycle, radius, - presence of conjoint cycle, - number of cycles, c-clique, - etc.

the set of all n-node graphs

1-WL equivalence

class Generalization and Representational Limits of Graph Neural Networks

Proposition 2. LU-GNNs with sum/average/max readout

cannot decide the following graph properties: (a) girth

(length of shortest cycle), (b) circumference (length of

longest cycle), (c) diameter (maximum distance, in terms

of shortest path, between any pair of nodes in the graph),

(d) radius (minimum node eccentricity, where eccentricity

of a node u is defined as the maximum distance from u to

other vertices), (e) conjoint cycle (two cycles that share an

edge), (f) total number of cycles, and (g) k-clique (a sub-

graph of at least k � 3 vertices such that each vertex in the

subgraph is connected by an edge to any other vertex in the

subgraph). These results also hold for CPNGNN models.

A B

CD

A B

CD

A1 B1 C1 D1

D2 C2 B2 A2

S4

S4

S8

Proof. To argue about girth and circumference, we con-struct a pair of graphs that have cycles of different lengthbut produce the same output embedding via the readoutfunction. Specifically, we show that LU-GNNs cannot de-cide a graph having cycles of length n from a cycle oflength 2n. We first construct a counterexample for aver-age and max readout functions for n = 4. Let {A, B, C, D}be the set of node labels. Consider a cycle S4 = ABCDA

(of length 4), and a cycle S8 = A1B1C1D1A2B2C2D2A1

(of length 8), where X1 and X2 are copies of X 2{A,B,C,D} (i.e. have the same node feature vectors asX). We also enforce that edge e(X`, Y`0) is a copy ofe(X,Y ), for `, `0 2 {1, 2}, whenever X` is a copy of Xand Y`0 is a copy of Y .

Additionally, for CPNGNN (Sato et al., 2019), we first finda consistent port numbering for S4 using Algorithm 3 in(Sato et al., 2019). Thus, we find a function p which satis-fies the following. For each edge (X,Y ) in S4, we obtain apair of port numbers (i, j) with i 2 {1, 2, . . . , degree(X)}and j 2 {1, 2, . . . , degree(Y )} such that p(X, i) = (Y, j)and p(p(X, i)) = (X, i). For each edge (X`, Y`0) in S8, weassociate the same pair of ports (i, j) as for (X,Y ) in S4.This ensures that both S4 and S8 have locally isomorphicconsistent port-ordered neighborhoods.

We first prove by induction that LU-GNNs and CPNGNNsproduce identical node embeddings for locally isomorphicgraphs. Consider any such GNN with L + 1 layers pa-rameterized by the sequence ✓1:L+1 , (✓1, . . . , ✓L, ✓L+1).

Fix any X 2 {A,B,C,D}. Since the neighbors and cor-responding port pairs of X in S4 are identical copies ofthose of X1 and X2 in S8, the updated embeddings X(✓1),X1(✓1), and X2(✓1) output by the first layer are all identi-cal. Assume that the node embeddings X(✓1:L), X1(✓1:L),and X2(✓1:L) are identical. Since the neighbors of X in S4

are locally isomorphic to neighbors of X1 (and X2) in S8,this immediately implies that (for any ✓L+1) we must haveX(✓1:L+1) = X1(✓1:L+1) = X2(✓1:L+1).

Thus, the average embedding vector for S4 is same asfor S8 (and likewise for their respective maximum embed-dings) even though girth as well as circumference is equalto 4 for S4, and equal to 8 for S8. Moreover, note that diam-eter and radius of S4 are both equal to 2. In contrast, diame-ter (and radius) of S8 is 4. For the sum function, we simplycompare S8 with a graph that has two disjoint copies ofS4 as its components. To edges in both these components,we assign the corresponding port numbers from S4. Doingso does not change the girth and circumference. However,since the graph with two components is disconnected, itsdiameter (and radius) is 1.

Our choice of ✓1:L+1 was arbitrary, so properties (a)-(d)follow immediately for any LU-GNN and CPNGNN withsum, average, or maximum readout.

A

B

C

D

A

B

C

D

A1

B1C1D1

G1

G1

A2

D2C2B2

G2

For properties (e)-(g), we craft another pair of graphsthat differ in the specified properties but cannot bediscriminated. The main idea is to replicate the effectof the common edge in the conjoint cycle (graph G1)via two identical subgraphs (of graph G2 that does nothave any conjoint cycle) that are cleverly aligned toreproduce the local port-ordered neighborhoods and thuspresent the same view to each node (see the adjoiningfigure). Note that the number of cycles in G1 is 3(ABCDA, ABDA, BCDB), while the correspondingnumber in G2 is 6 (A1B1C1D1A1, A2B2C2D2A2,A1B1D2A2B2D1A1, A1B1D2C2B2D1A1,A2B2D1C1B1D2A2, D1C1B1D2C2B2), and thus

{h1, . . . , hn}

<latexit sha1_base64="sBFMQIHhxC9HtKlxJzHO7GOq960=">AAAB/HicbVDLSsNAFJ34rPUV7dLNYBFclJJIQZcFNy4r2Ac0IUwmk3boZCbMTIQQ6q+4caGIWz/EnX/jpM1CWw9cOJxzL/feE6aMKu0439bG5tb2zm5tr75/cHh0bJ+cDpTIJCZ9LJiQoxApwignfU01I6NUEpSEjAzD2W3pDx+JVFTwB52nxE/QhNOYYqSNFNgNr5gGbstjkdCqNQ24N68HdtNpOwvAdeJWpAkq9AL7y4sEzhLCNWZIqbHrpNovkNQUMzKve5kiKcIzNCFjQzlKiPKLxfFzeGGUCMZCmuIaLtTfEwVKlMqT0HQmSE/VqleK/3njTMc3fkF5mmnC8XJRnDGoBSyTgBGVBGuWG4KwpOZWiKdIIqxNXmUI7urL62Rw1XY77c59p9ntVnHUwBk4B5fABdegC+5AD/QBBjl4Bq/gzXqyXqx362PZumFVMw3wB9bnD77olC8=</latexit>

{h0⇡1, . . . , h0

⇡n}

<latexit sha1_base64="KpHepxM7wNsS65LJGZybRrd/Abo=">AAACCnicbVDLSsNAFJ3UV62vqEs3o0V0UUoiBV0W3LisYB/QhDCZTNqhk5kwMxFKyNqNv+LGhSJu/QJ3/o1JG0FbD1w4nHMv997jx4wqbVlfRmVldW19o7pZ29re2d0z9w96SiQSky4WTMiBjxRhlJOuppqRQSwJinxG+v7kuvD790QqKvidnsbEjdCI05BipHPJM4+ddHzmpU5MPTtrOCwQWjV+FJ45Wc0z61bTmgEuE7skdVCi45mfTiBwEhGuMUNKDW0r1m6KpKaYkazmJIrECE/QiAxzylFElJvOXsngaa4EMBQyL67hTP09kaJIqWnk550R0mO16BXif94w0eGVm1IeJ5pwPF8UJgxqAYtcYEAlwZpNc4KwpPmtEI+RRFjn6RUh2IsvL5PeRdNuNVu3rXq7XcZRBUfgBJwDG1yCNrgBHdAFGDyAJ/ACXo1H49l4M97nrRWjnDkEf2B8fAN8Rpoh</latexit>

GNN/GIN

identical node embeddings as multi-sets

[Xu et al. 2019][Garg et al. 2020]

[Garg et al. 2020]

Page 14: Representation and Generation of Molecular Graphs

Structural motifs: polymer

Page 15: Representation and Generation of Molecular Graphs

Structural motifs: polymer

Page 16: Representation and Generation of Molecular Graphs

Structural motifs: polymer

Page 17: Representation and Generation of Molecular Graphs

Structural motifs: polymer

Page 18: Representation and Generation of Molecular Graphs

Structural motifs: polymer

Page 19: Representation and Generation of Molecular Graphs

Structural motifs: polymer

Page 20: Representation and Generation of Molecular Graphs

Structural motifs: polymer

Page 21: Representation and Generation of Molecular Graphs

‣ This is a simple, two-level hierarchy; motif graph does not encode how the substructures are attached to each other

Structural motifs: polymer

Page 22: Representation and Generation of Molecular Graphs

‣ This is a simple, two-level hierarchy; motif graph does not encode how the substructures are attached to each other

‣ We extend this to a three-level hierarchical representation for each molecule

Structural motifs: polymer

Page 23: Representation and Generation of Molecular Graphs

Fine-to-coarse graph encodingEn

codi

ng

[Jin et al. 2020]

Page 24: Representation and Generation of Molecular Graphs

‣ A simple example on solubility; ESOL dataset (averaged over 5 folds)

ESOL RMSE

0.5

0.675

0.85

1.025

1.2

MPNN Hier-MPNN

0.650.69

1.11

Is hierarchy helpful?

GNN with basic atom features

GNN with rich substructure features

HierGNN with basic features

Page 25: Representation and Generation of Molecular Graphs

New Antibiotic Discovery‣ If we can accurately predict molecular properties, we can

screen (select and repurpose) molecules from a large candidate set

‣ Antibiotic Discovery [Stokes et al., 2020] - Trained a model to predict the inhibition against E. Coli - Data: ~2000 measured compounds from Broad Institute - Screened ~100 million possible compounds - Tested 15 highest scoring molecules in the lab - 7 of them validated to be inhibitive in-vitro

Page 26: Representation and Generation of Molecular Graphs

Automating Drug design‣ Our problem: how to programmatically modify pre-cursor

molecules to have better properties

‣ Key challenges:

1. representation and prediction: learn to predict molecular properties

2. generation and optimization: realize target molecules with better properties programmatically

3. understanding: uncover principles (or diagnose errors) underlying complex predictions

design specs

Page 27: Representation and Generation of Molecular Graphs

‣ Goal: learn to turn precursor molecules into molecules that satisfy given design specification(s)

‣ The training set consists of (source, target) molecular pairs

‣ Key challenge: molecule generation

Optimization as graph translation

… …Source Target

……

…Encode Decode

X YSource Target

Page 28: Representation and Generation of Molecular Graphs

Coarse-to-fine graph generation‣ We realize graphs auto-regressively, in a course to fine

manner, one substructure at a time

Dec

odin

g

Enco

ding

[Jin et al. 2020]

Page 29: Representation and Generation of Molecular Graphs

‣ We realize graphs auto-regressively, in a course to fine manner, one substructure at a time

Dec

odin

g

Enco

ding

[Jin et al. 2020]

Coarse-to-fine graph generation

Page 30: Representation and Generation of Molecular Graphs

‣ We realize graphs auto-regressively, in a course to fine manner, one substructure at a time

Dec

odin

g

Enco

ding

[Jin et al. 2020]

Coarse-to-fine graph generation

Page 31: Representation and Generation of Molecular Graphs

Coarse-to-fine graph generation‣ We realize graphs auto-regressively, in a course to fine

manner, one substructure at a time

Dec

odin

g

Enco

ding

[Jin et al. 2020]

Page 32: Representation and Generation of Molecular Graphs

Does the hierarchy help? ‣ Example with polymers: 86K (76K+5K+5K)

[Jin et al. 2020]

Page 33: Representation and Generation of Molecular Graphs

graph generation: diversity‣ Goal: learn to turn precursor molecules into molecules

that satisfy given design specification(s)

‣ We’d like to generate a diverse set of candidate molecules that satisfy the criteria

X Y

diversity z ~ P(z)

Encode Decode……

[Jin et al. 2019,2020]

Page 34: Representation and Generation of Molecular Graphs

Example results‣ Single property optimization: DRD2 success % (from

inactive to active)

0

22.5

45

67.5

90

JT-VAE GCPN MMPA Seq2Seq JT-G2G AtomG2G HierG2G

85.975.877.875.9

46.4

4.43.4

[Jin et al. 2020]

Page 35: Representation and Generation of Molecular Graphs

Example results‣ Single property optimization: QED success % (QED >

0.9)

0

20

40

60

80

JT-VAE GCPN MMPA Seq2Seq JT-G2G AtomG2G HierG2G

76.973.6

59.958.5

32.9

9.48.8

[Jin et al. 2020]

Page 36: Representation and Generation of Molecular Graphs

Inverse design challenge‣ Many examples of molecules with a particular property

‣ Few instances of molecules that satisfy multiple (esp. new) property combinations

‣ Challenge: How do we realize a diverse distribution of molecules that satisfy all the criteria without any examples of such molecules?

?

GSK3

JNK3

<latexit sha1_base64="oo0pT2XWHqjCoTpNqYlNSJkOkzk=">AAAB7HicbVBNS8NAEJ34WetX1aOXYBE8lUQKeix48VjBtIU2lM120i7dbMLuRCilv8GLB0W8+oO8+W/ctjlo64OBx3szzMyLMikMed63s7G5tb2zW9or7x8cHh1XTk5bJs01x4CnMtWdiBmUQmFAgiR2Mo0siSS2o/Hd3G8/oTYiVY80yTBM2FCJWHBGVgp6ERLrV6pezVvAXSd+QapQoNmvfPUGKc8TVMQlM6brexmFU6ZJcImzci83mDE+ZkPsWqpYgiacLo6duZdWGbhxqm0pchfq74kpS4yZJJHtTBiNzKo3F//zujnFt+FUqCwnVHy5KM6lS6k7/9wdCI2c5MQSxrWwt7p8xDTjZPMp2xD81ZfXSeu65tdr9Yd6tdEo4ijBOVzAFfhwAw24hyYEwEHAM7zCm6OcF+fd+Vi2bjjFzBn8gfP5A8WCjqk=</latexit>

Page 37: Representation and Generation of Molecular Graphs

Our strategy‣ Step 1: rationale extraction

- carve out candidate substructures — rationales — likely responsible for each molecular property

‣ Step 2: multi-rationale assembly - learn to assemble pieces together into a complete

molecule that satisfies all the properties

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

Composing Molecules with Multiple Property Constraints

Anonymous Authors1

Abstract

Drug discovery aims to find novel compoundswith specified chemical property profiles. In termsof generative modeling, the goal is to learn tosample molecules in the intersection of multipleproperty constraints. This task becomes increas-ingly challenging when there are many propertyconstraints. We propose to offset this complex-ity by composing molecules from a vocabularyof substructures that we call molecular rationales.These rationales are identified from molecules assubstructures that are likely responsible for eachproperty of interest. We then learn to expand ratio-nales into a full molecule using graph generativemodels. Our final generative model composesmolecules as mixtures of multiple rationale com-pletions, and this mixture is fine-tuned to preservethe properties of interest. We evaluate our modelon various drug design tasks and demonstrate sig-nificant improvements over state-of-the-art base-lines in terms of accuracy, diversity, and noveltyof generated compounds.

1. Introduction

The key challenge in drug discovery is to find moleculesthat satisfy multiple constraints, from potency, safety, todesired metabolic profiles. Optimizing these constraintssimultaneously is challenging for existing computationalmodels. The primary difficulty lies in the lack of traininginstances of molecules that conform to all the constraints.For example, for this reason, Jin et al. (2019a) reports over60% performance loss when moving beyond the single-constraint setting.

In this paper, we propose a novel approach to multi-property molecular optimization. Our strategy is inspired byfragment-based drug discovery (Murray & Rees, 2009) of-ten followed by medicinal chemists. The idea is to start with

1Anonymous Institution, Anonymous City, Anonymous Region,Anonymous Country. Correspondence to: Anonymous Author<[email protected]>.

Preliminary work. Under review by the International Conferenceon Machine Learning (ICML). Do not distribute.

Figure 1. Illustration of our rationale based generative model. Togenerate a dual inhibitor against biological targets GSK3� andJNK3, our model first identifies rationale substructures S for eachproperty, and then learns to compose them into a full molecule G.Note that rationales are not provided as domain knowledge.

substructures (e.g., functional groups or later pieces) thatdrive specific properties of interest, and then combine thesebuilding blocks into a target molecule. To automate thisprocess, our model has to learn two complementary tasksillustrated in Figure 1: (1) identification of the buildingblocks that we call rationales, and (2) assembling multi-ple rationales together into a fully formed target molecule.In contrast to competing methods, our generative modeldoes not build molecules from scratch, but instead assem-bles them from automatically extracted rationales alreadyimplicated for specific properties (see Figure 1).

We implement this idea using a generative model ofmolecules where the rationale choices play the role of latentvariables. Specifically, a molecular graph G is generatedfrom underlying rationale sets S according to:

P (G) =X

SP (G|S)P (S) (1)

As ground truth rationales (e.g., functional groups or sub-graphs) are not provided, the model has to extract candidaterationales from molecules with the help of a property pre-dictor. We formulate this task as a discrete optimizationproblem efficiently solved by Monte Carlo tree search. Ourrationale conditioned graph generator, P (G|S), is initially

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

Composing Molecules with Multiple Property Constraints

Anonymous Authors1

Abstract

Drug discovery aims to find novel compoundswith specified chemical property profiles. In termsof generative modeling, the goal is to learn tosample molecules in the intersection of multipleproperty constraints. This task becomes increas-ingly challenging when there are many propertyconstraints. We propose to offset this complex-ity by composing molecules from a vocabularyof substructures that we call molecular rationales.These rationales are identified from molecules assubstructures that are likely responsible for eachproperty of interest. We then learn to expand ratio-nales into a full molecule using graph generativemodels. Our final generative model composesmolecules as mixtures of multiple rationale com-pletions, and this mixture is fine-tuned to preservethe properties of interest. We evaluate our modelon various drug design tasks and demonstrate sig-nificant improvements over state-of-the-art base-lines in terms of accuracy, diversity, and noveltyof generated compounds.

1. Introduction

The key challenge in drug discovery is to find moleculesthat satisfy multiple constraints, from potency, safety, todesired metabolic profiles. Optimizing these constraintssimultaneously is challenging for existing computationalmodels. The primary difficulty lies in the lack of traininginstances of molecules that conform to all the constraints.For example, for this reason, Jin et al. (2019a) reports over60% performance loss when moving beyond the single-constraint setting.

In this paper, we propose a novel approach to multi-property molecular optimization. Our strategy is inspired byfragment-based drug discovery (Murray & Rees, 2009) of-ten followed by medicinal chemists. The idea is to start with

1Anonymous Institution, Anonymous City, Anonymous Region,Anonymous Country. Correspondence to: Anonymous Author<[email protected]>.

Preliminary work. Under review by the International Conferenceon Machine Learning (ICML). Do not distribute.

Figure 1. Illustration of our rationale based generative model. Togenerate a dual inhibitor against biological targets GSK3� andJNK3, our model first identifies rationale substructures S for eachproperty, and then learns to compose them into a full molecule G.Note that rationales are not provided as domain knowledge.

substructures (e.g., functional groups or later pieces) thatdrive specific properties of interest, and then combine thesebuilding blocks into a target molecule. To automate thisprocess, our model has to learn two complementary tasksillustrated in Figure 1: (1) identification of the buildingblocks that we call rationales, and (2) assembling multi-ple rationales together into a fully formed target molecule.In contrast to competing methods, our generative modeldoes not build molecules from scratch, but instead assem-bles them from automatically extracted rationales alreadyimplicated for specific properties (see Figure 1).

We implement this idea using a generative model ofmolecules where the rationale choices play the role of latentvariables. Specifically, a molecular graph G is generatedfrom underlying rationale sets S according to:

P (G) =X

SP (G|S)P (S) (1)

As ground truth rationales (e.g., functional groups or sub-graphs) are not provided, the model has to extract candidaterationales from molecules with the help of a property pre-dictor. We formulate this task as a discrete optimizationproblem efficiently solved by Monte Carlo tree search. Ourrationale conditioned graph generator, P (G|S), is initially

[Jin et al. 2020]

Page 38: Representation and Generation of Molecular Graphs

Our strategy‣ Step 1: rationale extraction

- carve out candidate substructures — rationales — likely responsible for each molecular property

‣ Step 2: multi-rationale assembly - learn to assemble pieces together into a complete

molecule that satisfies all the properties

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

Composing Molecules with Multiple Property Constraints

Anonymous Authors1

Abstract

Drug discovery aims to find novel compoundswith specified chemical property profiles. In termsof generative modeling, the goal is to learn tosample molecules in the intersection of multipleproperty constraints. This task becomes increas-ingly challenging when there are many propertyconstraints. We propose to offset this complex-ity by composing molecules from a vocabularyof substructures that we call molecular rationales.These rationales are identified from molecules assubstructures that are likely responsible for eachproperty of interest. We then learn to expand ratio-nales into a full molecule using graph generativemodels. Our final generative model composesmolecules as mixtures of multiple rationale com-pletions, and this mixture is fine-tuned to preservethe properties of interest. We evaluate our modelon various drug design tasks and demonstrate sig-nificant improvements over state-of-the-art base-lines in terms of accuracy, diversity, and noveltyof generated compounds.

1. Introduction

The key challenge in drug discovery is to find moleculesthat satisfy multiple constraints, from potency, safety, todesired metabolic profiles. Optimizing these constraintssimultaneously is challenging for existing computationalmodels. The primary difficulty lies in the lack of traininginstances of molecules that conform to all the constraints.For example, for this reason, Jin et al. (2019a) reports over60% performance loss when moving beyond the single-constraint setting.

In this paper, we propose a novel approach to multi-property molecular optimization. Our strategy is inspired byfragment-based drug discovery (Murray & Rees, 2009) of-ten followed by medicinal chemists. The idea is to start with

1Anonymous Institution, Anonymous City, Anonymous Region,Anonymous Country. Correspondence to: Anonymous Author<[email protected]>.

Preliminary work. Under review by the International Conferenceon Machine Learning (ICML). Do not distribute.

Figure 1. Illustration of our rationale based generative model. Togenerate a dual inhibitor against biological targets GSK3� andJNK3, our model first identifies rationale substructures S for eachproperty, and then learns to compose them into a full molecule G.Note that rationales are not provided as domain knowledge.

substructures (e.g., functional groups or later pieces) thatdrive specific properties of interest, and then combine thesebuilding blocks into a target molecule. To automate thisprocess, our model has to learn two complementary tasksillustrated in Figure 1: (1) identification of the buildingblocks that we call rationales, and (2) assembling multi-ple rationales together into a fully formed target molecule.In contrast to competing methods, our generative modeldoes not build molecules from scratch, but instead assem-bles them from automatically extracted rationales alreadyimplicated for specific properties (see Figure 1).

We implement this idea using a generative model ofmolecules where the rationale choices play the role of latentvariables. Specifically, a molecular graph G is generatedfrom underlying rationale sets S according to:

P (G) =X

SP (G|S)P (S) (1)

As ground truth rationales (e.g., functional groups or sub-graphs) are not provided, the model has to extract candidaterationales from molecules with the help of a property pre-dictor. We formulate this task as a discrete optimizationproblem efficiently solved by Monte Carlo tree search. Ourrationale conditioned graph generator, P (G|S), is initially

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

Composing Molecules with Multiple Property Constraints

Anonymous Authors1

Abstract

Drug discovery aims to find novel compoundswith specified chemical property profiles. In termsof generative modeling, the goal is to learn tosample molecules in the intersection of multipleproperty constraints. This task becomes increas-ingly challenging when there are many propertyconstraints. We propose to offset this complex-ity by composing molecules from a vocabularyof substructures that we call molecular rationales.These rationales are identified from molecules assubstructures that are likely responsible for eachproperty of interest. We then learn to expand ratio-nales into a full molecule using graph generativemodels. Our final generative model composesmolecules as mixtures of multiple rationale com-pletions, and this mixture is fine-tuned to preservethe properties of interest. We evaluate our modelon various drug design tasks and demonstrate sig-nificant improvements over state-of-the-art base-lines in terms of accuracy, diversity, and noveltyof generated compounds.

1. Introduction

The key challenge in drug discovery is to find moleculesthat satisfy multiple constraints, from potency, safety, todesired metabolic profiles. Optimizing these constraintssimultaneously is challenging for existing computationalmodels. The primary difficulty lies in the lack of traininginstances of molecules that conform to all the constraints.For example, for this reason, Jin et al. (2019a) reports over60% performance loss when moving beyond the single-constraint setting.

In this paper, we propose a novel approach to multi-property molecular optimization. Our strategy is inspired byfragment-based drug discovery (Murray & Rees, 2009) of-ten followed by medicinal chemists. The idea is to start with

1Anonymous Institution, Anonymous City, Anonymous Region,Anonymous Country. Correspondence to: Anonymous Author<[email protected]>.

Preliminary work. Under review by the International Conferenceon Machine Learning (ICML). Do not distribute.

Figure 1. Illustration of our rationale based generative model. Togenerate a dual inhibitor against biological targets GSK3� andJNK3, our model first identifies rationale substructures S for eachproperty, and then learns to compose them into a full molecule G.Note that rationales are not provided as domain knowledge.

substructures (e.g., functional groups or later pieces) thatdrive specific properties of interest, and then combine thesebuilding blocks into a target molecule. To automate thisprocess, our model has to learn two complementary tasksillustrated in Figure 1: (1) identification of the buildingblocks that we call rationales, and (2) assembling multi-ple rationales together into a fully formed target molecule.In contrast to competing methods, our generative modeldoes not build molecules from scratch, but instead assem-bles them from automatically extracted rationales alreadyimplicated for specific properties (see Figure 1).

We implement this idea using a generative model ofmolecules where the rationale choices play the role of latentvariables. Specifically, a molecular graph G is generatedfrom underlying rationale sets S according to:

P (G) =X

SP (G|S)P (S) (1)

As ground truth rationales (e.g., functional groups or sub-graphs) are not provided, the model has to extract candidaterationales from molecules with the help of a property pre-dictor. We formulate this task as a discrete optimizationproblem efficiently solved by Monte Carlo tree search. Ourrationale conditioned graph generator, P (G|S), is initially

[Jin et al. 2020]

Page 39: Representation and Generation of Molecular Graphs

Step 1: rationale extraction‣ We can use Monte Carlo Tree Search to remove all parts

of the molecule not relevant for the property (according to a given property predictor)

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Composing Molecules with Multiple Property Constraints

et al. (2017), each edge (s, a) stores the following statistics:

• N(s, a) is the visit count of deletion a, which is used forexploration-exploitation tradeoff in the search process.

• W (s, a) is total action value which indicates how likelythe deletion a will lead to a good rationale.

• R(s, a) = ri(s0) is the predicted property score of thenew subgraph s0 derived from deleting a from s.

Guided by these statistics, MCTS searches for rationales inmultiple iterations. Each iteration consists of two phases:

1. Forward pass: Select a path s0, · · · , sL from the root s0to a leaf state sL with less than N atoms and evaluate itsproperty score ri(sL). At each state sk, an deletion akis selected according to the statistics in the search tree:

ak = argmaxa

W (sk, a)

N(sk, a)+ U(sk, a) (5)

U(sk, a) = cpuctR(sk, a)

pPb N(sk, b)

1 +N(sk, a)(6)

where cpuct determines the level of exploration. Thissearch strategy is a variant of the PUCT algorithm (Rosin,2011). It initially prefers to explore deletions with highR(s, a) and low visit count, but asympotically prefersdeletions that are likely to lead to good rationales.

2. Backward pass: The edge statistics are updated for eachstate sk. Specifically, N(sk, ak) N(sk, ak) + 1 andW (sk, ak) W (sk, ak) + ri(sL).

In the end, we collect all the leaf states s with ri(s) � �iand add them to the rationale vocabulary V i

S .

Multi-property Rationale For a set of M properties, wecan similarly define its rationale S [M ] by imposing M prop-erty constraints at the same time, namely

8i : ri(S [M ]) � �i, i = 1, · · · ,M

In principle, we can apply MCTS to extract rationales frommolecules that satisfy all the property constraints. However,in many cases there are no such molecules available. Asa result, we propose to construct rationales from single-property rationales for each property i. Specifically, eachmulti-property rationale S [M ] is a disconnected graph withM connected components S1, · · · ,SM :

V [M ]S = {S [M ] = S1 � · · ·� SM | Si 2 V i

S} (7)

where � means to concatenate two graphs Si and Sj . Fornotational convenience, we denote both single and multi-property rationales as S. In the rest of the paper, S is arationale graph with one or multiple connected components.

3.2. Graph Completion

This module is a variational autoencoder which completes afull molecule G given a rationale S . Since each rationale S

Figure 4. Illustration of Monte Carlo tree search for molecules.Peripheral bonds and rings are highlighted in red. In the forwardpass, the model deletes a peripheral bond or ring from each statewhich has maximum Q + U value. In the backward pass, themodel updates the statistics of each state.

can be realized into many different molecules, we introducea latent variable z to generate diverse outputs:

P (G|S) =Z

zP (G|S, z)P (z)dz (8)

where P (z) is the prior distribution. Different from standardgraph generation, our graph decoder must generate graphsthat contain subgraph S. Our VAE architecture is adaptedfrom existing atom-by-atom generative models (You et al.,2018b; Liu et al., 2018) to incorporate the subgraph con-straint. For completeness, we present our architecture here:

Encoder Our encoder is a message passing network (MPN)which learns the approximate posterior Q(z|G,S) for vari-ational inference. Let e(au) be the embedding of atom uwith atom type au, and e(buv) be the embedding of bond(u, v) with bond type buv . The MPN computes atom repre-sentations {hv|v 2 G}.

{hv} = MPNe (G, {e(au)}, {e(buv)}) (9)

For simplicity, we denote the MPN encoding process asMPN(·), which is detailed in the appendix. The atomvectors are aggregated to represent G as a single vectorhG =

Pv hv. Finally, we sample latent vector zG from

Q(z|G,S) with mean µ(hG) and log variance ⌃(hG):

zG = µ(hG) + exp(⌃(hG)) · ✏; ✏ ⇠ N (0, I) (10)

Decoder The decoder generates molecule G according toits breadth-first order. In each step, the model generates a

[Jin et al. 2020]

Page 40: Representation and Generation of Molecular Graphs

Step 2: multi-rationale assembly‣ Pre-train assembler: learn a graph completion model

P(G|S) that can expand any substructure S to a complete molecule G

‣ Fine tune multi-rationale completions: we use RL to optimize sample completions towards satisfying all the properties

[Jin et al. 2020]

S1:k ⇠ P (S1:k)

G ⇠ P (G|S1:k)

reward =

⇢1, if G has all the properties0. otherwise

<latexit sha1_base64="jAC6AalW6NOfHwzV6feTJxBTCoQ=">AAACnnicbVFti9NAEN7EtzO+XNWP92WwZzlBQiIFRRAqghVFrGjvCk0pm82kWbp5YXfiWWJ+ln/Eb/4bt70c6p0DC88+zzMzuzNxpaShIPjluFeuXrt+Y++md+v2nbv7vXv3j01Za4FTUapSz2JuUMkCpyRJ4azSyPNY4Um8fr3VT76iNrIsvtCmwkXOV4VMpeBkqWXvx+DzsglfrFuIjMxhctRdH0cReIPxOTv+/of3BhARfqNG4ynXSQsvwYsUphQ1FsS4kkXDteabtlGtFz7p3CBTOBwfQsYNcKWAMoRKlxVqkmigtYUD/9xbWlWfSoPQehEWSVcw0nKVke8te/3AD3YBl0HYgT7rYrLs/YySUtQ5FiQUN2YeBhUtbFWSQqHtURusuFjzFc4tLHiOZtHsxtvCI8skkJbanoJgx/6d0fDcmE0eW2fOKTMXtS35P21eU/p80ciiqgkLcdYore1oStjuChKpUZDaWMCFlvatIDKuuSC70e0QwotfvgyOn/rh0B9+GvZHo24ce+yAPWRHLGTP2Ii9ZRM2ZcI5cF4575z3Lrhv3A/uxzOr63Q5D9g/4c5+AyUIyFo=</latexit>

sample a candidate rationale setsample graph completion

evaluate

Page 41: Representation and Generation of Molecular Graphs

Step 2: multi-rationale assembly‣ Pre-train assembler: learn a graph completion model

P(G|S) that can expand any substructure S to a complete molecule G

‣ Fine tune multi-rationale completions: we use RL to optimize sample completions towards satisfying all the properties

[Jin et al. 2020]

S1:k ⇠ P (S1:k)

G ⇠ P (G|S1:k)

reward =

⇢1, if G has all the properties0. otherwise

<latexit sha1_base64="jAC6AalW6NOfHwzV6feTJxBTCoQ=">AAACnnicbVFti9NAEN7EtzO+XNWP92WwZzlBQiIFRRAqghVFrGjvCk0pm82kWbp5YXfiWWJ+ln/Eb/4bt70c6p0DC88+zzMzuzNxpaShIPjluFeuXrt+Y++md+v2nbv7vXv3j01Za4FTUapSz2JuUMkCpyRJ4azSyPNY4Um8fr3VT76iNrIsvtCmwkXOV4VMpeBkqWXvx+DzsglfrFuIjMxhctRdH0cReIPxOTv+/of3BhARfqNG4ynXSQsvwYsUphQ1FsS4kkXDteabtlGtFz7p3CBTOBwfQsYNcKWAMoRKlxVqkmigtYUD/9xbWlWfSoPQehEWSVcw0nKVke8te/3AD3YBl0HYgT7rYrLs/YySUtQ5FiQUN2YeBhUtbFWSQqHtURusuFjzFc4tLHiOZtHsxtvCI8skkJbanoJgx/6d0fDcmE0eW2fOKTMXtS35P21eU/p80ciiqgkLcdYore1oStjuChKpUZDaWMCFlvatIDKuuSC70e0QwotfvgyOn/rh0B9+GvZHo24ce+yAPWRHLGTP2Ii9ZRM2ZcI5cF4575z3Lrhv3A/uxzOr63Q5D9g/4c5+AyUIyFo=</latexit>

sample a candidate rationale setsample graph completion

evaluate

Page 42: Representation and Generation of Molecular Graphs

Example results‣ Example comparison of RL based multi-criteria

optimization methods

330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384

Composing Molecules with Multiple Property Constraints

Table 1. Results on molecule design with single property constraint.

Method DRD2 GSK3� JNK3Success Novelty Diversity Success Novelty Diversity Success Novelty Diversity

GVAE + RL 55.4% 56.2% 0.858 33.2% 76.4% 0.874 57.7% 62.6% 0.832GCPN 40.1% 12.8% 0.880 42.4% 11.6% 0.904 32.3% 4.4% 0.884

REINVENT 98.1% 28.8% 0.798 99.3% 61.0% 0.733 98.5% 31.6% 0.729Ours 100% 84.9% 0.866 100% 76.8% 0.850 100% 81.4% 0.847

Table 2. Results on molecule design with two property constraints.

Method DRD2 + GSK3� DRD2 + JNK3 GSK3� + JNK3Success Novelty Diversity Success Novelty Diversity Success Novelty Diversity

GVAE + RL 96.1% 0% 0.447 98.5% 0% 0.0 40.7% 80.3% 0.783GCPN 0.1% 40% 0.793 0.1% 25% 0.785 3.5% 8.0% 0.874

REINVENT 98.7% 100% 0.584 91.6% 100% 0.533 97.4% 39.7% 0.595Ours 100% 100% 0.866 100% 100% 0.837 100% 94.4% 0.844

Table 3. Molecule design with three property constraints.

Method DRD2 + GSK3� + JNK3Success Novelty Diversity

GVAE + RL 0% 0% 0.0GCPN 0% 0% 0.0REINVENT 48.3% 100% 0.166Ours 86.2% 100% 0.726

• Three properties: The rationale for three properties isconstructed as V DRD2

S �V GSK3�S �V JNK3

S . In total, weextracted 8163 different rationales for this task.

Model Setup Our model (and all baselines) are pretrainedon the same ChEMBL dataset from Olivecrona et al. (2017).We fine-tune our model for L = 5 iterations, where eachrationale is expanded for K = 20 times.

4.1. Results

The results on all design tasks are reported in Table 1-3. Onthe single-property design tasks, our model and REINVENTdemonstrate nearly perfect success rate since there is onlyone constraint. We also outperform all the baselines in termsof novelty metric. Though our diversity score is slightlylower than GCPN, our success rate and novelty score ismuch higher.

Table 2 summarizes the results on dual-property designtasks. On the GSK3�+JNK3 task, we achieved 100% suc-cess rate while maintaining 94.4% novelty and 0.844 di-versity score. Meanwhile, GCPN fails to discover positivecompounds on all the tasks due to reward sparsity. GVAE +RL fails to discover novel positives on the DRD2+GSK3�

and DRD2+JNK3 tasks because there are less than threepositive compounds available for training. Therefore it canonly learn to replicate existing positives. Our method is ableto succeed in all cases regardless of training data scarcity.

Table 3 shows the results on the three-property design task,which is the most challenging. The difference between ourmodel the baselines become significantly larger. In fact,GVAE + RL and GCPN completely fail in this task due toreward sparsity. Our model outperforms REINVENT witha wide margin (success rate: 86.8% versus 48.3%; diversity:0.726 versus 0.166).

When there are multiple properties, our method significantlyoutperforms GVAE + RL, which has the same generativearchitecture but does not utilize rationales and generatesmolecules from scratch. Thus we conclude the importanceof rationales for multi-property molecular design.

Visualization We further provide visualizations to help un-derstand our model. In Figure 5, we plotted a t-SNE (Maaten& Hinton, 2008) plot of the extracted rationales for GSK3�and JNK3. For both properties, rationales mostly cover thechemical space populated by existing positive molecules.The generated GSK3�+JNK3 dual inhibitors mostly appearin the place where GSK3� and JNK3 rationales overlap. InFigure 6, we show examples of molecules generated fromdual-property rationales. Each rationale has two connectedcomponents: one from GSK3� and the other from JNK3.

4.2. Rationale Sanity Check

While our rationales are mainly extracted for generation, itis also important for them to be chemically relevant. In otherwords, the extracted rationales should accurately explain

[Jin et al. 2020]

Page 43: Representation and Generation of Molecular Graphs

Summary‣ Molecules embody many of the key challenges in

prediction/generation/manipulation of complex objects

‣ While molecular design methods are rapidly becoming viable tools for drug discovery, many challenges remain: - generalizing predictions to unexplored chemical spaces - incorporating 3D features, physical constraints - explainability - etc.