Efficient Pattern Matching on Graph Patterns of Bounded Treewidth

6
Efficient Pattern Matching on Graph Patterns of Bounded Treewidth Takashi Yamada 1 and Takayoshi Shoudai 2,3 Department of Informatics, Kyushu University Fukuoka 819-0395, Japan Abstract This paper deals with a problem to decide whether a given graph structure appears as a pattern in the structure of a given graph. A graph pattern is a connected graph with structured variables. A variable is an ordered list of vertices that can be replaced with a connected graph by a kind of hyperedge replacements. The graph pattern matching problem (GPMP) is the computational problem to decide whether a given graph pattern matches a given graph. In this paper, we show that GPMP is solvable in polynomial time if for a given graph pattern p, the lengths of all variables of p are 2 and the base graph of p is of bounded treewidth. Keywords: graph pattern, pattern matching problem, treewidth, partial k-tree. 1 Introduction Large amount of data having graph structures, such as map data, CAD, biomolecular, chemical molecules, the World Wide Web, are stored in databases. 1 Email: [email protected] 2 Email: [email protected] 3 This research was partially supported by the Japanese Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research (C), 20500016, 2008–2010. Electronic Notes in Discrete Mathematics 37 (2011) 117–122 1571-0653/$ – see front matter © 2011 Elsevier B.V. All rights reserved. www.elsevier.com/locate/endm doi:10.1016/j.endm.2011.05.021

Transcript of Efficient Pattern Matching on Graph Patterns of Bounded Treewidth

Page 1: Efficient Pattern Matching on Graph Patterns of Bounded Treewidth

Efficient Pattern Matching on Graph Patternsof Bounded Treewidth

Takashi Yamada 1 and Takayoshi Shoudai 2,3

Department of Informatics, Kyushu University

Fukuoka 819-0395, Japan

Abstract

This paper deals with a problem to decide whether a given graph structure appearsas a pattern in the structure of a given graph. A graph pattern is a connectedgraph with structured variables. A variable is an ordered list of vertices that canbe replaced with a connected graph by a kind of hyperedge replacements. Thegraph pattern matching problem (GPMP) is the computational problem to decidewhether a given graph pattern matches a given graph. In this paper, we show thatGPMP is solvable in polynomial time if for a given graph pattern p, the lengths ofall variables of p are 2 and the base graph of p is of bounded treewidth.

Keywords: graph pattern, pattern matching problem, treewidth, partial k-tree.

1 Introduction

Large amount of data having graph structures, such as map data, CAD,biomolecular, chemical molecules, the World Wide Web, are stored in databases.

1 Email: [email protected] Email: [email protected] This research was partially supported by the Japanese Ministry of Education, Science,Sports and Culture, Grant-in-Aid for Scientific Research (C), 20500016, 2008–2010.

Electronic Notes in Discrete Mathematics 37 (2011) 117–122

1571-0653/$ – see front matter © 2011 Elsevier B.V. All rights reserved.

www.elsevier.com/locate/endm

doi:10.1016/j.endm.2011.05.021

Page 2: Efficient Pattern Matching on Graph Patterns of Bounded Treewidth

p

u1

u2h1

u4

u5

u7

u8

h2 h3

u3

u6

u9

v1

v2

w1w2

z1z2

G1 G3G2

G

G1

G2G3

u1

u2

u5u3

u7

u8

Fig. 1. A variable is drawn by a box with lines to its elements. A graph pattern p

has three variables h1 = (u1, u2), h2 = (u3, u5), and h3 = (u7, u8).

One of the natural computational problems in such databases is to decidewhether or not a given graph structure appears as a pattern in the structure ofa given graph G. In this paper, a graph pattern is defined as a graph-structuredpattern with internal variables, which represents characteristic common struc-tures in graph-structured data. We consider the graph pattern matching prob-lem for a given graph pattern of bounded treewidth. Practical applicationsinclude graph pattern mining for chemical databases [3,6,7]. For example,Horvath and Ramon reported in [3] that 99.97% of 250, 251 chemical com-pounds in the NCI chemical dataset 4 are expressed by graphs of treewidth atmost 3. Theoretical motivation for this research can be found in [1,2,5].

All graphs in this paper are undirected, finite, and have neither loops normultiple edges. Let V be a set of vertices of some graph. A variable is anordered list of different vertices in V , which is denoted by (v1, . . . , v�) (� ≥ 1),where vi �= vj if i �= j (1 ≤ i, j ≤ �). The length of a variable is the numberof vertices in the variable. A triplet p = (V, E, H) is called a graph pattern

if (V, E) is a connected graph and H is a set of variables of (V, E). Wegive an example of graph patterns in Fig. 1. Let p be a graph pattern andh = (v1, . . . , v�) a variable of p. Let G be a connected graph with at least� vertices. Let σ = (v1, . . . , v�) be an ordered list of � distinct vertices ofG. A binding is an operation, denoted by h := [G, σ], to substitute a graphfor a variable. A new graph pattern p{h := [G, σ]} is obtained from p andh := [G, σ] by removing the variable h and identifying the vertices v1, . . . , v�

of G with the vertices u1, . . . , u� in h, respectively. If the obtained graph

4 http://cactus.nci.nih.gov

T. Yamada, T. Shoudai / Electronic Notes in Discrete Mathematics 37 (2011) 117–122118

Page 3: Efficient Pattern Matching on Graph Patterns of Bounded Treewidth

pattern has multiple edges, we remove them. A set of bindings is called asubstitution. For example, in Fig. 1, a graph G is obtained from p and asubstitution θ = {h1 := [G1, (v1, v2)], h2 := [G2, (w1, w2)], h3 := [G3, (z2, z1)]}.The graph pattern matching problem (GPMP) is to decide whether or not agiven graph is obtained from a given graph pattern by a substitution.

GPMP is the graph isomorphism problem if a given graph pattern has novariable. Moreover, GPMP is NP-complete even if the lengths of all variablesof a given graph pattern p is 4 [4]. In this paper, we show that GPMP issolvable in polynomial time if for a given graph pattern p, the lengths of allvariables of p are 2 and the base graph of p is of bounded treewidth.

2 Preliminaries

For a graph pattern p, we denote by V (p), E(p), and H(p) the sets of allvertices, edges, and variables of p, respectively. Similarly, for a graph G, wedenote by V (G) and E(G) the sets of all vertices and edges of G, respectively.For a subset U of V (G), the induced subgraph of G by U , denoted by G[U ],is the subgraph G[U ] = (U, {{u, w}∈E(G) | u, v∈U}).

A tree-decomposition of a graph G is a 2-tuple (T,X ) where T is a tree andX = {X (α) | X (α) ⊆ V (G) for all α ∈ V (T )} satisfying the following threeconditions: (i)

⋃α∈V (T ) X (α) = V (G), (ii) ∀v, w ∈ V (G) [{v, w} ∈ E(G) ⇒

∃α ∈ V (T ) [{v, w} ⊆ X (α)]], and (iii) ∀α, β, γ ∈ V (T ) [β is on the path fromα to γ in T ⇒ X (α) ∩ X (γ) ⊆ X (β)]. The width of a tree-decomposition(T,X ) is maxα∈V (T ) |X (α)|−1. The treewidth of a graph G is minimum widthover all possible tree-decompositions of G. We say that a tree-decompositionof a graph G is optimal if its width equals to the treewidth of G. A partial

k-tree is a graph of treewidth at most k.

Thereafter, to distinguish from vertices of graph G, we call vertices of T

nodes. Below we assume that the tree T of a tree-decomposition (T,X ) is arooted tree by specifying a node of T . We denote by rT the root of T . Fora tree-decomposition (T,X ), we denote by T ↓a the maximal subtree rootedat a node a ∈ V (T ), by X (T ↓a) the union of elements of nodes of T ↓a, i.e.,X (T ↓a) =

⋃β∈V (T ↓a) X (β).

A tree-decomposition (T,X ) is smooth if ∀{α, β} ∈ E(T ) [|X (α)\X (β)| =|X (β)\X (α)| = 1]. A tree-decomposition (T,X ) has subtree connected charac-

teristic if ∀{α, β} ∈ E(T ) [β is a child of α and G[X (T ↓β)\X (α)] is connected].A tree-decomposition (T,X ) is normalized if it satisfies the following threeconditions: (i) (T,X ) is optimal, (ii) (T,X ) is smooth, and (iii) T is a rootedtree and (T,X ) has subtree connected characteristic.

T. Yamada, T. Shoudai / Electronic Notes in Discrete Mathematics 37 (2011) 117–122 119

Page 4: Efficient Pattern Matching on Graph Patterns of Bounded Treewidth

u1

u2

u4

u5

u7

u8u3

u6

u9Gp

u1,u2,u3

u1,u3,u4 u2,u3,u5

u3,u5,u6

u6,u7,u9

u5,u6,u9

u7,u8,u9

α1

α2 α3

α4

α5

α6

α7

T

Fig. 2. A graph Gp is the base graph of p in Fig. 1. The right figure is a normalizedtree decomposition (T,X ) of p in Fig. 1, where X (α1) = {u1, u2, u3}, and so on.

A normalized tree-decomposition of G of treewidth k is obtained from anyoptimal tree-decomposition of G in O(kn2) time [5]. The graph isomorphismproblem of partial k-trees is solved in O(nk+4) time [2].

Definition 2.1 A graph pattern p = (V, E, H) is said to be a partial k-tree

pattern if all variables have length 2 and a graph (V, E∪EH) is a partial k-tree,where EH = {{u1, u2} | (u1, u2) ∈ H}. We call the graph (V, E∪EH) the basegraph of p. A normalized tree-decomposition of a partial k-tree pattern p is anormalized tree-decomposition of its base graph.

A graph pattern p in Fig. 1 is a partial 2-tree pattern. In Fig. 2, a graphGp is the base graph of p in Fig. 1 and the right figure is a normalized tree-decomposition of p (and Gp).

3 A pattern matching algorithm for partial k-tree pat-

terns

Let p be a partial k-tree pattern and G a connected graph. We say that p

matches G if there is a substitution θ for p such that pθ is isomorphic to G.An injection ϕ : V (p) → V (G) is said to be a pattern isomorphism from p toG if there are a substitution θ for p and a graph isomorphism ρ from pθ to G

such that ϕ is the partial mapping of ρ on domain V (p). We denote the graphisomorphism ρ by ϕ(θ). We define our problem formally.

PARTIAL K-TREE PATTERN MATCHING PROBLEM

Input: A partial k-tree pattern p and a connected graph G.Problem: Decide whether or not p matches G, i.e., there is a pattern

T. Yamada, T. Shoudai / Electronic Notes in Discrete Mathematics 37 (2011) 117–122120

Page 5: Efficient Pattern Matching on Graph Patterns of Bounded Treewidth

X(α’)

X(α)

X(α’)

X(α)

X(T↓β1) X(T↓βg) X(T↓β1) X(T↓βg)

(a)

(b)

(c)

(a)

(b)

(c)

X(α’)

X(α)

X(β1) X(βg)

An internal node of Texcept for the root.

Fig. 3. The algorithm incrementally decides a pattern isomorphism from p to G.

isomorphism from p to G.

Definition 3.1 Let p be a partial k-tree pattern and G a connected graph.Let (T,X ) be a tree-decomposition of p. For a node α ∈ V (T ), we say that aninjection ψ : X (α)→V (G) is a node mapping of α if ∀v, w ∈ X (α) [{v, w} ∈E(p) ⇔ {ψ(v), ψ(w)} ∈ E(G)].

For two mappings ψ and ψ′ whose domains are D and D′, respectively, wesay that ψ does not contradict ψ′ if ψ(x) = ψ′(x) for any x ∈ D ∩D′. A nodemapping is a local mapping from p to G that preserves edge relations.

Let (T,X ) be a normalized tree-decomposition of G of treewidth k. Letα be a node of T that is not the root of T and α′ the parent of α. LetSα = X (α)∩X (α′) and Pα = X (α) \ X (α′). Since (T,X ) is smooth, |Sα| = k

and |Pα| = 1. For the root rT of T , we define SrT= ∅, PrT

= X (rT ). We showthe relationships of Sα and Pα in the leftmost figure of Fig. 3. Let β1, . . . , βg

be the children of a node α ∈ V (T ). Then, for any i, j (1 ≤ i �= j ≤ g),Pβi

∩ Pβj= ∅, and there is neither edge nor variable between Pβi

and Pβj.

For any node α ∈ V (T ) and node mapping ψ of α, a partial k-treepattern pα,ψ is defined as follows: (i) V (pα,ψ) = X (T ↓α), (ii) E(pα,ψ) =E(p[X (T ↓α)]) ∪ {{v, w} | (v, w) ∈ H(p[Sα]) ∧ {ψ(v), ψ(w)} ∈ E(G)}, (iii)H(pα,ψ) = H(p[X (T ↓α)]) \ H(p[Sα]).

An idea of our algorithm is as follows: Let α be a node of T , and β1, . . . , βg

the children of α. We assume that for each βi (1 ≤ i ≤ g), we have alreadycomputed a pattern isomorphism ϕi(θi) from pβi,ϕi

to a subgraph of G. Coloredareas in the center figure of Fig. 3 show the set of vertices from which mappingshave been already computed. For a node mapping ψ of α, by extending

T. Yamada, T. Shoudai / Electronic Notes in Discrete Mathematics 37 (2011) 117–122 121

Page 6: Efficient Pattern Matching on Graph Patterns of Bounded Treewidth

ϕi(θi), we construct a new substitution θ and a pattern isomorphism ϕ(θ)from pα,ψ to a subgraph of G. The algorithm decides whether or not such anextension is possible, by constructing a bipartite graph that represents possiblecorrespondences between variables of pα,ψ and connected components of G,and computing the maximum graph matching in the bipartite graph. Finallywe obtain a pattern isomorphism whose domain is shown in the colored areain the rightmost figure of Fig. 3. In the same time, the procedure computesall bindings for variables of types (a) and (b), and recomputes all bindingsfor variables of type (c), which are thick colored variables in the center andrightmost figures of Fig. 3.

Finally we have the following theorem. We omit the proof.

Theorem 3.2 PARTIAL K-TREE PATTERN MATCHING PROB-

LEM is solvable in O(Nk+4.5) time for a given partial k-tree pattern and a

given graph with N vertices.

References

[1] Arnborg, S., D.G. Corneil,, and A. Proskurowski, Complexity of Finding

Embedding in a k-Tree, SIAM J. Alg. Disc. Methods 8(2) (1987), 277–284.

[2] Bodlaender, H.L., Polynomial algorithms for graph isomorphism and chromatic

index on partial k-trees, J. Algorithms 11(4) (1990), 631–643.

[3] Horvath, T. and J. Ramon, Efficient Frequent Connected Subgraph Mining in

Graphs of Bounded Tree-Width, Theoret. Comput. Sci. 411(31–33) (2010),2784–2797.

[4] Miyahara, T., T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda, Polynomial

time matching algorithms for tree-like structured patterns in knowledge

discovery, Proc. 4th Pacific-Asia Conf. on Knowledge Discovery and DataMining, Springer, Lecture Notes in Artificial Intelligence 1805 (2000), 5–16,

[5] Nagoya, T., S. Tani, and S. Toda, A Polynomial-Time Algorithm for Counting

Graph Isomorphisms among Partial k-trees, IEICE Trans. on Information andSystems J85-D1(5) (2002) 424–435, (in Japanese).

[6] Yamasaki, H., Y. Sasaki, T. Shoudai, T. Uchida, and Y. Suzuki, Learning block-

preserving graph patterns and its application to data mining, Machine Learning76(1) (2009), 137–173.

[7] Yamasaki, H. and T. Shoudai, Mining of Frequent Externally Extensible

Outerplanar Graph Patterns, Proc. 7th Inter. Conf. on Machine Learning andApplications (2008), 871–876.

T. Yamada, T. Shoudai / Electronic Notes in Discrete Mathematics 37 (2011) 117–122122