Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo

University at BuffaloThe State University of New York

Lei Shi

Department of Computer Science and

Engineering

State University of New York at Buffalo

Frequent Subgraph/ Substructure Mining

Seminar 2009


Outline

Introduction

Apriori-based Subgrah Mining

Pattern Growth Subgraph Mining

Summary


Graphs are everywhere


Graph Mining Problems

Graph Pattern Mining• Frequent subgraph pattern mining• Pattern summarization• Optimal graph patterns• Graph patterns with constraints• Approximate graph patterns ….

Graph Classification• Graph clustering• Important node identification• Bridge and hub identification

Other Important Topics • Graph compression• Graph model• Social network analysis.


Subgraph pattern Mining

Frequent subgraph• A (sub)graph is frequent if its support (occurrence frequency) in a

given dataset is no less than a minimum support threshold

Application of subgraph pattern mining• Mining biochemical structures

• Program control flow analysis

• Mining XML structures or Web communities

• Building blocks for graph classifiction, clustering,compression, comparison and correlation analysis.


(1) (2) (3)

BC

A A

B

A

A

BC

C BC

A

AA

subgraph

331Support

Frequent Subgraph Example


Key Challenges in Subgraph Mining

Graph isomorphism• to detect if two graphs are identical in structure

Graph representation (Canonical Labeling) • A canonical label is a unique code of a given graph.

• Canonical label should be the same no matter how graphs are represented, as long as graphs have the same topological structure and the same labeling of edges and vertices.

Subgraph candidate generation• generate candidate frequent subgraphs from datasets


Subgraph Mining Approaches

Apriori-based • AGM/AcGM: Inokuchi, et al. (PKDD’00)• FSG: Kuramochi and Karypis (ICDM’01)

M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313-320, Nov. 2001

• PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)• FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)• FTOSM: Horvath et al. (KDD’06)

Pattern growth based• Subdue: Holder et al. (KDD’94)• MoFa: Borgelt and Berthold (ICDM’02)• gSpan: Yan and Han (ICDM’02)

Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721

• Gaston: Nijssen and Kok (KDD’04)• CMTreeMiner: Chi et al. (TKDE’05)• LEAP: Yan et al. (SIGMOD’08)


Outline

Introduction and Background



Summary


Apriori-based Approach

FSG : Frequent subgraph discovery. In ICDM’01, Nov. 2001 M.Kuramochi and G. Karypis.

Flattened Representation as Canonical Labeling

Apriori-based method to generate subgraph candidate


Graph Representation in FSG

Flattened Representation

00000 10 ee


Graph Representation in FSG

Flatterned Representation

Lexicographic order or dictionary order


Apriori-based method

Apriori Property• If a graph is frequent, all of

its subgraphs are frequent.

Candidate Generation• Create a set of candidate size k+1

-from given two frequent k-subgraphs

-containing the same (k-1)-subgraph

-Result in several candidates size k+1



Graph candidate generated Example



FlowChart



Experiment Result-Chemical Compound Dataset, which contains 340

compounds,24 different atoms (vertices)


Outline

Introduction



Summary


Motivation of gSpan

Weakness of Apriori-based approach• The generation of size (k+1) subgraph candidates from

size k frequent subgraph too complicated and complex.• Pruning false positive : subgraph isomorphism is an NP

complete problem which is costly.

gSpan: Graph-Based Substructure Pattern Mining

• Change the way to represent a graph (DFS: Depth First Search)

• Using pattern growth to generate new subgraph candidate.


gSpan: Graph-Based Substructure Pattern Mining

DFS (Depth First Search) Code

• First Step: DFS the graph and use edges on the path to represent the graph.

• Second Step: DFS Lexicographic Order

Pattern Growth subgraph generation


DFS code

An edge is presented by 5 tuples.

( , )( , , , , )

(0,1, , , )

i i j ji j l l l

X a Y


DFS code

Second Step: DFS Lexicographic Order


Pattern Growth Approach

Pattern Growth (free extension)



Duplicate Graphs



Free extension



Right most extension



Exmaples (cont.)


gSpan



Experimental result using Chemical data

•340 molecules

66 atom types and

4 bond types as labels

•On average only 27 vertices with 28 edges


Summary

Graph representationFlattern representation vs. DFS code

Generation of Candidate Patternsapriori vs. pattern growth


Pattern-Growth Approach


Frequent Graph Pattern

Given a graph dataset D, find subgraph g, s.t.

Where is the percentage of graphs in D that contain g.

Problem 1 : Exponential Pattern Set

Problem 2 : Threshold Setting

)(gfreq

)(gfreq


Difference between frequent itemset and frequent subgraph discovery


Frequent itemset discovery


subgraph Mining Algorithms

Apriori-based approach– AGM/AcGM: Inokuchi, et al. (PKDD’00)– FSG: Kuramochi and Karypis (ICDM’01)– PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)– FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)– FTOSM: Horvath et al. (KDD’06) Pattern growth approach– Subdue: Holder et al. (KDD’94)– MoFa: Borgelt and Berthold (ICDM’02)– gSpan: Yan and Han (ICDM’02)– Gaston: Nijssen and Kok (KDD’04)– CMTreeMiner: Chi et al. (TKDE’05)– LEAP: Yan et al. (SIGMOD’08)


Framework of subraph Mining Algorithms

Search Orderbreadth vs. depthcomplete vs. incomplete Generation of Candidate Patternsapriori vs. pattern growth Discovery Order of Patterns DFS orderpath tree graph Elimination of Duplicate Subgraphspassive vs. active Support Calculationembedding store or not


Frequent Subgraph

Examples:


Example (cont.)


Subgraph Mining Approaches

Apriori-based approach• AGM/AcGM: Inokuchi, et al. (PKDD’00)• FSG: Kuramochi and Karypis (ICDM’01)

M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313-320, Nov. 2001

• PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)• FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)• FTOSM: Horvath et al. (KDD’06)

Pattern growth approach• Subdue: Holder et al. (KDD’94)• MoFa: Borgelt and Berthold (ICDM’02)• gSpan: Yan and Han (ICDM’02)

Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721

• Gaston: Nijssen and Kok (KDD’04)• CMTreeMiner: Chi et al. (TKDE’05)• LEAP: Yan et al. (SIGMOD’08)


Outline

Introduction and Background



Summary

DFS codeYan, X. and Han, J. 2002. gSpan : Graph-Based Substructure

Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721

Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo

Documents

Transcript of Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo