University at BuffaloThe State University of New York Lei Shi Department of Computer Science and...

Post on 05-Jan-2016

214 views 1 download

Tags:

Transcript of University at BuffaloThe State University of New York Lei Shi Department of Computer Science and...

University at BuffaloThe State University of New York

Lei Shi

Department of Computer Science and

Engineering

State University of New York at Buffalo

Frequent Subgraph/ Substructure Mining

Seminar 2009

University at BuffaloThe State University of New York

Outline

Introduction

Apriori-based Subgrah Mining

Pattern Growth Subgraph Mining

Summary

University at BuffaloThe State University of New York

Graphs are everywhere

University at BuffaloThe State University of New York

Graph Mining Problems

Graph Pattern Mining• Frequent subgraph pattern mining• Pattern summarization• Optimal graph patterns• Graph patterns with constraints• Approximate graph patterns ….

Graph Classification• Graph clustering• Important node identification• Bridge and hub identification

Other Important Topics • Graph compression• Graph model• Social network analysis.

University at BuffaloThe State University of New York

Subgraph pattern Mining

Frequent subgraph• A (sub)graph is frequent if its support (occurrence frequency) in a

given dataset is no less than a minimum support threshold

Application of subgraph pattern mining• Mining biochemical structures

• Program control flow analysis

• Mining XML structures or Web communities

• Building blocks for graph classifiction, clustering,compression, comparison and correlation analysis.

University at BuffaloThe State University of New York

(1) (2) (3)

BC

A A

B

A

A

BC

C BC

A

AA

subgraph

331Support

Frequent Subgraph Example

University at BuffaloThe State University of New York

Key Challenges in Subgraph Mining

Graph isomorphism• to detect if two graphs are identical in structure

Graph representation (Canonical Labeling) • A canonical label is a unique code of a given graph.

• Canonical label should be the same no matter how graphs are represented, as long as graphs have the same topological structure and the same labeling of edges and vertices.

Subgraph candidate generation• generate candidate frequent subgraphs from datasets

University at BuffaloThe State University of New York

Subgraph Mining Approaches

Apriori-based • AGM/AcGM: Inokuchi, et al. (PKDD’00)• FSG: Kuramochi and Karypis (ICDM’01)

M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313-320, Nov. 2001

• PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)• FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)• FTOSM: Horvath et al. (KDD’06)

Pattern growth based• Subdue: Holder et al. (KDD’94)• MoFa: Borgelt and Berthold (ICDM’02)• gSpan: Yan and Han (ICDM’02)

Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721

• Gaston: Nijssen and Kok (KDD’04)• CMTreeMiner: Chi et al. (TKDE’05)• LEAP: Yan et al. (SIGMOD’08)

University at BuffaloThe State University of New York

Outline

Introduction and Background

Apriori-based Subgrah Mining

Pattern Growth Subgraph Mining

Summary

University at BuffaloThe State University of New York

Apriori-based Approach

FSG : Frequent subgraph discovery. In ICDM’01, Nov. 2001 M.Kuramochi and G. Karypis.

Flattened Representation as Canonical Labeling

Apriori-based method to generate subgraph candidate

University at BuffaloThe State University of New York

Graph Representation in FSG

Flattened Representation

00000 10 ee

University at BuffaloThe State University of New York

Graph Representation in FSG

Flatterned Representation

Lexicographic order or dictionary order

University at BuffaloThe State University of New York

Apriori-based method

Apriori Property• If a graph is frequent, all of

its subgraphs are frequent.

Candidate Generation• Create a set of candidate size k+1

-from given two frequent k-subgraphs

-containing the same (k-1)-subgraph

-Result in several candidates size k+1

University at BuffaloThe State University of New York

Apriori-based method

Graph candidate generated Example

University at BuffaloThe State University of New York

Apriori-based method

FlowChart

University at BuffaloThe State University of New York

Apriori-based method

Experiment Result-Chemical Compound Dataset, which contains 340

compounds,24 different atoms (vertices)

University at BuffaloThe State University of New York

Outline

Introduction

Apriori-based Subgrah Mining

Pattern Growth Subgraph Mining

Summary

University at BuffaloThe State University of New York

Motivation of gSpan

Weakness of Apriori-based approach• The generation of size (k+1) subgraph candidates from

size k frequent subgraph too complicated and complex.• Pruning false positive : subgraph isomorphism is an NP

complete problem which is costly.

gSpan: Graph-Based Substructure Pattern Mining

• Change the way to represent a graph (DFS: Depth First Search)

• Using pattern growth to generate new subgraph candidate.

University at BuffaloThe State University of New York

gSpan: Graph-Based Substructure Pattern Mining

DFS (Depth First Search) Code

• First Step: DFS the graph and use edges on the path to represent the graph.

• Second Step: DFS Lexicographic Order

Pattern Growth subgraph generation

University at BuffaloThe State University of New York

DFS code

An edge is presented by 5 tuples.

( , )( , , , , )

(0,1, , , )

i i j ji j l l l

X a Y

University at BuffaloThe State University of New York

DFS code

Second Step: DFS Lexicographic Order

University at BuffaloThe State University of New York

Pattern Growth Approach

Pattern Growth (free extension)

University at BuffaloThe State University of New York

Pattern Growth Approach

Duplicate Graphs

University at BuffaloThe State University of New York

Pattern Growth Approach

Free extension

University at BuffaloThe State University of New York

Pattern Growth Approach

Right most extension

University at BuffaloThe State University of New York

Pattern Growth Approach

Exmaples (cont.)

University at BuffaloThe State University of New York

gSpan

University at BuffaloThe State University of New York

gSpan

University at BuffaloThe State University of New York

Pattern Growth Approach

Experimental result using Chemical data

•340 molecules

66 atom types and

4 bond types as labels

•On average only 27 vertices with 28 edges

University at BuffaloThe State University of New York

Summary

Graph representationFlattern representation vs. DFS code

Generation of Candidate Patternsapriori vs. pattern growth

University at BuffaloThe State University of New York

University at BuffaloThe State University of New York

Pattern-Growth Approach

University at BuffaloThe State University of New York

Frequent Graph Pattern

Given a graph dataset D, find subgraph g, s.t.

Where is the percentage of graphs in D that contain g.

Problem 1 : Exponential Pattern Set

Problem 2 : Threshold Setting

)(gfreq

)(gfreq

University at BuffaloThe State University of New York

Difference between frequent itemset and frequent subgraph discovery

University at BuffaloThe State University of New York

Frequent itemset discovery

University at BuffaloThe State University of New York

subgraph Mining Algorithms

Apriori-based approach– AGM/AcGM: Inokuchi, et al. (PKDD’00)– FSG: Kuramochi and Karypis (ICDM’01)– PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)– FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)– FTOSM: Horvath et al. (KDD’06) Pattern growth approach– Subdue: Holder et al. (KDD’94)– MoFa: Borgelt and Berthold (ICDM’02)– gSpan: Yan and Han (ICDM’02)– Gaston: Nijssen and Kok (KDD’04)– CMTreeMiner: Chi et al. (TKDE’05)– LEAP: Yan et al. (SIGMOD’08)

University at BuffaloThe State University of New York

Framework of subraph Mining Algorithms

Search Orderbreadth vs. depthcomplete vs. incomplete Generation of Candidate Patternsapriori vs. pattern growth Discovery Order of Patterns DFS orderpath tree graph Elimination of Duplicate Subgraphspassive vs. active Support Calculationembedding store or not

University at BuffaloThe State University of New York

Frequent Subgraph

Examples:

University at BuffaloThe State University of New York

Example (cont.)

University at BuffaloThe State University of New York

Subgraph Mining Approaches

Apriori-based approach• AGM/AcGM: Inokuchi, et al. (PKDD’00)• FSG: Kuramochi and Karypis (ICDM’01)

M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313-320, Nov. 2001

• PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)• FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)• FTOSM: Horvath et al. (KDD’06)

Pattern growth approach• Subdue: Holder et al. (KDD’94)• MoFa: Borgelt and Berthold (ICDM’02)• gSpan: Yan and Han (ICDM’02)

Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721

• Gaston: Nijssen and Kok (KDD’04)• CMTreeMiner: Chi et al. (TKDE’05)• LEAP: Yan et al. (SIGMOD’08)

University at BuffaloThe State University of New York

Outline

Introduction and Background

Apriori-based Subgrah Mining

Pattern Growth Subgraph Mining

Summary

DFS codeYan, X. and Han, J. 2002. gSpan : Graph-Based Substructure

Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721

University at BuffaloThe State University of New York

Pattern Growth Approach