University at BuffaloThe State University of New York Lei Shi Department of Computer Science and...

42
University at Buffalo The State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent Subgraph/ Substructure Mining Seminar 2009

Transcript of University at BuffaloThe State University of New York Lei Shi Department of Computer Science and...

Page 1: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Lei Shi

Department of Computer Science and

Engineering

State University of New York at Buffalo

Frequent Subgraph/ Substructure Mining

Seminar 2009

Page 2: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Outline

Introduction

Apriori-based Subgrah Mining

Pattern Growth Subgraph Mining

Summary

Page 3: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Graphs are everywhere

Page 4: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Graph Mining Problems

Graph Pattern Mining• Frequent subgraph pattern mining• Pattern summarization• Optimal graph patterns• Graph patterns with constraints• Approximate graph patterns ….

Graph Classification• Graph clustering• Important node identification• Bridge and hub identification

Other Important Topics • Graph compression• Graph model• Social network analysis.

Page 5: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Subgraph pattern Mining

Frequent subgraph• A (sub)graph is frequent if its support (occurrence frequency) in a

given dataset is no less than a minimum support threshold

Application of subgraph pattern mining• Mining biochemical structures

• Program control flow analysis

• Mining XML structures or Web communities

• Building blocks for graph classifiction, clustering,compression, comparison and correlation analysis.

Page 6: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

(1) (2) (3)

BC

A A

B

A

A

BC

C BC

A

AA

subgraph

331Support

Frequent Subgraph Example

Page 7: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Key Challenges in Subgraph Mining

Graph isomorphism• to detect if two graphs are identical in structure

Graph representation (Canonical Labeling) • A canonical label is a unique code of a given graph.

• Canonical label should be the same no matter how graphs are represented, as long as graphs have the same topological structure and the same labeling of edges and vertices.

Subgraph candidate generation• generate candidate frequent subgraphs from datasets

Page 8: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Subgraph Mining Approaches

Apriori-based • AGM/AcGM: Inokuchi, et al. (PKDD’00)• FSG: Kuramochi and Karypis (ICDM’01)

M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313-320, Nov. 2001

• PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)• FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)• FTOSM: Horvath et al. (KDD’06)

Pattern growth based• Subdue: Holder et al. (KDD’94)• MoFa: Borgelt and Berthold (ICDM’02)• gSpan: Yan and Han (ICDM’02)

Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721

• Gaston: Nijssen and Kok (KDD’04)• CMTreeMiner: Chi et al. (TKDE’05)• LEAP: Yan et al. (SIGMOD’08)

Page 9: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Outline

Introduction and Background

Apriori-based Subgrah Mining

Pattern Growth Subgraph Mining

Summary

Page 10: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Apriori-based Approach

FSG : Frequent subgraph discovery. In ICDM’01, Nov. 2001 M.Kuramochi and G. Karypis.

Flattened Representation as Canonical Labeling

Apriori-based method to generate subgraph candidate

Page 11: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Graph Representation in FSG

Flattened Representation

00000 10 ee

Page 12: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Graph Representation in FSG

Flatterned Representation

Lexicographic order or dictionary order

Page 13: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Apriori-based method

Apriori Property• If a graph is frequent, all of

its subgraphs are frequent.

Candidate Generation• Create a set of candidate size k+1

-from given two frequent k-subgraphs

-containing the same (k-1)-subgraph

-Result in several candidates size k+1

Page 14: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Apriori-based method

Graph candidate generated Example

Page 15: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Apriori-based method

FlowChart

Page 16: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Apriori-based method

Experiment Result-Chemical Compound Dataset, which contains 340

compounds,24 different atoms (vertices)

Page 17: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Outline

Introduction

Apriori-based Subgrah Mining

Pattern Growth Subgraph Mining

Summary

Page 18: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Motivation of gSpan

Weakness of Apriori-based approach• The generation of size (k+1) subgraph candidates from

size k frequent subgraph too complicated and complex.• Pruning false positive : subgraph isomorphism is an NP

complete problem which is costly.

gSpan: Graph-Based Substructure Pattern Mining

• Change the way to represent a graph (DFS: Depth First Search)

• Using pattern growth to generate new subgraph candidate.

Page 19: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

gSpan: Graph-Based Substructure Pattern Mining

DFS (Depth First Search) Code

• First Step: DFS the graph and use edges on the path to represent the graph.

• Second Step: DFS Lexicographic Order

Pattern Growth subgraph generation

Page 20: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

DFS code

An edge is presented by 5 tuples.

( , )( , , , , )

(0,1, , , )

i i j ji j l l l

X a Y

Page 21: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

DFS code

Second Step: DFS Lexicographic Order

Page 22: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Pattern Growth Approach

Pattern Growth (free extension)

Page 23: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Pattern Growth Approach

Duplicate Graphs

Page 24: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Pattern Growth Approach

Free extension

Page 25: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Pattern Growth Approach

Right most extension

Page 26: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Pattern Growth Approach

Exmaples (cont.)

Page 27: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

gSpan

Page 28: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

gSpan

Page 29: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Pattern Growth Approach

Experimental result using Chemical data

•340 molecules

66 atom types and

4 bond types as labels

•On average only 27 vertices with 28 edges

Page 30: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Summary

Graph representationFlattern representation vs. DFS code

Generation of Candidate Patternsapriori vs. pattern growth

Page 31: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Page 32: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Pattern-Growth Approach

Page 33: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Frequent Graph Pattern

Given a graph dataset D, find subgraph g, s.t.

Where is the percentage of graphs in D that contain g.

Problem 1 : Exponential Pattern Set

Problem 2 : Threshold Setting

)(gfreq

)(gfreq

Page 34: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Difference between frequent itemset and frequent subgraph discovery

Page 35: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Frequent itemset discovery

Page 36: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

subgraph Mining Algorithms

Apriori-based approach– AGM/AcGM: Inokuchi, et al. (PKDD’00)– FSG: Kuramochi and Karypis (ICDM’01)– PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)– FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)– FTOSM: Horvath et al. (KDD’06) Pattern growth approach– Subdue: Holder et al. (KDD’94)– MoFa: Borgelt and Berthold (ICDM’02)– gSpan: Yan and Han (ICDM’02)– Gaston: Nijssen and Kok (KDD’04)– CMTreeMiner: Chi et al. (TKDE’05)– LEAP: Yan et al. (SIGMOD’08)

Page 37: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Framework of subraph Mining Algorithms

Search Orderbreadth vs. depthcomplete vs. incomplete Generation of Candidate Patternsapriori vs. pattern growth Discovery Order of Patterns DFS orderpath tree graph Elimination of Duplicate Subgraphspassive vs. active Support Calculationembedding store or not

Page 38: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Frequent Subgraph

Examples:

Page 39: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Example (cont.)

Page 40: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Subgraph Mining Approaches

Apriori-based approach• AGM/AcGM: Inokuchi, et al. (PKDD’00)• FSG: Kuramochi and Karypis (ICDM’01)

M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313-320, Nov. 2001

• PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)• FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)• FTOSM: Horvath et al. (KDD’06)

Pattern growth approach• Subdue: Holder et al. (KDD’94)• MoFa: Borgelt and Berthold (ICDM’02)• gSpan: Yan and Han (ICDM’02)

Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721

• Gaston: Nijssen and Kok (KDD’04)• CMTreeMiner: Chi et al. (TKDE’05)• LEAP: Yan et al. (SIGMOD’08)

Page 41: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Outline

Introduction and Background

Apriori-based Subgrah Mining

Pattern Growth Subgraph Mining

Summary

DFS codeYan, X. and Han, J. 2002. gSpan : Graph-Based Substructure

Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721

Page 42: University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

University at BuffaloThe State University of New York

Pattern Growth Approach