Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo
description
Transcript of Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo
University at BuffaloThe State University of New York
Lei Shi
Department of Computer Science and
Engineering
State University of New York at Buffalo
Frequent Subgraph/ Substructure Mining
Seminar 2009
University at BuffaloThe State University of New York
Outline
Introduction
Apriori-based Subgrah Mining
Pattern Growth Subgraph Mining
Summary
University at BuffaloThe State University of New York
Graphs are everywhere
University at BuffaloThe State University of New York
Graph Mining Problems
Graph Pattern Mining• Frequent subgraph pattern mining• Pattern summarization• Optimal graph patterns• Graph patterns with constraints• Approximate graph patterns ….
Graph Classification• Graph clustering• Important node identification• Bridge and hub identification
Other Important Topics • Graph compression• Graph model• Social network analysis.
University at BuffaloThe State University of New York
Subgraph pattern Mining
Frequent subgraph• A (sub)graph is frequent if its support (occurrence frequency) in a
given dataset is no less than a minimum support threshold
Application of subgraph pattern mining• Mining biochemical structures
• Program control flow analysis
• Mining XML structures or Web communities
• Building blocks for graph classifiction, clustering,compression, comparison and correlation analysis.
University at BuffaloThe State University of New York
(1) (2) (3)
BC
A A
B
A
A
BC
C BC
A
AA
subgraph
331Support
Frequent Subgraph Example
University at BuffaloThe State University of New York
Key Challenges in Subgraph Mining
Graph isomorphism• to detect if two graphs are identical in structure
Graph representation (Canonical Labeling) • A canonical label is a unique code of a given graph.
• Canonical label should be the same no matter how graphs are represented, as long as graphs have the same topological structure and the same labeling of edges and vertices.
Subgraph candidate generation• generate candidate frequent subgraphs from datasets
University at BuffaloThe State University of New York
Subgraph Mining Approaches
Apriori-based • AGM/AcGM: Inokuchi, et al. (PKDD’00)• FSG: Kuramochi and Karypis (ICDM’01)
M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313-320, Nov. 2001
• PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)• FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)• FTOSM: Horvath et al. (KDD’06)
Pattern growth based• Subdue: Holder et al. (KDD’94)• MoFa: Borgelt and Berthold (ICDM’02)• gSpan: Yan and Han (ICDM’02)
Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721
• Gaston: Nijssen and Kok (KDD’04)• CMTreeMiner: Chi et al. (TKDE’05)• LEAP: Yan et al. (SIGMOD’08)
University at BuffaloThe State University of New York
Outline
Introduction and Background
Apriori-based Subgrah Mining
Pattern Growth Subgraph Mining
Summary
University at BuffaloThe State University of New York
Apriori-based Approach
FSG : Frequent subgraph discovery. In ICDM’01, Nov. 2001 M.Kuramochi and G. Karypis.
Flattened Representation as Canonical Labeling
Apriori-based method to generate subgraph candidate
University at BuffaloThe State University of New York
Graph Representation in FSG
Flattened Representation
00000 10 ee
University at BuffaloThe State University of New York
Graph Representation in FSG
Flatterned Representation
Lexicographic order or dictionary order
University at BuffaloThe State University of New York
Apriori-based method
Apriori Property• If a graph is frequent, all of
its subgraphs are frequent.
Candidate Generation• Create a set of candidate size k+1
-from given two frequent k-subgraphs
-containing the same (k-1)-subgraph
-Result in several candidates size k+1
University at BuffaloThe State University of New York
Apriori-based method
Graph candidate generated Example
University at BuffaloThe State University of New York
Apriori-based method
FlowChart
University at BuffaloThe State University of New York
Apriori-based method
Experiment Result-Chemical Compound Dataset, which contains 340
compounds,24 different atoms (vertices)
University at BuffaloThe State University of New York
Outline
Introduction
Apriori-based Subgrah Mining
Pattern Growth Subgraph Mining
Summary
University at BuffaloThe State University of New York
Motivation of gSpan
Weakness of Apriori-based approach• The generation of size (k+1) subgraph candidates from
size k frequent subgraph too complicated and complex.• Pruning false positive : subgraph isomorphism is an NP
complete problem which is costly.
gSpan: Graph-Based Substructure Pattern Mining
• Change the way to represent a graph (DFS: Depth First Search)
• Using pattern growth to generate new subgraph candidate.
University at BuffaloThe State University of New York
gSpan: Graph-Based Substructure Pattern Mining
DFS (Depth First Search) Code
• First Step: DFS the graph and use edges on the path to represent the graph.
• Second Step: DFS Lexicographic Order
Pattern Growth subgraph generation
University at BuffaloThe State University of New York
DFS code
An edge is presented by 5 tuples.
( , )( , , , , )
(0,1, , , )
i i j ji j l l l
X a Y
University at BuffaloThe State University of New York
DFS code
Second Step: DFS Lexicographic Order
University at BuffaloThe State University of New York
Pattern Growth Approach
Pattern Growth (free extension)
University at BuffaloThe State University of New York
Pattern Growth Approach
Duplicate Graphs
University at BuffaloThe State University of New York
Pattern Growth Approach
Free extension
University at BuffaloThe State University of New York
Pattern Growth Approach
Right most extension
University at BuffaloThe State University of New York
Pattern Growth Approach
Exmaples (cont.)
University at BuffaloThe State University of New York
gSpan
University at BuffaloThe State University of New York
gSpan
University at BuffaloThe State University of New York
Pattern Growth Approach
Experimental result using Chemical data
•340 molecules
66 atom types and
4 bond types as labels
•On average only 27 vertices with 28 edges
University at BuffaloThe State University of New York
Summary
Graph representationFlattern representation vs. DFS code
Generation of Candidate Patternsapriori vs. pattern growth
University at BuffaloThe State University of New York
University at BuffaloThe State University of New York
Pattern-Growth Approach
University at BuffaloThe State University of New York
Frequent Graph Pattern
Given a graph dataset D, find subgraph g, s.t.
Where is the percentage of graphs in D that contain g.
Problem 1 : Exponential Pattern Set
Problem 2 : Threshold Setting
)(gfreq
)(gfreq
University at BuffaloThe State University of New York
Difference between frequent itemset and frequent subgraph discovery
University at BuffaloThe State University of New York
Frequent itemset discovery
University at BuffaloThe State University of New York
subgraph Mining Algorithms
Apriori-based approach– AGM/AcGM: Inokuchi, et al. (PKDD’00)– FSG: Kuramochi and Karypis (ICDM’01)– PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)– FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)– FTOSM: Horvath et al. (KDD’06) Pattern growth approach– Subdue: Holder et al. (KDD’94)– MoFa: Borgelt and Berthold (ICDM’02)– gSpan: Yan and Han (ICDM’02)– Gaston: Nijssen and Kok (KDD’04)– CMTreeMiner: Chi et al. (TKDE’05)– LEAP: Yan et al. (SIGMOD’08)
University at BuffaloThe State University of New York
Framework of subraph Mining Algorithms
Search Orderbreadth vs. depthcomplete vs. incomplete Generation of Candidate Patternsapriori vs. pattern growth Discovery Order of Patterns DFS orderpath tree graph Elimination of Duplicate Subgraphspassive vs. active Support Calculationembedding store or not
University at BuffaloThe State University of New York
Frequent Subgraph
Examples:
University at BuffaloThe State University of New York
Example (cont.)
University at BuffaloThe State University of New York
Subgraph Mining Approaches
Apriori-based approach• AGM/AcGM: Inokuchi, et al. (PKDD’00)• FSG: Kuramochi and Karypis (ICDM’01)
M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313-320, Nov. 2001
• PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)• FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)• FTOSM: Horvath et al. (KDD’06)
Pattern growth approach• Subdue: Holder et al. (KDD’94)• MoFa: Borgelt and Berthold (ICDM’02)• gSpan: Yan and Han (ICDM’02)
Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721
• Gaston: Nijssen and Kok (KDD’04)• CMTreeMiner: Chi et al. (TKDE’05)• LEAP: Yan et al. (SIGMOD’08)
University at BuffaloThe State University of New York
Outline
Introduction and Background
Apriori-based Subgrah Mining
Pattern Growth Subgraph Mining
Summary
DFS codeYan, X. and Han, J. 2002. gSpan : Graph-Based Substructure
Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721
University at BuffaloThe State University of New York
Pattern Growth Approach