CSCE822 Data Mining and Warehousing

45
Lecture 16 Graph Mining MW 4:00PM-5:15PM Dr. Jianjun Hu http://mleg.cse.sc.edu/edu/csce822 CSCE822 Data Mining and Warehousing University of South Carolina Department of Computer Science and Engineering

description

CSCE822 Data Mining and Warehousing. Lecture 16 Graph Mining MW 4:00PM-5:15PM Dr. Jianjun Hu http://mleg.cse.sc.edu/edu/csce822. University of South Carolina Department of Computer Science and Engineering. Roadmap. What, Why Graph Mining? Methods for Mining Frequent Subgraphs - PowerPoint PPT Presentation

Transcript of CSCE822 Data Mining and Warehousing

Page 1: CSCE822 Data Mining and Warehousing

Lecture 16 Graph MiningMW 4:00PM-5:15PM

Dr. Jianjun Huhttp://mleg.cse.sc.edu/edu/csce822

CSCE822 Data Mining and Warehousing

University of South CarolinaDepartment of Computer Science and Engineering

Page 2: CSCE822 Data Mining and Warehousing

RoadmapWhat, Why Graph Mining?

Methods for Mining Frequent Subgraphs

Mining Variant and Constrained Substructure Patterns

Applications:

Classification and Clustering Graph Indexing Similarity Search

Summary

April 21, 2023Mining and Searching Graphs in Graph Databases

Page 3: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Graph, Graph, Everywhere

Aspirin Yeast protein interaction network

fro

m H

. Je

on

g e

t a

l Na

ture

41

1, 4

1 (

200

1)

Internet Co-author network

RelationshipsInteractionsconnections

Page 4: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Why Graph Mining?

Graphs are ubiquitous Chemical compounds (Cheminformatics)

Protein structures, biological pathways/networks (Bioinformactics)

Program control flow, traffic flow, and workflow analysis

XML databases, Web, and social network analysis

Graph is a general model Trees, lattices, sequences, and items are degenerated graphs

Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted,

with angles & geometry (topological vs. 2-D/3-D)

Complexity of algorithms: many problems are of high

complexity

Page 5: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Graph Pattern Mining

Frequent subgraphs

A (sub)graph is frequent if its support (occurrence

frequency) in a given dataset is no less than a minimum

support threshold

Applications of graph pattern mining

Mining biochemical structures

Program control flow analysis

Mining XML structures or Web communities

Building blocks for graph classification, clustering,

compression, comparison, and correlation analysis

Page 6: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Example: Frequent Subgraphs

GRAPH DATASET

FREQUENT PATTERNS(MIN SUPPORT IS 2)

(A) (B) (C)

(1) (2)

Page 7: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Graph Mining Algorithms

Incomplete beam search – Greedy (Subdue 1994)

Inductive logic programming (WARMR 1998)

Graph theory-based approaches

Apriori-based approach

Pattern-growth approach

Page 8: CSCE822 Data Mining and Warehousing

Graph Definitionsa

b a

c c

b

(a) Labeled Graph

pq

p

p

rs

tr

t

qp

a

a

c

b

(b) Subgraph

p

s

t

p

a

a

c

b

(c) Induced Subgraph

p

rs

tr

p

Page 9: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

SUBDUE (Holder et al. KDD’94)

Start with single vertices

Expand best substructures with a new edge

Limit the number of best substructures

Substructures are evaluated based on their ability to

compress input graphs Using minimum description length (DL) Best substructure S in graph G minimizes: DL(S) +

DL(G\S)Terminate until no new substructure is discovered

Page 10: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Frequent Subgraph Mining Approaches

Apriori-based approach AGM/AcGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03)

Pattern growth approach MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Gaston: Nijssen and Kok (KDD’04)

Page 11: CSCE822 Data Mining and Warehousing

Apriori-like AlgorithmFind frequent 1-subgraphsRepeat

Candidate generation Use frequent (k-1)-subgraphs to generate candidate k-subgraph

Candidate pruning Prune candidate subgraphs that contain infrequent

(k-1)-subgraphs Support counting

Count the support of each remaining candidate Eliminate candidate k-subgraphs that are infrequent

In practice, it is not as easy. There are many other issues

Page 12: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Apriori-Based Approach

G

G1

G2

Gn

k-edge(k+1)-edge

G’

G’’

JOIN

Pruning

G1

G2

SupportCounting

G1

G2

Elimination

G1

Page 13: CSCE822 Data Mining and Warehousing

Candidate GenerationIn Apriori:

Merging two frequent k-itemsets will produce a candidate (k+1)-itemset

In frequent subgraph mining (vertex/edge growing) Merging two frequent k-subgraphs may produce more

than one candidate (k+1)-subgraph

Page 14: CSCE822 Data Mining and Warehousing

Vertex Growinga

a

e

a

p

q

r

p

a

a

a

p

rr

d

G1 G2

p

000

00

00

0

1

q

rp

rp

qpp

MG

000

0

00

00

2

r

rrp

rp

pp

MG

a

a

a

p

q

r

ep

0000

0000

00

000

00

3

q

r

rrp

rp

qpp

MG

G3 = join(G1,G2)

dr+

Page 15: CSCE822 Data Mining and Warehousing

Edge Growing

a

a

f

a

p

q

r

p

a

a

a

p

rr

f

G1 G2

p

a

a

a

p

q

r

fp

G3 = join(G1,G2)

r+

Page 16: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

FFSM (Huan, et al. ICDM’03)

Represent graphs using canonical adjacency matrix (CAM)Join two CAMs or extend a CAM to generate a new graphStore the embeddings of CAMs

All of the embeddings of a pattern in the database Can derive the embeddings of newly generated CAMs

Page 17: CSCE822 Data Mining and Warehousing

Example: Dataset

a

b

e

c

p

q

rp

a

b

d

p

r

G1 G2

q

e

c

a

p q

r

b

p

G3

d

rd

r

(a,b,p) (a,b,q) (a,b,r) (b,c,p) (b,c,q) (b,c,r) … (d,e,r)G1 1 0 0 0 0 1 … 0G2 1 0 0 0 0 0 … 0G3 0 0 1 1 0 0 … 0G4 0 0 0 0 0 0 … 0

a eq

c

d

p p

p

G4

r

Page 18: CSCE822 Data Mining and Warehousing

Example

p

a b c d ek=1FrequentSubgraphs

a b

pc d

pc e

qa e

rb d

pa b

d

r

pd c

e

p

(Pruned candidate)

Minimum support count = 2

k=2FrequentSubgraphs

k=3CandidateSubgraphs

Page 19: CSCE822 Data Mining and Warehousing

Graph IsomorphismA graph is isomorphic if it is topologically equivalent

to another graph

A

A

A A

B A

B

A

B

B

A

A

B B

B

B

Page 20: CSCE822 Data Mining and Warehousing

Graph Isomorphism

Page 21: CSCE822 Data Mining and Warehousing

Graph IsomorphismTest for graph isomorphism is needed:

During candidate generation step, to determine whether a candidate has been generated

During candidate pruning step, to check whether its (k-1)-subgraphs are frequent

During candidate counting, to check whether a candidate is contained within another graph

Page 22: CSCE822 Data Mining and Warehousing

Graph IsomorphismUse canonical labeling to handle isomorphism

Map each graph into an ordered string representation (known as its code) such that two isomorphic graphs will be mapped to the same canonical encoding

Example: Lexicographically largest adjacency matrixFind the permutations of the vertices so that the adjacency matrix

is lexicographically maximized when read off from left to right, one row at a time.

Page 23: CSCE822 Data Mining and Warehousing

Graph Isomorphism Example:

Lexicographically largest adjacency matrix

0110

1011

1100

0100

String: 0010001111010110

0001

0011

0101

1110

Canonical: 0111101011001000

Page 24: CSCE822 Data Mining and Warehousing

Lexicographically largest adjacency matrixFind the permutations of the vertices so that the

adjacency matrix is lexicographically maximized when read off from left to right, one row at a time.

Page 25: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Graph Pattern Explosion Problem

If a graph is frequent, all of its subgraphs are frequent ─

the Apriori property

An n-edge frequent graph may have 2n subgraphs

Among 422 chemical compounds which are confirmed to

be active in an AIDS antiviral screen dataset, there are

1,000,000 frequent graph patterns if the minimum support

is 5%

Page 26: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Closed Frequent Graphs

Motivation: Handling graph pattern explosion problemClosed frequent graph

A frequent graph G is closed if there exists no supergraph of G that carries the same support as G

If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs)

Lossless compression: still ensures that the mining result is complete

Page 27: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

CLOSEGRAPH (Yan & Han, KDD’03)

A Pattern-Growth Approach

G

G1

G2

Gn

k-edge

(k+1)-edge

At what condition, can westop searching their children

i.e., early termination?

If G and G’ are frequent, G is a subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, then we need not grow G, since none of G’s children will be closed except those of G’.

Page 28: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Experimental ResultThe AIDS antiviral screen compound dataset from

NCI/NIH

The dataset contains 43,905 chemical compounds

Among these 43,905 compounds, 423 of them belongs

to CA, 1081 are of CM, and the remaining are in class

CI

Page 29: CSCE822 Data Mining and Warehousing

April 21, 2023 Mining and Searching Graphs in Graph Databases

Discovered Patterns

20% 10%

5%

Page 30: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Graph MiningMethods for Mining Frequent Subgraphs

Mining Variant and Constrained Substructure Patterns

Applications:

Graph Indexing Similarity Search Classification and Clustering

Summary

Page 31: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Constrained Patterns

DegreeDensityDiameterConnectivityMin, Max, Avg

Page 32: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Pattern-Growth ApproachFind a small frequent candidate graph

Remove vertices (shadow graph) whose degree is less than the connectivity

Decompose it to extract the subgraphs satisfying the connectivity constraint

Stop decomposing when the subgraph has been checked before

Extend this candidate graph by adding new vertices and edges

Repeat

Page 33: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Graph MiningMethods for Mining Frequent Subgraphs

Mining Variant and Constrained Substructure Patterns

Applications:

Classification and Clustering Graph Indexing Similarity Search

Summary

Page 34: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Graph ClusteringGraph similarity measure

Feature-based similarity measureEach graph is represented as a feature vector

The similarity is defined by the distance of their corresponding vectors

Frequent subgraphs can be used as features

Structure-based similarity measureMaximal common subgraph

Graph edit distance: insertion, deletion, and relabel

Graph alignment distance

Page 35: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Graph ClassificationLocal structure based approach

Local structures in a graph, e.g., neighbors surrounding a vertex, paths with fixed length

Graph pattern-based approach Subgraph patterns from domain knowledge Subgraph patterns from data mining

Kernel-based approach Random walk (Gärtner ’02, Kashima et al. ’02, ICML’03,

Mahé et al. ICML’04) Optimal local assignment (Fröhlich et al. ICML’05)

Page 36: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Graph Pattern-Based Classification

Subgraph patterns from domain knowledge Molecular descriptors

Subgraph patterns from data mining General idea

Each graph is represented as a feature vector x = {x1,

x2, …, xn}, where xi is the frequency of the i-th pattern

in that graph Each vector is associated with a class label Classify these vectors in a vector space

Page 37: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Graph MiningMethods for Mining Frequent Subgraphs

Mining Variant and Constrained Substructure Patterns

Applications:

Classification and Clustering Graph Indexing Similarity Search

Summary

Page 38: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Graph Search

Querying graph databases: Given a graph database and a query graph, find all

the graphs containing this query graph

query graph graph database

Page 39: CSCE822 Data Mining and Warehousing

Scalability IssueSequential scan

Disk I/Os Subgraph isomorphism testing

An indexing mechanism is needed DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha, et al. PODS'02 Grace: Srinath Srinivasa, et al. ICDE'03

April 21, 2023Mining and Searching Graphs in Graph Databases

Page 40: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Indexing Strategy

Graph (G)

Substructure

Query graph (Q)

If graph G contains query graph Q, G should contain any substructure of Q

Remarks Index substructures of a query graph to prune graphs that do not

contain these substructures

Page 41: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Indexing Framework

Two steps in processing graph queriesStep 1. Index Construction

Enumerate structures in the graph database, build an inverted index between structures and graphs

Step 2. Query Processing Enumerate structures in the query graph Calculate the candidate graphs containing

these structures Prune the false positive answers by

performing subgraph isomorphism test

Page 42: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Graph MiningMethods for Mining Frequent Subgraphs

Mining Variant and Constrained Substructure Patterns

Applications:

Classification and Clustering Graph Indexing Similarity Search

Summary

Page 43: CSCE822 Data Mining and Warehousing

April 21, 2023 Mining and Searching Graphs in Graph Databases

Structure Similarity Search

(a) caffeine (b) diurobromine (c) viagra

• CHEMICAL COMPOUNDS

• QUERY GRAPH

Page 44: CSCE822 Data Mining and Warehousing

April 21, 2023 Mining and Searching Graphs in Graph Databases

Feature-Graph Matrix

G1 G2 G3 G4 G5

f10 1 0 1 1

f20 1 0 0 1

f31 0 1 1 1

f41 0 0 0 1

f50 0 1 1 0

Assume a query graph has 5 features and at most 2 features to miss due to the relaxation threshold

graphs in databasefe

atu

res

Page 45: CSCE822 Data Mining and Warehousing

April 21, 2023Mining and Searching Graphs in Graph Databases

Summary: Graph Mining

Graph mining has wide applications

Frequent and closed subgraph mining methods gSpan and CloseGraph: pattern-growth depth-first search approach

Graph indexing techniques Frequent and discriminative subgraphs are high-quality indexing features

Similarity search in graph databases Indexing and feature-based matching

Further development and application exploration