Graph Based Clustering

34
Graph Based Clustering Summer School “Achievements and Applications of Contemporary Informatics, Mathematics and Physics” (AACIMP 2011) August 8-20, 2011, Kiev, Ukraine Erik Kropat University of the Bundeswehr Munich Institute for Theoretical Computer Science, Mathematics and Operations Research Neubiberg, Germany

description

AACIMP 2011 Summer School. Operational Research Stream. Lecture by Erik Kropat.

Transcript of Graph Based Clustering

Page 1: Graph Based Clustering

Graph Based Clustering

Summer School

“Achievements and Applications of Contemporary Informatics,

Mathematics and Physics” (AACIMP 2011)

August 8-20, 2011, Kiev, Ukraine

Erik Kropat

University of the Bundeswehr Munich Institute for Theoretical Computer Science,

Mathematics and Operations Research

Neubiberg, Germany

Page 2: Graph Based Clustering

Real World Networks

• Biological Networks

− Gene regulatory networks

− Metabolic networks

− Neural networks

− Food webs

• Technological Networks

− Telecommunication networks

− Internet

− Power grids

food web

power grid

Page 3: Graph Based Clustering

Real World Networks

• Social Networks

− Communication networks

− Organizational networks

− Social media

− Online communities

• Economic Networks

− Financial market networks

− Trade networks

− Collaboration networks

social networks

economic networks

Source: Frank Schweitzer et al., “Economic Networks: The New Challenges,” Science 325, no. 5939 (July 24, 2009): 422-425.

Page 4: Graph Based Clustering

Graph-Theory

• Graph theory can provide more detailed information about the inner structure of the data set in terms of

− cliques (subsets of nodes where each pair of elements is connected)

− clusters (highly connected groups of nodes)

− centrality (important nodes, hubs)

− outliers . . . (unimportant nodes)

• Applications

− social network analysis

− diffusion of information

− spreading of diseases or rumours

⇒ marketing campaigns, viral marketing, social network advertising

Page 5: Graph Based Clustering

Graph-Based Clustering

• Collection of a wide range of very popular clustering algorithms

that are based on graph-theory.

• Organize information in large datasets to facilitate users

for faster access to required information.

Page 6: Graph Based Clustering

Idea

• Objects are represented as nodes in a complete or connected graph.

• Assign a weight to each branch between the two nodes x and y.

The weight is defined by the distance d(x,y) between the nodes.

Clustering Distance between

clusters Distance between objects

Page 7: Graph Based Clustering

Idea

minimal spanning tree

graph

clusters

Page 8: Graph Based Clustering

Graph Based Clustering

Hierarchical method

(1) Determine a minimal spanning tree (MST)

(2) Delete branches iteratively

New connected components = Cluster

1

3

5

8

4

6

Page 9: Graph Based Clustering

Minimal Spanning Trees

Page 10: Graph Based Clustering

Minimal Spanning Tree

A minimal spanning tree of a connected graph G = (V,E)

is a connected subgraph with minimal weight

that contains all nodes of G and has no cycles.

1

3

5

8

4

6

a

1

3

5

8

4

6

c

d

b

a

c

d

b

minimal spanning tree graph G = (V, E)

Page 11: Graph Based Clustering

Minimal spanning trees can be calculated with...

(1) Prim’s algorithm.

(2) Kruskal’s algorithm.

a

1

3

5

8

4

6

c

d

b

Page 12: Graph Based Clustering

Example – Prims’s Algorithm

1

3

5

8

4

6

a

b

c

d

Set VT = {a}, ET = { }

1

3

5

8

4

6

a

b

c

d

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

VT = {a,b} and ET = { (a,b) }.

Page 13: Graph Based Clustering

Example– Prims’s Algorithm

c

1

3

5

8

4

6

a

b

c

d

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

VT = {a,b,d} and ET = { (a,b), (a,d) }.

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

VT = {a,b,c,d} and ET = { (a,b), (a,d),(b,c) }.

c

1

3

5

8

4

6

a

c

d

b

Page 14: Graph Based Clustering

Prim’s Algorithm

INPUT: Weighted graph G = (V, E), undirected + connected

OUTPUT: Minimal spanning tree T = (VT, ET) (1) Set VT = {v}, ET = { }, where v is an arbitrary node from V (starting point).

(2) REPEAT

(3) Choose an edge (a,b) with minimal weight, such that a ∈ VT and b ∉ VT.

(4) Set VT = VT ∪ {b} and ET = ET ∪ { (a,b) }.

(5) UNTIL VT = V

Page 15: Graph Based Clustering

Kruskal’s Algorithm

INPUT: Weighted graph G = (V, E), undirected + connected

OUTPUT: Minimal spanning tree T = (VT, ET) (1) Set VT = V, ET = { }, H = E.

(2) Initialize a queue to contain all edges in G, using the weights in ascending order as keys.

(3) WHILE H ≠ { }

(4) Choose an edge e ∈ H with minimal weight.

(5) Set H = H \ {e}.

(6) If (VT, ET ∪ {e}) has no cycles, then ET = ET ∪ {e} .

(7) END

Page 16: Graph Based Clustering

Branch Deletion

Page 17: Graph Based Clustering

Delete Branches - Different Strategies

(1) Delete the branch with maximum weight.

(2) Delete inconsistent branches.

(3) Delete by analysis of weights.

Page 18: Graph Based Clustering

(1) Delete the branch with maximum weight

• In each step, create two new clusters by deleting the branch with maximum weight.

• Repeat until the given number of clusters is reached.

2

2 6

3

4

2 2 2

Page 19: Graph Based Clustering

2

2 6

3

4

2 2 2

Ordered weights of branches: 6, 4, 3, 2, 2, 2, 2, 2.

Minimum spanning tree

Example: Delete the branch with maximum weight

Page 20: Graph Based Clustering

2

2 6

3

4

2 2 2

Ordered weights of branches: 6, 4, 3, 2, 2, 2, 2, 2.

Step 1: Delete branch (weight 6) ⇒ 2 clusters

Example: Delete the branch with maximum weight

Page 21: Graph Based Clustering

2

2 6

3

4

2 2 2

Example: Delete the branch with maximum weight

Ordered weights of branches: 6, 4, 3, 2, 2, 2, 2, 2.

Step 1: Delete branch (weight 6) ⇒ 2 clusters Step 2: Delete branch (weight 4) ⇒ 3 clusters

Page 22: Graph Based Clustering

(2) Delete inconsistent branches

• A branch e is inconsistent, if the corresponding weight de

is (much) larger than a reference value de .

• The reference value de can be defined by the average weight of all branches adjacent to e.

_

_

1

2 6

3 e de = 3 + 2 + 1 _________

3

_ = 2

de = 6 > 2 = de _

⇒ e inconsistent

Page 23: Graph Based Clustering

(3) Delete by analysis of weights

• Perform an “analysis” of all weights of branches in the MST. Determine a threshold S.

• The threshold can be estimated by histograms on the weights of branches (= length of branches).

• Delete a branches, if the corresponding weight higher than the threshold S.

weight of branch (length of branch)

Num

ber

weight of branch

Num

ber

S

Page 24: Graph Based Clustering

Exercise

Find a minimal spanning tree and provide a clustering of the graph by deleting all inconsistent branches.

10

f

a

b

c

d

e

g

2

12

4 1

3 20

8

5

9

15 6

Page 25: Graph Based Clustering

Example

Set VT = {a}, ET = { } Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

Page 26: Graph Based Clustering

Example

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

Page 27: Graph Based Clustering

Example

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

Page 28: Graph Based Clustering

Example

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

minimal spanning tree

Page 29: Graph Based Clustering

Example

For each branch calculate the reference value

(average weight of adjacent branches)

f

a

b

c

d

e

g

2

4 1

3

5

6

(3)

(3)

(4.5)

(3.6)

(5)

(4)

Page 30: Graph Based Clustering

Example

Delete inconsistent branches

(weight is larger than the reference value)

f

a

b

c

d

g 4 1

3

(3)

(3) (4)

e

2 clusters

Noise?

Page 31: Graph Based Clustering

Summary

Page 32: Graph Based Clustering

Summary

• In graph based clustering objects are represented as nodes in a complete or connected graph.

• The distance between two objects is given by the weight of the corresponding branch.

• Hierarchical method

(1) Determine a minimal spanning tree (MST)

(2) Delete branches iteratively

• Visualization of information in large datasets.

Page 33: Graph Based Clustering

• V. Kumar, M. Steinbach, P.-N. Tan

Introduction to Data Mining.

Addison Wesley, 2005.

Literature

• J.A. Dunne, R.J. Williams, N.D. Martinez, R.A. Wood, D.H. Erwin Compilation and Network Analyses of Cambrian Food Webs.

PLoS Biol 6(4): e102. doi:10.1371/journal.pbio.0060102 • F. Schweitzer, G. Fagiolo, D. Sornette, F. Vega-Redondo, A. Vespignani, D.R. White

Economic Networks: The New Challenges.

Science 325, no. 5939 (July 24, 2009): 422-425.

Other work mentioned in the presentation

Page 34: Graph Based Clustering

Thank you very much!