Clustering Pathways Using Graph Mining Approach Mahmud Shahriar Hossain Monika Akbar Pramodh Pochu...

Post on 17-Jan-2016

213 views 0 download

Tags:

Transcript of Clustering Pathways Using Graph Mining Approach Mahmud Shahriar Hossain Monika Akbar Pramodh Pochu...

Clustering PathwaysUsing Graph Mining Approach

Mahmud Shahriar HossainMonika AkbarPramodh PochuVenkata Sesha Sanagavarapu

2

Design Pipeline

Preprocessor

Frequent Subgraph Discovery

Graph Objects of Pathways

Mined Data

Pathway Clustering

STKE Dataset

NN Search Pathway Relations

3

Dataset Properties (size)

Total Pathways = 50

Size of Pathway, k

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

105

110

Nu

mb

er

of

k-e

dg

e p

ath

wa

ys

0

1

2

3

4

4

Dataset Properties (size)

Total Pathways = 50

Size Range

0-1

0

11

-20

21

-30

31

-40

41

-50

51

-60

61

-70

71

-80

81

-90

91

-10

0

10

0-1

10

Nu

mb

er

of

Pa

thw

ays

in S

ize

Ra

ng

e

0

1

2

3

4

5

6

7

8

9

10

11

12

13

5

pf-ipf (tf-idf)

Transaction Items bought

David Lopez Orange Juice (2), Potato chip (3), Pepsi (1)

Robbie Lamb Potato chip (3), Pepsi (3), Beer (1)

Jonathan Branden Potato chip (1), Pepsi (1)

John Paxton Potato chip (2), Coconut Cookies (2), Pepsi (1)

Rafal Angryk Swiss Army Knife (15)

Jeannete Radclif Potato chip (2), Coconut Cookies (3)

Rocky Ross Orange Juice (2), Coconut Cookies (3)

Richard MaClaster Coconut Cookies (3), Beer (1)

………… ……………………………….

6

Dataset Properties (pf-ipf)

Number of Edges in MPG = 1376

min_pfipf

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

Nu

mb

er o

f ed

ges

left

0

200

400

600

800

1000

1200

1400

7

Dataset Properties (pf-ipf)

Total Pathways=50

min_pfipf

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

Nu

mb

er o

f p

ath

way

s le

ft

20

25

30

35

40

45

50

8

Subgraph Discovery

k # of Subgraphs generated

Time (sec.)

1 1,376 Existing

2 5,380 41

3 29,565 149

4 187,508 971

5 1274,852 7518

--- ---- -----

min_sup=2%

• What so novel about pruning edges?

9

Subgraph Discovery

Contour graph for number of subgraphs

min_sup4 6 8 10 12 14 16 18 20

pf-

ipf

thre

sho

ld0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

1000 2000 3000 4000

0

1000

2000

3000

4000

5000

6000

46

810

1214

1618

20

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Nu

mb

er o

f S

ub

gra

ph

s

min

_sup

pf-ipf threshold

Total Run: 10X9

0 1000 2000 3000 4000 5000 6000

10

Subgraph Discovery

minsup= 4.0%min_tfidf= 0.01

k

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Tim

e (m

s)

0

50x103

100x103

150x103

200x103

250x103

300x103

350x103

400x103

FSGSEM

11

Subgraph Discovery

minsup= 4.0%min_tfidf= 0.01

k

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Tim

e (m

s)

0

500

1000

1500

2000

2500

3000

FSGSEM

12

Subgraph Discovery

minsup= 4.0%min_tfidf= 0.01

k

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

# o

f A

tem

pts

0

250000

500000

750000

1000000

1250000

FSGSEM

k Number of Subgraphs

Time Saved (%)

Attempts Saved(%)

2 186 99.83 98.983 246 98.33 86.154 305 98.57 86.385 323 98.95 86.916 313 98.96 85.647 279 98.88 83.258 263 98.67 78.919 292 98.38 74.76

10 364 98.58 74.7511 470 98.76 78.0812 608 99.04 81.8413 785 99.22 85.0214 980 99.38 87.6315 1117 99.48 89.4816 1075 99.53 90.2617 804 99.51 89.4018 430 99.34 85.2219 141 98.76 71.2220 20 96.15 9.1921 1 75.74 -574.47

Overall attempts saved = 89.52%Overall time saved = 99.39%

13

Clustering

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

0.22

0.24

6 8 10 12 14 16 18 200.010.020.030.040.050.060.070.080.090.10

Ave

rag

e S

C

min_sup

pf-

ipf

thre

sho

ld

Average SC Mesh plot for 10 clusters using different min_sup and pf-ipf threshold

0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22

14

Clustering

Average SC Contour Graph for 10 clusters using different min_sup and pf-ipf

min_sup

4 6 8 10 12 14 16 18 20

pf-

ipf

thre

sh

old

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

0.08 0.10 0.12 0.14 0.16 0.18 0.20

15

Nearest Neighbors

Each bar indicates 100 execution time of NN search of a pathway

Sample Pathway

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Tim

e (

ms

)

0

2000

4000

6000

8000

10000

12000

14000

16000

Cover Tree Brute-force

Cover Tree andBrute-force method

16

Pathway Relations (StoryTelling)

Bidirectional Search

S

p1

p2

p3

T

p7

p8

p9

17

Pathway Relations (StoryTelling)

Numbers of varying length storiesfor different branching factor

Story length, t

3 4 5 6 7 8 9 10 11 12 13 14 15 16

Nu

mb

er

of

t-le

ng

th s

tori

es

0

50

100

150

200

250

300

350

b=2b=4b=6b=8

18

Pathway Relations (StoryTelling)

Numbers of varying length storiesfor different branching factor

Story length, t

3 4 5 6 7 8 9 10 11 12 13 14 15 16

Nu

mb

er

of

t-le

ng

th s

tori

es

0

50

100

150

200

250

300

350

b=2b=3b=4b=5b=6b=7b=8b=9b=10

19

Pathway Relations (StoryTelling)

Branching factor, b

2 3 4 5 6 7 8 9 10

To

tal s

tori

es f

rom

all

pa

irs

0

200

400

600

800

1000

Branching factor, b

2 3 4 5 6 7 8 9 10

Tim

e t

o g

ene

rate

all

sto

rie

s (

ms)

0.0

200.0x103

400.0x103

600.0x103

800.0x103

1.0x106

1.2x106

1.4x106

Branching factor, b

2 3 4 5 6 7 8 9 10

Len

gth

of

the

lon

ges

t s

tory

4

6

8

10

12

14

16

20

Questions ???