Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering...

73
Clustering Panagio’s Papapetrou, PhD Associate Professor, Stockholm University Adjunct Professor, Aalto University 1

Transcript of Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering...

Page 1: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Clustering  

Panagio's  Papapetrou,  PhD  Associate  Professor,  Stockholm  University  Adjunct  Professor,  Aalto  University  

1  

Page 2: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

TODAY…  Nov  4   Introduc4on  to  data  mining  

Nov  5   Associa4on  Rules  

Nov  10,  14   Clustering  and  Data  Representa4on  

Nov  17   Exercise  session  1  (Homework  1  due)  

Nov  19   Classifica4on  

Nov  24,  26   Similarity  Matching  and  Model  Evalua4on  

Dec  1   Exercise  session  2  (Homework  2  due)  

Dec  3   Combining  Models  

Dec  8,  10   Time  Series  Analysis  

Dec  15   Exercise  session  3  (Homework  3  due)  

Dec  17   Ranking  

Jan  13   Review  

Jan  14   EXAM  

Feb  23   Re-­‐EXAM  2  

Page 3: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

What  is  clustering?  

•  A   grouping   of   data   objects   such   that   the   objects   within   a  group   are   similar   (or   related)   to   one   another   and   different  from  (or  unrelated  to)  the  objects  in  other  groups  

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

3  

Page 4: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Outliers    •  Outliers  are  objects  that  do  not  belong  to  any  cluster  or  

form  clusters  of  very  small  cardinality  

•  In  some  applica4ons  we  are  interested  in  discovering  outliers,  not  clusters  (outlier  analysis)  

cluster  

outliers  

Page 5: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Applica4ons  of  clustering?  

•  Image  Processing  –  cluster  images  based  on  their  visual  content  

•  Web  –  Cluster  groups  of  users  based  on  their  access  pa]erns  on  webpages  

–  Cluster  webpages  based  on  their  content  •  Bioinforma4cs  

–  Cluster  similar  proteins  together  (similarity  wrt  chemical  structure  and/or  func4onality  etc)  

•  Many  more…  

5  

Page 6: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

The  clustering  task  •  Group   observa4ons   into   groups   so   that   the   observa4ons  

belonging   in   the   same   group   are   similar,   whereas  observa4ons  in  different  groups  are  different  

   •  Basic  quesGons:  

– What  does  “similar”  mean?  – What   is   a   good   par44on   of   the   objects?   I.e.,   how   is   the  quality  of  a  solu4on  measured?  

–  How  to  find  a  good  par44on  of  the  observa4ons?  

6  

Page 7: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Observa4ons  to  cluster  

•  Usually  data  objects  consist  of  a  set  of  a]ributes  (also  known  as  dimensions)  

•  J.  Smith,  20,  200K  

•  If   all   d   dimensions   are   real-­‐valued   then   we   can  visualize  the  data  as  points  in  a  d-­‐dimensional  space  

•  If   all   d   dimensions   are   binary   then   we   can   think   of  each  data  point  as  a  binary  vector      

7  

Page 8: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Distance  func4ons  for  binary  vectors  

Q1   Q2   Q3   Q4   Q5   Q6  

X   1   0   0   1   1   0  

Y   0   1   1   0   1   0  

JSim(X,Y ) = X∩YX∪Y

•  Jaccard  similarity  between  binary  vectors  X  and  Y      •  Jaccard  distance  between  binary  vectors  X  and  Y    Jdist(X,Y)  =  1-­‐  JSim(X,Y)  

•  Example:  •  JSim  =  1/6  •  Jdist  =    5/6  

Page 9: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Distance  func4ons  for  real-­‐valued  vectors  

•  Lp  norms  or  Minkowski  distance:  

 

 

9  

Page 10: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Distance  func4ons  for  real-­‐valued  vectors  

•  Lp  norms  or  Minkowski  distance:  

where  p  is  a  posi4ve  integer  

 

 

Lp(x,y)= (xi−yi)i=1

d∑

p#

$

%%%%%

&

'

(((((

1/p

10  

Page 11: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Distance  func4ons  for  real-­‐valued  vectors  

•  Lp  norms  or  Minkowski  distance:  

where  p  is  a  posi4ve  integer  

•  If  p  =  1,  L1  is  the  Manha@an  (or  city  block)  distance:  

•  If  p  =  2,  L2  is  the  Euclidean  distance:    

 

Lp(x,y)= (xi−yi)i=1

d∑

p#

$

%%%%%

&

'

(((((

1/p

L1(x,y)= xi−yii=1

d∑

L2(x,y)= (|x1−y1|2+|x2−y2|

2+...+|xd−yd |2)

11  

Page 12: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Data  Structures  

•  Data  matrix    

•  Distance  matrix  

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

ndx...nx...n1x...............idx...ix...i1x...............1dx...1x...11x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0...)2,()1,(:::

)2,3()

...ndnd

0dd(3,10d(2,1)

0

a]ributes/dimensions  

tuples/objects  

objects  ob

jects  

12  

Page 13: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Par44oning  algorithms  

•  Basic  concept:    

–  Construct  a  par44on  of  a  set  of  n  objects  into  a  set  of  k  clusters  

–  Each  object  belongs  to  exactly  one  cluster  

–  The  number  of  clusters  k  is  given  in  advance      

13  

Page 14: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

The  k-­‐means  problem  •  Given   a   set  X   of  n   points   in   a  d-­‐dimensional   space  and  an  integer  k  

•  Task:  choose  a  set    of  k  points  {c1,  c2,…,ck}   in  the  d-­‐dimensional   space   to   form  clusters  C  =   {C1,  C2,…,Ck}  such  that  

 is  minimized  •  Some  special  cases:  k  =  1,  k  =  n  

Cost(C) = L22 x,ci( )

x∈Ci

∑i=1

k

∑ = x − ci( )2x∈Ci

∑i=1

k

14  

Page 15: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Algorithmic  proper4es  of    the  k-­‐means  problem  

•  NP-­‐hard  if  the  dimensionality  of  the  data  is  at  least  2  (d>=2)  

•  Finding   the   best   solu4on   in   polynomial   4me   is  infeasible  

•  For  d=1  the  problem  is  solvable  in  polynomial  4me  

•  A   simple   itera4ve   algorithm   works   quite   well   in  prac4ce  

15  

Page 16: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

The  k-­‐means  algorithm  •  Randomly  pick  k  cluster  centers    {c1,…,ck}  

•  For   each   i,   set   the   cluster   Ci   to   be   the   set   of  points  in  X  that  are  closer  to  ci  than  they  are  to  cj  for  all  i≠j  

 •  For  each  i   let  ci  be  the  center  of  cluster  Ci  (mean  of  the  vectors  in  Ci)  

•  Repeat  un4l  convergence  

16  

Page 17: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

0  

1  

2  

3  

4  

5  

0   1   2   3   4   5  

K-­‐means  Clustering:  Step  1  Algorithm:  k-­‐means,  Distance  Metric:  Euclidean  Distance  

k1  

k2  

k3  

17  

Page 18: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

0  

1  

2  

3  

4  

5  

0   1   2   3   4   5  

K-­‐means  Clustering:  Step  2  Algorithm:  k-­‐means,  Distance  Metric:  Euclidean  Distance  

k1  

k2  

k3  

18  

Page 19: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

0  

1  

2  

3  

4  

5  

0   1   2   3   4   5  

K-­‐means  Clustering:  Step  3  Algorithm:  k-­‐means,  Distance  Metric:  Euclidean  Distance  

k1  

k2  

k3  

19  

Page 20: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

0  

1  

2  

3  

4  

5  

0   1   2   3   4   5  

K-­‐means  Clustering:  Step  4  Algorithm:  k-­‐means,  Distance  Metric:  Euclidean  Distance  

k1  

k2  

k3  

20  

Page 21: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Proper4es  of  the  k-­‐means  algorithm  

•  Finds  a  local  op4mum  

•  Converges  oken  quickly  (but  not  always)    •  But  it  is  guaranteed  to  converge  aker  at  most  KN  itera4ons  (N  number  of  objects)  

•  The   choice   of   ini4al   points   can   have   large  influence  in  the  result  

21  

Page 22: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Discussion  of  the  k-­‐means  algorithm  

x1  

x2  

x3  

x4  K  =  2  

22  

Page 23: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Discussion  of  the  k-­‐means  algorithm  

x1  

x2  

x3  

x4  

c1  

c2  K  =  2  

23  

Page 24: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Discussion  of  the  k-­‐means  algorithm  

x1  

x2  

x3  

x4  

c1  

c2  K  =  2  

K-­‐means  converges  immediately!  

24  

Page 25: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Discussion  of  the  k-­‐means  algorithm  

x1  

x2  

x3  

x4  

c1  

c2  K  =  2  

K-­‐means  converges  immediately!  

Bad  result  L    25  

Page 26: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Discussion  of  the  k-­‐means  algorithm  

x1  

x2  

x3  

x4  

c1  

c2  K  =  2  

K-­‐means  converges  immediately!  

Even  worse  as  the  horizontal  lines  stretch  26  

Page 27: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

K-­‐means++  •  Choose  one  center  uniformly  at  random  from  X    

•  For  each  data  point  x,  compute  d(x)  

–  distance   between   x   and   the   nearest   center   that   has  already  been  chosen  

•  Choose  “the  farthest”  data  point  as  a  new  center  –  using  a  weighted  probability  distribu4on  where  a  point  x  is  chosen  with  probability  propor4onal  to  d(x)  

•  Repeat   Steps   2   and   3   un4l   k   centers   have   been  chosen  

27  

Page 28: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

K-­‐means  •  Limita4ons?  

•  What  if  the  space  is  not  known?  

•  Example:  images…   0  

1  

2  

3  

4  

5  

0   1   2   3   4   5  

k1  

k2  

k3  

28  

Page 29: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Clustering  images  

29  

Page 30: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Clustering  images:  scien4st  or  not  

30  

What  are  the  centroids?  

Page 31: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

The  k-­‐median  problem  •  Given  a  set  X  of  n  points  in  a  d-­‐dimensional  space  and  an  integer  k  

•  Task:  choose  a  set  of  k  points  {c1,c2,…,ck}  from  X  and  form  clusters  {C1,C2,…,Ck}    such  that    

           is  minimized  

31  

∑∑= ∈

=k

i Cxi

i

cxLCCost1

1 ),()(

Page 32: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

The  k-­‐medoids  algorithm  •  Or  …  PAM  (Par44oning  Around  Medoids,  1987)  

– Choose   randomly   k   medoids   from   the   original  

dataset  X    

– Assign  each  of  the  n-­‐k  remaining  points  in  X  to  their  

closest  medoid  

–  Itera4vely  replace  one  of  the  medoids  by  one  of  the  

non-­‐medoids  if  it  improves  the  total  clustering  cost  

32  

Page 33: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Discussion  of  PAM  algorithm  •  The  algorithm  is  very  similar  to  the  k-­‐means  

algorithm  

•  Advantages  and  disadvantages?  

33  

Page 34: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

K-­‐means  vs.  K-­‐medoids  •  K-­‐medoids  is  more  flexible:  

–  Can  be  used  with  any  distance/similarity  measure  

–  K-­‐means   works   only   with   measures   that   are   consistent  with  the  mean  (e.g.,  Euclidean  Distance)  

•  K-­‐medoids  is  more  robust  to  ouliers:  –  Example:  [1  2  3  4  100000]  

–  mean  =  20002    vs    medoid  =  3  

•  K-­‐medoids  is  more  expensive:  –  All  pair-­‐wise  distances  are  computed  at  each  itera4on  

34  

Page 35: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

CLARA  (Clustering  Large  Applica4ons)  •  It   draws  mul6ple   samples   of   the   data   set,   applies   PAM   on  

each  sample,  and  gives  the  best  clustering  as  the  output  

•  Strength:  deals  with  larger  data  sets  than  PAM  

•  Weakness:  –  Efficiency  depends  on  the  sample  size  

–  A   good   clustering   based   on   samples   will   not   necessarily   represent   a  good  clustering  of  the  whole  data  set  if  the  sample  is  biased  

35  

Page 36: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

What  is  the  right  number  of  clusters?  •  …or  who  sets  the  value  of  k?  

•  For  n  points  to  be  clustered  consider  the  case  where  k=n.  What  is  the  value  of  the  error  func4on?  

•  What  happens  when  k  =  1?  

•  Since  we  want  to  minimize  the  error  why  don’t  we  select  always  k  =  n?  

•  X-­‐means:    –  based  on  the  Bayesian  Informa4on  Criterion  –  Detects  the  op4mal  number  of  clusters  

36  

Page 37: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Hierarchical  Clustering    

•  Produces  a  set  of  nested  clusters  organized  as  a  hierarchical  tree  

•  Can  be  visualized  as  a  dendrogram  –  A  tree-­‐like  diagram  that  records  the  sequences  of  merges  or  splits  

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

37  

Page 38: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Strengths  of  Hierarchical  Clustering  

•  No  assump4ons  on  the  number  of  clusters  –  Any  desired  number  of  clusters  can  be  obtained  by  ‘cutng’  the  dendogram  at  the  proper  level  

 •  Hierarchical  clusterings  may  correspond  to  meaningful  taxonomies  –  Example  in  web  (e.g.,  product  catalogs),  etc…  

38  

Page 39: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Hierarchical  Clustering  •  Two  main  types  of  hierarchical  clustering  

–  AgglomeraGve:      •   Start  with  the  points  as  individual  clusters  •   At  each  step,  merge  the  closest  pair  of  clusters  un4l  only  one  cluster  (or  k  clusters)  lek  

–  Divisive:      •   Start  with  one,  all-­‐inclusive  cluster    •   At  each  step,  split  a  cluster  un4l  each  cluster  contains  a  point  (or  there  are  k  clusters)  

•  Tradi4onal  hierarchical  algorithms  use  a  similarity  or  distance  matrix  

–  Merge  or  split  one  cluster  at  a  4me  

39  

Page 40: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Complexity  of  hierarchical  clustering  

•  Distance   matrix   is   used   for   deciding   which  clusters  to  merge/split  

•  At   least   quadra4c   in   the   number   of   data  points  

•  Not  usable  for  large  datasets  

40  

Page 41: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Agglomera4ve  clustering  algorithm  

•  Most  popular  hierarchical  clustering  technique  

•  Basic  algorithm  1.  Compute  the  distance  matrix  between  the  input  data  points  2.  Let  each  data  point  be  a  cluster  3.   Repeat  4.    Merge  the  two  closest  clusters  5.    Update  the  distance  matrix  6.   UnGl  only  a  single  cluster  remains      

•  Key  opera4on  is  the  computa4on  of  the  distance  between  two  clusters  

–  Different  defini4ons  of  the  distance  between  clusters  lead  to    different  algorithms  

41  

Page 42: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Input/  Ini4al  setng  •  Start  with  clusters  of  individual  points  and  a  distance/proximity  

matrix  

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Distance/Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

42  

Page 43: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Intermediate  State  •  Aker  some  merging  steps,  we  have  some  clusters    

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Distance/Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

43  

Page 44: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Intermediate  State  •  Merge  the  two  closest  clusters  (C2  and  C5)    and  update  the  

distance  matrix  

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Distance/Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p1244  

Page 45: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Aker  Merging  •  “How  do  we  update  the  distance  matrix?”    

C1

C4

C2 U C5

C3 ? ? ? ?

?

?

?

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

...p1 p2 p3 p4 p9 p10 p11 p1245  

Page 46: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Distance  between  two  clusters  

•  Each  cluster  is  a  set  of  points  

•  How  do  we  define  distance  between  two  sets  of  points  – Lots  of  alterna4ves  – Not  an  easy  task  

46  

Page 47: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Distance  between  two  clusters  

•  Single-­‐link  distance  between  clusters  Ci  and  Cj  the  minimum   distance   (or   highest   similarity)  between   any  object  in  Ci  and  any  object  in  Cj    

•  The   distance   is   defined   by   the   two   most   similar  objects  

( ) { }jiyxjisl CyCxyxdCCD ∈∈= ,),(min, ,

47  

Page 48: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Single-­‐link  clustering:  example    

•  Determined  by  one  pair  of  points,  i.e.,  by  one  link  in  the   proximity   graph   using   the   similarity   matrix  below:  

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1   2   3   4   5  

48  

Page 49: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Strengths  of  single-­‐link  clustering  

Original Points Two Clusters

•  Can handle non-elliptical shapes

49  

Page 50: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Limita4ons  of  single-­‐link  clustering  

Original Points Two Clusters

•  Sensitive to noise and outliers •  It produces long, elongated clusters

50  

Page 51: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Distance  between  two  clusters  

•  Complete-­‐link   distance   between   clusters   Ci   and   Cj  the   maximum   distance   (minimum   similarity)  between  any  object  in  Ci  and  any  object  in  Cj    

•  The  distance   is  defined  by   the   two  most   dissimilar  objects  

( ) { }jiyxjicl CyCxyxdCCD ∈∈= ,),(max, ,

51  

Page 52: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Complete-­‐link  clustering:  example  

•  Distance  between  clusters  is  determined  by  the  two  most  distant  points  in  the  different  clusters  

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1   2   3   4   5  

52  

Page 53: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Strengths  of  complete-­‐link  clustering  

Original Points Two Clusters

•  More balanced clusters (with equal diameter) •  Less susceptible to noise

53  

Page 54: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Limita4ons  of  complete-­‐link  clustering  

Original Points Two Clusters

•  Tends to break large clusters •  All clusters tend to have the same diameter – small clusters are merged with larger ones

54  

Page 55: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Distance  between  two  clusters  

•  Average-­‐link   distance   between   clusters   Ci   and   Cj   is  the  average   distance  between   any   object   in  Ci   and  any  object  in  Cj    

( ) ∑∈∈×

=ji CyCxji

jiavg yxdCC

CCD,

),(1,

55  

Page 56: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Average-­‐link  clustering:  example  

•  Proximity  of  two  clusters  is  the  average  of  pairwise  proximity  between  points  in  the  two  clusters.  

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1   2   3   4   5  

56  

Page 57: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Average-­‐link  clustering:  discussion  

•  Compromise   between   Single   and   Complete  Link  

•  Strengths  –  Less  suscep4ble  to  noise  and  outliers  

•  Limita4ons  –  Biased  towards  globular  clusters  

57  

Page 58: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Distance  between  two  clusters  

•  Ward’s  distance  between  clusters  Ci  and  Cj  the  difference  between:    –  the  total  within  cluster  sum  of  squares  for  the  two  clusters  separately  

–  the  within   cluster   sum  of   squares   resul6ng   from  merging  the  two  clusters  in  cluster  Cij  

•  ri:  centroid  (mean)  of  Ci  •  rj:  centroid  (mean)  of  Cj  •  rij:  centroid  (mean)  of  Cij  

( ) ( ) ( ) ( )∑∑∑∈∈∈

−−−+−=ijji Cx

ijCx

jCx

ijiw rxrxrxCCD 222,

58  

Page 59: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Ward’s  distance  for  clusters  

•  Similar  to  average-­‐link  

•  Less  suscep4ble  to  noise  and  outliers  

•  Biased  towards  globular  clusters  

•  Hierarchical  analogue  of  k-­‐means  –  Can  be  used  to  ini4alize  k-­‐means  

59  

Page 60: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Hierarchical  Clustering:      Time  and  Space  requirements  

•  For  a  dataset  X  consis4ng  of  n  points  

•  O(n2)   space;   it   requires   storing   the   distance  matrix    

•  O(n3)  Gme  in  most  of  the  cases  –  There   are   n   steps   and   at   each   step   the   size   n2  

distance  matrix  must  be  updated  and  searched  –  Complexity  can  be  reduced  to  O(n2  log(n)  )  4me  by  

using  appropriate  data  structures  

60  

Page 61: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

How  good  is  a  clustering?  •  Cluster  purity  

–  Each  cluster   is  assigned  to  the  class   label  that   is  most  frequent   in  the  cluster  

–  Purity   is  measured   by   coun4ng   the  number   of   correctly   assigned  objects  and  dividing  by  the  total  number  of  objects  

   

Purity  =    (5  +  4  +  3)  /  17        =          12  /  17  

61  

Page 62: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

How  good  is  purity?  

•  Cluster  purity  –  Simple!  –  Intui4ve  

•  Trade-­‐off  between  number  of  clusters  and  purity?  –  If  each  object  belongs  to  its  own  cluster,  then  purity  is  1.0  –  Is  that  desirable?  

•  Cannot  trade-­‐off  the  quality  of  clusters  against  the  number  of  clusters  

•  Alterna4ves:  –  Normalized  Mutual  Informa4on  

–  Rand  Index       62  

Page 63: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

What  if  we  don’t  have  class  labels?  •  a(i):  average  dissimilarity  (distance)  of  i  with  all  other  data  within  

the  same  cluster  •  b(i):  the   lowest  average  dissimilarity  (distance)  of   i  to  any  other  

cluster  where  i  is  not  a  member  •  The  Silhoue]e  of  object  i:  

 

 •  Values  between  -­‐1  and  1  

–  if  b(i)  >>  a(i)  then  s(i)  =  1  –  if  b(i)  <<  a(i)  then  s(i)  =  -­‐1    

•  Total  Silhouete:    –  average  silhouete  value  of  all  objects  

63  

i  

Page 64: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Correct  number  of  clusters  

64  

•  Can  be  used  to  iden4fy  the  correct  number  of  clusters  

Page 65: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Clustering  aggrega4on  

•  Many  different  clusterings  for  the  same  dataset!  –  Different  objec4ve  func4ons  –  Different    algorithms  –  Different  number  of  clusters  

•  Which  clustering  is  the  best?  –  AggregaGon:  we  do  not  need  to  decide,  but  rather  find  a  reconcilia4on  between  different  outputs  

65  

Page 66: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

The  clustering-­‐aggrega4on  problem  

•  Input  –  n  objects  X  =  {x1,x2,…,xn}  – m  clusterings  of  the  objects  {C1,…,Cm}  – clustering:  a  collec4on  of  disjoint  sets  that  cover  X  

 •  Output  

–  a  single  clustering  C,  that  is  as  close  as  possible  to  all  input  clusterings  

 •  How  do  we  measure  closeness  of  clusterings?  

–  disagreement  distance  

66  

Page 67: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Disagreement  distance    •  For  two  par44ons  C  and  P,  and  objects  x,y  

in  X  define  

•  if   IC,P(x,y)   =   1   we   say   that   x,y   create   a  disagreement  between  clusterings  C  and  P    •  The  disagreement  distance  between  C  and  

P  is:            

U C P x1 1 1 x2 1 2 x3 2 1 x4 3 3 x5 3 4

D(C,P)  =  3  

⎪⎪⎪

⎪⎪⎪

=≠

≠=

=

otherwise0

P(y) P(x) and C(y) C(x) ifOR

P(y) P(x) and C(y) C(x) if 1

y)(x,I PC,

D(C,P) = IC,P (x,y)(x,y)∑

67  

Page 68: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Clustering  aggrega4on  •  Given  m  clusterings  C1,…,Cm  find  C  such  that    

     is  minimized  ∑=

=m

1ii )CD(C, D(C)

U C1 C2 C3 C x1 1 1 1 1 x2 1 2 2 2 x3 2 1 1 1 x4 2 2 2 2 x5 3 3 3 3 x6 3 4 3 3

the  aggrega4on  cost  

68  

Page 69: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Why  clustering  aggrega4on?  

•  Clustering  categorical  data  U City Profession Nationality

x1 New York Doctor U.S.

x2 New York Teacher Canada

x3 Boston Doctor U.S.

x4 Boston Teacher Canada

x5 Los Angeles Lawer Mexican

x6 Los Angeles Actor Mexican

69  

Page 70: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Why  clustering  aggrega4on?  

•  Detect  outliers  –  outliers  are  defined  as  points  for  which  there  is  no  consensus  by  the  clusterings  

•  Improve  the  robustness  of  clustering  algorithms  –  different  algorithms  have  different  weaknesses  –  combining  them  can  produce  a  be]er  result  

•  Privacy  preserving  clustering  –  different  companies  have  data  for  the  same  users  –  compute  an  aggregate  clustering  without  sharing  the  actual  data  

70  

Page 71: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

Complexity  of  Clustering  Aggrega4on  

•  The  clustering  aggrega4on  problem  is  NP-­‐hard  –  the  median  par44on  problem  [Barthelemy  and  LeClerc  1995]  

 •  Look  for  heuris4cs  and  approximate  solu4ons    

ALG(I)  ≤  c  OPT(I)  

71  

Page 72: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

A  simple  2-­‐approxima4on  algorithm  

•  The  disagreement  distance  D(C,P)  is  a  metric  – What  does  that  mean?  We  will  revisit  this  in  a  few  lectures…  

•  The  algorithm  BEST:    – select   among   the   input   clusterings   the   clustering  C*  that  minimizes  D(C*)  

– a  2-­‐approximate  solu4on  

72  

Page 73: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!

TODOs  

•  Assignment  1:    – Due  next  week  [Nov.  17  at  11:00am]  – Discussion  forum  on  this  assignment  now  available  on  iLearn2  

 

•  Exercise  Session  1:    – Nov.  17  at  11:00am  

73