UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016...

46
UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of Computer Science 11/22/16 Dr. Yanjun Qi / UVA CS 6316 / f16 1

Transcript of UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016...

Page 1: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

UVACS6316/4501–Fall2016

MachineLearning

Lecture19:UnsupervisedClustering(I)

Dr.YanjunQi

UniversityofVirginia

DepartmentofComputerScience

11/22/16

Dr.YanjunQi/UVACS6316/f16

1

Page 2: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Wherearewe?èmajorsecJonsofthiscourse

q Regression(supervised)q ClassificaJon(supervised)

q FeatureselecJonq Unsupervisedmodels

q DimensionReducJon(PCA)q Clustering(K-means,GMM/EM,Hierarchical)

q Learningtheoryq Graphicalmodels

11/22/16

Dr.YanjunQi/UVACS6316/f16

2

Page 3: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

AnunlabeledDatasetX

•  Data/points/instances/examples/samples/records:[rows]•  Features/a0ributes/dimensions/independentvariables/covariates/predictors/regressors:[columns]

11/22/16

Dr.YanjunQi/UVACS6316/f16

a data matrix of n observations on p variables x1,x2,…xp

Unsupervisedlearning=learningfromraw(unlabeled,unannotated,etc)data,asopposedtosuperviseddatawhereaclassificaJonlabelofexamplesisgiven

3

Page 4: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

Today:Whatisclustering?

•  Arethereany“groups”?•  Whatiseachgroup?•  Howmany?•  HowtoidenJfythem?

4

Page 5: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

•  Find groups (clusters) of data points such that data points in a group will be similar (or related) to one another and different from (or unrelated to) the data points in other groups

Whatisclustering?

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

11/22/16

Dr.YanjunQi/UVACS6316/f16

5

Page 6: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

Whatisclustering?•  Clustering:theprocessofgroupingasetofobjectsintoclassesofsimilarobjects–  highintra-classsimilarity–  lowinter-classsimilarity–  Itisthecommonestformofunsupervisedlearning

•  AcommonandimportanttaskthatfindsmanyapplicaJonsinScience,Engineering,informaJonScience,andotherplaces,e.g.

•  GroupgenesthatperformthesamefuncJon•  GroupindividualsthathassimilarpoliJcalview•  Categorizedocumentsofsimilartopics•  Idealitysimilarobjectsfrompictures

6

Page 7: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

ToyExamples•  People

•  Images

•  Language

•  species

7

Page 8: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

Issuesforclustering•  Whatisanaturalgroupingamongtheseobjects?

–  DefiniJonof"groupness"•  Whatmakesobjects“related”?

–  DefiniJonof"similarity/distance"•  RepresentaJonforobjects

–  Vectorspace?NormalizaJon?•  Howmanyclusters?

–  Fixedapriori?–  Completelydatadriven?

•  Avoid“trivial”clusters-toolargeorsmall•  ClusteringAlgorithms

–  ParJJonalalgorithms–  Hierarchicalalgorithms

•  FormalfoundaJonandconvergence8

Page 9: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

TodayRoadmap:clustering

§  DefiniJonof"groupness”§  DefiniJonof"similarity/distance"§  RepresentaJonforobjects§  Howmanyclusters?§  ClusteringAlgorithms

§ ParJJonalalgorithms§ Hierarchicalalgorithms

§  FormalfoundaJonandconvergence9

Page 10: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

Whatisanaturalgroupingamongtheseobjects?

10

Page 11: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Anotherexample:clusteringissubjecJve

A

B

A

B

A

B

A

B A

B

A

B

TwopossibleSoluJons…

11/22/16 Dr.YanjunQi/UVACS6316/f16 11

Page 12: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

TodayRoadmap:clustering

§  DefiniJonof"groupness”§  DefiniJonof"similarity/distance"§  RepresentaJonforobjects§  Howmanyclusters?§  ClusteringAlgorithms

§ ParJJonalalgorithms§ Hierarchicalalgorithms

§  FormalfoundaJonandconvergence12

Page 13: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

WhatisSimilarity?

•  TherealmeaningofsimilarityisaphilosophicalquesJon.WewilltakeamorepragmaJcapproach

•  DependsonrepresentaJonandalgorithm.Formanyrep./alg.,easiertothinkintermsofadistance(ratherthansimilarity)betweenvectors.

Hardtodefine!Butweknowitwhenweseeit

13

Page 14: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

WhatproperJesshouldadistancemeasurehave?

•  D(A,B)=D(B,A) Symmetry

•  D(A,A)=0 ConstancyofSelf-Similarity

•  D(A,B)=0IIfA=B Posi=vitySepara=on

•  D(A,B)<=D(A,C)+D(B,C) TriangularInequality

14

Page 15: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

•  D(A,B)=D(B,A) Symmetry–  Otherwiseyoucouldclaim"AlexlookslikeBob,butBoblooksnothing

likeAlex"

•  D(A,A)=0 ConstancyofSelf-Similarity–  Otherwiseyoucouldclaim"AlexlooksmorelikeBob,thanBobdoes"

•  D(A,B)=0IIfA=B Posi=vitySepara=on–  Otherwisethereareobjectsinyourworldthataredifferent,butyou

cannottellapart.

•  D(A,B)<=D(A,C)+D(B,C) TriangularInequality–  Otherwiseyoucouldclaim"AlexisverylikeBob,andAlexisverylike

Carl,butBobisveryunlikeCarl"

IntuiJonsbehinddesirableproperJesofdistancemeasure

15

Page 16: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

DistanceMeasures:MinkowskiMetric

•  Supposetwoobjectxandybothhavepfeatures

•  TheMinkowskimetricisdefinedby•  MostCommonMinkowskiMetrics

!!d(x , y)= |xi− yi

i=1

p

∑ |rr

!!

x = (x1 ,x2 ,!,xp)y = ( y1 , y2 ,!, yp)

1,r =2(Euclideandistance)d(x , y)= |xi− yii=1

p

∑ |22

2,r =1(Manhattandistance)d(x , y)= |xi− yii=1

p

∑ |

3,r = +∞("sup"distance)d(x , y)=max1≤i≤p

|xi− yi |16

Page 17: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

.},{max :distance sup"" :3. :distanceManhattan :2

. :distanceEuclidean :1

434734

5342 22

==+

=+

AnExample

4

3

x

y

17

Page 18: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

11011111100001110100111001001001101716151413121110987654321

GeneBGeneA

. :Distance Hamming 5141001 =+=+ )#()#(

•  ManhanandistanceiscalledHammingdistancewhenallfeaturesarebinary.

–  E.g.,GeneExpressionLevelsUnder17CondiJons(1-High,0-Low)

Hammingdistance:binaryfeatures

!!d(x , y)= |xi− yi

i=1

p

∑ |

18

Page 19: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

SimilarityMeasures:CorrelaJonCoefficient

Time

Gene A

Gene B

Gene A Time

Gene B

Expression Level Expression Level

Expression Level

Time

Gene A Gene B

19

CorrelaJonisunitindependent;IfyouscaleoneoftheobjectstenJmes,youwillgetdifferenteuclideandistancesandsamecorrelaJondistances.

Page 20: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

•  PearsoncorrelaJoncoefficient

•  Specialcase:cosinedistance11/22/16

Dr.YanjunQi/UVACS6316/f16

. and where

)()(

))((),(

∑∑

∑ ∑

==

= =

=

==

−×−

−−=

p

iip

p

iip

p

i

p

iii

p

iii

yyxx

yyxx

yyxxyxs

1

1

1

1

1 1

22

1

1≤),( yxs

SimilarityMeasures:CorrelaJonCoefficient

yxyxyxs !!!!

⋅⋅=),(

•  MeasuringthelinearcorrelaMonbetweentwosequences,xandy,

•  givingavaluebetween+1and−1inclusive,where1istotalposiJvecorrelaMon,0isnocorrelaMon,and−1istotalnegaJvecorrelaMon.

CorrelaJonisunitindependent

20

Page 21: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

EditDistance:Agenerictechniqueformeasuringsimilarity

•  Tomeasurethesimilaritybetweentwoobjects,transformoneoftheobjectsintotheother,andmeasurehowmucheffortittook.Themeasureofeffortbecomesthedistancemeasure.

ThedistancebetweenPanyandSelma.

Changedresscolor,1pointChangeearringshape,1pointChangehairpart,1point

D(Pany,Selma)=3

ThedistancebetweenMargeandSelma.

Changedresscolor,1pointAddearrings,1pointDecreaseheight,1pointTakeupsmoking,1pointLoseweight,1point

D(Marge,Selma)=5

ThisiscalledtheEditdistanceortheTransformaJondistance21

SelmaPanyMarge

Page 22: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

TodayRoadmap:clustering

§  DefiniJonof"groupness”§  DefiniJonof"similarity/distance"§  RepresentaJonforobjects§  Howmanyclusters?§  ClusteringAlgorithms

§ ParJJonalalgorithms§ Hierarchicalalgorithms

§  FormalfoundaJonandconvergence22

Page 23: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

ClusteringAlgorithms

•  ParJJonalalgorithms– Usuallystartwitharandom(parJal)parJJoning

–  RefineititeraJvely•  Kmeansclustering•  Mixture-Modelbasedclustering

•  Hierarchicalalgorithms–  Bonom-up,agglomeraJve–  Top-down,divisive

23

Page 24: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

TodayRoadmap:clustering

§  DefiniJonof"groupness”§  DefiniJonof"similarity/distance"§  RepresentaJonforobjects§  Howmanyclusters?§  ClusteringAlgorithms

§ ParJJonalalgorithms§ Hierarchicalalgorithms

§  FormalfoundaJonandconvergence24

Page 25: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

HierarchicalClustering•  Buildatree-basedhierarchicaltaxonomy(dendrogram)fromasetofobjects,e.g.organisms,documents.

•  NotethathierarchiesarecommonlyusedtoorganizeinformaJon,forexampleinawebportal.–  Yahoo!hierarchyismanuallycreated,wewillfocusonautomaJccreaJonofhierarchiesindatamining.

Withbackbone Withoutbackbone

25

Page 26: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

(How-to) Hierarchical Clustering The number of dendrograms with n leafs

= (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of Possibleof Leafs Dendrograms 2 13 34 155 105... …10  34,459,425

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

Clustering:theprocessofgroupingasetofobjectsintoclassesofsimilarobjectsè

highintra-classsimilaritylowinter-classsimilarity

Agreedylocal

opJmalsoluJon

11/22/16

Dr.YanjunQi/UVACS6316/f16

26

Page 27: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

0 8 8 7 7

0 2 4 4

0 3 3

0 1

0

D( , ) = 8 D( , ) = 1

We begin with a distance matrix which contains the distances between every pair of objects in our database.

11/22/16

Dr.YanjunQi/UVACS6316/f16

27

Page 28: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

… Consider all possible merges…

Choose the best

11/22/16

Dr.YanjunQi/UVACS6316/f16

28

Page 29: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

… Consider all possible merges…

Choose the best

Consider all possible merges… …

Choose the best

11/22/16

Dr.YanjunQi/UVACS6316/f16

29

Page 30: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

… Consider all possible merges…

Choose the best

Consider all possible merges… …

Choose the best

Consider all possible merges…

Choose the best …

11/22/16

Dr.YanjunQi/UVACS6316/f16

30

Page 31: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

… Consider all possible merges…

Choose the best

Consider all possible merges… …

Choose the best

Consider all possible merges…

Choose the best … But how do we compute distances

between clusters rather than objects?

11/22/16

Dr.YanjunQi/UVACS6316/f16

31

Page 32: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

Howtodecidethedistancesbetweenclusters?

•  Single-Link

– NearestNeighbor:theirclosestmembers.

•  Complete-Link– FurthestNeighbor:theirfurthestmembers.

•  Average:– averageofallcross-clusterpairs.

32

Page 33: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Computing distance between clusters: Single Link

•  cluster distance = distance of two closest members in each class

- Potentially long and skinny clusters

11/22/16

Dr.YanjunQi/UVACS6316/f16

33

Page 34: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Computing distance between clusters: : Complete Link

•  cluster distance = distance of two farthest members

+ tight clusters

11/22/16

Dr.YanjunQi/UVACS6316/f16

34

Page 35: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Computing distance between clusters: Average Link

•  cluster distance = average distance of all pairs

the most widely used measure

Robust against noise

11/22/16

Dr.YanjunQi/UVACS6316/f16

35

Page 36: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

12 3 4

5

11/22/16

Dr.YanjunQi/UVACS6316/f16

36

Page 37: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

12 3 4

5

11/22/16

Dr.YanjunQi/UVACS6316/f16

37

Page 38: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

⎥⎥⎥⎥

⎢⎢⎢⎢

0458079

030

543)2,1(

543)2,1(

12 3 4

5

8}8,9min{},min{9}9,10min{},min{3}3,6min{},min{

5,25,15),2,1(

4,24,14),2,1(

3,23,13),2,1(

======

===

ddddddddd

11/22/16

Dr.YanjunQi/UVACS6316/f16

38

Page 39: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Example: single link

⎥⎥⎥

⎢⎢⎢

04507

0

54)3,2,1(

54)3,2,1(

12 3 4

5

⎥⎥⎥⎥

⎢⎢⎢⎢

0458079

030

543)2,1(

543)2,1(

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

5}5,8min{},min{7}7,9min{},min{

5,35),2,1(5),3,2,1(

4,34),2,1(4),3,2,1(

======

dddddd

11/22/16

Dr.YanjunQi/UVACS6316/f16

39

Page 40: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Example: single link

⎥⎥⎥

⎢⎢⎢

04507

0

54)3,2,1(

54)3,2,1(

12 3 4

5

⎥⎥⎥⎥

⎢⎢⎢⎢

0458079

030

543)2,1(

543)2,1(

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

5},min{ 5),3,2,1(4),3,2,1()5,4(),3,2,1( == ddd

11/22/16

Dr.YanjunQi/UVACS6316/f16

40

Page 41: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7

1

2

3

4

5

6

7

Average linkage

Single linkage

Height represents distance between objects / clusters

ParJJonsbycutngthedendrogramatadesiredlevel:eachconnectedcomponentformsacluster.

11/22/16

Dr.YanjunQi/UVACS6316/f16

41

Page 42: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

HierarchicalClustering•  Bonom-UpAgglomeraJveClustering

–  Startswitheachobjectinaseparatecluster–  thenrepeatedlyjoinstheclosestpairofclusters,–  unJlthereisonlyonecluster.

Thehistoryofmergingformsabinarytreeorhierarchy(dendrogram)

•  Top-Downdivisive–  StarJngwithallthedatainasinglecluster,–  Considereverypossiblewaytodividetheclusterintotwo.Choosethebestdivision

–  Andrecursivelyoperateonbothsides.

42

Page 43: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

11/22/16

Dr.YanjunQi/UVACS6316/f16

ComputaJonalComplexity

•  InthefirstiteraJon,allHACmethodsneedtocomputesimilarityofallpairsofnindividualinstanceswhichisO(n2p).

•  Ineachofthesubsequentn−2mergingiteraJons,computethedistancebetweenthemostrecentlycreatedclusterandallotherexisJngclusters.

•  Forthesubsequentsteps,inordertomaintainanoverallO(n2)performance,compuJngsimilaritytoeachotherclustermustbedoneinconstantJme.ElseO(n2logn)orO(n3)ifdonenaively

43

Page 44: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

SummaryofHierarchalClusteringMethods

•  Noneedtospecifythenumberofclustersinadvance.

•  HierarchicalstructuremapsnicelyontohumanintuiJonforsomedomains

•  Theydonotscalewell:JmecomplexityofatleastO(n2),wherenisthenumberoftotalobjects.

•  LikeanyheurisJcsearchalgorithms,localopJmaareaproblem.

•  InterpretaJonofresultsis(very)subjecJve.11/22/16

Dr.YanjunQi/UVACS6316/f16

44

Page 45: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

Hierarchical Clustering

Clustering

n/a

No clearly defined loss

greedy bottom-up (or top-down)

Dendrogram (tree)

Task

Representation

Score Function

Search/Optimization

Models, Parameters

11/22/16 45

Dr.YanjunQi/UVACS6316/f16

Page 46: UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

References

q HasJe,Trevor,etal.Theelementsofsta=s=callearning.Vol.2.No.1.NewYork:Springer,2009.

q BigthankstoProf.EricXing@CMUforallowingmetoreusesomeofhisslides

q BigthankstoProf.ZivBar-Joseph@CMUforallowingmetoreusesomeofhisslides

11/22/16

Dr.YanjunQi/UVACS6316/f16

46