Christophe Ambroise...Christophe Ambroise Data Analysis 22 / 74 In R, the function kmeans {stats} II...
Transcript of Christophe Ambroise...Christophe Ambroise Data Analysis 22 / 74 In R, the function kmeans {stats} II...
Data Analysis
Christophe Ambroise
Christophe Ambroise Data Analysis 1 / 74
Clustering
Christophe Ambroise Data Analysis 2 / 74
Introduction
A widespread confusion exists between the terms classification and clustering:
Classification presupposes the existence of classes with certain objects are known,whileClustering tries to discover a class structure which is “natural” to the data.
In the literature related to pattern recognition, the distinction between the two approachesis often referred to by the terms “supervised” and “unsupervised” learning.
Christophe Ambroise Data Analysis 3 / 74
Objectives
Clustering can have di�erent motivations:
compress information, to describe in a simplified way large masses of data,structure a set of knowledge,reveal structures, hidden causes,make a diagnosis . . .
Christophe Ambroise Data Analysis 4 / 74
Similarity
the fist step to any clustering is to define a measure resemblance between objects (shapesvectors). Traditionally two approaches are envisaged:
monothetic : we can say that two objects are similar if they share a certain feature.Basis of the Aristotelian approach [@Sutcli�e1994]. All objects in the same class sharea number of characteristics (e.g. “All men are mortal”);polythetic : we can also measure the similarity by using a measure of proximity(distance, dissimilarity). In this case the notion of resemblance is measured morefuzzy and two objects of the same class will have “close” characteristics within themeaning of the measure used.
Context of this lecturepolythetic classification
Christophe Ambroise Data Analysis 5 / 74
Classes or groups
A classification leads to the distribution of the set of vectors forms in di�erenthomogeneous classes. The definition of a class and the relations between classes can bevery varied. In this chapter we will focus on both main classification structures:
partition,hierarchy.
Christophe Ambroise Data Analysis 6 / 74
Partitions
Définition� being a finite set, a set P = (C
1
, C2
, ..., CK ) non-empty parts of � is a partition if:1 ’i ”= j, Ci fl Cj = ÿ,2 fii Ci = �.
In a set � = (x1
, ..., xN) partitioned into K classes, each element of the set belongs to aclass and only one. A practical way of describing this P partition consists of using a matrixnotation.Let C(P) the characteristic matrix of the partition P = (C
1
, C2
, ..., CK ) (ou matrice de
C(P) = C =
Q
cac
11
· · · c1K
.... . .
...cN1
· · · cNK
R
db
where cik = 1 i� x i œ Ck , and cik = 0 otherwise.Note that the sum of the i th row is equal to 1 (an element belongs to a single class) andthe sum of the values of the k th column is worth nk the number of elements of the classCk . On a donc
qKk=1
nk = N.
Christophe Ambroise Data Analysis 7 / 74
Hard and Fuzzy partition
The notion of hard partition is based on a set design classic. Considering the work of[@Zadeh1965] on the sets fuzzy, a definition of the concept of fuzzy partition seems“Natural”. The fuzzy classification, developed at the beginning of 1970s [@Ruspini1969],generalizes an approach classical classification by broadening the concept belonging to aclass.
Fuzzy setsAs part of the classic set design, an individual x i belongs to or does not belong to a givenset Ck . In the theory of fuzzy subsets, an individual can belong to several classes withdi�erent degrees of membership.
Christophe Ambroise Data Analysis 8 / 74
Fuzzy clustering
In classification this amounts to authorize the vectors forms to belong to all classes, whichresults in the releasing the binarity constraint on the coe�cients belonging to cik . A fuzzypartition is defined by a fuzzy classification matrix C = {cik} checking the followingconditions:
1 ’k = 1..K , ’x i œ �, cik œ [0, 1].2 ’k = 1..K , 0 <
qNi=1
cik < N,3 ’x i œ �,
qKk=1
cik .
The second condition reflects the fact that no class should not be empty and the thirdexpresses the concept of total membership.
Christophe Ambroise Data Analysis 9 / 74
Indexed Hierarchy
� is a finite set. H a set of non empty subsets of � is a hiearchy i�
� œ H’x œ �, {x} œ H’h, hÕ œ H, h fl hÕ = ÿ or hÕ µ h or h µ hÕ
Christophe Ambroise Data Analysis 10 / 74
Index
The index of a hierarchy is a function i from H to R+ having the following properties
h µ hÕ ∆ i(h) < i(hÕ)’x œ �, i({x}) = 0
(h, i) is then an indexed hierarchy
Christophe Ambroise Data Analysis 11 / 74
Partition and Hierarchy
each level of a indexed hierarchy is a partition{�, P1, P2, . . . , PK , x
1
, .., xn} is a hierarchy
Christophe Ambroise Data Analysis 12 / 74
Clustering ideal
Ideal2 objects of a class should be closer than 2 objects of 2 di�erent classes∆ most of the time impossible to achieve
numerical approach: definition of a criterion (with or without a unique extremum)
Christophe Ambroise Data Analysis 13 / 74
The Bell numbers
But the number of partitions possible, even for a problem of reasonable size, is huge.Indeed if we consider a set of N objects to partition into K classes, the number of possiblepartitions is:
NP(N, K) = 1K !
Kÿ
k=0
(≠1)k≠1 · CKk · kN .
Figure 1: The 52 partitions of a set with 5 elements
Christophe Ambroise Data Analysis 14 / 74
The bell number
The bell number reccurenceShow that the number of partition of n objects verifies
Bn+1
=nÿ
k=0
Cnk Bk
Compute the number of partition of 5 objects.
Christophe Ambroise Data Analysis 15 / 74
The bell number
fix an element x in a set with n + 1 elements,sort the partitions according to the number k of elements outside the part containingx,For each value of k from 0 to n, it is necessary to choose k elements among the ndi�erent elements of x, then to give a partition.
Christophe Ambroise Data Analysis 16 / 74
Criteria and algorithms I
The concepts of partition and polythetic classification being specified, the followingquestion emerges: how to find an optimal partition of a dataset, when the resemblancebetween two individuals is assessed by a measure of proximity?
The first thing to do is to formally clarify the meaning of the word optimal.
The solution generally adopted is to choose a digital measure of the quality of a partition.
This measure is sometimes called a criterion, functional, or still function of energy. Thepurpose of a classification procedure so is to find the partition or partitions that give thebest value (the smallest or the largest) for a given criterion.
Rather than looking for the best score, the one that gives the optimum value of thecriterion, more methods are used which converge towards “local” optima “of the criterion.thus found scores are often satisfying.
Christophe Ambroise Data Analysis 17 / 74
Intra-classe Inertia and partition I
Many criteria exist [@Gordon1980]. Some may be related, as we will see in the following,at the choice of a model for all the data. One of most used functions is the sum ofintra-class variances:
IW =Kÿ
k=1
Nÿ
i=1
cikÎx i ≠ µkÎ2
= trace(SW )
where the µk are the prototypes (centers) of classes and the cik are the elements of a hardpartition matrix. The problem is then a problem of optimization under constraints (relatedto cik) :
(c, µ) = arg min(c,µ)
IW ((c, µ)) (1)
where µ represents all the centers of gravities.
Christophe Ambroise Data Analysis 18 / 74
K-means algorithm I
A common algorithm for solving this problem is the one k-means. Historically, thisalgorithm dates sixties. It has been proposed by several researchers in di�erent areas atclose dates [@ Edwards1965, @ Lloyd1957]. This algorithm based on considerationsGeometric certainly owes its success to its simplicity and e�ciency:
Algorithm1 Initialization of centers: a common method is to initialize centers with coordinates of
K randomly selected points.2 Then the iterations have the following alternate form:
1given µ
1
, · · · , µK , chose cik minimizing IW ,
2given C = {cik}, minimise IW with respect to µ
1
, · · · , µK .
The first stepassigns each x i to the nearest prototype,
the second steprecalculates the position of the prototypes considering that the prototype of the class ibecomes its mean vector.
Christophe Ambroise Data Analysis 19 / 74
K-means algorithm II
Convervegence of kmeans algorithmIt is possible to show that each iteration decreases the criterion but no guarantee ofconvergence towards a global maximum does not exist in general.
Link with fuzzy clusteringIf the k-means criterion is considered from the point of view of research of a fuzzypartition, that is, if the constraints on the cik are released and become cik œ [0, 1] insteadof cik œ {0, 1}, the optimal partition in the sense of the new criterion is the optimal one forthe classical criterion [@Selim1984]. In other words, there is no interest in consideringfuzzy scores when working with the criterion of k-means.
Christophe Ambroise Data Analysis 20 / 74
Kmeans algorithm’s
Nuées dynamiques (dynamic swarm)This form of alternate algorithm where a certain criterion is optimized, alternatively withrespect to the class membership variables, then compared to the parameters defining theseclasses has been extensively exploited. Let’s mention among others the dynamic clouds ofDiday [@Diday1971] and the fuzzy c-means algorithm [@Bezdeck1974].Note that Webster Fisher [@Fisher1958] (not to be confused with Ronald Fisher) hadproposed an algorithm finding the optimal partition, within the intra-class variance, of a setof N data one-dimensional in O(N · K 2) operations using methods from dynamicprogramming.
Christophe Ambroise Data Analysis 21 / 74
In R, the function kmeans {stats} I
e�ectively implements the algorithm.
data(iris)kmeans.res <- iris %>%
select(-Species,-Sepal.Length,-Sepal.Width) %>%kmeans(3,nstart = 10)
cluster<-as.factor(kmeans.res$cluster)centers <-as.tibble(kmeans.res$centers)
ggplot(iris, aes(x=Petal.Length, y=Petal.Width, color=cluster)) +geom_point() +geom_point(data=centers, color=�coral�,size=4,pch=21)+geom_point(data=centers, color=�coral�,size=50,alpha=0.2)
Christophe Ambroise Data Analysis 22 / 74
In R, the function kmeans {stats} II
0.0
0.5
1.0
1.5
2.0
2.5
2 4 6Petal.Length
Petal.W
idth cluster
1
2
3
Figure 2: Automatic classification into three classes of iris by the k-means algorithm
Christophe Ambroise Data Analysis 23 / 74
In R, the function kmeans {stats} III
knitr::kable(table(iris$Species, kmeans.res$cluster),caption = "Table de contingence croisantréalité et estimation de la structure par k-means")
Table 1: Table de contingence croisant réalité et estimation de la structure par k-means
1 2 3setosa 0 50 0versicolor 2 0 48virginica 46 0 4
Christophe Ambroise Data Analysis 24 / 74
Voronoi tiling I
By definition, kmeans partition the space by a Voronoi tiling defined by the centers.
library(deldir)
## Warning: package �deldir� was built under R version 3.4.4
## deldir 0.1-15
#This creates the voronoi line segments
voronoi <- deldir(centers$Petal.Length, centers$Petal.Width)
#### PLEASE NOTE: The components "delsgs" and "summary" of the## object returned by deldir() are now DATA FRAMES rather than## matrices (as they were prior to release 0.0-18).## See help("deldir").#### PLEASE NOTE: The process that deldir() uses for determining## duplicated points has changed from that used in version## 0.0-9 of this package (and previously). See help("deldir").
Christophe Ambroise Data Analysis 25 / 74
Voronoi tiling II
#Now we can make a plot
ggplot(data=iris, aes(x=Petal.Length, y=Petal.Width, color=cluster)) +geom_point()+#Plot the voronoi lines
geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2),size = 2,data = voronoi$dirsgs,linetype = 1,color= "coral") +
#Plot the points
geom_point(data=centers,fill=�coral�,pch=21,size = 4,color="#333333")
Christophe Ambroise Data Analysis 26 / 74
Voronoi tiling III
0.0
0.5
1.0
1.5
2.0
2.5
2 4 6Petal.Length
Petal.W
idth cluster
1
2
3
Figure 3: Classification automatique en trois classes des iris par l’algorithme des centres mobiles et
pavage de Voronoï
Christophe Ambroise Data Analysis 27 / 74
Spectral Clustering
Christophe Ambroise Data Analysis 28 / 74
Hierachical Clustering
There are two main approaches
AgglomerativeDivisive
Christophe Ambroise Data Analysis 29 / 74
Hierarchical Agglomerative Clusering
The principle of Hierarchical Agglomerative Clusering is fairly easy
1 Initialisation each � element constitutes a class. A dissimilarity D is computedbetween all classes.
2 While number of cluster > 1
1 group the two closest classes in the sense of the dissimilarity D,2 calculation of “distances” between the new class and the others.
Christophe Ambroise Data Analysis 30 / 74
Cluster dissimilarity
The dissimilarity D between two parts h and hÕ of �, can be defined in many ways from ameasure of dissimilarity d on �.
single link (critère du lien minimum):
D(h, hÕ) = min#d(x, y)/x œ h and y œ hÕ$ ,
complete link (critère du lien maximum)
D(h, hÕ) = max#d(x, y)/x œ h and y œ hÕ$ ,
* group average
D(h, hÕ) =qnh
i=1
qnhÕj=1
d(x i , x j)nh · nhÕ
,
Ward criterion
D(h, hÕ) = nh · nhÕ
nh + nhÕÎmh ≠ mhÕ Î2.
Christophe Ambroise Data Analysis 31 / 74
Intra class criterion and Ward method I
When we have a partition in K classes, the criterion of intra-class inertia measures itshomogeneity:
IW = trace(SW ),
=Kÿ
k=1
nkÿ
i=1
(x ik ≠ mk)t(x ik ≠ mk).
Let us consider two partitions
P = (P1
, · · · , PK ),and P Õ, the partition obtained by merging the classes Ck et C¸.
Christophe Ambroise Data Analysis 32 / 74
Intra class criterion and Ward method II
We can show that the di�erence between the inertia of the two partitions is equal toWard’s aggregation criteria:
IW Õ ≠ IW = nk · n¸
nk + n¸Îmk ≠ m¸Î2.
Thus, each stage of Ward’s algorithm chooses a new partition that limits the increase ofintra-class inertia. Note that this property does not guarantee overall optimization of thecriteria.
Christophe Ambroise Data Analysis 33 / 74
Ultrametric distance
An ultrametric distance ” checks all the properties that define a classical distance andsatisfies in addition the inequality
”(x, z) Æ max (”(x, z), ”(z, y)) ,
stronger than the triangular inequality.
it is possible to interpret the minimum number of nestings required for two formvectors to belong to one class, as a dissimilarity.this dissimilarity is an ultrametric distance.it is possible to understand the problem of hierarchical clustering as the search for anultrametric ” close to d , the dissimilarity used on �
Christophe Ambroise Data Analysis 34 / 74
Divisive hierarchical clustering
the divisive clustering approach is much less popularIn theory, the first step of a downward method must compare $2ˆ{N-1}-1 $ possiblepartitions from N vectors, in two classes.To avoid those impossible calculations, one solution is to apply a method ofpartitioning to get both classes. Repeating this process recursively on each classobtained allows to have fast algorithms
Christophe Ambroise Data Analysis 35 / 74
Hierarchical agglomerative clustering with R I
agnesAgglomerative Nesting Description: Computes agglomerative hierarchicalclustering of the dataset. Usage: agnes(x, diss = inherits(x, “dist”), metric =“euclidean”, stand = FALSE, method = “average”, keep.diss = n < 100,keep.data = !diss)
Christophe Ambroise Data Analysis 36 / 74
Example of use I
## Warning: package �MASS� was built under R version 3.4.4
#### Attaching package: �MASS�
## The following object is masked from �package:dplyr�:#### select
library(cluster)
## Warning: package �cluster� was built under R version 3.4.4
Christophe Ambroise Data Analysis 37 / 74
Example of use II
res<-agnes(crabsquant2,method="ward")
plot(res,which.plots=1)
Christophe Ambroise Data Analysis 38 / 74
Example of use III
Height
Banner of agnes(x = crabsquant2, method = "ward")
Agglomerative Coefficient = 0.98
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65
Christophe Ambroise Data Analysis 39 / 74
Example of use IV
plot(res,which.plots=2)
Christophe Ambroise Data Analysis 40 / 74
Example of use V
112 16 89 85 90 96 100 67 92 69 71 79 94 82 98 99 73 97 83 87 2 80 55
54 91 76 867 77 93 66 84 3 75 10 74 60
6151 53 5
8 72 64 56 68 88
52 63 57 95 59
62 78 81 6
518
3 70
101
152
194 181
105
153
191
198 172
174
199
197
175
182
102
104
103
186
193
195
190
107
161
111
117
108
110
189 180
151
154 164
168
173
170
178
196
155
200
158
165
160
162
185
184
187
157
163
159 17
615
616
917
7 192
166
179
167
171
188 4 9 19 155 11
17 18 21 6
20 26 29 8 4714
13 25 33 37 41 22 39 42
24 27 43 46 28 35 23 30 36 38 45
48 49 50 31
32 44 34 40 106
116
109
119
122
114
120
129
137 12
312
413
6 125
128
112
135
148
139
130
146
131
138
143
149 132
134
147
115
121
126
140 11
811
312
7 133
141
150
142
144
145
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Dendrogram of agnes(x = crabsquant2, method = "ward")
Agglomerative Coefficient = 0.98crabsquant2
Hei
ght
Christophe Ambroise Data Analysis 41 / 74
Example of use VI
plot(1:199,sort(res$height))
Christophe Ambroise Data Analysis 42 / 74
Example of use VII
0 50 100 150 200
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1:199
sort(res$height)
Christophe Ambroise Data Analysis 43 / 74
Divisive Hiearchical Clustering with R
dianaDIvisive ANAlysis Clustering Description: Computes a divisive hierarchicalclustering of the dataset returning an object of class ‘diana’. Usage:diana(x, diss = inherits(x, “dist”), metric = “euclidean”, stand = >FALSE,keep.diss = n < 100, keep.data = !diss)
Christophe Ambroise Data Analysis 44 / 74
Example of use I
res<-diana(crabsquant2)
plot(res,which.plots=1)
Christophe Ambroise Data Analysis 45 / 74
Example of use II
Height
Banner of diana(x = crabsquant2)
Divisive Coefficient = 0.91
0.183 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0
Christophe Ambroise Data Analysis 46 / 74
Example of use III
plot(res,which.plots=2)
Christophe Ambroise Data Analysis 47 / 74
Example of use IV
13 75 60
8151 53
58 64 72
56 6888
62 78 52 6357
86 9565 70
59 612 8055
54 9176 7
77 93 66 84 10 74 16 894 15 1
9 95 6
1112
85 90 96 100 6
773 97
8387
69 71 98 99 79 9482
9210
115
219
418
1 105
153
191
198
172
174
199
197 1
8218
010
210
3 186 108
184
189 161
187
193 19
5 190
151
154 1
6415
716
315
9 160
176
155
162
185 2
0015
816
617
9 165
170
178 19
616
817
3 156
169
177
188
167
175 17
119
218
38 47 34 40 32 44
3122 39
33 37 41 3
623 30 3
846 45
48 49 50
13 25 24 27 43
21 42
28 3514
17 18 20 26 29 104 11
010
711
7 111
106
109
119 116
122
114
120
129
115
121
123
124
136 1
2511
811
213
014
613
514
813
713
213
414
711
312
7 133
126
140 1
3113
814
314
9 139 12
814
114
214
414
515
00.00
0.05
0.10
0.15
Dendrogram of diana(x = crabsquant2)
Divisive Coefficient = 0.91crabsquant2
Hei
ght
Christophe Ambroise Data Analysis 48 / 74
Example of use V
plot(1:199,sort(res$height))
Christophe Ambroise Data Analysis 49 / 74
Example of use VI
0 50 100 150 200
0.00
0.05
0.10
0.15
1:199
sort(res$height)
Christophe Ambroise Data Analysis 50 / 74