Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica...
Transcript of Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica...
![Page 1: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/1.jpg)
A robust clustering approach to fraud detection
Luis Angel Garcıa-Escudero
Dpto. de Estadıstica e I.O. and IMUVA - Universidad de Valladolid
joint (and on-going...) work with A. Mayo-Iscar, A. Gordaliza, C. Matran(U. Valladolid) and colleagues from M. Riani, A. Cerioli (U. Parma) and D.Perrotta, F. Torti (JRC-Ispra)
Conference on Benford’s Law for fraud detection. Stresa, July 2019 1
![Page 2: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/2.jpg)
1. CLUSTERING AND ROBUSTNESS
• Clustering is the task of grouping a set of objects in such a way thatobjects in the same cluster are more similar to each other than to thosein other clusters:
![Page 3: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/3.jpg)
• Sample mean:
� m = 1n
∑ni=1 xi minimizes
∑ni=1 ‖xi −m‖2
� m may be seen as the “center” of a data-cloud:
0 1 2 3 4
01
23
4
x1
y2 X
• k clusters ⇒ k “data-clouds” ⇒ k-means
![Page 4: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/4.jpg)
• k-means: Search for
� k centers m1, ...,mk
� a partition {R1, ..., Rk} of {1, 2, ..., n}
minimizingk∑j=1
∑i∈Rj
‖xi −mj‖2.
• Cluster j:
Rj = {i : ‖xi −mj‖ ≤ ‖xi −ml‖ for every l = 1, ..., k}
(...assignment to the closest center...)
![Page 5: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/5.jpg)
• Robustness: Many statistical procedures are strongly affected by evenfew outlying observations:
� The mean is not robust:
x =1.72 + 1.67 + 1.80 + 1.70 + 1.82 + 1.73 + 1.78
7= 1.745
x =1.72 + 1.67 + 1.80 + 1.70 + 182 + 1.73 + 1.78
7= 27.485
� k-means inherits that lack of robustness from the mean
![Page 6: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/6.jpg)
• Lack of robustness of k-means:
−20 −10 0 10 20 30
−20
−10
010
2030
(a) 3−means
x1
x2
2 groups artific. joined
0 50 100 150
−40
−20
020
40
(b) 2−means
x1
x2
Cluster 1
Cluster 2
![Page 7: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/7.jpg)
• Outliers can be seen as “clusters by themselves”
• So, why not increasing the number of clusters...?
� But:
· Due to (physical, economical,...) reasons we could have an initialidea of k without being aware of the existence of outliers
· “Radial/background” noise requires large k’s
• Moreover, the detection of outliers may be the goal itself!!!
![Page 8: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/8.jpg)
• Outliers in trade data can be associated to “frauds”:
� Heterogeneous sources of data (clustering) + Few outliers (frauds??)
0 50 100 150 200
010
020
030
040
050
060
0
Trade data
Quantity (tons)
Valu
e (1
000
euro
s)
![Page 9: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/9.jpg)
2.- TRIMMED k-MEANS
• Trimming is the oldest and most widely used way to achieve robustness.
• Trimmed mean: The proportion α/2 smallest and α/2 largestobservations are discarded before computing the mean:
−2 0 2 4 6
−1.0
−0.5
0.0
0.5
1.0
x
Trimmed Trimmed
![Page 10: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/10.jpg)
• But,... how to trim in clustering?
� Why not trimming outlying “bridge” points?
0 5 10 15
−1.0
−0.5
0.00.5
1.0
x
Non−trimmed ’bridge’ points
� Why a symmetric trimming?
0 10 20 30
−1.0
−0.5
0.0
0.5
1.0
x
Symmetric trimming?
� How to trim in multivariate clustering problems?
![Page 11: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/11.jpg)
• Idea: Data itself tell us which are the most outlying observations!!
� Data-driven, adaptive, impartial,... trimming!
• Trimmed k-means: we search for
� k centers m1, ...,mk and
� a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]
minimizingk∑j=1
∑i∈Rj
‖xi −mj‖2.
[A fraction α of data is not taken into account Trimmed]
![Page 12: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/12.jpg)
• Black circles: trimmed points (k = 3 and α = 0.05):
−20 −10 0 10 20 30
−20
020
40
(a)
x1
x2
−5 0 5
−50
510
(b)
x1
x2
![Page 13: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/13.jpg)
• Old Faithful Geyser data: x1 = “Eruption length”, x2 = “Previouseruption length” and n = 271
Classificationk = 3, α = 0.03
Eruption length
Pre
viou
s er
uptio
n le
ngth
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
� k = 3 and α = 0.03 (0.03 · 271 ' 9 trimmed obs.): 6 rare “short-followed-by-short” eruptions trimmed, 3 bridge points...
![Page 14: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/14.jpg)
3.- ROBUST MODEL-BASED CLUSTERING
• k-means and trimmed k-means prefer spherical clusters:
−6 −4 −2 0 2 4 6
−6−4
−20
24
6
(a) 2−means (spherical groups)
−6 −4 −2 0 2 4 6
−20
−10
010
20
(b) 2−means (elliptical groups)
• Elliptically contoured clusters?
![Page 15: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/15.jpg)
• Multivariate normal distributions with densities φ(·;µ,Σ):
� µ =
(22
)and Σ =
(1 00 1
)[spherical] in (a)
� µ =
(22
)and Σ =
(2 11 1
)[non-spherical] in (b)
−2 0 2 4 6
−20
24
6
(a)
x1
y2
−2 0 2 4 6
−20
24
6
(b)
x1
y2
φ(x;µ,Σ) = (2π)−p/2|Σ|−1/2 exp(− (x− µ)′Σ−1(x− µ)/2
)
![Page 16: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/16.jpg)
• Trimmed likelihoods: Search for
� k centers m1, ...,mk,
� k scatter matrices S1, ..., Sk, and,
� a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]
maximizing
k∑j=1
∑xi∈Rj
log φ(xi;mj, Sj) (obs. in R0 not taken into account)
Garcıa-Escudero et al 2008, Neykov et al 2007, Gallegos and Ritter 2005,...
![Page 17: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/17.jpg)
• Constraints on the Sj scatter matrices needed:
� Unbounded target likelihood functions� Avoid detecting (non-interesting) “spurious” clusters
• Control relative axes’ lengths (eigenvalues constraints):
c = 1 Large c value
![Page 18: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/18.jpg)
• The FSDA Matlab toolbox:
• The R package tclust at CRAN repository:
![Page 19: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/19.jpg)
• The R package tclust :
> library(tclust)
• tkmeans(data, k, alpha)
� k = “number of groups”
� alpha = “trimming proportion”
• tclust(data, k, alpha, restr.fact, ...)
� restr.fact = “Strength of the constraints”
![Page 20: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/20.jpg)
• tclust(X,k=3,alpha=0.03,restr.fact=50)
Large restr.factk = 3, α = 0.03
x1
x2
0 5 10 15 20
05
1015
20
![Page 21: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/21.jpg)
• Old Faithful Geyser data again:
Classificationk = 4, α = 0
Eruption length
Prev
ious e
rupti
on le
ngth
1.5 2.5 3.5 4.5
1.52.0
2.53.0
3.54.0
4.55.0
Classificationk = 3, α = 0.03
Eruption lengthPr
eviou
s eru
ption
leng
th1.5 2.5 3.5 4.5
1.52.0
2.53.0
3.54.0
4.55.0
• Why k = 3 and α = 0.03 was a sensible solution?
![Page 22: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/22.jpg)
• Applying ctlcurves to the Old Faithful Geyser data:
0.0 0.2 0.4 0.6 0.8
−800
−600
−400
−200
0CTL−Curves
α
Objec
tive F
uncti
on Va
lue
Restriction Factor = 50
5
5
5
5555555
555555555
55555555555
4
4
44444444444444
444
44444444444
3
3
3333333333333333
3333333333
33
2
2
222222222
2
22222222
2222222222
1
111111111
11
1
1111111111
1111111
0.00 0.02 0.04 0.06 0.08 0.10
−800
−700
−600
−500
−400
CTL−Curves
α
Objec
tive F
uncti
on Va
lue
Restriction Factor = 50
55
5 5 55
5 5 5 55 5 5 5 5
5 5 5 55
44 4
4 44 4 4
4 44 4
4 4 44
4 4 44
33
3
33
3 3 33 3
3 33 3 3
3 3 3 3 3
22
2
22
2 2 22 2
2 22 2 2
2 22 2
2
11 1
1 11 1 1
1 11 1
1 1 11 1
1 11
![Page 23: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/23.jpg)
4.- ROBUST CLUSTERING AROUND LINEARSUBSPACES
• Robust linear grouping: Higher p dimensions, but assuming that ourdata “live” in k low-dimensional (affine) subspaces...
� We search for
· k linear subspaces h1, ..., hk in Rp
· a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]
minimizingk∑j=1
∑i∈Rj
‖xi − Prhj(xi)‖2.
� Prh(·) denotes the “orthogonal” projection onto the linear subspace h
![Page 24: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/24.jpg)
• Example: Three linear structures in presence of noise:
1 1.5 2 2.5 3 3.5 4 4.5 5−6
−4
−2
0
2
4
6
X
Y
1 1.5 2 2.5 3 3.5 4 4.5 5−6
−4
−2
0
2
4
6
X
Y(a) α = 0 (b) α = 0.1 (◦ = “Trimmed”)
Trimmed “mixtures of regressions” can also be applied...
![Page 25: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/25.jpg)
• k = 1 case ⇒ Robust “Principal Components Analysis (PCA)”:
� PCA provides a q-dimensional (q << p) representation of data by
minBq,Aq,m
n∑i=1
||xi − xi||2 for
xi = Prh(xi) = xi(Bq,Aq,m) = m + Bqai
· Aq =
−a1−· · ·−ai−· · ·−an−
is the scores matrix (n× q)
· Bq =
−b1−· · ·−bj−· · ·−bp−
is a matrix (p × q) whose columns generate a q-dimensional approximating subspace h
![Page 26: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/26.jpg)
• Principal Components Analysis is highly non-robust!!!
• Least Trimmed Squares PCA (Maronna 2005): Minimize
n∑i=1
wi‖xi − xi‖2 =
n∑i=1
wi‖xi − xi(Bq,Aq,m)‖2,
with {wi}ni=1 being “0-1 weights” such that
n∑i=1
wi = [n(1− α)]
� Weights: wi =
{1 If xi is not trimmed
0 If xi is trimmed.
![Page 27: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/27.jpg)
• Cases → xi = (xi1, ..., xip)′ ∈ Rp and Cells → xij ∈ R
� i denotes a country (or a trader; company;...) for i = 1, ..., n
� xij is the “quantity-value ratio” for country i in the j-th month (orthe j-th year; the j-th product;...) for j = 1, ..., p
• Casewise trimming: Trim xi cases with (at least one) outlying xij
n = 100× p = 4 data matrix with 2% outlying cells:
Outlying xij cells Trimmed xi cases (black lines)
![Page 28: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/28.jpg)
• But when the dimension p increases... we do not expect many xicompletely free of outlying xij cells:
n = 100× p = 80 data matrix with 2% outlying cells:
Outlying xij cells Trimmed xi cases (black lines)
• Cellwise trimming:
� Only trimming outlying cells... (⇒ “Particular” frauds identified...??)
![Page 29: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/29.jpg)
• PCA approximation xi = m + Bqai = (xi1, ..., xip)T re-written as
xij = mj + aTi bj.
• Cellwise LTS (Cevallos-Valdiviezo 2016): Minimize
n∑i=1
wij(xij −mj − aTi bj)2
� wij = 0 if cell xij is trimmed and wij = 1 if not with
n∑i=1
wij = [n(1− α)], for j = 1, ..., p.
![Page 30: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/30.jpg)
• Different patterns/structures in data ⇒ G subspace approximations:
xgi(Bgqg,A
gqg,m
g)
= mg + Bgqga
gi or xgij = mg
j + (agi )Tbgj ,
for g = 1, ..., G
• Minimize
minwgij,B
gq ,A
gq,mg
n∑i=1
p∑j=1
G∑g=1
wgij(xij − xgij)
2.
� wgij = 1 if cell xij is assigned to cluster g and non-trimmed and 0otherwise
� Appropriate constraints on the wgij
q1, ..., qG are intrinsic dimensions...
![Page 31: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/31.jpg)
• Example 1: n = 400 in dimension p = 100 with 2 groups and 2%“scattered” outliers:
0.0 0.2 0.4 0.6 0.8 1.0
−20
−10
010
2030
40
0.0 0.2 0.4 0.6 0.8 1.0
−20
−10
010
2030
40
![Page 32: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/32.jpg)
• k = 2, q = 2 and α = 0.05:
“-” are the trimmed cells
![Page 33: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/33.jpg)
• Cluster means and trimmed cells (◦):
![Page 34: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/34.jpg)
• Example 2: n = 400 in dimension p = 100 with 2 groups and fewcurves with 20% consecutive cells corrupted:
0.0 0.2 0.4 0.6 0.8 1.0
−40
−20
020
40
0.0 0.2 0.4 0.6 0.8 1.0
−40
−20
020
40
![Page 35: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/35.jpg)
• Results:
![Page 36: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/36.jpg)
• Real data example: Average daily temperatures in 83 Spanishmeteorologic stations between 2007-2009 (n = 83 and p = 1096).
![Page 37: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/37.jpg)
• Artificial outliers:
� Two periods of 50 consecutive days in Oviedo replaced by 0oC.
� 150 consecutive days in Huelva temperature replaced by 0oC.
![Page 38: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/38.jpg)
• Cluster means:
� “Meseta” (Central plateau-Castile): Mediterranean:
� Cantabrian Coast: Canary Islands:
![Page 39: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/39.jpg)
• Clustered stations:
![Page 40: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/40.jpg)
• Clusters found and trimmed cells:
![Page 41: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/41.jpg)
−10 −5 0 5 10 15
05
1015
First two scores of cluster 1
PONFERRADA
OURENSE
SORIA
BURGOS
VALLADOLID
ÁVILA
PUERTO DE NAVAC
SEGOVIA
VALLADOLID AIR
ZAMORA
LEÓN
SALAMANCA AIR
SALAMANCAMADRID AIRTORREJÓN ARDOZ
COLMENAR V.
MADRIDMADRID CUAT VI
GETAFE
TOLEDOCIUDAD REALGRANADA BA
GRANADA AIR
CUENCA
ALBACETE BA
ALBACETE
TERUEL
FORONDA
DAROCA
−5 0 5 10 15
05
10
First two scores of cluster 2
BARCELONA AIR
BARCELONA FABRA
CACERES
BADAJOZ AIR
HUELVA R. ESTE
JAÉN
CORDOBA
SEVILLA
MORÓN FRONTE
ROTA
JEREZ FRONT
MELILLAMALAGA
ALMERIA AIR SAN JAVIERMURCIA
ALCANTARILLAALICANTE AIR
ALICANTE
VALENCIA AIR
VALENCIA
CASTELLÓN ALMAZ
LOGROÑO
PAMPLONAPAMPLONA AIR
ZARAGOZA
LLEIDA
HUESCA AIR
TORTOSA
PALMA PORT
PALMA AIRMENORCA
IBIZA
−5 0 5
−4
−3
−2
−1
01
2
First two scores of cluster 3
HONDARRIBIA
SAN SEBASTIÁN
BILBAO
SANTANDER AIR
SANTANDER
GIJÓN PORT VITORIAOVIEDO
CORUÑACORUÑA AIR
SANTIAGO DE C.
PONTEVEDRA
VIGO
LUGO
TENERIFE NORTE
−1 0 1 2 3
−1
01
2
First two scores of cluster 4
LANZAROTE
LA PALMA
FUERTEVENTURA
STA CRUZ TENER
GRAN CANARIA
HIERRO
![Page 42: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/42.jpg)
BARCELONA AIRBARCELONA FABRA
HONDARRIBIASAN SEBASTIÁN
BILBAOSANTANDER AIR
SANTANDERGIJÓN PORT
VITORIAOVIEDO
CORUÑACORUÑA AIR
SANTIAGO DE C.PONTEVEDRA
VIGOLUGO
PONFERRADAOURENSE
SORIABURGOS
VALLADOLIDÁVILA
PUERTO DE NAVACSEGOVIA
VALLADOLID AIRZAMORA
LEÓNSALAMANCA AIR
SALAMANCAMADRID AIR
TORREJÓN ARDOZCOLMENAR V.
MADRIDMADRID CUAT VI
GETAFETOLEDO
CACERESCIUDAD REALBADAJOZ AIR
HUELVA R. ESTEJAÉN
CORDOBAGRANADA BA
GRANADA AIRSEVILLA
MORÓN FRONTEROTA
JEREZ FRONTMELILLAMALAGA
ALMERIA AIRSAN JAVIER
MURCIAALCANTARILLAALICANTE AIR
ALICANTECUENCA
ALBACETE BAALBACETE
TERUELVALENCIA AIR
VALENCIACASTELLÓN ALMAZ
FORONDALOGROÑO
PAMPLONAPAMPLONA AIR
DAROCAZARAGOZA
LLEIDAHUESCA AIR
TORTOSAPALMA PORT
PALMA AIRMENORCA
IBIZALANZAROTE
LA PALMAFUERTEVENTURATENERIFE NORTESTA CRUZ TENER
GRAN CANARIAHIERRO
0 300 600 900Dias
Clusters
0
1
2
3
4
![Page 43: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/43.jpg)
• Reconstructed curves “ ” and true real data “ ” in Oviedo:
![Page 44: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/44.jpg)
• Conclusions:
� Different patterns/structures in data ⇒ Cluster Analysis
� Robust clustering aimed at (jointly) detecting main clusters(bulk of data) and outliers ⇒ Potential “frauds”...
� Higher dimensional problems: Assume clusters “living” in low-dimensional subspaces
� “Casewise” and “cellwise” trimming
![Page 45: Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica e I.O. and IMUVA - Universidad de Valladolid joint (and on-going...) work with](https://reader034.fdocuments.in/reader034/viewer/2022051910/600081730615213da35a1f0e/html5/thumbnails/45.jpg)
Some References:
· Cuesta-Albertos, J.A., Gordaliza, A. and Matran, C. (1997), “Trimmed
k-means: An attempt to robustify quantizers,” Ann. Statist., 25, 553-576.
· Garcıa-Escudero, L.A. and Gordaliza, A. (1999), “Robustness properties of
k-means and trimmed k-means,” J. Amer. Statist. Assoc., 94, 956-969.
· Garcıa-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A.
(2008), “A General Trimming Approach To Robust Cluster Analysis,” Ann. Statist.,
36, 1324-1345.
· Garcıa-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A.
(2010), “A review of robust clustering methods,” Advances in Data Analysis and
Classification, 4, 89-109.
· Fritz, H.,Garcıa-Escudero, L.A. and Mayo-Iscar; A (2012), “tclust: An R
package for a trimming approach to Cluster Analysis,” Journal of Statistical Software,
47, Issue 12.
Thanks for your attention!!!