Monothetic divisive clustering with geographical constraints

23
Monothetic divisive clustering with geographical constraints Marie Chavent (1) Yves Lechevallier (2) Francoise Vernier (3) Kevin Petit (3) (1) Université Bordeaux2, IMB, UMR 5251 CNRS, France [email protected] (2) INRIA, Paris-Rocquencourt 78153 Le Chesnay cedex, France [email protected] (3) CEMAGREF-Bordeaux, Unité de recherche ADER 50, France francoise.vernier,[email protected] COMPSTAT 2008, Porto, Portugal Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Transcript of Monothetic divisive clustering with geographical constraints

Page 1: Monothetic divisive clustering with geographical constraints

Monothetic divisive clustering withgeographical constraints

Marie Chavent(1) Yves Lechevallier(2)

Francoise Vernier(3) Kevin Petit(3)

(1) Université Bordeaux2, IMB, UMR 5251 CNRS, [email protected]

(2) INRIA, Paris-Rocquencourt 78153 Le Chesnay cedex, [email protected]

(3) CEMAGREF-Bordeaux, Unité de recherche ADER 50, Francefrancoise.vernier,[email protected]

COMPSTAT 2008, Porto, Portugal

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 2: Monothetic divisive clustering with geographical constraints

Introduction

DIVCLUS-T is a divisive and monothetic hierarchicalclustering method which proceeds by optimization of apolythetic criterion. The bipartitional algorithm and thechoice of the cluster to be split are based on theminimization of the within-cluster inertia.C-DIVCLUS-T is an extension of DIVCLUS-T which is ableto take contiguity constraints into account. The newcriterion defined to include these constraints is adistance-based criterion.

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 3: Monothetic divisive clustering with geographical constraints

DIVCLUS-T

DIVCLUS-T algorithm repeats the following two steps :

splitting a cluster into a bipartition which optimizes acriterion W . The complete enumeration is avoided by usinga monothetic approch.choosing in the current partition the cluster to be split insuch a way that the new partition optimizes the criterion W .

⇒ The process stops after a number of iterations specified bythe user.⇒ The output is an indexed hierarchy (dendrogram) which isalso a decision tree.

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 4: Monothetic divisive clustering with geographical constraints

DIVCLUS-T

First : How the bipartitional algorithm works ?The best bipartition is chosen among the set of bipartitionsinduced by all possible binary questions.

On a numerical variable X a binary question is noted“is X ≤ c ?”On a categorical variable X a binary question is noted :is X ∈ C ? ⇒ Note that for numerical variables withcomplex descriptions like intervals, is is note possible toanswer by yes or no to this binary question.

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 5: Monothetic divisive clustering with geographical constraints

DIVCLUS-T

On a numerical variable X , the number of binary questionsis infinite but these binary questions induce a maximum ofn` − 1 different bipartitions of a cluster C` with n` objects.On a categorical variable X of m categories, there will be amaximum of 2m−1 − 1 different bipartitions induced→ computational problem.

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 6: Monothetic divisive clustering with geographical constraints

DIVCLUS-T

Second : how to choose the cluster to split ?Choose the cluster C` = A` ∪ A` of Pk such that the partitionPk+1 = C1, . . . , C`−1, A`, A`, C`−1, . . . , Ck has the smallesthomogeneity criterion W (Pk+1) :⇒ If the homogeneity criterion W (Pk ) is additive :

W (Pk ) =k∑

`=1

D(C`)

⇒ the cluster C` chosen maximizesh(C`) = D(C`)− D(A`)− D(A`).

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 7: Monothetic divisive clustering with geographical constraints

DIVCLUS-T

Third : how to defined the hierarchical level ?

The number of divisions is fixed and then the hierarchy isan upper hierarchy.The hierarchical level is h(C`) = D(C`)− D(A`)− D(A`)

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 8: Monothetic divisive clustering with geographical constraints

DIVCLUS-T : a simple example

Port

ItalyG

reeceSpain

USSR

PolCzechE_G

erW

_Ger

Nether

Aust

Switz

Fr BelgIrelandU

K

Nor

Finl

Swed

Den

Alban

Nuts > 3.5

Yes No

YesNo

Fish>5.7No YesRed Meat > 12.2No YesStarchy Foods >3.9

Yes No

Fruits/Veg. >5.35

Hung

Yugo

BulgRom

3.12

1.21

0.770.563.51

0.51

W_Ger

AlbanBulg

Yugo

Italy

Rom

Greece

SpainPortHung

USSRPolCzechE_G

erFrUK BelgIrelandNetherAustSwitz

FinNor

Swed

Den

3.12

1.24

0.890.74

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 9: Monothetic divisive clustering with geographical constraints

DIVCLUS-T : a simple example

What is the price paid in term of inertia for this supplementarymonothetic interpretation ?

Proportion of the inertia (in %) explained by the k -clusters partitions obtained withDIVCLUS-T and Ward on the protein data set :

k 2 3 4 5 6 7 8 9 10DIVCLUS-T 37.1 50.6 59.2 65.5 71.2 73.5 79.3 81.6 84Ward 34.7 48.5 58.5 66.7 72.4 75.5 79 81.6 84

Chavent, M., Briant, O., Lechevallier, Y. (2007). DIVCLUS-T : a monothetic divisive

hierarchical clustering method. Computational Statistics and Data Analysis, 32 (2),

687-701.

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 10: Monothetic divisive clustering with geographical constraints

A distance-based homogeneity criterion

how to define an homogeneity criterion when the data havecomplex descriptions ?Let D = (dii ′)n×n be the distance matrix.

A distance-based homogeneity criterion D of a cluster C`

can be defined by :

D(C`) =∑i∈C`

∑i ′∈C`

wiwi ′

2µkd2

ii ′ with µk =∑i∈Ck

wi

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 11: Monothetic divisive clustering with geographical constraints

A distance-based homogeneity criterion

A distance-based homogeneity criterion W of a partitionPk can be defined by :

W (Pk ) =k∑

`=1

D(C`)

W (Pk ) is the within-cluster inertia criterion for classicalnumerical data and the Euclidean distance

Analysis of symbolic data, Ed. H.H.Bock, E. Diday, Springer.

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 12: Monothetic divisive clustering with geographical constraints

A new distance-based criterion

The geographical constraints are represented in an adjacencymatrix Q = (qii ′)n×n where

qii ′ = 1 if i ′ is a neighbor of iqii ′ = 0 otherwise.

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 13: Monothetic divisive clustering with geographical constraints

A new distance-based homogeneity criterion

We have

D(C`) =∑i∈C`

∑i ′∈C`

wiwi ′

2µkd2

ii ′ =∑i∈C`

wi

2µkDi(C`)

withDi(C`) =

∑i ′∈C`

wi ′d2ii ′

which measures the proximity between the object i and thecluster C` to which it belongs.We define a new homogeneity criterion D(C`) by defining anew criterion Di(C`) = αai(C`) + (1− α)bi(C`) withα ∈ [0, 1].The new distance-based criterion is Wα(Pk ) =

∑k`=1 D(C`)

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 14: Monothetic divisive clustering with geographical constraints

A new distance-based criterion

In the criterion

Di(C`) = αai(C`) + (1− α)bi(C`),

the first partai(C`) =

∑i ′∈C`

wi ′(1− qii ′)d2ii ′

measures the coherence or the dissimilarity between i and itscluster C`. It it small when i is similar to the objects in C`

(dii ′ ≈ 0) and when these objects are neighbor of i (qii ′ = 0).

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 15: Monothetic divisive clustering with geographical constraints

A new distance-based criterion

In the criterion

Di(C`) = αai(C`) + (1− α)bi(C`),

the second part

bi(C`) =∑i ′ 6∈C`

wi ′qii ′(1− d2ii ′)

measures the coherence between i and the objects which arenot in C`. It is small when i is dissimilar from the objects whichare not in C` (dii ′ ≈ 1) and when the objects which are note inC` are not neighbors of i (qii ′ = 0). In other words bi(C`)represents a penalty for the neighbors of i which belongs toother clusters.

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 16: Monothetic divisive clustering with geographical constraints

Study of the parameter α

The parameter α can be chosen by the user (usually, α = 0.5)if α = 1 then W1(Pn) = 0 and for k we have :

W1(Pk ) =k∑

`=1

∑i∈C`

∑i ′∈C`

wiwi ′

2µ`(1− qii ′)d2

ii ′ ,

if α = 0 then W0(P1) = 0 and for k we have :

W0(Pk ) =k∑

`=1

∑i∈C`

∑i ′ 6∈C`

wiwi ′

2µ`qii ′(1− d2

ii ′),

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 17: Monothetic divisive clustering with geographical constraints

Automatic choice of α

The parameter α can be chosen automatically such thatWα(P1) = Wα(Pn). The parameter α is then equal to :

α =A

A + B

whereA =

∑i∈Ω

∑i ′∈Ω,i 6=i ′

qii ′(1− d2ii ′),

B =∑i∈Ω

∑i ′∈Ω

(1− qii ′)d2ii ′ .

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 18: Monothetic divisive clustering with geographical constraints

Hydrological areas clustering

A study is carrying out at Cemagref in the context of theSPICOSA (web site : www.spicosa.eu) projectThe purpose is to define the relevant spatial unit, helpfullfor the integrated managment of the “Charente river basin”.Find a partition of the 140 hydrological units within thestudied area

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 19: Monothetic divisive clustering with geographical constraints

Hydrological areas clustering

The 140 hydrological units are characterized on :14 types of soils,17 types of soil occupation,8 main crops, a mean slope and a drainage rate.

Zhydro Type of soil Soil occupation Crope Mean slope Drainage rateS1 S2 . . . S14 O1 O2 . . . O17 C1 C2 . . . C8

R000 12 22 . . . 7.8 9.8 12.6 . . . 9.4 12 8.7 . . . 32.1 4.44 11.28

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Two files :the first file includes the descriptions of the 140 hydrologicalunitsthe second file includes for each hydological area the list ofits neighbors

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 20: Monothetic divisive clustering with geographical constraints

Hydrological areas clustering

The DIVCLUS-T method has been applied to the first datafileC-DIVCLUS-T has been applied to the same data filetaking into account the contiguity of the data given in theneighbors fileThe five-clusters partition has been retained in both cases

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 21: Monothetic divisive clustering with geographical constraints

Results with DIVCLUS-T and C-DIVCLUS-T

The maps give the clusters obtained byDIVCLUS-T and C-DIVCLUS-T on the Charente basin

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 22: Monothetic divisive clustering with geographical constraints

Results with C-DIVCLUS-T

A part of the coastal area can be linked to the presence ofDoucins soils (moors).

In the North of the river basin, an homogeneous area with cerealcrops stands out.

An other relevant area is delimited in the South of the basin withthe variable limestone soils : we can find here vineyards andcomplex cultivation patterns.

The cluster 1 can be linked to more artificialised areas.

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints

Page 23: Monothetic divisive clustering with geographical constraints

Conclusion

A first trial of taking contiguity constraints into account inthe clustering of this dataset,Many other approaches exist and may by used,The advantage of C-DICVLUS-T remains its monotheticaspect and the distance based criterion which is able todeal with data having complex descriptions

Chavent,Lechevallier,Vernier,Petit Monothetic divisive clustering with geographical constraints