Download - Determining the number of clusters using information entropy for mixed data

Transcript

Determining the number of clusters using information entropy for mixed data

Presenter : Hong-Yi, Cai Authors : Jiye Liang, Xingwang Zhao, Deyu Li, Fuyuan Cao, Chuangyin Dang

PR, 2012

1

Outlines

• Motivation• Objectives• Methodology• Experiments• Conclusions• Comments

2

Motivation

• The determination of the initial parameters of cluster is the most difficult problem.

• None of cluster algorithms can cluster effectively mixed data set.

3

Objectives

• To propose a generalized mechanism on mixed data set by integrating Renyi entropy and complement entropy.

• To improve k-prototype algorithm by using new generalized mechanism.

4

Methodology

• K-Prototype…

5

Methodology

• A generalized mechanism for numerical data…

6

Renyi Entropy :

Parzen window density estimation:

By the convolution theorem…

Within-Cluster Entropy:

Between-Cluster Entropy:

Improved Entropy for numerical data:

Methodology

• A generalized mechanism for categorical data…

7

Indiscernibility relation…

Complement Entropy: Within-Cluster Entropy:

Improved Entropy for categorical data:

Between-Cluster Entropy:

Huang Dissimilarity for categorical data:

Methodology

• A generalized mechanism for mixed data set…

8

Methodology

• Cluster validity index for mixed data…

9

For numerical data…

For categorical data…

For mixed data…

10

Methodology

Experiments

• Ten Cluster

11

Experiments

• STUDENT

12

Experiments

• Real data sets…

13

Experiments

• Wine Breast

14

Experiments

• Voting Car

15

Experiments

• DNA TAE

16

Experiments

• Heart Credit

17

Experiments

• CMC Adult

18

Experiments

19

Conclusions

• The generalized mechanism and algorithm can cluster effectively and determine the optimal number of clusters for mixed data sets.

20

Comments

• Advantages–The entropy can apply on mixed data set.

• Applications–Cluster for mixed-type data

21