Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of...

26
Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of...

Page 1: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Basic concepts of Data Mining, Clustering and Genetic Algorithms

Tsai-Yang JeaDepartment of

Computer Science and Engineering

SUNY at Buffalo

Page 2: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Data Mining Motivation

• Mechanical production of data need for mechanical consumption of data

• Large databases = vast amounts of information

• Difficulty lies in accessing it

Page 3: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

KDD and Data Mining

• KDD: Extraction of knowledge from data

– “non-trivial extraction of implicit, previously unknown & potentially useful knowledge from data”

• Data Mining: Discovery stage of the KDD process

Page 4: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Data Mining Techniques

• Query tools• Statistical techniques• Visualization• On-line analytical

processing (OLAP)• Clustering

• Classification• Decision trees• Association rules• Neural networks• Genetic algorithms

Any technique that helps to extract more out of data is useful

Page 5: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

What’s Clustering

• Clustering is a kind of unsupervised learning.• Clustering is a method of grouping data that share

similar trend and patterns.• Clustering of data is a method by which large sets

of data is grouped into clusters of smaller sets of similar data.– Example:

Thus, we see clustering means grouping of data or dividing a large data set into smaller data sets of some similarity.

After clustering:

Page 6: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

The usage of clustering

• Some engineering sciences such as pattern recognition, artificial intelligence have been using the concepts of cluster analysis. Typical examples to which clustering has been applied include handwritten characters, samples of speech, fingerprints, and pictures.

• In the life sciences (biology, botany, zoology, entomology, cytology, microbiology), the objects of analysis are life forms such as plants, animals, and insects. The clustering analysis may range from developing complete taxonomies to classification of the species into subspecies. The subspecies can be further classified into subspecies.

• Clustering analysis is also widely used in information, policy and decision sciences. The various applications of clustering analysis to documents include votes on political issues, survey of markets, survey of

products, survey of sales programs, and R & D.

Page 7: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

A Clustering Example

Income: HighChildren:1Car:Luxury

Income: LowChildren:0Car:Compact

Car: Sedan andChildren:3Income: Medium

Income: MediumChildren:2Car:Truck

Cluster 1

Cluster 2Cluster 3

Cluster 4

Page 8: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Different ways of representing clusters

(b)

ad

kj

h

gif

ec

b

ad

kj

h

g if

ec

b

(a)

(c) 1 2 3

abc

0.4 0.1 0.50.1 0.8 0.10.3 0.3 0.4

...

(d)

g a c i e d k b j f h

Page 9: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

K Means Clustering(Iterative distance-based clustering)

• K means clustering is an effective algorithm to extract a given number of clusters of patterns from a training set. Once done, the cluster locations can be used to classify patterns into distinct classes.

Page 10: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

K means clustering(Cont.)

Select the k cluster centers randomly.

Store the k cluster centers.

.cluster tobelongingpattern th - theis and ,

cluster in patterns trainingofnumber theis mean, new theis where

1

:cluster theofmean thefindingby center its recompute cluster,each For

1

kjXk

NM

XN

M

jk

kk

N

jjk

kk

k

. ofmember a as classify and center cluster nearest the

find set, trainingin the patern each For set. trainingentire heClassify t CXC

X

i

i

Loop until the change in cluster means is less the amount specified

by the user.

Page 11: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

The drawbacks of K-means clustering

• The final clusters do not represent a global optimization result but only the local one, and complete different final clusters can arise from difference in the initial randomly chosen cluster centers. (fig. 1)

• We have to know how many clusters we will have at the first.

Page 12: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Drawback of K-means clustering(Cont.)

Figure 1

Page 13: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Clustering with Genetic Algorithm

• Introduction of Genetic Algorithm

• Elements consisting GAs

• Genetic Representation

• Genetic operators

Page 14: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Introduction of GAs

• Inspired by biological evolution.

• Many operators mimic the process of the biological evolution including

– Natural selection

– Crossover

– Mutation

Page 15: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Elements consisting GAs

• Individual (chromosome):– feasible solution in an optimization

problem• Population

– Set of individuals

– Should be maintained in each generation

Page 16: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Elements consisting GAs

• Genetic operators. (crossover, mutation…)• Define the fitness function.

– The fitness function takes a single chromosome as input and returns a measure of the goodness of the solution represented by the chromosome.

Page 17: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Genetic Representation

• The most important starting point to develop a genetic algorithm

• Each gene has its special meaning

• Based on this representation, we can define

– fitness evaluation function,

– crossover operator,

– mutation operator.

Page 18: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Genetic Representation (Cont.)

• Examples 1

Outlook 0

Wind

1

PlayTennis

1

Overcast

Rain

Sunny

1 1

Strong

Normal

Yes

No

0 0If Outlook isOvercast or Rainand Wind is Strong,thenPlayTennis = Yes

If Outlook isOvercast or Rainand Wind is Strong,thenPlayTennis = Yes

0 1 1 1 0 1 0A chromosome

Gene

Allele value

Page 19: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Genetic Representation (Cont.)

• Examples 2 ( In clustering problem)

– Each chromosome represents a set of clusters; each gene represents an object; each allele value represents a cluster. Genes with the same allele value are in the same cluster.

1 2 1 4 3 5 5

A B C D E F G

Page 20: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Crossover

• Exchange features of two individuals to produce two offspring (children)

• Selected mates may have good properties to survive in next generations

• So, we can expect that exchanging features may produce other good individuals

Page 21: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Crossover (cont.)

• Single-point Crossover

• Two-point Crossover

• Uniform Crossover

1 1 01 1

0 0 00 1

0 0 1 0 0 0

0 1 0 1 0 1

1 1 01 1

0 0 00 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 01 1

0 0 00 1

0 0 1 0 0 0

0 1 0 1 0 1

1 1 00 1

0 0 01 1

0 1 1 0 0 0

0 0 0 1 0 1

1 0 10 1 0 1 0 0 1 1

1 1 01 1

0 0 00 1

0 0 1 0 0 0

0 1 0 1 0 1

1 0 00 1

0 1 01 1

0 0 0 1 0 0

0 1 1 0 0 1

Crossover templateCrossover template

Page 22: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Mutation

• Usually change a single bit in a bit string

• This operator should happen with very low probability.

0 1 01 1

0 1 11 1

Mutation point(random)

Page 23: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Typical Procedures

• Crossover mates are probabilistically selected based on their fitness value.

0 1 00 11 1 01 0

0 0 11 10 1 01 1

1 1 01 01 1 01 1

1 1 01 1

0 1 00 1

1 1 00 1

0 1 01 1

Crossover pointrandomly selected

1 1 00 1

0 1 11 1

0 1 11 1

old generation

new generation0 1 01 1

1 1 01 01 1 01 1

Mutation point(random)

Probabilistically select individualsProbabilistically select individuals

Page 24: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

• Preparing the chromosomes

• Defining genetic operators– Fusion: takes two unique allele values and combines them into a

single allele value, combining two clusters into one.

– Fission: takes a single allele value and gives it a different random allele value, breaking a cluster apart.

• Defining fitness functions

How to apply GA on a clustering problem

1 2 33 5

1 2 33 5 3 2 33 5

1 3 33 5 1 3 44 5

Page 25: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Example: (Cont.)

CrossoverMutationFusionFission

Old generation

New generation

Select the chromosomes according to the fitness

function.

1 2 33 51 2 43 5

2 1 33 52 2 43 5

1 1 13 5

2 2 32 51 2 53 5

2 4 33 42 2 43 5

2 1 23 5

Page 26: Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.

Finally…