1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

46
1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    2

Transcript of 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

Page 1: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

1

Genetic Algorithms

Data Mining and Distributed Computing Group

Facultad de Informática

Page 2: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

2

DATSI, Universidad Politécnica de Madrid

Overview

A class of probabilistic optimization algorithms Inspired by the biological evolution process Uses concepts of “Natural Selection” and “Genetic

Inheritance” (Darwin 1859) Originally developed by John Holland (1975) Particularly well suited for hard problems where little

is known about the underlying search space Widely-used in business, science and engineering

Page 3: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

3

DATSI, Universidad Politécnica de Madrid

Natural Evolution

Darwin: The Origin of Species– Search of optimal forms (environment)– Based on:

assortment : recombination of genetic material

randomness selection : survival of the fittest

Page 4: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

4

DATSI, Universidad Politécnica de Madrid

Biological Concepts (Cell)

• A set of many small “factories” working together

• Center: cell nucleus

• The nucleus contains the genetic information

Page 5: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

5

DATSI, Universidad Politécnica de Madrid

Biological Concepts (Chromosome)

• Chromosomes store genetic information

• Each chromosome is build of DNA

• Chromosomes in humans form pairs (23 pairs)

• The chromosome is divided in parts: genes

• Genes code for properties

• Possible values of the genes: allele

• Position of the gene in the

chromosome: locus

Page 6: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

6

DATSI, Universidad Politécnica de Madrid

Biological Concepts (Genetics)

• The entire combination of genes: genotype

• A genotype is expressed as a phenotype

• Alleles can be either dominant or recessive

• Dominant alleles will always express from the genotype to the phenotype

• Recessive alleles can survive in the population for many generations, without being expressed

Page 7: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

7

DATSI, Universidad Politécnica de Madrid

Biological Concepts (Reproduction)

Mitosis: copying the same genetic information to new offspring: there is no exchange of information

Normal way of growing of multicell structures, like organs.

Page 8: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

8

DATSI, Universidad Politécnica de Madrid

Biological Concepts (Reproduction)

• Meiosis: the basis of sexual reproduction

• After meiotic division 2 gametes appear in the process

• In reproduction two gametes conjugate to a zygote wich will become the new individual

• Genetic information is shared between the parents in order to create new offspring

Page 9: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

9

DATSI, Universidad Politécnica de Madrid

Biological Concepts (Reproduction)

• During reproduction “errors” occur

• Due to these “errors” genetic variation exists

• Most important “errors” are:

• Recombination (cross-over)

• Mutation

Page 10: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

10

DATSI, Universidad Politécnica de Madrid

Biological Concepts (Natural Selection)

• The origin of species: “Preservation of favourable variations and rejection of unfavourable variations.”

• Survival of the fittest

• Mathematical expresses as fitness: success in life

Page 11: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

11

DATSI, Universidad Politécnica de Madrid

Genetic Algorithms (GAs)

John Holland (~1973)– GAs use the principle of natural selection to solve

complicated optimization problems.– An alternative of brute-force search on a complex

solution space. Instead of complete exploration Estocastic-driven search

Page 12: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

12

DATSI, Universidad Politécnica de Madrid

Genetic Algorithms (GAs)

F inonacc i N ew ton

D irect m ethods Indirect m ethods

C alcu lus-based techn iques

Evolu tionary s trategies

C entra l ized D is tr ibuted

Para l le l

S teady-s ta te G enera tiona l

Sequentia l

G enetic a lgori thm s

Evolutionary a lgori thm s S im u lated annealing

G uided random search techniques

D ynam ic program m ing

Enum erative techn iques

Search techniques

© W. Williams, ‘99

Page 13: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

13

DATSI, Universidad Politécnica de Madrid

Genetic Algorithms: Definitions

Concepts:– Individual: A possible solution of the problem.– Gene: A solution component (a.k.a. an attribute) [eye color]– Allele: A gene value [eye color=green]– Population: group of potential solutions

10011101011011

-19 4 108 46

0.433 -33.345 0.0013

+

A x

B 3

Page 14: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

14

DATSI, Universidad Politécnica de Madrid

Genetic Algorithms: Definitions

Concepts:– Genotype: Genetic codification (seq. of genes)– Phenotype: Actual individual (with its capabilities)

For GAs, phenotype represents the fitness value of the solution:

Asumption: there exists a relationship between genetic information and individual fitness.

Page 15: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

15

DATSI, Universidad Politécnica de Madrid

Fitness Landscapes

Fitness: Function that evaluates individual capabilities.

Graphical representation

N+1 dimensions

fitness

© L.J. Pit, ‘96

Page 16: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

16

DATSI, Universidad Politécnica de Madrid

Natural Evolution and Operators

How new individuals are created?– Crossover (recombination)

mixing existing genetic material, assortment Inheritage of usefull characteristics

– Mutation: rare (very rare in biological perspective) Randomness

They simulate the evolution process

Page 17: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

17

DATSI, Universidad Politécnica de Madrid

Natural Evolution and Operators

Crossover

1001110101101100010100010001

#9

00010100011011

Variants: 2, 3, n-points crossover Uniform crossover Other...

Page 18: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

18

DATSI, Universidad Politécnica de Madrid

Natural Evolution and Operators

Mutation

00010100011011

#5

00011100011011

Variants: Mutation-by-swap Biased mutation Cataclysmic mutation

Page 19: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

19

DATSI, Universidad Politécnica de Madrid

Natural Evolution and Operators

Selection:1. Evaluation of individuals (fitness)

2. Parent selection (for each crossover operation).

Selective pressure: relationship between capacities (fitness) of the individual and its possibilities to generate new individuals (participate in the reproduction).

Page 20: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

20

DATSI, Universidad Politécnica de Madrid

Natural Evolution and Operators

Selection (alternatives):– Tournament selection– Rank-based: Continuous function– Roulette wheel

Variants:– Previous population selection (worst individual

elimination).

Page 21: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

21

DATSI, Universidad Politécnica de Madrid

Canonical Genetic Algorithm

P0

Initial population

end condition?

selection P’iSelected population

crossover/mutation

Fitness-function

Pi+1

Next population

Includes the chances of participation in reproduction

Page 22: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

22

DATSI, Universidad Politécnica de Madrid

GA Variants (Replacement)

Elitism:– The best(s) individuals in Pi stay in population Pi.

– Provides monotonic fitness increment

Steady-state:– Only one individual is created per generation.– New one replaces worst individual.

Page 23: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

23

DATSI, Universidad Politécnica de Madrid

Replacement Overview

Pi

Current population

SelectionCrossover/mutation

P’iNext population

Offi

Offspring population

Canonical Steady-sate Elitism

Offspring size N 1 >, = ó < N

Current Next 0 N-1 << N

Replacement All Worst Worst

Size=N

Page 24: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

24

DATSI, Universidad Politécnica de Madrid

The Metaphor

Genetic Algorithm Nature

Optimization problem Environment

Feasible solutions Individuals living in that environment

Solutions quality (fitness function) Individual’s degree of adaptation to its surrounding environment

A set of feasible solutions A population of organisms (species)

Stochastic operators Selection, recombination and mutation in nature’s evolutionary process

Iteratively applying a set of stochastic operators on a set of feasible solutions

Evolution of populations to suit their environment

Page 25: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

25

DATSI, Universidad Politécnica de Madrid

Aspects to consider: Population Size

What’s the optimal size of the population?– Too small: premature convergence (weak

exploration)– Too large: waste of resources

Page 26: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

26

DATSI, Universidad Politécnica de Madrid

Aspects to consider:Problem representation

A key aspect: chromosomes representation– Initially, a binary string– Nowadays, other possibilities:

Bit strings (0101 ... 1100) Real numbers (43.2 -33.1 ... 0.0 89.2) Permutations of element (E11 E3 E7 ... E1 E15) Lists of rules (R1 R2 R3 ... R22 R23) Program elements (genetic programming) ... any data structure ...

– Great influence in the problem resolution

Page 27: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

27

DATSI, Universidad Politécnica de Madrid

GA elements to define

Chromosome representation Creation of initial population Fitness function Genetic operators Parameters (population size,

probabilities of the genetic operators, etc.)

Population size: 50 – 100

Children per generation:

= population size

Crossovers: > 85%

Mutations: < 5%

Generations: 20 – 20,000

Typical configuration for small problems:

Page 28: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

28

DATSI, Universidad Politécnica de Madrid

Example (The MAXONE problem)

• Suppose we want to maximize the number of ones in a string of m binary digits

Is it a trivial problem?

• It may seem so because we know the answer in advance

• However, we can think of it as maximizing the number of correct answers, each encoded by 1, to m yes/no difficult questions`

Page 29: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

29

DATSI, Universidad Politécnica de Madrid

MAXONE problem

An individual is encoded (naturally) as a string of m binary digits

The fitness f of a candidate solution to the MAXONE problem is the number of ones in its genetic code

We start with a population of n random strings. Suppose that m = 10 and n = 6

Page 30: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

30

DATSI, Universidad Politécnica de Madrid

MAXONE problem (initialization)

• We toss a fair coin 60 times and get the following initial population:

s1 = 1111010101 f (s1) = 7

s2 = 0111000101 f (s2) = 5

s3 = 1110110101 f (s3) = 7

s4 = 0100010011 f (s4) = 4

s5 = 1110111101 f (s5) = 8

s6 = 0100110000 f (s6) = 3

Page 31: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

31

DATSI, Universidad Politécnica de Madrid

MAXONE problem (selection)

• Next we apply fitness proportionate selection with the roulette wheel method:

21n

3

Area is Proportional to fitness value

Individual i will have a

probability to be chosen

i

if

if

)(

)(

4

• We repeat the extraction as many times as the number of individuals we need to have the same parent population size (6 in our case)

Page 32: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

32

DATSI, Universidad Politécnica de Madrid

MAXONE problem (selection)

• Suppose that, after performing selection, we get the following population:

s1` = 1111010101 (s1)

s2` = 1110110101 (s3)

s3` = 1110111101 (s5)

s4` = 0111000101 (s2)

s5` = 0100010011 (s4)

s6` = 1110111101 (s5)

Page 33: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

33

DATSI, Universidad Politécnica de Madrid

MAXONE problem (crossover)

• Next we mate strings for crossover. For each couple we decide according to crossover probability (for instance 0.6) whether to actually perform crossover or not

• Suppose that we decide to actually perform crossover only for couples (s1`, s2`) and (s5`, s6`). For each couple, we randomly extract a crossover point, for instance 2 for the first and 5 for the second

Page 34: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

34

DATSI, Universidad Politécnica de Madrid

MAXONE problem (crossover)

s1` = 1111010101 s2` = 1110110101

s5` = 0100010011 s6` = 1110111101

Before crossover:

After crossover:

s1`` = 1110110101 s2`` = 1111010101

s5`` = 0100011101 s6`` = 1110110011

Page 35: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

35

DATSI, Universidad Politécnica de Madrid

MAXONE problem (mutation)

• The final step is to apply random mutation: for each bit that we are to copy to the new population we allow a small probability of error (for instance 0.1)

• Before applying mutation:

s1`` = 1110110101 s4`` = 0111000101

s2`` = 1111010101 s5`` = 0100011101

s3`` = 1110111101 s6`` = 1110110011

Page 36: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

36

DATSI, Universidad Politécnica de Madrid

MAXONE problem (mutation)

After applying mutation:

s1``` = 1110100101 f (s1``` ) = 6

s2``` = 1111110100 f (s2``` ) = 7

s3``` = 1110101111 f (s3``` ) = 8

s4``` = 0111000101 f (s4``` ) = 5

s5``` = 0100011101 f (s5``` ) = 5

s6``` = 1110110001 f (s6``` ) = 6

Page 37: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

37

DATSI, Universidad Politécnica de Madrid

MAXONE problem

• In one generation, the total population fitness changed from 34 to 37, thus improved by ~9%.

• At this point, we go through the same process all over again, until a stopping criterion is met

Page 38: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

38

DATSI, Universidad Politécnica de Madrid

Local Optimizations (Lamark)

Fitness calculation performs an improved solution in the neibourghood:– Lamarkian Evolution: the inheritance of acquired

characteristics.– Methods:

Hill-climbing Simulated annealing

Page 39: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

39

DATSI, Universidad Politécnica de Madrid

Local Optimizations (Baldwin)

Other alternatives:– Evolve fitness

function– But keep

genetic representation

As a result:– New fitness

landscape

© L.J. Pit, ‘96

Page 40: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

40

DATSI, Universidad Politécnica de Madrid

GAs: Why do they work?

• Schema Theorem

• Building Blocks Hypothesis

Page 41: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

41

DATSI, Universidad Politécnica de Madrid

Schema Notation

• {0,1,#} is the symbol alphabet, where # is a special wild card symbol

• A schema is a template consisting of a string composed of these three symbols

• Example: the schema [1#00#] matches the strings: [10000], [10001], [11000] and [11001]

100

1 000 0

1 000 1

1 100 0

1 100 1

Page 42: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

42

DATSI, Universidad Politécnica de Madrid

Schema

“A schema is a similarity template describing a subset of strings with similarities at certain positions” John Holland

Crossover causes low order schema, to get increased representation in the next generation.

Mutation causes long schema to be destroyed! Thus short schema have a better chance of surviving to the next generation

Page 43: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

43

DATSI, Universidad Politécnica de Madrid

Schema Theorem and Building Block

GAs explore the search space by short, low-order schemata which, subsequently, are used for information exchange during crossover

Objective: Building blocks (set of alleles with representative participation on fitness/solution quality).

Page 44: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

44

DATSI, Universidad Politécnica de Madrid

Building Blocks Hypothesis

“A genetic algorithm seeks near-optimal performance through the juxtaposition of short, low-order, high-performance schemata, called the building blocks”

Some building blocks (short, low-order schemata) can mislead GA and cause its convergence to suboptimal points

Page 45: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

45

DATSI, Universidad Politécnica de Madrid

Differences betwwen GA and Conventional Search Algorithms

GA works on a coding of the parameters set, not the parameter themselves

GA searches from a population of points, not a single point GA uses only a payoff function, and no domain knowledge GA uses probabilistic transition rules, not deterministic

ones GA can provide a number of potential solutions to a given

problem. The final choice is left to the user.

Page 46: 1 Genetic Algorithms Data Mining and Distributed Computing Group Facultad de Informática.

46

DATSI, Universidad Politécnica de Madrid

Conclusions

“Genetic Algorithms are good at taking large, potentially huge search spaces and navigating them, looking for optimal

combinations of things, solutions you might not otherwise find in a lifetime.”

Salvatore Mangano

Computer Design, May 1995