Lecture on bioinformatics, chapter 2: optimization and ...khuri/Aalto_2017/Theis_2005.pdfAlgorithm...

IntroductionAlgorithm

Properties and extensionsExamples

Lecture on bioinformatics, chapter 2:optimization and genetic algorithms

Fabian J. Theis

Institute of BiophysicsUniversity of Regensburg, Germany

[email protected]

4th May 2005

Theis Optimization and genetic algorithms

Outline

IntroductionReinforcement learningOptimizationImitate natureGenetic algorithms

AlgorithmBasic algorithmData representationSelection

ReproductionProperties and extensions

OverviewConvergence analysisSchema theoremGenetic programming

Examples2d-function optimizationGenetic MastermindHyerplane detection

Reinforcement learningOptimizationImitate natureGenetic algorithms

Algorithm

Introduction

I idea of genetic algorithms (GAs)I extract optimization strategies nature uses successfully → Darwinian

EvolutionI transform them for application in mathematical optimization theory

I abstract goal: find the global optimum of a problem/function in adefined phase space

Introduction

I idea of genetic algorithms (GAs)I extract optimization strategies nature uses successfully → Darwinian

EvolutionI transform them for application in mathematical optimization theory

I abstract goal: find the global optimum of a problem/function in adefined phase space

Optimization

I GA as special kind of reinforcement learningI no access to the full problem/functionI but: rewards are given for a given action/search space positionI goal: use rewards to find optimumI this contrasts to learning by (given) examples as in supervised

learning e.g. using neural networksI → traverse search space manually

Optimization

I simple algorithm: random samplingI pick a single location in the search spaceI store it if reward is higher than at previous locations, discard it

otherwiseI repeat

I other such algorithmsI Markov-Chain-Monte-Carlo search (MCMC)I simulated annealingI if derivative of reward is available: (conjugated) gradient

ascent/descent etc.

Optimization

I simple algorithm: random samplingI pick a single location in the search spaceI store it if reward is higher than at previous locations, discard it

otherwiseI repeat

I other such algorithmsI Markov-Chain-Monte-Carlo search (MCMC)I simulated annealingI if derivative of reward is available: (conjugated) gradient

ascent/descent etc.

Optimization

I simple algorithmic maximization possible e.g. by gradient ascent:I a differentiable function f : Rn → R can be maximized by local

updates in directions of its gradientI given sufficiently small learning rate η > 0 and a starting point

x(0) ∈ Rn

I local maxima of f can be found by iterating

x(t + 1) = x(t) + η∆x(t)

∆x(t) = (Df )(x(t))> = grad f (x(t)) =∂f

∂x(x(t))

the gradient of f at x(t).

Optimization

x(0) ∈ Rn

x(t + 1) = x(t) + η∆x(t)

∆x(t) = (Df )(x(t))> = grad f (x(t)) =∂f

∂x(x(t))

Optimization

x(0) ∈ Rn

x(t + 1) = x(t) + η∆x(t)

∆x(t) = (Df )(x(t))> = grad f (x(t)) =∂f

∂x(x(t))

Stochastic optimization

I Problem: minimize multivariate, real functionf (x), x = (x1, x2, . . . , xn)

I if no analytical expression of f (x) is given (but only some values) orf is non-continuous, f can be minimized by stochastic methods

I Random searchI choose x1 randomlyI calculate f (x1)I choose δ randomlyI calculate f (x1 + δ)I

f (x1 + δ) < f (x1) choose x1 + δ asif new starting point

f (x1 + δ) > f (x1) choose new δ

I choose δ = δ(t) with limt→∞ δ(t) → 0

I Problem: minimize multivariate, real functionf (x), x = (x1, x2, . . . , xn)

I if no analytical expression of f (x) is given (but only some values) orf is non-continuous, f can be minimized by stochastic methods

I Random searchI choose x1 randomlyI calculate f (x1)I choose δ randomlyI calculate f (x1 + δ)I

f (x1 + δ) < f (x1) choose x1 + δ asif new starting point

f (x1 + δ) > f (x1) choose new δ

I choose δ = δ(t) with limt→∞ δ(t) → 0

I Metropolis samplingI like random search butI if f (x1 + δ) > f (x1) only use newly generated x with probability

1 + exp�

f (x1+δ)−f (x1)α

withlim

t→∞α(t) → 0

I then the algorithm leaves a local minimum with P > 0

Genetic algorithms

I here: imitate nature’s robust way of evolving successful organismsI organisms ill-suited to an environment die off, whereas fit ones

reproduceI offspring is similar to the parents, so population fitness increases with

generationsI mutation can randomly generate new speciesI ‘The Origin of Species by Means of Natural Selection’, C.R. Darwin,

D. Appleton and Company, NY, 1897

I history:I introduced by J. Holland 1975I further invesigated by his students e.g. K. DeJong 1975I more recently theoretical advances e.g. by M. Vose 1993

Genetic algorithms

I what’s good for nature is good for artificial systems

I imagine population of individual ‘explorers’ sent into theoptimization phase-space

I explorer is defined by its genes, encoding his phase-space positionI optimization problem is given by a fitness function

I the struggle of ‘life’ beginsI selectionI crossoverI mutation

I according to these rules populations tend to increase overall fitness

Genetic algorithms

I advantagesI global not only local optimizationI simple and hence easy to implementI easy parallelization possible

I disadvantagesI how to encode phase-space positionI rather low speed and high computational costI parameter dependencies (population size, selection and reproduction

parameters)

Genetic algorithms

I advantagesI global not only local optimizationI simple and hence easy to implementI easy parallelization possible

I disadvantagesI how to encode phase-space positionI rather low speed and high computational costI parameter dependencies (population size, selection and reproduction

parameters)

Basic algorithmData representationSelectionReproduction

Algorithm

Basic genetic algorithm

Data: population, a set of individuals

fitness-function Fitness, a function measuring fitness of an individual

Result: an individual

repeat

parents ← Selection (population, Fitness)1

population ← Reproduction (parents)2

until some individual is fit enough;

return the best individual in population according to Fitness3

Individual

I an individual encodes the data space position

I classic GA approach: representation by word (chromosome) over afinite alphabet

I each letter is called geneI real DNA: alphabet is {A, G , T , C}I here: usually binary alphabet {0, 1}I some authors speak more general of evolutionary programming if

alphabet is largerI finite alphabet implies discrete search space

I continuous search spaceI use continuous ‘alphabet’ i.e. genes ∈ R or bounded genes ∈ [a, b]I so individual ∈ Rn respectively ∈ [a1, b1]× . . .× [an, bn]

Individual

Selection

I goal: select individuals that produce the next generation

I probabilistic selectionI based on fitness function fI better individuals have increased chance of reproductionI usually selection with replacement → very fit individuals reproduce

several times

I selection probabilitiesI roulette wheel (Holland 1975)

P(choice of individual i) =f (i)Pj f (j)

problem: negative f ? minimization?I ranking methods, i.e. choose individuals according to fitness rank

e.g. normalized geometric ranking (Joines and Houck 1994)I tournament selection, i.e. select best among a randomly selected

subset

Selection

several times

subset

Selection

several times

subset

Reproduction

I typically consists of two stagesI crossover (or mating): selected individuals are randomly paired and

(usually two) children are producedI mutation: genes can be altered by random mutation to a different

value according to a small probability

I use genetic operators to produce and alter new offspring → basicsearch mechanism in GAs

Reproduction

I typically consists of two stagesI crossover (or mating): selected individuals are randomly paired and

(usually two) children are producedI mutation: genes can be altered by random mutation to a different

value according to a small probability

I use genetic operators to produce and alter new offspring → basicsearch mechanism in GAs

Crossover

I let x, y ∈ An be the genes of the two parents

I simple crossoverI choose r randomly in {1, . . . , n}I generate children x′, y′ ∈ An by

x ′i :=

�xi if i < ryi otherwise

y ′i :=

�yi if i < rxi otherwise

I in the case of continuous genes: arithmetic crossoverI choose r randomly in [0, 1]I generate children x′, y′ ∈ An by

x′ := rx + (1− r)y

y′ := (1− r)x + ry

Crossover

x ′i :=

y ′i :=

x′ := rx + (1− r)y

y′ := (1− r)x + ry

Crossover

x ′i :=

y ′i :=

x′ := rx + (1− r)y

y′ := (1− r)x + ry

Mutation

I let xi ∈ A be the gene of an individual that is to be mutated

I binary gene: binary mutationI x ′i := 1− xi

I discrete or continuous bounded A: uniform mutationI set x ′i to be a uniformly randomly chosen element of A

I also possible: non-uniform mutationI needs fixed distribution for element choice

Mutation

One generation example

Analytical example

I optimize f (x) := x2 in [0, 1]

I x → 0.b9b8 . . . b0 ⇒ 1024 values of x

I minimal distance of adjacent values is 2−10

I mutation of bit i changes x by δ = 2−i → x + δ

I crossover of two character sets:

x = 0.b9b8.....b0

⇒ z = 0.b9b8....biai−1....a0

y = 0.a9a8.....a0 = x + δ

I crossover corresponds to improved stochastic search

Analytical example

I x → 0.b9b8 . . . b0 ⇒ 1024 values of x

x = 0.b9b8.....b0

⇒ z = 0.b9b8....biai−1....a0

y = 0.a9a8.....a0 = x + δ

Analytical example

I x → 0.b9b8 . . . b0 ⇒ 1024 values of x

x = 0.b9b8.....b0

⇒ z = 0.b9b8....biai−1....a0

y = 0.a9a8.....a0 = x + δ

Analytical example

I x → 0.b9b8 . . . b0 ⇒ 1024 values of x

x = 0.b9b8.....b0

⇒ z = 0.b9b8....biai−1....a0

y = 0.a9a8.....a0 = x + δ

Analytical example

I x → 0.b9b8 . . . b0 ⇒ 1024 values of x

x = 0.b9b8.....b0

⇒ z = 0.b9b8....biai−1....a0

y = 0.a9a8.....a0 = x + δ

Analytical example

I x → 0.b9b8 . . . b0 ⇒ 1024 values of x

x = 0.b9b8.....b0

⇒ z = 0.b9b8....biai−1....a0

y = 0.a9a8.....a0 = x + δ

Progress diagram of a genetic algorithm

Gen := 0↓

Create InitialRandom Population

−→ Termination CriteriaYes→ End

| satisfied?| ↓ No| Evaluate fitness of| each individual| ↓| i := 0| ↓

Gen := Gen + 1Yes← i := M?

↓ NoPr←− Select genetic Operation

Pm−→↓ ↓=; Pc ↓

Select one individual Select two individuals Select one individual↓

Perform reproduction Perform Crossover Perform mutation↓

i := i + 1Copy into new Insert two off-springs Insert mutant into

population into new population new population| | |−→ ↓ ←−

i := i + 1

Convergence analysis

I corresponds to a ‘schema’ (J.Holland 1975)

I example goal: maximize f (x) = x2 in [0, 1]

I generate N numbers in [0, 1] using 10 bit fixed point codingI generate new population by choosing codes with decreasing

probability according to fitness e.g.

1 0, 1... → 1 0, 1... → 1 0, 11...2 0, 1... → 2 0, 1... → 2 0, 11...

. → →...

N 0, 0... → N 0, 1... → N 0, 10...

1 0, 11...2 0, 11... → 0, 1111111111

.N 0, 11...

I Each cross over defines new points, with step sizes close to x = 1becoming more and more popular. Final step size will be 2−10.

1 0, 1... → 1 0, 1... → 1 0, 11...2 0, 1... → 2 0, 1... → 2 0, 11...

. → →...

N 0, 0... → N 0, 1... → N 0, 10...

1 0, 11...2 0, 11... → 0, 1111111111

.N 0, 11...

1 0, 1... → 1 0, 1... → 1 0, 11...2 0, 1... → 2 0, 1... → 2 0, 11...

. → →...

N 0, 0... → N 0, 1... → N 0, 10...

1 0, 11...2 0, 11... → 0, 1111111111

.N 0, 11...

1 0, 1... → 1 0, 1... → 1 0, 11...2 0, 1... → 2 0, 1... → 2 0, 11...

. → →...

N 0, 0... → N 0, 1... → N 0, 10...

1 0, 11...2 0, 11... → 0, 1111111111

.N 0, 11...

Schema

A schema is a bit pattern representing a set of binary characters usingthe symbols {0, 1, .}.Genetic algorithms consist of a finite series of the three steps:

1. choice of parent bit pattern

2. recombination

3. mutation

With which probability do the fitter bit pattern survive from generationto generation?

Consider the probability of selecting the fitter bit pattern for generatingand for them to survive recombination and mutation operations.

I population at time t: N binary chains (schemas) of length lI O(H, t): number of elements of population at time t that contain

the schema HI d(H): diameter of a schema i.e. the length of the shortest

subpattern containing all non . characters (fixed bits) e.g.d(..1.1..) = d(1.1) = 3

I maximize fitness f defined on all binary chains of length lI selection of a parent chain with probability

p(Hj) =f (Hj)∑Ni=1 f (Hi )

I mean fitness of the population

fµ =1

N∑i=1

f (Hi )

p(Hj) =f (Hj)

N · fµTheis Optimization and genetic algorithms

fµ =1

N∑i=1

f (Hi )

p(Hj) =f (Hj)

fµ =1

N∑i=1

f (Hi )

p(Hj) =f (Hj)

fµ =1

N∑i=1

f (Hi )

p(Hj) =f (Hj)

SelectionI selection probability of a chain containing schema H

P =k∑

f (Hj)

N · fµ

H1, ...,Hk chains of the population with schema HI fitness f (H) of schema H in generation t

f (H) =

∑j f (Hj)

O(H, t)⇒ p =

O(H, t)f (H)

N · fµ

I probability PA for two chains containing H to be selected as parents:

[O(H, t)f (H)

N · fµ

I probability PB that exactly one of two selected chains contains H:

PB = 2 · O(H, t)f (H)

N · fµ·(

1− O(H, t)f (H)

N · fµ

)Theis Optimization and genetic algorithms

P =k∑

f (Hj)

N · fµ

f (H) =

∑j f (Hj)

O(H, t)⇒ p =

O(H, t)f (H)

N · fµ

[O(H, t)f (H)

N · fµ

PB = 2 · O(H, t)f (H)

N · fµ·(

1− O(H, t)f (H)

N · fµ

P =k∑

f (Hj)

N · fµ

f (H) =

∑j f (Hj)

O(H, t)⇒ p =

O(H, t)f (H)

N · fµ

[O(H, t)f (H)

N · fµ

PB = 2 · O(H, t)f (H)

N · fµ·(

1− O(H, t)f (H)

N · fµ

P =k∑

f (Hj)

N · fµ

f (H) =

∑j f (Hj)

O(H, t)⇒ p =

O(H, t)f (H)

N · fµ

[O(H, t)f (H)

N · fµ

PB = 2 · O(H, t)f (H)

N · fµ·(

1− O(H, t)f (H)

N · fµ

Recombination

I probability that schema H is contained in child chain: if bothparents contain H, then 1, if only one then 1/2 in the mean

I schema H is divided during crossover with probability

Pdiv =d(H)− 1

l − 1

hence probability of survival during recombination

W ≥(

O(H, t)f (H)

N · fµ

(O(H, t)f (H)

N · fµ

) (1− O(H, t)f (H)

N · fµ

) (1− d(H)− 1

l − 1

)≥ O(H, t)f (H)

N · fµ

(1− d(H)− 1

l − 1

(1− O(H, t)f (H)

N · fµ

Recombination

I probability that schema H is contained in child chain: if bothparents contain H, then 1, if only one then 1/2 in the mean

I schema H is divided during crossover with probability

Pdiv =d(H)− 1

l − 1

hence probability of survival during recombination

W ≥(

O(H, t)f (H)

N · fµ

(O(H, t)f (H)

N · fµ

) (1− O(H, t)f (H)

N · fµ

) (1− d(H)− 1

l − 1

)≥ O(H, t)f (H)

N · fµ

(1− d(H)− 1

l − 1

(1− O(H, t)f (H)

N · fµ

Mutation

I during recombination a schema H with b(H) fixed bits Bits surviveswith probability (1− p)b(H) with mutation probability p

I hence total survival probability is

W ′ ≥ O(H, t)f (H)

N · fµ

(1− d(H)− 1

l − 1

(1− O(H, t)f (H)

N · fµ

))(1−p)b(H)

I Schema-theorem: If N new chains are generated in generation t,then the mean number of chains containing H in generation t + 1 is

〈O(H, t + 1)〉 = N ·W ′ ≥ P · (1− Pdiv (1− P))(1− p)b(H)

P =O(H, t)f (H)

N · fµand Pdiv =

d(H)− 1

l − 1

Mutation

W ′ ≥ O(H, t)f (H)

N · fµ

(1− d(H)− 1

l − 1

(1− O(H, t)f (H)

N · fµ

))(1−p)b(H)

〈O(H, t + 1)〉 = N ·W ′ ≥ P · (1− Pdiv (1− P))(1− p)b(H)

P =O(H, t)f (H)

N · fµand Pdiv =

d(H)− 1

l − 1

Mutation

W ′ ≥ O(H, t)f (H)

N · fµ

(1− d(H)− 1

l − 1

(1− O(H, t)f (H)

N · fµ

))(1−p)b(H)

〈O(H, t + 1)〉 = N ·W ′ ≥ P · (1− Pdiv (1− P))(1− p)b(H)

P =O(H, t)f (H)

N · fµand Pdiv =

d(H)− 1

l − 1

Conclusions from the schema theoremI chains with higher fitness and small diameter are favored and

reproduce with higher probabilityI too high mutation rate p destroys every schemaI if a schema H is well represented in a population it reproduces

better even at medium fitnessI If f (H) = fµ then

1− Ptrenn(1− P) = 1− Ptrenn

(1− O(H, t)

)i.e. schemata with high O(H, t)/N are not cut during crossover andsurvive, hence the algorithm converges to these patterns with meanfitness ⇒ genetic drift (corresponds to random walk in searchspace)

I genetic drift looses bit patterns, which can be countered by highermutation rates.

I correct balance between mutation and crossover is important, butdepends on the problem

(1− O(H, t)

Genetic programming

I generalization of genetic algorithms to automatically developcomputer programs (CP)

I each structure of the population represents a single CP

I properties of CP’sI hierarchical operationsI alternative calculation using conditionsI iterative calculationsI manipulation of different data typesI procedural calculations

Genetic programming

I generalization of genetic algorithms to automatically developcomputer programs (CP)

I each structure of the population represents a single CP

I properties of CP’sI hierarchical operationsI alternative calculation using conditionsI iterative calculationsI manipulation of different data typesI procedural calculations

Genetic programming — coding

representation of a CP by a tree e.g.

if a > b then x := 3

can be represented by the tree

if↙ ↘

> :=↙ ↘ ↙ ↘

a b x 3

I search of a CP that solves a given problem can be performed usingprinciples of genetic algorithms

Genetic programming — crossover-operatorparents generation

A G↙ ↘ ↙ ↘

B D H I↓ ↙ ↘ ↙ ↘C E F J K

↙ ↘L M

children generation

A G↙ ↘ ↙ ↘

K D H I↙ ↘ ↙ ↘ ↙ ↘

L M E F J B↓C

Genetic programming — crossover-operatorparents generation

A G↙ ↘ ↙ ↘

B D H I↓ ↙ ↘ ↙ ↘C E F J K

↙ ↘L M

children generation

A G↙ ↘ ↙ ↘

K D H I↙ ↘ ↙ ↘ ↙ ↘

L M E F J B↓C

Genetic programming — mutation-operator

a knot is chosen randomly and replaced by a randomly generated subtree:

A A↙ ↘ ↙ ↘

B D B D↓ ↙ ↘ ↓ ↙ ↘C E F C E K

↙ ↘L M

2d-function optimizationGenetic MastermindHyerplane detection

Algorithm

Examples

I continuous exampleI global optimization of continuous function f : [a, b] → R

I binary exampleI genetic MastermindI select optimal guess using GA

I example from our researchI perform overcomplete blind source separation by sparse component

analysisI key problem: hyperplane detectionI solution: optimize cost function using GAs

Examples

2d-function optimization

−10 −8 −6 −4 −2 0 2 4 6 8 10

multipeak

0 10 20 30 40 50 60 70 80 90 10020

f performance (optimal individual and mean)

Genetic Mastermind

Hyerplane detection

−0.5

1 −1

−0.5

I perform overcomplete blind source separation by sparse componentanalysis [Georgiev et al., 2004, Theis et al., 2004]

I key problem: hyperplane detection

I solution: optimize cost function using GAs

Conclusions

I genetic algorithms perform global optimization

I they mimic nature by letting a population evolve according to theirfitness

I algorithmI selectionI reproduction: by crossover and mutation

I simple applicability in real-world situations

I Resources

I books: [Goldberg, 1989,Schoneburg et al., 1994]

I Matlab GA optimizationtoolbox:http://www.ie.ncsu.edu/

mirage/GAToolBox/gaot

I Details and papers on my websitehttp://fabian.theis.name

I This research was supported bythe DFG and BMBF.

I ReferencesP. Georgiev, F. Theis, and A. Cichocki. Sparse

component analysis and blind source separation ofunderdetermined mixtures. IEEE Trans. on NeuralNetworks in print, 2004.

D. Goldberg. Genetic Algorithms in Search Optimizationand Machine Learning. Addison Wesley Publishing,1989.

E. Schoneburg, F. Heinzmann, and S. Feddersen.Genetische Algorithmen und Evolutionsstrategien.Addison Wesley Publishing, 1994.

F. Theis, P. Georgiev, and A. Cichocki. Robustovercomplete matrix recovery for sparse sources usinga generalized hough transform. In Proc. ESANN2004, pages 343–348, Bruges, Belgium, 2004. d-side,Evere, Belgium. URL http:

//homepages.uni-regensburg.de/∼thf11669/

publications/theis04houghSCA ESANN04.pdf.

Lecture on bioinformatics, chapter 2: optimization and ...khuri/Aalto_2017/Theis_2005.pdfAlgorithm...

Documents

Transcript of Lecture on bioinformatics, chapter 2: optimization and ...khuri/Aalto_2017/Theis_2005.pdfAlgorithm...