Spatial Co-Evolution - Quicker, Fitter and Less Bloated

8
Spatial Co-Evolution - Quicker, Fitter and Less Bloated Robin Harper University of Sydney, Sydney, Australia [email protected] ABSTRACT Operator equalisation is a methodology inspired by the cross- over bias theory that attempts to limit bloat in genetic pro- gramming (GP ). This paper examines a bivariate regression problem and demonstrates that operator equalisation suf- fers from bloat like behaviour when attempting to solve this problem. This is in contrast to a spatial co-evolutionary mechanism (SCALP ) that appears to avoid bloat, without any need for express bloat control mechanisms. A previously analysed real world problem (human oral bioavailability pre- diction) is examined. The behaviour of SCALP on this prob- lem is quite different from that of standard GP and operator equalisation leading to short, general candidate solutions. Categories and Subject Descriptors I.2.2 [Artificial Intelligence]: Automatic Programming— program synthesis General Terms Algorithms, Design, Experimentation Keywords bloat, GP, operator equalisation, spatial co-evolution (SCALP) 1. INTRODUCTION Bloat is a problem plaguing evolutionary systems that utilise a variable length genome. Genetic Programming (GP) is no exception. A feature of bloat is that as the run pro- gresses there is a tendency for the average size of the code to increase rapidly without any corresponding effective increase in fitness. Not only does this consume resources but it has also been linked to early convergence of the run [11]. Not surprisingly the phenomenon of bloat has been much stud- ied. It has been linked to a number of theories including the ‘defense against crossover’ theory [1], the ‘fitness causes bloat’ theory [10] and the ‘crossover bias’ theory [2]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GECCO’12, July 7-11, 2012, Philadelphia, Pennsylvania, USA. Copyright 2012 ACM 978-1-4503-1177-9/12/07 ...$10.00. Recent work, inspired by the crossover bias theory, intro- duced an operator equalisation bloat control method. Sec- tion 2 explains the way this operator works in more detail, but essentially it seeks to control the sizes of newly gener- ated individuals entering the population maintained by the GP system. To date a number of articles have shown how this operator, which has several flavours, is able to allow a GP system to stave off bloat, evolving fit and, importantly, short individuals. This is at the cost of requiring a large number of additional evaluations (sometimes by an order of magnitude). [17] indicates that further research is being car- ried out with the aim of reducing the overhead introduced by operator equalisation. This paper examines a bivariate regression problem (XY regression) that GP has great difficulty solving, normally be- cause of bloat. It is found that with XY regression even the operator equalisation methodology is often unable to con- strain program sizes for long enough to allow a solution to be found. The increase in program size that occurs together with the large number of evaluations required by operator equalisation means that runs are frequently brought to a standstill. This is contrasted with a co-evolutionary system (SCALP, described in section 4) that solves XY regression with comparative ease, fewer evaluations (and wall-clock time) and in a way that seems to produce comparatively short solutions. The performance of operator equalisation over a real world test problem has been analysed in several previous reports (e.g. [18, 16]). The comparison between it and both stan- dard GP and SCALP is looked at in a slightly different light from the previous reports, with SCALP appearing to be able to create general solutions whilst avoiding bloat even when confronted with noisy data. 2. OPERATOR EQUALISATION For the purposes of this paper standard GP means a ge- netic programming methodology similar to that exemplified by Koza [9]. In standard GP a generational system is typ- ically used. In such a system there is a population of in- dividuals and each generation a number of new individuals are created (typically by crossover). These new individu- als together with a percentage of the old individuals form the population of the new generation. This process is re- peated until a solution is found, the search process stagnates or one has exhausted the number of generations for which the search is conducted. Operator equalisation is a recent technique that has undergone a number of iterations [3, 18]. Unlike standard GP it vets the newly created individuals be- 759

Transcript of Spatial Co-Evolution - Quicker, Fitter and Less Bloated

Spatial Co-Evolution - Quicker, Fitter and Less Bloated

Robin HarperUniversity of Sydney, Sydney, Australia

[email protected]

ABSTRACTOperator equalisation is a methodology inspired by the cross-over bias theory that attempts to limit bloat in genetic pro-gramming (GP). This paper examines a bivariate regressionproblem and demonstrates that operator equalisation suf-fers from bloat like behaviour when attempting to solve thisproblem. This is in contrast to a spatial co-evolutionarymechanism (SCALP) that appears to avoid bloat, withoutany need for express bloat control mechanisms. A previouslyanalysed real world problem (human oral bioavailability pre-diction) is examined. The behaviour of SCALP on this prob-lem is quite different from that of standard GP and operatorequalisation leading to short, general candidate solutions.

Categories and Subject DescriptorsI.2.2 [Artificial Intelligence]: Automatic Programming—program synthesis

General TermsAlgorithms, Design, Experimentation

Keywordsbloat, GP, operator equalisation, spatial co-evolution (SCALP)

1. INTRODUCTIONBloat is a problem plaguing evolutionary systems that

utilise a variable length genome. Genetic Programming (GP)is no exception. A feature of bloat is that as the run pro-gresses there is a tendency for the average size of the code toincrease rapidly without any corresponding effective increasein fitness. Not only does this consume resources but it hasalso been linked to early convergence of the run [11]. Notsurprisingly the phenomenon of bloat has been much stud-ied. It has been linked to a number of theories includingthe ‘defense against crossover’ theory [1], the ‘fitness causesbloat’ theory [10] and the ‘crossover bias’ theory [2].

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.GECCO’12, July 7-11, 2012, Philadelphia, Pennsylvania, USA.Copyright 2012 ACM 978-1-4503-1177-9/12/07 ...$10.00.

Recent work, inspired by the crossover bias theory, intro-duced an operator equalisation bloat control method. Sec-tion 2 explains the way this operator works in more detail,but essentially it seeks to control the sizes of newly gener-ated individuals entering the population maintained by theGP system. To date a number of articles have shown howthis operator, which has several flavours, is able to allow aGP system to stave off bloat, evolving fit and, importantly,short individuals. This is at the cost of requiring a largenumber of additional evaluations (sometimes by an order ofmagnitude). [17] indicates that further research is being car-ried out with the aim of reducing the overhead introducedby operator equalisation.

This paper examines a bivariate regression problem (XYregression) that GP has great difficulty solving, normally be-cause of bloat. It is found that with XY regression even theoperator equalisation methodology is often unable to con-strain program sizes for long enough to allow a solution tobe found. The increase in program size that occurs togetherwith the large number of evaluations required by operatorequalisation means that runs are frequently brought to astandstill. This is contrasted with a co-evolutionary system(SCALP, described in section 4) that solves XY regressionwith comparative ease, fewer evaluations (and wall-clocktime) and in a way that seems to produce comparativelyshort solutions.

The performance of operator equalisation over a real worldtest problem has been analysed in several previous reports(e.g. [18, 16]). The comparison between it and both stan-dard GP and SCALP is looked at in a slightly different lightfrom the previous reports, with SCALP appearing to be ableto create general solutions whilst avoiding bloat even whenconfronted with noisy data.

2. OPERATOR EQUALISATIONFor the purposes of this paper standard GP means a ge-

netic programming methodology similar to that exemplifiedby Koza [9]. In standard GP a generational system is typ-ically used. In such a system there is a population of in-dividuals and each generation a number of new individualsare created (typically by crossover). These new individu-als together with a percentage of the old individuals formthe population of the new generation. This process is re-peated until a solution is found, the search process stagnatesor one has exhausted the number of generations for whichthe search is conducted. Operator equalisation is a recenttechnique that has undergone a number of iterations [3, 18].Unlike standard GP it vets the newly created individuals be-

759

fore allowing them to form part of the new generation. Theaim is to prevent individuals that are too short to be usefulor longer than needed from entering the target population.It uses the concept of a target distribution. If one imaginesa histogram of the target distribution then each bar of thehistogram is a bin that contains a number of individuals ofa certain length. The width and position of the bar (or bin)is the range of individual lengths that are allocated to thatparticular bin and the capacity of the bin is the number ofindividuals that fit into it. A target distribution (see later)determines the capacity of each of the different bins.

2.1 Basic Equalisation AlgorithmThe system works with the generational model of genetic

programming. However, instead of just generating x individ-uals each generation, it validates the entry of each individualinto the new generation as follows:

• If its length puts it in a bin which has free space, thenit is accepted.

• If its length puts it into a bin which is already full, itis evaluated and if its fitness is better than any otherindividual in that bin it is accepted;

• If its length is such that it is longer than any bin thathas been created, then it is evaluated and if it is betterthan the best individual found in that run so far itis accepted and bins (of capacity one) are opened upbetween the previous maximum bin size and this newbin size;

Otherwise it is rejected and does not enter the new genera-tion. This process is repeated until the requisite number ofindividuals has been accepted into the new generation.

2.2 Target DistributionMost recent papers (e.g. [3, 16]) favour a target distribu-

tion algorithm that allocates bin sizes based on the averagefitness of the individuals allocated to each bin in the previousgeneration (Dynamic Operator Equalisation). This createsa moving distribution that biases the search space for thenew population towards the size of programs that currentlyhave the best fitness, avoiding the creation of short unfitindividuals (as well as longer individuals which do not im-prove the fitness of the population). In addition there are anumber of equalisation techniques that use different distri-bution targets, such as flat distribution (where bin sizes areallocated equally up to the maximum bin that has openedso far). These have also been found to be effective [16]. Inthis paper the Dynamic Operator Equalisation method asoutlined in [18] is used.

2.3 Increased Effort in Operator EqualisationOne of the features of the operator equalisation technique

is that each generation every individual generated (whetheror not they are accepted into the population) needs to beevaluated by the fitness function. This means that everygeneration there are a larger number of fitness evaluationsthan with standard GP. By way of example, with a popula-tion of 5,000, it was found that in a typical XY regressionrun 25,000 individuals might be evaluated in the first fewgenerations of the run, dropping to about 15,000 individ-uals per generation after a few generations (once a largervariety of bin sizes have opened up). This can be compared

to a standard GP run where there are only 5,000 evalua-tions per generation. This represents a considerable increasein the work required per generation with operator equalisa-tion. This has been noted in previous papers and attemptshave made to quantify the additional work in [17]. Thereare, however, many circumstances in which the trade off be-tween work done and fighting bloat is a reasonable one. Ifthe fitness function is relatively quick and the time taken toevaluate an individual is proportionate to the length of thatindividual (or the number of nodes in that individual) thenthe wall clock time of operator equalisation might not belarger (and may be smaller) than standard GP. This is be-cause although there are more evaluations, each evaluationwill be quicker than evaluations with a bloated population.There are also circumstances in which the aim is not onlyto find a solution but to find a solution which is short (and,hopefully, therefore more readily analysed). In such a casethe amount of work done to find a solution is only of sec-ondary importance to the type of solution found. Finally(as is the case with the XY regression problem discussedin section 4) it may be that the bloat in the standard GPruns prevents solutions from being found. If it constrainsbloat, the additional work per generation required by oper-ator equalisation may be worthwhile because it might allowbetter solutions to be found in circumstances where stan-dard GP struggles.

On the other hand there are situations where the evalua-tion function requires a substantial amount of time, such asproblems that require a simulation to be run, competitionsto be held or even software to be run in an embedded en-vironment. In such situations the length of the individualbeing evaluated is unlikely to be determinative of the timetaken to evaluate the individual. In these circumstances theadditional burden of operator equalisation may well be un-acceptable, as the time to move from one generation to thenext will be greatly increased. Such problems require tech-niques that try to minimise the number of evaluations eachgeneration.

3. SCALPSpatial Co-Evolution in Age Layered Planes (SCALP) has

been previously used to solve the XY regression problem [5](detailed later). In SCALP the population is laid out in anumber of planes. Each plane is an MxN grid which wraps atthe boundaries. It is a combination of the co-evolutionarysystems described, inter alia, in [7] and [19] and the agelayered system (ALPS) described in [8].

SCALP works as follows: At the beginning of the runthere is only one co-evolutionary plane. To start with eachnode within the plane is connected to the eight surroundingnodes. In a regression style problem each node containsa host (which is an individual or candidate solution) anda parasite (which is one of the test cases of the problem).Each generation the following takes place:

• Fitness calculation: The host in each node is evalu-ated against the nine parasites (test cases) in its localneighbourhood (being the one in its node and the eightin the surrounding nodes, a tile). A host’s score is thecombined error from each of these nine test cases (thenine parasites). Typically for a regression style prob-lem a low score is better. A parasite (a test case) isscored on the error it caused in the hosts that are eval-

760

uated against it. For a parasite the bigger the error itcauses the host to suffer, the better it has performed.A parasite is given a score equal to its least impressiveperformance against these nine hosts.

• Selection: For each node the host and its surrounding8 hosts are ranked according to their fitness, from 1 to9. A host has a probability of being selected for thissite equal to 0.5rank other than the last one which usesa probability of 0.58 , ensuring the probabilities add upto 1. Each selected host is marked as the new host (h’).A similar process is applied to the parasites.

• Crossover of hosts: 40% of the nodes are selectedfor crossover. If crossover is to be performed at a par-ticular site the h’ at that site is crossed over with oneof the surrounding h’ (chosen randomly) and replacedby one of the children formed from the crossover (againselected randomly). With the regression style problemthere is no crossover for parasites.

• Mutation of hosts: 20% of the nodes are selectedand the h’ in each of these nodes is mutated.

• Mutation of parasites: 10% of the nodes are se-lected and the p’ at each of these nodes is mutated. Ifa parasite is mutated then it is replaced by a differenttest case.

Then if the new host (h’) is valid it replaces the old hostin each of the nodes. The new parasite (p’) replaces theold parasite. A SCALP system consists of several stackedlayers, where each layer limits the oldest genetic life of theindividuals allowed in it. A genetic life is the number ofgenerations an individual has lived through; a child inheritsthe genetic life of its oldest parent. For each subsequent layerintroduced (see section 4.2) the competitive neighbourhoodof a node is the eight surrounding nodes and the tile of ninenodes directly underneath it.

4. THE XY REGRESSION PROBLEMThis problem is a regression problem with two variables

as follows:

f(x, y) =1

(1 + x−4)+

1

(1 + y−4)

It is evaluated over the range of

−5 ≤ x ≤ 5 and− 5 ≤ y ≤ 5 (x, y �= 0)

and has a graph of the form shown in figure 1. Typically thetraining set consists of data points spaced 0.4 apart betweenthe limits, leading to 676 distinct x,y data points over whichthe regression can be trained (each a test point). As with atypical regression problem the fitness valuation reports as anerror value the sum of the absolute differences between thevalue given by the evolved function and value given by thetarget function for each of the test points (so a low fitnessis better). A successful run is one in which every data pointhas an error of less than 0.01 (each success being a ‘hit’).The XY regression problem has been studied in a number ofpapers such as [15, 20] and as noted in [15] is easily scalableto any number of dimensions by the inclusion of additionalvariables. The reason why the problem is used here is thedifficulty standard GP has in solving it. Over all the runs

-4-2

0 2

4X axis -4

-2

0

2

4

Y axis

0

0.5

1

1.5

2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Figure 1: Graph of f(x, y) = 1(1+x−4)

+ 1(1+y−4)

, the

function used for the XY regression problem.

reported in the papers discussed above standard GP failedto find a solution and in this paper only, approximately,12% of the runs were successful; indicating that the prob-lem is very difficult for standard GP to solve. One of theproblems standard GP has is that the population becomesbloated very quickly, effectively halting any improvementsin fitness. Since this is exactly the type of problem thatoperator equalisation should be good at (the time taken toevaluate an individual is directly related to the complexityof the individual in question), XY regression is a good hardtest of the principles behind operator equalisation.

4.1 Details of the system used: GP and Oper-ator Equalisation

The experiments reported in this paper were implementedusing a Grammatical Evolution system [14]. GrammaticalEvolution (GE) is an enhancement of Genetic Programmingthat can be constrained to act in the same way as GP. It usesa Backus Naur Form (BNF ) grammar to specify the pheno-types to be evolved by the system. Whilst there are a vari-ety of crossover mechanisms available to a GE system, if theGP style crossover is used, with an appropriate grammar,then the system behaves in a way that is indistinguishablefrom a standard GP system [6]. For these experiments onlythe GP style crossover was used. A GE individual consistsof a genotype, which is a series of numbers, called codons.These codons are translated by the grammar chosen to forma phenotype. In this case the phenotype is a mathematicalexpression that can be evaluated on the test data points. Inthis paper the length of an individual means the number ofcodons that are actually used to form the phenotype. Thenumber of codons required to translate an individual’s geno-type to its phenotype represents the number of choices thatmust be made when applying the grammar. In a more tra-ditional GP system the phenotype is typically representedby an expression tree (or a lisp s-expression). The numberof codons used to form an individual in a GE system cor-responds closely to the number of nodes in a standard GPexpression tree - rather than, say, the depth of the tree. Itis an appropriate choice when trying to measure the size ofthe individual solutions, specifically when one is analysingthe system for signs of bloat.

For the XY regression problem a simple no-function re-gression grammar was used. With the no function grammar

761

the phenotype consists of the standard +, -, * and % (pro-tected division) operators, the x and y parameters and con-stants that can be created by the system. Unlike StandardGP these constants can mutate. This simple (no function,no types) grammar was used in the anticipation that theresults of the paper will have a wider application to all GPsystems and not just GE systems. For each of the runs withstandard GP and operator equalisation a population of 5,000was used. The runs were allowed to continue for 100 gener-ations. Each child generated had a 5% chance of a mutationbeing applied. If a child was generated that was invalid itwas rejected and a new child was generated in its place. Aninvalid child is one that fails to translate from the codons(genotype) to the phenotype. (This doesn’t occur often butwhen it does it is typically because an individual has toofew codons to complete the translation process1.) Tourna-ment selection was used to select individuals for breeding,with a tournament size of 7. With the standard GP andthe operator equalisation runs a form of lexicographic par-simony pressure was applied. Basically this means that iftwo individuals would otherwise have the same fitness, theshorter individual is favoured. Such parsimony control does,however, seem to be slightly less effective where there is afloating point fitness function, such as in the present case.The populations were initialised using Sean Luke’s prob-

abilistic tree creation method (version 2)[12]. This method-ology creates a broader distribution of tree shapes and sizesthan, say, Koza’s ramped half and half method [4]. For theoperator equalisation method runs were conducted with binwidths of 1 and 7. Save for the number of evaluations re-quired each run (see subsection 4.4) there was no significantdifference between the runs so only the runs with bin width1 are reported.

4.2 Details of the system used: SCALPThe SCALP system was run using a layer size of 30 x 30

(900 individuals), with layer lives of 300, 600, 1200 and 2700.This means that the lowest layer does not allow individualsto remain in it if they are older than 300 generations, thesecond layer older than 600 generations and so on. The fifthlayer has an implied“no life limit”. Layers are only generatedwhen required. So for the first 300 generations there is onlyone layer. After 300 generations, but before 600 generationsthere two layers, etc. Finally it should also be rememberedthat in the SCALP system there are far fewer test case eval-uations per individual per generation. In the lowest layerthere are only 9 evaluations per individual (being the nineparasites in an individual’s neighbourhood) compared with676 evaluations per individual per generation with standardGP. Each subsequent layer introduces a further 18 evalua-tions (for a total of 9 + 18 or 27) per individual per layer,the additional evaluations being the evaluations against thenine parasites in the corresponding tile in the layer imme-diately below and the comparison evaluation of the host inthe layer below with the nine parasites in the upper layer.This means, for instance, that in the first 300 generationsthere are (30 * 30) population * 300 (generations) * 9 (eval-uations) = (approx.) 2.4 million evaluations. This is fewerevaluations than the equivalent of a single generation of a

1It should also be noted that determining whether an indi-vidual is valid or invalid is different from and (for any non-trivial problem) far less time-consuming than evaluating thefitness of the individual.

Op Eq (2/25) Normal GP (3/25) SCALP (24/25)

050

100

150

200

Type of run (number of successes/total number of runs)

Fitn

ess

at th

e en

d of

the

run

Figure 2: XY Regression: Box plot of the final fit-ness scores for each of the different systems.

5,000 population standard GP system (5000 population *676 evaluations = (approx.) 3.4 million evaluations). Oncethe last layer is in place (four layers with limited lives, onewith unlimited lives) there will be 4,500 individuals (fivelayers of 900 individuals) and a total of 105,300 evaluationsper generation. Going by evaluations performed; an approx-imate equivalence of SCALP generations to standard GPgenerations is 50:1. Consequently in graphs comparing thedifferent systems where generations form one of the axis, 50SCALP generations are taken as equaling one standard GPgeneration.

4.3 Preparation of DataWhere graphs are produced showing items such as fitness

over time against generations, the data has been compiledby taking the median result from the 25 runs performed.The median is used because it is more robust to some of theoutliers that are observed - particularly so in the Bio Avail-ability runs, described in section 5. If a run was successful(and therefore terminated early) the results for the success-ful generation were replicated if data was needed from thatrun for later generations. Where a box-whisker chart (suchas figure 2) is shown to demonstrate the spread of a partic-ular measure at the end of the run, then each of the runs isrepresented. As is typical for such charts the median is themiddle line, the box represents the quartiles, the whiskersthe most extreme runs providing they lie within 1.5 timesthe inter-quartile spread. Outliers (if shown) are plottedseparately.

4.4 Analysis of Operator EqualisationFigure 2 shows a box-whisker chart of the fittest individ-

ual produced at the end of the XY problem runs; a lowfitness being better. The number after the name indicatesthe number of runs (out of 25) that had a successful solu-tion (less than 0.01 error for each of the 676 test points).The difference in success rate between standard GP andoperator equalisation was not significant. As can be seenoperator equalisation produced a “tighter” range of fitnessthan standard GP. There were several standard GP runswhich suffered from bloat very early on, as a consequenceof which even the best individuals at the end of such runshad a very poor fitness. As discussed in section 2.3 oper-

762

0

10000

20000

30000

40000

50000

0 10 20 30 40 50 60 70 80 90 100

Numb

er of

evalu

ation

s req

uired

Number of Generations

Operator Equalisation, bin width 7Operator Equalisation, bin width 1

Figure 3: A comparison of the number evaluationsper generation required by operator equalisationwith a bin width of 1 and a bin width of 7.

0

50

100

150

200

250

300

0 10 20 30 40 50 60 70 80 90 100

Media

n fitn

ess o

ver a

ll run

s (low

is be

tter)

Number of Generations (or equivalent for SCALP)

Standard GPOperator Equalisation

Spatial Co-Evolution

Figure 4: XY regression showing the median fit-ness over 25 runs versus generations (or generationequivalent for SCALP).

ator equalisation requires far more evaluations than stan-dard GP. Whilst standard GP evaluates 5,000 individualsper generation, operator equalisation required, on average,approximately 16,000 evaluations per generation with a binwidth of 1 and 32,000 evaluations per generation with a binwidth of 7. No other bin widths were tried. Figure 3 showsa generation by generation comparison of those two meth-ods. As noted earlier, other than the increased effort, thereappeared to be no notable differences between the runs witha bin width of 1 and those with a bin width of 7 and so onlythe bin width 1 runs are reported.Figure 4 shows a standard plot of fitness versus gener-

ations. For each run the best fitness in each generationwas recorded. The plot shows the median of these recordedbest fitnesses per generation. Looking at the plots of thestandard GP fitness and the operator equalisation fitness itcan be seen that operator equalisation appears to find lower(and therefore better) fitness levels more quickly than stan-dard GP. However, when reading this graph it must be keptin mind that operator equalisation is performing approxi-mately three times the number of evaluations over standardGP. As well as the fitness of the population it is also impor-tant to look at the average size of the programs in the pop-ulation. Figure 5 shows the median of the average programsize in the population for each of the different methods. Ascan be seen with operator equalisation and standard GP av-erage program size increases, with very little to distinguishthe two methods. The question then becomes whether this isan increase in size that corresponds to an increase in fitnessor whether the system is bloating.The metric used, inter alia, in [18] provides a methodology

0

200

400

600

800

1000

1200

1400

0 10 20 30 40 50 60 70 80 90 100

Media

n of a

verag

e pop

ulatio

n len

gths

Number of Generations (or equivalent for SCALP)

Standard GPOperator Equalisation

SCALP

Figure 5: XY regression showing the program sizeagainst the number of generations. The SCALP plotshows the average program size in the oldest layer.

0

50

100

150

200

250

300

0 200 400 600 800 1000 1200 1400

Media

n fitn

ess o

ver a

ll run

s (low

is be

tter)

Average size of individuals in the population

Standard GPOperator Equalisation

Spatial Co-Evolution

Figure 6: XY regression, showing fitness as againstaverage program length. Both Standard GP andOperator Equalisation show standard bloat charac-teristics, namely an increase in length with no im-provement (decrease) in fitness. The SCALP profileshows a shrinking of program size as fitness dropsbelow, about, 25.

to answer this question. Instead of comparing fitness withevaluations or generations, it compares fitness with programlength. For this type of chart, if a better fitness is lower,a line heading down the chart (south) shows a fitness thatimproves without an increase in program length, a line whichheads south-east indicates a better fitness, at the cost ofincreasing program length and a line which heads directlyeast indicates the type of condition which is, by definition,bloat (increase in average program length, no correspondingimprovement in fitness).

Figure 6 shows the fitness/length graphs for each of thethree different runs conducted. Looking at the standard GPrun it can be see the graph runs south east at the start show-ing an improvement in fitness together with an increase insize followed by a longer east running tail showing an in-crease in size with very little improvement in fitness. Thisis typical of a standard GP run on many such problems andindicates bloat taking hold. Looking at the operator equal-isation run one can see a “south” running line from abouta fitness of 250 to approximately 75 indicating an improve-ment in fitness without an increase in size, i.e. operatorequalisation is constraining bloat. After that, however, thegraph quickly levels off running east and showing that oper-ator equalisation is also suffering from bloat like behaviour.(the SCALP run is analysed in the next subsection.) Fromthe above it appears that with both standard GP and op-erator equalisation as the generations increase the runs in-

763

0

200

400

600

800

1000

1200

1400

0 10 20 30 40 50 60 70 80 90 100

Avera

ge of

each

run’s

avera

ge pr

ogram

size

Generations (or generation equivalent for SCALP)

Standard GPOperator Equalisation

SCALP

Figure 7: XY regression, showing the average ofeach run’s average program length versus genera-tions. This shows that with SCALP even for theruns that have not yet solved the problem there isno increase in the size of program length once fit,short, phenotypes are found.

creasingly suffer from an increase in program size with verylittle (if any) improvement in fitness.

4.5 Analysis of SCALPAs can be seen from the box-whisker chart in Figure 2,

SCALP produced a solution in 24 out of the 25 runs withinthe 5,000 generations allocated. 16 of the 25 solutions hada fitness of zero (i.e. they had found the actual solution),of the remaining 9, 8 found approximate solutions (i.e. 676hits) and one had not yet found the solution. The ability ofa spatial co-evolution system to solve XY regression is welldocumented (e.g. [15, 20]). What has not been looked atbefore is the length of the solutions found.In order to analyse whether SCALP was controlling bloat

it was necessary to gather additional metrics from the sys-tem. To this end every 50 generations each individual wasevaluated as against each of the 676 data points in orderto measure how the fitness of the population was improving.This measurement was for data gathering purposes only anddid not affect the evolutionary algorithm. In addition theaverage size of the individuals in the oldest layer was mea-sured. This data was incorporated into Figures 4 to 6. Fig-ure 5 shows the average length of programs in the top layeras against time (measured by way of generations). The in-teresting point about this chart is that the length appearsnot only to increase, but to decrease as well. As previouslymentioned, evolution was halted once a solution was foundand the long flat length versus generations line for SCALPmust be treated with a bit of caution as a result of this.After about generation 3000 (generation 60 on the graph)well over 50% of the SCALP runs had solved so the medianwould be unchanged even if the unsolved runs bloated.Figure 7 shows the average of the average size (rather

than the median of the average size shown in Figure 6).This confirms that the SCALP program size is not increas-ing for those runs that have yet to find a solution. With atypical standard GP or even operator equalisation run thelength trends upwards. With SCALP, whilst longer individ-uals appear in the population, it appears that the dynamicsof the methodology (e.g. the constantly changing test cases)serves to penalise larger individuals, allowing smaller indi-viduals to dominate the population. Unlike the standard GPand operator equalisation methodologies there was no formof length control or direct parsimony pressure. This would

seem to imply that smaller individuals are able to use theevolutionary mechanisms (crossover and mutation) to adaptto the change in parasites more quickly. Perhaps the ‘defenseagainst crossover’ provided by long individuals serves to in-hibit adaptation to the changing environment. The plots inFigures 6 and 7 demonstrate the difference between the be-haviour of SCALP and standard GP/operator equalisation.As can be seen, towards the end of runs, fitness improve-ments are correlated with a shrinking of program lengths (awest moving line in the figure) - the exact opposite of bloat.

5. REAL WORLD PROBLEM - HUMAN ORALBIOAVAILABILITY

This is a real world problem that has been the subjectof various operator equalisation papers (e.g. [18]). It isthe prediction of human oral bioavailability of a set of drugcompounds, based on their molecular structure. This paperuses the same dataset as [18], which is available online. Thedataset consists of 262 items (representing molecules) with241 parameters and a target value. Each parameter is amolecular descriptor, the target value is the known bioavail-ability value. In previous papers the datasets were preparedby randomly splitting them at the start of each run so that70% of the molecules (randomly selected with uniform prob-ability) formed the training set, whilst the remaining 30%formed the validation set. The phenotype was based on asimple regression problem with non-terminals *, +, - and% (protected division). For tree creation, where a choice isto be made between choosing a non-terminal or a terminal,an equal probability was assigned with respect to whethera non-terminal would be chosen or a terminal. A similarmethodology was adopted for the tests in this paper.

Some of the interesting things about this dataset is thatit may well be noisy, the data included may be insufficientto give an accurate prediction of bioavailability and/or therepresentation chosen may not be complex enough to allowaccurate predictions. How such a dataset impacts on theco-evolution methodology is interesting.

[18, 17] were concerned with whether operator equalisa-tion might prevent over-fitting of the data on such datasets.Not surprisingly they concluded that over-fitting and bloatwere not positively correlated and may in fact be counter-indicative. It is arguable that any system which is capableof learning an arbitrarily complex relationship will over-fitwhen presented with noisy/partial data. Within any setof noisy and/or partial data there will exist relationshipsthat are purely coincidental and limited to that sample ofthe data. A system that strives to fit the data perfectly islikely to take advantage of these coincidental rules, whichwill limit the generality of the solution. The trick with de-cision trees and, it is argued, GP is to know when to stopbuilding/evolving [13]. One method of doing this is to usethe validation data set to work out when to stop learning.One can either halt the learning process when the valida-tion score starts to worsen or when the combination of thetraining data and the validation score starts to worsen. Theproblem with using validation data for this purpose is thatif the data is used to determine which individual to returnit cannot be used to work out how successful the run was- the validation data becomes part of the overall data usedto generate the evolved individual. Consequently, a thirdset of data is required to work out how successful the train-

764

GP T+V OpEq T+V SCALP T+V GP Test OpEq Test SCALP Test

2030

4050

6070

80

Fitne

ss

Type of DataTraining + Validation Test

Two GP Test outliers not shown

Figure 8: Bioavailability data. The first three plotsshow the scores of the individual that had the lowest(best) score on the combined training and validation(T+V) data. Operator equalisation found the bestfit for this data. The second three plots show thescore on the test data, this is the data not seen bythe evolutionary system. SCALP returns the tight-est and lowest scores here suggesting it is findingbetter general solutions.

ing methodology is. To do this the validation data needs tobe split into two sets, one is used as validation data (to de-termine when over-fitting occurs) the second set (comprising15% of the original data) is used to determine how successfulthe run is (the test data).Rather than using population sizes of 500 as per [18] the

runs in this paper used a population size of 5,000 - with allother parameters being as in the XY regression run. Theruns were allowed 100 generations (or 5,000 in the case ofSCALP) and the individual which had the lowest (i.e. best)combined test and validation score during the run was noted.Since the score from the validation data might worsen thenimprove again during the course of the run, rather than stopthe run as soon as the combined test and validation errorworsen, the current best individual is remembered whilst thefull 100 generations are finished. The best individual foundduring the course of the run is returned2.

5.1 Changes necessary for SCALPOne of the issues with SCALP is that every generation

each individual is only assessed against a small number ofthe training cases. It therefore becomes difficult to assesswhen the run has been completed, especially if, as with thisproblem, there is no prefect solution. In order to allow acomparison to be made with the two other methodologieseach SCALP run was terminated after a set number of gen-erations (5000). The individuals in the oldest layer weretested every 50 generations as against the training data andthe validation data and the individual which scored best dur-ing the run was retained and returned as the result of therun. As with the other methodologies the validation datais not being used to guide the evolution, rather it is being

2Since there were relatively few items of validation data(about 54) using only the validation data to try and de-termine if over-fitting had occurred led to large variationsbetween runs

20

25

30

35

40

45

50

0 10 20 30 40 50 60 70 80 90 100

Fitne

ss on

traini

ng D

ata

Generations (or equivalent for SCALP)

SCALPstandard GP

operator equalisation

Figure 9: Bioavailability Data, showing the fitnessagainst the training data as it changes during thecourse of the run. Both standard GP and operatorequalisation show continued improvement as they fitthe data, but SCALP is constantly oscillating appar-ently unable to fit the data.

used to try and avoid over-fitting by returning the individ-ual that scores best on the training data and the validationdata. The unseen test data is used to work out how wellthe run performed. It should be noted that for this problemwhilst the evaluations every 50 generations on the combinedtraining data and validation data do not influence the courseof the evolutionary run they are used in determining whichevolved individual is returned at the end of the run. Ac-cordingly these additional evaluations constitute additionaloverhead for this methodology.

5.2 AnalysisFigures 9 and 10 show the plots of fitness against gener-

ations and fitness against program length for this problem.The plots for operator equalisation and standard GP aremuch as expected. Both show improving (decreasing) fit-ness as the evolution proceeds. Figure 10 shows the “south”headed line for operator equalisation which indicates thatfitness is improving without an increase in size. However asthe fitness continues to decrease the average length of theprograms increases - although not as much as they do withstandard GP3. The behaviour of the SCALP run is quite dif-ferent. Instead of a continual improvement in fitness as onemight expect, figure 9 demonstrates an oscillating fitness.The reasons for this are complex and linked to the way theparasites (test cases) self select. Whilst figure 9 is showingthe score against all the test data, at any one time the toplayer is only evolving against a subset of this. Since this oralbioavailability problem does not appear to have a perfect so-lution (for this representation at least) SCALP appears to becycling between different solutions which score well againstdifferent parasites. Unlike both operator equalisation andstandard GP, SCALP seems incapable of finding a hypoth-esis that scores better (less) than, approximately, 29 whentested against all the training data. Figure 10 shows thesame effect, although this time it is plotting program lengthagainst fitness. As can be seen SCALP is criss-crossing be-tween certain lengths of population size. Unlike the XYregression problem there is no perfect solution to be foundand so it keeps cycling through imperfect solutions. What isremarkable is that the population does not appear to bloat,

3The difference between the OpEq results shown here andin [18] appear to be mainly attributable to the different ini-tialisation methodology and the increased population size.

765

20

25

30

35

40

45

50

0 500 1000 1500 2000 2500 3000

Fitne

ss on

traini

ng D

ata

Average length of individuals in the population

SCALPstandard GP

operator equalisation

Figure 10: Bioavailability Data, showing averagesize of program length as against the error on thetraining data. Again both operator equalisation andstandard GP show typical graphs, slowly improvingfitness against increase in length. SCALP howevershows a totally different type of graph exploring arange of program sizes and fitness over and overagain.

presumably for similar reasons as discussed with the XY re-gression problem, that is if the solution becomes too long itis not responsive to the change in the parasites.Figure 8 contains the last items to note. Whilst the best

individual returned by SCALP does not score as well on thetraining (plus validation) data as, say, operator equalisation,there is a very tight range returned over all 25 runs. Theperformance of SCALP on the unseen test data is the bestof the methods, returning a tight range of low scores. Thiswould seem to indicate that SCALP is not over-fitting thetraining data unlike the other two methods (despite theiruse of validation data to try and prevent over-fitting).

6. SUMMARY AND CONCLUSIONS.SCALP appears to be remarkably resilient to bloat even

although it has no express bloat control measures. Its be-haviour in the oral bioavailability problem is both concern-ing and exciting. It is concerning because it appears thatit is incapable of fully fitting (over-fitting) the data; whichindicates that there will be problems it struggles with. Ex-citing because it appears to be better at finding a generalsolution than either of the two other methods examined inthis paper. Its search of the program space is quite differentfrom more traditional methods. Of course it is not possi-ble to draw any conclusions on one set of test data, but theresults are definitely interesting and warrant further inves-tigation.

7. REFERENCES[1] L. Altenberg. The evolution of evolvability in genetic

programming. In K. E. Kinnear, Jr., editor, Advancesin Genetic Programming, chapter 3, pages 47–74. MITPress, 1994.

[2] S. Dignum and R. Poli. Generalisation of the limitingdistribution of program sizes in tree-based geneticprogramming and analysis of its effects on bloat. InH. Lipson, editor, GECCO 2007, pages 1588–1595.ACM, 2007.

[3] S. Dignum and R. Poli. Operator equalisation andbloat free GP. In EuroGP 2008, volume 4971 of

Lecture Notes in Computer Science, pages 110–121.Springer, 26-28 Mar. 2008.

[4] R. Harper. GE, explosive grammars and the lastinglegacy of bad initialisation. In IEEE Congress onEvolutionary Computation (CEC 2010). IEEE Press,18-23 July 2010.

[5] R. Harper. Spatial co-evolution in age layered planes(SCALP). In CEC 2010. IEEE Press, 2010.

[6] R. T. R. Harper. Enhancing Grammatical Evolution.PhD thesis, School of Computer Science andEngineering, The University of New South Wales,Sydney 2052, Australia, 2009.

[7] W. D. Hillis. Co-evolving parasites improve simulatedevolution as an optimization procedure. In C. G.Langton, C. E. Taylor, J. D. Farmer, andS. Rasmussen, editors, Artificial Life II, volume X ofSanta Fe Institute Studies in the Sciences ofComplexity, pages 313–324. Addison-Wesley, Santa FeInstitute, New Mexico, USA, Feb. 1990 1992.

[8] G. S. Hornby. ALPS: the age-layered populationstructure for reducing the problem of prematureconvergence. In GECCO 2006, volume 1, pages815–822. ACM Press, 8-12 July 2006.

[9] J. R. Koza. Genetic Programming: On theProgramming of Computers by Means of NaturalSelection. MIT Press, 1992.

[10] W. B. Langdon. The evolution of size in variablelength representations. In 1998 IEEE InternationalConference on Evolutionary Computation, pages633–638. IEEE Press, 5-9 May 1998.

[11] W. B. Langdon and R. Poli. Foundations of GeneticProgramming. Springer-Verlag, 2002.

[12] S. Luke. Two fast tree-creation algorithms for geneticprogramming. IEEE Transactions on EvolutionaryComputation, 4(3):274–283, Sept. 2000.

[13] T. M. Mitchell. Machine Learning. McGraw-Hill, Inc.,New York, NY, USA, 1 edition, 1997.

[14] M. O’Neill and C. Ryan. Grammatical evolution.IEEE Transactions on Evolutionary Computation,5(4):349–358, Aug. 2001.

[15] L. Pagie and P. Hogeweg. Evolutionary consequencesof coevolving targets. Evolutionary Computation,5(4):401–418, 2011/12/16 1997.

[16] S. Silva. Reassembling operator equalisation: a secretrevealed. GECCO ’11, pages 1395–1402, New York,NY, USA, 2011. ACM.

[17] S. Silva, S. Dignum, and L. Vanneschi. Operatorequalisation for bloat free genetic programming and asurvey of bloat control methods. Genetic Programmingand Evolvable Machines, pages 1–42, 2011.10.1007/s10710-011-9150-5.

[18] S. Silva and L. Vanneschi. Operator equalisation,bloat and overfitting: a study on human oralbioavailability prediction. In GECCO ’09, pages1115–1122. ACM, 8-12 July 2009.

[19] N. Williams and M. Mitchell. Investigating the successof spatial coevolutionary learning, pages 523–530.ACM Press, 2005.

[20] N. Williams and M. Mitchell. Investigating the successof spatial coevolutionary learning. In GECCO-2005,pages 523–530. ACM Press, 2005.

766