Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

30
Bioinformatics Literature Review A Review of Genetic Algorithms Lit Review Talk by Kato Mivule COSC891 – Bioinformatics, Spring 2014 Bowie State University Bowie State University Department of Computer Science

description

Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms and Paper Review: C. H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction for the analysis of gene expression data,” Bioinformatics, vol. 19, no. 1, pp. 37–44, 2003.

Transcript of Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Page 1: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Bioinformatics Literature Review

A Review of Genetic Algorithms

Lit Review Talk

by

Kato Mivule

COSC891 – Bioinformatics, Spring 2014

Bowie State University

Bowie State University Department of Computer Science

Page 2: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Outline

• Introduction

• Biological Background

• Genetic Algorithm

• Genetics Algorithm Paper discussion

• Conclusion

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 3: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Sources

Information presented in these slides is adapted from the following sources:

1. Michael Skinner, Genetic Algorithms Overview, http://geneticalgorithms.ai-depot.com/Tutorial/Overview.html ,

accessed online, March 2nd 2014.

2. Genetic Algorithms, Lecture Notes UC Davis Computer Science Dept,

http://www.cs.ucdavis.edu/~vemuri/classes/ecs271/Genetic%20Algorithms%20Short%20Tutorial.htm

3. Wikipedia, Genetic algorithm, http://en.wikipedia.org/wiki/Genetic_algorithm

4. Nobal Niraula, Genetic Algorithms by Example

http://www.slideshare.net/kancho/genetic-algorithm-by-example

5. BBC Genetics:

http://www.bbc.co.uk/bitesize/intermediate2/biology/environmental_and_genetics/factors_affecting_variation_spec

ies/revision/6/

6. Deoxyribonucleic Acid (DNA), https://www.genome.gov/25520880#al-3

7. MATLAB, How the Genetic Algorithm Works, http://www.mathworks.com/help/gads/how-the-genetic-algorithm-

works.html

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 4: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Genetic Algorithms (GA) - Introduction

• Genetic Algorithms (GA) were first developed by John Holland (1975).

• GA is a search heuristic that mimics the process of natural evolution.

• GA uses Darwin's concepts of “Natural Selection” and “Genetic Inheritance”.

• GA are used to solve problems with little information about those problems.

• GA are Generalized to work in any search space.

• GA use selection and evolution to generate numerous solutions to a problem.

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 5: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Genetic Algorithms (GA) – Introduction

• GA works well with a very large set of candidate solutions.

• GA are outperformed by more situation specific algorithms in the simpler

search spaces.

• GA are not always the best choice, their time run is long.

• GA are good at creating high quality solutions to a problem.

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 6: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Genetic Algorithms (GA) – Introduction

• GA use the process of natural selection and evolution.

• “…Some birds developed large, strong beaks suited to cracking nuts, others long,

narrow beaks more suitable for digging bugs out of wood. The birds that had these

characteristics when blown to the island survived longer than other birds. This

allowed them to reproduce more and therefore have more offspring that also had this

unique characteristic. Those without the characteristic gradually died out from

starvation. Eventually all of the birds had a type of beak that helped it survive on its

island. The individuals themselves do not change, but those that survive better, or

have a higher fitness, will survive longer and produce more offspring. This continues

to happen, with the individuals becoming more suited to their environment every

generation. It was this continuous improvement that inspired computer scientists, one

of the most prominent being John Holland, to create genetic algorithms…” Genetic

Algorithms Overview, Michael Skinner

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 7: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Biology background

• The Body is made up of cells. The cell has a center called a nucleus. The

nucleus contains the chromosomes. The chromosome is composed of firmly

coiled strings of deoxyribonucleic acid (DNA).

• Genes are sections of DNA that determine particular traits, like eye and skin

color. You have more than 20,000 genes. A gene mutation is an modification

in DNA. Some changes in your genes result in genetic disorders.

Source: http://www.riversideonline.com/health_reference/Tools/DS00549.cfm

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 8: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Biology background

• The Body is made up of cells. The cell has a center called a nucleus. The

nucleus contains the chromosomes. The chromosomes contain the DNA

strand.

Source: BBC Genetics: http://www.bbc.co.uk/bitesize/intermediate2/biology/environmental_and_genetics/factors_affecting_variation_species/revision/6/

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 9: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Biology background

• The chromosome is composed of firmly coiled strings of deoxyribonucleic acid

(DNA). Genes are sections of DNA that determine particular traits, like eye and

skin color.

Source: BBC Genetics: http://www.bbc.co.uk/bitesize/intermediate2/biology/environmental_and_genetics/factors_affecting_variation_species/revision/6/

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 10: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Biology background

• DNA: A molecule of DNA is made up of two strands called the double helix.

The DNA Strand contains four types of molecules, Adenine (A), Thymine

(T), Guanine (G) and Cytosine (C). The molecules are held together by weak

hydrogen bonds. Adenine pairs with Thymine. Guanine pairs with Cytosine.

• A section of this DNA is called a gene. It is normally hundreds or thousands

of DNA bases long.

Source: BBC Genetics: http://www.bbc.co.uk/bitesize/intermediate2/biology/environmental_and_genetics/factors_affecting_variation_species/revision/6/

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 11: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Biology background

• Genes and Proteins: The genetic information coded into DNA in the genes

gives the cells instructions to make many specific protein molecules

• Proteins are built using amino acid molecules. The order of the DNA bases is

code for the order of amino acids in the protein

Source: BBC Genetics: http://www.bbc.co.uk/bitesize/intermediate2/biology/environmental_and_genetics/factors_affecting_variation_species/revision/6/

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 12: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Biology Background

• Random assortment of chromosomes: The partition of the members of a pair

of chromosomes is completely at random with many possible combinations.

Source: BBC Genetics: http://www.bbc.co.uk/bitesize/intermediate2/biology/environmental_and_genetics/factors_affecting_variation_species/revision/6/

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 13: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Biology Background

Natural Selection Process

Source: BBC Biology Genetics: http://www.bbc.co.uk/bitesize/higher/biology/genetics_adaptation/natural_selection/revision/2/

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 14: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Biology Background

Natural Selection Process

Source: Wikipedia, Evolution: http://en.wikipedia.org/wiki/Evolution

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 15: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Genetic Algorithm Pseudo-code

Generate an initial population of individuals

Evaluate the fitness of all individuals

while termination condition not met do

Select fitter individuals for reproduction

Recombine between individuals

Mutate individuals

Evaluate the fitness of the modified individuals

Generate a new population

End while

Source: Nobal Niraula, Genetic Algorithms by Example http://www.slideshare.net/kancho/genetic-algorithm-by-example

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 16: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Genetic Algorithm

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Nobal Niraula, Genetic Algorithms by Example http://www.slideshare.net/kancho/genetic-algorithm-by-example

Page 17: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Genetic algorithm process

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Phases in the Genetic algorithm process.

Source: http://www.cs.ucdavis.edu/~vemuri

Page 18: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Genetic Algorithm (GA)

•Initial Population: GA starts by generating a random initial population

•Creating the Next Generation: children are created from the current initial population

•GA generates three types of children for the next generation:

•Elite children: individuals with the best fitness values who survive.

•Crossover children: combining the vectors of a pair of parents.

•Mutation children: introducing random changes to a single parent.

•Stopping Conditions for the Algorithm

•The algorithm stops when the value of the fitness criteria is met.

Source: MATLAB How the Genetic Algorithm Works, http://www.mathworks.com/help/gads/how-the-genetic-algorithm-works.html

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 19: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Paper Review: C. H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction

for the analysis of gene expression data,” Bioinformatics, vol. 19, no. 1, pp. 37–44, 2003.

The Problem:

•The expression dataset being analyzed involves multiple classes.

•The efficient selection of good predictive gene groups from datasets that are

inherently ‘noisy’.

•The development of new methodologies that can enhance the successful

classification of these complex datasets.

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 20: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Paper Review: C. H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction

for the analysis of gene expression data,” Bioinformatics, vol. 19, no. 1, pp. 37–44, 2003.

Methods:

• GA is applied to the problem of multi-class prediction.

•A GA-based gene selection scheme is employed to automatically

•Determine the members of a predictive gene group

•Determine the optimal group size

•Determine the classification success using a maximum likelihood (MLHD)

classification method.

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 21: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Paper Review: C. H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction

for the analysis of gene expression data,” Bioinformatics, vol. 19, no. 1, pp. 37–44, 2003.

Results:

•The Authors state that GA/MLHD-based approach achieves higher

classification accuracies than other published predictive methods on the same

multi-class test dataset.

•The Authors claim that GA/MLHD also permits substantial feature reduction in

classifier gene sets without compromising predictive accuracy.

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 22: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Paper Review: C. H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction

for the analysis of gene expression data,” Bioinformatics, vol. 19, no. 1, pp. 37–44, 2003.

Dataset and Data Preprocessing

•Authors used the NCI60 gene expression dataset contains the gene expression

profiles of 64 cancer cell lines as measured by cDNA microarrays containing

9703 spotted cDNA sequences.

•Authors downloaded data from http://genome-

www.stanford.edu/sutech/download/nci60/dross arrays nci60.tgz.

•Authors during data preprocessing, excluded spots with missing data, control,

and empty leaving 6167 genes.

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 23: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Paper Review: C. H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction

for the analysis of gene expression data,” Bioinformatics, vol. 19, no. 1, pp. 37–44, 2003.

Overall Methodology

The GA/MLHD classification strategy consists of two main components:

(1) a GA-based gene selector

(2) a maximum likelihood (MLHD) classifier.

•The actual classification process is performed using the maximum likelihood

(MLHD) classifier.

•Each individual in the population thus represents a specific gene predictor

subset

•A fitness function is used to determine the classification accuracy of a predictor

set.

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 24: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Paper Review: C. H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction

for the analysis of gene expression data,” Bioinformatics, vol. 19, no. 1, pp. 37–44, 2003.

System and Methods

•Initialization and Evaluation: An initial population is formed by creating N

random strings, where the population size N is pre-specified

•Selection, Crossover and Mutation: Two selection methods were used to

select the strings for the mating pool: (i) stochastic universal sampling (SUS) and

(ii) roulette wheel selection (RWS).

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 25: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Paper Review: C. H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction

for the analysis of gene expression data,” Bioinformatics, vol. 19, no. 1, pp. 37–44, 2003.

System and Methods

•Crossovers: performed by randomly choosing a pair of strings from the

mating pool and then applying a crossover operation on the selected string

pair.

•Uniform mutation: operations applied at probability p(m) on each of the

offspring strings produced from crossover.

•Termination :evaluation, selection, crossover and mating are repeated for

G generations until the string with the best fitness of all generations is

outputted as the solution.

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 26: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Paper: C. H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction for the

analysis of gene expression data,” Bioinformatics, vol. 19, no. 1, pp. 37–44, 2003.

A maximum likelihood (MLHD) classifier

•To build an MLHD classifier (James, 1985), a total of M(t) tumor samples are

used as training samples. The remaining M(θ) tumor samples are used as test

samples.

•For the NCI60 dataset, the ratio between M(t) and M(θ) is 2:1.

•Discriminant Function: The basis of the discriminant function is Bayes’ rule of

maximum likelihood: Assign the sample to the class with the highest conditional

probability.

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 27: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Paper: C. H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction for the

analysis of gene expression data,” Bioinformatics, vol. 19, no. 1, pp. 37–44, 2003.

“…Comparing GA-based Predictor Sets to Predictor Sets Obtained from Other Methodologies The best

predictor set obtained using the GA-based selection scheme exhibited a cross validation error rate of

14.63% and an independent test error rate of 5% (Table 1, row 1, and see Supplementary Information for

specific misclassifications). This is an improvement in accuracy as compared to other methodologies

assessed by Dudoit et al. (2000), where the lowest independent test error rate was reported as 19%...” Ooi

and Tan (2003)

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 28: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Paper: C. H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction for the

analysis of gene expression data,” Bioinformatics, vol. 19, no. 1, pp. 37–44, 2003.

“…Comparison of expression profiles of predictor sets obtained through different methodologies.

Columns represent different class distinctions, and only training set samples are depicted. (a) Expression

profile of genes selected through the GA/MLHD method (only genes for the best predictor set are shown).

(b) Expression profile of 20 genes selected through the BSS/WSS ratio ranking method. (c) Expression

profile of 18 genes selected through the OVA/S2N ratio ranking method. Arrows depict genes which have

highly correlated expression patterns across the sample classes. Classes are labeled as follows: BR

(breast), CN (central nervous system), CL (colon), LE (leukemia), ME (melanoma), NS (non-small-cell

lung carcinoma), OV (ovarian), RE (renal) and PR (reproductive system)…” Ooi and Tan (2003)

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 29: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Paper: C. H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction for the

analysis of gene expression data,” Bioinformatics, vol. 19, no. 1, pp. 37–44, 2003.

Conclusion

•The authors state that their report shows that highly accurate classification

results can be obtained using a combination of GA-based gene selection and

discriminant-based classification methods.

•The authors note that accuracy achieved (95% for NCI60) is better than other

published methods employing the same dataset.

•The authors note that other advantages of the GA-based approach are that it

automatically determines the optimal predictor set size and the delivery of

predictive accuracies that are comparable to other methods.

Bowie State University Department of Computer Science

Bioinformatics Literature Review

Page 30: Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms

Conclusion

• Genetic algorithms tend to get outdone by more situation specific algorithms

in the simpler search spaces.

• Genetic algorithms are not always the best choice, their time run is long.

• Genetic algorithms are good at creating high quality solutions to a problem.

Bowie State University Department of Computer Science

Bioinformatics Literature Review