1 Jerry Tsai [email protected] This presentation available at: clintuition.com/pubs

56
1 Jerry Tsai Jerry.Tsai@clintuit ion.com This presentation available at: clintuition.com/pubs/

Transcript of 1 Jerry Tsai [email protected] This presentation available at: clintuition.com/pubs

Page 1: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

1

Jerry Tsai

[email protected]

This presentation available at:

clintuition.com/pubs/

Page 2: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

22

Optimal Model Search By a Genetic Algorithm Using SAS®

Jerry Tsai

Page 3: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

33

Problem Statement

n observations; p possible predictors n >> p >> 0 2p possible subsets of

the set of predictors The challenge: Choose a subset of the

possible predictors that has the greatest predictive ability relative to its size

Page 4: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

44

Problem Definition

What do statisticians call this problem? “Subset selection” Finding the “best (predictive) model” Finding a “parsimonious model”

How do statisticians approach this problem? Conduct a search through a space defined by

the 2p possible combinations of the p parameters to find a subset of those parameters that optimizes an objective function

Page 5: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

55

Reasons to Search for an Optimal Model

1. To describe the relative importance of variables

2. To save money in data collection and management

3. To enhance predictive ability But we should make very sure it is

worth the effort Inappropriate for estimation and

hypothesis testing Time-consuming

Page 6: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

66

Commonly-Known Search Heuristics

Forward; backward; stepwise Found in REG, LOGISTIC, PHREG, more

LAR (least angle regression) LASSO (least absolute shrinkage and

selection operator) Both found in GLMSELECT

All of these heuristics use an incremental approach when searching for an optimal model

Page 7: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

77

Incremental Approach

To a set, add or subtract one variable at a time

Include or exclude a candidate variable if: The variable meets entry and stopping criteria

OR The set of variables with the candidate variable

added better optimizes the objective function

Page 8: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

88

Holistic Approach

Assess a set of variables as a whole Sets of variables are compared to one another Each element (variable) of the set is treated

equally Disadvantage: less “helpful” elements of

the set are treated the same as more “helpful” elements of the set

Advantage: May uncover synergism or confounding among variables

Page 9: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

99

Advantage of a Non-incremental Approach

The absolute optimum may be undiscoverable through a incremental approach, due to: Confounding Endogeneity Nonlinearity (with respect to a link function)

Space searched could be much greater Forward selection: O(p2)

limp → ∞

O(p2)

2p= 0

and this expression quickly converges

Page 10: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

1010

Advantage of Using Regression

Statisticians are very familiar with generalized linear models (GLMs)

Parameter estimates are amenable to comprehensible interpretation

Page 11: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

1111

Genetic Algorithm Implementation

Create a generation of sets of variables (a set of sets)

Score all sets in a generation Sets that score higher are selected for

reproduction These selected sets are recombined and

mutated to yield additional sets. These additional sets will constitute a new

generation that will in turn undergo scoring, selection, and recombination.

Page 12: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

1212

Why Use a Genetic Algorithm?

Examples from nature suggest local optima are eventually found

A holistic approach allows variables to be assessed simultaneously

The search covers a much larger area than traditional incremental approaches

Page 13: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

1313

Implementation

The presence (or absence) of each variable in a set is represented by a bit

A string of bits together represent a chromosome of bits

So each chromosome represents a subset of the possible predictors

Page 14: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

1414

Implementation Illustration

12 possible parameters alfa, bravo, charlie, delta…kilo, lima

Representation example: The variables bravo, charlie, and kilo

constitute a subset (i.e., constitute a model)0110 0000 0010abcd efgh ijkl

Page 15: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

1515

Genetic Operation – Mutation

Logically negate bits within a chromosome (point mutation) 0 becomes 1; 1 becomes 0

Page 16: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

1616

Implementation Illustration

Assume 12 possible parameters alfa, bravo, charlie, delta…kilo, lima

Example: bravo, charlie, and kilo are in the model, all

other variables are not 0110 0000 00100110 0000 0010abcd efgh ijkl

Page 17: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

1717

Mutation Example

0110 0000 0010

Page 18: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

1818

Mutation Example

0110 0000 0010

Page 19: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

1919

Mutation Example

Randomly selected for mutation

0110 0000 0010

bravo echo lima

Page 20: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

2020

0010 1000 0011

Mutation Example

bravo echo lima

Randomly selected for mutation

Page 21: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

2121

Genetic Operation – Mutation

Logically negate random bits within a chromosome (point mutation)

0 becomes 1; 1 becomes 0 Example: {bravo; charlie; kilo}; MUTATE(bravo; echo; lima)

0010 1000 0011

Page 22: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

2222

Genetic Operation – Crossover

Two chromosomes exchange genetic information (Morgan 1916)

Page 23: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

2323

Crossover Example

0110 0000 00100100 1000 0001

{bravo; charlie; kilo}

{bravo; echo; lima}

Page 24: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

2424

Crossover Example

0110 0000 00100100 1000 0001

{bravo; charlie; kilo}

{bravo; echo; lima}

Page 25: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

2525

Crossover Example

0110 0000 00100100 1000 0001

{bravo; charlie; kilo}

{bravo; echo; lima}

Page 26: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

2626

Crossover Example

0110 0000 00100100 1000 0001

0110 0000 00010100 1000 0010

{bravo; charlie; kilo}

{bravo; echo; lima}

{bravo; charlie; lima}

{bravo; echo; kilo}

Page 27: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

2727

Genetic Operation – Crossover

Two chromosomes exchange genetic information (Morgan 1916)

Example: CROSSOVER[{bravo; charlie; kilo}; {bravo;

echo; lima}; @ foxtrot]

0110 0000 00100100 1000 0001

0110 0000 00010100 1000 0010

Page 28: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

2828

Genetic Algorithm - Main Steps

Initialize Set up environment Create starting generation

Evaluate (i.e., score) Chromosomes (i.e., individuals) Generation

Report, interim Select (i.e., choose which individuals reproduce) Reproduce (i.e., create new generation)

Apply genetic operators

Page 29: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

2929

Flow Chart

Report, Interim

Evaluate

SelectInitial-ize

Repro-duce

Escape?

Report, Final

Yes

No

Page 30: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

3030

Initialize

Clear environment Initialize parameters

Create &&VAR&I macro variables from the list of possible parameters

Evaluate and store minimum (aka null) model

Evaluate and store maximum (aka full) model

Initialize parents (create starting generation)

Page 31: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

3131

Flow Chart

Report, Interim

Evaluate

SelectInitial-ize

Repro-duce

Escape?

Report, Final

Yes

No

Page 32: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

3232

Evaluate

Individual (chromosomes) If a chromosome has a score saved, assign

that score to the chromosome Otherwise, evaluate the chromosome on its

fitness for reproduction Save scores for newly-evaluated chromosomes

Generation (of chromosomes) Evaluate and store historical information on

the characteristics of the generation, e.g., the mean score.

Page 33: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

3333

Scores

Evaluate each chromosome by computing the value of these functions: Objective function = the function to be

optimized Reward greater predictive ability while

penalizing any increase in the number of parameters

e.g., Akaike’s Information Criterion (AIC)

Fitness function A function based on the objective function

that determines the probability of a chromosome being selected for reproduction.

Page 34: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

3434

SAS® Code Evaluation Illustration

proc anly-proc data = input-data-set <options>; model %do i = 1 to %cntvars.; %if %substr(&bitstrg., &i., 1) = 1 %then %do; &&var&i.. %end; %end; </ options>; <other statements>;run;

p = # of possible parameters

chromosome

variable(s)

Page 35: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

3535

SAS® Code Comments

You will very likely create output data sets from the PROC– through the use of ODS statements, OUTPUT statements, or an output option on the MODEL statement– to obtain statistics that will constitute your objective function and fitness function scores.

I actually use a modified version of my %ITERLIST macro (Tsai, WUSS 2008) to create the list of variables in the MODEL statement.

Page 36: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

3636

Flow Chart

Report,

Interim

Evaluate

SelectInitial-ize

Repro-duce

Escape?

Report, Final

Yes

No

Page 37: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

3737

Flow Chart

Report, Interim

Evaluate

SelectInitial-ize

Repro-duce

Escape?

Report, Final

Yes

No

Page 38: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

3838

Evaluate Escape Criterion

You need to specify a condition to escape the loop… if you want to algorithm to terminate

Escape criteria examples: Mean score for a particular generation fails to

exceed any of those for a specified number of generations immediately preceding

Failure to surpass the best score seen so far within a specified number of generations

Time or resource constraints reached Minimum score surpassed

Page 39: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

3939

Flow Chart

Report, Interim

Evaluate

SelectInitial-ize

Repro-duce

Escape?

Report, Final

Yes

No

Page 40: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

4040

Flow Chart

Report, Interim

Evaluate

SelectInitial-ize

Repro-duce

Escape?

Report, Final

Yes

No

Page 41: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

4141

Select

Those chromosomes with superior scores are given preference in the selection for reproduction

The method of selection is at the analyst’s discretion. One popular method used in GAs is

stochastic universal sampling

Page 42: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

4242

Stochastic Universal Sampling

Uses a single randomly-chosen value to sample from the chromosome, choosing variables at evenly-spaced intervals across their collective fitness score

F = sum of the fitness scores for all chromosomes in a generation

N = number of chromosomes to be selected for reproduction

Wikipedia, 2009

Page 43: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

4343

Flow Chart

Report, Interim

Evaluate

SelectInitial-ize

Repro-duce

Escape?

Report, Final

Yes

No

Page 44: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

4444

Reproduce

Apply to selected chromosomes the genetic operations of crossover and mutation.

The resulting chromosomes constitute (in part and possibly in full) a new generation.

Page 45: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

4545

Flow Chart

Report, Interim

Evaluate

SelectInitial-ize

Repro-duce

Escape?

Report, Final

Yes

No

Page 46: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

4646

Flow Chart

Report, Interim

Evaluate

SelectInitial-ize

Repro-duce

Escape?

Report, Final

Yes

No

Page 47: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

4747

Flow Chart

Report, Interim

Evaluate

SelectInitial-ize

Repro-duce

Escape?

Report, Final

Yes

No

Page 48: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

4848

Flow Chart

Report,

Interim

Evaluate

SelectInitial-ize

Repro-duce

Escape?

Report, Final

Yes

No

Page 49: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

4949

Flow Chart

Report, Interim

Evaluate

SelectInitial-ize

Repro-duce

Escape?

Report, Final

Yes

No

Page 50: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

5050

Flow Chart

Report, Interim

Evaluate

SelectInitial-ize

Repro-duce

Escape?

Report, Final

Yes

No

Page 51: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

5151

Flow Chart

Report, Interim

Evaluate

SelectInitial-ize

Repro-duce

Escape?

Report,

Final

Yes

No

Page 52: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

5252

Final Report

Number of generations algorithm evaluated

Mean fitness score for each generation Most optimal chromosome discovered and

its fitness and objective scores

Page 53: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

5353

Disadvantages of using a GA

Not a built-in SAS functionality Many parameters to specify

Generation size Crossover probability Mutation rate Objective function / Fitness function

Time-consuming to run Still may not find the absolute optimum

Page 54: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

5454

Advantages of using a GA

Deeper exploration of the model space. Allows you to remain within a familiar

paradigm (regression) with interpretable parameter coefficients

Agnostic to the regression model chosen – can use the same macro for any GLM with minor modifications

“Proven” success in the real world

Page 55: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

55

Suggested Reading

References in paper Search heuristics

LAR and LASSO heuristics -- Robert Cohen, Peter Flom, and David Cassell

Information criteria in model selection Linear regression -- Dennis Beal Logistic and proportional hazards

regression -- Ernest Shtatland Mixed models -- Jesse Canchola and Torsten

Neilands

Page 56: 1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs

56

Jerry Tsai

[email protected]

This presentation available at:

clintuition.com/pubs/