GADataMining CNA

Genetic Algorithms for

Data Mining

Sid Bhattacharyya

Overview• Genetic Algorithms: a gentle introduction

– What are GAs– How do they work/ Why?– Critical issues

• Using Genetic algorithms (effectively)• Use in Data Mining

Natural Genetics to AI

• Computational models inspired by biological evolution– survival of the fittest– reproduction through cross-breeding

Genetic Algorithms• Population based search (parallel)

– simultaneous search from multiple points in search space

population members: potential solutions

• Fitness function (search objective)– numerical “figure of merit”/utility measure of an individual

selection• “Mating” and reproduction of individuals

crossover, mutation• Evolution from one generation to the next

iterative search, convergence

Advantage GAs• General purpose, robust search technique

– application to varied problem types

• Data mining– fitness function: flexible expression of modeling criteria,

tradeoffs amongst multiple objectives– models optimized to specific business objectives

– diverse model representation– linear, non-linear interaction terms, rules, sequences, etc.

GA Application Examples• Function optimizers

– difficult, discontinuous, multi-modal, noisy functions• Combinatorial optimization

– layout of VLSI circuits, factory scheduling, traveling salesman problem

• Design and Control– bridge structures, neural networks, communication networks

design; control of chemical plants, pipelines• Machine learning

– classification rules, economic modeling, scheduling strategies

Portfolio design, optimized trading models, directmarketing models, sequencing of TV advertisements,adaptive agents, data mining, etc.

GAs: Basic Principles

• Representation of individuals– String of parameters (genes) : chromosome

eg. F(p,q,r,s,t): p q r s t– Bit-string representation (?):

1 0 0 1 1 0 1 0 1 1 0 1 1 0 0– genotype and phenotype

GAs: Basic Principles• Survival of the fittest (Fitness function)

– numerical “figure of merit”/utility measure of an individual

– tradeoff amongst a multiple evaluation criteria– efficient evaluation


• Reproduction to create offspring– Selection– Crossover– Mutation


• Convergence– progression towards uniformity in population– premature convergence?

(local optima)

GA: Basic Operation

Solution1 (f1)

Solution2 (f2)

Solution3 (f3)

Solution4 (f4)

...

...

SolutionN (fN)

Solution1

Solution2

Solution2

Solution4

...

...

SolutionX

Offspring1(1,4)

Offspring2(1,4)

Offspring3(2,7)

Offspring4(2,7)

...

...

OffspringN(x,y)

Selection RecombinationCrossover Mutation

Generation t Generation t+1

GAs: Parallel Search

X

X

Hill climber

Fitness

x

Typical GA Run

Fitness

Generations

Best

Average

Operators: Selection

• Fitness proportionate selection (fi/f )• number of reproductive trials for

individuals

Selection• Roulette-wheel selection

(stochastic sampling with replacement)

– wheel spaced in proportion to fitness values

– N (pop size) spins of the wheel

Selection• Stochastic universal sampling

– N equally spaced pins on wheel– single turn of the wheel

Selection• Premature converge• Fitness scaling

f = f - (2*avg. - max.)• Ranked fitness• Elitism• Steady-state selection• Demetic grouping

Operators: CrossoverParent 1: 11010 101100101Parent 2: xxyxx yxyyxxyxy

crossover site

Offspring 1: 11010 yxyyxxyxyOffspring 2: xxyxx 101100101

(Single-pt. crossover)

• combining good building blocks

CrossoverParent 1: axpsqvqbtpihdParent 2: qzxxaycgbtphw

crossover sites

Offspring 1: azpsavcbtpphdOffspring 2: qxxxqyqgbtihw

(Uniform crossover)

Crossover

Fitness

x

X X

X ParentsOffspring

Operators: Mutation

• alters each gene with small probabilityx 1 y x 0 y 0 y y 0 x y x y

x 1 y x 0 y 1 y y 0 x x x y

Recombination operators

• Mutation & premature convergence• Mutation vs. Crossover

– operator probabilities– which is more important?

• Optimal parameter settings (!)

Non-Binary Representations• Integer, real-number, order-based, rules, ...

• Binary or Real-valued?real representations give faster, more consistent, more accurate results

• High-level representation– intuitive, can utilize specialized crossover and mutation– effective search over complex spaces– design of representation and operators --forma theory

Real-valued representationParent1: 3.45 0.56 6.78 0.976 2.5Parent2: 0.98 1.06 4.20 0.34 1.8

Offspring1: 3.22 0.56 6.78 0.65 2.12Offspring2: 1.43 1.06 4.20 0.41 1.93

(Arithmetic crossover)

High-level representationParent1:Parent2:

Offspring1:Offspring2:

{(1 .2 x 3 .4 ) (5 .8 x 6 .0 ) (0 .2 x 0 .61 )}1 2 7≤ ≤ ∧ ≤ ≤ ∧ ≤ ≤

{( . . ) ( . . ) ( . . )2 3 41 36 51 51 5616 2 4≤ ≤ ∧ ≤ ≤ ∧ ≤ ≤x x x∧ ≤ ≤ ∧ ≤ ≤( . . ) ( . . )}0 3 11 2 2 2 73 9x x

{ ( . . ) ( . . )}(1.2 x 3.4)1≤ ≤ ∧ ≤ ≤ ∧ ≤ ≤2 2 2 7 51 5 619 4x x

{( . . ) [( . . ) ]2 3 41 36 516 2≤ ≤ ∧ ≤ ≤ ∨ ≤ ≤x x (5.8 x 6.0)2

∧ ≤ ≤ ∧ ≤ ≤( . . ) }0 3 113x (0 .2 x 0.61)7

High-level representation

{( . . ) ( . . )}0 3 11 2 2 2 73 9≤ ≤ ∧ ≤ ≤x x{( . . ) ( . . ) ( . . )}0 3 11 2 2 2 7 51 6 23 9 4≤ ≤ ∧ ≤ ≤ ∧ ≤ ≤x x x

• Generalize/Specialize

{( . . ) ( . . )}0 3 11 2 2 2 73 9≤ ≤ ∧ ≤ ≤x x

{( . . ) ( . . )}0 45 0 9 19 2 93 9≤ ≤ ∧ ≤ ≤x x

Tree-structured representation (GP)

/

x 5

log

*

(x log(y))/5)

y<

if

y 7

0

* y

x 2

+AND

>

x 2

If (y<7) and (x>2) then 0, else 2x+y

Genetic search: Issues

• Coding scheme, fitness function critical– General mechanism so robust that, within

reasonable margins, parameter settings are not critical.

– exploiting problem-specific knowledge– the “art” in GA design!

Genetic search: Issues• Stochastic search

– multiple runs with different random streams• Exploration vs. exploitation of search• Does not guarantee optimality ! But ….

• Structured population models• Parallelizable for large data

GAs and Optimization• Search space: representation• Global search without gradient information

– functions with multiple local optima – non-differentiable functions

• Robust, assumption-free, and very general• Hybrid approaches -- GAs with conventional

optimization techniques

Using GAs ?

• When to use a GA? • GA and traditional techniques• How long does it take?• Will it perform better?

Using GAs

• population size• mutation, crossover rates• how many generations• multiple runs

Is it a “black-box”?

? Huh?

• Data characteristics• Fitness function• GA parameters

GA Application Examples• Function optimizers

– difficult, discontinuous, multimodal, noisy functions• Combinatorial optimization

– layout of VLSI circuits, factory scheduling• Design and Control

– bridge structures, neural networks, communication networks design; control of chemical plants, pipelines

• Machine learning– classification rules, economic modeling, scheduling strategies

Portfolio design, optimized trading models, direct marketing models, sequencing of TV advertisements, adaptive agents, data mining, etc.

GAs and Data Mining

• Discovery• Prediction• Hypothesis testing and refinement

Data Mining• Pattern templates

([attribute in {v1,v2}] and [attribute=value]) or([attribute in {v1,v2,v3}] and [attribute>value]) or …

• when S, if C then Pwhen region=neif inc > 41K and child>2then x-sales>100

• when S, C and P are positively correlated• the mean of A when S and C, is significantly different

from the mean of A when S

S

PC

Data mining

• How good are the patterns– accuracy– coverage– support

• Understandability

# cases in C and P# cases in C

# cases in C and P# cases in P

# cases in C# cases in S

GA for Data Mining• Fitness evaluation

Expected values

Chi-square

– higher values imply C and P are related

Correlation • linear correlation -- product moment corr. coefficient• monotonically correlated -- Spearman’s rank corr. coeff.• Correlation coefficient x support

Interesting rule

n c c

21

22212

12111

2221

1211nnrnnr

nnCnnCPPS

+=+=

I

ncr

e jiij =

∑∑−

=i i ij

ijije

cn 22 )(

χ

SV

2 sCramer' χ

=

SPSCS

PCSII

II −

DM application• Symbolic models of consumer choice

– assumption-free– behavioral insights for targeting promotions– advantage over decision trees algorithms?

• DTs are stepwise optimal, but not globally so• high noise-sensitivity of DTs

– advantages over neural networks

{ ( ) ( ) ( ) ( )}3 5 4 0 4 3 6 3 5 5≤ ≤ ∧ < > ∧ >in c K a g e in c K a g e B u yo r th e n

Performance evaluation• Accuracy/Error rate

– will higher accuracy give better performance for the target task?

“The use of error rate often suggests insufficiently careful thought about the real objectives of the research” – David J. Hand, Construction and Assessment of Classification Rules.

True NFalse N

False PTrue P

Actual

Predicted P

N

P N • sensitivity, specificity• misclassification costs

• Of course, with 99:1 split in data, default dummy model gives 99% accuracy.

Model Representation

• Non-linear tree-structured models (GP)– Non-linear interaction terms– Function set : internal nodes

• {+,-,*,/,log}

– Terminal set: leaf nodes• {constants, variables}

/

x1 5

log

*

(x1 log(x3))/5)

x3

DM Performance: Decile Analysis

DecileNumber ofCustomers

Number ofResponses

ResponseRate(%)

CumulativeResponses

CumulativeResponseRate (%)

CumulativeResponse Lift

top 2500 2179 87.2 2179 87.2 4472 2500 1753 70.1 3932 78.6 4033 2500 396 15.8 4328 57.7 2964 2500 111 4.4 4439 44.4 2285 2500 110 4.4 4549 36.4 1876 2500 85 3.4 4634 30.9 1587 2500 67 2.7 4701 26.9 1388 2500 69 2.8 4770 23.9 1229 2500 49 2.0 4819 21.4 110

bottom 2500 55 2.2 4874 19.5 100Total 25,000 4874 19.5

100.*eperformanc avg. overall

decileeperformanc avg. cum. = decile LiftCumulative

Decile Maximization(DMAX)• Objective

Find model f(x) (predictor variables x)such that performance in upper deciles (specified depth-of-file) is maximized

• Explicitly manages resource constraint– mailings to particular depths-of file

• Performance at different mailing depths– models optimized for different mailing depths

DecileNumber of

Responders/Profit

top max2 max3 max456789

bottom

DMAX: Illustrative Example

0

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 40

$10

$4

$2

$6

$9$7

$3

$1

$5

$8

OLS($28)

DMAX 40% ($32)

OLS: .14 X1 + .06 X2DMAX 40%: .19 X1 + .07 X2

Profit X1 X2$10 45 5$9 35 21$8 31 38$7 30 30$6 6 10$5 45 37$4 30 10$3 23 30$2 16 13$1 12 30

GA DMAX• Representation: w1 w2 w3 .. wk

• Integrated variable selection• Fitness evaluation

– classification accuracy– model reliability – maximize specified decile performance

• response, profit, etc.

• Hybrid algorithm

Comparative Performance: Case I

• Response modeling– maximize response in top 3 deciles– 4.6% response to mailing

DMAX (30%): - 0.01X1 - 2.51X2 - 0.008X3 - 0.08X4

LOGIT : - 0.40 - 0.01X2 - 0.007X3- 3.25X4

Neural Network: 3 layers, 2 hidden nodes, 12 coefficients

Case I: Genetic Algorithm DMAX (30%)

DecileNumber

ofCustomers

Numberof

Responses

DecileResponse

Rate

CumResponse

Rate

CumResponse

Lifttop 4,617 865 18.7% 18.7% 4112 4,617 382 8.3% 13.5% 2963 4,617 290 6.3% 11.1% 2444 4,617 128 2.8% 9.0% 1985 4,617 97 2.1% 7.6% 1676 4,617 81 1.8% 6.7% 1467 4,617 79 1.7% 5.9% 1308 4,617 72 1.6% 5.4% 1189 4,617 67 1.5% 5.0% 109

bottom 4,617 43 0.9% 4.6% 100TOTAL 46,170 2,104 4.6%

Case I:Cum Response Lift Comparison

DecileGenetic

AlgorithmDMAX(30%)

LogisticRegression

NeuralNetwork

top 411 384 385 2 296 284 277 3 244 227 2214 198 194 1865 167 166 1646 146 146 1467 130 131 1318 118 119 1189 109 108 108

bottom 100 100 100

Case II 2% Response RateCum Response Lift Comparison

DecileGenetic

AlgorithmDMAX(10%)

GeneticAlgorithm

DMAX(20%)

GeneticAlgorithm

DMAX(30%)

GeneticAlgorithm

DMAX(40%)

LogisticRegression

1 220 186 191 192 194 2 174 195 166 166 165 3 157 173 179 150 1484 148 158 158 161 154*5 139 145 146 146 1466 131 135 138 138 1387 122 124 127 127 1278 114 116 117 117 1179 108 108 109 109 109

bottom 100 100 100 100 100

Case II: 2% Response RateSmoothness: Logistic Regression

DecileNumber

ofCustomers

Numberof

Responses

DecileResponse

Rate

CumResponse

Rate

CumResponse

Lifttop 7,203 283 3.9% 3.9% 1942 7,220 200 2.8% 3.3% 1653 7,225 165 2.3% 3.0% 1484 7,215 255* 3.5% 3.1% 154*5 7,227 167 2.3% 3.0% 1466 7,220 140 1.9% 2.8% 1387 7,209 89 1.2% 2.6% 1278 7,228 68 0.9% 2.4% 1179 7,205 65 0.9% 2.2% 109

bottom 7,232 32 0.4% 2.0% 100TOTAL 72,184 1,464 2.0%

Case II: 2% Response RateSmoothness: GA DMAX (10%)

DecileNumber

ofCustomers

Numberof

Responses

DecileResponse

Rate

CumResponse

Rate

CumResponse

Lifttop 7,203 322 4.5% 4.5% 2202 7,220 188 2.6% 3.5% 1743 7,225 178 2.5% 3.2% 1574 7,215 178 2.5% 3.0% 1485 7,227 151 2.1% 2.8% 1396 7,220 133 1.8% 2.7% 1317 7,209 103 1.4% 2.5% 1228 7,228 84 1.2% 2.3% 1149 7,205 81 1.1% 2.2% 108

bottom 7,232 46 0.6% 2.0% 100TOTAL 72,184 1,464 2.0%

Case II: 2% Response RateSmoothness: GA DMAX (20%)

DecileNumber

ofCustomers

Numberof

Responses

DecileResponse

Rate

CumResponse

Rate

CumResponse

Lifttop 7,203 271 3.8% 3.8% 1862 7,220 299* 4.1% 4.0% 195*3 7,225 191 2.6% 3.5% 1734 7,215 162 2.2% 3.2% 1585 7,227 140 1.9% 2.9% 1456 7,220 119 1.8% 2.7% 1357 7,209 90 1.2% 2.5% 1248 7,228 85 1.2% 2.3% 1169 7,205 69 1.0% 2.2% 108

bottom 7,232 38 0.5% 2.0% 100TOTAL 72,184 1,464 2.0%

Comparative Performance: Case III

Profit modeling– maximize profit in top 2 deciles– mailing (profit / size)

» Non-responder: -$0.29 / 92.55%» Unpaid responder: -$5.65 / 7.10%» Paid responder: +$275 / 0.35%

Average profit for mailing: +$0.32

DMAX (20%): - .36X1 - .23X2 + .005X3 + .24X4

LOGIT(PR): - .01X1 - .03X2 + .322X3 + .25X4

Case IV: Profit ModelGenetic Algorithm DMAX (20%)

Decile

Number of

Customers

Percent PAID

Responders

Percent UNPAID

Responders

Decile Average

Profit

Cum Average

Profit

Cum Profit Lift

top 8,171 0.82% 10.1% $1.43 $1.43 444 2 8,171 0.62% 8.7% $0.96 $1.20 371 3 8,171 0.37% 8.2% $0.28 $0.89 277 4 8,171 0.34% 8.4% $0.20 $0.72 223 5 8,171 0.29% 5.9% $0.20 $0.62 191 6 8,171 0.32% 7.4% $0.19 $0.54 169 7 8,171 0.23% 4.0% $0.13 $0.49 151 8 8,171 0.18% 4.8% -$0.04 $0.42 130 9 8,171 0.24% 8.3% -$0.06 $0.37 114

bottom 8,171 0.17% 4.9% -$0.08 $0.32 100 TOTAL 81,710 0.35% 7.1%

Case IV: Profit ModelCum Profit Lift Comparison

DecileGenetic

AlgorithmDMAX (20%)

LogisticRegression

top 444 3852 371 2943 277 2354 223 1905 191 1846 169 1637 151 1468 130 1239 114 111

bottom 100 100

Modeling on Multiple Objectives• Model [y1,..,yk] = f (x)

– simultaneously optimize on multiple objectives

• Some common DM modeling desirables– response and high purchase revenues– likely churners with high usage of services– high tenure and usage– purchase and non-return– cross-selling, etc.

[or CPR (Combined Profit and Response) Models]

Multiple objectives• Traditional approaches

– multiple single-objective models, and combine– weighted average of objectives

• conflicting objectives– different levels of tradeoffs

• frontier of non-dominated solutions– choice of final model based on diverse decision-

maker objectives, can also be subjective

Pareto Frontier

• Non-dominated solutions– multiple objectives πi, f a(x) better than f b(x) if

• Single GA run obtains– tradeoff frontier of

non-dominated solutions f k(x)

))(())((: xx bi

ai ffi ππ ≥∀

))(())((: xx bj

aj ffj ππ >∃

π1

π2 non-dominated modelsdominated models

Multi-objective GA

• Pareto-Based Selection (Louis and Rawlins, ‘93)– randomly select a pair of solutions from population– generate two new “offspring”– determine the Pareto-optimal set from parents and offspring,

and choose two solutions for new population

• Elitistism• retain best solution intact in next population

• fosters local search around best solution

– retain non-dominated set of solutions intact in next generation

Fitness evaluation• DMAX approach

– fitness at specified depth-of-file d

Experimental Study: Data

• Cellular-phone provider seeking to identify potential high-value churners– two dependent variables

• binary Churn variable• continuous variable measuring revenue ($)

– predictors: minutes-of-use (peak and off-peak), average charges, and payment information, etc.

• obtained after EDA, normalized to 0 mean 1 s.d

– 50,000 sample: 25,000 for training, 25,000 for test set

Multiple Objectives: Performance

• Churn lift • model capturing more churners in top deciles is better

• $-Lift

• model giving high revenue customers in upper deciles is better

• overall modeling objective– maximize expected revenue saved through identification of high-

value churners– Churn-Lift * $-Lift

NC

NC

d

d /

NR

NR

d

d /

Decile 1 (trg)

050

100150200250300350400

0 100 200 300 400 500 600

Churn-Lift

$-Li

ft

GPGALogisticOLS

5 independent GA runs, aggregate the sets of non-dominated solutions

Experimental StudyNon-dominated models: Decile 1 (Training)

Experimental StudyNon-dominated models: Decile 1 (Test)

Decile 1 (Test)

0

50

100

150

200

250

300

350

400

0 100 200 300 400 500

Churn-Lift

$-Li

ft

GPGALogisticOLS


Decile 2 (Test)

0

50

100

150

200

250

300

0 50 100 150 200 250 300 350 400 450

Churn-Lift

$-Li

ft

GPGALogisticOLS


Decile 3 (Test)

0

50

100

150

200

250

0 50 100 150 200 250 300 350

Churn-Lift

$-Li

ft

GPGALogisticOLS


Decile 7 (Test)

60

80

100

120

140

80 90 100 110 120 130 140 150

Churn-Lift

$-Li

ft

GPGALogisticOLS

Experimental Study

Performance SummaryPerformance Decile 1 Decile 2 Decile 3 Decile 7

Churn-Lift, $-Lift 304.9, 261.7 265.4, 207.4 272.3. 155.0 138.8, 126.9 GA-best Product of Lifts 797.8 550.4 422.2 176.1

Churn-Lift, $-Lift 343.7, 256.5 343.5, 182.1 275.1, 178.3 139.4, 131.2 GP-best Product of Lifts 881.5 625.5 490.4 182.9

Churn-Lift, $-Lift 447.1,111.8 403.4, 72.6 295.9, 57.4 137.8, 66.7 Logistic Regression Product of Lifts 499.8 292.7 169.96 91.9

Churn-Lift, $-Lift 116.2, 360.5 108.1, 271.7 99.7, 223.2 91.8, 136.2 OLS Regression Product of Lifts 418.8 293.71 222.5 125.1

Churn-Lift, $-Lift 79, 357 76, 263 74, 217 78, 136 OLS *

Logistic Product of Lifts 282 201 160 106

General Optimization of Lifts• Fitness function

– Seeks a general maximization of lifts at all deciles

Specific vs. General Lift Opt

Performance Decile 1 Decile 2 Decile 3 Decile 7 $-Lift, Churn-Lift 304.9, 261.7 265.4, 207.4 272.3. 155.0 138.8, 126.9 GA-best

Lift-Opt Product of Lifts 797.8 550.4 422.2 176.1 $-Lift, Churn-Lift 303.2, 261 288.3, 188.8 276.7, 151.3 138.1, 104.5, GA-best

General-Opt Product of Lifts 791.4 544.3 418.6 144.3

Churn-Lift, $-Lift 343.7, 256.5 343.5, 182.1 275.1, 178.3 139.4, 131.2 GP-best Lift-Opt Product of Lifts 881.5 625.5 490.4 182.9

Churn-Lift, $-Lift 332, 252.5 265, 223.1 233.9, 186.5 132.3, 133.1 GP-best General-Opt Product of Lifts 838.3 591.2 436.2 176.1

Table: Best Prod-Lifts in Deciles

Specific vs. General Lift Opt.

Decile 1 Decile 2 Decile 3 Decile 7 Performance $-Lift Churn-

Lift $-Lift Churn

-Lift $-Lift Churn-

Lift $-Lift Churn-

Lift GA-best Lift-Opt

361.4

464.7

271.6

401.3

223.9

309.8

136.6

139.5

GA-best General-Opt 361.7 421 273.3 398.1 223.9 304.1 136.6 138.4

GA-best Lift-Opt

372.7

475.2

276.5

417.9

226.1

310.3

137.2

139.8

GA-best General Opt 372.1 421.3 276.8 378.3 226.6 296.7 137.1 139.8

Table: Best $-Lift and Churn-Lifts in Deciles

Case Study – “EC challenge”EDA, Variable-selection• Problem

– 15,178 obs., 79 variables, “response” dependent– Seeking maximum lift in the top decile

– Logistic regression model• 15 variables, after EDA, transformation This is the hard part!

(many of them combinations of multiple vars.)• Lift of 126 in the top decile

• EC approach– Include all variables– Explore simple “terms”: non-linear GP models

• small populations, looking for robust terms– Final model(s) using obtained terms

Case Study – “EC• Various 2-5 var. terms show some predictability

– Lifts ranging in 122-127• Models on these terms

– Non-linear, Linear model: lifts in 126-132

• Examples– 3 tan(HC211) + EC31 Trg:122.5 Test: 122.5– (OCC81 - log10(ORDTERM1/IC191))*STATE2*HHAS21 Trg: 124.9 Test:126.4– STATE2 * HHAS21 Trg: 121.3 Test: 121.3

– (OCC81 - log10(B)) * B * (A + B + (ORDTERM1 * (A + B))) Trg: 131.5 Test: 126.9A = (STATE2 - SECGENDE) and B = STATE2*HHAS21

– B + tan(2B + HHAS21) + EC31 + (ORDTERM1)*(B + Trg: 131.1 Test:127.8tan[B + HHAS21 + ((HHAS21*HV31)/2.1)] )

– AB^3 (1 + OCC81) + AB(OCC81) + 2DEB(OCC81)^2. Trg: 134.4 Test 131.6

– 4A + B + C + 2D + E + 2*OCC81 (10 vars. total) Trg: 132.5 Test: 131.7

GADataMining CNA

Documents

Transcript of GADataMining CNA