Post on 07-Jul-2020
01001011100110111101011110111001101111010111
YONSEIUNIVERSITY
YONSEI UNIV.YONSEI UNIV.Computer ScienceComputer Science
Database system :Database system :
Term project Term project Final PresentationFinal Presentation
Efficient hugeEfficient huge--scale feature selectionscale feature selectionusing using SSpeciatedpeciated GGenetic enetic AAlgorithm and lgorithm and BBayesian ayesian NNetworketwork
Team : 9CTeam : 9CMembers : Members : JaJa--Min Min KooKoo HeadHead (M.S. 3(M.S. 3rdrd)) icicle@sclab.yonsei.ac.kricicle@sclab.yonsei.ac.kr
JiJi--Oh Oh YooYoo (M.S. 1(M.S. 1stst)) taiji391@sclab.yonsei.ac.krtaiji391@sclab.yonsei.ac.kr
JinJin--HyukHyuk HongHong ((Ph.DPh.D 11stst)) hjinh@sclab.yonsei.ac.krhjinh@sclab.yonsei.ac.kr
GsumGsum--Sung HwangSung Hwang ((Ph.DPh.D 11stst)) yellowg@sclab.yonsei.ac.kryellowg@sclab.yonsei.ac.kr
Soft Computing Lab. Computer Science, Soft Computing Lab. Computer Science, YonseiYonsei University.University.Presentation Date : June 10Presentation Date : June 10thth , 2004, 2004
Presentation by Presentation by JaJa--Min Min KooKoo
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 11/20/20
YONSEI UNIV.Computer ScienceAgendaAgenda
MotivationMotivationRelated WorkRelated Work– Classification & feature selection– Conventional feature selection with genetic algorithm
Proposed MethodProposed Method– Overview– Genetic algorithm with speciation– Modified representation for huge-scale feature selection– Bayesian network learning and classification
ExperimentsExperiments– Experimental environment– Experimental results
ConclusionConclusion
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 22/20/20
YONSEI UNIV.Computer Science
GeneticGeneticalgorithmalgorithm
MotivationMotivation
Classification issueClassification issue– one of the popular data mining problems in Database.
HugeHuge--scale featurescale feature
Selecting informative featuresSelecting informative features– NP-Complete problem
In this projectIn this project……
Web usage profileWeb usage profile Gene expression dataGene expression data
Over 1,000Over 1,000of features!!of features!!
!BayesianBayesianClassifierClassifier
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 33/20/20
YONSEI UNIV.Computer ScienceClassificationClassification
Classification Classification –– 2 step process2 step process
ClassifiersClassifiers– Decision trees, neural networks, SVMs, kNN, Bayesian classifier, etc.– Bayesian classifier is based on Bayes theorem.
– Based on class conditional independence and graphical probability model.
Related WorkRelated Work
)()()|()|(
xPhphxpxhp =
1. Build a modelthat describe a data classes or concepts.
2. Classify new samplesand evaluate the accuracy.
?
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 44/20/20
YONSEI UNIV.Computer ScienceFeature selection issueFeature selection issue
Feature selectionFeature selection– Huge-scale features
Irrelevant or redundant features is possible.
– ProblemsNeed to ensure the statistical variability between patterns fromdifferent classes.Mislead learning algorithms or overfit the data.More Complex
– Therefore, feature selection extracting informative features is necessary!!
Related WorkRelated Work
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 55/20/20
YONSEI UNIV.Computer ScienceConventional feature selection with GAConventional feature selection with GA
Previous workPrevious work– Goal
Maximize the distribution of distance between groups.– Evaluate each feature in one dimension
it May lose crucial information from the combination of features.
GA wrapper methodGA wrapper method– Possible feature subsets is– It is almost impossible to evaluate them all.– So..
Related WorkRelated Work
∑=
==f
f
f
n
k
nknc Cn
12 ▶▶ Too Large!!Too Large!!
…… Crossover
Mutation
Selection
Evaluation
GA
Procedure
GA
Procedure
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 66/20/20
YONSEI UNIV.Computer ScienceOverviewOverview
Proposed MethodProposed Method
SGABNSGABN– Speciated Genetic Algorithm and Bayesian Network
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 77/20/20
YONSEI UNIV.Computer ScienceGenetic algorithm with speciationGenetic algorithm with speciation
Speciation techniqueSpeciation technique– Generate multiple species within the population of evolutionary methods.– Method
Restrict an individual to mate only with similar ones, while others manipulate its fitness using niching pressure to control the selection.
– AdvantagesAvoid “genetic drift”.Maintain the wide search landscape.
We use Explicit fitness sharing.We use Explicit fitness sharing.– sfi : sharing fitness, fi : fitness– dij : distance between i and j– σs : sharing radius
Proposed MethodProposed Method
sij
sijα
s
ij
ij
N
jiji
i
ii
σr d ,fo
σd, for )σd
(dsh
dshm
mfsf
≥
<≤−=
=
=
∑=
0
01)(
)(1
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 88/20/20
YONSEI UNIV.Computer ScienceGenetic algorithm with speciationGenetic algorithm with speciation
Speciation techniqueSpeciation technique– Generate multiple species within the population of evolutionary methods.– Method
Restrict an individual to mate only with similar ones, while others manipulate its fitness using niching pressure to control the selection.
– AdvantagesAvoid “genetic drift”.Maintain the wide search landscape.
We use Explicit fitness sharing.We use Explicit fitness sharing.– sfi : sharing fitness, fi : fitness– dij : distance between i and j– σs : sharing radius
Proposed MethodProposed Method
sij
sijα
s
ij
ij
N
jiji
i
ii
σr d ,fo
σd, for )σd
(dsh
dshm
mfsf
≥
<≤−=
=
=
∑=
0
01)(
)(1
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 99/20/20
YONSEI UNIV.Computer ScienceModified representationModified representation for hugefor huge--scale feature selectionscale feature selection
This project are composed of thousands of features.This project are composed of thousands of features.– It is hard to apply the conventional approach.
New ApproachNew Approach– Modify the representation of chromosome.
– In this project…ns is set as 25.13-bit and 12-bit indices are used to represent …
– 7,129 features of Leukemia– 4,026 features of Lymphoma cancer data
Proposed MethodProposed Method
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 1010/20/20
YONSEI UNIV.Computer ScienceModified representationModified representation for hugefor huge--scale feature selectionscale feature selection
This project are composed of thousands of features.This project are composed of thousands of features.– It is hard to apply the conventional approach.
New ApproachNew Approach– Modify the representation of chromosome.
– In this project…ns is set as 25.13-bit and 12-bit indices are used to represent …
– 7,129 features of Leukemia– 4,026 features of Lymphoma cancer data
Proposed MethodProposed Method
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 1111/20/20
YONSEI UNIV.Computer ScienceBN learning and classificationBN learning and classification
Bayesian Network LearningBayesian Network Learning– Training data : from GA– Structure Learning : K2 algorithm !!– Accuracy for training data : GA fitness function.
K2 algorithmK2 algorithm– Greedy heuristic search algorithm.– Given a database D, this searches for the BN structure G with maximal Pr(G, D).
NoisyNoisy--OR gateOR gate– Describe the interaction between n causes X1, X2, …, Xn and their common effect Y.– Binary noisy-OR gate
For Y’s CPD, define p1, p2, …,pn. (if the Cause Xi is true, the others is false, then Y is true)
It is easy to verify that the probability of y given a subset Xp of Xi’s.
Proposed MethodProposed Method
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 1212/20/20
YONSEI UNIV.Computer ScienceBN learning and classificationBN learning and classification
Bayesian Network LearningBayesian Network Learning– Training data : from GA– Structure Learning : K2 algorithm !!– Accuracy for training data : GA fitness function.
K2 algorithmK2 algorithm– Greedy heuristic search algorithm.– Given a database D, this searches for the BN structure G with maximal Pr(G, D).
NoisyNoisy--OR gateOR gate– Describe the interaction between n causes X1, X2, …, Xn and their common effect Y.– Binary noisy-OR gate
For Y’s CPD, define p1, p2, …,pn. (if the Cause Xi is true, the others is false, then Y is true)
It is easy to verify that the probability of y given a subset Xp of Xi’s.
Proposed MethodProposed Method
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 1313/20/20
YONSEI UNIV.Computer ScienceExperimental environment (1)Experimental environment (1)
Experimental DomainsExperimental Domains– Leukemia Data (Lin et al., 2001)
# of features : 7,129 features# of data : 72 (training : 38, testing : 34)Class
– ALL patients (47 samples) are labeled as class 0– AML patients (25 samples) are labeled as class 1
– Lymphoma Data (Lossos et al., 2000)# of features : 4,026 features# of data : 47 (training : 22, testing : 25)Class
– GC B-like (24 samples) are labeled as class 0– Activated B-like (23 samples) are labeled as class 1
ExperimentsExperiments
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 1414/20/20
YONSEI UNIV.Computer ScienceExperimental environment (2)Experimental environment (2)
Parameter of Genetic OperatorParameter of Genetic Operator– Population size : 20– Selection rate : 0.8– Crossover rate : 0.8– Mutation rate : 0.005– Use Elitism
Parameter of Neural networkParameter of Neural network– Learning rate : 0.3 / 0.1– Momentum : 0.5 / 0.8– Maximum iteration : 500– Minimum error : 0.02– # of Hidden node : 2 / 5
ExperimentsExperiments
Crossover
Mutation
Selection
Evaluation
GA
Procedure
GA
Procedure
0
1
…
…
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 1515/20/20
YONSEI UNIV.Computer ScienceExperimental results: Number of feature usedExperimental results: Number of feature used
ExperimentsExperiments
0 20 40 60 80 1000
25
1500
2000
feat
ures
#
generation
sGA SGABN
0 20 40 60 80 1000
25
3000
3500
feat
ures
#
generation
sGA SGABN
Lymphoma Leukemia
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 1616/20/20
YONSEI UNIV.Computer ScienceExperimental results: Bayesian networkExperimental results: Bayesian network
ExperimentsExperiments
Result TableResult Table
Selected features Selected features – simple GA
14 features is selected: → 0.638889 – 507 651 35 1999 274 154 1044 1277 522
1221 875 105 624 320
– speciated GA 13 features is selected: → 0.86
– 448 1999 235 743 1834 1575 634 1809 879 1132 380 1247 1556
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 1717/20/20
YONSEI UNIV.Computer ScienceExperimental results: GA comparisonExperimental results: GA comparison
Lymphoma
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1 51 101 151 201 251 301
generation
Fit
ne
ss
simple GA
speciated GA
no Index sGA
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 1818/20/20
YONSEI UNIV.Computer ScienceExperimental results: Neural network (1)Experimental results: Neural network (1)
ExperimentsExperiments
25253569Average features used
510
Discovered solutions in last 10 generations
(train error rate <= 1/38)
86 generationsNot found in 100 generations
Not found in 100 generations
Generation for finding 3 solutions
0.026 (0.08)0.008 (0.07)0.211 (0.06)Average training error rate
of the best solution (std)
0.236 (0.03)0.210 (0.03)0.342 (0.02)Average training error rate (std)
SGANNGANNsGAMeasure
25252009Average features used
780Discovered solutions in last 5 generation (train error rate
= 0)
15 generations16 generationsNot foundin 100 generations
Generation for finding 5 solutions
0.002 (0.09)0.002 (0.09)0.3217 (0.06)Average training error rate of the best solution (std)
0.269 (0.02)0.266 (0.02)0.637 (0.02)Average training error rate (std)
SGANNGANNsGAMeasure
Lymphoma
Leukemia
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 1919/20/20
YONSEI UNIV.Computer ScienceExperimental results: Neural network (2)Experimental results: Neural network (2)
ExperimentsExperiments
0.2252 (0.09)0.3401 (0.03)0.3006 (0.03)Average test error rate (std)
590 sec559 sec42,359 secAverage processing time(10 generations)
25253569Input node #
SGANNGANNsGAMeasure
0.2247 (0.08)0.2662 (0.10)0.4692 (0.07)Average test error rate (std)
545 sec502 sec34,390 secAverage processing time(10 generations)
25252009Input node #
SGANNGANNsGAMeasure
Lymphoma
Leukemia
PC SC ED CC MI sGA GANN SGANN0
10
20
30
40
50
60
70
80
Neural network SASOM SVM (linear kernel) SVM (RBF kernel) kNN (cosine)
Test
acc
urac
y
Feature selection
Lymphoma
10111101011110111001101111010111
Softcomputing Lab.Softcomputing Lab. 2020/20/20
YONSEI UNIV.Computer ScienceConclusionConclusion
HugeHuge--scale feature datascale feature data– Search diverse solutions using speciated GA– Improve the performance of classification– Easy interpretability by BN
Future worksFuture works– Some problems on BN classifier
The other results will be included into the final report
– Cross-validation: Too few samples
01001011100110111101011110111001101111010111
YONSEIUNIVERSITY
YONSEI UNIV.YONSEI UNIV.Computer ScienceComputer Science
Any Question??Any Question??
01001011100110111101011110111001101111010111
YONSEIUNIVERSITY
YONSEI UNIV.YONSEI UNIV.Computer ScienceComputer Science
Thank You~!!Thank You~!!