Feature selection concepts and methods
-
Upload
reza-ramezani -
Category
Technology
-
view
722 -
download
1
description
Transcript of Feature selection concepts and methods
![Page 1: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/1.jpg)
1
Feature Selection
Concepts and Methods
Electronic & Computer Department
Isfahan University Of Technology
Reza Ramezani
![Page 2: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/2.jpg)
What are Features?
Features are attributes that their value make an instance.
With features we can identify instances.
Features are determinant values that determine instance belong to which class.
2
![Page 3: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/3.jpg)
Classifying Features
Relevance: These are features that have an influence on the output and whose role can not be assumed by the rest.
Irrelevance: Features that don't have any influence on the output, and whose values are generated at random for each example.
Redundance: A redundancy exists whenever a feature can take the role of another.
3
![Page 4: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/4.jpg)
What is Feature Selection?Feature selection, is a preprocessing step to
machine learning that choose a subset of original features according to a certain evaluation criterion and is effective in:
Removing/Reduce effect of irrelevant data removing redundant data reducing dimensionality (binary model) increasing learning accuracy and improving result comprehensibility.
4
![Page 5: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/5.jpg)
Other DefinitionsProcess which select a subset of features
defined by one of three approaches:
1) the subset with a specified size that optimizes an evaluation measure
2) the subset of smaller size that satisfies a certain restriction on the evaluation measure
3) the subset with the best compromise among its size and the value of its evaluation measure (general case).
5
![Page 6: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/6.jpg)
Feature Selection Algorithm (FSA) FSA is a computational solution that is
motivated by a certain definition of relevance.1) The relevance of a feature may have several
definitions depending on the objective that is looked for.
2) Find a compromise among minimizing and maximizing (general case).
3) An irrelevant feature is not useful for induction, but not all relevant features are necessarily useful for induction.
6
![Page 7: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/7.jpg)
Classifying FSAs The FSAs can be classified according to
the kind of output they yield:
1) Algorithms that giving a weighed linear order of features. (Continuous feature selection problem)
2) Algorithms that giving a subset of the original features. (Binary feature selection problem)
Note that both types can be seen in an unified way by noting that in (2) the weighting is binary.
7
![Page 8: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/8.jpg)
Notation X = Feature Set X’ = Feature Subset xi = Feature I = Instances p = Probability distribution on E W = Space of labels (e.g. classes). c = Objective function c:E T according to
its relevant features. (Classifier) S = Data set (Training set)
8
![Page 9: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/9.jpg)
Relevance of a feature
The purpose of a FSA is to identify relevant features according to a definition of relevance.
Unfortunately the notion of relevance in machine learning has not yet been rigorously defined on a common agreement.
Let us to define Relevance in many aspect:9
![Page 10: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/10.jpg)
Relevance with respect to an
objective Feature Relevant to objective function c Two examples A, B in the instance space E A and B differ only in their assignment to .
10
![Page 11: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/11.jpg)
Strong relevance with respect to S Fature
Strongly relevant to the sample S Two examples A and B differ in their assignment to .
That is to say, it is the same last definition, but now and the definition is with respect to S.
11
![Page 12: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/12.jpg)
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categoric
al
categoric
al
continuous
class
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
12
![Page 13: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/13.jpg)
Strong relevance with respect to p Feature
Strongly relevant to an objective c in the distribution p
Two examples with 0 and 0 A and B differ in their assignment to .
This definition is the natural extension of last definition and, contrary to it, the distribution is assumed to be known. 13
![Page 14: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/14.jpg)
Weak relevance with respect to S Feature
Weakly relevant to the sample S There exists at least a proper where is
strongly relevant with respect to S.
A weakly relevant feature can appear when a subset containing at least one strongly relevant feature is removed.
14
![Page 15: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/15.jpg)
Weak relevance with respect to p Feature
Weakly relevant to the objective c in the distribution p
There exists at least a proper where is strongly relevant with respect to p.
These 5 definitions are important to decide what features should be conserved and which can be eliminated.
15
![Page 16: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/16.jpg)
Strongly Relevant Features The strongly relevant features are, in theory,
important to maintain a structure in the domain
And they should be conserved by any feature selection algorithm in order to avoid the addition of ambiguity to the sample.
16
![Page 17: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/17.jpg)
Weakly Relevant Features
Weakly relevant features could be important or not, depending on:
The other features already selected.
The evaluation measure that has been chosen (accuracy, simplicity, consistency, etc.).
17
![Page 18: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/18.jpg)
Relevance as a complexity measure
Define r(S,c) Smallest number of relevant features to c The error in S is the least possible for the
inducer.
In other words, it refers to the smallest number of features required by a specific inducer to reach optimum performance in the task of modeling c using S.
18
![Page 19: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/19.jpg)
Incremental usefulness a data sample S
a learning algorithm L and a subset of features
The feature is incrementally useful to L with respect to if the accuracy of the group of features better than the accuracy reached using only the subset of features .
19
![Page 20: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/20.jpg)
ExampleX1…………...X11…………....X21……………..X30
100000000000000000000000000000 +
111111111100000000000000000000 +
000000000011111111110000000000 +
000000000000000000001111111111 +
000000000000000000000000000000 – X1 is strongly relevant, the rest are weakly relevant. r(S,c) = 3 Incremental usefulness: after choosing {X1, X2}, none of X3…
X10 would be incrementally useful, but any of X11…X30 would.20
![Page 21: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/21.jpg)
General Schemes for Feature Selection
Relationship between a FSA and the inducer Inducer:
• Chosen process to evaluate the usefulness of the features
• Learning Process
Filter Scheme
Wrapper Scheme
Embedded Scheme 21
![Page 22: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/22.jpg)
Filter Scheme Feature selection process takes place before
the induction step
This scheme is independent of the induction algorithm.
• High Speed• Low Accuracy
22
![Page 23: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/23.jpg)
Wrapper Scheme Use the learning algorithm as a subroutine to
evaluate the features subsets.
Inducer must be known.
• Low Speed• High Accuracy
23
![Page 24: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/24.jpg)
Embedded Scheme Similar to the wrapper approach
Features are specifically selected for a certain inducer
Inducer selects the features in the process of learning (Explicitly or Implicitly).
24
![Page 25: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/25.jpg)
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
25
Embedded Scheme Example
Refund Marital Status
Taxable Income
Age Cheat
Yes Single 125K 18 No
No Married 100K 30 No
No Single 70K 28 No
Yes Married 120K 19 No
No Divorced 95K 18 Yes
No Married 60K 20 No
Yes Divorced 220K 25 No
No Single 85K 30 Yes
No Married 75K 20 No
No Single 90K 18 Yes 10
categoric
al
categoric
al
continuous
classcontin
uous
Decision Tree Maker Algorithm, willAutomatically Remove ‘Age’ Feature.
![Page 26: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/26.jpg)
Characterization of FSAs
Search Organization: General strategy with which the space of hypothesis is explored.
Generation of Successors: Mechanism by which possible successor candidates of the current state are proposed.
Evaluation Measure: Function by which successor candidates are evaluated.
26
![Page 27: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/27.jpg)
Types of Search Organization
We consider three types of search:
Exponential
Sequential
Random
27
![Page 28: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/28.jpg)
Exponential Search Algorithms that carry out searches with
cost Best solution is guaranteed. The exhaustive search is an optimal search. An optimal search need not be exhaustive;
• Branch and Bound for monotonic evaluation measure
• search with an admissible heuristic.• A measure is monotonic if for any two subsets and
, then .28
![Page 29: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/29.jpg)
Sequential Search This Strategy selects one among all the
successors to the current state. Once the state is selected it is not possible to go
back. The number of such steps must be limited by . Let be the number of evaluated subsets in each
state change. The cost of this search is therefore polynomial .
These methods do not guarantee an optimal result.
29
![Page 30: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/30.jpg)
Random Search Use Randomness to avoid the algorithm to
stay on a local minimum.
Allow temporarily moving to other states with worse solutions.
These are anytime algorithms.
Can give several optimal subsets as solution.
30
![Page 31: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/31.jpg)
Types of Successors Generation Forward
Backward
Compound
Weighting
Random 31
![Page 32: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/32.jpg)
Forward Successors Generation Starting with .
Adds features to the current solution , among those that have not been selected yet.
In each step, the feature that makes be greater is added to the solution.
The cost of operator is
32
![Page 33: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/33.jpg)
Backward Successors Generation Starting with .
Removes features from the current solution , among those that have not been removed yet.
In each step, the feature that makes J be greater is removed from the solution.
The cost of operator is 33
![Page 34: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/34.jpg)
Forward and Backward Method, Stopping Criterion
( has been fixed in advance)
The value of J has not increased in the last k steps
The value of J has not surpasses a prefixed value .
In practice backward method demands more computation than its forward counterpart.
34
![Page 35: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/35.jpg)
Compound Successors Generation Apply f consecutive forward steps and b
consecutive backward ones. If the net result is a forward operator, otherwise it
is a backward one. This method, allows to discover new interactions
among features. Other stopping conditions should be established if
f = b. (Such as for ) In sequential FSA, the condition assures a
maximum of steps, with a total cost . 35
![Page 36: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/36.jpg)
Weighting Successors Generation In weighting operators (continuous features).
All of the features are present in the solution to a certain degree.
A successor state is a state with a different weighting.
This is typically done by iteratively sampling the available set of instances.
36
![Page 37: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/37.jpg)
Random Successors Generation Includes those operators that can potentially
generate any other state in a single step.
Restricted to some criterion of advance:• In the number of features• In improving the measure J at each step.
37
![Page 38: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/38.jpg)
Evaluation Measures Probability of Error
Divergence Dependence Interclass Distance Information or Uncertainty Consistency Relative values assigned to different subsets reflect
their greater or lesser relevance to the objective function.
Let an evaluation measure to be maximized, where is a (weighed) feature subset.
38
![Page 39: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/39.jpg)
Evaluation Measures, Probability of Error
Ultimate goal is to build a classifier with minimizing the (Bayesian) probability of error.
of the classifier seems to be the most natural choice.
39
![Page 40: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/40.jpg)
Evaluation Measures, Probability of Error
Since the class-conditional densities are usually unknown, they can either be explicitly modeled (using parametric or non-parametric methods)
40
![Page 41: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/41.jpg)
Evaluation Measures, Probability of Error
Provided the classifier has been built using only a subset of the features, we have:
is a test data sample is the subset of where the classifier
performed correctly. Finally we have:
41
![Page 42: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/42.jpg)
Evaluation Measures, DivergenceThese measures compute a probabilistic distance
or divergence among the class-conditional probability densities , using the general formula:
42
![Page 43: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/43.jpg)
Evaluation Measures, Divergence For valid measure, the function must be
such that the value of satisfies the following conditions:
1) only when the are equal
2) is maximum when they are non-overlapping
If the features used in a solution are good ones, the divergence will be significant.
43
![Page 44: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/44.jpg)
Divergence, Some classical choices:
44
![Page 45: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/45.jpg)
Divergence, Some classical choices:
45
![Page 46: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/46.jpg)
Evaluation Measures,
Dependence These measures quantify how strongly two features are associated with one another.
Knowing the value of one feature it is possible to predict the value of the other feature.
The correlation coefficient is a classical measure that still use for these methods.
46
![Page 47: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/47.jpg)
Evaluation Measures, Interclass Distance
These measures are based on the assumption that instances of a different class are distant in the instance space.
being the instance of class , and the �number of instances of the class .
The most usual distances belong to the Euclidean family.
47
![Page 48: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/48.jpg)
Evaluation Measures,
Consistency An inconsistency in and is defined as: two instances in that are equal when considering only
the features in and that belong to different classes.
The aim is thus to find the minimum subset of features leading to zero inconsistencies.
48
![Page 49: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/49.jpg)
Evaluation Measures,
Consistency The inconsistency count of an instance is defined as:
is the number of instances in equal to using only the features in .
is the number of instances in of class equal to using only the features in .
49
![Page 50: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/50.jpg)
Evaluation Measures,
ConsistencyThe inconsistency rate of a feature subset in a sample is:
Finally we have:
This measure is in [0, 1] that must min.
50
![Page 51: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/51.jpg)
51
![Page 52: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/52.jpg)
General Algorithm for Feature Selection
All FSA can be represented in a space of characteristics according to the criteria of: search organization (Org) Generation of successor states (GS) Evaluation measures (J)
This space <Org, GS, J> encompasses the �whole spectrum of possibilities for a FSA.
hybrid FSA when FSA requires more than a point in the same coordinate to be characterized.
52
![Page 53: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/53.jpg)
53
![Page 54: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/54.jpg)
FCBFFeature Correlation
Based Filter(Filter Mode)
<Sequential, Compound, Information> 54
![Page 55: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/55.jpg)
Previous Works and Their Defects
1) Huge Time Complexity
Binary Mode: Subset search algorithms search through
candidate feature subsets guided by a certain search strategy and a evaluation measure.
Different search strategies, namely, exhaustive, heuristic, and random search, are combined with this evaluation measure to form different algorithms. 55
![Page 56: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/56.jpg)
Previous Works and Their Defects The time complexity is exponential in terms
of data dimensionality for exhaustive search and quadratic for heuristic search. The complexity can be linear to the number
of iterations in a random search, but experiments show that in order to find best feature subset, the number of iterations required is mostly at least quadratic to the number of features.
56
![Page 57: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/57.jpg)
Previous Works and Their Defects
2) Inability to recognize redundant features.
Relief: The key idea of Relief is to estimate the relevance of
features according to how well their values distinguish between the instances of the same and different classes that are near each other.
Relief randomly samples a number (m) of instances from the training set and updates the relevance estimation of each feature based on the difference between the selected instance and the two nearest instances of the same and opposite classes. 57
![Page 58: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/58.jpg)
Previous Works and Their Defects Time complexity of Relief for a data set with
M instances and N features is O(mMN). With m being a constant, the time complexity
becomes O(MN), which makes it very scalable to data sets with both a huge number of instances and a very high dimensionality.
However, Relief does not help with removing redundant features.
58
![Page 59: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/59.jpg)
Good Feature A feature is good if it is relevant to the class
concept but is not redundant to any of the other relevant features.
Correlation as Goodness Measure
A feature is good if it is highly correlated to the class but not highly correlated to any of the other features. 59
![Page 60: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/60.jpg)
Approaches to Measure The Correlation
Classical Linear Correlation (Linear Correlation Coefficient)
Information theory (Entropy or Uncertainty)
60
![Page 61: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/61.jpg)
Linear Correlation Coefficient For a pair of variables (X,Y ) the linear
correlation coefficient r is given by the formula:
If X and Y are completely correlated, r takes the value of 1 or -1.
If X and Y are totally independent, r is zero.
61
![Page 62: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/62.jpg)
Advantages It helps to remove features with near zero
linear correlation to the class. It helps to reduce redundancy among
selected features.
Disadvantages It may not be able to capture correlations
that are not linear in nature. Calculation requires all features contain
numerical values.62
![Page 63: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/63.jpg)
Entropy The Entropy of a variable (feature) X is
defined as:
The Entropy of X after observing values of another variable Y is defined as:
63
![Page 64: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/64.jpg)
Entropy, Information Gain The amount by which the entropy of X
decreases reflects additional information about X provided by Y:IG(X|Y) = H(X) - H(X|Y)
Feature Y is regarded more correlated to feature X than to feature Z, if IG(X|Y) > IG(Z|Y)
Information gain is symmetrical for two random variables X and Y: IG(X|Y) = IG(Y|X)
64
![Page 65: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/65.jpg)
Entropy, Symmetrical Uncertainty Information gain is biased in favor of features
with more values. Thus must normalize it:
SU(X,Y) values are normalized to the range [0,1]. value 1 indicating that knowledge of the value of either
one completely predicts the value of the other. The value 0 indicating that X and Y are independent.
65
![Page 66: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/66.jpg)
Entropy, Symmetrical Uncertainty Symmetrical Uncertainty still treats a pair of
features symmetrically. Entropy-based measures require nominal
features. Entropy-based measures can be applied to
measure correlations between continuous features as well, if the values are discretized properly in advance.
66
![Page 67: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/67.jpg)
Algorithm Steps Aspects of developing a procedure to select
good features for classification:
1) How to decide whether a feature is relevant to the class or not (C-correlation).
2) How to decide whether such a relevant feature is redundant or not when considering it with other relevant features (F-correlation).
Select features with SU greater than a threshold.67
![Page 68: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/68.jpg)
Predominant Correlation
The correlation between a feature and the class C is predominant iff:
There exists no such that
68
![Page 69: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/69.jpg)
Redundant Feature If is redundant to feature , we use to denote
the set of all redundant peers for . We divide into two parts:
69
![Page 70: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/70.jpg)
Predominant Feature A feature is predominant to the class, iff:
Its correlation to the class is predominant Or can become predominant after removing its
redundant peers.
Feature selection for classification is a process that identifies all predominant features to the class concept and removes the rest.
70
![Page 71: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/71.jpg)
Heuristic We must use heuristics in order to avoid
pairwise analysis of F-Correlations between all relevant features.
Heuristic: (if ). Treat as a predominant feature, remove all features in , and skip identifying redundant peers for them.
71
![Page 72: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/72.jpg)
72
FCBF Algorithm
![Page 73: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/73.jpg)
73
![Page 74: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/74.jpg)
74
![Page 75: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/75.jpg)
GA-SVMGeneric Algorithm
Support Vector Machine
(Wrapper Mode)<Sequential, Compound,
Classifier>75
![Page 76: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/76.jpg)
Support Vector Machine (SVM) SVM, one of the best techniques for pattern
classification. Widely use in many application areas. SVM classifies data by determining a set of
support vectors and their distance to hyperplane.
SVM provides a generic mechanism that fits the hyperplane surface to the training data.
76
![Page 77: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/77.jpg)
SVM Main Idea With this hypothesis that classes are linearly
separable, make hyperplane with maximum margin to separate classes.
When classes are not linearly separable, map them to high dimensional space to linearly separate them.
77Separating Surface:
A+A-
![Page 78: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/78.jpg)
Support Vector Nearest training set instances to hyperplane:
Use SV instead of training set.
Line Equation: (w and b are unknown)
78
Class +1Class -1
X2
X1
SV
SV
SV
![Page 79: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/79.jpg)
Kernel
79
1 2 4 5 6
class 2 class 1class 1
1 Dimension
1 2 4 5 6
class 2 class 1class 1
2 Dimension
![Page 80: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/80.jpg)
Kernel Data in higher dimensional! The user may select a kernel function for the
SVM during the training process. The kernel parameters setting for SVM in a
training process impacts on the classification accuracy.
The parameters that should be optimized include penalty parameter C and the kernel function parameters.
80
![Page 81: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/81.jpg)
Linear SVM SVM concepts for typical two-class
classification problems:
Training set of instance-label pairs
For the linearly separable case, the data points will be correctly classified by
81
![Page 82: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/82.jpg)
Linear SVM Fnd an optimal separating hyperplane with
the maximum margin by solving the following optimization problem:
To solve this quadratic optimization problem one must find the saddle point of the Lagrange function:
denotes Lagrange multipliers, hence .
82
![Page 83: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/83.jpg)
Linear SVM After differentiating and Karush Kuhn–
Tucker (KTT) conditions:
values determine the parameters and of the optimal hyperplane. Thus, we obtain an optimal decision hyperplane
83
![Page 84: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/84.jpg)
Linear Generalized SVM When can't linearly separate, the goal is to
construct a hyperplane that makes the smallest number of errors. (non-negative slack variables)
Solve
84C : tradeoff parameter between error and margin, Number od misclassified instances.
![Page 85: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/85.jpg)
Linear Generalized SVM This optimization model can be solved
using the Lagrangian method
The penalty parameter C, which is now the upper bound on
85
![Page 86: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/86.jpg)
NonLinear SVM The nonlinear SVM maps the training
samples from the input space into a higher-dimensional feature space via a mapping function F, which are also called kernel function. Inner products are replaced by the kernel function:
86
![Page 87: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/87.jpg)
NonLinear SVM, Kernels
final hyperplane equation
87
![Page 88: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/88.jpg)
NonLinear SVM, Kernels In order to improve classification accuracy,
these kernel parameters in the kernel functions should be properly set.
Polynomial kernel:
Radial basis function kernel:
Sigmoid kernel:88
![Page 89: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/89.jpg)
Genetic Algorithm (GA) Genetic algorithms (GA), as a optimization search
methodology is a promising alternative to conventional heuristic methods.
GA work with a set of candidate solutions called a population.
Based on the Darwinian principle of ‘survival of the fittest’, the GA obtains the optimal solution after a series of iterative computations.
GA generates successive populations of alternate solutions that are represented by a chromosome.
A fitness function assesses the quality of a solution in the evaluation step. 89
![Page 90: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/90.jpg)
90
![Page 91: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/91.jpg)
GA Feature Selection Structure
The chromosome comprises three parts, C, , and the features mask. (Different parameters when other types of kernel functions)
The binary coding system was used to represent the chromosome. is the number of bits representing parameter is the number of bits representing parameter is the number of bits representing the features
Choose and according to the calculation precision.91
![Page 92: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/92.jpg)
Evaluation Measure Three criteria used to design a fitness
function: Classification accuracy The number of selected features The feature cost
Thus, for the individual (chromosome) with: High classification accuracy Small number of features Low total feature cost
Produce a high fitness value. 92
![Page 93: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/93.jpg)
Evaluation Measure
𝑓𝑖𝑡𝑛𝑒𝑠𝑠=𝑊 𝐴∗𝑆𝑉 𝑀 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦+¿93
![Page 94: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/94.jpg)
94
![Page 95: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/95.jpg)
95
Thanks For Your Regard
![Page 96: Feature selection concepts and methods](https://reader031.fdocuments.in/reader031/viewer/2022020122/54bdd8c04a7959e3508b4576/html5/thumbnails/96.jpg)
75
Thanks For Your Regard