Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)]...
-
Upload
allan-spencer -
Category
Documents
-
view
219 -
download
1
Transcript of Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)]...
![Page 1: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/1.jpg)
Feature Selection
[slides prises du cours cs294-10 UC Berkeley (2006 / 2009)]http://www.cs.berkeley.edu/~jordan/courses/294-fall09
![Page 2: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/2.jpg)
Outline• Review/introduction
– What is feature selection? Why do it?• Filtering• Model selection
– Model evaluation– Model search
• Regularization• Summary recommendations
![Page 3: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/3.jpg)
Review• Data: pairs • : vector of features• Features can be real ( ), categorical
( ), or more structured• y: response (dependent) variable
– : binary classification– : regression– Typically, this is what we want to be able to
predict, having observed some new .
![Page 4: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/4.jpg)
Featurization• Data is often not originally in vector form
– Have to choose how to featurize– Features often encode expert knowledge of the domain– Can have a huge effect on performance
• Example: documents
– “Bag of words” featurization: throw out order, keep count of how many times each word appears.• Surprisingly effective for many tasks
– Sequence featurization: one feature for first letter in the document, one for second letter, etc.• Poor feature set for most purposes—similar documents are not
close to one another in this representation.
![Page 5: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/5.jpg)
What is feature selection?
• Reducing the feature space by throwing out some of the features (covariates)– Also called variable selection
• Motivating idea: try to find a simple, “parsimonious” model– Occam’s razor: simplest explanation that
accounts for the data is best
![Page 6: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/6.jpg)
What is feature selection?
cat 2
and 35
it 20
kitten 8
electric 2
trouble 4
then 5
several 9
feline 2
while 4
…
lemon 2
cat 2
kitten 8
feline 2
Vegetarian No
Plays video games
Yes
Family history No
Athletic No
Smoker Yes
Sex Male
Lung capacity 5.8L
Hair color Red
Car Audi
…
Weight 185 lbs
Family history
No
Smoker Yes
Task: classify whether a document is about cats
Data: word counts in the document
Task: predict chances of lung disease
Data: medical history survey
X X
Reduced XReduced X
![Page 7: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/7.jpg)
Why do it?• Case 1: We’re interested in features—we want
to know which are relevant. If we fit a model, it should be interpretable.
• Case 2: We’re interested in prediction; features are not interesting in themselves, we just want to build a good classifier (or other kind of predictor).
![Page 8: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/8.jpg)
Why do it? Case 1.
• What causes lung cancer? – Features are aspects of a patient’s medical history– Binary response variable: did the patient develop lung cancer?– Which features best predict whether lung cancer will develop?
Might want to legislate against these features.• What causes a program to crash? [Alice Zheng ’03, ’04, ‘05]
– Features are aspects of a single program execution• Which branches were taken? • What values did functions return?
– Binary response variable: did the program crash?– Features that predict crashes well are probably bugs.
• What stabilizes protein structure? (my research)– Features are structural aspects of a protein– Real-valued response variable—protein energy– Features that give rise to low energy are stabilizing.
We want to know which features are relevant; we don’t necessarily want to do prediction.
![Page 9: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/9.jpg)
Why do it? Case 2.
• Text classification– Features for all 105 English words, and maybe all word pairs– Common practice: throw in every feature you can think of, let
feature selection get rid of useless ones– Training too expensive with all features– The presence of irrelevant features hurts generalization.
• Classification of leukemia tumors from microarray gene expression data [Xing, Jordan, Karp ’01]– 72 patients (data points)– 7130 features (expression levels of different genes)
• Disease diagnosis– Features are outcomes of expensive medical tests– Which tests should we perform on patient?
• Embedded systems with limited resources– Classifier must be compact– Voice recognition on a cell phone– Branch prediction in a CPU (4K code limit)
We want to build a good predictor.
![Page 10: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/10.jpg)
Get at Case 1 through Case 2
• Even if we just want to identify features, it can be useful to pretend we want to do prediction.
• Relevant features are (typically) exactly those that most aid prediction.
• But not always. Highly correlated features may be redundant but both interesting as “causes”.– e.g. smoking in the morning, smoking at night
![Page 11: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/11.jpg)
Outline• Review/introduction
– What is feature selection? Why do it?• Filtering• Model selection
– Model evaluation– Model search
• Regularization• Kernel methods• Miscellaneous topics• Summary
![Page 12: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/12.jpg)
Filtering• Simple techniques for weeding out irrelevant features without fitting model
![Page 13: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/13.jpg)
Filtering• Basic idea: assign score to each feature
indicating how “related” and are. – Intuition: if for all i, then is good no
matter what our model is—contains all information about .
– Many popular scores [see Yang and Pederson ’97]• Classification with categorical data: Chi-squared, information
gain– Can use binning to make continuous data categorical
• Regression: correlation, mutual information• Markov blanket [Koller and Sahami, ’96]
• Then somehow pick how many of the highest scoring features to keep (nested models)
![Page 14: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/14.jpg)
Comparison of filtering methods for text categorization [Yang and Pederson ’97]
![Page 15: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/15.jpg)
Filtering• Advantages:
– Very fast– Simple to apply
• Disadvantages:– Doesn’t take into account which learning algorithm will
be used.– Doesn’t take into account correlations between features
• This can be an advantage if we’re only interested in ranking the relevance of features, rather than performing prediction.
• Also a significant disadvantage—see homework • Suggestion: use light filtering as an efficient initial
step if there are many obviously irrelevant features– Caveat here too—apparently useless features can be
useful when grouped with others
![Page 16: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/16.jpg)
Outline• Review/introduction
– What is feature selection? Why do it?• Filtering• Model selection
– Model evaluation– Model search
• Regularization• Summary
![Page 17: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/17.jpg)
Model Selection• Choosing between possible models of
varying complexity– In our case, a “model” means a set of features
• Running example: linear regression model
![Page 18: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/18.jpg)
Linear Regression Model
• Recall that we can fit it by minimizing the squared error:
• Can be interpreted as maximum likelihood with–
Data
Response:
Parameters
Assume is augmented to so the constant term is absorbed into as and
ModelPrediction rule:
![Page 19: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/19.jpg)
Least Squares Fitting
0 200
Error or “residual”
Prediction
Observation
Sum squared error:
![Page 20: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/20.jpg)
Model Selection
• Consider a reduced model with only those features for
• Squared error is now
• We want to pick out the “best” . Maybe this means theone with the lowest training error ?
• Note – Just zero out terms in to match .
• Generally speaking, training error will only go up in a simpler model. So why should we use one?
Data
Response:
ParametersModelPrediction rule:
![Page 21: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/21.jpg)
Overfitting example 1
• This model is too rich for the data• Fits training data well, but doesn’t generalize.
0 2 4 6 8 10 12 14 16 18 20-15
-10
-5
0
5
10
15
20
25
30
Degree 15 polynomial
![Page 22: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/22.jpg)
Overfitting example 2• Generate 2000 , i.i.d.• Generate 2000 , i.i.d. completely
independent of the ’s– We shouldn’t be able to predict at all from
• Find • Use this to predict for each by
It really looks like we’ve found a relationship between and ! But no such relationship exists, so will do no better than random on new data.
![Page 23: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/23.jpg)
Model evaluation• Moral 1: In the presence of many irrelevant
features, we might just fit noise. • Moral 2: Training error can lead us astray. • To evaluate a feature set , we need a better
scoring function – We’ve seen that is not appropriate.
• We’re not ultimately interested in training error; we’re interested in test error (error on new data).
• We can estimate test error by pretending we haven’t seen some of our data. – Keep some data aside as a validation set. If we don’t
use it in training, then it’s a fair test of our model.
![Page 24: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/24.jpg)
K-fold cross validation• A technique for estimating test error• Uses all of the data to validate• Divide data into K groups .• Use each group as a validation set, then average all validation
errors
X1
LearnX2
X3X4
X5
X6
X7
test
![Page 25: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/25.jpg)
K-fold cross validation• A technique for estimating test error• Uses all of the data to validate• Divide data into K groups .• Use each group as a validation set, then average all validation
errors
X1
LearnX2
X3X4
X5
X6
X7
test
![Page 26: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/26.jpg)
K-fold cross validation• A technique for estimating test error• Uses all of the data to validate• Divide data into K groups .• Use each group as a validation set, then average all validation
errors
X1
…Learn
X2
X3X4
X5
X6
X7
test
![Page 27: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/27.jpg)
K-fold cross validation• A technique for estimating test error• Uses all of the data to validate• Divide data into K groups .• Use each group as a validation set, then average all validation
errors
X1
LearnX2
X3X4
X5
X6
X7
![Page 28: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/28.jpg)
Model Search
• We have an objective function – Time to search for a good model.
• This is known as a “wrapper” method– Learning algorithm is a black box– Just use it to compute objective function, then
do search• Exhaustive search expensive
– 2n possible subsets s• Greedy search is common and effective
![Page 29: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/29.jpg)
Model search
• Backward elimination tends to find better models– Better at finding models with interacting features– But it is frequently too expensive to fit the large
models at the beginning of search
• Both can be too greedy.
Backward elimination
Initialize s={1,2,…,n}Do:
remove feature from swhich improves K(s) most
While K(s) can be improved
Forward selection
Initialize s={}Do:
Add feature to swhich improves K(s) most
While K(s) can be improved
![Page 30: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/30.jpg)
Model search• More sophisticated search strategies exist
– Best-first search– Stochastic search– See “Wrappers for Feature Subset Selection”, Kohavi and John
1997• For many models, search moves can be evaluated
quickly without refitting– E.g. linear regression model: add feature that has most
covariance with current residuals• RapidMiner can do feature selection with cross-
validation and either forward selection or backwards elimination.
• Other objective functions exist which add a model-complexity penalty to the training error– AIC: add penalty to log-likelihood. – BIC: add penalty
![Page 31: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/31.jpg)
Outline• Review/introduction
– What is feature selection? Why do it?• Filtering• Model selection
– Model evaluation– Model search
• Regularization• Summary
![Page 32: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/32.jpg)
Regularization• In certain cases, we can move model
selection into the induction algorithm– Only have to fit one model; more efficient.
• This is sometimes called an embedded feature selection algorithm
![Page 33: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/33.jpg)
Regularization• Regularization: add model complexity penalty to
training error.•
for some constant C• Now • Regularization forces weights to be small, but
does it force weights to be exactly zero? – is equivalent to removing feature f from the
model
![Page 34: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/34.jpg)
L1 vs L2 regularization
![Page 35: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/35.jpg)
L1 vs L2 regularization• To minimize , we can solve
by (e.g.) gradient descent.
• Minimization is a tug-of-war between the two terms
![Page 36: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/36.jpg)
L1 vs L2 regularization• To minimize , we can solve
by (e.g.) gradient descent.
• Minimization is a tug-of-war between the two terms
![Page 37: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/37.jpg)
L1 vs L2 regularization• To minimize , we can solve
by (e.g.) gradient descent.
• Minimization is a tug-of-war between the two terms
![Page 38: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/38.jpg)
L1 vs L2 regularization• To minimize , we can solve
by (e.g.) gradient descent.
• Minimization is a tug-of-war between the two terms• w is forced into the corners—many components 0
– Solution is sparse
![Page 39: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/39.jpg)
L1 vs L2 regularization• To minimize , we can solve
by (e.g.) gradient descent.
• Minimization is a tug-of-war between the two terms
![Page 40: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/40.jpg)
L1 vs L2 regularization• To minimize , we can solve
by (e.g.) gradient descent.
• Minimization is a tug-of-war between the two terms• L2 regularization does not promote sparsity• Even without sparsity, regularization promotes
generalization—limits expressiveness of model
![Page 41: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/41.jpg)
Lasso Regression [Tibshirani ‘94]
• Simply linear regression with an L1 penalty for sparsity.
• Two big questions:– 1. How do we perform this minimization?
• With L2 penalty it’s easy—saw this in a previous lecture
• With L1 it’s not a least-squares problem any more
– 2. How do we choose C?
![Page 42: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/42.jpg)
Least-Angle Regression• Up until a few years ago
this was not trivial– Fitting model: optimization
problem, harder than least-squares
– Cross validation to choose C: must fit model for every candidate C value
• Not with LARS! (Least Angle Regression, Hastie et al, 2004)– Find trajectory of w for all
possible C values simultaneously, as efficiently as least-squares
– Can choose exactly how many features are wanted
Figure taken from Hastie et al (2004) sklearn.linear_model.Lars
![Page 43: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/43.jpg)
Outline• Review/introduction
– What is feature selection? Why do it?• Filtering• Model selection
– Model evaluation– Model search
• Regularization• Summary
![Page 44: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/44.jpg)
Summary• Filtering• L1 regularization
(embedded methods)• Kernel methods
• Wrappers• Forward selection• Backward
selection• Other search• Exhaustive
Com
putational cost
![Page 45: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/45.jpg)
Summary• Good preprocessing step• Information-based scores
seem most effective– Information gain– More expensive: Markov
Blanket [Koller & Sahami, ’97]
• Fail to capture relationship between features
• Filtering• L1 regularization
(embedded methods)• Kernel methods
• Wrappers• Forward selection• Backward
selection• Other search• Exhaustive
Com
putational cost
![Page 46: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/46.jpg)
Summary• Fairly efficient
– LARS-type algorithms now exist for many linear models
– Ideally, use cross-validation to determine regularization coeff.
• Not applicable for all models
• Linear methods can be limited– Common: fit a linear model
initially to select features, then fit a nonlinear model with new feature set
• Filtering• L1 regularization
(embedded methods)• Kernel methods
• Wrappers• Forward selection• Backward
selection• Other search• Exhaustive
Com
putational cost
![Page 47: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/47.jpg)
Summary• Expand the
expressiveness of linear models
• Very effective in practice• Useful when a similarity
kernel is natural to define• Not as interpretable
– They don’t really perform feature selection as such
• Achieve parsimony through a different route– Sparsity in data
• Filtering• L1 regularization
(embedded methods)• Kernel methods
• Wrappers• Forward selection• Backward
selection• Other search• Exhaustive
Com
putational cost
![Page 48: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/48.jpg)
Summary• Most directly optimize
prediction performance• Can be very expensive,
even with greedy search methods
• Cross-validation is a good objective function to start with
• Filtering• L1 regularization
(embedded methods)• Kernel methods
• Wrappers• Forward selection• Backward
selection• Other search• Exhaustive
Com
putational cost
![Page 49: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/49.jpg)
Summary• Too greedy—ignore
relationships between features
• Easy baseline• Can be generalized in
many interesting ways– Stagewise forward
selection– Forward-backward search– Boosting
• Filtering• L1 regularization
(embedded methods)• Kernel methods
• Wrappers• Forward selection• Backward
selection• Other search• Exhaustive
Com
putational cost
![Page 50: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/50.jpg)
Summary• Generally more effective
than greedy• Filtering• L1 regularization
(embedded methods)• Kernel methods
• Wrappers• Forward selection• Backward
selection• Other search• Exhaustive
Com
putational cost
![Page 51: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d125503460f949e5cb8/html5/thumbnails/51.jpg)
Summary• The “ideal”• Very seldom done in
practice• With cross-validation
objective, there’s a chance of over-fitting– Some subset might
randomly perform quite well in cross-validation
• Filtering• L1 regularization
(embedded methods)• Kernel methods
• Wrappers• Forward selection• Backward
selection• Other search• Exhaustive
Com
putational cost