COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551...

45
COMP 551 - Applied Machine Learning Lecture 8 --- Feature design William L. Hamilton (with slides and content from Joelle Pineau and Jackie Cheung) * Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

Transcript of COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551...

Page 1: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

COMP 551 - Applied Machine LearningLecture 8 --- Feature designWilliam L. Hamilton(with slides and content from Joelle Pineau and Jackie Cheung)* Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

Page 2: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Upcoming tutorial hours

William L. Hamilton, McGill University and Mila 2

§ Tutorial hours on SciKit learn (all in Trottier 3120).§ September 30, 6-8pm§ October 1st, 3:30-5:30pm§ October 2nd, 4-6pm§ October 4th, 6-8pm

Page 3: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Overfitting in decision trees

§ Decision tree construction proceeds until all leaves are “pure” , i.e. all examples are from the same class.

§ As the tree grows, the generalization performance can start to degrade, because the algorithm is including irrelevant attributes/tests/outliers.

§ How can we avoid this?

William L. Hamilton, McGill University and Mila 3

Page 4: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Avoiding Overfitting§ Objective: Remove some nodes to get better generalization.

§ Early stopping: Stop growing the tree when further splitting the data does not improve information gain of the validation set.

§ Post pruning: Grow a full tree, then prune the tree by eliminating lower nodes that have low information gain on the validation set.

§ In general, post pruning is better. It allows you to deal with cases where a single attribute is not informative, but a combination of attributes is informative.

William L. Hamilton, McGill University and Mila 4

Page 5: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Reduced-error pruning (aka post pruning)

§ Split the data set into a training set and a validation set

§ Grow a large tree (e.g. until each leaf is pure)

§ For each node:- Evaluate the validation set accuracy of pruning the subtree rooted at the node

- Greedily remove the node that most improves validation set accuracy, with its corresponding subtree

- Replace the removed node by a leaf with the majority class of the corresponding examples

§ Stop when pruning starts hurting the accuracy on validation set.

William L. Hamilton, McGill University and Mila 5

Page 6: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Example: Effect of reduced-error pruning

William L. Hamilton, McGill University and Mila 6

Page 7: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Advantages of decision trees

§ Provide a general representation of classification rules§ Scaling / normalization not needed, as we use no notion of “distance”

between examples.

§ The learned function, y=f(x), is easy to interpret.

§ Fast learning algorithms (e.g. C4.5, CART).

§ Good accuracy in practice – many applications in industry!

William L. Hamilton, McGill University and Mila 7

Page 8: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Limitations§ Sensitivity:

§ Exact tree output may be sensitive to small changes in data

§ With many features, tests may not be meaningful

§ Good for learning (nonlinear) piecewise axis-orthogonal decision boundaries, but not for learning functions with smooth, curvilinear boundaries.

§ Some extensions use non-linear tests in internal nodes.

12

Joelle Pineau 23

Using higher-order features 4.2 Linear Regression of an Indicator Matrix 103

1

1

1

1111

11

11

1

1

1

1

1

1

11 1

111

1

1

1

1

1

11

1

1

11

1

11

11

1

1

1

1

1

1

1 11

1

1 1

1

1

1

1

1

1

111

1

1

1

1

1

1

1 111

1

1

1

11

1

1

1

1

1

11

1

1

1

11

1

1

1

1

111

1

1

11

11

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

11

1

1

1

1 1

1

1

11 1

1

1

1 11

11

1

1

1

1

1

1

1

11

1 1

11

1

1

1

11

1

1

1

1 1

11

1

11

1

1

1

1

11

1

11

1

11

1

1

111

1

1 1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

22

222

2

2

2

22

2

222

22

2

22

222

2

222

2

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2 222

2

2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

222

2

2

2

2

2 2

2

2

2

2

2

22

2

2

2

2 22

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

2

2

2

2

2

22

2

2

2

22

2 2

22 2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

222

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

22

2 2

2

2

2

3

3

3

3

33

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

3 3

3

3

3

3

33

3

3

33

3

3

33

3

333

3

3

3 3

3 3

3

3

3

3

3

3

3

3

3

3

33

3 3

3

33

3 3

33

3

3

3

333

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

3

33

3

3 33

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

33

33

3

33

3

3

3

3

3

3

3 3

3

3

3 3

3

3

3

33

3

3

3 3

33

33

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

1

1

1

1111

11

11

1

1

1

1

1

1

11 1

111

1

1

1

1

1

11

1

1

11

1

11

11

1

1

1

1

1

1

1 11

1

1 1

1

1

1

1

1

1

111

1

1

1

1

1

1

1 111

1

1

1

11

1

1

1

1

1

11

1

1

1

11

1

1

1

1

111

1

1

11

11

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

11

1

1

1

1 1

1

1

11 1

1

1

1 11

11

1

1

1

1

1

1

1

11

1 1

11

1

1

1

11

1

1

1

1 1

11

1

11

1

1

1

1

11

1

11

1

11

1

1

111

1

1 1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

22

222

2

2

2

22

2

222

22

2

22

222

2

222

2

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2 222

2

2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

222

2

2

2

2

2 2

2

2

2

2

2

22

2

2

2

2 22

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

2

2

2

2

2

22

2

2

2

22

2 2

22 2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

222

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

22

2 2

2

2

2

3

3

3

3

33

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

3 3

3

3

3

3

33

3

3

33

3

3

33

3

333

3

3

3 3

3 3

3

3

3

3

3

3

3

3

3

3

33

3 3

3

33

3 3

33

3

3

3

333

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

3

33

3

3 33

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

33

33

3

33

3

3

3

3

3

3

3 3

3

3

3 3

3

3

3

33

3

3

3 3

33

33

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

FIGURE 4.1. The left plot shows some data from three classes, with lineardecision boundaries found by linear discriminant analysis. The right plot showsquadratic decision boundaries. These were obtained by finding linear boundariesin the five-dimensional space X1, X2, X1X2, X

21 , X

22 . Linear inequalities in this

space are quadratic inequalities in the original space.

mation h(X) where h : IRp !→ IRq with q > p, and will be explored in laterchapters.

4.2 Linear Regression of an Indicator Matrix

Here each of the response categories are coded via an indicator variable.Thus if G has K classes, there will be K such indicators Yk, k = 1, . . . ,K,with Yk = 1 if G = k else 0. These are collected together in a vectorY = (Y1, . . . , YK), and the N training instances of these form an N × Kindicator response matrix Y. Y is a matrix of 0’s and 1’s, with each rowhaving a single 1. We fit a linear regression model to each of the columnsof Y simultaneously, and the fit is given by

Y = X(XTX)−1XTY. (4.3)

Chapter 3 has more details on linear regression. Note that we have a coeffi-cient vector for each response column yk, and hence a (p+1)×K coefficientmatrix B = (XTX)−1XTY. Here X is the model matrix with p+1 columnscorresponding to the p inputs, and a leading column of 1’s for the intercept.

A new observation with input x is classified as follows:

• compute the fitted output f(x)T = (1, xT )B, a K vector;

• identify the largest component and classify accordingly:

G(x) = argmaxk∈G fk(x). (4.4)

COMP-598: Applied Machine Learning

Joelle Pineau 24

Quadratic discriminant analysis

•  LDA assumes all classes have the same covariance matrix, Σ.

•  QDA allows different covariance matrices, Σy for each class y.

•  Linear discriminant function:

•  Quadratic discriminant function:

–  QDA has more parameters to estimate, but greater flexibility to estimate the target function. Bias-variance trade-off.

COMP-598: Applied Machine Learning

δy (x) = xTΣ−1µy −

12µyTΣ−1µy + logP(y)

δy (x) = −12Σy −

12(x −µy )

T Σy−1(x −µy )+ logP(y)

vs.

William L. Hamilton, McGill University and Mila 8

Page 9: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

MiniProject 2

William L. Hamilton, McGill University and Mila 9

§ Details are now available: https://cs.mcgill.ca/~wlh/comp551/files/miniproject2_spec.pdf

§ Due Oct. 18th at 11:59pm.

Page 10: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

MiniProject 2

William L. Hamilton, McGill University and Mila 10

Kaggle leaderboard

Your accuracyRandom BaselineTA BaselineAccuracy of 3rd best group

Page 11: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Steps to solving a supervised learning problem

1. Decide what the input-output pairs are.

2. Decide how to encode inputs and outputs.

• This defines the input space X and output space Y.

3. Choose a class of hypotheses / representations H.

• E.g. linear functions.

4. Choose an error function (cost function) to define best hypothesis.

• E.g. Least-mean squares.

5. Choose an algorithm for searching through space of hypotheses.

So far:we have been focusing on this

Today: deciding on what the inputs are

With a focus on feature extractionfor language problems!

William L. Hamilton, McGill University and Mila 11

Page 12: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Feature extraction steps

“raw”features Predictorconstructed

featuresFeature

constructionFeature

selectionfeaturesubset

William L. Hamilton, McGill University and Mila 12

Page 13: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

A few strategies

§ Use domain knowledge to construct “ad hoc” features.§ Normalization across different features, e.g. centering and scaling

with xi = (x’i – μi) / σi.§ Normalization across different data instances, e.g.

counts/histogram of pixel colors.§ Non-linear expansions when first order interactions are not enough

for good results, e.g. products x1x2, x1x3, etc.§ Other functions of features (e.g. sin, cos, log, exponential etc.)§ Regularization (lasso, ridge).

William L. Hamilton, McGill University and Mila 13

Page 14: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Feature constructionWhy do we do feature construction?

§ Increase predictor performance.§ Reduce time / memory requirements.§ Improve interpretability.

But: Don’t lose important information!

Problem: we may end up with lots of possibly irrelevant, noisy, redundant features. (Here, “noisy” is in the sense that it can lead the predictor astray.)

William L. Hamilton, McGill University and Mila 14

Page 15: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Applications with lots of features

§ Any kind of task involving images or videos - object recognition, face

recognition. Lots of pixels!

§ Classifying from gene expression data. Lots of different genes!

§ Number of data examples: 100

§ Number of variables: 6000 to 60,000

§ Natural language processing tasks. Lots of possible words!

William L. Hamilton, McGill University and Mila 15

Page 16: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Features for modelling natural language

§ Words

§ TF-IDF

§ N-grams

§ Syntactic features

§ Word embeddings (not in this lecture…)

§ Useful Python package for implementing these: http://www.nltk.org/

William L. Hamilton, McGill University and Mila 16

Page 17: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Words

§ Binary (present or absent)

§ Absolute frequency

§ i.e., raw count

§ Relative frequency

§ i.e., proportion

§ document length

William L. Hamilton, McGill University and Mila 17

Page 18: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Multinomial vs Bernoulli Naïve Bayes§ Note that using binary vs. count features can have implications for

your model!§ In Lecture 5 we discussed the Bernoulli (i.e., binary) Naïve Bayes:

§ Assumes the features are binary counts.

§ But we can also have a Multinomial Naïve Bayes:

§ Assumes the features are integer counts.

P (x|y = k) =mY

j=1

✓xj

j,k(1� ✓j,k)(1�xj) Bernoulli

distribution

Multinomial distribution

P (x|y = k) =(Pm

j=1 xj)!Qmj=1 xj !

mY

j=1

✓xj

j,k

Homework: what is the maximum

likelihood estimate for the multinomial

Naïve Bayes?

William L. Hamilton, McGill University and Mila 18

Page 19: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

More options for words§ Stopwords

§ Common words like “the”, “of”, “about” are unlikely to be informative about the contents of a document.

§ Standard practice to remove them, though they can be useful (e.g., for authorship identification)

§ Lemmatization§ Inflectional morphology: changes to a word required by the

grammar of a language§ e.g., “perplexing” “perplexed” “perplexes”§ (Much worse in languages other than English, Chinese, Vietnamese)

§ Lemmatize to recover the canonical form; e.g., “perplex”

William L. Hamilton, McGill University and Mila 19

Page 20: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Term weighting

§ Not all words are equally important.

§ What do you know about an article if it contains the word

the?

penguin?

William L. Hamilton, McGill University and Mila 20

Page 21: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

TF*IDF (Salton, 1988)§ Term Frequency Times Inverse Document Frequency§ A term is important/indicative of a document if it:

1. Appears many times in the document2. Is a relative rare word overall

§ TF is usually just the count of the word§ IDF is a little more complicated:

§ !"# $, &'()*+ = log #(Docs in Corpus)#(Docs with term t) AB

• Can use a separate large training corpus for this§ Originally designed for document retrieval

William L. Hamilton, McGill University and Mila 21

Page 22: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

N-grams§ Use sequences of words, instead of individual words

§ e.g., … quick brown fox jumped …§ Unigrams (i.e. words)

§ quick, brown, fox, jumped

§ Bigrams§ quick_brown, brown_fox, fox_jumped

§ Trigrams§ quick_brown_fox, brown_fox_jumped

§ Usually stop at N <= 3, unless you have lots and lots of data

William L. Hamilton, McGill University and Mila 22

Page 23: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Rich linguistic features§ Syntactic

§ Extract features from a parse tree of a sentence§ [SUBJ The chicken] [VERB crossed] [OBJ the road].

§ Semantic§ e.g., Extract the semantic roles in a sentence§ [AGENT The chicken] [VERB crossed] [LOC the road].§ e.g., Features are synonym clusters (“chicken” and “fowl” are the

same feature) à WordNet§ Trade-off: Rich, descriptive features might be more discriminative, but are

hard (expensive, noisy) to get!

William L. Hamilton, McGill University and Mila 23

Page 24: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Sentiment and affective lexicons

§ Words have different connotations and associated meanings.§ “hate” vs. “love”

§ There are collections of “sentiment lexicons”, which map words to different sentiment values. E.g., § Bing Liu’s “Opinion Lexicon” (6000 positive/negative words)§ Warriner et al’s “Affective Rating Dataset” (continuous scores for 14,000 words)

§ Ratings for valence (”how positive vs. negative”) and arousal (“how “strong”).

§ Other types of lexicons exist. E.g., § LIWC contains various categories (“anger”, “cognitive processes”)§ Brysbaert et al’s “concreteness lexicon”

William L. Hamilton, McGill University and Mila 24

Page 25: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Feature selection• Thousands to millions of low level features: select the most

relevant one to build better, faster, and easier to understandlearning machines.

m

Xn

m’

William L. Hamilton, McGill University and Mila 25

Page 26: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Feature selection techniques§ Principal Component Analysis (PCA)

§ Also called (Truncated) Singular Value Decomposition, or Latent Semantic Indexing in NLP

§ Variable Ranking

§ Think of features as random variables

§ Find how strong they are associated with the output prediction, remove the ones that are not highly associated, either before training and during training

§ Representation learning techniques [future lectures]

William L. Hamilton, McGill University and Mila 26

Page 27: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Principal Component Analysis (PCA)

§ Idea: Project data into a lower-dimensional sub-space, Rm -->Rm’, where m’<m.

William L. Hamilton, McGill University and Mila 27

Page 28: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Principal Component Analysis (PCA)

§ Idea: Project data into a lower-dimensional sub-space, Rm -->Rm’, where m’<m.

§ Consider a linear mapping, xi -->WTxi§ W is the compression matrix with dimension Rmxm’.

§ Assume there is a decompression matrix Um’xm.

William L. Hamilton, McGill University and Mila 28

Page 29: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Principal Component Analysis (PCA)

§ Idea: Project data into a lower-dimensional sub-space, Rm -->Rm’, where m’<m.

§ Consider a linear mapping, xi -->WTxi§ W is the compression matrix with dimension Rmxm’.§ Assume there is a decompression matrix Um’xm.

§ Solve the following problem: argminW,U Σi=1:n ||xi –UWTxi ||2

William L. Hamilton, McGill University and Mila 29

Page 30: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Principal Component Analysis (PCA)

§ Solve the following problem: argminW,U Σi=1:n ||xi – UWTxi ||2§ Equivalently: argminW,U ||X– XWUT ||2

§ Solution is given by eigen-decomposition of XTX.§ W is mxm’ matrix corresponding to the first m’ eigenvectors of XTX

(sorted in descending order by the magnitude of the eigenvalue).

§ Equivalently: W is mxm’ matrix containing the first m’ left singular vectors of X

§ Note: The columns of Ware orthogonal!

William L. Hamilton, McGill University and Mila 30

Page 31: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Principal Component Analysis (PCA)

§ W is given by eigen-decomposition of XTX§ Select the project dimension, m’, using

cross-validation.

§ Typically “center” the examples before applying PCA (i.e., subtract the mean).

§ Interpretation: First column ofW is direction with maximal variance in the data; second column is the orthogonal direction with second-most variance, etc.

William L. Hamilton, McGill University and Mila 31

Page 32: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Eigenfaces

§ Turk & Pentland (1991) used PCA method to capture face images.

§ Assume all faces are about the same size

§ Represent each face image as a data vector.

§ Each eigenvector is an image, called an Eigenface.

Eigenfaces

Average image

William L. Hamilton, McGill University and Mila 32

Page 33: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Training images328 Dimensionality Reduction

x xx x xx x

o oo o oo o

* **

*** *+++ + ++ +

Figure 23.2 Images of faces extracted from the Yale data set. Top-Left: the originalimages in R50x50. Top-Right: the images after dimensionality reduction to R10 andreconstruction. Middle row: an enlarged version of one of the images before and afterPCA. Bottom: The images after dimensionality reduction to R2. The di↵erent marksindicate di↵erent individuals.

Some images of faces are depicted on the top-left side of Figure 23.2. UsingPCA, we reduced the dimensionality to R10 and reconstructed back to the orig-inal dimension, which is 502. The resulting reconstructed images are depictedon the top-right side of Figure 23.2. Finally, on the bottom of Figure 23.2 wedepict a 2 dimensional representation of the images. As can be seen, even from a2 dimensional representation of the images we can still roughly separate di↵erentindividuals.

Original images in R50x50.

328 Dimensionality Reduction

x xx x xx x

o oo o oo o

* **

*** *+++ + ++ +

Figure 23.2 Images of faces extracted from the Yale data set. Top-Left: the originalimages in R50x50. Top-Right: the images after dimensionality reduction to R10 andreconstruction. Middle row: an enlarged version of one of the images before and afterPCA. Bottom: The images after dimensionality reduction to R2. The di↵erent marksindicate di↵erent individuals.

Some images of faces are depicted on the top-left side of Figure 23.2. UsingPCA, we reduced the dimensionality to R10 and reconstructed back to the orig-inal dimension, which is 502. The resulting reconstructed images are depictedon the top-right side of Figure 23.2. Finally, on the bottom of Figure 23.2 wedepict a 2 dimensional representation of the images. As can be seen, even from a2 dimensional representation of the images we can still roughly separate di↵erentindividuals.

Projection to R2. Different marksIndicate different individuals.

Projection to R10 and reconstruction

William L. Hamilton, McGill University and Mila 33

Page 34: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Other feature selection methods

§ The goal: Find the input representation that produces the best generalization error.

§ Two classes of approaches:§ Wrapper & Filter methods: Feature selection is

applied as a pre-processing step.

§ Embedded methods: Feature selection is integrated in the learning (optimization) method, e.g. regularization.

William L. Hamilton, McGill University and Mila 34

Page 35: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Variable Ranking§ Idea: Rank features by a scoring function defined for individual

features, independently of the context of others. Choose the m’highest ranked features.

§ Pros / cons:

§ Need to select a scoring function.

§ Must select subset size (m’): cross-validation

§ Simple and fast – just need to compute a scoring function m times and sort m scores.

William L. Hamilton, McGill University and Mila 35

Page 36: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Scoring function: Correlation Criteria

slide by Michel VerleysenR( j) =

(xij −i=1:n∑ x j )(yi − y )

(xij −i=1:n∑ x j )2 (yi − y )

2

i=1:n∑William L. Hamilton, McGill University and Mila 36

Page 37: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Scoring function: Mutual information§ Think of Xj and Y as random variables.

§ Mutual information between variable Xj and target Y:

§ Empirical estimate from data (assume discretized variables):

I( j) = p(x j, yY∫ )logp(x j, y)p(x j )p(y)

dxdyXj∫

I( j) = P(Xj = x j,Y = y)logp(Xj = x j,Y = y)p(Xj = x j )p(Y = y)Y∑X j

William L. Hamilton, McGill University and Mila 37

Page 38: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Nonlinear dependencies with MI

§ Mutual information identifies nonlinear relationships between variables.

§ Example:§ x uniformly distributed over [-1 1]§ y = x2 + noise§ z uniformly distributed over [-1 1]§ z and x are independent

William L. Hamilton, McGill University and Mila 38

Page 39: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Variable Ranking§ Idea: Rank features by a scoring function defined for individual

features, independently of the context of others. Choose the m’highest ranked features.

§ Pros / cons:

§ Need to select a scoring function.

§ Must select subset size (m’): cross-validation

§ Simple and fast – just need to compute a scoring function m times and sort m scores.

§ Scoring function is defined for individual features (not subsets).

William L. Hamilton, McGill University and Mila 39

Page 40: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Best-Subset selection

§ Idea: Consider all possible subsets of the features, measure performance on a validation set, and keep the subset with the best performance.

§ Pros / cons?

§ We get the best model!

§ Very expensive to compute, since there is a combinatorial number of subsets.

William L. Hamilton, McGill University and Mila 40

Page 41: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Search space of subsets

n features, 2n – 1 possible feature subsets!

Kohavi-John, 1997

William L. Hamilton, McGill University and Mila 41

Page 42: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Subset selection in practice§ Formulate as a search problem, where the state is the feature set

that is used, and search operators involve adding or removing feature set

§ Constructive methods like forward/backward search

§ Local search methods, genetic algorithms

§ Use domain knowledge to help you group features together, to reduce size of search space

§ e.g., In NLP, group syntactic features together, semantic features, etc.

William L. Hamilton, McGill University and Mila 42

Page 43: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Regularization§ Idea: Modify the objective function to constrain the model choice.

Typically adding term (∑j=1:m wjp)1/p.§ Linear regression -> Ridge regression, Lasso

§ Challenge: Need to adapt the optimization procedure (e.g. handle non-convex objective).

§ Regularization can be viewed as a form of feature selection.

§ This approach is often used for very large natural (non-constructed) feature sets, e.g. images, speech, text, video.

William L. Hamilton, McGill University and Mila 43

Page 44: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Steps to solving a supervised learning problem

1. Decide what the input-output pairs are.

2. Decide how to encode inputs and outputs.

• This defines the input space X and output space Y.

3. Choose a class of hypotheses / representations H.• E.g. linear functions.

4. Choose an error function (cost function) to define best hypothesis.• E.g. Least-mean squares.

5. Choose an algorithm for searching through space of hypotheses.

(If doing k-fold cross-validation, re-do feature selection for each fold.)

Evaluate onvalidation set

Evaluate final model/hypothesis on test set

William L. Hamilton, McGill University and Mila 44

Page 45: COMP 551 -Applied Machine Learning Lecture 8 -- …wlh/comp551/slides/08-features.pdfCOMP 551 -Applied Machine Learning Lecture 8 ---Feature design William L. Hamilton (with slides

Final commentsClassic paper on this:I. Guyon, A. Elisseeff, An introduction to Variable and Feature Selection. Journal of Machine Learning Research 3 (2003) pp.1157-1182

http://machinelearning.wustl.edu/mlpapers/paper_files/GuyonE03.pdf

(and references therein.)

More recently, move towards learning the features end-to-end, using neural network architectures (more on this later in the course).

William L. Hamilton, McGill University and Mila 45