Lab 1

54
1 Lab 1 Getting started with Basic Learning Machines and the Overfitting Problem

description

Lab 1. Getting started with Basic Learning Machines and the Overfitting Problem. Lab 1. Polynomial regression. Matlab: POLY_GUI. The code implements the ridge regression algorithm: w =argmin S i (1-y i f( x i )) 2 + g || w || 2 f( x ) = w 1 x + w 2 x 2 + … + w n x n = w x T - PowerPoint PPT Presentation

Transcript of Lab 1

Page 1: Lab 1

1

Lab 1

Getting started with Basic Learning

Machinesand

the Overfitting Problem

Page 2: Lab 1

2

Lab 1

Polynomial regression

Page 3: Lab 1

3

Matlab: POLY_GUI

• The code implements the ridge regression algorithm: w=argmin i (1-yi f(xi))2 + || w ||2

f(x) = w1 x + w2 x2 + … + wn xn = w xT

x = [x, x2, … , xn]

wT = X+Y

X+= XT(XXT+)-1=(XTX+ )-1XT

X=[x(1); x(2); … x(p)] (matrix (p, n))

• The leave-one-out error (LOO) is obtained with PRESS statistic (Predicted REsidual Sums of Squares.):

• LOO error = (1/p) k[ rk/1-(XX+)kk ]2

Page 4: Lab 1

4

Matlab: POLY_GUI

Page 5: Lab 1

5

Matlab: POLY_GUI

• At the prompt type: poly_gui;• Vary the parameters. Refrain from hitting

“CV”. Explain what happens in the following situations:– Sample num. << Target degree (small noise)– Large noise, small sample num– Target degree << Model degree

• Why is the LOO error sometimes larger than the training and test error?

• Are there local minima in the LOO error? Is the LOO error flat near the optimum?

• Propose ways of getting a better solution.

Page 6: Lab 1

6

CLOP Data Objects

X = rand(10,5) Y = rand(10,1) D = data(X,Y) % constructor methods(D) get_x(D) get_y(D) plot(D);

The poly_gui emulates CLOP objects of type “data”:

Page 7: Lab 1

7

CLOP Model Objects

P = poly_ridge; h = plot(P);D = gene(P); plot(D, h); [resu, P] = train(P, D); mse(resu)Dt = gene(P);[tresu, P] = test(P, Dt); mse(tresu) plot(P, h);

poly_ridge is a “model” object.

Page 8: Lab 1

8

Lab 1

Support Vector Machines

Page 9: Lab 1

9

Support Vector Classifier

x1

x2

x=[x1,x2]

f(x)=0

f(x)>0f(x)<0

f(x) = i yi k(x, xi)k SV

Boser-Guyon-Vapnik-1992

Page 10: Lab 1

10

Matlab: SVC_GUI

• At the prompt type: svc_gui;• The code implements the Support

Vector Machine algorithm with kernelk(s, t) = (1 + s t)q exp -||s-t||2

• Regularization similar to ridge regression:

Hinge loss: L(xi)=max(0, 1-yi f(xi))

Empirical risk: i L(xi)

w=argmin (1/C) ||w||2 + i L(xi)

shrinkage

Page 11: Lab 1

11

Lab 1

More loss functions…

Page 12: Lab 1

12

Loss Functions

-1 -0.5 0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

3.5

4

z=y f(x)

L(y, f(x))Decision boundary

Margin

well classifiedmissclassified

0/1 loss

square loss (1- z)2

SVC loss, =1 max(0, 1-z)

logistic loss log(1+e-z)

Adaboost loss e-z

Perceptron loss max(0, -z)

SVC loss, =2 max(0, (1- z))2

Page 13: Lab 1

13

Exercise: Gradient Descent

• Linear discriminant f(x) = j wj xj

• Functional margin z=y f(x), y=1

• Compute z/ wj

• Derive the learning rules wj-L/wj corresponding to the following loss functions:square loss

(1- z)2

SVC loss max(0, 1-z)

logistic loss log(1+e-z)

Adaboost loss e-z

Perceptron loss max(0, -z)

Page 14: Lab 1

14

Exercise: Dual Algorithms

• From the wj derive the w

• w = i i xi

• From the w, derive the i of the dual algorithms.

Page 15: Lab 1

15

Summary

• Modern ML algorithms optimize a penalized risk functional:

Page 16: Lab 1

16

Lab 2

Getting started with CLOP

Page 17: Lab 1

17

Lab 2

CLOP tutorial

Page 18: Lab 1

18

What is CLOP?

• CLOP=Challenge Learning Object Package.• Based on the Spider developed at the Max Planck

Institute.• Two basic abstractions:

– Data object– Model object

• Put the CLOP directory in your path.• At the prompt type: use_spider_clop;• If you have used before poly_gui… type

clear classes

Page 19: Lab 1

19

CLOP Data Objects

addpath(<clop_dir>);use_spider_clop;X=rand(10,8);Y=[1 1 1 1 1 -1 -1 -1 -1 -1]';D=data(X,Y); % constructor[p,n]=get_dim(D) get_x(D) get_y(D)

At the Matlab prompt:

Page 20: Lab 1

20

CLOP Model Objects

model = kridge; % constructor[resu, model] = train(model, D);resu, model.W, model.b0Yhat = D.X*model.W' + model.b0testD = data(rand(3,8), [-1 -1 1]');tresu = test(model, testD); balanced_errate(tresu.X, tresu.Y)

D is a data object previously defined.

Page 21: Lab 1

21

Hyperparameters and Chains

default(kridge) hyper = {'degree=3', 'shrinkage=0.1'}; model = kridge(hyper);

model = chain({standardize,kridge(hyper)}); [resu, model] = train(model, D); tresu = test(model, testD); balanced_errate(tresu.X, tresu.Y)

A model often has hyperparameters:

Models can be chained:

Page 22: Lab 1

22

Hyper-parameters

• Kernel methods: kridge and svc:k(x, y) = (coef0 + x y)degree exp(-gamma ||x - y||2)

kij = k(xi, xj)

kii kii + shrinkage

• Naïve Bayes: naive: none• Neural network: neural

units, shrinkage, maxiter

• Random Forest: rf (windows only)mtry

Page 23: Lab 1

23

Exercise

• Here some the pattern recognition CLOP objects: @rf @naive @svc @neural @gentleboost @lssvm @gkridge @kridge @klogistic @logitboost • Try at the prompt example(neural)• Try other pattern recognition objects• Try different sets of hyperparameters, e.g., example(svc({'gamma=1', 'shrinkage=0.001'}))

• Remember: use default(method) to get the HP.

Page 24: Lab 1

24

Lab 2

Example: Digit Recognition

Subset of the MNIST data of LeCun and Cortes used for the NIPS2003 challenge

Page 25: Lab 1

25

data(X, Y)% Go to the Gisette directory: cd('GISETTE')

% Load “validation” data: Xt=load('gisette_valid.data'); Yt=load('gisette_valid.labels'); % Create a data object % and examine it: Dt=data(Xt, Yt); browse(Dt, 2);

% Load “training” data (longer): X=load('gisette_train.data'); Y=load('gisette_train.labels'); [p, n]=get_dim(Dt); D=train(subsample(['p_max=' num2str(p)]), data(X, Y)); clear X Y Xt Yt

% Save for later use: save('gisette', 'D', 'Dt');

Page 26: Lab 1

26

model(hyperparam)

% Define some hyperparameters: hyper = {'degree=3', 'shrinkage=0.1'};

% Create a kernel ridge % regression model: model = kridge(hyper);

% Train it and test it: [resu, Model] = train(model, D); tresu = test(Model, Dt);

% Visualize the results: roc(tresu); idx=find(tresu.X.*tresu.Y<0); browse(get(D, idx), 2);

Page 27: Lab 1

27

Exercise

• Here are some pattern recognition CLOP objects: @rf @naive @gentleboost @svc @neural @logitboost @kridge @lssvm @klogistic• Instanciate a model with some hyperparameters

(use default(method) to get the HP)• Vary the HP and the number of training examples

(Hint: use get(D, 1:n) to restrict the data to n examples).

Page 28: Lab 1

28

chain({model1, model2,…})

% Combine preprocessing and kernel ridge regression: my_prepro=normalize; model = chain({my_prepro,kridge(hyper)});

% Combine replicas of a base learner: for k=1:10 base_model{k}=neural; end model=ensemble(base_model);

ensemble({model1, model2,…})

Page 29: Lab 1

29

Exercise

• Here are some preprocessing CLOP objects:

@normalize @standardize @fourier• Chain a preprocessing and a model, e.g., model=chain({fourier, kridge('degree=3')}); my_classif=svc({'coef0=1', 'degree=4', 'gamma=0', 'shrinkage=0.1'});

model=chain({normalize, my_classif});

• Train, test, visualize the results. Hint: you can browse the preprocessed data:

browse(train(standardize, D), 2);

Page 30: Lab 1

30

Summary

% After creating your complex model, just one command: train

model=ensemble({chain({standardize,kridge(hyper)}),chain({normalize,naive})});

[resu, Model] = train(model, D);

% After training your complex model, just one command: test

tresu = test(Model, Dt);

% You can use a “cv” object to perform cross-validation:

cv_model=cv(model); [resu, Model] = train(model, D); roc(resu);

Page 31: Lab 1

31

Lab 3

Getting started with Feature Selection

Page 32: Lab 1

32

POLY_GUI again…

clear classes

poly_gui;

•Check the “Multiplicative updates” (MU) box.

•Play with the parameters.

•Try CV

•Compare with no MU

Page 33: Lab 1

33

Lab 3

Exploring feature selection methods

Page 34: Lab 1

34

Re-load the GISETTE data

% Start CLOP: clear classes use_spider_clop;

% Go to the Gisette directory: cd('GISETTE') load('gisette');

Page 35: Lab 1

35

Visualization

1) Create a heatmap of the data matrix or a subset:show(D);

show(get(D,1:10, 1:2:500));

2) Look at individual patterns: browse(D);

browse(D, 2); % For 2d data

% Display feature positions:

browse(D, 2, [212, 463, 429, 239]);

3) Make a scatter plot of a few features:scatter(D, [212, 463, 429, 239]);

Page 36: Lab 1

36

Example

my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});

model=chain({normalize, s2n('f_max=100'), my_classif});

[resu, Model] = train(model, D);tresu = test(Model, Dt);roc(tresu);% Show the misclassified first[s,idx]=sort(tresu.X.*tresu.Y);browse(get(Dt, idx), 2, Model{2});

Page 37: Lab 1

37

Some Filters in CLOP

Univariate:• @s2n (Signal to noise ratio.)

• @Ttest (T statistic; similar to s2n.)

• @Pearson (Uses Matlab corrcoef. Gives the same results as Ttest, classes are balanced.)

• @aucfs (ranksum test)

Multivariate:• @relief (no elimination of redundancy)

• @gs (Gram-Schmidt orthogonalization; complementary features)

Page 38: Lab 1

38

Exercise

• Change the feature selection algorithm• Visualize the features• What can you say of the various

methods?• Which one gives the best results for 2,

10, 100 features?• Can you improve by changing the

preprocessing? (Hint: try @pc_extract)

Page 39: Lab 1

39

Lab 3

Feature significance

Page 40: Lab 1

40

T-test

• Normally distributed classes, equal variance 2 unknown; estimated from data as 2

within.

• Null hypothesis H0: + = -

• T statistic: If H0 is true,

t= (+ - -)/(withinm++1/m-Studentm++m--d.f.

-1

- +

- +

P(Xi|Y=-1)

P(Xi|Y=1)

xi

Page 41: Lab 1

41

Evalution of pval and FDR

• Ttest object: – computes pval analytically– FDR~pval*nsc/n

• probe object: – takes any feature ranking object as an

argument (e.g. s2n, relief, Ttest)– pval~nsp/np

– FDR~pval*nsc/n

Page 42: Lab 1

42

Analytic vs. probe

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

rank

FD

R

Page 43: Lab 1

43

Example

[resu, FS] = train(Ttest, D);[resu, PFS] = train(probe(Ttest), D);

figure('Name', 'pvalue');plot(get_pval(FS, 1), 'r');hold on; plot(get_pval(PFS, 1));figure('Name', 'FDR');plot(get_fdr(FS, 1), 'r');hold on; plot(get_pval(PFS, 1));

Page 44: Lab 1

44

Exercise

• What could explain differences between the pvalue and fdr with the analytic and probe method?

• Replace Ttest with chain({rmconst('w_min=0'), Ttest})

• Recompute the pvalue and fdr curves. What do you notice?

• Choose an optimum number fnum of features based on pvalue or FDR. Visualize with browse(D, 2,FS, fnum);

• Create a model with fnum. Is fnum optimal? Do you get something better with CV?

Page 45: Lab 1

45

Lab 3

Local feature selection

Page 46: Lab 1

46

Exercise

Consider the 1 nearest neighbor algorithm. We define the following score:

Where s(k) (resp. d(k)) is the index of the nearest neighbor of xk belonging to the same class (resp. different class) as xk.

Page 47: Lab 1

47

Exercise

1. Motivate the choice of such a cost function to approximate the generalization error (qualitative answer)

2. How would you derive an embedded method to perform feature selection for 1 nearest neighbor using this functional?

3. Motivate your choice (what makes your method an ‘embedded method’ and not a ‘wrapper’ method)

Page 48: Lab 1

48

Relief

nearest hit

nearest miss

Dhit Dmiss

Relief=<Dmiss/Dhit>

Dhit

Dmiss

Local_Relief= Dmiss/Dhit

Page 49: Lab 1

49

Exercise

[resu, FS] = train(relief, D);browse(D, 2,FS, 20);[resu, LFS] = train(local_relief,D);browse(D, 2,LFS, 20);

•Propose a modification to the nearest neighbor algorithm that uses features relevant to individual patterns (like those provided by “local_relief”).

•Do you anticipate such an algorithm to perform better than the non-local version using “relief”?

Page 50: Lab 1

50

Epilogue

Becoming a pro andplaying with

other datasets

Page 51: Lab 1

51

Some CLOP objects

Basic learning machines

Feature selection, pre- and post- processing

Compound models

Page 52: Lab 1

52

http://clopinet.com/challenges/

• Challenges in– Feature selection– Performance prediction– Model selection– Causality

• Large datasets

Page 53: Lab 1

53

MADELON Best BER=6.22Best BER=6.220.57% - n0=20 (4%) – BER0=7.33%0.57% - n0=20 (4%) – BER0=7.33%

my_classif=svc({'coef0=1', 'degree=0', 'gamma=1', 'shrinkage=1'});

my_model=chain({probe(relief,{'p_num=2000', 'pval_max=0'}), standardize, my_classif})

DOROTHEA Best BER=8.54Best BER=8.540.99% - n0=1000 (1%) – BER0=12.37%0.99% - n0=1000 (1%) – BER0=12.37%

my_model=chain({TP('f_max=1000'), naive, bias});

Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmarkCompetitive baseline methods set new standards for the NIPS 2003 feature selection benchmark , , Isabelle Guyon, Jiwen Li, Theodor Mader, Patrick A. Pletscher, Georg Isabelle Guyon, Jiwen Li, Theodor Mader, Patrick A. Pletscher, Georg

Schneider and Markus UhrSchneider and Markus Uhr ,Pattern Recognition Letters, Volume 28, Issue 12, 1 September 2007, Pages 1438-1444.,Pattern Recognition Letters, Volume 28, Issue 12, 1 September 2007, Pages 1438-1444.

Dataset Size Type FeaturesTraining Examples

Validation Examples

Test Examples

Arcene8.7 MB

Dense 10000 100 100 700

Gisette22.5 MB

Dense 5000 6000 1000 6500

Dexter0.9 MB

Sparse integer

20000 300 300 2000

Dorothea4.7 MB

Sparse binary

100000 800 350 800

Madelon2.9 MB

Dense 500 2000 600 1800

Class taught at ETH, Zurich, winter 2005Task of the students:• Baseline method provided, BER0 performance and n0 features.• Get BER<BER0 or BER=BER0 but n<n0.• Extra credit for beating best challenge entry.

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

GISETTE

DOROTHEA

NEW YORK, October 2, 2001 – Instinet Group Incorporated (Nasdaq: INET), the world’s largest electronic agency securities broker, today announced tha

DEXTER

MADELON

0 2000 4000 6000 8000 10000 12000 14000 160000

10

20

30

40

50

60

70

80

90

100

ARCENE

DEXTER Best BER=3.30Best BER=3.300.40% - n0=300 (1.5%) – BER0=5%0.40% - n0=300 (1.5%) – BER0=5%

my_classif=svc({'coef0=1', 'degree=1', 'gamma=0', 'shrinkage=0.5'});

my_model=chain({s2n('f_max=300'), normalize, my_classif})

GISETTE Best BER=1.26Best BER=1.260.14% - n0=1000 (20%) – BER0=1.80%0.14% - n0=1000 (20%) – BER0=1.80%

my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});

my_model=chain({normalize, s2n('f_max=1000'), my_classif});

ARCENE Best BER= 11.9 Best BER= 11.9 1.2 %1.2 % - n0=1100 (11%) – BER0=14.7%- n0=1100 (11%) – BER0=14.7%

my_svc=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=0.1'});

my_model=chain({standardize, s2n('f_max=1100'), normalize, my_svc})

NIPS 2003 Feature Selection Challenge

Page 54: Lab 1

54

NIPS 2006 Model Selection Game

Dataset

CLOP models selected

ADA 2*{sns,std,norm,gentleboost(neural),bias}; 2*{std,norm,gentleboost(kridge),bias}; 1*{rf,bias}

GINA

6*{std,gs,svc(degree=1)}; 3*{std,svc(degree=2)}

HIVA

3*{norm,svc(degree=1),bias}

NOVA

5*{norm,gentleboost(kridge),bias}

SYLVA

4*{std,norm,gentleboost(neural),bias}; 4*{std,neural}; 1*{rf,bias}

 

First place: Juha Reunanen, cross-indexing-7

sns = shift’n’scale, std = standardize, norm = normalize (some details of hyperparameters

not shown)

Dataset

CLOP models selected

ADA {sns, std, norm, neural(units=5), bias}

GINA

{norm, svc(degree=5, shrinkage=0.01), bias}

HIVA

{std, norm, gentleboost(kridge), bias}

NOVA

{norm,gentleboost(neural), bias}

SYLVA

{std, norm, neural(units=1), bias}

 

Second place: Hugo Jair Escalante Balderas, BRun2311062

sns = shift’n’scale, std = standardize, norm = normalize (some details of hyperparameters not shown)

Note: entry Boosting_1_001_x900 gave better results, but was older.

Subject: Re: Goalie masksLines: 21

Tom Barrasso wore a great mask, one time, last season. It was all black, with Pgh city scenes on it. The "Golden Triangle" graced the top, along with a steel mill on one side and the Civic Arena on the other. On the back of the helmet was the old Pens' logo the current (at the time) Pens logo, and a space for the "new" logo.

Lori 

NOVA

GINA

HIVA

ADA

SYLVA

Dataset Domain Feature # Training # Validation # Test #

ADA Marketing 48 4147 415 41471

GINA Digit recognition 970 3153 315 31532

HIVA Drug discovery 1617 3845 384 38449

NOVA Text classification 16969 1754 175 17537

SYLVA Ecology 216 13086 1309 130857

Proc. IJCNN07, Orlando, FL, Aug, 2007:

PSMS for Neural Networks H. Jair Escalante, Manuel Montes y G´omez, and Luis Enrique Sucar

Model Selection and Assessment Using Cross-indexing, Juha Reunanen