1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 2 Basic...

Post on 13-Dec-2015

216 views 0 download

Tags:

Transcript of 1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 2 Basic...

1

Introduction to Predictive Learning

Electrical and Computer Engineering

LECTURE SET 2

Basic Learning Approaches and Complexity Control

2

OUTLINE

2.0 Objectives

2.1 Terminology and Basic Learning Problems

2.2 Basic Learning Approaches

2.3 Generalization and Complexity Control

2.4 Application Example

2.5 Summary

3

2.0 Objectives1. To quantify the notions of

explanation, prediction and model

2. Introduce terminology

3. Describe basic learning methods

• Past observations ~ data points

• Explanation (model) ~ function

Learning ~ function estimation

Prediction ~ using estimated model to make predictions

4

2.0 Objectives (cont’d)• Example: classification

training samples, model

Goal 1: explanation of training data

Goal 2: generalization (for future data)

• Learning is ill-posed

5

Learning as Induction

Induction ~ function estimation from data:

Deduction ~ prediction for new inputs:

6

2.1 Terminology and Learning Problems

• Input and output variables

System

xy

z

x

* * *

* **

**

y

* * *

*

* ** *

**

*

• Learning ~ estimation of F(X): Xy

• Statistical dependency vs causality

7

2.1.1 Types of Input and Output Variables

• Real-valued

• Categorical (class labels)

• Ordinal (or fuzzy) variables

• Aside: fuzzy sets and fuzzy logic

Me

mbe

rsh

ip v

alu

e

Weight (lbs)

75 100 125 150 175 200 225

LIGHT MEDIUM HEAVY

8

Data Preprocessing and Scaling• Preprocessing is required with observational data

(step 4 in general experimental procedure)

Examples: ….• Basic preprocessing includes

- summary univariate statistics: mean, st. deviation, min + max value, range, boxplot performed independently for each input/output

- detection (removal) of outliers

- scaling of input/output variables (may be required for some learning algorithms)

• Visual inspection of data is tedious but useful

9

Example Data Set: animal body&brain weight

kg gram1 Mountain beaver 1.350 8.100

2 Cow 465.000 423.000

3 Gray wolf 36.330 119.500

4 Goat 27.660 115.000

5 Guinea pig 1.040 5.500

6 Diplodocus 11700.000 50.000

7 Asian elephant 2547.000 4603.000

8 Donkey 187.100 419.000

9 Horse 521.000 655.000

10 Potar monkey 10.000 115.000

11 Cat 3.300 25.600

12 Giraffe 529.000 680.000

13 Gorilla 207.000 406.000

14 Human 62.000 1320.000

10

Example Data Set: cont’dkg gram

15 African elephant 6654.000 5712.000

16 Triceratops 9400.000 70.000

17 Rhesus monkey 6.800 179.000

18 Kangaroo 35.000 56.000

19 Hamster 0.120 1.000

20 Mouse 0.023 0.400

21 Rabbit 2.500 12.100

22 Sheep 55.500 175.000

23 Jaguar 100.000 157.000

24 Chimpanzee 52.160 440.000

25 Brachiosaurus 87000.000 154.500

26 Rat 0.280 1.900

27 Mole 0.122 3.000

28 Pig 192.000 180.000

11

Original Unscaled Animal Data: what points are outliers?

12

Animal Data: with outliers removed and scaled to [0,1] range: humans in the left top corner

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Body weight

Bra

in w

eig

ht

13

2.1.2 Supervised Learning: Regression• Data in the form (x,y), where

- x is multivariate input (i.e. vector)

- y is univariate output (‘response’)• Regression: y is real-valued

Estimation of real-valued function xy

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

14

2.1.2 Supervised Learning: Classification• Data in the form (x,y), where

- x is multivariate input (i.e. vector)

- y is univariate output (‘response’)• Classification: y is categorical (class label)

Estimation of indicator function xy

15

2.1.2 Unsupervised Learning

• Data in the form (x), where

- x is multivariate input (i.e. vector)

• Goal 1: data reduction or clustering

Clustering = estimation of mapping X c

16

Unsupervised Learning (cont’d)

• Goal 2: dimensionality reduction

Finding low-dimensional model of the data

17

2.1.3 Other (nonstandard) learning problems

• Multiple model estimation:

18

OUTLINE2.0 Objectives

2.1 Terminology and Learning Problems

2.2 Basic Learning Approaches

- Parametric Modeling

- Non-parametric Modeling

- Data Reduction

2.3 Generalization and Complexity Control

2.4 Application Example

2.5 Summary

19

2.2.1 Parametric Modeling

Given training data

(1) Specify parametric model

(2) Estimate its parameters (via fitting to data)• Example: Linear regression F(x)= (w x) + b

minbyn

iii

2

1

)( xw

niyii ,...2,1),,( x

20

Parametric ModelingGiven training data

(1) Specify parametric model

(2) Estimate its parameters (via fitting to data)

Univariate classification:

niyii ,...2,1),,( x

21

2.2.2 Non-Parametric ModelingGiven training data

Estimate the model (for given ) as

‘local average’ of the data.

Note: need to define ‘local’, ‘average’• Example: k-nearest neighbors regression

k

y

f

k

jj

10 )(x

niyii ,...2,1),,( x

0x

22

2.2.3 Data Reduction Approach

Given training data, estimate the model as ‘compact encoding’ of the data.

Note: ‘compact’ ~ # of bits to encode the model• Example: piece-wise linear regression

How many parameters needed

for two-linear-component model?

23

Example: piece-wise linear regression vs linear regression

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1

1.5

y

x

24

Data Reduction Approach (cont’d)

Data Reduction approaches are commonly used for unsupervised learning tasks.

• Example: clustering.

Training data encoded by 3 points (cluster centers)

H

Issues:- How to find centers?- How to select the

number of clusters?

25

Inductive Learning Setting

Induction and Deduction in Philosophy:All observed swans are white (data samples).Therefore, all swans are white.• Model estimation ~ inductive step, i.e. estimate

function from data samples.• Prediction ~ deductive step Inductive Learning Setting• Discussion: which of the 3 modeling

approaches follow inductive learning?• Do humans implement inductive inference?

26

OUTLINE2.0 Objectives

2.1 Terminology and Learning Problems

2.2 Modeling Approaches & Learning Methods

2.3 Generalization and Complexity Control

- Prediction Accuracy (generalization)

- Complexity Control: examples

- Resampling2.4 Application Example

2.5 Summary

27

2.3.1 Prediction AccuracyInductive Learning ~ function estimation• All modeling approaches implement ‘data

fitting’ ~ explaining the data• BUT True goal ~ prediction• Two possible goals of learning:

- estimation of ‘true function’- good generalization for future data

• Are these two goals equivalent?• If not, which one is more practical?

28

Explanation vs Prediction

(a) Classification (b) Regression

29

Inductive Learning Setting• The learning machine observes samples (x ,y), and

returns an estimated response

• Recall ‘first-principles’ vs ‘empirical’ knowledge

Two modes of inference: identification vs imitation• Risk

),(ˆ wfy x

min,y),w)) dP(Loss(y, f( xx

30

Discussion• Math formulation useful for quantifying

- explanation ~ fitting error (training data)- generalization ~ prediction error

• Natural assumptions- future similar to past: stationary P(x,y), i.i.d.data- discrepancy measure or loss function, i.e. MSE

• What if these assumptions do not hold?

31

Example: RegressionGiven: training data

Find a function that minimizes squared

error for a large number (N) of future samples:

BUT Future data is unknown ~ P(x,y) unknown

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

min,y) dP(,w))f((y xx 2

minwfy kk

N

k

2

1

)],([( x

niyii ,...2,1),,( x),( wf x

32

2.3.2 Complexity Control: parametric modelingConsider regression estimation• Ten training samples

• Fitting linear and 2-nd order polynomial:25.0),,0( 222 whereNxy

33

Complexity Control: local estimationConsider regression estimation• Ten training samples from

• Using k-nn regression with k=1 and k=4:25.0),,0( 222 whereNxy

34

Complexity Control (cont’d)

• Complexity (of admissible models) affects generalization (for future data)

• Specific complexity indices for– Parametric models: ~ # of parameters– Local modeling: size of local region– Data reduction: # of clusters

• Complexity control = choosing good complexity (~ good generalization) for a given (training) data

35

How to Control Complexity ?• Two approaches: analytic and resampling • Analytic criteria estimate prediction error as a

function of fitting error and model complexityFor regression problems:

Representative analytic criteria for regression• Schwartz Criterion:

• Akaike’s FPE:

where p = DoF/n, n~sample size, DoF~degrees-of-freedom

empRn

DoFrR

nppnpr ln11, 1

r p 1 p 1 p 1

36

2.3.3 Resampling• Split available data into 2 sets:

Training + Validation(1) Use training set for model estimation (via data fitting)(2) Use validation data to estimate the prediction error of the model

• Change model complexity index and repeat (1) and (2)

• Select the final model providing lowest (estimated) prediction error

BUT results are sensitive to data splitting

37

K-fold cross-validation

1. Divide the training data Z into k randomly selected disjoint subsets {Z1, Z2,…, Zk} of size n/k

2. For each ‘left-out’ validation set Zi :

- use remaining data to estimate the model

- estimate prediction error on Zi :

3. Estimate ave prediction risk as

)(ˆ xify

2)(

i

yfn

kr ii

Z

x

k

iicv r

kR

1

1

38

Example of model selection(1)• 25 samples are generated as

with x uniformly sampled in [0,1], and noise ~ N(0,1)• Regression estimated using polynomials of degree m=1,2,…,10• Polynomial degree m = 5 is chosen via 5-fold cross-validation. The curve shows the

polynomial model, along with training (* ) and validation (*) data points, for one partitioning.

m Estimated R via Cross validation

1 0.1340

2 0.1356

3 0.1452

4 0.1286

5 0.0699

6 0.1130

7 0.1892

8 0.3528

9 0.3596

10 0.4006

xy 22sin

39

Example of model selection(2)• Same data set, but estimated using k-nn regression.• Optimal value k = 7 chosen according to 5-fold cross-validation

model selection. The curve shows the k-nn model, along with training (* ) and validation (*) data points, for one partitioning.

k Estimated R via Cross validation

1 0.1109

2 0.0926

3 0.0950

4 0.1035

5 0.1049

6 0.0874

7 0.0831

8 0.0954

9 0.1120

10 0.1227

40

More on Resampling• Leave-one-out (LOO) cross-validation

- extreme case of k-fold when k=n (# samples)- efficient use of data, but requires n estimates

• Final (selected) model depends on:- random data- random partitioning of the data into K subsets (folds) the same resampling procedure may yield different model selection results

• Some applications may use non-random splitting of the data into (training + validation)

• Model selection via resampling is based on estimated prediction risk (error).

• Does this estimated error measure reflect true prediction accuracy of the final model?

41

Resampling for estimating true risk

• Prediction risk (test error) of a method can be also estimated via resampling

• Partition the data into: Training/ validation/ test• Test data should be never used for model

estimation• Double resampling method:

- for complexity control- for estimating prediction performance of a method

• Estimation of prediction risk (test error) is critical for comparison of different learning methods

42

Example of model selection for k-NN classifier via 6-fold x-validation: Ripley’s data.Optimal decision boundary for k=14

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

43

Example of model selection for k-NN classifier via 6-fold x-validation: Ripley’s data.Optimal decision boundary for k=50

which one

is better?

k=14 or 50

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

44

Estimating test error of a method• For the same example (Ripley’s data) what is the true

test error of k-NN method ?• Use double resampling, i.e. 5-fold cross validation to

estimate test error, and 6-fold cross-validation to estimate optimal k for each training fold:

Fold # k Validation Test error1 20 11.76% 14%2 9 0% 8%3 1 17.65% 10%4 12 5.88% 18%5 7 17.65% 14%

mean 10.59% 12.8%• Note: opt k-values are different; errors vary for each fold,

due to high variability of random partitioning of the data

45

Estimating test error of a method• Another realization of double resampling, i.e. 5-fold

cross validation to estimate test error, and 6-fold cross-validation to estimate optimal k for each training fold:

Fold # k Validation Test error1 7 14.71% 14%2 31 8.82% 14%3 25 11.76% 10%4 1 14.71% 18%5 62 11.76% 4%

mean 12.35% 12%

• Note: predicted average test error (12%) is usually higher than minimized validation error (11%) for model selection

46

2.4 Application Example

• Why financial applications?- “market is always right” ~ loss function- lots of historical data - modeling results easy to understand

• Background on mutual funds • Problem specification + experimental setup• Modeling results • Discussion

47

OUTLINE

2.0 Objectives

2.1 Terminology and Basic Learning Problems

2.2 Basic Learning Approaches

2.3 Generalization and Complexity Control

2.4 Application Example

2.5 Summary

48

2.4.1 Background: pricing mutual funds

• Mutual funds trivia and recent scandals• Mutual fund pricing:

- priced once a day (after market close) NAV unknown when order is placed

• How to estimate NAV accurately? Approach 1: Estimate holdings of a fund (~200-400 stocks), then find NAVApproach 2: Estimate NAV via correlations btwn NAV and major market indices (learning)

49

2.4.2 Problem specs and experimental setup

• Domestic fund: Fidelity OTC (FOCPX)• Possible Inputs:

SP500, DJIA, NASDAQ, ENERGY SPDR• Data Encoding:

Output ~ % daily price change in NAV

Inputs ~ % daily price changes of market indices

• Modeling period: 2003.

• Issues: modeling method? Selection of input variables? Experimental setup?

50

Experimental Design and Modeling Setup

Mutual FundsMutual Funds Input VariablesInput Variables

YY X1X1 X2X2 X3X3

FOCPXFOCPX ^IXIC^IXIC -- --

FOCPXFOCPX ^GSPC^GSPC ^IXIC^IXIC --

FOCPXFOCPX ^GSPC^GSPC ^IXIC^IXIC XLEXLE

• All variables represent % daily price changes.• Modeling method: linear regression• Data obtained from Yahoo Finance.• Time period for modeling 2003.

Possible variable selection:

51

Specification of Training and Test Data

Year 2003

1, 2 3, 4 5, 6 7, 8 9, 10 11, 12

Training Test

Training Test

Training Test

Training Test

Training Test

Two-Month Training/ Test Set-up Total 6 regression models for 2003

52

Results for Fidelity OTC Fund (GSPC+IXIC)

Coefficients w0 w1 (^GSPC) W2(^IXIC)

Average -0.027 0.173 0.771

Standard Deviation (SD) 0.043 0.150 0.165

Average model: Y =-0.027+0.173^GSPC+0.771^IXIC^IXIC is the main factor affecting FOCPX’s daily price change Prediction error: MSE (GSPC+IXIC) = 5.95%

53

Results for Fidelity OTC Fund (GSPC+IXIC)

Daily closing prices for 2003: NAV vs synthetic model

80

90

100

110

120

130

140

1-Jan-03 20-Feb-03 11-Apr-03 31-May-03 20-Jul-03 8-Sep-03 28-Oct-03 17-Dec-03

Date

Daily A

cco

un

t V

alu

e

FOCPX

Model(GSPC+IXIC)

54

Average Model: Y=-0.029+0.147^GSPC+0.784^IXIC+0.029XLE ^IXIC is the main factor affecting FOCPX daily price change Prediction error: MSE (GSPC+IXIC+XLE) = 6.14%

Coefficients w0 w1 (^GSPC) W2(^IXIC) W3(XLE)

Average -0.029 0.147 0.784 0.029

Standard Deviation (SD) 0.044 0.215 0.191 0.061

Results for Fidelity OTC Fund (GSPC+IXIC+XLE)

55

Results for Fidelity OTC Fund (GSPC+IXIC+XLE)

Daily closing prices for 2003: NAV vs synthetic model

80

90

100

110

120

130

140

1-Jan-03 20-Feb-03 11-Apr-03 31-May-03 20-Jul-03 8-Sep-03 28-Oct-03 17-Dec-03

Date

Da

ily

Ac

co

un

t V

alu

e

FOCPX

Model(GSPC+IXIC+XLE)

56

Effect of Variable Selection

Different linear regression models for FOCPX:• Y =-0.035+0.897^IXIC

• Y =-0.027+0.173^GSPC+0.771^IXIC• Y=-0.029+0.147^GSPC+0.784^IXIC+0.029XLE• Y=-0.026+0.226^GSPC+0.764^IXIC+0.032XLE-0.06^DJI

Have different prediction error (MSE):• MSE (IXIC) = 6.44%• MSE (GSPC + IXIC) = 5.95%• MSE (GSPC + IXIC + XLE) = 6.14%• MSE (GSPC + IXIC + XLE + DJIA) = 6.43%

(1) Variable Selection is a form of complexity control

(2) Good selection can be performed by domain experts

57

Discussion• Many funds simply mimic major indices statistical NAV models can be used for

ranking/evaluating mutual funds

• Statistical models can be used for

- hedging risk and

- to overcome restrictions on trading (market timing) of domestic funds

• Since 70% of the funds under-perform their benchmark indices, better use index funds

58

Summary• Inductive Learning ~ function estimation • Goal of learning (empirical inference):

to act/perform well, not system identification • Important concepts:

- training data, test data- loss function, prediction error (aka risk)- basic learning problems- basic learning methods

• Complexity control and resampling• Estimating prediction error via resampling