MALIS: Supervised Learning - EURECOMzuluaga/files/02_Linear_reg_notes.pdf · and Rio 2016? Adapted...

04/10/2019

1

MALIS: Supervised Learning

Maria A. Zuluaga

Data Science Department

Basics & terminology

• Let � ∈ ℝ and � ∈ ℝ� be related by:

MALIS 2019 2

Goal: To predict � using �

Output

Target or label

Dependent variable

Input

Feature vector

Independent variable

but we don’t know the true relationship

between � and �

� = �(�)

04/10/2019

2

Basics: Data

• To discover the relationship between � and � we have access to a set of N inputs

and the corresponding set of outputs

• The paired inputs-outputs set

is denoted the training set.

MALIS 2019 3

�� , ��

�� = 1, … , �

��

Basics: Prediction

• The task of learning consists on finding a function �: ℝ� → ℝ that can make a good prediction of �, denoted by ��:

• We will make use of the data and of any prior knowledge we might have about �

Example:• � ≥ 0• Continuity and smoothness of the function

MALIS 2019 4

�� = �(�)Final goal: �� ≈ � for unseen �, � pairs

04/10/2019

3

An example: 100m at the Olympics

MALIS 2019 5

9,5

10

10,5

11

11,5

12

1880 1900 1920 1940 1960 1980 2000 2020

Tim

e (

s)

Year

Year Time

1896 12

1900 11

1904 11

1906 11,2

1908 10,8

1912 10,8

1920 10,8

1924 10,6

1928 10,8

1932 10,32

1936 10,3

1948 10,3

1952 10,4

1956 10,5

1960 10,2

1964 10

1968 9,95

1972 10,14

1976 10,06

1980 10,25

1984 9,99

1988 9,92

1992 9,96

1996 9,84

2000 9,87

2004 9,85

2008 9,69

Winning times of

men’s 100m at the

Olympics

1896 to 2008

Can we use this

information to

predict the times

of London 2012

and Rio 2016?

Adapted from M. Filippone – Advanced Statistical Inference course

100m at the Olympics: Review of concepts

• � = � �• � –

• � –

• � –• � –

• (��, ��) –

• �� −

MALIS 2019 6

Year Time

1896 12

1900 11

1904 11

1906 11,2

1908 10,8

1912 10,8

1920 10,8

1924 10,6

1928 10,8

1932 10,32

1936 10,3

1948 10,3

1952 10,4

1956 10,5

1960 10,2

1964 10

1968 9,95

1972 10,14

1976 10,06

1980 10,25

1984 9,99

1988 9,92

1992 9,96

1996 9,84

2000 9,87

2004 9,85

2008 9,69

9,5

10

10,5

11

11,5

12

1880 1900 1920 1940 1960 1980 2000 2020

Tim

e (

s)

Year

04/10/2019

4

100m at the Olympics: Assumptions

• Do we have any prior knowledge on �?

MALIS 2019 7

9,5

10

10,5

11

11,5

12

1880 1900 1920 1940 1960 1980 2000 2020

Tim

e (

s)

Year

Supervised learning basics & the Olympics

• We have identified our data

• We assumed that there is an unknown function � that maps the Olympic year (�) to the men’s Olympic 100m winning time (�)

• We assumed that :

• � is a decreasing function (� �� ≥ �(��)• � > 0• �, � have a linear relationship

MALIS 2019 8

Wrap up

All that is left is to find �

04/10/2019

5

Linear models for regression

MALIS 2019 9

Linear regression

• A linear regressor (or predictor) is a linear function of �• They have the form:

� = { !} are the parameters of the model

MALIS 2019 10

� = � � = # + % !�!�

!&�= �(�; �)

So

urc

e:

M.

Fil

ipp

on

e–

Ad

va

nce

d S

tati

stic

al

Infe

ren

ce c

ou

rse

So

urc

e:

M.

Fil

ipp

on

e–

Ad

va

nce

d S

tati

stic

al

Infe

ren

ce c

ou

rse

� = � � = # + �� = �(�; #, �)

Olympic games example:

04/10/2019

6

Model fitting

• Finding a good estimate of � � accounts to fitting our training data into the linear model to estimate the model’s parameters

• How can we achieve this?

• Reminder:

MALIS 2019 11

�� = � � = (# + % (!�!�

!&�

�� ≈ �

Loss functions

• A loss or risk function ℓ: ℝ × ℝ quantifies how well (or bad) ��approximates �

• The lower the value of ℓ(��, �) the better the approximation

• ℓ �, � = 0• Typically (but not always) ℓ ��, � ≥ 0 for all �, ��

• Examples:

• Quadratic loss:ℓ �, �� = (� − ��)+ (Also known as squared error)

• Absolute loss: ℓ �, �� = |� − ��|

MALIS 2019

12

04/10/2019

7

Empirical risk | average loss | error

• How can we use the loss to find the best predictor according to our data?

• We compute average loss over all the data points :

MALIS 2019 13

ℒ = 1� % ℓ

.

�&�(�� , ��)

ℒ� = 1� % ℓ�

.

�&�(�� , ��)

Optimization

• In principle, one could do trail-error to find the � that make ℒminimal.

• Formalizing it though a mathematical expression:

MALIS 2019 14

argmin�

ℒ = argmin�

1� % ℓ�(�� , ��)

.

�&�

04/10/2019

8

Optimization: Least squares

• Method for estimating the unknown parameters � of a linear model.

• It minimizes the sum of the squares of the differences between the observed outputs (��) and those predicted by the model.

MALIS 2019 15

ℒ = 1� % ℓ

.

�&��, �� = 1

� % �� − �� +.

�&�

argmin�

ℒ = argmin�

1� % �� − �� +

.

�&�

• Let’s say we want to find:

• At a minimum or a maximum the gradient must be zero.

• The gradient is given by the first derivative of the function:

• Setting to zero and solving for z:

MALIS 2019 16Adapted from: M. Filippone – Advanced Statistical Inference course

Parenthesis – Calculus refresherFinding maxima and minima

04/10/2019

9

• How do we know it is a minimum or a maximum?

• At the minimum the gradient must be increasing.

• Take the second derivative:

MALIS 2019 17Adapted from: M. Filippone – Advanced Statistical Inference course

Parenthesis – Calculus refresherFinding maxima and minima

5+� 656+ =

Back to optimization

• Let’s use least squares to find the � in the Olympics problem

• Recall:

• We need to find minima values for:

• Multi-variable calculus: Find the best values for # and �

MALIS 2019 18

�� = # + �� = �(�; #, �) Note: For simplicity we

will drop �

Exercise: Estimate (# and (�

argmin�

ℒ = argmin�

1� % �� − ( #+ ��) +

.

�&�

04/10/2019

10

Parenthesis - Matrix notation

• , � and � can have large dimensions

• The current notation can be cumbersome to handle.

• Matrix notation:

MALIS 2019 19

� = #…

�7 =

1 �� … ��1 �+� … �+�1 �� … ��⋮ ⋮ ⋱ ⋮ 1 �.� … �.�

: =��……�.

Matrix cheat sheet

MALIS 2019 20

Source: http://www.gatsby.ucl.ac.uk/teaching/courses/sntn/sntn-2017/resources/Matrix_derivatives_cribsheet.pdf

B – constant matrix

b – constant vector

04/10/2019

11

Least squares revisited

• Using the matrix notation, the least squares formulation is:

MALIS 2019 21

argmin�

1� (: − 7�);(: − 7�)

<ℒ<� = ?

MALIS 2019 22

<ℒ<� = <

<� : − 7� ; : − 7�

==>� = ?�( = (7;7)>�7;:

04/10/2019

12

Recap

• We introduced the basic terminology used in supervised learning

• We introduced linear regression models

• We introduced the concept of loss and how it is used to fit a model to the training data

• Using least squares, we found a general expression to obtain the unknown parameters of a linear regressor

MALIS 2019 23

�( = (7;7)>�7;:

Predictions

• Once we have estimated �( how we use our model to predict new values?

where 7@AB is a set of “unseen” input data

• The matrix is constructed in the same way as we did for the training set.

• Question: What would 7@AB be for our Olympic problem?

MALIS 2019 24

:(@AB = 7@AB�(

04/10/2019

13

Back to our example

• It is all about making use of the expression we obtained

• 01_linear_models.ipynb

MALIS 2019 25

The fitted model

MALIS 2019 26

X

Model:

Real values

Model predictions

y=35.8517 - 0.0131X

04/10/2019

14

The fitted model

MALIS 2019 27

X

Model:

Real values

Model predictions

y=35.8517 - 0.0131X

Year Prediction Real time

2012 9.60 9.63

2016 9.55 9.81

What can we say about our model?

• Is a straight line too simple? Should we try to fit a more complex model?

• Is it really always decreasing?

• We also said that it cannot be negative (so it cannot always decrease)

• Are we being too precise?

MALIS 2019 28

Models & assumptions“All models are wrong, but some are useful” G. Box

How useful is our model depends on what we are trying to answerHow useful is our model depends on what we are trying to answer

04/10/2019

15

Adding polynomial features

• One could consider a higher order model by using polynomial features

• Nth order model:

• This is still considered to be linear model as the weights associated with the features are still linear.

MALIS 2019 29

�� = # + �� + +�+ + ⋯ + @�@

MALIS 2019 30

01

_lin

ea

r_m

od

els

.ip

yn

b

04/10/2019

16

Recap

• We introduced the basic terminology used in supervised learning

• We introduced linear regression models

• We introduced the concept of loss and how it is used to fit a model to the training data

• Using least squares, we found a general expression to obtain the unknown parameters of a linear regressor

• We have reviewed the use of polynomial features

• We have introduced the problem of model selection

MALIS 2019 31

A final note on ML

• Statistical learning or modeling arouse from statistics.

• It emphasizes models and their interpretability, and precision and uncertainty.

• Machine learning has a greater emphasis on large scale applications and prediction accuracy.

MALIS 2019 32

However, there is much overlap among them:

• They both deal with supervised and

unsupervised problems

• The distinction among them is becoming

more and more blurry and they “cross” a lot.

Further reading: Bzdok et al. Statistics vs Machine Learning. Nature methods (2018)

04/10/2019

17

Further reading and useful material

Source Chapters

The Elements of Statistical Learning Ch. 2 and 3

Pattern Recognition and Machine Learning Sec 1.5.5, Ch. 3

Nature Methods Statistics vs Machine Learning. Bzdok et al

The Matrix Cook Book All

Introduction to Linear Applied Linear Algebra Part III – Least Squares

MALIS 2019 33

Warning: Notation might vary among the different sources

MALIS: Supervised Learning - EURECOMzuluaga/files/02_Linear_reg_notes.pdf · and Rio 2016? Adapted...

Documents

Transcript of MALIS: Supervised Learning - EURECOMzuluaga/files/02_Linear_reg_notes.pdf · and Rio 2016? Adapted...