MALIS: Supervised Learning - EURECOMzuluaga/files/02_Linear_reg_notes.pdf · and Rio 2016? Adapted...
Transcript of MALIS: Supervised Learning - EURECOMzuluaga/files/02_Linear_reg_notes.pdf · and Rio 2016? Adapted...
04/10/2019
1
MALIS: Supervised Learning
Maria A. Zuluaga
Data Science Department
Basics & terminology
• Let � ∈ ℝ and � ∈ ℝ� be related by:
MALIS 2019 2
Goal: To predict � using �
Output
Target or label
Dependent variable
Input
Feature vector
Independent variable
but we don’t know the true relationship
between � and �
� = �(�)
04/10/2019
2
Basics: Data
• To discover the relationship between � and � we have access to a set of N inputs
and the corresponding set of outputs
• The paired inputs-outputs set
is denoted the training set.
MALIS 2019 3
�� , ��
�� = 1, … , �
��
Basics: Prediction
• The task of learning consists on finding a function �: ℝ� → ℝ that can make a good prediction of �, denoted by ��:
• We will make use of the data and of any prior knowledge we might have about �
Example:• � ≥ 0• Continuity and smoothness of the function
MALIS 2019 4
�� = �(�)Final goal: �� ≈ � for unseen �, � pairs
04/10/2019
3
An example: 100m at the Olympics
MALIS 2019 5
9,5
10
10,5
11
11,5
12
1880 1900 1920 1940 1960 1980 2000 2020
Tim
e (
s)
Year
Year Time
1896 12
1900 11
1904 11
1906 11,2
1908 10,8
1912 10,8
1920 10,8
1924 10,6
1928 10,8
1932 10,32
1936 10,3
1948 10,3
1952 10,4
1956 10,5
1960 10,2
1964 10
1968 9,95
1972 10,14
1976 10,06
1980 10,25
1984 9,99
1988 9,92
1992 9,96
1996 9,84
2000 9,87
2004 9,85
2008 9,69
Winning times of
men’s 100m at the
Olympics
1896 to 2008
Can we use this
information to
predict the times
of London 2012
and Rio 2016?
Adapted from M. Filippone – Advanced Statistical Inference course
100m at the Olympics: Review of concepts
• � = � �• � –
• � –
• � –• � –
• (��, ��) –
• �� −
MALIS 2019 6
Year Time
1896 12
1900 11
1904 11
1906 11,2
1908 10,8
1912 10,8
1920 10,8
1924 10,6
1928 10,8
1932 10,32
1936 10,3
1948 10,3
1952 10,4
1956 10,5
1960 10,2
1964 10
1968 9,95
1972 10,14
1976 10,06
1980 10,25
1984 9,99
1988 9,92
1992 9,96
1996 9,84
2000 9,87
2004 9,85
2008 9,69
9,5
10
10,5
11
11,5
12
1880 1900 1920 1940 1960 1980 2000 2020
Tim
e (
s)
Year
04/10/2019
4
100m at the Olympics: Assumptions
• Do we have any prior knowledge on �?
MALIS 2019 7
9,5
10
10,5
11
11,5
12
1880 1900 1920 1940 1960 1980 2000 2020
Tim
e (
s)
Year
Supervised learning basics & the Olympics
• We have identified our data
• We assumed that there is an unknown function � that maps the Olympic year (�) to the men’s Olympic 100m winning time (�)
• We assumed that :
• � is a decreasing function (� �� ≥ �(����)• � > 0• �, � have a linear relationship
MALIS 2019 8
Wrap up
All that is left is to find �
04/10/2019
5
Linear models for regression
MALIS 2019 9
Linear regression
• A linear regressor (or predictor) is a linear function of �• They have the form:
� = { !} are the parameters of the model
MALIS 2019 10
� = � � = # + % !�!�
!&�= �(�; �)
So
urc
e:
M.
Fil
ipp
on
e–
Ad
va
nce
d S
tati
stic
al
Infe
ren
ce c
ou
rse
So
urc
e:
M.
Fil
ipp
on
e–
Ad
va
nce
d S
tati
stic
al
Infe
ren
ce c
ou
rse
� = � � = # + �� = �(�; #, �)
Olympic games example:
04/10/2019
6
Model fitting
• Finding a good estimate of � � accounts to fitting our training data into the linear model to estimate the model’s parameters
• How can we achieve this?
• Reminder:
MALIS 2019 11
�� = � � = (# + % (!�!�
!&�
�� ≈ �
Loss functions
• A loss or risk function ℓ: ℝ × ℝ quantifies how well (or bad) ��approximates �
• The lower the value of ℓ(��, �) the better the approximation
• ℓ �, � = 0• Typically (but not always) ℓ ��, � ≥ 0 for all �, ��
• Examples:
• Quadratic loss:ℓ �, �� = (� − ��)+ (Also known as squared error)
• Absolute loss: ℓ �, �� = |� − ��|
MALIS 2019
12
04/10/2019
7
Empirical risk | average loss | error
• How can we use the loss to find the best predictor according to our data?
• We compute average loss over all the data points :
MALIS 2019 13
ℒ = 1� % ℓ
.
�&�(�� , ���)
ℒ� = 1� % ℓ�
.
�&�(�� , ���)
Optimization
• In principle, one could do trail-error to find the � that make ℒminimal.
• Formalizing it though a mathematical expression:
MALIS 2019 14
argmin�
ℒ = argmin�
1� % ℓ�(�� , ���)
.
�&�
04/10/2019
8
Optimization: Least squares
• Method for estimating the unknown parameters � of a linear model.
• It minimizes the sum of the squares of the differences between the observed outputs (��) and those predicted by the model.
MALIS 2019 15
ℒ = 1� % ℓ
.
�&���, ��� = 1
� % �� − ��� +.
�&�
argmin�
ℒ = argmin�
1� % �� − ��� +
.
�&�
• Let’s say we want to find:
• At a minimum or a maximum the gradient must be zero.
• The gradient is given by the first derivative of the function:
• Setting to zero and solving for z:
MALIS 2019 16Adapted from: M. Filippone – Advanced Statistical Inference course
Parenthesis – Calculus refresherFinding maxima and minima
04/10/2019
9
• How do we know it is a minimum or a maximum?
• At the minimum the gradient must be increasing.
• Take the second derivative:
MALIS 2019 17Adapted from: M. Filippone – Advanced Statistical Inference course
Parenthesis – Calculus refresherFinding maxima and minima
5+� 656+ =
Back to optimization
• Let’s use least squares to find the � in the Olympics problem
• Recall:
• We need to find minima values for:
• Multi-variable calculus: Find the best values for # and �
MALIS 2019 18
�� = # + �� = �(�; #, �) Note: For simplicity we
will drop �
Exercise: Estimate (# and (�
argmin�
ℒ = argmin�
1� % �� − ( #+ ���) +
.
�&�
04/10/2019
10
Parenthesis - Matrix notation
• , � and � can have large dimensions
• The current notation can be cumbersome to handle.
• Matrix notation:
MALIS 2019 19
� = #…
�7 =
1 ��� … ���1 �+� … �+�1 ��� … ���⋮ ⋮ ⋱ ⋮ 1 �.� … �.�
: =��……�.
Matrix cheat sheet
MALIS 2019 20
Source: http://www.gatsby.ucl.ac.uk/teaching/courses/sntn/sntn-2017/resources/Matrix_derivatives_cribsheet.pdf
B – constant matrix
b – constant vector
04/10/2019
11
Least squares revisited
• Using the matrix notation, the least squares formulation is:
MALIS 2019 21
argmin�
1� (: − 7�);(: − 7�)
<ℒ<� = ?
MALIS 2019 22
<ℒ<� = <
<� : − 7� ; : − 7�
==>� = ?�( = (7;7)>�7;:
04/10/2019
12
Recap
• We introduced the basic terminology used in supervised learning
• We introduced linear regression models
• We introduced the concept of loss and how it is used to fit a model to the training data
• Using least squares, we found a general expression to obtain the unknown parameters of a linear regressor
MALIS 2019 23
�( = (7;7)>�7;:
Predictions
• Once we have estimated �( how we use our model to predict new values?
where 7@AB is a set of “unseen” input data
• The matrix is constructed in the same way as we did for the training set.
• Question: What would 7@AB be for our Olympic problem?
MALIS 2019 24
:(@AB = 7@AB�(
04/10/2019
13
Back to our example
• It is all about making use of the expression we obtained
• 01_linear_models.ipynb
MALIS 2019 25
The fitted model
MALIS 2019 26
X
Model:
Real values
Model predictions
y=35.8517 - 0.0131X
04/10/2019
14
The fitted model
MALIS 2019 27
X
Model:
Real values
Model predictions
y=35.8517 - 0.0131X
Year Prediction Real time
2012 9.60 9.63
2016 9.55 9.81
What can we say about our model?
• Is a straight line too simple? Should we try to fit a more complex model?
• Is it really always decreasing?
• We also said that it cannot be negative (so it cannot always decrease)
• Are we being too precise?
MALIS 2019 28
Models & assumptions“All models are wrong, but some are useful” G. Box
How useful is our model depends on what we are trying to answerHow useful is our model depends on what we are trying to answer
04/10/2019
15
Adding polynomial features
• One could consider a higher order model by using polynomial features
• Nth order model:
• This is still considered to be linear model as the weights associated with the features are still linear.
MALIS 2019 29
�� = # + �� + +�+ + ⋯ + @�@
MALIS 2019 30
01
_lin
ea
r_m
od
els
.ip
yn
b
04/10/2019
16
Recap
• We introduced the basic terminology used in supervised learning
• We introduced linear regression models
• We introduced the concept of loss and how it is used to fit a model to the training data
• Using least squares, we found a general expression to obtain the unknown parameters of a linear regressor
• We have reviewed the use of polynomial features
• We have introduced the problem of model selection
MALIS 2019 31
A final note on ML
• Statistical learning or modeling arouse from statistics.
• It emphasizes models and their interpretability, and precision and uncertainty.
• Machine learning has a greater emphasis on large scale applications and prediction accuracy.
MALIS 2019 32
However, there is much overlap among them:
• They both deal with supervised and
unsupervised problems
• The distinction among them is becoming
more and more blurry and they “cross” a lot.
Further reading: Bzdok et al. Statistics vs Machine Learning. Nature methods (2018)
04/10/2019
17
Further reading and useful material
Source Chapters
The Elements of Statistical Learning Ch. 2 and 3
Pattern Recognition and Machine Learning Sec 1.5.5, Ch. 3
Nature Methods Statistics vs Machine Learning. Bzdok et al
The Matrix Cook Book All
Introduction to Linear Applied Linear Algebra Part III – Least Squares
MALIS 2019 33
Warning: Notation might vary among the different sources