Regularization Methods for Linear Regression

Regularization Methods forLinear Regression

Mathilde Mougeot

ENSIIE

2017-2018

Mathilde Mougeot (ENSIIE) MRR2017 1 / 66

Agenda

Lessons

• 3 plenatory lessons (MRR)

• 3 Practical work sessions using R (mrr)

• 10 Project sessions (ipR)

Before next session, install on your computer

1 R software, https://www.r-project.org/

2 Rstudio, https://www.rstudio.com/

Documents are availabale (at this stage)

• https://sites.google.com/site/MougeotMathilde/teaching


https://www.r-project.org/

https://www.rstudio.com/

https://sites.google.com/site/MougeotMathilde/teaching

Plan

A word on data and predictive models

Data are everywhere

• Industry (Temperature, IR sensors...)

• Finance : transactions

• Marketing : consumer data.

• on your phone (GPS, mail, musique ...)

→ Data base are available everywhere : from small data set to Big Data


Plan

A word on data and predictive models

Nowadays, predictive models are crutial for monitoring, for diagnosis

• Industry : Health monitoring, Energy...

• Finance : forecast of the evolution of the market

• Marketing : scoring

• Health

→ Machine learning models are used to mine, to operate the data.


Plan

Regularization Methods for Linear Regression

-Linear regression and Regularized Linear Regression belongs to the Predictivemodel family.-Linear regression is an old model but still very useful !Gauss, 1785 ; Legendre 1805

Outline of the lesson

• Motivations

• Ordinary Least Square -OLS- (geometrical approach)

• The linear Model (probabilistic approach)

• Using R software for modeling


Plan

Evolution of the average age of the French population


Plan

Modeling the average age of the French population


Plan

Application : Social Networks

Facebook users :

an user(million)

31/12/04 031/12/05 531/12/06 1031/3/07 2030/12/07 6030/6/08 10030/12/08 15030/3/09 17030/6/09 20030/9/09 25030/12/09 30030/3/10 40030/6/10 50030/6/11 750


Plan

Evolution of the number of Facebook users

Motivations : investment, forecastMathilde Mougeot (ENSIIE) MRR2017 10 / 66

Plan

Modeling the evolution of the number of Facebook users


Introduction : Regression model

• (Y,X) : couple of variablesY : Target quantitative variableX = (X1,X2, . . . ,Xp) : Co-variates, quantitative variables

→ The goal is to propose a Regression model to explain Y given X.The parameters of the model are computed using a set of data

Y = Fdata(X ) = Fdata(X1, . . . ,Xp)

In this case, F is a linear function.

• Questions :

- What are the performances of this model ?- What are the main explicative variables ?- Is-it possible to use the model to predict new values ? to forecast ?- Can we improve the model ?

Boston Housing Data

The original data are n = 506 observations on p = 14 variables,medv median value, being the target variable

crim per capita crime rate by townzn proportion of residential land zoned for lots over 25,000 sq.ftindus proportion of non-retail business acres per townchas Charles River dummy variable (= 1 if tract bounds river ; 0 otherwise)nox nitric oxides concentration (parts per 10 million)rm average number of rooms per dwellingage proportion of owner-occupied units built prior to 1940dis weighted distances to five Boston employment centresrad index of accessibility to radial highwaystax full-value property-tax rate per USD 10,000ptratio pupil-teacher ratio by townb 1000(B − 0.63)2 where B is the proportion of blacks by townlstat percentage of lower status of the populationmedv median value of owner-occupied homes in USD 1000’s

Boston Housing Data

The data :nr crim zn indus chas nox rm age dis rad tax ptratio b lstat medv1 0.006 18 2.3 0 0.53 6.57 65.2 4.09 1 296 15.3 396.9 4.9 24.02 0.027 0 7.0 0 0.46 6.42 78.9 4.96 2 242 17.8 396.9 9.1 21.63 0.027 0 7.0 0 0.46 7.18 61.1 4.96 2 242 17.8 392.8 4.0 34.74 0.032 0 2.1 0 0.45 6.99 45.8 6.06 3 222 18.7 394.6 2.9 33.45 0.069 0 2.1 0 0.45 7.14 54.2 6.06 3 222 18.7 396.9 5.3 36.2... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Boston Housing Data

Different points of view may exist for studying those data :Model Y = Fdata(X )

• Mining approach

− Evaluate the performances of the model

− What are the most important variables ? (variable selection)→ sparse models, less complex, best performances

• Predictive approach

− Inference and simulation→ Ponctual estimation for new values of the co-variables→ Confidence interval computation.

Outline

Outline

• Applications

• Ordinary Least Square (OLS) / Moindre Carres Ordinaires (MCO)

• Linear Model

• Regularization methods : ridge, lasso


OLS

Ordinary Least Square (OLS)

Outline

Ordinary Least Square (OLS)

• Values/Variables :

- Y , Y ∈ R value/ Target variable- X = (X 1, ...,X p), X ∈ Rp values/ covariates

• Data : S = {(xi , yi ) i = 1, ..., n, yi ∈ R, xi ∈ Rp}

• Goal : Modeling Y linearly with X , with a ”small” ε term

• Y = β1 + β2X2 + · · ·+ βpXp+ ε

• If X1 = 1, we can write Y = β1X1 + β2X2 + · · ·+ βpXp+ ε

• Y =∑p

j βjXj+ ε (p βj parameters, p variables, ”one constant” inside)


OLS

Ordinary Least Square (OLS)Simple Linear Regression model

Outline

Simple Linear Regression : exampleWe only have one co-variable (X) to explain the target variable (Y).The scatter plot is represented by :

++++++

+

++

+

+

+

++

+

++++

+

+

+

+++

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

++++

++

+

+

+

+

+

+++

+

+

+

+

++

++

+

+

+

+

++

+

++

+

+

+

+

++

++

+

+

+

+

+

+

++

+

5 6 7 8 9 10

12

16

20

24

x

y


Outline

Simple Linear Regression : example

For all observation couples i , 1 ≤ i ≤ n (Yi ,Xi ),the goal is here to minimize (Yi − (β1 + β2Xi ))2

++++++

+

++

+

+

+

++

+

++++

+

+

+

+++

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

++++

++

+

+

+

+

+

+++

+

+

+

+

++

++

+

+

+

+

++

+

++

+

+

+

+

++

++

+

+

+

+

+

+

++

+

5 6 7 8 9 10

12

16

20

24

x

y

Régression


Outline

Ordinary Least Square : simple linear model

Formalism :

• Distance to a single point : (yi − xiβ2 − β1)2

• Distance to the whole sample :∑n

i=1(yi − xiβ2 − β1)2

→ Best line : intercept β1 and slope β2 such that∑n

i=1(yi − xiβ2 − β1)2 isminimum, among all possible values of β1 and β2.

OLS estimator :The values estimated by OLS (the estimates) for β1 and β2 verify :

(βols1 , βols

2 ) = argminβ1,β2∈R2

{n∑

i=1

(yi − xiβ2 − β1)2}


Outline

Ordinary Least Square : simple linear modelOLS estimator (observations) :The values estimated by OLS (the estimates) for β1 and β2 verify :

(βols1 , βols


{n∑

i=1

(yi − xiβ2 − β1)2}

OLS estimator (vector notations) : y , x , 1n

The values estimated by OLS (the estimates) for β1 and β2 verify :

(βols1 , βols


‖y − xβ2 − 1nβ1‖22

OLS estimator (matrix notations Y (n,1) ; X (n,2)) :The values estimated by OLS (the estimates) for β1 and β2 verify :

(βols1 , βols


‖Y − Xβ‖22


Outline


Theorem :The OLS estimators have the following expressions :

β1 = y − β2x

β2 =∑n

i=1(xi−x)(yi−y)∑2i=1(xi−x)2 = Cov(x ,y)

Var(x)

Proof :by zeroing the derivative of the objective function, which is convex.


Outline


For the simple linear model, the correlation coefficient may be very useful :

• r(x , y) : correlation coefficient/ coefficient de correlation lineaire

r(x , y) = cov(X ,Y )√var(x)√

var(y)

• r(x , y) = 1 if and only if Y = aX + b, linear relation between Y et X

R-square used in multiple regression

• R2 = VarYVar(Y )

• R2 ∈ [0, 1]

• Simple regression R2 = r2 :


Outline

The value of a correlation coif. may be non informative

Illustration of two joint distribution with a correlation coefficient equaled to 0.999

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

n=500, correlation 0.9998

x

y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

x

y

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

n=500,correlation 0.9990

x

y

→ A scatter plot shows two very different distribution.Always look at the data ! !

Always study carefully the data


OLS

Ordinary Least Square (OLS)Multiple Linear Regression model

Outline

Ordinary Least Square : multiple linear model

• We suppose Y =∑p

j βjXj + ε and S = {(xi , yi ) i = 1...n, yi ∈ R xi ∈ Rp}

• The Quadratic error is defined by :

E (β) =∑n

i ε2i =

∑ni (yi −

∑j x

ji βj)

2

with matrix notation :

E (β) = (Y − Xβ)T (Y − Xβ)

• Goal : To minimize the error E (β) on the data set S .To compute β ∈ Rp :

β = argminβ∈Rp E (β)


Outline

Ordinary Least Square. multiple regression model

• We aim to compute β which minimize :

E (β) = ||Y − Xβ||22= (Y − Xβ)T (Y − Xβ)

• Assumption : XTX inversible. (n ≥ p)

Theorem :

βMCO = (XTX )−1XTY


Outline

MCO

• Estimation

β = (XTX )−1XTY

• Prediction Knowing β and given X1, . . . ,Xp,

the prediction of the target can be computed : Y =∑

j βjXj

Y = X β= X (XTX )−1XTY

• P Projection matrix on the Hyperplan (hat matrix)P = X (XTX )−1XT

P2 = P

• Residuals

- ε = Y − Y• Remark : no assumption on the law or distribution of ε


Outline

Ordinary Least Square. Geometrical interpretation

Y =∑p

j Xjβj + ε

∈ Rn ∈ Rp ∈ R(n−p)


Outline Properties

Ordinary Least Square. Properties

• Orthogonality :

- Y⊥ε• Xj⊥ε ∀j ∈ [1 . . . p] < X j , ε >= 0

• Residual average :

-∑

i εi = 0 if there is an intercept in the model X 1 = (1, 1, . . . , 1)→ the average point belongs to the hyperplan

- ¯Y = Y

• Analysis of Variance -ANAVAR- (Pythagore)

- var(Y ) = var(Y ) + var(E)


Outline Adjustment

Multiple Linear model : example with R

head(mydata,3) ;y x1 x2 x31 -2.20 0.38 0.98 0.462 -1.75 0.11 0.62 0.373 -0.24 0.80 0.59 0.87...> modlm=lm(y ˜ x1+x2+x3,data=mydata) ;Call :lm(formula = y ˜ x1+x2+x3, data = mydata)Coefficients :(Intercept) x1 x2 x30.02754 1.98163 -3.03612 0.01903


Outline Adjustment

Multiple Linear model : example with R

> modlm=lm(y ˜ x1+x2+x3,data=mydata) ;> summary(modlm)lm(formula = y ˜ x1+x2+x3, data = mydata)Residuals :Min 1Q Median 3Q Max-0.29 -0.075 -0.0035 0.073 0.281Coefficients :Estimate Std. Error t value Pr(>|t|)(Intercept) 0.02754 0.01503 1.833 0.0674 .x1 1.98163 0.01577 125.652 <2e-16 ***x2 -3.03612 0.01621 -187.286 <2e-16 ***x3 0.01903 0.01576 1.208 0.2277— Signif. codes : 0 /*** 0.001 /** 0.01 / * 0.05 / . 0.1/ 1Residual standard error : 0.1009 on 496 degrees of freedomMultiple R-squared : 0.9904, Adjusted R-squared : 0.9904F-statistic : 1.707e+04 on 3 and 496 DF, p-value : < 2.2e-16


Outline Adjustment

Ordinary Least Square, quality of the adjustment

• R2, R-square (French : coefficient de determination)

• R2 = var(Y )var(Y )

, remark : no unit

• cos2w = R2 =||Y− ¯Y1,n||2

||Y−Y1,n||2

w : angle between the centered vector (Y − Y1,n) and its centered prediction

(Y − ˆY1,n)

• var(E ) = 1n

∑ni=1(yi − yi )

2

= (1− R2)var(Y ), unit of Y 2


Outline Adjustment

Ordinary Least Square : Quality of the Ajustement• R-square (for few variables)

• R2 = cos2 ω = VarYVar(Y )

• R2 ∈ [0, 1]• R2 = increases mechanically with the number of variables

• Adjusted R-squared is sometimes preferred (penalization with the number ofvariables)• R2

adj = 1− (1− R2) n−1n−p

• R2adj may be negtive

• Residual study :• εi = yi − yi ∀i ∈ 1..n• Vizualization of

• (εi , i) ∀i ∈ 1..n• (εi , yi ) homoscedastic vs heteroscedasticbissectrice model

• Prediction : Vizualization of• (yi , yi ) ∀i ∈ 1..n• comparison with the first bisector.


Outline Adjustment

Ordinary Least Square : Quality of the Adjustement

Graphics (yi , yi ) 1 ≤ i ≤ j VERY USEFUL

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

modlm$fit

y

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

● ●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

modlm$fit

y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ● ●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●●●●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

● ●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

● ●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

modlm$fit

y

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

● ●

●

●●●

●●●

●

●●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

● ●

●

●

●

● ●●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●●

●

●

●

●●

●

●●

●

●

●

●

●

● ●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●● ●●

●

●

●●

●

●

●

●

●

●●

●●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

modlm$fit

y


Outline Adjustment

Ordinary Least Square : Student Residual graphεiSE

= yi−yiSE

(no unit term)

+

+++

+

++

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

++

+

++

+++++

++

+

++

+

+

+

+

+

+

+

+

+

+

++

++

+

++

+

++

+++

+

+

++

++

+

+

++

+

++

++

++

++

++

+

+

+

+

+

++

+

++

12 14 16 18 20 22 24

−4

−2

02

4

y

E

Residual graph

→ Random distribution. There is no information to be capture


Outline Adjustment


= yi−yiSE

(with non unit)

++++

++++

+++++

+++

+

+++++

+

+++

+

++

+++++

+++++++

++++++++++++++++++++

++++++++

++++++++++++++++++

+++++++

++++

++

0 20 40 60 80 100

−3

0−

10

01

02

03

0

Index

E

o

oo

Residual graph function of Y

→ Large values for some points ? Outliers detection ?


Outline Adjustment


= yi−yiSE

(with non unit)

+++

++++++

+

+

++

+

++

+++ ++

+

+

+

+

+

+

+

+

++

++

+

+

+

++

+ +

++++

+

+++

+

+++++

+

+

+

+

+++

++++

+++

+++

+++

+

++

+

+

+

++

++

+

++++

++

+

+

+

+

++

+

+

+

50 100 150 200

−1

0−

50

51

0

y2

E2

Graphe des residus en fonction de Y

→ there is still some information in the residuals.→ The model needs to be changed.


Outline Adjustment

Ordinary Least Square : curse of dimension

Data set : {(yi , xi )1 ≤ i ≤ n}. One target variable, one covariable

+

+

+

+

+

+

++

++

+

+

+

+ +

+

+

+

+ +

+

+

+

+

+

+

+

+

+

+

++

++

+++

++

+

+

+

+

+

+

+

++

+

+

−2 −1 0 1 2

−2−1

01

2

x1

y

R2= 0.07

Y:N(0,1) X:N(0,1)

n=50, p=1+1

→ R2 =, R2adj = −0.02


Outline Adjustment

Ordinary Least Square : illustration of the impact of thenumber of covariables on the modelinitial data and OLS line

+

+

+ +

+

+

++

+

++

+

++

+

++

+

++

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+ ++

+

+

++

+

+

+

+

+

+

+

+

−2 −1 0 1 2

−0.6

−0.4

−0.2

0.0

0.2

0.4

y

mco

$fit

Y:N(0,1) X:N(0,1)

n=50, p=1+1

R2= 0.07

→ R2 =, R2adj = −0.02


Outline Adjustment

Ordinary Least Square : illustration of the impact of thenumber of covariables on the model

initial data and 48 more covariables N (0, 1) are added to the initial data set.

+

+

+

+

+

+

+

+++

+

++

+

+

+

+

+

+

+

+

+++

++

++

+

++

++

++

+

+

++

+

++

+

+

+

+

+

+

+

+

−3 −2 −1 0 1 2

−3−2

−10

12

y

mco

$fit

Y:N(0,1) X:N(0,1)

n=50, p=1+48

R2= 1.00

→ R2 = 0.99, R2adj = 0.93


Outline Adjustment

Ordinary Least Square : need to change the model (1/2)

initial data set :


Outline Adjustment

Ordinary Least Square : need to change the model (1/2)

logarithmic transformation


Outline Adjustment

MCO Regression. Some limits :

If XTX is non inversible

• n >> p, collinearity between some Xj .• Pseudo-inverse, the solution is not unique• Variable selection

• p >> n, when the number of variables is larger than the number ofobservations• Regularization method• Ridge -L2-, Lasso -L1-.• Variable selection

OLS modelPonctual estimation.


Outline Adjustment

OLS, XTX non inversible →

Pseudo inverse computation

Solution (n > p), XTX is non invertible with the rank k, k < p :

XTX = UΣ2UT

=U

σ2

1 0 0 0

0... 0 0

0 0 σ2k 0

0 0 0 0

UT

= UkΣ2kU

Tk

(XTX )∗−1 = UkΣ2k−1

UTk avec Σ2

k =

σ21 0 0

0... 0

0 0 σ2k

β = (XTX )∗−1XTYThe solution non unique


Modele Lineaire

Outline

• Motivations

• Ordinary Least Square

• Linear Model

• Penalized regression, ridge, lasso


Linear Model

Probabilistic assumption on the residuals

The probabilistic assumption on the residuals let to introduce :- statistical tests on the value of the coefficients (to study is a variable is useful ornot to explain the target),-to test if the general liner model is useful (all coefficients equal zero).

Linear model

• We write : Y = Xβ + ε avec ε ∼ N (0, σ2)

• We have

- εi = Yi −∑

X ji βj avec f (ε) = 1√

2πσe− ε2

2σ2 i.i.d.

• Residual density & Maximum Likelihood Estimationf (ε1, . . . , εn) =

∏i f (εi )

= 1(2π)n/2σn e

−∑ε2i

2σ2

= 1(2π)n/2σ2n/2 e

− ||Y−Xβ||22σ2

• the goal is to compute β, σ2 solutions of the maximum likelihood Estimation(MLE)Same solution for the MLE and the OLS :

- β = (XTX )−1XTY

- σ2 = 1n

∑i=ni=1 ε

2i

Linear model : What are the laws of the estimators ?

Y = Xβ + ε avec ε ∼ N (0, σ2)

Law of the estimators :• Law of β ?

• Law of Y ?

• Law of σ2 ?

Benefits→ let to compute confidence intervals for β and Y .→ let to test the parameters.

Law of β :

β ∼ N (β, σ2(XTX )−1)

Expectation and Variance of β ? :

β = (XTX )−1XTY et Y = Xβ + ε

• E(β) = β Non biased estimator E(β)− β = 0

• Var(β) = σ2(XTX )−1

Var(β) = (XTX )−1XT ([var(Y )]X (XTX )−1

= (XTX )−1XT ([var(ε)]X (XTX )−1

= σ2(XTX )−1

• E[(β − β)2] = Var(β) + 0

Recall Var(aY ) = aVar(Y )aT

Law of Y

Y ∼ N (Xβ, σ2X (XTX )−1XT )

Expectation and Variance of Y ?, Y = X β

• E(Y ) = XβE(Y ) = E(X β) = XE(β) = Xβ = E(Y )

• Var(Y ) = σ2 X (XTX )−1 XT

Var(Y ) = Var(X β)

= X Var(β) XT

= σ2 X (XTX )−1 XT

Law of ε

ε ∼ N (0, σ2(In − X (XTX )−1XT )

Expectation and Variance of ε = Y − Y ? :

• E(ε) = 0

• Var(ε) = σ2(In − X (XTX )−1XT )

Var(ε) = Var(Y − Y )

= Var(Y − X β)

= σ2(In)− XVar(β)XT

= σ2(In − X (XTX )−1XT )

Recal : Var(aY ) = aVar(Y )aT

Linear model : law of the estimators

Under the assumption that εi are i.i.d. with εi ∼ N (0, σ2)

Theoremif p ≤ n and XTX inversible,

The vector

(βε

)of dimension (p + n) is a gaussian vector

with mean

(β0

), and

and variance σ2

((XTX )−1 0

0 In − X (XTX )−1XT

)

Loi σ2

n−pσ2 σ

2 ∼ χ2n−p

We note : σ2 = ||ε||2n−p

||ε||2 =∑n

i ε2i ||ε||2 suit une loi σ2χ2(n − p) (Cochran theorem)

Then, the expectation of σ2 = ||ε||2n−p is σ2,

(E(χ2(n − p)) = n − p)

We deduce the law of σ2 :σ2 ∼ σ2

n−pχ2(n − p)

Recall : Student theorem.U ∼ N (0, 1) and V ∼ χ2(d) , U and V are independant, the, we haveZ = U√

V/dfollows a Student law of parameter d .

Significativity test of βj , σ2 inconnu

• Student Statistics : T

• Significativity test (bilateral)• H0 : βj = 0• H1 : βj 6= 0

• Decision with a risk α, Reject of H0 if

• βj√σ2Sj,j

> tn−p(1− α/2) with Sj,j jeme term of the diagnonal of (XTX )−1

• pvalue < α

• Conclusion :• βj is significatively different of zero• Xj a une influence dans le modele

Not true if there exists colinearity between the variables

Global significativity of the model

Test of the model with a risk α

H0 : β2 = β3 = . . . = βp = 0H1 : ∃j = 2, . . . , p, βj 6= 0

Statistics

F = n−pp−1

||Y− ¯Y ||2

||Y−Y ||2 ∼ Fisher(p − 1, n − p)

Remarque : n−pp−1

||Y− ¯Y ||2

||Y−Y ||2 = SSE/(p−1)SSR/(n−p) (E :Estimated ; R : Residuals)

Decision rule

• si Fobs > qFα , H0 is rejected, and there exist a coefficient which is not zero.The regression is ”useful”

• si Fobs ≤ qFα , H0 is acceted, all the coefficients are supposed to be nullThe regression is not ”useful”


• Fisher Statistic

• Significativity test (bilateral)• H0 : β2 = . . . = βp = 0• H1 : ∃βj 6= 0

• Decision with a rish α, Reject H0 if

• si n−pp−1

R2

1−R2 > fp−1,n−p(1− α)• si pvalue < α

→ The linear model has an added value


Remarque1 : sur la statistique de Fisher :

F = n−pp−1

R2

1−R2

The R2 coefficient increase mechanically with the number of variables

Remarque : the adjusted R2 may be used

R2adj = 1− (1−R2)(n−1)

n−p

The R2adj does not increase ith the number of variables.

Boston Housing Data

The original data are n = 506 observations on p = 14 variables,medv being the target variable

crim per capita crime rate by townzn proportion of residential land zoned for lots over 25,000 sq.ftindus proportion of non-retail business acres per townchas Charles River dummy variable (= 1 if tract bounds river ; 0 otherwise)nox nitric oxides concentration (parts per 10 million)rm average number of rooms per dwellingage proportion of owner-occupied units built prior to 1940dis weighted distances to five Boston employment centresrad index of accessibility to radial highwaystax full-value property-tax rate per USD 10,000ptratio pupil-teacher ratio by townb 1000(B − 0.63)2 where B is the proportion of blacks by townlstat percentage of lower status of the populationmedv median value of owner-occupied homes in USD 1000’s

Boston Housing Data

Les donnees :nr crim zn indus chas nox rm age dis rad tax ptratio b lstat medv1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.02 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.63 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.74 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.45 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Boston Housing Data

MCO sous Rlibrary(mlbench)

#Data data(BostonHousing) tab=BostonHousing;names(tab)

target="medv"; Y=tab[,target]; X=tab[,names(tab)!=target]; names(X)

#MCO resfit=lsfit(x=X,y=Y,intercept=T);

resfit$coef hist(resfit$res)Cst crim zn indus chas nox rm age dis rad tax ptratio b lstat medv36.45 -0.10 0.046 0.020 2.68 -17.76 3.80 0.00 -1.47 0.30 -0.01 -0.95

Boston Housing Data

Modele Lineaire sous Rcode Rreslm=lm(medv ~ .,data=tab); summary(reslm) Resultats :n = 506, p = 14Residuals:

Min 1Q Median 3Q Max

-15.595 -2.730 -0.518 1.777 26.199

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***

crim -1.080e-01 3.286e-02 -3.287 0.001087 **

zn 4.642e-02 1.373e-02 3.382 0.000778 ***

indus 2.056e-02 6.150e-02 0.334 0.738288

chas1 2.687e+00 8.616e-01 3.118 0.001925 **

nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***

rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***

age 6.922e-04 1.321e-02 0.052 0.958229

dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***

rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***

tax -1.233e-02 3.760e-03 -3.280 0.001112 **

ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***

b 9.312e-03 2.686e-03 3.467 0.000573 ***

lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***

Signif. codes: 0 *** 0.001/ ** 0.01 /* 0.05 /. 0.1 / 1

Residual standard error: 4.745 on 492 degrees of freedom

Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338

F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16

Precautions

• Multicolinearitela solution des MCO necessite de calculer (XTX )−1.Lorsque le determinant de cette matrice est tres proche de zero, le problemeest mal conditionne.

• Choix des variablesLe coefficient de determination R2 augmente en fonction du nombre devariables.si p = n R2 = 1, ce qui n’est pas forcement pertinent.

Demonstration sous R

Regularization Methods for Linear Regression

Documents

Transcript of Regularization Methods for Linear Regression