Slides econ-lm

7
Arthur Charpentier, Econometrics & Machine Learning - 2017 Econometrics & Machine Learning A. Charpentier, E. Flachaire & A. Ly https://arxiv.org/abs/1708.06992 @freakonometrics freakonometrics freakonometrics.hypotheses.org 1

Transcript of Slides econ-lm

Page 1: Slides econ-lm

Arthur Charpentier, Econometrics & Machine Learning - 2017

Econometrics & Machine Learning

A. Charpentier, E. Flachaire & A. Ly

https://arxiv.org/abs/1708.06992

@freakonometrics freakonometrics freakonometrics.hypotheses.org 1

Page 2: Slides econ-lm

Arthur Charpentier, Econometrics & Machine Learning - 2017

Probabilistic Foundations of Econometrics

see Haavelmo (1944)see Duo (1997)

@freakonometrics freakonometrics freakonometrics.hypotheses.org 2

Page 3: Slides econ-lm

Arthur Charpentier, Econometrics & Machine Learning - 2017

Probabilistic Foundations of Econometrics

There is a probabilistic space (Ω,F ,P) such that data yi,xi can be seen asrealizations of random variables Yi,Xi.

Consider a conditional distribution for Y |X, e.g.

(Y |X = x) L∼ N (µ(x), σ2) where µ(x) = β0 + xTβ, and β ∈ Rp.

for the linear model, with some extension when it is in the exponential family(GLM).

Then use maximum likelihood for inference, to derive confidence interval(quantification of uncertainty), etc.

Importance of unbiased estimators (Cramer Rao lower bound for the variance).

@freakonometrics freakonometrics freakonometrics.hypotheses.org 3

Page 4: Slides econ-lm

Arthur Charpentier, Econometrics & Machine Learning - 2017

Loss Functions

Gneiting (2009) in a statistical context: a statistics T is said to be ellicitable ifthere is ` : R× R→ R+ such that

T (Y ) = argminx∈R

∫Rl(x, y)dF (y)

= argmin

x∈R

E[l(x, Y )

]where Y L∼ F

(e.g. T (x) = x and ` = `2, or T (x) = median(x) and ` = `1).

In machine-learning, we want to solve

m?(x) = argminm∈M

n∑i=1

`(yi,m(x))

using optimization techniques, in a not-too-complex spaceM.

@freakonometrics freakonometrics freakonometrics.hypotheses.org 4

Page 5: Slides econ-lm

Arthur Charpentier, Econometrics & Machine Learning - 2017

Overfit ?

Underfit: true yi = β0 + xT1β1 + xT

2β2 + εi vs. fitted model yi = b0 + xT1 b1 + ηi.

b1 = (XT1X1)−1XT

1y = β1 + (X ′1X1)−1XT1X2β2︸ ︷︷ ︸

β12

+ (XT1X1)−1XT

1 ε︸ ︷︷ ︸νi

i.e. E[b1] = β1 + β12 6= β1, unless X1 ⊥X2 (see Frish-Waugh Theorem).

Overfit: true yi = β0 + xT1β1 + εi vs. fitted model yi = b0 + xT

1 b1 + xT2 b2 + ηi.

In that case E[b1] = β1 but no-longer efficient.

@freakonometrics freakonometrics freakonometrics.hypotheses.org 5

Page 6: Slides econ-lm

Arthur Charpentier, Econometrics & Machine Learning - 2017

Occam’s Razor and Parsimony

Importance of penalty in order to avoid a too complex model (overfit).

Consider some linear predictor, y = Hy where H = X(XTX)−1XTy, or moregenerally y = Sy for some smoothing matrix S.

In econometrics, consider some ex-post penalty for model selection,

AIC = −2 logL+ 2p = deviance + 2p

where p is the dimension (i.e. trace(S) in a more general setting).

In machine-learning, penalty is added in the objective function, see Ridge orLASSO regression

(β0,λ,βλ) = argmin(β0,β)

n∑i=1

`(yi, β0 + xTi β) + λ‖β‖

Goodhart’s law, “when a measure becomes a target, it ceases to be a goodmeasure”

@freakonometrics freakonometrics freakonometrics.hypotheses.org 6

Page 7: Slides econ-lm

Arthur Charpentier, Econometrics & Machine Learning - 2017

Boosting, or learning from previous errors

Construct a sequence of models

m(k)(x) = m(k−1)(x) + α · f?(x)

where

f? = argminf∈W

n∑i=1

`(yi −m(k−1)(xi), f(xi))

for some set of weak learner W.

Problem: where to stop to avoid overfit...

@freakonometrics freakonometrics freakonometrics.hypotheses.org 7