Slides econ-lm

Arthur Charpentier, Econometrics & Machine Learning - 2017

Econometrics & Machine Learning

A. Charpentier, E. Flachaire & A. Ly

https://arxiv.org/abs/1708.06992

@freakonometrics freakonometrics freakonometrics.hypotheses.org 1


https://twitter.com/freakonometrics

https://freakonometrics.github.io/

https://freakonometrics.hypotheses.org/


Probabilistic Foundations of Econometrics

see Haavelmo (1944)see Duo (1997)


http://fitelson.org/woodward/haavelmo.pdf

https://global.oup.com/academic/product/the-formation-of-econometrics-9780198292876?lang=en&cc=gb





Probabilistic Foundations of Econometrics

There is a probabilistic space (Ω,F ,P) such that data yi,xi can be seen asrealizations of random variables Yi,Xi.

Consider a conditional distribution for Y |X, e.g.

(Y |X = x) L∼ N (µ(x), σ2) where µ(x) = β0 + xTβ, and β ∈ Rp.

for the linear model, with some extension when it is in the exponential family(GLM).

Then use maximum likelihood for inference, to derive confidence interval(quantification of uncertainty), etc.

Importance of unbiased estimators (Cramer Rao lower bound for the variance).






Loss Functions

Gneiting (2009) in a statistical context: a statistics T is said to be ellicitable ifthere is ` : R× R→ R+ such that

T (Y ) = argminx∈R

∫Rl(x, y)dF (y)

= argmin

x∈R

E[l(x, Y )

]where Y L∼ F

(e.g. T (x) = x and ` = `2, or T (x) = median(x) and ` = `1).

In machine-learning, we want to solve

m?(x) = argminm∈M

n∑i=1

`(yi,m(x))

using optimization techniques, in a not-too-complex spaceM.







Overfit ?

Underfit: true yi = β0 + xT1β1 + xT

2β2 + εi vs. fitted model yi = b0 + xT1 b1 + ηi.

b1 = (XT1X1)−1XT

1y = β1 + (X ′1X1)−1XT1X2β2︸︷︷︸

β12

+ (XT1X1)−1XT

1 ε︸︷︷︸νi

i.e. E[b1] = β1 + β12 6= β1, unless X1 ⊥X2 (see Frish-Waugh Theorem).

Overfit: true yi = β0 + xT1β1 + εi vs. fitted model yi = b0 + xT

1 b1 + xT2 b2 + ηi.

In that case E[b1] = β1 but no-longer efficient.






Occam’s Razor and Parsimony

Importance of penalty in order to avoid a too complex model (overfit).

Consider some linear predictor, y = Hy where H = X(XTX)−1XTy, or moregenerally y = Sy for some smoothing matrix S.

In econometrics, consider some ex-post penalty for model selection,

AIC = −2 logL+ 2p = deviance + 2p

where p is the dimension (i.e. trace(S) in a more general setting).

In machine-learning, penalty is added in the objective function, see Ridge orLASSO regression

(β0,λ,βλ) = argmin(β0,β)

n∑i=1

`(yi, β0 + xTi β) + λ‖β‖

Goodhart’s law, “when a measure becomes a target, it ceases to be a goodmeasure”


https://en.wikipedia.org/wiki/Goodhart%27s_law





Boosting, or learning from previous errors

Construct a sequence of models

m(k)(x) = m(k−1)(x) + α · f?(x)

where

f? = argminf∈W

n∑i=1

`(yi −m(k−1)(xi), f(xi))

for some set of weak learner W.

Problem: where to stop to avoid overfit...





Slides econ-lm

Education

Transcript of Slides econ-lm