Applied Eco No Metrics

Applied Econometrics

15. Juli 2010

Inhaltsverzeichnis

1 Introduction and Motivation 31.1 3 motivations for CLRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 The Classical Linear Regression Model (CLRM): Parameter Estimation by OLS 7

3 Assumptions of the CLRM 10

4 Finite sample properties of the OLS estimator 13

5 Hypothesis Testing under Normality 16

6 Confidence intervals 226.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7 Goodness-of-fit measures 24

8 Introduction to large sample theory 258.1 A. Modes of stochastic convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 268.2 B. Law of Largen Numbers (LLN) . . . . . . . . . . . . . . . . . . . . . . . . . . . 288.3 C. Central Limit Theorems (CLTs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 298.4 D. Useful lemmas of large sample theory . . . . . . . . . . . . . . . . . . . . . . . . 308.5 Large sample properties for the OLS estimator . . . . . . . . . . . . . . . . . . . . . 31

9 Time Series Basics (Stationarity and Ergodicity) 34

10 Generalized Least Squares 37

11 Multicollinearity 39

12 Endogeneity 40

Applied Econometrics – Inhaltsverzeichnis

13 IV estimation 42

14 Questions for Review 45

2

Applied Econometrics – 1 Introduction and Motivation

1 Introduction and Motivation

What is econometrics?Econometrics=economic statistics/data analysis∩ economic theory∩ mathematics

Conceptional:

• Data perceived as realizations of random variables

• Parameters are real numbers, not random variables

• Joint distributions of random variables depend on parameters

General regression equationsGeneralization: yi = β1xi1 + β2xi2 + ...+ βKxiK + εi (linear model) mit:yi= dependent variable (observable)xik=explanatory variable (observable)εi=unobservable componentIndex of observations i=1,2,...,n (for individuals, time etc.) and regressors k=1,2...,K

yi

(1x1)=β′

(1xK) ∗xi

(Kx1) +εi

(1x1) (1)

β =

β1

...βK

andxi =

xi1...xiK

(2)

Structural parameters=suggested by economic theory

The key problem of econometrics: We deal with non-experimental dataUnobservable variables, interdependence, endogeneity, (reversed) causality

1.1 3 motivations for CLRM

Theory: Glosten/HarrisNotation:

• Transaction price: Pt

3


• Indicator of transaction type:

Qt =

{1 buyer initiated trade−1 seller initiated trade

(3)

• Trade volume: vt

• Drift parameter: µ

• Earnings/costs of the market maker: c (operational costs, asymmetric information costs)

• Private information: Qtzt

• Public information: εt

Efficient/’fair’ price:mt = µ+mt−1 + εt︸︷︷︸Random walk with drift

+Qtzt mit zt = z0 + z1vt

Market maker sets:

• Sell price (ask): P at = µ+mt−1 + εt + zt + c

• Buy price︸︷︷︸(for oneself)

(bid): P bt = µ+mt−1 + εt − zt − c

• ⇒Spread: market maker anticipates price impact

Transactions occur at ask or bid price:

Pt − Pt−1 = µ+mt−1 + εt +Qtzt + cQt − [mt−1 + cQt−1] (4)

∆Pt = µ+ z0Qt + z1vtQt + c∆Qt + εt (5)

• Observable: ∆Pt, Qt, vt → xi = [1, Qt, vtQt,∆Qt]′ and yi = ∆Pt

• Unobservable: εt

• Estimation of unknown structural parameters β = (µ, z0, z1, c)′

Theory: Asset pricingFinance theory: Investors compensated for holding risk; demand expected return beyond risk-freerate

xjt+1: payoff of risky asset j → f1,t+1...fR,t+1 (f:risk factor)→Linear (risk) factor models:

E(Rejt+1)︸︷︷︸expected asset return

= βj︸︷︷︸exposure of payoff asset j to factor k risk

∗ λ︸︷︷︸price of factor k risk

mit βj = (βj1, ..., βjk)′ und

4


λ = (λ1, ..., λk)′

Rejt+1 = Rjt+1︸︷︷︸gross return

− Rft+1︸︷︷︸risk-free rate

(6)

mit gross return=xj

t+1

P jt

=P j

t+1+djt+1

P jt

Single risk factor: βj = Cov(Rej ,f)V ar(f)

Some AP models:

• CAPM: f = Rem = (Rm −Rf )︸︷︷︸marketrisk

• Fama French (FF): f = (Rem, HML, SMB)′

• → f1, ..., fK excess returns themselves

Then λ = [E(f1), ..., E(fK)]′

CAPM:E(Rejt+1) = βj∗E(Remt+1) und FF:E(Rejt+1) = βj1E(Remt+1)+βj2E(HMLt+1)+βj3E(SMBt+1)

To estimate risk loadings β, we formulate sampling (regression, „compatible“) model:

Rejt+1 = β1Remt+1 + β2HMLt+1 + β3SMBt+1 + εjt+1 (7)

Assume E(εjt+1|Remt+1, HMLt+1, SMBt+1) = 0 (implies E(εjt+1) = 0).This model is compatible with the theoretical model which gets clear when you use expected valueson both sides.

This sampling model does not contain a constant: βj0 = 0 (from theory) = testable restriction

Theory: Mincer equation

ln(WAGEi) = β1 + β2Si + β3TENUREi + β4EXPRi + εi (8)

Notation:

• Logarithm of the wage rate: ln(WAGEi)

• Years of schooling: Si

• Experience in the current job: TENUREi

• Experience in the labor market: EXPRi

5


→Estimation of the parameters βk, where β2: return to schooling

Statistical specificationE(yi|xi) = x′i︸︷︷︸

(xi1,...,xik)

∗ βk︸︷︷︸(β1,...,βk)′

(linear function of the x’s)

Marginal effect:∂E(yi|xi)∂xK

= βK (9)

„Compatible“ regression model: yi = x′iβ+ εi with E(yi|xi) = x′iβ → LTE (law of total expectation)implies:{E(εi|xi) = 0Cov(εi, xik) = 0∀k

(10)

The specification E(yi|xi) = x′iβ is ad hoc→ Alternative: non-parametric regression (leaves thefunctional form open).

Justifyable by normality assumption:(YX

)∼ BV N

[(µyµx

),

(σ2y σxy

σxy σ2x

)](11)

YX1

...Xk︸︷︷︸

(k+1x1)

∼MVN

µyµx︸︷︷︸

(k+1x1)

,

(σ2y Σxy

Σxy Σ2x

) (12)

mit µx = E(x) = [E(x1)...E(xk)] und Σx = V ar(x) = kxk

E(y|x) = µy + Σ−1x Σyx(x− µx) = α+ βx︸︷︷︸

linear conditional mean

α = µy − βµx und β = Σ−1x Σyx

V ar(y|x) = σ2y − Σ′yxΣxΣyx︸︷︷︸

does not depend on x, „homoscedasticity“

Rubin’s causal model = Regression analysis of experimentsSTAR experiment = small class experiment

Treatment: binary variable Di = 0, 1 (class size)Outcome: yi (SAT scores)Does Di → yi? (causal effects)

Potential outcome:

One of the outcomes is hypothetical =

{y1i ifDi = 1y0i ifDi = 0

(13)

6

Applied Econometrics – 2 The Classical Linear Regression Model (CLRM): ParameterEstimation by OLS

Actual outcome: observed yi = y0i + (y1i − y0i)︸︷︷︸causaleffect

∗Di (the effect may be identical or different across

individuals i)

Uses causality explicitly in contrast to the other models.

Yi = α︸︷︷︸E(yoi)

+ ρ︸︷︷︸y1i−yoi

Di + η︸︷︷︸yoi−E(yoi)

+ z′iγ︸︷︷︸STAR(gender, race, free lunch)

(14)

ρ is constant across i. Goal: estimate ρ

STAR: experiment→random assignment of i to treatment: E(yoi|Di = 1) = E(yoi|Di = 0)In a non-experiment→selection bias: E(yoi|Di = 1)− E(yoi|Di = 0) 6= 0

E(yi|Di = 1) = α+ ρ ∗ E(ηi|Di = 1)︸︷︷︸E(yoi|Di=1)

(15)

sowie

E(yi|Di = 0) = α+ E(ηi|Di = 0)︸︷︷︸E(yoi|Di=0)

(16)

→ E(yi|Di = 1)− E(yi|Di = 0) = ρ+ selectionbias︸︷︷︸E(yio|Di=1)−E(yoi|Di=0)

Another way, apart from experiments, to avoid selection bias are natural (quasi) experiments.

2 The Classical Linear Regression Model (CLRM): ParameterEstimation by OLS

Classical linear regression model:

yi = β1xi1 + β2xi2 + ...+ βKxiK + εi = x′i︸︷︷︸(1xK)

∗ β︸︷︷︸(Kx1)

+εi (17)

yi=(y1, ..., yn)’: Dependent variable, observedx′i=(xi1, ..., xiK): Explanatory variable, observedβ′=(β1, ..., βK): Unknown parametersεi=Disturbance component, unobserved→b’=(b1, ..., bK): Estimate of β′

→ei=yi − x′ib: Estimated residual

7


Based on i.i.d. variables, which refers to the independence of the i’s, not the x’s.Preferred technique: Least-squares (instead of maximum-likelihood and moments).Two-sidedness (ambiquity): x’s as random variables or realisations

For convenience we introduce matrix notation:

y︸︷︷︸(nx1)

= X︸︷︷︸(nxK)

∗ β︸︷︷︸(Kx1)

+ ε︸︷︷︸(nx1)

(18)

Constant: (x11, ..., xn1)’=(1,...,1)’

Writing extensively: A system of linear equations

y1 = β1 + β2x12 + ...+ βKx1K + ε1 (19)y2 = β1 + β2x22 + ...+ βKx2K + ε2 (20)

... (21)yn = β1 + β2xn2 + ...+ βKxnK + εn (22)

(23)

OLS estimation in detail:Estimate β by choosing b:

argmin(b)︸︷︷︸minimizesumofsquaredresiduals

∑e2i =

∑(yi − x′ib)2 =

∑(yi − βxi1 − ...− βxiK)2 = S(b)

Differentiate with respect to b1, ..., bK →FOC

∂S(b)∂b1

= −2[∑

yi − x′ib] =1n

∑ei = 0 (24)

∂S(b)∂b2

= −2[∑

yi − x′ib]xi2 =1n

∑eixi2 = 0 (25)

... (26)∂S(b)∂bK

= −2[∑

yi − x′ib]xiK =1n

∑eixiK = 0 (27)

(28)

8


→System of K equations and K unknowns: solve for b (OLS estimator)

„Characteristics“:e = 0Cov(ei, xi2) = 0Cov(ei, xiK) = 0

With 1 regressor:yi = β1 + β2xi2 + εib2 =

∑yixi2−

∑yi∑xi2∑

x2i2−[

∑xi2]2

= sample covsample var

Solution more complicated for K≥2 → The system of K equations is solved by matrix algebra:e=y-Xb → FOC rewritten:

∑ei = 0 (29)∑

xi2ei = 0 (30)

... (31)∑xiKei = 0 (32)

→ X ′e = 0 (33)

Extensively:1...1

x12...xn2

...x1K ...xnK

∗e1

e2

...en

(34)

→ X ′e = X ′(y −Xb) = X ′y︸︷︷︸Kx1

−X ′X︸︷︷︸KxK

b︸︷︷︸Kx1

= 0

X’X has a rank of K (full rank, then [X ′X]−1 exists and premultiplying by [X ′X]−1 results in:[X ′X]−1X’y-[X ′X]−1X’Xb=0[X ′X]−1X’y-Ib=0→b=[X ′X]−1X’y

Alternative notation:

b = (1nX ′X)−1 1

nX ′y = (

1n

∑xix′i)−1︸︷︷︸

matrixofsamplemeans

1n

∑xiyi︸︷︷︸

vectorofsamplemeans

(35)

Questions:Properties? Unbiased, efficient, consistent?

9

Applied Econometrics – 3 Assumptions of the CLRM

3 Assumptions of the CLRM

The four core assumptions of CLRM1.1 Linearity in parameters: yi = x′iβ + εiThis is not too restrictive, because reformulation is possible, e.g. using ln, quadratics

1.2 Strict exogeneity: E(εi|X) = E(εi|x11, ..., x1k, ..., xi1, ..., xik, ..., xn1, ..., xnk) = 0Implications:a) E(εi|x) = 0→ E(εi|xik) = 0︸︷︷︸

by LTE

→ E(εi) = 0︸︷︷︸by LTE

b) E(εixjk)a)︷︸︸︷= Cov(εi, xjk) = 0︸︷︷︸

by LTE

∀ i,j,k (use LTE, LIE)

⇒unconditional moment restrictions (compare to OLS FOC)

Example where this may be violated:ln(wagesi) = β1 + β2 Si︸︷︷︸

(+)

+...+ εi → Ability

crimei︸︷︷︸in district i

= β1 + β2 Policei︸︷︷︸(−)

+...+ εi → Social factors

unempli︸︷︷︸in country i

= β1 + β2 Libi︸︷︷︸(−)

+...+ εi → Macro shock

When E(εi|x) 6= 0 Endogeneity (prevents us from estimating β′s consistently).

Discussion: Endogeneity and sample selection biasRubin’s causal model: yi = α+ρDi+ηi;E(yi|Di = 1)−E(yi|Di = 0) = ρ+E(yoi|Di = 1)− E(yoi|Di = 0)︸︷︷︸

selection bias

Di and {yoi, y1i}︸︷︷︸partly unobservable

assumed independent → {yoi, y1i}⊥Di

With independence : E(yoi|Di = 1) = E(yoi|Di = 0) = E(yoi)=E( ηi︸︷︷︸εi

| Di︸︷︷︸xik

) = E(ηi) = 0

10


Independence is normally not the case because of endogeneity and sample selection bias.

Conditional independence assumption (CIA): {yoi, y1i}⊥Di|xi → selection bias vanishes conditioningon xiHow do we instrumentalize CIA: by adding „control variables“ to the right-hand side of the equation

Example: Mincer-type regressionln(wagei) = β1 + β2Highschooli + β3Teni + β4Expi + β5Abilityi + β6Familyi︸︷︷︸

Control variables

+εi

→I assume εi⊥Highschool|Ability, Family,... Justifies E(εi|Highschool, ...) = 0CIA justifies inclusion control variable and E(εi|x) = 0

Matching= sorting individuals into groups and then comparing the outcomes

1.3 No exact multicollinearity, P (rank(X) = K)︸︷︷︸Bernoulli distributed (is a random variable)

= 1

⇒ No linear dependencies in the data matrix, otherwise (X ′X)−1 does not exist⇒ Does not refer to a high correlation between the X’s

1.4 Spherical disturbances: V ar(εi|x) = E(ε2i |x) = V ar(εi) = σ2∀i→ Homoscedasticity (relates tothe MVN)Cov(εi, εj |x) = E(εi ∗ εj |x) = E(εi ∗ εj) = 0→ No serial correlation

E(ε1, ..., εn)

E[ε ∗ ε′|x] =

E(ε21|x) ... ...E(ε2 ∗ ε1|x) ... ...

... ... E(ε2n|x)

=

σ2 ... 0... ... ...0 ... σ2

= Cov(ε|x) (36)

By LTE E(ε ∗ ε′) = V ar(ε) = σ2In→ V ar(εi) = E(ε2i ) = σ2

→ Cov(εi, εj) = E(εi, ..., εj) = 0∀i 6= j

Interpreting the parameters β of different types of linear equationsLinear model: yi = β1 + β2xi2 + ... + βKxiK + εi. A one unit increase in the independent variablexiK increases the dependent variable by βK units

Semi-log form: ln(yi) = β1 +β2xi2 + ...+βKxiK +εi. A one unit increase in the independent variableincreases the dependent variable approximately by 100*βk percent

Log-linear model: ln(yi) = β1ln(xi1) + β2ln(xi2) + ...+ βK ln(xiK) + εi. A one percent increase inxiK increases the dependent variable yi approximately by βk percent.e.g. yi = A ∗ xαi1x

γi2εi (Cobb-Douglas)→ lnyi = lnA+ αlnxi1 + γlnxi2 + lnεi

11

Applied Econometrics – 4 Finite sample properties of the OLS estimator

4 Finite sample properties of the OLS estimator

Finite Sample Properties of b = (X ′X)−1X ′y

1. With 1.1-1.3 and holding for any sample size: E[b|X] = β and by LTE: E[b] = β→unbiasedness

2. With 1.1-1.4: V ar[b|X] = σ2[X ′X]−1 (importance for testing, depends on the data)→conditional variance

3. With 1.1-1.4: OLS is efficient: V ar[b|X] ≤ V ar[β|X]→Gauss-Markov theorem

4. ⇒ OLS is BLUE

Starting point: sampling error

b− β =

b1 − β1

...bK − βK

= [X ′X]−1X ′y − β = [X ′X]−1X ′[Xβ + ε]− β = [X ′X]−1X ′ε (45)

Defining A = [X ′X]−1X ′, so that:

A ∗ ε =

a11 ... a1n

... ... ...ak1 ... akn

∗ε1...εn

(46)

whereas A is treated as a constant

Derive unbiasednessStep 1:

E(b− β|X) = E(A ∗ ε|X) = A ∗ E(ε|X)︸︷︷︸=0(assumption 1.2)

= 0 (47)

Step 2: „b conditionally unbiased“

E(b− β|X) = E(b|X)− E(β|X) = E(b|X)− βStep1︷︸︸︷

= 0⇔ E(b|X) = β (48)

Step 3: „OLS unbiased“

Ex[E(b|X)]LTE︷︸︸︷= E(b)

Step2︷︸︸︷= β (49)

13


Derive conditional variance

V ar(b|X) = V ar(b−β|X)sampling error︷︸︸︷

= V ar(A∗ε|X) = A∗V ar(ε|x)∗A′ = A∗E(εε′|x)∗A′1.4︷︸︸︷= A∗σ2In∗A′ = σ2AA′

(50)

Using [BA]’=A’B’ and inserting for A:

σ2[X ′X]−1X ′X[X ′X]−1 = σ2[X ′X]−1 = V ar(b|X) (51)

Und es gilt:

V ar(b|X) =

V ar(b1|x) Cov(b2, b1|x) ...Cov(b1, b2|x) ... ...

... ... V ar(bk|x)

(52)

Derive Gauss-Markov-TheoremV ar(β|x) ≥ V ar(b|x)This refers to the fact that we are claiming that the elementwise difference V ar(β|x)− V ar(b|x) ispositive semi-definite.

Insertion:A and B: separate matrizes, same sizeA≥ B if A-B is positive semi-definiteC︸︷︷︸kxk

is psd if X ′︸︷︷︸1xk

C︸︷︷︸kxk

X︸︷︷︸kx1

≥ 0 for all x6= 0

a′[V ar(β|x)− V ar(b|x)]a ≥ 0∀a 6= 0 (53)

If this is true for all a, it is true in particular for a=[1,0,...,0]’ which leads to: V ar(β1|x) ≥ V ar(b1|x)This works also for b2, ..., bk with the respective a. This means that any conditional variance of β isbigger than the repsective conditional variance.

The proof:Note that b=A*y

β is linear in y and unbiased and β=C*y with C being a function of XD=C-A ↔ C=D+A

Step 1:

β = (D +A)y = Dy +Ay = D(Xβ + ε) + b = DXβ +Dε+ b (54)

Step 2:

E(β|x) = DXβ + E(Dε|x) + E(b|X) = DXβ +D ∗ E(ε|x)︸︷︷︸=0

+β = DXβ + β (55)

14


As we are working with an unbiased estimator: DX=0

Step 3: Going back to Step 1

β = Dε+ b (56)β − β︸︷︷︸s.e.ofβ

= Dε+ b− β︸︷︷︸s.e.ofb=A∗ε

= [D +A]ε (57)

Step 4:

V ar(β|x) = V ar(β − β|x) = V ar[(D +A)ε|x] = [D +A]V ar(ε|x)[D +A]′ = σ2[D +A][D +A]′(58)= σ2[DD′ +AD′ +DA′ +AA′](59)

Es gilt:AA′ = [X ′X]−1X ′X[X ′X]−1 = [X ′X]−1

AD′ = [X ′X]−1X ′D′ = [X ′X]−1[DX]′ = 0 (as DX=0)DA′ = D[[X ′X]−1X ′]′ = DX[X ′X]−1 = 0 (as DX=0)

V ar(β|x) = σ2[DD′ + [X ′X]−1] (60)

Step 5: Es muss also gelten

a′[σ2[DD′ + [X ′X]−1]− [X ′X]−1σ2]a ≥ 0 (61)a′[σ2DD′]a ≥ 0 (62)

(63)

We have to show that a’DD’a≥ 0∀ a is sdpAs z’z=

∑z2i ≥ 0∀ a the above is true.

The OLS estimator is BLUE

• OLS is linear: Holds under assumption 1.1

• OLS is unbiased: Holds under assumption 1.1-1.3

• OLS is the best estimator: Holds under the Gauss Markov theorem

15

Applied Econometrics – 5 Hypothesis Testing under Normality

OLS anatomy

• yi ≡ x′ib; y = Xb; e = y − y

• p = X[X ′X]−1X ′ „projection matrix“; use: y = p∗y = x∗b (lies in space spanned by columnesof X); P is symmetric and idempotent (P*P=P)

• M=In-p „residual maker“; use: e=M*y; M is symmetric and idempotent

• y=Py+My [X’e=0 from FOC] (e orthogonal to columns of X; space is spanned by columns ofX) → y′e = 0

• e = M ∗ ε→ e′e =∑e2i = ε′Mε

• with a constant

–∑ei = 0 [FOC]

– y = x′b (for x and y vectors of sample means)

– y = 1n

∑yi = 1

n

∑yi = y

5 Hypothesis Testing under Normality

Economic theory provides hypotheses about parameter.Eg. asset pricing example: Ret = α+ βRemt + εt → Hypothesis implied by APT: α=0If theory os right→Testable implications.In order to test we need the distribution of b, so that hypotheses can’t be tested without distributionalassumptions about ε.

In addition to 1.1-1.4 we assume that ε1, ..., εn|X are normally distributed:Distributional assumption (Assumption 1.5): Normality assumption about the conditional distributionof ε|X ∼MVN(E(ε|x) = 0, V ar(ε|x) = σ2In) (Can be dispersed of later).→ εi|x ∼ N(0, σ2)

Useful tools/results:

• Fact 1: If xi ∼ N(0, 1) with i=1,...,m and xi independent → y =∑m

i=1 x2i ∼ χ2(m)

If x ∼ N(0,1) and y ∼ χ2(m) and x,y independent: t = x√y/m∼ t(m)

• Fact 2 omitted

• Fact 3: If x and y are independent, so are f(x) and g(y)

• Fact 4: Let x ∼MVN(µ,Σ) and Σ nonsingular: (x− µ)′Σ−1(x− µ) ∼ χ2 (random variable)

• Fact 5: W ∼ χ2(a), g ∼ χ2(b) and W,g independent: (W/a)(g/b) ∼ F (a, b)

16


• Fact 6: X ∼MVN(µ,Σ) and y=c+Ax mit c,A non-random vector/matrix → y ∼MVN(c+Aµ,AΣA′)

We use b− β︸︷︷︸s.e.

= (X ′X)−1X ′ε:

Assuming ε|X ∼MVN(0, σ2In) from 1.5 and using Fact 6:

b− β|X ∼MVN((X ′X)−1X ′E(ε|X), (X ′X)−1X ′σ2InX(X ′X)−1) (64)b− β|X ∼MVN(0, σ2(X ′X)−1) (65)

b− β|X ∼MVN(E(b− β|X), V ar(b− β|X)) (66)

Note that V ar(bk|X) = σ2((X ′X)1)

By the way: e|X ∼MVN(0, σ2M)→ show using e=M*ε and M is symmetric and idempotent

Testing hypothesis about individual parameters (t-Test)Null hypothesis: H0 : βk = βk (a hypothesized value, a real number). βk is often assumed to be 0.It is suggested by theory.Alternative hypothesis HA : βk 6= βk

We control the type 1 error (=rejection when right) by fixing a significance level. We then aim for ahigh power (=rejection when false) (low type 2 error (=no rejection when false)).

Construction of the test statistic:By Fact 6: bk − βk ∼ N(0, σ2((X ′X)−1)kk) [((X ′X)−1)kk is the k-th row k-th column element of(X ′X)−1]OLS-estimator conditionally normal distributed if ε|X is multivariate normal

bk|X ∼ N(βk, σ2((X ′X)−1)kk)Under H0 : βk = βk :bk|X ∼ N(βk, σ2((X ′X)−1)kk)If H0 is true E(bk) = βk :

zk =bk − βk√

σ2((X ′X)−1)kk∼ N(0, 1) (67)

• Standard normally distributed under the null hypothesis

• We don’t say anything about the distribution under the alternative hypothesis

• Distribution under H0 does not depend on X

• Value of the test statistic depends on X and on σ2 whereas σ2 is unknown

17


Call s2 an unbiased estimate of σ2:

tk =bk − βk√

s2((X ′X)−1)kk∼ t(n− k)or

a︷︸︸︷∼ N(0, 1)for(n− k) > 30 (68)

Nuisance parameter σ2 can be estimated

σ2 = E(ε2i |X) = V ar(εi|X) = E(ε2i ) = V ar(εi) (69)

We don’t know εi, but we use the estimator ei = yi − x′ib

σ2 = V ar(ei) =1n

∑(ei −

1n

∑ei)2 =

1n

∑e2i =

1ne′e (70)

sigma2is a biased estimator:

E(σ2|x) =n−Kn

σ2 (71)

Proof: in Hayashi p.30-31. Use:

•∑e2i = e′e = ε′Mε

• trace(A)=∑aii (sum of the diagonal of the matrix)

Note that for n→∞σ2 is asymptotically unbiased as

limn→∞n−Kn

= 1 (72)

Having an estimator where the bias and the variance vanishes when n → ∞, is what we call aconsistent estimator.

An unbiased estimator of σ2 (see Hayashi p.30-31)For s2 = 1

n−K∑e2i = 1

n−K e′e we get an unbiased estimator:

E(s2|X) =1

n−KE(e′e|X) = σ2 (73)

E(E(s2|X)) = E(s2) = σ2 (74)

Using this provides an unbiased estimator of V ar(b|X) = σ2(X ′X)−1:

V ar(b|X) = s2(X ′X)−1 (75)

t-statistic under H0:

tk =bk − βk√

( V ar(b|X))kk=bk − βkSE(bk)

=bk − βk√

( V ar(bk|X))∼ t(n−K) (76)

18


Hayashi p.36-37 shows that with the unbiased estimator of σ2 we have to replace the distribution ofthe t-stat by the t-distribution. Sketch of the proof:

tk =bk − βk√

σ2[(X ′X)−1]kk∗√σ2

s2=

zk√e ∗ e′(n− k)/σ2

=zk√

q/n− k(77)

We need to shoq that q/n− k ∼ χ2(n− k) and q,zk independent.

Decision rule for the t-test

1. State your null hypothesis: H0 : βk = βk is often βk = 0; HA : βk 6= βk

2. Given βk, OLS-estimate bk and s2, we compute tk = bk−βkSE(bk)

3. Fix significance level α of two-sided test

4. Fix non-rejection and rejection regions

5. ⇒ decision

Remark:√σ2[(X ′X)−1]kk: standard deviation bk|X√s2[(X ′X)−1]kk: standard error bk|X

We want a test that keeps its size and at the same time has a high power.

Find critical value such that: Prob[ −tα/2(n− k)︸︷︷︸real n◦; lower crit value

< t(n− k)︸︷︷︸r.v.;t-stat

< tα/2(n− k)︸︷︷︸real n◦; upper crit value

] = 1− α

Example: n-k=30, α=0.05tα/2(n− k) = t.975(30) = 2.042The probability that the t-stat takes on a value of 2.042 or smaller is equal to 97.5%.−tα/2(n− k) = t.025(30) = −2.042The probability that the t-stat takes on a value of -2.042 or smaller is equal to 2.5%

Conventional α’s: 0.01; 0.05; 0.1

• Use stars

• Use p-values: compute the quantile taking your t-stat as the x (Interpretation!)

• Use t-stat

• Report standard error

• Report confidence intervals (Interpretation!)

19


F-test/Wald testModel more complex hypotheses („linear hypotheses“) = testing joint hypotheses.

Example 1: G/H-Modell∆Pi = µ+ c∆Qi + z0Qi + z1viQi + εiHypothesis: no infromational content of tradersH0 : z0 = 0 and z1 = 0; HA : z0 6= 0 or z1 6= 0

Example 2: Mincer eq.ln(wagei) = β1 + β2Si + β3tenurei + β4experi + εiHypothesis: 1y more schooling shows the same effect as 1y more tenure AND experience has no effectH0 : β2 = β3 and β4 = 0; HA : β2 6= β3 or β4 6= 0

For the F-test write the null hypothesis as a system of linear equations:R︸︷︷︸

#rxK

∗ β︸︷︷︸Kx1

= r︸︷︷︸#rx1R11 ... R1K

... ... ...R#1 ... R#K

∗β1

...βk

=

r1

...r#

(78)

mit:#=number of restrictionsR=matrix of real N◦’sr=vector of real N◦’s

Example 1:

(0 0 1 00 0 0 1

)∗

ycz0

z1

=(

00

)(79)

Example 2:

(0 1 −1 00 0 0 1

)∗

β1

β2

β3

β4

=(

00

)(80)

Use Fact 6: Y=A*Z (Y ∼MVN(Aµ,AΣA′))

In our context:Replacing the β = (β1, ..., βk) by estimator b = (b1, ..., bK)′.R ∗ b = r (b is conditionally normally distributed under 1.1-1.5)

For the Wald-test: you only need the unconstrained parameter estimates.For the other two, you need both the constrained and the unconstrained parameter estimate.

20


The Wald-test keeps its size and gives the highest power.The restrictions are fixed by the H0

Definition of the F-test statisticE(r|x) = R ∗ E(b|x) = R ∗ β (under 1.1-1.3)V ar(r|x) = R ∗ V ar(b|x)R′ = Rσ2(X ′X)−1R′ (under 1.1-1.4)r|x = R ∗ b|x ∼MVN(Rβ,Rσ2(X ′X)−1R)

Using Fact 4 to construct the test for:H0 : R ∗ β = rHA : R ∗ β 6= r

Under the hypothesis:E(Rb|x)=r (=hypothesized value)Var(Rb|x)=Rσ2(X ′X)−1R’ (unaffected by hypothesis)Rb|x=r∼MVN(r,Rσ2(X ′X)−1R′) (this is where the hypothesis goes in)

Problem: Rb is a random vector and the distribution depends on x.⇒Distribution of a random variabl under H0 whose distribution does not depend on X.With Fact 4: (Rb− E(Rb|x))′(V ar(Rb|x))−1(Rb− E(Rb|x))This leads to the Wald-statistic:

(Rb− r)′[Rσ2(X ′X)−1R]−1(Rb− r) ∼ χ2(#r) (81)

The smallest number for this is zero (which is the case when the restrictions are perfectly met andthe parenthesis are zero). Thus we always have a one-sided test here.

Replace σ2 by its unbiased estimate s2 = 1n−k

∑e2i = 1

n−ke′e and dividing by #r (to find the distri-

bution using Fact 5):

F = (Rb− r)′[Rs2(X ′X)−1R′](Rb− r)/#r ∼ F (#r, n− k) (82)

F =(Rb− r)′[R(X ′X)−1R′]−1(Rb− r)/#r

(e′e)/(n−K)(83)

= (Rb− r)′[R V ar(b|X)R′]−1(Rb− r)/#r ∼ F (#r, n−K) (84)

F-Test is one-sided.

For applied work:N→∞ : s2 (and σ2) get close to σ2, so we can use the approximation.

Decision rule for the Wald-test

1. Specify H0 in the form Rβ = r and HA : Rβ 6= r

2. Calculate F-statistic

21

Applied Econometrics – 6 Confidence intervals

3. Look up entry in the table of the F-distribution for #r and n-K at given significance level

4. Null is not rejected on the significance level α for F less than Fα(#r, n−K)

Alternative representation of the Wald/F-statisticMinimization of the unrestricted sum of squared residuals min

∑(yi − x′ib)2 → SSRu

Minimization of the restricted sum of squared residuals (constraints imposed) min∑

(yi − x′ib)2

SSRr

SSRu ≤ SSRr (always!)

F − ratio : F =(SSRr − SSRu)/#r

SSRu/(n− k)(85)

This F-ratio is equivalent to the Wald-ratio. Problem: For the F-ratio we need two estimates b.

This is also known as the „likelihood-ratio principle“ in contrast to the Wald-principles.

(Third principle: only using unrestricted estimates „Langrange Multiplier principle“)

6 Confidence intervals

Duality of t-test and confidence intervalProbability for non-rejection for t-test:

P (−tα/2(n−K) ≤ tk ≤ tα/2(n−K)︸︷︷︸do not reject if this even occurs

) = 1− α︸︷︷︸if H0 is true

(86)

(tk is a random variable)In 1-α% of our samples tk lies inside the critical values.

Rewrite (86):

P (bk − SE(bk)tα/2(n−K) ≤ βk ≤ bk + SE(bk)tα/2(n−K)) = 1− α (87)

(bk, SE(bk) is a random variable)

The confidence intervalConfidence interval for βk (=if H0 is true):

P (bk − SE(bk)tα/2(n−K) ≤ βk ≤ bk + SE(bk)tα/2(n−K)) = 1− α (88)

The values inside the bounds of the confidence interval are the value for which you cannot reject thenull-hypothesis.In 1-α% of our samples βk would lie inside our bounds (the bounds would enclose βk).

22

Applied Econometrics – 6 Confidence intervals

The confidence bounds are random variables!bk − SE(bk)tα/2(n−K): lower bound of the 1-α confidence interval.bk + SE(bk)tα/2(n−K): upper bound of the 1-α confidence interval.

Wrong interpretation: True parameter βk lies with probability 1-α within the bounds of the confidenceinterval. Problem: Confidence bounds are not fixed; they are random!

H0 is rejected at significance level α if the hypothesized value does not lie within the confidencebounds of the 1-α interval.

<Grafik>

Count the number of times that βk is inside the confidence interval. After n sample, this gives us theprobability for the confidence interval 1-α. (βk not inside the confidence interval is equivalent to theevent that we reject the H0 using the t-statistic).

This works the same way if βk 6= βk.

On the correct interpretation of confidence intervals: given:

• b

• s2 and se(bk)

• tα/2(n− k)

⇒ (bk − βk)/se(bk) (reject if outside the bounds of the critical values)Rephrasing: for which values βk do we (not) reject?bk − se(bk) ∗ tα/2(n− k) < βk < bk + se(bk) ∗ tα/2(n− k) = not rejectbk − se(bk) ∗ tα/2(n− k) > βk > bk + se(bk) ∗ tα/2(n− k) = reject

We want small confidence intervals, because the allow us to reject hypothesis (narrow range=highpower). We achieve this by increasing n (decreasing the standard error) or by increasing α.

6.1 Simulation

Frequentists work with the idea that we can conduct experiments over and over again.

To avoid type 2 error we need smaller standard errors (narrower distribution) and thus we have toincrease the sample size.

We are running a simulation with multiple draws. To proof unbiasedness, we compute the mean forall the draws and compare this to the true parameters. To get closer, we have to increase the numberof draws.

→ Increasing n leads to a smaller standard error.→ Increasing draws leads to more precise estimates, but the standard error doesn’t decrease.

23

Applied Econometrics – 7 Goodness-of-fit measures

You can use a confidence interval to determine the power of a test. Determine in how many % ofthe cases a wrong parameter estimate lies inside the confidence interval (for the true parameter thisshould be 1-α). This is the power of the test. To increase this, increase the sample size.

The closer the hypothesized value and the tue value are, the less the power of the test.

7 Goodness-of-fit measures

Is needed with conflicting equations (theories)

Uncentered R2 : R2uc

„Variability of yi:∑y2i = y′y

Decomposition of y’y:=(Xb+e)′(Xb+e) = (y+e)′(y+e) = y′y+2e′y+e′e = y′y︸︷︷︸

explainedvariation

+ e′e︸︷︷︸unexplainedvariation(SSR)

(e′y = (y′e)′ = (e′y)′ = y′e = (Xb)′e = b′X ′e = 0 (as X’e=0))

<Grafik>

R2uc =

y′yy′y ∗ 100% (% of explained variation)

1− e′ey′y ∗ 100% = 1−

∑e2i∑y2i∗ 100% (% of unexplained variation)

A good model explains much and therefore the residual variation is very small compared to theexplained variation.

Centered R2 : R2c

Use centered R2 if there is a constant in the model (xi1 = 1)

Variance of yi : 1n

∑(yi − y)2

Decomposition of 1n

∑(yi − y)2 =

∑(yi − y)2︸︷︷︸

variance of predicted values

+∑

e2i︸︷︷︸

SSR

Proof:1n

∑(yi − y)2 = 1

n

∑(yi + ei − y)2 =

∑(y − y)2 +

∑e2i +

∑2(y − y)ei =∑

(y − y)2 +∑e2i +

∑2b Xei︸︷︷︸

0

−∑

2yei =∑

(y − y)2 +∑e2i − 2y

∑ei︸︷︷︸

0

R2c =

1n

∑(yi−y)2

1n

∑(yi−y)2

= 1−1n

∑e2i

1n

∑(yi−y)2

Both uncentered and centered R2 lie in the interval [0,1] but describe different models. They are notcomparable.The centered R2 of a model with only a constant whereas the uncentered R2 is not 0 as long as the

24

Applied Econometrics – 8 Introduction to large sample theory

constant is 6=0 (but we explain only the level of y and not the variation). Without a constant thecnetered R2 can’T be compared, because

∑ei = 0 only holds with a constant.

• R2: coefficients of determination; =[corr(y, y)]2

• Explanatory power of regression beyond constant

• Test that β2 = ... = βk = 0 is the same as the test that R2=0 (R2 is a random varia-ble/estimate)

• →Overall F-test:SSRR : yi = β0 + εi =

∑(yi − y)2

SSRUR : yi = β0 + β1x1 + ...+ βkxk =∑e2i

→ (SSRR−SSRUR)/kSSRUR/(n−k−1)

Trick: increase R2, add more regressors (k↑).

Modell comparison:

• Conduct an F-test between the parsimonious and the extended model

• Use model selection criteria, which give a penalty for parameterization

Selection criterion

R2adj = 1− SSR/(n− k)

SST/(n− 1)= 1− n− 1

n− k︸︷︷︸>1fork>1

∗SSRSST

(89)

(can become negative)

Alternative model selection criteriaImplement „Occams razor“ (the simpler explanation (parsimonious model) tends to be the right one)

Akaike criterion (from information theory): log[SSR/n]+2k/n (minimize, can be neg.)Schwarz criterion (from Bayesian theory): log[SSR/n]+k*log(n)/n (minimize, can be neg.)Schwarz tends to penalize more heavily for larger n (=favors parsimonious models).→not one is favored (all three are generally accepted), but one has to argue why one is used.

8 Introduction to large sample theory

Basic concepts of large sample theory

Using large sample theory we can dispense with basic assumptions from finite sample theory:1.2 E(εi|X) = 0: strict exogeneity1.4 V ar(ε|X) = σ2In: homoscedasticity1.5 ε|X ∼ N(0, σ2In): normality of the error term

25


Approximate/assymptotic distribution of b, and t- and the F-stat can be obtained.

Contents:A. Stochastic convergenceB. Law of large numbersC. Central limit theoremD. Useful lemmas

A-D: pillars and building blocks of modern applied econometrics→CAN(consistent asymptotiv normal) - property of OLS (and other estimates)

8.1 A. Modes of stochastic convergence

First: non-stochastic convergence{cn}=(c1, c2, ...)=sequence of real numbers

Q: Can you find n(ε) such that |cn-c|<ε for all n≥n(ε) for all ε?A: Yes? Then {cn} converges to c.Other notations: limn→∞cn = c; cn → c; c=lim cn

Examples:cn = 1− 1

n so that cn → 1cn = exp(−n) so that cn → 0cn = (1 + a

n)n so that cn → exp(a)

Second: Stochastic convergence{zn}=(Z1, Z2, ...) with Z1, Z2,... being random variables.

Illustration:All Zn are iid e.g. N(0,1).They can be described using their distribution function.

Example 1:→a.s.:

{un} sequence of random variables (ui is iid with E(ui) = µ <∞, V ar(ui) <∞)

zn = 1n

∑ui mit {zn}={u1,

12

(u1 + u2)︸︷︷︸z2

, ...,1

100(u1 + ...+ u100)︸︷︷︸

z100

}

<Grafik>

For every possible realization (a sequence of real numbers) of {zn}, the limit is µ

Mathematical: limn→∞zn = µ „almost sure convergence“; also written as: zn→a.s. µ

a.s.: strongest mode of stochastic convergence

More formally:Q: Can you find an n(ε, w) such that |zn − µ| < ε for all n>n(ε,w) for all ε<0 and for all w.A: Yes? Then zn

→a.s. µ

26


Note: sample mean sequence {zn} is a special case of→a.s.. Generally: P (limn→∞Zn = α) = 1;

zn→a.s. α; „zn converges almost surely to α“.

Example 2:→p :

<Grafik> n=5: 1/3 realizations within limitsn=10: 2/3 realizations within limitsn=20: 3/3 realizations within limits

When # of realizations→ ∞, then relative frequency of # of realizations in limits over # ofreplications→ p.

Formalization:Q: P [|zn − α| < ε] = 1A: Yes? Then zn → α; plimzn = α; „zn convergence in probability to α“If you define P [|zn − α| < ε] as a series of probabilities pn, you can express this as: pn → 1

A different illustration (reaching towards consistency):<Grafik>

Example 3:→m.s. (is similar to

→p ):

limn→∞E[(Zn − α)2] = 0→m.s. implies

→p (m.s. is the stronger concept)

Implies that limn→∞E(Zn − α) = 0 and limn→∞V ar(Zn − α) = 0

Add remarks on the modes of convergence

• α (limit) can also be a random variable (not so important for us), so that Zn→p Z, Z being a

r.v. (also for a.s., m.s.)

• Extends to vector/matrix sequence of r.v.: Zn1→p α1, Zn2

→p α2, ...

Notation: Zn→p α (element-wise convergence). This is used later to write: bn

→p β

Convergence in distribution→d

Illustration:<Grafik>fz: limiting distribution of Zn

Definition: Denote Fn cdf: P (Zn ≤ a) = Fn(a), which means: {Zn} converges in distribution to ar.v. Z, if cdf Fn of Zn converges to cdf of Z at each point (of continuity) of Fz <Grafik>

Zn→d Z;Fz: limiting distribution if Zn

If distribution of Z known, we write e.g. Zn→d N(0, 1)

27


Extension to a sequence of random vectors:

Zn︸︷︷︸(kx1)

→d Z︸︷︷︸

(kx1)

;Zn =

Zn1

...Znk

andZ =

Z1

...Zk

(90)

Fn of Zn converges at each point to cdf of Z.cdf of Z = FZ(a) = P [Z1 ≤ a1, ..., Zk ≤ ak]cdf of Zn = FZn(a) = P [Zn1 ≤ a1, ..., Znk ≤ ak]

8.2 B. Law of Largen Numbers (LLN)

Weak law of large numbers (Khinchin){Zi} sequence of iid random variablesE(Zi) = µ <∞→ compute 1

n

∑Zi = Zn. Then:

limn→∞P [|Zn − µ| > ε] = 0(ε > 0) or plimZn − µ;Zn→p µ

Extensions: The WLLN holds for1. Multivariate Extension (seq. of random vectors):{Zi}z11

...z1k

︸︷︷︸i=1

,

z21

...z2k

︸︷︷︸i=2

, ... (91)

E(Zi) = µ <∞, µ being a vector of expectations of the rows [E(z1), ..., E(zk)]′

Compute sample means for each element of seq. „over the rows“:

1n

n∑i=1

Zi = [1n

∑Z1i, ...,

1n

∑Zki]′ (92)

Element-wise convergence: 1n

∑Zi→p µ

2. Relax independence:Relax iid, allow for dependence in {Zi} through cov(Zi, Zi−j) 6= 0, j 6= 0 (especially important intime series; draws still from the same distribution)→ergodic theorem.

3. Functions of Zi

h(zi)︸︷︷︸measurable function

e.g.{ ln(Zi)︸︷︷︸new sequence

} (93)

28


If {Zi} allows application of LLN and E(h(Zi)) <∞, we have: 1n

∑h(zi)

→p E(h(zi)) (LLN also can

be used)

Application{Zi}iid, E(zi) <∞h(Zi) = (zi − µ)2

E(h(zi)) = V ar(zi) = σ2

1n

∑(zi − µ)2

→p V ar(zi)

4. Vector-valued function f(zi)Vector-valued functions: one or many arguments and one or many returns.

{Zi}→{f(Zi)} (mit f(Zi) = f(zi1, ..., zik))’If {Zi} allow application of LLN and E(f(Zi))<∞:

1n

∑f(Zi)

→p E(f(Zi)) (94)

Example:{Zi}={Zi1, Zi2} vector sequencef{Zi}=f(zi1, zi2)=(zi1 ∗ zi2, zi1, zi2)Apply the above result (WLLN)

8.3 C. Central Limit Theorems (CLTs)

Univariate CLTIf {Zi} iid, E(Zi) = µ <∞, Var(Zi) = σ2 <∞ and WLLN works:Then:

√n(Zn − µ)

→d N(0, σ2) (without

√n the variance and mean would go to zero as n goes into

infinity; also used with: square-root n consistent estimators)

CLT implies LLN.

Other kind of writing this:√n(Zn − µ) ∼a N(0, σ2)

If yn =√n(Zn − µ) then zn = yn√

n+ µ and thus zn ∼a N(µ, σ

2

n )

Multivariate CLT{Zi} is iid and can be drawn from ANY distribution, but the n for convergence may be different.

=

z11 z21 ... zn1

... ... ... ...z1k z2k ... znk

=(Z1, Z2, ..., Zn

)(95)

29


E(Z1) = E(Z2) = ... = µ = [E(Z1), ..., E(Zk)]′ and

V ar(Zi) =

V ar(zi1)... ...... .... ...

Cov(Zi1, zin) ... V ar(zik)

= Σ <∞ (96)

and WLLN works.

Then:√n(zn − µ)

→d MV N(0,Σ)

Other writings:√n(zn − µ) ∼a MVN(0,Σ) or:

Zn ∼a MVN(µ,Σ/n)iid can be relaxed, but not as far with the WLLN

8.4 D. Useful lemmas of large sample theory

Lemma’s 1&2 „continuous mapping theorem“ (probabilitylimit→p and

→apasses through continuous

function)

Lemma 1:Zn can be a scalar, vector or matrix sequence.Zn→p α with a as a continuous function which does not depend on n then:

a(Zn)→p a(α) or plimn→∞a(zn) = a(plimn→∞(zn))

Examples:xn→p α⇒ ln(xn)

→p ln(α)

xn→p β and yn

→p γ ⇒ xn + yn

→p β + γ

Lemma 2:zn→d z then:

a(zn)→d a(z) (for z: different degrees of difficulty to find it)

Examples:zn→d z ∼ N(0, 1)⇒ z2

n

→d χ2(1) (as [N(0, 1)]2 = χ2(1))

Lemma 3:xn︸︷︷︸

(mx1)

→d x and yn

→p α (yntreated as non-stochastic, α a vector of real numbers), then:

xn + yn→d x+ α

Examples:xn→d N(0, 1), yn

→p α⇒ xn + yn

→d N(α, 1)

30


Lemma’s 4& 5 (& 6) „Slotzky’s theorem“Lemma 4:xn→d x and yn

→p 0 (we say that yn is 0P), then:

xn ∗ yn→p 0

Lemma 5:xn︸︷︷︸

(mx1)

→d x and An︸︷︷︸

(kxm)

→p A, then:

An ∗ xn→d A ∗ x (A treated as non-random, x as random vector)

Examples:xn→d MV N(0,Σ), An

→p A⇒ An ∗ xn

→d MV N(0, AΣa′)

Lemma 6:xn→d x and An

→p A then:

x′nA−1n xn

→d x′A−1x (x is random and A is non-random)

Examples:xn→d x ∼MVN(0, A) and An

→p A⇒ x′nA

−1xn→d x′A−1x ∼ χ2(rows(x))

8.5 Large sample properties for the OLS estimator

(2.1) maintained: Linearity: yi = x′iβ + εi, ∀i = 1, 2, ..., n(2.2) Assumptions regarding dependence of {yi, xi} (get rid of iid)(2.3) Replacing strict exogeneity: Orthogonality/predetermined regressors: E(xik ∗ εi) = 0If xik = 1⇒ E(εi) = 0⇒ Cov(xik, εi) = 0∀i, k. If violated: endogeneity(2.4) Rank condition: E(xix′i)︸︷︷︸

KxK

= Σxx is non-singular.

Remarks on the properties{yi, xi}’ allows application of WLLN (so DGP is not necessarily iid)We assume {yi, xi}’ are (jointly) stationary & ergodic (identical but not independent)

{xi ∗ εi}=xi1 ∗ εi...xik ∗ εi

= {gi} (97)

allows applicatio of central limit theorem{gi} a martingal difference sequence (m.d.s.): E(gi|gi−1, ...) = 0 (gi is uncorrelated with its past =zero autocorrelation)

Assuming iid {yi, xi, εi} covers this (iid as a special case), but its not necessary (and not useful intime series)

31


OverviewWe get for b=(X ′X)−1X’y:

bn = [1n

∑xix′i]−1 1n

∑xiy′i (98)

WLLN and continuous mapping & Slotzky theorem:bn→p β (consistency = asymptotically unbiased)

√n(bn − β)

→d MV N(0, AV ar(b)) or b ∼a MVN(β, AV ar(b)n ) (approximate distribution)

⇒ bn is consistent, asymptotically normal (CAN) (We loose unbiasedness & efficiency here)

We show that bn = (X ′X)−1X ′y is consistentbn = [ 1

n

∑xix′i]−1 1

n

∑xiy′i

→ bn − β︸︷︷︸sampling error

= [ 1n

∑xix′i]−1 1

n

∑xiεi

We show: bn→p β

When sequence {yi, xi} allows application of WLLN:⇒ 1

n

∑xix′i

→p E(xix′i)

=

E(x2i1) ... E(xi1, xik)

... ... ...E(xi1, xik) ... E(x2

ik)

(99)

So by Lemma 1:[ 1n

∑xix′i]−1→p [E(xix′i)]

−1

⇒ 1n

∑xiεi

→p E(xiεi) = 0 (we assume predeterminant regressors)

Lemma 1 implies:bn − β = [ 1

n

∑xix′i]−1 1

n

∑xiεi

→p E(xix′i)

−1E(xiεi)→p E(xix′i)

−1 ∗ 0 = 0bn = (X ′X)−1X ′y is consistent.

If the assumption E(xiεi) = 0 does not hold, we have inconsistent estimators (=death of estima-tors)

We show that bn = (X ′X)−1X ′y is asymptotically normalSequence {gi}={xiεi} allows applying CLT for 1

n

∑xiεi = g

√n(g − E(gi))

→d MV N(0, E(gig′i))

√n(bn − β) = [

1n

∑xix′i]−1︸︷︷︸

An

√ng︸︷︷︸xn

(starting point: sampling error expression *√n)

Applying Lemma 5:An = [ 1

n

∑xix′i]−1→p A = Σ−1

xx

xn =√ng→d x ∼MVN(0, E(gig′i))

32


⇒√n(bn − β)

→d MV N(0,Σ−1

xxE(gig′i)Σ−1xx︸︷︷︸

AV ar(b)

)

⇒ Practical use: bn ∼a MVN(β, AV ar(b)n )⇒ bn is CAN

In detail: CLT on {xi ∗ εi}√n 1n

∑xiεi =

√ng

CLT:√n(g − E(gi))

→d MV N(0, E[(gi − E(gi))(gi − E(gi))′]︸︷︷︸

V ar(gi)

)

With predetermined xi: E(xiεi) = 0→ E(gi) = 0⇒ V ar(gi) = E(gig′i) = E(ε2ixix′i)

How to estimate Avar(b)Avar(b)=Σ−1

xxE(gig′i)Σ−1xx

1. Σxx = E(xix′i)→ Sxx = 1n

∑xixi

→p E(xixi) = Σxx (also for inverse using Lemma 1)

2. E(gig′i)→ S =?→p E(gigi) = E(ε2ixix

′i)

→Consistent estimate for Avar(b): Avar(b) = S−1xx ∗ S ∗ S−1

xx

→p Σ−1

xxE(gig′i)Σ−1xx

So how to estimate S = E(ε2ixix′i):

Recall: Zi(vector) allows application of LLN→ 1n

∑f(Zi)

→p E(f(Zi))

Our Zi = [εi, xi]′

→ 1n

∑ε2ixix

′i

→p E(ε2ixix

′i)

Problem: εi is unknown→we use ei = yi − x′ib→ s = 1

n

∑e2ixix

′i

→p E(ε2ixix

′i) = E(gig′i)

Hayashi: p.123 shows that s is a consistent estimator

Result:Avar(b) = [ 1n

∑xix′i]−1 1

n

∑e2ixix

′i[

1n

∑xix′i]−1→p E(xix′i)

−1E(gigi)E(xix′i)−1 = Avar(b)

→ is estimated without the assumption 1.4 (homoscedasticity)

Developing a test statistic under the assumption of conditional homoscedasticityAssumption: E(ε2i |xi) = σ2(<0 and <∞) (We don’t have to condition on all x’s, but only xi)→ E[E(ε2i |xi)] = σ2 = E(ε2i ) = V ar(εi)Avar(b) = [ 1

n

∑xix′i]−1σ2 1

n

∑xix′i[

1n

∑xix′i]−1 = [ 1

n

∑xix′i]−1σ2

with S = 1n

∑e2i

1nxix

′i and σ

2 = 1n

∑e2i ; S is still a consistent estimate if E(ε2i |xi) = σ2

This is the asymptotic variance covariance matrix Avar(b). The variance of b V ar(b) =Avar(b)n =

σ2[∑xix′i]−1 = σ2(X ′X)−1 → is the same as before.

⇒When using the above assumption, we get to the same results as before asymptotic theory, but wealso have the general result for Avar(b) without the above assumption:

1. If you assume conditional homoscedasticity (cf. 1.4)Avar(b) = σ2[ 1n

∑xix′i]−1

33

Applied Econometrics – 9 Time Series Basics (Stationarity and Ergodicity)

We estimate Var(b):Avar(b)n = σ2[ 1

n

∑xix′i]−1

n = σ2(X ′X)−1

σ2 = 1n

∑e2i (consistent but biased)

We can use instead of σ2 : s2 = 1n−k

∑e2i

→p σ2 (consistent and unbiased)

2. If you don’t assume homoscedasticity:Avar(b) = [ 1n

∑xix′i]−1[ 1

n

∑e2ixix

′i][

1n

∑xix′i]−1

In EViews: Heteroscedasticity-consistent variance-covariance matrix is used to compute t-stats, s.e.,f-tests which are robust (=we don’t need the homoscedasticity assumption)

3. Assuming wrongly conditional homoscedasticity: t-stats, s.e., f-tests are wrong (more precise: in-consistent)

White standard errorsAdjusting the test statistics to make them robust against violations of conditional homoscedasticity.

t-ratio

tk =bk − βk√

[ [ 1n

∑xix′i]

−1 1n

∑e2i xix′i[

1n

∑xix′i]

−1

n ]kk∼a N(0, 1) (100)

Holds under H0 : βk = βk

F-ratioW = (Rb− r)′[R

Avar(b)n R′]−1(Rb− r)′ ∼a χ2(#r)

Holds under H0 : Rβ − r = 0; allows for linear restrictions on β

9 Time Series Basics (Stationarity and Ergodicity)

Dependence in the data:Certain degree of dependence in the data in time series analyis; only one realization of the datagenerating process is given.CLT and WLLN rely on iid data, but dependence in real world data.

Examples:Inflation rate, stock market returns

Stochastic process: sequence of random variables/vectors indexed by time {z1, z2, ...} or {zi} withi=1,2,...A realization/sample path: one possible outcome of the process (p.97)

Distinguish between:

• Realization (sample path) = sequence of real numbers

34


• Process {Zi} = sequence of random variables (not a real number)

Hayashi: p.99fAnnual inflation rate = sequence of real numbers<Grafik>

vs. Three realizations (r) of stochastic process of US inflation rates = sequence of random variables<Grafik>

Ensemble mean: 1R

∑infl1995,r

→p E[infl1995] (by LLN, here: R=3)

What can we infer from one realiztation about the process?We can learn something when the DGP has certain properties/restrictions. Condition: stationarity ofthe process.

StationarityWe draw from the same distribution over and over again, but maybe the draws are dependent.Additionally, the dependence doesn’t change over time.

Strict stationarity:The joint distribution of zi, zi1, ..., zir depends only on the relative position i1− i, ..., i1− i (distance)but not on i itself.In other words: the joint distribution of (zi, zir) is the same as the joint distribution of (zj , zjr) ifi− ir = j − jr.

Weak stationarity:E(zi) = µ (finite/exists, but does not depend on i)Cov(zi, zi−j) depends on j (distance), but not on i (absolute position)→implies V ar(zi) = σ2 (doesnot depend on i)(We only need the first moments and not the whole distribution)→If all moments exist: strict stationarity implies weak stationarity

Illustration:<Grafik>It seems that E(zi) does depend on i

<Grafik>IT seems that E(zi) = mu does not depend on i; the realization oszilliates around the expected value(some process drags them back, although there are some dependencies over or below the mean).V ar(zi) = σ2 does not depend on i→ This second graph seems to fulfill the first two requirements.

We now turn to the assumption that the Cov(zi, zi−j) only depends on the distance.j=0→ V ar(zi) <Grafik>

The ensemble means are still constant, but V ar(zi) increases with i.Example: Random walk: Zi = zi−1 + εi [εi ∼ N(0, σ2) iid]

→Stationarity is testable

35


ErgodicityIf I space two random variables apart in the process, he dependence has to decrease with the distance(restricted memory of the process; in the limit the past is forgotten)So a stationary process is also called ergodic if: independence

limn→∞E[f(zi, zi+1, ..., zi+k)∗g(zi+n, zi+n+1, ..., zi+n+l)] = E[f(zi, zi+1, ..., zi+k)]∗E[g(zi+n, zi+n+1, ..., zi+n+l)](101)

Ergodic processes are stationary processes which allow the application of a LLN: Ergodic theorem:Sequence {zi} is stationary and ergodic with E(zi) = µ, then

zn =1n

∑zi→p µ (102)

So for an ergodic process we are back to the fact that one realization is enough to infer from.

Martingale difference sequenceStationarity and Ergodicity are not enough for applying the CLT. To derive the CAN property of theOLS-estimator we assume:{gi}={xiεi}{gi} is a stationary and ergodic martingale difference sequence (m.d.s.):E(gi| gi−1, gi−2, ...︸︷︷︸

history of the process

) = 0⇒ E(gi) = 0 by LTE

Derive asymptotic distribution of bn(X ′X)−1X ′y. We assume {xi, εi}︸︷︷︸gi

is a stationary & ergodic m.d.s.:

gi =

εi ∗ 1εi ∗ xi2...

εi ∗ xik

(103)

Why? Because there exists a CLT for s&e m.d.sE(gi|gi−1, ...) = 0 with xi1 = 1 ⇒ E(εi|gi−1, ...) = 0 ⇒ by LTE: E(εi) = 0 and also by LIE:E(εi ∗ εi−j) = 0∀i = 1, 2, ... so that Cov(εi, εi−j) = 0.

SummarizeWe assume for large sample OLS:

{yi, xi} stationary& ergodic→LLN (ergodic theorem): 1

n

∑xix′i

→p E(xix′i) and 1

n

∑xiyi

→p E(xiyi)

⇒Consistency of b=(X’X)X’y

gi = εi ∗ xi is a stationary&ergodic m.d.s→ E(gi|gi−1, ...) = 0→ E(gi) = [E(εi), ..., E(εixik)] = 0→ CLT for s&e m.d.s applied to {gi}: 1

n

∑εixi = gn:

36

Applied Econometrics – 10 Generalized Least Squares

√n(gn − E(gi)︸︷︷︸

=0

)→d MV N(0, E(gig′i)︸︷︷︸

V ar(gi)

)

⇒Distribution of b

⇒ b = [ 1n

∑xix′i]−1 1

n

∑xiyi is CAN

Notes:{εi ∗ xi} is a s&e m.d.s rules out Cov(εi, εi−j) 6= 0.

This autocorrelation is ok for cross-sectional data, but not so ok for time eries. Can be relaxed (butmore complex): HAC properties (=heteroscedasticity autocorrelation consistent).⇒Robust inference

Robust s.e./t-stats etc. (HAC) and those using the restrictive assumptions (CAN) should notdiffer too much (Often happens when one tries to account for autocorrelation).

10 Generalized Least Squares

Assumptions of GLSLinearity yi = x′iβ + εiFull rank: rank(X)=K (with Prob=1)Strict exogeneity E(εi|X) = 0→ E(εi) = 0 and Cov(εi, xik) = E(εi ∗ xik) = 0

Not assumed: V ar(ε|X) = σ2In.Instead:

V ar(ε|X) = E(εε′|X) =

V ar(ε1|X) Cov(ε1, ε2|X) ... Cov(ε1, εn|X)Cov(ε1, ε2|X) V ar(ε2|X) ... ...

... ... ... ...Cov(ε1, εn|X) ... ... V ar(εn|X)

(104)

→ V ar(ε|X) = E(εε′|X) = σ2V (X) (a positive real number extracted from V ar(ε|X))

Consequences:

• bn = (X ′X)−1X ′y no longer BLUE, but unbiased

• V ar(bn|X) 6= σ2(X ′X)−1

Deriving the GLS estimatorDerived under the assumption that V(X) is knowm, symmetric and positive definite. Then:V=PΛ P’ (with P=characteristic vectors/eigenvectors of V(nxn), Λ=diagonal matrix with eigenvaluesλ on the diagonal) = „spectral decomposition“⇒C=P*Λ−1/2 (C is (nxn) and nonsingular)

⇒ V (X)−1=C’CTransformation: y = Cy, X = CX⇒ y = Xβ + ε

37

Applied Econometrics – 10 Generalized Least Squares

Cy = CXβ + Cεy = Xβ + ε

Least squares estimation of β using transformed dataβGLS = (X ′X)−1X ′y = (X ′C ′CX)−1X ′C ′Cy = (X ′ 1

σ2V−1X)−1X ′ 1

σ2V−1y

= [X ′[V ar(ε|X)]−1]X ′[V ar(ε|X)]−1y

GLS estimator is BLUE:

• linearity

• full rank X

• strict exogeneity

• V ar(ε|X) = σ2In

• V ar(β|X) = σ2[X ′V −1X]−1

• Assumptions 1.1-1.4 are fulfilled for this model

Problems:

• Diffcult to work out the asymptotic properties of βGLS

• In real world applications V ar(ε|X) not known (and this time we have a lot more unknowna)

• If V ar(ε < X) is estimated the BLUE-property of βGLS is lost

Special case of GLS - weighted least squares

E(εε′) = V ar(ε|X) = σ2

V1(X) 0 ... 0

0 V2(X) ... 0... ... ... ...0 0 ... VN (X)

= σ2V (xi) (105)

As V (X)−1=C’C ⇒ C=1√V1(X)

0 ... 0

0 1√V2(X)

... 0

... ... ... ...0 0 ... 1√

Vn(X)

=

1s1

0 ... 00 1

s2... 0

... ... ... ...0 0 ... 1

sn

(106)

⇒ argmin∑

(y1si− β1s

−1i − β2

xi2si...− βk xiK

si)2

Observations are weighted by standard deviation (higher penalty for wrong estimates)For this we would need the conditional variances V ar(ε|xi). We assume that we can use a varianceconditioned on xi.

38

Applied Econometrics – 11 Multicollinearity

Notes on GLSIf regressors X are not strictly exogeneous, GLS may be inconsistent (OLS does not have this problem:robustness property of OLS).

Cross-section data: E(εi|X) = 0 better justified and E(εi ∗ εj) = 0, but:Conditional heteroscedasticity: V ar(εi|) = σ2V (xi)⇒ WLS using σ2V (xi)Functional form of V (xi)? Typically not known, but has to be assumed. E.g.:

V (xi) =′Zi︸︷︷︸

observables

α︸︷︷︸parameters to be estimated

⇒Feasible GLS (FGLS) (α, β estimated in one step)→Finite sample properties of GLS are lost

→For large sample properties:FGLS may be more efficient (asymptotic) than the OLS, but: the functional form of V (xi) has to becorrectly specified. If this is not the case: FGLS may be even less efficient than OLS.⇒So we face a trade-off between consistency and efficiency.

One can work out the distribution of th GLS also asusming s&e m.d.s and you also get a normaldistribution.

11 Multicollinearity

Exact multicollinearityExpressing a regressor as linear combination of (an)other regressor(s)

rank(X) 6= K: No full rank ⇒ Assumption 1.3 or 2.4 is violated and (X ′X)−1 does not exist.Example: seasonal dummies/sex dummies and a constant

Often economic variables are correlated to some degree.Example: tenure and work experience

BLUE result is not afffectedLarge sample results are not affectedVar(b|X) is affected

Effects of Multicollinearity and solution to the problemEffects:

• Coefficients may have high standard errors and low significance levels

• Estimates may have the wrong sign (compared to a theoretical sign)

• Small changes in the data produces wide swings in the parameter

39

Applied Econometrics – 12 Endogeneity

Recall: V ar(bk|x) = σ2[(X ′X)−1]kk, X contains a constant & 2 regressors x1, x2

Then V ar(bk|X) = σ2

(1−r2x1,x2)∗∑

(xik−xk)2(with r2 = empirical correlation of regressors and second

part as sample variance of the k-th regressor)General: [(X ′X)1]kk = 1

(1−R2k.)∑

(xik−xk)2(with R2

k.= explained variance of the regression of xk onthe rest (data matrix excluding xk))⇒The stronger the correlation between the regressors, the smaller the denominator of the varianceand the higher the variance of bk: less precision and we can hardly reject the null hypothesis (smallt-stat)⇒The smaller σ2, the smaller V ar(bk).⇒The higher n, the higher the variance in the regressor xk and thus a high n can compensate forhigh correlation: the larger the sample variance, the smaller V ar(bk).

Solutions:

• Increasing precision by implementing more data (= higher n→higher variance of xk) (costly!)

• Building a better fitting model that leaves less unexplained (= smaller σ2)

• Excluding some regressors (Omitted variables bias!) (= smaller correlation)

12 Endogeneity

Omitted variable biasCorrectly specified model: y = X1β1 +X2β2 + εRegression of y on X1 → X2 gets into the error term: y = X1β1 + ε︸︷︷︸

X2β2+ε

→ Omitted variable bias:b1 = (X ′1X1)−1X ′1y= (X ′1X1)−1X ′1 (X1β1 +X2β2 + ε)︸︷︷︸

y

= β1 + (X ′1X1)−1X ′1X2︸︷︷︸RegressionofX2onX1

β2 + (X ′1X1)−1X ′1ε︸︷︷︸=0ifE(ε|X)=0

OLS estimator is biased:If β2 = 0 (which would mean thatX2 is not part of the model in the first place)⇒ (X ′1X1)−1X ′1X2β2 6=0If (X ′1X1)−1X ′1X2 6= 0 (the regression ofX2 onX1 gives you non-zero coefficients)⇒ (X ′1X1)−1X ′1X2β2 6=0

Recall:1. E(εi|x) = 0 = strict exogeneity → E(εi ∗ xi) = 02. E(εi ∗ xik) = 0∀k = large sample OLS assumption; predetermined/orthogonal regressors⇒ E(εi ∗ xi) 6= 0 or E(εi ∗ xik) 6= 0 for some k

40

Applied Econometrics – 12 Endogeneity

Endogeneity bias: Working exampleSimultaneous equations model of market equilibrium (structural form):qdi = α0 + α1pi + ui (ui=demand shifter (e.g. preferences))qsi = β0 + β1pi + vi (vi=supply shifter (e.g. temperature)Assumptions: E(ui) = 0, E(vi) = 0, E(ui ∗ vi) = 0(Cov)Clear markets: qdi = qsiOur data show clear point of the market. It is not possible to estimate α0, α1, β0, β1 (levels and slopes)as we do not know whether changes in the market equilibrium are due to supply or demand shocks.We observe many possible equilibria, however we can not explain the slope of the demand and thesupply curve from the data.

Here: Simultaneous equation bias

From structural form to reduced formSolving for qi and pi (qdi = qsi = q) yields reduced form (only exogeneous things on the right-handside):pi = β0−α0

α1−β1+ vi−ui

α1−β1

qi = α1β0−α0β1

α1−β1+ α1vi−β1ui

α1−β1

Both price and quantity are a function of the tow error terms (no regression can be used at this pointas we don’t observe the error terms): vi=supply shifter and ui:demand shifterEndogeneity: Correlation between errors and regressors, regressors are not predetermined as the priceis a function of these error terms

Calculating the covariance of pi and the demand shifter ui:Cov(pi, ui) = −V ar(ui)

α1−β1(using reduced form)

If endogeneity is present the OLS-estimator is not consistentFOCs in simple regression context yield:

α1 =1n

∑(qi−q)(pi−p)

1n

∑(pi−p)2

→p Cov(pi,qi)

V ar(pi)(formula for estimation)

But here: Cov(pi, qi) = α1V ar(pi) + Cov(pi, ui) (from structured form)⇒ Cov(pi,qi)

V ar(pi)= α1 + Cov(pi,ui)

V ari6= α1

⇒OLS is not consistent

Same holds for β1

Instruments for the market modelProperties of the instruments: uncorrelated with the errors, instruments are predetermined and corre-lated with the endogenous regressors.

New model:qdi = α0 + α1pi + uiqsi = β0 + β1pi + β2xi + δiqdi = qsiwith:E(δi) = 0

41

Applied Econometrics – 13 IV estimation

Cov(xi, δi) = 0Cov(xi, ui) = 0Cov(xi, pi) = β2

α1−β1V ar(xi) 6= 0

→ xi an instrument for pi ⇒ yields new reduced form:pi = β0−α0

α1−β1+ β2

α1−β1xi + δi−ui

α1−β1

qi = α1β0−α0β1

α1−β1+ α1β1

α1−β1xi + α1δi−β1ui

α1−β1

Cov(xi, qi) = α1β2

α1−β1V ar(xi) = α1Cov(xi, pi) (from reduced form)

⇒Using Method of Moments Theory (not regression): α1 = Cov(xi,qi)Cov(xi,pi)

by WLLN: α1→p α1 (see

estimator:1n

∑xiyi

1n

∑xizi

)

A simple macroeconomic model: HaavelmoAggregated consumption function: Ci = α0 + α1Yi + uiGDP identity: Yi = Ci + IiYi affects Ci but at the same time Ci influences YiReduced form:Yi = α0

1−α1+ 1

1−α1Ii + ui

1−α1

→ Cicannot be regressed on Yi as the regressor is correlated with the residual:Cov(Yi, ui) = V ar(ui)

1−α1> 0 (using reduced form)

→OLS-estimator is inconsistent: upwards biased (as Cov(Ci, Yi) = α1V ar(yi) + Cov(Yi, ui) fromstructured form):Cov(Ci,Yi)V ar(Yi)

= α1 + Cov(Yi,ui)V ar(Yi)

6= α1

Valid instrument for income Yi: investment IiCov(Ii, ui) = 0Cov(Ii, Yi) = V ar(Ii)

1−α16= 0 (from reduced form)

Errors in variablesExplanatory variable is measured with error (e.g. reporting errors)Classical example: Friedman’s permanent income hypothesisPermanent consumption is proportional to permanent income Ci∗ = kYi∗Observed variables:Yi = Yi ∗+yiCi = Ci ∗+cici = kyi + uiEndogeneity due to measurement errors→Solution: IV-estimators; here: xi = 1

13 IV estimation

General solution to endogeneity problem: IVNote: change in notation: zi=regressors, xi=instruments

42


Linear regressions: yi = z′iδ + εiBut the assumption of predetermined regressors does not hold: E(zi ∗ εi) 6= 0To get consistent estimators instrumental variables are needed: xi = [xi1, ..., xiK ]′

xi is correlated with the endogeneous regressor, but uncorrelated with the error term.

Assumptions for IV-estimators3.1 Linearity yi = ziδ + εi

3.2 Ergodic stationarity: K instruments and L regressors. Data sequence wi = {yi, zi, xi} is stationaryand ergodic [LLN can be applied as with OLS]We focus on K=L →IV (other possibility K>L→GMM)

3.3 Orthogonality conditions: E(xi ∗ εi) = 0E(xi1(yi − z′iδ)) = 0...E(xiK(yi − z′iδ)) = 0→ E(xi(yi − z′iδ)) = 0⇒ E(xi ∗ εi) = E(gi) = 0

Later we also assume that a CLT can be applied to gi as for OLS

3.4 Rank condition for identification: rank(Σxy)=L with

E(xiz′i) =

E(xi1) ... E(xi1ziL)... ... ...

E(xiKzi1) ... E(xiKziL)

= Σxz︸︷︷︸(KxL)

(107)

(=full column rank)Σ−1xz exists of K=L

Core question: Do moment conditions provide enough information to determine δ uniquely?

Some notes on IV-estimationStart from:gi = xi(yi − z′iδ) = gi(wi, δ) (with wi=the data)Assumption 3.3 E(gi) = 0 or E(g(wi, δ)) = 0 = „orthogonality condition“: E(xi(yi − z′iδ)) = 0 =System of K equations relating L unknown parameters δ to K unconditional moments

Core question: „Are population moment conditions sufficient to determine δ uniquely?“δ as a solution to system of equation. But: δ only solution?δ hypothetical value of δ that solves E(g(wi, δ)) = 0δ only solution:→ E(xiyi)︸︷︷︸

σxy(Kx1)

−E(xizi)δ︸︷︷︸Σxz(KxL)

= 0

Σxz ∗ δ = σxy (System of linear equations Ax=b)Necessary& sufficient conditions that δ only solution⇒ Σxz full column rank(L)⇒ Reason for assumption 3.4

43


⇒Another necessary condition (but not sufficient): Order condition: K(#instruments)≥ L(#regressors)

Deriving the IV-estimator (K=L)E(xiεi) = E(xi(yi − ziδ) = 0E(xiyi)− E(xiz′iδ) = 0δ = [E(xiz′i)]

−1E(xiyi)δIV = [ 1

n

∑xiz′i]−1[ 1

n

∑xiyi]

→p δ

To show this proceed as for OLS, start from sampling error: δIV − δ

Alternative notation:δIV = (X ′Z)−1X ′y

If K=L the rank condition implies that Σ−1xz exists and the system is exactly identified.

Applying WLLN, CLT and the lemmas it can be shown that IV-estimator δIV is CAN.To show this, proceed as for OLS, start from

√n(δIV − δ). Assume that a CLT can be applied to

{gi}:√n( 1

n

∑gi − E(gi))

→d MV N(0, E(gig′i))

⇒√n(δIV − δ)

→d MV N(0, Avar(δIV )) (Find expression for the variance and the estimator of the

variance)

So:Show consistency and asymptotic normality of δIV . δIV is CAN: δ ∼a MVN(δ, Avar(δIV )

n )

Relate OLS- to IV-estimators:If regressors predetermined E(εi ∗ zi) = 0→ xi = zi (use regressors as instruments)δIV = [ 1

n

∑ziz′i]−1 1

n

∑ziy′i = OLS estimate

⇒ OLS a special case of IV

44

Applied Econometrics – 14 Questions for Review

14 Questions for Review

First set1. Ragnar Frisch’s statement in the first Econometrica issue claims that Econometrics is not (only)economic statistics and not (only) mathematics applied to economics. What is it. then?

Econometrics=Economic statistics (Basis) ∩Economic theory (Basis and application) ∩Mathematics (Methodology)

2. How does the fundamental asset value evolve over time in the Glosten/Harris model presented inthe lecture? Which parts of the final equation are associated with public and which parts with privateinformation that influences the fundamental asset value?

mt︸︷︷︸fundamentalassetvalue

= µ+mt−1 + εt︸︷︷︸randomwalkw.drift

+Qtz0 +Qtz1vt︸︷︷︸privateinfo

(108)

∆Pt = µ︸︷︷︸publicinfo

+ z0Qt + z1vtQt + c∆Qt︸︷︷︸privateinfo

+ εt︸︷︷︸publicinfo

(109)

3. Explain the components of the equations that give the market maker buy (bid) and the marketmaker sell (ask) price in the Glosten/Harris model.

P at = mu︸︷︷︸driftparameter

+ mt−1︸︷︷︸lastfairprice

+ εt︸︷︷︸publicinfo

± zt︸︷︷︸volume

± c︸︷︷︸earnings/costs

(110)

4. Explain how in the Glosten/Harris model the market maker anticipates the impact that a tradeevent exerts on the fundamental asset price when setting buy and sell prices.

The market maker sells with the ask price, thus zt and c enter positively in the price. This meansthe market maker adds a volume parameters and his costs to the price to be rewarded for his workappropriately.

5. Why should it be interesting for a) an investor and b) for a stock exchange to estimate theparameters of the Glosten/Harris model (z0, z1, c, µ)?

a) To find out the fundamental value thus to prevent the investor from paying too muchb) To trade with efficient prices (avoid bubbles, avoid exploitation through market maker).

45


6. What does the (bid-ask) spread mean? What are explanations for the existence of a positive spreadand what could be the reason that the spread differs when the same asset is traded on different marketenvironments?

P at − P bt = 2zt + 2c (111)

Positive spread: earnings of the market maker to earn a living and avoid costs (operational costs,protection/storage costs)Market environment: influences costs and trade volume eg. in a crisis

7. Which objects (variables and parameters) in the Glosten/Harris model are a) observable to themarket maker, but not to the econometrician and b) observable to the econometrician?

a) costs/earnings cb) ∆Pt, Qt, vt

8. The final equation of the Glosten/Harris model contains observable objects, unknown parametersand an unobservable component. What are these? What is the meaning of the unobservable variablein the final equation?

Observable: ∆Pt, Qt, vtUnobservable: εtParameter: µ, z0, z1, c

9. Why did we transform the two equations for the market maker’s buy and sell price that the marketmaker posts at point in time t into the equation ∆Pt?

So that we can estimate the structural parameters

10. In the "Mincer equation"derived from human capital theory what are the observable variables andwhat would a sensible interpretation of of the unobservable component be?

Observable: Wage, S, Tenure, ExpUnobservable: εi (eg. ability, luck, motivation, gender)

11. Why should the government (ministry of labour or education) be interested in estimating theparameters of the Mincer equation from human capital theory?

To determine how many years people should go to school (costs!)

12. Explain why we can conceive ln(WAGEi), Si, TENUREi, and EXPRi in the Mincer equation asrandom variables.

The individuals are randomly drawn. The variables are the results of a random experiment and thusrandom variables

13. An empirical researcher using a linear regression model is confronted with the critique „this ismeasurement without theory". What is meant by that and why is this a problem.

46


A correlation is found, but it may be too small or without meaning for theory. Always start with a theo-retical assumption. Without a theoretical assumption behind the statistics, we can’t test hypothesis.Causality is a problem, because we can’t refer to a model/reasons for the form of our regression

14. The researcher counters that argument by stating that he is only interested in prediction andcausality issues do not matter for him. Exploiting a correlation would be just fine for his purposes.Provide a critical assessment of this statement.

To act upon such a test, causality is needed. Correlation could be because of self-selection, unobservedvariables, reversed causality, interdependence, endogeneity

15. A possibility to motivate the linear regression model is to assume that E(Y|X)=x’β. What doesthis imply for the disturbance term ε?

ε uncorrelated with the X’s and its expectation value is zero.

16. Some researchers argue that nonparametric analysis provides a better way to model the conditionalmean of the dependent variable. What can be said in favor of it and what is the drawback?

E(yi|xi) = x′iβ = assuming linearity (=parametric) vs. E(yi|xi) = f(xi, ϑ) = no functional formassumed (nonparametric)Favor: no linearity assumed (robust)Drawback: data intensive (with a lot of variety in X), „curse of dimensionality“

17. The parameters of asset pricing models relating expected excess returns to risk factors can beestimated by regression analysis. Describe the idea. Which implications for the "compatible regres-sion model"follow from theory with respect to the regression parameters and the distribution of thedisturbance term? How can we test the parameter restriction?

E(Rejt+1) = βj ∗ E(Remt+1) (112)

Compatible model assumes that (un)conditional mean of εj is 0:

E(Rejt+1) = βj ∗ E(Remt+1) + εjt+1 (113)

Parameter restriction???

18. Despite the fact that the Fama-French model is in line with the fundamental equation fromfinance theory which states that E(Rej) = β′λ, it is nevertheless prone to the "measurement withouttheory"critique. Why is that so? What do β and λ stand for in the first place?

β=exposure of payoff asset to factor k riskλ=price of factor k risk

Causality problem: if risk increases, does that cause the expected return to increase or is that becauseof other factors (unobserved variables, reversed causality,...)

47


Zusatz. If one has stable correlation, we can predict without causality, eg. animals leaving the shoreand tsunamis.

Second set1. Explain the key difference of experimental data and the "process-producedëconomic data that istypically available.

In „process-produced“ data: self selection (individuals not randomly assigned)

2. Why is the Tennessee STAR study a proper experiment while the SQ study of the sociology student(probably) not?

STAR participants randomly assigned to the two groups (small or big classes). SQ study with self-selection bias and without control group and pre-test

3. Explain the term ßelection bias". Write down the expression and explain its terms to your fellowstudent.

Selection-bias:Self-selection: individuals decide themselves which group they want to be part ofBias: The parameters are biased because of self-selection; errors in choosing the individuals/groupsthat participate

4. Comparing the average observable outcome yi (e.g. entry income) of a random sample for individualswho received a treatment di (e.g. attended an SQ course) with the average outcome of a randomsample who did not receive a treatment (did not attend an SQ course) yields a consistent estimateof E(Yi|di = 1)− E(Yi|di = 0) (by a law of large numbers). Under which conditions does this yielda consistent estimate of a treatment effet ρ assumed to be identical for all i. What causes a bias ofthe treatment effect? Write down an explicit expression of this bias. How can this bias be avoided?

The individuals have to be randomly assigned to either treatment or non-treatment.E(b− β) 6= 0 bzw. E(yi|di = 1)− E(yi|di = 0) = ρ+ selection bias︸︷︷︸

E(yi0|di=1)−E(yi0|di=0)

5. Consider the police-crime example. Put it in a Rubin causal framework. Why would be suspect aselection bias, i.e. E(y0ijdi = 1) ¡ E(y0ijdi = 0) 6= 0?

Yi = y0i + (y1i − y0i)Di

Yi: CrimeDi: 0=police doesn’t change; 1=police increasesy1i=crime if Di = 1 und y0i=crime if Di = 0

Social factors, which means police and εi are correlated. Police officers are not randomly assigned tothe districts.

6. What is a quasi-experiment?

48


=a natural experiment:Without assigning individuals to groups, those groups form in a way they do in experiments (randomassignment)

7. In the lecture I promoted the ëxperimental ideal"for the measurement of causal effects. For instance,the US senate passed (in 2004) a law that educational studies have to be based on experimental orquasi-experimental designs. Discuss the pros and cons of setting up an experiment in a socio-economiccontext. Give argument against conducting such experiments which do a good thing as they help toestimate the true causal effect of treatments.

Pro: measures causal effects; self-selection bias avoidedContra: unethical, costly

8. Take the Mincer equation as an example. Why are the years of schooling ëndogenous", i.e. theresult of an economic decision process that is not made explicit in the equation. What is the role ofthe disturbance term ε in this context? Place this endogeneity in the context of the US senate’s lawmentioned above.

Years of schooling depend on the ability which is part of the ε (Si could be written as a function ofIQ). → No decision based on the Mincer equation.

9. Estimating the parameters of the Mincer equation yields an estimate of the coefficients that multiplywith the years of schooling of 0.12. Interpret that number economically. When would the US senaterely on this result when it has to decide on budget measures that involve endowing the US educationsystem with tax money? What are the concerns? What should be done to convince the senate thatthe result is trustworthy?

An additional year of schooling increases the wage by 12%. The US senate would require an expe-riment which would ask some students to leave school earlier or stay longer than planned (randomassignment). As this is unethical the selection bias can’t be avoided. One might want to collect asmany pieces of information as possible to avoid endogeneity.

10. Provide another economic example (like the crime example or the liberalization example, but onethat is not given in the lecture) where endogeneity is present. Cast your example in a CLRM. Whatis the economic interpretation of ε in your model?

???

11. There is a special name for model parameters which have an economic meaning. What do we callthem?

Structural parameters

12. Based on a sample of firms, a researcher estimates by OLS the coefficients of a Cobb-Douglasproduction function. The CD function is nonlinear, but the regression coefficients are computeddespite the first CLRM assumption of linearity. How is this possible? Estimates of the productionfunction parameters yield numbers equal to 0.62 and 0.41, respectively. How do you interpret theseestimates?

49


Log-linear modelA 1% increase in capital leads to a .62% increase in production

13. Prove your algebraic knowledge and write the CLRM in matrix and vector notation. Indicate thedimensions of scalars, vectors and matrices explicitly!

y1

...yn

︸︷︷︸(nx1)

=

1 x12 ... x1k

...1 xn2 ... xnk

︸︷︷︸

(nxk)

∗

β1

...βk

︸︷︷︸

(kx1)

+

ε1...εn

︸︷︷︸(nx1)

(114)

14. Which objects in the CLRM are observable and which are not?

Observable: y,XUnobservable: β, ε

15. In the lecture we had a possible choice of b = (X ′X)−1X ′y. Write out that expression in detailusing the definition of the matrix X and the vector y. Can you write the right hand side expressionusing xi = xi1, xi2, ..., xiK and yi instead of the matrix X and the vector y?

b =

1 ... 1x12 ... xn2

...x1k ... xnk

∗ 1 x12 ... x1k

...1 xn2 ... xnk

−1

∗

1 ... 1x12 ... xn2

...x1k ... xnk

∗y1

...yn

(115)

b = (1n

∑xix′i)−1 1n

∑xiyi (116)

16. Using observed data on explanatory variables and the dependent variable you can compute theOLS estimator as b = (X ′X)−1X ′y. This yields a K dimensional column vector of real numbers.That’s an easy task using a modern computer. Explain why we conceive the OLS estimator also asa vector of random variables (a random vector), and that the concrete computation of b yields aparticular realization of this vector of random variables. Explain the source of randomness of b.

Random vector because b is computed as a function of X andy; and X and y are random variablesbecause we assume they are drawn randomly. When individuals are drawn, those are the realizationsand we can compute the realized b

17. Linearity (yi = x′iβ + εi) seems a restrictive assumption. Provide an example (not from lecture)where an initially nonlinear relation of dependent variable and regressors can be reformulated suchthat an equation linear in parameters result. Give another example (not in lecture) where this is notfeasible.

50


y = xαzβ → lny = αx+ βzy = x

αyβ oder y = xα + zβ as impossible variations.

18. When are two random variables orthogonal?

For r.v. E(xy)=0For vectors x’y=0

19. Strict exogeneity is a restrictive assumption. In another assignment you will be asked to prove theresult (given in the lecture) that strict exogeneity implies that the regressors and the disturbance termare uncorrelated (or, equivalently, have zero covariance or, also equivalently, are orthogonal). Explain,using one of the economic examples given in the lecture (crime example or liberalization example), thatassuming that error term and regressor(s) are uncorrelated is doubtful from an economic perspective(You have to give the ε an economic meaning like in the wage example where ε measured unobservedability of an individual.)

Social factors as part of εi incluence crime and police.

20. Why is Z = rank(X) is a random variable? Which values can Z take (X is a nxK matrix wheren≥K)? What do we assume w.r.t. the probability function of Z? Why do we make this assumption inthe first place and why is it important for the computation of the OLS estimator?

rank(x) is a function of XZ=rank(X) can take on 1...K values and is Bernoulli distributed. Probability distribution: 1 for fullrank, 0 for all other outcomes →degeneratesP(Z=rank(X))=1 → OLS is otherwise not computable

21. How many random variables are contained in the equationsa) Y = Xβ+ε: Y(n rv), X(kxn rv)b) Y = Xb + e: Y(n rv), X(kxn rv), b(k rv), e(k rv)

Third set

1. Show the following results explicitly by writing out in detail the expectations for the case of conti-nuous and discrete random variables.Useful:

E(g(x, y)) =∫ ∫

g(x, y)fxydxdy (117)

E(g(x, y)|x) =∫g(x, y)

fxyfxdy (118)

a) EX [EY |X(Y |X)] = EY (Y ) (Law of Total Expectations LTE)

EY |X(Y |X) =∫y ∗ fxy(x,y)

fx(x) dy

EX [EY |X(Y |X)] =∫

[∫y ∗ fxy(x,y)

fx(x) dy]fx(x)dx =∫y∫fxydxdy =

∫yfy(y)dy = Ey(y)

51


3. Explain how the positive semi-definiteness of the difference of the two variance-covariance matricesof two alternative estimators is related to the relative efficiency of one estimator vs. another.

Positive semi-definiteness: a′[V ar(β|x)− V ar(b|x)]a ≥ 0When positive semi-definiteness is true we can choose a’=[1,0,...,0],=[0,1,0,...,0] usw., so that:V ar(β1|x) ≥ V ar(b1|x), V ar(β2|x) ≥ V ar(b2|x) usw. für alle bk.Thus we can assume efficiency when a matrix is positive semi-definite.

4. Show by writing in detail for K=2, β = (β1, β2)′ that a′[var(β|X)−var(b|X)]a ≥ 0∀a 6= 0 impliesthat var(β1|X) ≥ var(b1|X) for a=(1, 0)’.

a′[var(β|X)− var(b|X)]a ≥ 0 (119)

(1, 0)

[(V ar(β1|x) Cov(β2, β1|x)

Cov(β1, β2|x) V ar(β2|x)

)−(

V ar(b1|x) Cov(b2, b1|x)Cov(b1, b2|x) V ar(b2|x)

)](10

)(120)

(1, 0)

(V ar(β1|x)− V ar(b1|x) Cov(β2, β1|x)− Cov(b2, b1|x)

Cov(β1, β2|x)− Cov(b1, b2|x) V ar(β2|x)− V ar(b2|x)

)(10

)(121)

(1, 0)(a bc d

)(10

)= [a, b]

(10

)= a = V ar(β1|x)− V ar(b1|x) ≥ 0 (122)

→ V ar(β1|x) ≥ V ar(b1|x)

5. We require a variance covariance matrix to be symmetric and positive definite. Why is the lattera sensible restriction? Definition of positive (semi-)definite matrix: A is a symmetric matrix. x is anonzero vector.1. if x’Ax>0 for all nonzero x, then A is positive definite2. if x’Ax≥0 for all nonzero x, then A is positive semi definite

They are symmetric by construction

X ∼ (µ,Σ)a′ = [a1, ..., an]→ Z=a’X with V ar(Z) = a′V ar(X)a︸︷︷︸

has to be >0

6. How does the conditional independence (CIA) assumption justify the inclusion of additional regres-sors (control variables) in addition to the „key regressor“, i.e. the explanatory variable of most interest(e.g. small class, schooling, foreign indebtedness). In which way can one, by invoking the CIA getcloser to the „experimental ideal“?

CIA in general: p(a|b,c)=p(a|c) or a± b|c: a is conditionally independent of b given c→ εi ± x1|x2, x3, ...

The more control variables, the closer we come to the ideal that εi ± x1

53


7. The CIA motivates the idea of a „matching estimator“ briefly discussed in the lecture. What is thesimple basic idea of it?

To measure treatment effects without random assignment:Matching means sorting individuals in similar groups, where one received treatment and then comparethe out comes.

E(yi|xi, di = 1)− E(yi|xi, di = 0) = local average treatment effectSort groups according to x’s. Replace expectations by the sample means for the individuals and thedifference tells the treatment effect of a certain group.

Exi [E(yi|xi, di = 1)− E(yi|xi, di = 0)] = average treatment effectAveraging over all x’s, so we get the treatment effect for all groups.

Fourth set0. Where does the null hypothesis enter into the t-statistic?

In the numerator, as we replace βk with βk

1. Explain what unbiasedness, efficiency, and consistency mean. Give an explanation of how we canuse these concepts to assess the quality of the OLS estimator b = (X ′X)−1X ′y. (We will be morespecific about consistency in later lectures, this question is just to make you aware of the issue.)

Unbiasedness E(b)=β: If an estimator is unbiased, its probability distribution has an expected valueequal to the parameter it is supposed to estimate.

Efficiency V ar(β|x) ≥ V ar(b|X): We would like the distribution of the sample estimates to be astightly distributed around the true β as possible.

Consistency P (b− β > ε)→ 0 as n→∞: b is asymtotically unbiased

⇒ b is BLUE = best︸︷︷︸efficiency

linear unbiased estimator

2. Refresh your basic Mathematics and Statistics:a) v=a’*Z. Z is a (nx1) vector of random variables. a’ is (1xn) vector of real numbers.b) V=A*Z. Z is a (nx1) vector of random variables. A is (mxn) matrix of real numbers.Compute for v mean and variance as well as mean vector and variance covariance matrix of V .

a) E(v) = E(a′z) = a′︸︷︷︸(1xn)

E(z)︸︷︷︸(nx1)

V ar(v) = V ar(a′z) = a′︸︷︷︸(1xn)

V ar(z)︸︷︷︸(nxn)

a︸︷︷︸(nx1)

b) E(V ) = E(AZ) = A︸︷︷︸(mxn)

∗E(Z)︸︷︷︸(nx1)

V ar(V ) = V ar(A ∗ Z) = A︸︷︷︸(mxn)

∗V ar(Z)︸︷︷︸(nxn)

∗ A′︸︷︷︸(nxm)

54


3. Under which assumptions can the OLS estimator b = (X ′X)−1X ′y be called BLUE?

Best estimator: Gauss-Markov (1.1-1.5)Linear: 1.1Unbiased: 1.1-1.3

4. Which additional assumption is used to establish that the OLS estimator is conditionally normallydistributed? Which key results from mathematical statistics have we used to employed to show thenormality of the OLS estimator b = (X ′X)−1X ′y which results from this assumption?

After assumption 1.5: ε|X ∼MVN(0, σ2In)As we know b− β = (X ′X)X ′e and using Fact 6:b− β|X ∼MVN(E(b− β|X), V ar(b− β|X))b− β|X ∼MVN(0, σ2(X ′X)−1)b|X ∼MVN(β, σ2(X ′X)−1)

5. Why are the elements of b random variables in the first place?

They are randomly drawn and are a result of a random experiment.

6. For which purpose is it important to know the distribution of the parameter estimate in the firstplace?

Hypothesis testing (including confidence intervals, critical values, p-values)

7. What does the expression „a random variable is distributed under the null hypothesis as ....“ mean?

Random variable has a distribtuiton, as the realizations are only one way that the estimate can looklike. Under the null hypothesis means that we suppose that b=0 meaning that the null hypothesis istrue.

8. Have we said something about the unconditional distribution of the OLS estimate b? Have we saidsomething about the unconditional means? The unconditional variances and covariances?

Unconditional distribution: neinUnconditional means: ja (LTE)Unconditional variances: nein

9. What can we say about the conditional (on X) and unconditional distribution of the t and z statistic(under the null hypothesis)? What is their respective mean and variance?

The type of distribution stays the same when conditioning.zk: conditional: N(0,1); mean: βk; variance: σ2[(X ′X)−1]kktk: unconditional: t(n-k); mean: βk; variance: s2[(X ′X)−1]kk

10. Explain

• what is meant by a type 1 error and a type 2 error?

• what is meant by „significance level“ = „size“.

• what is meant by the „power of a test“?

55


• Type 1 error α: rejected although correct; Type 2 error β: accepted although wrong

• Probability of making Type 1 error

• 1-Type 2 error: Rejecting when false

11. What are the two key properties that a good statistical test has to provide?

Tests should have a high power (=reject a false null) and should keep its size

12. There are two main schools of statistics: Frequentist and Bayesian. Describe their main diffe-rences!

Frequentists: fix the probability (significance level) → H0 can be rejected ⇒ rejection approach(favored)Bayesian: the probability changes with the evidence we get from the data → H0 cannot be rejected,but we get closer to it ⇒ confirmation approach

13. What is the null and the alternative hypothesis of the t-test presented in the lecture? How is thetest statistic constructed?

H0 : βk = βk, HA : βk 6= βkbk|X ∼ N(βk, σ2[(X ′X)−1]kk)H0 : bk|X ∼ N(βk, σ2[(X ′X)−1]kk)zk = bk−βk√

σ2[(X′X)−1]kk

∼ N(0, 1)

14. On which grounds could you criticize a researcher for choosing a significance level of 0.00000001(or 0.95, respectively)? Give examples for more „conventional“ significance levels.

0.00000001: hypothesis almost never rejected (type 1 error too low)0.95: hypothesis is too often rejected (type 1 error too high)

15. Assume the value of a t-test statistic equals 3. The number of observations n in the study is 105,the number of explanatory variables is 5. Give a quick interpretation of this result.

t-stat=3; n=105 and k=5 leads to df=100We can use standard normal distribution as df>30.For 5% significance level the critical values are ±2 (rule of thumb), so we would reject the hypothe-sis.

16. Assume the value of a t-test statistic equals 0.4 and n-K=100. Give a quick interpretation of thisresult.

We can’t reject the H0 on a 5% significance level (t-stat approximately normal and thus we can usethe rule of thumb for the critical values)

17. In a regression analysis with K=4 you obtain parameter estimates b2 = 0.2 and b3 = 0.04. Youare interested in testing (separately) the hypotheses β2 = 0.1 against β 6= 0.1 and β3 = 0 against

56


β3 6= 0.You obtain

s2X ′X−1 =

0.00125 0.023 0.0017 0.00050.023 0.0016 0.015 0.00970.0017 0.015 0.0064 0.00060.0005 0.0097 0.00066 0.001

(123)

where s2 is an unbiased estimated of σ2. Compute the standard errors of the two estimates, the twot-statistics and the associated p-values. Note that the t-test (as we have defined it) is two-sided. Iassume that you are familiar with p-values, students of Quantitative Methods definitely should be.Otherwise, refresh you knowledge!

b2 = 0.2, β2 = 0.1→ standard error for b2 =√

0.0016 = 0.04b3 = 0.04, β3 = 0→ standard error for b3 =

√0.0064 = 0.08

t-stat 2: 0.2−0.10.04 =2.5

t-stat 3: 0.040.08=0.5

p-value 2: 1-0.9938=0.0062p-value 3: 1-0.6915=0.3085Computing the p-value is computing the implied α.

18. List the assumptions that are necessary for the result that the z-statistic zk = bk−βk√σ2[(X′X)−1]kk

is

standard normally distributed under the null hypothesis, i.e. zk ∼ N(0, 1).

Benötigt: 1.1-1.5Zusätzlich: n-k>30 bzw. Var(b|X) known.

19. Is the t-statistic „nuisance parameter-free“?

Yes, as the nuisance parameter σ2 is replaced by s2.Definition: any parameter which is not of immediate interest but must be accounted for in the analysisof those parameters which are of interest.

20. I argue that using the quantile table of the standard normal distribution instead of the quantiletable of the t-distribution doesn’t make much of a difference in many applications: The respectivequantiles are very similar anyway. Do you agree? Discuss.

This is correct for n-k>30. For lower values of n-k: t-dist is more spread out in the tails.

21. I argue that with respect to the distribution of the t-statistic under the null substituting for σ2

the unbiased estimator σ2 = 1n−ke

′e of σ2 does not make much of a difference. When would yousubscribe to that argument? Discuss.

For n→ ∞ the bias n−Kn for σ2 vanishes. Thus the statement is ok for large n. For small n the

unbiased estimator should be used.

22.a. Show that P = X(X ′X)−1X ′ and M=In-P are symmetric (i.e. A=A’) and idempotent, (i.e.

57


A=A*A). A useful result is: (BA)’=A’B’.b. Show that y=Xb=Py and e=y-Xb=My are orthogonal vectors, i.e. y′e = 0c. Show that e=Mε and

∑e2i = e′e = ε′Mε

d. Show that if xi1 = 1 we have∑ei = 0 (use OLS FOC) and that y = x′b (i.e. the regression

hyperplane passes through the sample means).

a.P symmetric:

X(X ′X)−1X’’=[X’]’[(X ′X)−1]’[X]’=X[(X ′X)′]−1X’=X[X ′[X ′]′]−1X’=X(X ′X)−1X’

M symmetric:

-X(X ′X)−1X’’=In’-[X(X ′X)−1X’]’=In-X(X ′X)−1X’

P idempotent:

X(X ′X)−1X’[X(X ′X)−1X’]=X(X ′X)−1(X’X)(X ′X)−1X’=X(X ′X)−1X’

M idempotent:

-X(X ′X)−1X’[In-X(X ′X)−1X’]= In[In-X(X ′X)−1X’]-X(X ′X)−1X’[In-X(X ′X)−1X’]= In-X(X ′X)−1X’-X(X ′X)−1X’+X(X ′X)−1X’=In-X(X ′X)−1X’

b.Py’*My=0

X(X ′X)−1X’y’*[(In-X(X ′X)−1X’)y]= y’[X(X ′X)−1X’]*[y-X(X ′X)−1X’y]= y’X(X ′X)−1X’y-y’X(X ′X)−1X’X(X ′X)−1X’y=0

c.e=M*ε-X(X ′X)−1X’ε=ε-X(X ′X)−1X’ε=(y-xβ)-X(X ′X)−1X’(y-Xβ)= y-Xβ-X(X ′X)−1X’y+X(X ′X)−1X’Xβ= y-Xβ-X(X ′X)−1X’y+Xβ=y-X(X ′X)−1X’y=y-Xb=e

e’e=ε′Mεe’e=(Mε)′Mε = ε′M ′Mε = ε′Mε

d.Starting point: X’(y-Xb)=X’y-X’Xb=0

1 ... 1x12 ... xn2

... ... ...x1k ... xnk

∗y1

...yn

−

1 ... 1x12 ... xn2

... ... ...x1k ... xnk

∗ 1 x12 ... x1k

... ... ... ...1 xn2 ... xnk

∗b1...bk

= 0 (124)

58


Only first row:∑yi − [n ∗

∑xi2, ...,

∑xik] ∗ b = 0→ y = b1 + b2x2 + ....+ bkxk (125)

yi = x′ib+ ei = yi + ei1n

∑yi = 1

n

∑yi +

1n

∑ei︸︷︷︸

0, if x=1

y = y

Fifth set1. What is the difference between the standard deviation of a parameter estimate bk and the standarderror of the estimate?

Standard deviation is the square root of the true variance and the standard error is the square rootof the estimated variance (and thus is dependent on the sample size).

2. Discuss the pros and conts of alternative ways to present the results for a t-test:a) parameter estimate and*** for significant parameter estimate (at α=1%)** for significant parameter estimate (at α=5%)* for significant parameter estimate (at α=10%)b) parameter estimate and p-valuec) parameter estimate and t-statisticd) parameter estimate and parameter standard errore) your preferred choice

a. Pro: Choice for all conventional significance levels and easy to recognize; Con: not as much infor-mation as with p-value (densed information)b. Pro: p-value with information for any significance level; Con: not as fast/recognizable as stars; fora specific null hypothesisc. Pro: t-stat is basis for p-values; Con: no clear choice possible based on the information given (table);for a specific null hypothesisd. Pro: Is the basis for p-values and shows variance and thus if any significance level is useful, canbe used to compute different null hypothesis; Con: no clear choice possible based on the informationgiven (table, t-stat)e. I would take case b. because it gives the best information very fast.

4. Consider Fama-French’s asset pricing model and its „compatible“ regression representation (seeyour lecture notes of the first week). Suppose you want to test the restriction that none of these threerisk factors play a role in explaining the expected excess return of the asset (that is the parametersin the regression equation are all zero) State the null- and alternative hypothesis in proper statisticalterms and construct the Wald statistic for that hypothesis, i.e. define R and r.

59


Rejt+1 = β1Remt+1 + β2HMLt+1 + β3SMBt+1 + εjt+1

Hypothesis:H0 : β1 = β2 = β3 = 0;HA : at least one 6=0 bzw.0 : Rβ = r;HA : Rβ 6= r

R =

1 0 00 1 00 0 1

;β =

β1

β2

β3

; r =

000

; b =

b1b2b3

(126)

5. How is the result from multivariate statistics that (z − µ)′Σ−1(z − µ) ∼ χ2(rows(z))withz ∼MVN(µ,Σ) used to construct the Wald statistic to test linear hypotheses about the parameters?

We compute E(r|x) = R ∗ E(b|x) = Rβ;V ar(r|x) = R ∗ V ar(b|X)R′ = Rσ2(X ′X)−1R′; r|x =Rb|x ∼MVN(Rβ,Rσ2(X ′X)−1R)Under the null hypothesis:E(Rb|x)=r (=hypothesized value)Var(Rb|x)=Rσ2(X ′X)−1R′ (=unaffected by hypothesis)Rb|x ∼MVN(r, σ2(X ′X)−1R′) (=this is where the null hypothesis goes in)

Using Fact 4:[Rb− E(Rb|x)]′[V ar(Rb|x)]−1[Rb− E(Rb|x)][Rb− r]′[Rσ2(X ′X)−1R]−1[Rb− r] ∼ χ2(#r)

Sixth setSuppose you have estimated a parameter vector b=(0.55,0.37,1.46,0.01)’ with an estimated variance-covariance matrix.

V ar(b|X) =

0.01 0.023 0.0017 0.00050.023 0.0025 0.015 0.00970.0017 0.015 0.64 0.00060.0005 0.0097 0.0006 0.001

(127)

a) Compute the 95% confidence interval each parameter bk.b) What does the specific confidence interval computed in a) tell you?c) Why are the bounds of a confidence interval for βk random variables?d) Another estimation yields an estimated bk with the corresponding standard error se(bk). Youconclude from computing the t-statistic tk = βk−βk

se(bk) that you can reject the null hypothesis H0 : bk =

βk on the α% significance level. Now, you compute the (1-α)% confidence interval. Will βk lie insideor outside the confidence interval?

a. Confidence intervals for:b1: 0.55±1.96*0.1=[0.354;0.746] (bounds as realizations of r.v.)b2: 0.37±1.96*0.05=[0.272;0.468]b3: 1.46±1.96*0.8=[-0.108;3.028]b4: 0.01±1.96*0.032=[-0.0527;0.0727]

60


b. The first two reject the H0m the other two don’t reject on a 5% significance level for a two-sidedtest.

c. Bounds are a function of X (=r.v.): se(bk) = s2[X ′X]−1

d. Lies outside as we reject the t-stat.

2. Suppose, computing the lower bound of the 95% confidence interval yields bk − tα/2(n − K) ∗se(bk) = −0.01. The upper bound is bk+tα/2(n−K)se(bk)=0.01. Which of the following statementsare correct?

• With probability of 5% the true parameter βk lies in the interval -0.01 and 0.01.

• The null hypothesis H0 : βk = βk cannot be rejected for values (−0.01 ≤ betak ≤ 0.01) onthe 5% significance level.

• The null hypothesis H0 : βk = 1 can be rejected on the 5% significance level.

• The true parameter βk is with probability 1 − α = 0.95 greater than -0.01 and smaller than0.01.

• The stochastic bounds of the 1− α confidence interval overlap the true parameter with proba-bility 1− α.

• If the hypothesized parameter value βk falls within the range of the 1 − α confidence intervalcomputed from the estimates bk and se(bk) then we do not reject H0 : βk = βk at thesignificance level of 5%.

• false

• correct

• correct

• false

• correct (if we suppose that the null hypothesis is true)

• correct

Seventh Set1.a) Show that if the regression includes a constant:yi = β1 + β2xi2 + ...+ βKxiK + εithen the variance of the dependent variable can be written as:

1N

N∑i=1

(yi − y)2 =1N

N∑i=1

(yi − y)2 +1N

N∑i=1

ε2i (128)

61


Hint: y = yb) Take your result from a) and formulate an expression for the coefficient of determination R2.c) Suppose, you estimated a regression with an R2 = 0.63. Interpret this value.d) Suppose, you estimate the same model as in c) without a constant. You know that you cannotcompute a meaningful centered R2. Therefore, you compute the uncentered R2

uc : R2uc = y′y

y′y = 0.84Compare the two goodness of fit measures in c) and d). Would you conclude that the constant canbe excluded because R2

uc > R2?

a)

1N

∑(yi − y)2︸︷︷︸SST

(129)

=1N

∑(yi + ei − y)2 =

1N

∑(yi − y + ei)2 =

1N

∑(yi − y)2 +

1N

∑(ei)2 (130)

− 1N

∑2(yi − y)ei =

1N

∑(yi − y)2 +

1N

∑(ei)2 − 1

N

∑2yiei +

1N

2y∑

ei (131)

=1N

∑(yi − y)2 +

1N

∑(ei)2 − 1

N2b′∑

Xei︸︷︷︸0by FOC

+1N

2y∑

ei︸︷︷︸0with constant

(132)

=1N

∑(yi − y)2︸︷︷︸SSE

+1N

∑(ei)2︸︷︷︸

SSR

(133)

(Is our desired result as y = y if xi1 = 1)

b)

R2 =SSE

SST= 1− SSR

SST(134)

c) 63% of the variance of the dependent variable can be explained with the estimated variance.

d) R2 and R2uc can never be compared as they are based on different models.

2. In a hedonic price model the price of an asset is explained with its characteristics. In the followingwe assume that housing pricing can be explained by its size sqrft (measured as square feet), thenumber of bedrooms bdrms and the size of the lot lotsize (also measured as square feet. Therefore,we estimate the following equation with OLS:log(price) = β1 + β2log(sqrft) + β3bdrms+ β4log(lotsize) + εResults of the estimation can be found in the following table: (a) Interpret the estimates b2 und b3.(b) Compute the missing values for Std. Error and t-Statistic in the table and comment on thestatistical significance of the estimated coefficients (H0 : βj = 0 vs. H1 : βj 6= 0, j=0,1,2,3).(c) Test the null hypothesis H0 : β1 = 1vs.H1 : β1 6= 1.(d) Compute the p-value for the estimate b2 and interpret the result.

62


(e) What is the null hypothesis of that specific F- statistic missing in the table? How does it relateto the R2? Compute the missing value of the statistic and interpret the result.(f) Interpret the value of R-squared.(g) An alternative specification of the model that excludes the lot size as an explanatory variableprovides you with values for the Akaike information criterion of -0.313 and a Schwartz criterion of-0.229. Which specification would you prefer?

a) Everything else constant, an increase of 1% in size leads to a .7% increase in price.Everything else constant, a one unit increase in bedrooms leads to a 3.696% increase in price.

b) Starting point: Coefficient/Std.Err=T-statSe2=0.0929T − stat3=1.3425The constant, sqrft and lotsize are significant on a 5% significant level; bdrms is not significant

c) −1.29704−10.65128 = −3.52696 As the critical value is 1.98 we reject on a 5% significance level.

d) bdrms=0.70023; t-stat: 7.54031 => almost 0 in table for N(0,1)

e) H0 : β2 = β3 = β4 = 0 bzw.

0 1 0 00 0 1 00 0 0 1

∗β1

β2

β3

β4

=

000

(135)

F =R2/k

(1−R2)/(n−K)=

0.64297/4(1− 0.64297)/84

=0.1607

0.0042003= 38.25 (136)

So we reject for any conventional significance level

Alternative: get SSRR =∑

(yi − y)2=n*sample variance of yi

f) 64% of the variance of the dependent variable can be explained with the estimated variance.

g) -0.313 vs. -0.4968 for Akaike and -0.229 vs. -0.3842 for Schwartz. We prefer the first model (withlot size) as both AIC and SIC are smaller.

63


3. What can you do to narrow the parameter confidence bounds? Discuss the possibilities: - Increaseα- Increase nCan you explain how the effect of an increase of the sample size on the confidence bounds works?

- Increase in α: not good, as type 1 error increases- Increases n: standard error descreases→bounds decrease

4. When would you use the uncentered R2uc and when would you use the centered R2. Why is the

uncentered R2uc higher than a centered R2? What is the range of the R2

uc and R2 ?

R2uc: without a constant→higher (as the level is explained too)

R2c : with constant→lower (explaines only the variance around the level)

Range for both is from 0 to 1.

5. How do you interpret a R2 of 0.38?

38% of the variance of the dependent variable can be explained with the estimated variance.

6. Why would you use an adjusted R2adj? What is the idea behind the adjustment of the R2

adj? Whichvalues can the R2

adj take?

R2adj : any value between −∞ and 1. Has a penalty term for the use of degrees of freedom (heavy

parameterization) (Occams Razor)

7. What is the intuition behind the computation of AIC and SBC?

Find the smallest SSR, but adds a penalty term for the use of degrees of freedom (=heavy parame-terization)

Eighth set1. A researcher conducts an OLS regression with K = 4 with a computer software that is unfortunatelynot able to report p-values. Besides the four coefficients and their standard errors the program onlyreports the t-statistics that test the null hypothesis H0 : βk = 0 against HA : βk 6= 0 for k=1,2,3,4.Interpret the t-statistics below and compute the associated p-values. (Interpret the p-values for areader who works with a significance level α=5%.)a)t1=-1.99b)t2=0.99c)t3=-3.22d)t4=2.3

T-stat shows where we are on the x-axis of the cdf.Assumption: n-k>30. Then we reject for β1, β3, β4 using the critical value ±1.96.

P-values:a) T-stat in standard normal distribution is 0.9767. 2*(1-0.9767)=0.0466 →rejectb) T-stat in standard normal distribution is 0.8398. 2*(1-0.8398)=0.3224 →not rejectc) T-stat in standard normal distribution is close to one, so p-value close to zero →rejectd) T-stat in standard normal distribution is 0.9893. 2*(1-0.9893)=0.0214 →reject

64


2. Explain your own in words: What is convergence in probability? What is convergence almost surely?Which concept is stronger?

Convergence in probability: The probability that the n-th element of a row of n numbers lies withinthe bounds around a certain number (e.g. mean) is equal to 1 as n goes into infinity.

Almost sure convergence: The probability that the n-th element of a row of n numbers is equal to acertain number (e.g. mean) is equal to 1 as n goes into infinity.

→ →a.s is stronger as it requires Zn to be equal to α whereas

→p only requires Zn to lie in bounds

around α.

3. Illustrate graphically the concept of convergence in probability. Illustrate graphically a randomsequence that does not converge in probability.

See Vorlesungsunterlagen

4. Review the notion of non-stochastic convergence. In which way do we refer to the notion of non-stochastic convergence when we consider almost sure convergence, convergence in probability andconvergence in mean square? What are the non-stochastic sequences in each of the three modes ofstochastic convergence?

In non-stochastic convergence we review what happens to a series when n goes into infinity (there isno randomness then in the elements of the series).In→a.s→p , und

→m.s we have n going into infinity as the non-stochastic component. As the stochastic

component we have to consider the different worlds of infinite draws.

Examples for non-stochastic series:A series of probabilities

→p1

Realizations of X→a.s.0

5. Explain in your own words: What does convergence in mean square mean? Does convergence inmean square imply convergence in probability? Or does convergence in probability imply convergencein mean square?

Convergence in mean square: The expectation value of the mean error is equal to 0 as n grows intoinfinity (also: the variance). Convergence in mean square implies convergence in probability.

6. Illustrate graphically the concept of convergence in distribution. What does convergence in distribu-tion mean? Think of an example and provide a graphical illustration where the c.d.f. of the sequenceof random variables does not converge in distribution.

Convergence in distribution: The series Zn converges to Z, if the distribution function Fn of Znconverges to the cdf of Z at each point of FZ .With this mode of convergence, we increasingly expect to see the next outcome in a sequence ofrandom experiments becoming better and better be modeled by a given probability distribution.

Example where convergence in distribution doesn’t work:1. We draw random variables with a probability of 0.5 from a normal distribution with either mean 0

65


or with mean 1.2. The distribution of bn − β, as it doesn’t have a distribution (only a point)3. We draw from a distribution where the variance doesn’t exist. Computing the means doesn’t havea limit distribution.

7. I argue that extending convergence almost surely/in probability/mean square to vector sequences (ormatrices) does not increase complexeity as the basic concept remains the same. However, extendingconvergence in distribution of a random sequence to a random vector sequence entails an increase ofcomplexity. Why?

For→a.s,

→p and

→m.s extending the concepts to vectors only require elementwise-convergence.

For→d all the k elements of the vectors have to be lower or equal to a constant at the same time for

every of the n elements of the series Zn (joint distribution).

8. Convergence in distribution implies the existence of a limit distribution. What do we mean bythat?

{Zn} has a distribution Fn. The limit distribution F of Zn is the distribution for which we assumethat Fn converges to at every point (F and Fn belong to the same class of distributions and we justadjust the parameters)

9. Suppose that the random sequence zn converges in distribution to z, where z is a χ2(2) randomvariable. Write this formally using two alternative notations.

Zn→d Z ∼ χ2(2)

Zn→d χ2(2)

10. What assumptions have to be fullfilled that you can apply Khinchin’s WLLN?

{Zi} sequence of iid random variablesE(Zi) = µ <∞

11. I argue that extending the WLLN to a vector random sequences does not add complexeity to thecase of a scalar sequence, it just saves space because of notational convencience. Why?

for a vector we have to use elementwise convergence (of the means towards the expectation values).We use the same argument as for convergence in probability (we just have to define the means andexpectations for the rows for elementwise convergence).

Ninth set1. What does the Lindeberg-Levy (LL) central limit theorem state? What assumptions have to befullfilled that you can apply the LL CLT?Formula:

√n(zn − µ)

→d N(0, σ2).

The difference between the mean of a series and its expected value follows a normal distribution andthus the mean of the series follows also a normal distribution (zn ∼a N(µ, σ

2

n )).

Assumptions:

66


• {Zi} iid

• E(Zi) = µ <∞

• V ar(Zi) = σ2 <∞

• WLLN can be applied

2. Name the concept that is associated to the following short hand notations and explain their meaning:a. zn

→d N(0, 1): Convergence in distribution

b. plimn→∞zn = α: Convergence in probabilityc. zn

→m.s. α: Convergence in mean square

d. zn→d z: Convergence in distribution

e.√n(zn − µ)

→d MV N(0,Σ): Lindenberg-Levy CLT (Multivariate CLT)

f. yn→p α: Convergence in probability

g. zn ∼a N(µ, σ2

n ): CLZ (follows from the univariate CLT)h. zn

→a.s. α: Convergence almost surely/almost sure convergence

i. z2n

→d χ2(1): Convergence in distribution/Lemma 2

3. Apply the üseful lemmasöf the lecture. State the name of the respective lemma(s) that you use.Whenever matrix multiplications or summations are involved, assume that the respective operationsare possible. The underscore means that we deal with vector or matrix sequences. µ and A indicatevectors or matrices of real numbers. Im is the identity matrix of dimension m:a. zn

→p α, then: plimexp(zn) =?

b. zn→d z ∼ N(0, 1), then: z2

n

→d?

c. zn→d z ∼MVN(µ,Σ), An

→p A, then: An(zn − µ) ∼a?

d. xn→d MV N(µ,Σ), y

n

→p α,An

→p A, then: Anxn + y

n

→d?

e. xn→d x, yn is Op, then: xn + y

n

→d?

f. xn→d x ∼MVN(µ,Σ), y

n

→p 0, then: plimxn ∗ yn =?

g. xn→d MV N(0, Im), An

→p A, then: zn = Anxn

→d?

and then: z′n(AnA′n)−1zn ∼a?

a. Using Lemma 1: plimexp(zn) = exp(α) (if exp does not depend on n)b. Using Lemma 2: z2

n

→d [N(0, 1)]2 = χ2(1)

c. Using Lemma 2: zn−µ→d MV N(0,Σ) and then using Lemma 5: An(zn−µ) ∼a MVN(0, AΣA′)

d. Using Lemma 5: An ∗ xn→d A ∗ x = MVN(µ,AΣA′) and then using Lemma 3: An ∗ xn + y

n

→d

A ∗ x+ α = MVN(µ+ α,AΣA′)

e. Using Lemma 3: xn + yn

→d x+ 0

f. Using Lemma 4: plimxn ∗ yn→p 0

67


g. Using Lemma 5: zn→d A ∗ x = MVN(0, AA′) and then using Lemma 6: z′n(AnA

′n)−1zn ∼a

(Ax)′︸︷︷︸z′

(AA′)−1︸︷︷︸V ar(z)

(Ax)︸︷︷︸z

Fact 4→ χ2(rows(z))

4. When large sample theory is used to derive the properties of the OLS estimator the set of ass-umptions for the finite sample properties of the OLS estimator are altered. Which assumptions areretained? Which are replaced?Two are retained:(1.1)→(2.1) Linearity(1.3)→(2.4) Rank condition

Four are replaced:iid→(2.2) Dependencies allowed(1.2)→(2.3) Replacing strict exogeneity by orthogonality: E(xik ∗ εi) = 0(1.4)→deleted(1.5)→deleted

5. What are the properties of bn = (X ′X)−1X ′y under the "newässumptions?bn→p β (consistency)

b ∼a MVN(β, Avar(b)n ) (approximate normal distribution)→Unbiasedness & efficiency lost

6. What does CAN mean?CAN = Consistent Asymptotically Normal

7. Does consistency of bn need an iid sample of dependent variable and regressors?iid is a special case of a martingale difference sequence. Necessary is only a stationary & ergodicm.d.s., which means an identical but not an independent sample.

8. Where does a WLLN and where does a CLT come into play when deriving the properties ofb = (X ′X)−1X ′y?Deriving consitency: We use Lemma 1 twice. But to apply Lemma 1 on [ 1

n

∑xix′i], we have to assure

that 1n

∑xix′i

→p E(xix′i) which we do by applying WLLN.

Deriving distribution: We define 1n

∑xiεi = g and apply CLT to get

√n(g−E(gi))

→d MV N(0, E(gig′i)).

This is then used together with Lemma 5.

Tenth Set1. At which stage of the derivation of the consistency property of the OLS estimator do we have toinvoke a WLLN?

2. What does it mean when an estimator has the CAN property?

3. At which stage of the derivation of the asymptotic normality of the OLS estimator do we have toinvoke a WLLN and when a CLT?

1-3: see Ninth set

68


4. Which of the useful lemmas 1-6 is used at which stage of a) consistency proof and b) asymptoticnormality proof?

a) We use Lemma 1 to show that together with WLLN: [ 1n

∑xix′i]−1→p [E(xix′i)]

−1

b) We know that√n(b− β) = [

1n

∑xix′i]−1︸︷︷︸

An

√ng︸︷︷︸xn

. For:

An→p A = Σ−1

xx andxn→d x ∼MVN(0, E(gig′i))

So we can apply Lemma 5:√n(b− β) ∼MVN(0,Σ−1

xxE(gig′i)Σ−1xx )

5. Explain the difference of the assumptions regarding the variances of the disturbances in the finitesample context and using asymptotic reasoning.

Finite sample variance: E(ε2i |X) = σ2 (assumption 1.4)Large sample variance: E(ε2i |xi) = σ2

→ We only condition on xi and not on all x’s at the same time

6. There is a special case when the finite sample variances of the OLS estimator based on finitesample assumptions and based on large sample theory assumptions (almost) coincide. When does thishappen?

When we have only one observation i.

7. What would you counter an argument of someone who says that working with the variance cova-riance estimate s2(X ′X)−1 is quite OK as it is mainly consistency of the parameter estimates thatcounts?

Using s2(X ′X)−1 we would have to assume conditional homoscedasticity and using (X ′X)−1S(X ′X)−1

would allow us to get rid of that assumption.

8. Consider the following assumptions:(a) linearity(b) rank condition: KxK matrix E(xix′i) = Σxx is nonsingular(c) predetermined regressors: E(gi) = 0 where gi = xiεi(d) gi is a martingale difference sequence with finite second momentsi) Show that under those assumptions, the OLS estimator is approximately normally distributed:

√n(b− β)

→d N(0,Σ−1

xxE(ε2ixix′i)Σ−1xx ) (137)

Starting point: sampling error

We have√n(b− β) = [

1n

∑xix′i]−1︸︷︷︸

An

√ng︸︷︷︸xn

. For:

An→p A = Σ−1

xx (only possible if Σxx is nonsingular)xn→d x ∼MVN(0, E(gig′i)) (as

√n(g − E(gi))

→d MV N(0, E(gig′i)) by CLT)

69


So we can apply Lemma 5:√n(b− β) ∼MVN(0,Σ−1

xxE(gig′i)Σ−1xx︸︷︷︸

Avar(b)

)

→ bn ∼a MVN(β, Avar(b)n )

ii) Further, show that assumption 4 implies that the εi are serially uncorrelated or E(εi ∗ εi−j) = 0.

E(gi|gi−1, ..., g1). As xi1 = 1 we focus on the first row:E(εi|εi−1, ..., ε1, εi−1xi−1,2, ..., ε1x1K) = 0by LIE: Ez|x[E(y|x, z)|x] = E(y|x) with x = εi−1, ..., ε1; y = εi and z = εi−1xi−1,2, ..., ε1x1K :E(εi|εi−1, ..., ε1) = 0by LTE: E(εi) = 0

Second step in exercise 19

9. Show that the test statistic

tk =(bk − βk)√ Avar(bk)

n

(138)

converges in distribution to a standard normal distribution. Note, that bk is the k-th element of band Avar(bk) is the (k,k) element of the KxK matrix Avar(b). Use the facts, that

√n(bk − βk)

→d

N(0, Avar(bk)) and Avar(b)→pAvar(b). Why is the latter true? Hint: use continuous mapping and

the Slutzky theorem (the üseful lemmas")!

bk−βk√ Avar(bk)

n

=√n(bk−βk)√ Avar(bk)

∼a N(0, 1)

We know:1.√n(bk − βk)

→d N(0, Avar(bk)) by CLT and from joint distribution of the estimators.

2. 1√ Avar(bk)

→p 1√

Avar(bk)by Lemma 1 and from the derivation of V ar(bk)

⇒ Lemma 5:tk =

√n(bk−βk)√ Avar(bk)

→d N(E(a ∗ x), V ar(a ∗ x)) = N(0, a2V ar(x)) = N(0, Avar(bk)√

Avar(bk)2 ) = N(0, 1)

10. Show that the Wald statistic:

W = (Rb− r)′[RAvar(b)n

R′]−1(Rb− r) (139)

converges in distribution to a Chi-square with degrees of freedom equal to the number of restrictions.As a hint, rewrite the equation above asW = c′nQ

−1n cn. Use Hayashi’s lemma 2.4(d) and the footnote

on page 41.

W = [Rb− r]′[RAvar(b)n R′]−1[Rb− r]

Under the null: Rb-r=Rb-Rβ=R(b-β)cn = R

√n(bn − β)

→d c ∼MVN(0, R ∗Avar(b) ∗R′) (Lemma 2)

70


Qn = R ∗ Avar(b) ∗R′→p Q = R ∗Avar(b) ∗R′ = V ar(c) (derivation of V ar(b) and Lemma 1)

⇒Wn = c′nQ−1n cn

→d c′ Q

−1︸︷︷︸V ar(c)

c ∼ χ2( #r︸︷︷︸rows(c)

) (Lemma 6 + Fact 4)

11. What is an ensemble mean?

An ensemble is consisting of a large number of mental copies.The ensemble mean is obtained by averaging all ensemble forecasts. This has the effect of filteringout features of the forecast that are less predictable.

12. Why do we need the stationarity and ergodicity in the first place?

Especially for time series we want to get rid of the iid assumption, as we observe dependencies. Usings&e allows dependencies, but we still need to draw from the same distribution and the dependenciesshould be less between draws that are far apart from each other.But: Mean and variance do not change over time→We need this for correct estimates, tests.

13. Explain the concepts of weak stationarity and strong stationarity.

Strong stationarity: the joint distribution of (zi, zir) is the same as the joint distribution of (zj , zjr)if i-ir=j-jr →relative position

Weak stationarity: only the first two moments are constant (not the whole distribution) and thecov(zi, zi−j) only depends on j.

15. When is a stationary process ergodic?

If the dependence decreases with distance so that:limn→∞E(f(zi, zi+1, ..., zi+k) ∗ g(zi+n, zi+n+1, ..., zi+n+l)) =E(f(zi, zi+1, ..., zi+k)) ∗ E(g(zi+n, zi+n+1, ..., zi+n+l))

16. Which assumptions have to be fulfilled to apply Kinchin’s WLLN? Which of these assumptionsare weakened by the ergodic theorem? Which assumption is used instead?

Assumptions:{Zi} iid→instead: stationarity&ergodicityE(Zi) = µ <∞

17. Which assumptions have to be ful

lled to apply the Lindeberg-Levy CLT? Is stationarity and ergodicity sufficient to apply a CLT? Whatproperty of the sequence {gi}0{εixi} do we assume to apply a CLT?

Assumptions:{Zi} iid→instead: stationary&ergodic m.d.s.E(Zi) = µ <∞V ar(Zi) = σ2 <∞WLLN

18. How do you call a stochastic process for which E(gi|gi−1, gi−2, ...) = 0?

71


M.D.S

19. Show that if a constant is included in the model it follows from E(gi|gi−1, gi−2, ...) = 0, thatcov(εi, εi−j) = 0∀j 6= 0 Hint: Use the law of iterated expectations.

From exercise 8: E(εi|εi−1, ...) = E(εi) = 0Cov(εi, εi−j) = E(εi ∗ εi−j)︸︷︷︸

?

−E(εi)︸︷︷︸=0

∗E(εi−j)

→ E(εi ∗ εi−j) = 0E[E(εi ∗ εi−j)|εi−1, ..., εi−j , ..., ε1] by LTE backwardsE[εi−j ∗ E(εi|εi−1, ..., εi−j , ..., ε1)︸︷︷︸

0(s.o.)

] = E(0) = 0 by Linearity of Conditional Expectations

→ Cov(εi, εi−j) = 0

20. When using the heteroskedasticity-consistent covariance matrix,

Avar(b) = Σ−1xxE(ε2ixix

′i)Σ−1xx (140)

which assumption regarding the covariances of the disturbances εi and εi−j do we make?

Cov(εi, εi−j) = 0

Eleventh Set1. When and why would you use GLS? Describe the limiting nature of the GLS approach.

GLS is used to estimate unknown parameters in a linear regression model. It is used when heterosce-dasticity is present and when observations are correlated in the error term (cf. WLS).Assumptions:

• Linearity yi = x′iβ + εi

• Full rank

• Strict exogeneity

• V ar(ε|X) = sigma2V (X). V(X) has to be known, symmmetric and positive definite.

Limits:V(X) normally not known, and thus we even have to estimate more. If V ar(ε|x) is estimated theBLUE property is lost and βGLS might even be worse than OLS.If X not strictly exogeneous, GLS might be inconsistentLarge sample properties are more difficult to obtain.

2. A special case of the GLS approach is weighted least square (WLS). What difficulties could arisein a WLS estimation? How are the weights constructed?

Observations are wighted by standard deviations (higher penalty for wrong estimates). Used when allthe off-diagonal entries of Σ (Var.-Cov.-Matrix) are zero.Here: V ar(ε|X) = σ2V (xi)→ V (xi) = Z ′iα (V (xi) typically unknown)

72


→So again, the variance has to be estimated and thus our estimates might be inconsistent and evenless efficient than OLS.

3. When does exact multicollinearity occur? What happens to the OLS estimator in this case?Occurs when one regressor is a linear combination of other regressors: rank(X) 6=KThen: Assumption 1.3/2.4 violated and (X ′X)−1 and thus the OLS estimator cannot be computed.We can still estimate the linear combination of the parameters, though.

4. How is the OLS estimator and its standard error affected by (not exact) multicollinearity?

• BLUE result is not affected

• Var(b|X) is affected: Coefficients have high standard errors (effect: wide confidence intervalsand low t-stats and thus difficulties with rejection)

• Estimates may have the wrong sign

• Small changes in the data produce wide swings in the parameters

5. Which steps can be taken to overcome the problem of multicollinearity?

• Increase n (and thus the variance of the x’s)

• Get smaller σ2 (´=better fitting model)

• Get smaller correlation (=exclude some regressors, but: omitted variable bias)

Other ways:

• Don’t do anything (OLS is still BLUE)

• Joint hypothesis testing

• Use empirical data

• Combine the multicollinear variables

Twelth Set1. When does an omitted regressor not cause biased estimates?

With an omitted regressor we get: b1 = β1 + (X ′1X1)−1X ′1X2︸︷︷︸0?

β2 + (X ′1X1)−1X ′1ε︸︷︷︸0with strict exogeneity

→ β2 = 0 if X2 is not part of the model in the first place→ (X ′1X1)−1X ′1X2 = 0 if regression of X2 on X1 gives you zero coefficients

2. Explain the term endogeneity bias. Give a practical economic example when an endogeniety biasoccurs.

73


Endogeneity = no strict exogeneity bzw. no predetermined regressots: X’s correlated with the errorterm →estimator is biased.Example: Haavelmo: Ci = α0 + α1Yi + ui (Ii part of ui and correlated with Yi)

3. What is solution to the endogeneity problem in a linear regression framework?

IV-variables: include variables from ui that are correlated with xi

4. Use the same tools used to derive the CAN property of the OLS property to derive the CANproperty of the IV-estimator. Start with the sampling error and make your assumptions on the way.(applicability of WLLN, CLT...)

Compute the sampling error:(X ′Z)−1X ′y − δ(X ′Z)−1X ′(Zδ + ε)− δ(X ′Z)−1X ′Zδ + (X ′Z)−1X ′ε− δδ + (X ′Z)−1X ′ε− δ = (X ′Z)−1X ′ε = [ 1

n

∑xiz′i]−1 1

n

∑xiεi

Consistency: δ = [ 1n

∑xiz′i]−1 1

n

∑xiyi

→p δ

1. 1n

∑xiz′i

→p E(xiz′i) and by Lemma 1:

[ 1n

∑xiz′i]−1→p [E(xiz′i)]

−1

2. 1n

∑xiεi

→p E(xiεi) = 0

⇒ δ − δ = [ 1n

∑xiz′i]−1 1

n

∑xiεi

→p E(xizi)−1E(xi ∗ εi)︸︷︷︸

0

= 0

Asymptotic normality:√n(δ − δ)

→d MV N(0, Avar(δ))

√n*sampling error:

√n(δ − δ) = [

1n

∑xiz′i]−1︸︷︷︸

An

√n

1n

∑xiεi︸︷︷︸

xn

1. An = [ 1n

∑xiz′i]−1→p A = E(xiz′i)

−1

2. xn =√ng→d x ∼MVN(0, E(gig′i)︸︷︷︸

S.33

) (apply CLT as E(gi) = 0)

⇒Lemma 5:√n(δ − δ)

→d MV N(0, E(xiz′i)

−1E(gig′i)E(xiz′i)−1︸︷︷︸

Avar(δ)

)

5. When would you use an IV estimator instead of an OLS estimator? (Hint: Which assumption ofthe OLS estimation is violated and what is the consequence.)

When xi and εi correlated (violated: Assumption 2.3)

6. Describe the basic idea of instrumental variables estimation. How are the unknown parametersrelated to the data generating process?

IV can be used to obtain consistent estimators in presence of omitted variables.xi and zi correlated; xi and ui not correlated → xi = instrument

74


7. Which assumptions are necessary to derive the instrumental variables estimator and the CANproperty of the IV estimator?

Assumptions:3.1 Linearity3.2 Ergodic stationarity3.3 Orthogonality conditions3.4 Rank condition for identification (full rank)

8. Where do the assumptions enter the derivation of the instrumental variables estimator and theCAN property of the IV estimator?

Derivation:E(xi ∗ εi) = E(xi (yi − Ziδ)︸︷︷︸

3.1

) =︸︷︷︸3.3

0

E(xi ∗ yi)− E(xiz′iδ) = 0δ = [E(xiz′i)]

−1︸︷︷︸3.4

E(xiyi)

δ→p︸︷︷︸

3.2

δ

9. Show that the OLS estimator can be conceived as a special case of the IV estimator.

Relation:E(εi ∗ zi) = 0 and xi = zi. Then:δ = [ 1

n

∑ziz′i]−1 1

n

∑ziy′i = OLS estimator

10. What are possible sources endogeneity?

see additional page

75

Applied Eco No Metrics

Documents

Transcript of Applied Eco No Metrics