MULTIPLE REGRESSION ANALYSIS: SPECIFICATION AND DATA ISSUES Chapter 9 1.
-
Upload
kathleen-stevenson -
Category
Documents
-
view
231 -
download
8
Transcript of MULTIPLE REGRESSION ANALYSIS: SPECIFICATION AND DATA ISSUES Chapter 9 1.
I. Introduction2
Failure of zero conditional mean assumption Correlation between error, u, and one or more
explanatory variables. Why variables can be endogenous Possible remedies
Functional Form Misspecification If omitted variable is a function of an
explanatory variable in the model, the model suffers from functional for misspecification
Using proxy variables to address omitted variable bias
Measurement error Not all variables are measured accurately.
II. Functional Form3
Regression model can suffer from misspecification when it doesn’t account for relationship between dependent and explanatory variables.
wage = 0 + 1educ + 2exper + u Omit exper2 or exper*educ
Omitting variable can lead to biased estimates of all regressors
Use wage rather than log(wage) (latter satisfies GM) using wrong variable to relate LHS and RHS can
lead to biased estimates of all regressors.
II. Functional Form4
We can change linear relationship by: using logs on RHS, LHS or both using quadratic forms of x’s Using interactions of x’s
How do we know if we’ve gotten the right functional form for our model? Use F-test for joint exclusion restrictions to detect
misspecification
II. Functional Form Ex: Model of Crime Quadratics or not? Each of sq terms is
individually and jointly signficant (F=31.37, df=3; 2,713
Adding squares makes interpretation more difficult: Before, intuitive (–) sign on
pcnv suggested conviction rate has deterrence on crime.
Now, level is positive, quadratic is negative: for low levels conviction has no deterrent effect, only effective for large levels.
Note: Don’t square qemp86, because it’s a discrete variable taking only few values.
5
II. Functional Form6
How do you know what to try? Use economic theory to guide you
Think about the interpretation Does it make more sense for x to affect y in
percentage (use logs) or absolute terms? Does it make more sense for the derivative
of x1 to vary with x1 (quadratic) or with x2 (interactions) or to be fixed?
II. Ramsey’s RESET 7
Know how to test joint exclusion restrictions for higher order terms or interactions. Can be tedious to add and test extra terms May find a square term matters when really
using logs would be even better A test of functional form is Ramsey’s regression
specification error test (RESET) Intuition: If specification okay, no nonlinear functions
of the independent variables should be significant when put in original equation.
Cost: Degrees of freedom
II. Ramsey’s RESET8
RESET relies on a trick similar to the special form of the White test Instead of adding functions of the x’s directly,
we add and test functions of ŷ y = 0 + 1x1 + … + kxk + 1ŷ2 + 1ŷ3 +error
Don’t look at above for parameter estimates, just to test inclusion of extra terms
H0: 1 = 0, 2 = 0 using F~F2,n-k-3 Significant F-stat suggests there’s some sort
of functional for problem
II. Ramsey’s RESET9
Ex: Housing Price Equation (n=88) price = 0 + 1lotsize + 2sqrft +3bdrms +u
RESET statistic (up to yhat3)=4.67 F2,82 and p-value .012 Evidence of functional form misspecification
lprice = 0 + 1llotsize + 2lsqrft +3bdrms +u RESET statistic (up to yhat3)=2.56
F2,82 and p-value .84. No evidence of functional form misspecification
On basis of RESET, log equation is preferred. But just because loq equation “passed” RESET, does
that mean it’s the right specification? Should still use economic theory to determine if
functional form makes sense.
III. Proxy Variables10
Previously, assumed could resolve functional form misspecification because you had the relevant data. What if model is misspecified because no data is
available on an important x variable? Log(wage) = 0 + 1educ +2exper + 3abil + u
Would like to hold ability fixed, but have no measure of it.
Exclusion causes parameter estimates to be biased.
Potential solution: Obtain proxy variable for omitted variable
III. Proxy Variables11
A proxy variable is something that is related to the unobserved variable that we’d like to control for in our analysis-but can’t. Ex: IQ as proxy for ability x3* = 0 + 3x3 + v3, where * implies unobserved v3 signals that x3 and x3* are not directly related 0 allows different scales to be compared (i.e. IQ
scale may not be how ability measured) just substitute x3 for x3* in y= 0 + 1 x1 +2 x2 +
3 x3* + u
III. Proxy Variables12
What do we need for this solution to give us unbiased estimates of 1 and 2? Need assumptions on u and v3
1.) u uncorrelated with x1, x2, x3* (standard) Also suggests u uncorrelated with x3…once x1, x2, x3*
included, x3 is irrelevant (i.e. x3 doesn’t directly affect y other than through x3*)
2.) v3 is uncorrelated with x1, x2, x3.
For v3 to be uncorrelated with x1, x2 that means x3* must be good proxy for x3
Formally, this means E(x3* | x1, x2, x3) = E(x3* | x3) = 0 + 3x3 Once x3 controlled for, x3* does not depend on x1, x2
III. Proxy Variables13
E(abil|educ,exper,IQ)=E(abil|IQ)=0 + 3IQ Implies ability only changes with IQ, and not with
educ and epxer (once include IQ). So are really running: y = (0 + 30) + 1x1+ 2x2 + 33x3 + (u + 3v3) redefined intercept, error term, x3 coefficient
Can rewrite as: y = 0 + 1x1+ 2x2 + 3x3 + e Unbiased estimates of
0 , 1 =12 =2 , 3 Won’t get original 0 or 3.
III. Proxy Variables
IQ as proxy for ability Want to estimate
return to education 6.5% when run
regression w/o ability proxy
5.4% when include IQ Interact educ*IQ,
allows for possibility that returns to education differ across different ability levels. See that interaction not significant though.
14
III. Proxy Variables15
Proxy variable can still lead to bias if assumptions are not satisfied
Say x3* = 0 + 1x1 + 2x2 + 3x3 + v3 (violation)
Then running: y = (0 + 30) + (1 + 31) x1+ (2 + 32) x2 + 33x3 + (u +
3v3) Bias will depend on signs of 3 and j
Can safely assume 1 >0 and 3 >0, so that return to education is upward biased even when using proxy variable.
This bias may be smaller than omitted variable bias, though (if x3* and x1 correlated less than x3 and x1)
III. Lagged Dependent Variables
16
What if there are unobserved variables, and you can’t find reasonable proxy variables?
Can include a lagged dependent variable to account for omitted variables that contribute to both past and current levels of y must think past and current y are related for
this to make sense allows you to account for historical factors that
cause current differences in dependent variables
III. Lagged Dependent Variables Ex: Model of Crime: Effect
of expenditure on crime crime= 0 + 1 unem +2
expend +u Concerned that cities
which have lots of crime react by spending more on crime…biased estimates
Coeff on unem and expend are not intuitive
crime= 0 + 1 unem +2 expend+ 3 crime-1 + u Lagged value controls for
fact that cities with high historical crime rates may spend more on crime prevention
Coefficient estimates now more intuitive
17
IV. Properties of OLS under Measurement Error
18
Sometimes we have the variable we want, but we think it is measured with error how many hours did you work last year, how
many weeks you used child care when your child was young
When use imprecise measure of variable in our regression, then model contains measurement error.
Consequences of M.E. Model is similar to that of omitted variable bias Often variable with measurement error is the
one we’re interested in measuring There are some conditions under which we still
get unbiased results Measurement error in y different from
measurement error in x
IV. Measurement Error in a Dependent Variable
19
Let y* denote variable we’d like to explain, like annual savings. Model: y* = 0 + 1x1 + …+ kxk + u Most often, respondents are not perfect in
their reporting, and so reported savings is denoted y
Define measurement error as observed-actual: e0 = y – y*
Thus, really estimating: y = 0 + 1x1 + …+ kxk + u + e0
IV. Measurement Error in a Dependent Variable
20
When will OLS produce unbiased results? Have assumed u has zero mean and that xj and u
are uncorrelated Need to assume
e0 also has zero mean (otherwise just biases 0 ) but more importantly e0 and xj are uncorrelated.
That is, the measurement error in y is statistically independent of each explanatory variable. As result, estimates are unbiased.
Generally find Var(u+ e0 )=u2 +e0
2 >u2
When have m.e. in LHS variable, get larger variances for OLS estimators.
IV. Measurement Error in a Dependent Variable Savings Function sav* = 0 + 1inc +
2size+3educ+ 4age + u e0= sav-sav* Is m.e. correlated with RHS
variables? May think families with higher
incomes or more education more likely to report savings accurately.
Never know if that’s true, so assume there is no systematic relationship: i.e. wealthy or more educated just as likely to mis-report as non-wealthy, uneducated
Scrap Rates Log(scrap*) = 0 + 1grant + u Error assumed to be
multiplicative: y=(y*)*a0 where e0=log(a0) log(scrap)=log(scrap*)+e0
Log(scrap) = 0 + 1grant + u + e0
It’s possible that measurement error more likely to at firms that receive grant underreport scrap rate to make
grant look more effective-so get more in future.
Can’t verify whether true, so assume no relationship: i.e. measurement error not correlated with grant.
21
IV. Measurement Error in an Explanatory Variable
22
More complicated when measurement error occurs in the explanatory variable(s)
Model: y = 0 + 1 x1* + u x1* is not observed, instead only observe x1 define m.e. as e1 =observed-actual = x1 – x1*
Assume E(e1) = 0 (not strong assumption) E(y| x1*, x1) = E(y| x1*)…means x1 doesn’t affect y
after control for x1*…means u uncorrelated with x1 and x1*….similar to proxy variable assumption.
Now are estimating y = 0 + 1x1 + (u – 1e1)
IV. Measurement Error in an Explanatory Variable
23
What kind of results will OLS give us? depends on our assumption about the
correlation between e1 and x1 Suppose Cov(x1, e1) = 0
OLS remains unbiased Variances larger ( since Var(u-1 e1)=u
2 +1
2 e1 2 )
Assumption that Cov(x1, e1) is analogous to the proxy variable assumption.
IV. Measurement Error in an Explanatory Variable
What if that’s not the case? Suppose only that Cov(x1
*, e1) = 0 Called classical errors-in-variables assumption More realistic assumption than assuming Cov(x1, e1) =0
This means: Cov(x1, e1) = E(x1e1)-E(x1 )E(e1 ) =E[(x1
*+e1)(e1)]= E(x1*e1) + E(e1
2) = 0 + e2 ≠0.
This means x1 is correlated with the error so estimate is biased and inconsistent
)(
*)(1
)*,cov(2
),(),(,ˆplim
1
1122
*
2*
122*
2
1
1122
*
21
11
11111
1
11111
xVar
xVar
exxVar
exCovuxCov
xVar
euxCov
ex
x
ex
e
ex
e
24
IV. Measurement Error in an Explanatory Variable
Economics 20 - Prof. Anderson
25
Notice that the multiplicative portion Var(x1*)/Var(x1)< 1 Means the estimate is biased toward zero – called
attenuation bias True regardless of if 1 is (+) or (-) Larger Var(x1*)/Var(x1) suggests inconsistency with
OLS is small, because variation in “noise” (a.k.a. m.e.) is small relative to variation in true value.
It’s more complicated with a multiple regression, but can still expect attenuation bias when assume classical errors in variables.
IV. Measurement Error in an Explanatory Variable
Economics 20 - Prof. Anderson
26
y = 0 + 1x*1 + 2x2 + 3x3 +u Assume u uncorrelated with x*1,x1,x2,x3
If assume e1 uncorrelated with x1,x2,x3 then get y = 0 + 1x1 + 2x2 + 3x3 +u -1e1
get consistent estimates
But, if e1 uncorrelated with x2,x3 but not necessarily x1, get
If x*1 uncorrelated with x2,x3 get consistent estimates of 2, 3
If this doesn’t hold, then other estimates will be inconsistent (size and direction are indeterminate)
*
132210*1
*1
21
*21
*21
11
xfrom is r
ˆplim
rxxwhere
er
r
IV. Measurement Error in an Explanatory Variable
Economics 20 - Prof. Anderson
27
Ex: GPA with measurement error colGPA = 0 + 1faminc* + 2hsGPA+3SAT +4smoke + u
faminc* is actual annual family income faminc=faminc*+e1
Assuming CEV holds, get OLS estimator of 1 that is attenuated (biased toward zero).
colGPA = 0 + 1faminc + 2hsGPA+3SAT +4smoke* + u smoke=smoke*+e1
CEV unlikely to hold, because those who don’t smoke are really unlikely to mis-report. Those that do smoke can mis-report, such that error and actual number of times smoked (smoked*) are correlated.
Deriving the implications of measurement error when CEV doesn’t hold is difficult and out of scope of text.
V. Missing Data, Nonrandom Samples, Outlying Observations
Economics 20 - Prof. Anderson
28
Introduction into data problems that can violated MLR.2 of G-M assumptions Cases when data problems have no effect
on OLS estimates Other cases when get biased estimates
Missing Data Generally collect data from random sample
of observations (people, schools, firms) Discover that information from these
observations on key variables are missing
V. Missing Data – Is it a Problem?
Economics 20 - Prof. Anderson
29
Consequences If any observation is missing data on one of
the variables in the model, it can’t be used Data missing at Random
If data is missing at random, using a sample restricted to observations with no missing values will be fine
Simply reduces sample size, thus reducing precision of estimates
V. Missing Data – Is it a Problem?
Economics 20 - Prof. Anderson
30
Data not missing at random A problem can arise if the data is missing
systematically High income individuals refuse to provide income data Low education people generally don’t report education People with high IQ more likely to report IQ
When missing data does not lead to bias Sample chosen on basis of independent variables Ex: Savings, income, age, size for population of
people 35 years and older No bias because E(savings|income, age, size) is same
for any subset of population described by income, age, size in this data.
V. Nonrandom Samples
Economics 20 - Prof. Anderson
31
When missing data leads to bias If the sample is chosen on the basis of the y
variable, then we have sample selection bias
Ex: estimating wealth based on education, experience, and age. Only those with wealth below 250k included OLS gives biased estimates because E(wealth|
educ, exper, age) not same as expected value conditional on wealth being less than 250k.
V. Outliers /Influential Observations
Economics 20 - Prof. Anderson
32
Sometimes an individual observation can be very different from the others “Influential” for estimates if dropping that
observation(s) from the analysis changes the key OLS estimates by a lot
Particularly important with small data sets OLS susceptible to outliers because by definition,
minimizes sum of squared residual, and this outlier will have “large” residual.
Causes of outliers errors in data entry – one reason why looking at
summary statistics is important sometimes the observation will just truly be very
different from the others
V. Outliers /Influential Observations Example: R& D
Intensity & Firm Size
Sales more than triples, and now statistically significant.
33
Economics 20 - Prof. Anderson
1137.R ,1728.R 31,n
(.0445) (.000084) (.592)
arg0478.000186.297.2int
0124.R ,0761.R 32,n
(.0462) (.000044) (.586)
arg0446.000053.625.2int
argint
2_2
^
2_2
^
210
profmsalesensrd
profmsalesensrd
uprofmsalesensrd
V. Outliers
Economics 20 - Prof. Anderson
34
Not unreasonable to fix observations where it’s clear there was just an extra zero entered or left off, etc.
Not unreasonable to drop observations that appear to be extreme outliers, although readers may prefer to see estimates with and without the outliers
Can use Stata to investigate outliers graphicall