Statistical Inference and Regression Analysis: GB.3302.30

106
Statistical Inference and Regression Analysis: GB.3302.30 Professor William Greene Stern School of Business IOMS Department Department of Economics

description

Statistical Inference and Regression Analysis: GB.3302.30. Professor William Greene Stern School of Business IOMS Department Department of Economics. Statistics and Data Analysis. Part 7 – Regression Model-1 Regression Diagnostics. Using the Residuals. - PowerPoint PPT Presentation

Transcript of Statistical Inference and Regression Analysis: GB.3302.30

Statistics

Statistical Inference and Regression Analysis: GB.3302.30Professor William GreeneStern School of BusinessIOMS Department Department of Economics

#/9712

#/973

#/97Statistics and Data AnalysisPart 7 Regression Model-1 Regression Diagnostics

#/974Using the ResidualsHow do you know the model is good?Various diagnostics to be developed over the semester.But, the first place to look is at the residuals.

#/975Residuals Can Signal a Flawed ModelStandard application: Cost function for output of a production process.Compare linear equation to a quadratic model (in logs)(123 American Electric Utilities)

#/976Electricity Cost Function

#/97Candidate Model for CostLog c = a + b log q + e

Most of the points in this area are above the regression line.Most of the points in this area are below the regression line.Most of the points in this area are above the regression line.

#/978A Better Model?

Log Cost = + 1 logOutput + 2 [logOutput]2 +

#/97Candidate Models for CostThe quadratic equation is the appropriate model.Logc = a + b1 logq + b2 log2q + e

#/9710Missing Variable Included

Residuals from the quadratic cost model

Residuals from the linear cost model

#/9711Unusual Data Points

Outliers have (what appear to be) very large disturbances, Wolf weight vs. tail length The 500 most successful movies

#/9712Outliers

99.5% of observations will lie within mean 3 standard deviations. We show (a+bx) 3se below.)Titanic is 8.1 standard deviations from the regression!Only 0.86% of the 466 observations lie outside the bounds. (We will refine this later.)These observations might deserve a close look.

#/9713

logPrice = a + b logArea + ePrices paid at auction for Monet paintings vs. surface area (in logs) Not an outlier: Monet chose to paint a small painting. Possibly an outlier: Why was the price so low?

#/97What to Do About Outliers(1) Examine the data(2) Are they due to mismeasurement error or obvious coding errors? Delete the observations.(3) Are they just unusual observations? Do nothing. (4) Generally, resist the temptation to remove outliers. Especially if the sample is large. (500 movies is large. 10 wolves is not.)(5) Question why you think it is an outlier. Is it really?

#/9715Regression Options

#/97Diagnostics

#/9717On Removing Outliers Be careful about singling out particular observations this way.The resulting model might be a product of your opinions, not the real relationship in the data.

Removing outliers might create new outliers that were not outliers before.

Statistical inferences from the model will be incorrect.

#/9718Statistics and Data AnalysisPart 7 Regression Model-2 Statistical Inference

#/9719b As a Statistical EstimatorWhat is the interest in b? = dE[y|x]/dxEffect of a policy variable on the expectation of a variable of interest.Effect of medication dosage on disease response many others

#/97Application: Health Care DataGerman Health Care Usage Data, There are altogether 27,326 observations on German households, 1984-1994.

DOCTOR = 1(Number of doctor visits > 0) HOSPITAL= 1(Number of hospital visits > 0) HSAT = health satisfaction, coded 0 (low) - 10 (high) DOCVIS = number of doctor visits in last three months HOSPVIS = number of hospital visits in last calendar year PUBLIC = insured in public health insurance = 1; otherwise = 0 ADDON = insured by add-on insurance = 1; otherswise = 0 INCOME = household nominal monthly net income in German marks / 10000. HHKIDS = children under age 16 in the household = 1; otherwise = 0 EDUC = years of schooling AGE = age in years MARRIED = marital status EDUC = years of education

#/9721Regression?Population relationshipIncome = + Health + (For this population,Income = .31237 + .00585 Health + E[Income | Health] = .31237 + .00585 Health

#/97Distribution of Health

#/97Distribution of Income

#/97Average Income | Health

HealthNj = 447 255 642 1173 1390 4233 2570 4191 6172 3061 3192

#/97b is a statisticRandom because it is a sum of the s.It has a distribution, like any sample statistic

#/97Sampling Experiment500 samples of N=52 drawn from the 27,326 (using a random number generator to simulate N observation numbers from 1 to 27,326)Compute b with each sampleHistogram of 500 values

#/97

#/97ConclusionsSampling variabilitySeems to center on Appears to be normally distributed

#/97Distribution of slope estimator, bAssumptions:(Model) Regression: yi = + xi + i (Crucial) Exogenous data: data x and noise are independent; E[|x]=0 or Cov(,x)=0(Temporary) Var[|x] = 2, not a function of x(Homoscedastic)Results: What are the properties of b?

#/97

#/97(1) b is unbiased and linear in

#/97(2) b is efficientGauss Markov Theorem: Like Rao Blackwell. (Proof in Greene)

Variance of b is smallest among linear unbiased estimators.

#/97(3) b is consistent

#/97Consistency: N=52 vs. N=520

#/97a is unbiased and consistent

#/97Covariance of a and b

#/97Inference about Have derived expected value and variance of b.b is a point estimatorLooking for a way to form a confidence interval.Need a distribution and a pivotal statistic to use.

#/97Normality

#/97Confidence Interval

#/97Estimating sigma squared

#/97Usable Confidence IntervalUse s instead of s.Use t distribution instead of normal.Critical t depends on degrees of freedomb - ts < < b + ts

#/97Slope Estimator

#/97Regression Results-----------------------------------------------------------------------------Ordinary least squares regression ............LHS=BOX Mean = 20.72065 Standard deviation = 17.49244---------- No. of observations = 62 DegFreedom Mean squareRegression Sum of Squares = 7913.58 1 7913.57745Residual Sum of Squares = 10751.5 60 179.19235Total Sum of Squares = 18665.1 61 305.98555---------- Standard error of e = 13.38627 Root MSE 13.16860Fit R-squared = .42398 R-bar squared .41438Model test F[ 1, 60] = 44.16247 Prob F > F* .00000--------+-------------------------------------------------------------------- | Standard Prob. 95% Confidence BOX| Coefficient Error t |t|>T* Interval--------+--------------------------------------------------------------------Constant| -14.3600** 5.54587 -2.59 .0121 -25.2297 -3.4903CNTWAIT3| 72.7181*** 10.94249 6.65 .0000 51.2712 94.1650--------+--------------------------------------------------------------------Note: ***, **, * ==> Significance at 1%, 5%, 10% level.-----------------------------------------------------------------------------

#/97Hypothesis Test about Outside the confidence interval is the rejection for hypothesis tests about For the internet buzz regression, the confidence interval is 51.2712 to 94.1650The hypothesis that equals zero is rejected.

#/97Statistics and Data AnalysisPart 7-3 Prediction

#/9746Predicting y Using the RegressionActual y0 is + x0 + 0Prediction is y0^ = a + bx0 + 0Error is y0 y0^ = (a-) + (b-)x0 + 0Variance of the error isVar[a] + x02 Var[b] + 2x0 Cov[a,b] + Var[0]

#/97Prediction Variance

#/97Quantum of SolaceActual Box = $67.528882Ma=-14.36, b=72.7181, N=62, sb =10.94249, s2 = 13.38632buzz = 0.76, prediction = 40.906Mean buzz = 0.4824194(buzz mean)2 = 1.49654Sforecast = 13.8314252Confidence interval = 40.906 +/- 2.003(13.831425) = 13.239 to 68.527(Note: The confidence interval contains the value)

#/97Forecasting Out of SamplePer Capita Gasoline Consumption vs. Per Capita Income, 1953-2004.How to predict G for 2012? You would need first to predict Income for 2012.How should we do that?Regression Analysis: G versus Income The regression equation isG = 1.93 + 0.000179 IncomePredictor Coef SE Coef T PConstant 1.9280 0.1651 11.68 0.000Income 0.00017897 0.00000934 19.17 0.000S = 0.370241 R-Sq = 88.0% R-Sq(adj) = 87.8%

#/9750The Extrapolation Penalty

The interval is narrowest at x* = , the center of our experience. The interval widens as we move away from the center of our experience to reflect the greater uncertainty.(1) Uncertainty about the prediction of x(2) Uncertainty that the linear relationship will continue to exist as we move farther from the center.

#/9751NormalityNecessary for t statistics and confidence intervalsResiduals reveal whether disturbances are normal?Standard tests and devices

#/97Normally Distributed Residuals?

--------------------------------------Kolmogorov-Smirnov test of F(E )vs. Normal[ .00000, 13.27610^2]******* K-S test statistic = .1810063******* 95% critical value = .1727202******* 99% critical value = .2070102Normality hyp. should be rejected.--------------------------------------

#/97Nonnormal DisturbancesAppeal to the central limit theorem

Use standard normal instead of t

t is essentially normal if N > 50.

#/97Statistics and Data AnalysisPart 7-4 Multiple Regression

#/9755Box Office and Movie Buzz56

#/97Box Office and Budget57

#/97Budget and Buzz Effects58

#/97An Enduring Art Mystery

Why do larger paintings command higher prices?The Persistence of Memory. Salvador Dali, 1931The Persistence of Statistics. Rice, 2007Graphics show relative sizes of the two works.

#/9759The Data

Note: Using logs in this context. This is common when analyzing financial measurements (e.g., price) and when percentage changes are more interesting than unit changes. (E.g., what is the % premium when the painting is 10% larger?)

#/97Monet in Large and Small

Log of $price = a + b log surface area + e Sale prices of 328 signed Monet paintings The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model.

#/9761Monet Regression: There seems to be a regression. Is there a theory?

#/9762How much for the signature?The sample also contains 102 unsigned paintings Average Sale Price Signed $3,364,248 Not signed $1,832,712 Average price of signed Monets is almost twice that of unsigned

#/9763Can we separate the two effects? Average Prices Small LargeUnsigned 346,845 5,795,000

Signed 689,422 5,556,490

What do the data suggest?(1) The size effect is huge(2) The signature effect is confined to the small paintings.

#/9764A Multiple Regression

Ln Price = a + b1 ln Area + b2 (0 if unsigned, 1 if signed) + eb2

#/9765Monet Multiple RegressionRegression Analysis: ln (US$) versus ln (SurfaceArea), Signed The regression equation isln (US$) = 4.12 + 1.35 ln (SurfaceArea) + 1.26 SignedPredictor Coef SE Coef T PConstant 4.1222 0.5585 7.38 0.000ln (SurfaceArea) 1.3458 0.08151 16.51 0.000Signed 1.2618 0.1249 10.11 0.000S = 0.992509 R-Sq = 46.2% R-Sq(adj) = 46.0%Interpretation (to be explored as we develop the topic):(1) Elasticity of price with respect to surface area is 1.3458 very large(2) The signature multiplies the price by exp(1.2618) (about 3.5), for any given size.

#/9766Ceteris Paribus in TheoryDemand for gasoline: G = f(price,income)

Demand (price) elasticity:eP = %change in G given %change in P holding income constant.

How do you do that in the real world?The percentage changesHow to change price and hold income constant?

#/9767The Real World Data

#/9768U.S. Gasoline Market, 1953-2004

#/9769 Shouldnt Demand Curves Slope Downward?

#/9770A Thought ExperimentThe main driver of gasoline consumption is income not priceIncome is growing over time.We are not holding income constant when we change price!How do we do that?

#/9771How to Hold Income Constant? Multiple Regression Using Price and IncomeRegression Analysis: G versus GasPrice, Income

The regression equation isG = 0.134 - 0.00163 GasPrice + 0.000026 Income

Predictor Coef SE Coef T PConstant 0.13449 0.02081 6.46 0.000GasPrice -0.0016281 0.0004152 -3.92 0.000Income 0.00002634 0.00000231 11.43 0.000It looks like the theory works.

#/9772Statistics and Data AnalysisLinear Multiple Regression Model

#/9773Classical Linear Regression ModelThe model is y = f(x1,x2,,xK,1,2,K) + = a multiple regression model. Important examples: Marginal cost in a multiple output setting Separate age and education effects in an earnings equation.Denote (x1,x2,,xK) as x. Boldface symbol = vector.

Form of the model E[y|x] = a linear function of x.

Dependent and independent variables. Independent of what? Think in terms of autonomous variation.Can y just change? What causes the change?

#/9774Model Assumptions: GeneralitiesLinearity means linear in the parameters. Well return to this issue shortly.Identifiability. It is not possible in the context of the model for two different sets of parameters to produce the same value of E[y|x] for all x vectors. (It is possible for some x.)Conditional expected value of the deviation of an observation from the conditional mean function is zeroForm of the variance of the random variable around the conditional mean is specifiedNature of the process by which x is observed.Assumptions about the specific probability distribution.

#/9775Linearity of the Modelf(x1,x2,,xK,1,2,K) = x11 + x22 + + xKKNotation: x11 + x22 + + xKK = x. Boldface letter indicates a column vector. x denotes a variable, a function of a variable, or a function of a set of variables. There are K variables on the right hand side of the conditional mean function. The first variable is usually a constant term. (Wisdom: Models should have a constant term unless the theory says they should not.) E[y|x] = 1*1 + 2*x2 + + K*xK. (1*1 = the intercept term).

#/9776LinearityLinearity means linear in the parameters, not in the variablesE[y|x] = 1 f1() + 2 f2() + + K fK(). fk() may be any function of data.Examples:Logs and levels in economicsTime trends, and time trends in loglinear models rates of growthDummy variablesQuadratics, power functions, log-quadratic, trig functions, interactions and so on.

#/9777LinearitySimple linear model, E[y|x] =xQuadratic model: E[y|x] = + 1x + 2x2Loglinear model, E[lny|x] = + k lnxkkSemilog, E[y|x] = + k lnxkkAll are linear. An infinite number of variations.

#/9778Matrix Notation79

#/97NotationDefine column vectors of N observations on y and the K x variables.

The assumption means that the rank of the matrix X is K.No linear dependencies => FULL COLUMN RANK of the matrix X.

#/9780Uniqueness of the Conditional MeanThe conditional mean relationship must hold for any set of N observations, i = 1,,N. Assume, that N K (justified later) E[y1|x] = x1 E[y2|x] = x2 E[yn|x] = xn

All N observations at once: E[y|X] = X = E.

#/9781Uniqueness of E[y|X]Now, suppose there is a that produces the same expected value, E[y|X] = X = E.

Let = - . Then, X = X - X = E - E = 0.

Is this possible? X is an NK matrix (N rows, K columns). What does X = 0 mean? We assume this is not possible. This is the full rank assumption. Ultimately, it will imply that we can estimate . This requires N K .

#/9782An Unidentified (But Valid)Theory of Art Appreciation

Enhanced Monet Area Effect Model: Height and Width EffectsLog(Price) = 1 + 2 log Area + 3 log Aspect Ratio + 4 log Height + 5 Signature + (Aspect Ratio = Height/Width)

#/9783Conditional Homoscedasticity and Nonautocorrelation Disturbances provide no information about each other.Var[i | X ] = 2 Cov[i, j |X] = 0

#/9784HeteroscedasticityRegression of log of per capita gasoline use on log of per capita income, gasoline price and number of cars per capita for 18 OECD countries for 19 years. The standard deviation varies by country.Countries are ordered by the standard deviation of their 19 residuals.

#/9785AutocorrelationlogG=1 + 2logPg + 3logY + 4logPnc + 5logPuc +

#/9786Autocorrelation Results from an Incomplete Model

#/9787Normal Distribution of Used to facilitate finite sample derivations of certain test statistics.

Observations are independent

Assumption will be unnecessary we will use the central limit theorem for the statistical results we need.

#/9788The Linear Modely = X+, N observations, K columns in X, (usually including a column of ones for the intercept).Standard assumptions about XStandard assumptions about |XE[|X]=0, E[]=0 and Cov[,x]=0

Regression: E[y|X] = X

#/9789Statistics and Data AnalysisLeast Squares

#/9790VocabularySome terms to be used in the discussion.Population characteristics and entities vs. sample quantities and analogsResiduals and disturbancesPopulation regression line and sample regressionObjective: Learn about the conditional mean function. Estimate and 2First step: Mechanics of fitting a line to a set of data

#/9791Least Squares

#/9792Matrix Results

#/97

#/97Moment Matrices

#/97Least Squares Normal Equations

#/9796Second Order Conditions

#/9797Does b Minimize ee?

#/9798Positive Definite Matrix

#/9799