A Non-technical Introduction to Regression - Personal WWW Pages

90
A Non-technical Introduction to Regression () Introductory Econometrics: Topic 1 1 / 90

Transcript of A Non-technical Introduction to Regression - Personal WWW Pages

Page 1: A Non-technical Introduction to Regression - Personal WWW Pages

A Non-technical Introduction to Regression

() Introductory Econometrics: Topic 1 1 / 90

Page 2: A Non-technical Introduction to Regression - Personal WWW Pages

This first set of lectures will (quickly) go through basic data analysisup to regression in a reltatively non-technical fashion

Based on Chapters 1 and 2 of the textbook

This material should mostly be a review of material you should knowfrom your previous study (e.g. in your second year course).

Since you have covered this material before, I will go through thismaterial quickly, with a focus on the most important tool of theapplied economist: regression.

Please read through chapters 1 and 2, particularly if you need somereview of this material.

() Introductory Econometrics: Topic 1 2 / 90

Page 3: A Non-technical Introduction to Regression - Personal WWW Pages

Types of Economic Data

This section introduces types of data used by economists and definesthe notation and terminology associated with them

Time Series DataCommon in macroeconomics and finance

Examples: Gross Domestic product (GDP), stock prices, interestrates, exchange rates (called “time series variables”or simply“variables”

Data is collected at specific points in time (e.g. every month, everyday or every year).

Yt is the observation on variable Y at time t.

A time series runs from period t = 1 to t = T .

() Introductory Econometrics: Topic 1 3 / 90

Page 4: A Non-technical Introduction to Regression - Personal WWW Pages

Cross-sectional Data

Characterized by individual units such as companies, people orcountries.

E.g. the wage of each of 100 individuals in a survey.

Note: ordering does not matter (unlike with time series data).

Yi is observation for individual i for i = 1 to N.

Note: often we have quantitative data (e.g. wages are measured inpounds so data will be a number).

Sometimes data is qualitative data. E.g. in survey may ask whethereach worker is Male or Female.

Econometricians convert qualitative answers into numeric data (e.g.Male=1, Female=0)

Variables which take on only values 0 or 1 are referred to as dummyvariables.

() Introductory Econometrics: Topic 1 4 / 90

Page 5: A Non-technical Introduction to Regression - Personal WWW Pages

Panel Data

Sometimes have data with both time series and a cross-sectionalcomponent.

Survey 100 workers every year for 5 years

Such data is referred to as panel data.

We will not have time to cover panel data in this course, but readChapter 8 of the textbook if you want to learn more (e.g. if you areusing panel data in your dissertation)

() Introductory Econometrics: Topic 1 5 / 90

Page 6: A Non-technical Introduction to Regression - Personal WWW Pages

Graphical Methods

Time Series GraphsMonthly time series data from January 1947 through October 1996on the U.K. pound/U.S. dollar exchange rate is plotted in Figure 1.1.

1940 1950 1960 1970 1980 1990 2000100

150

200

250

300

350

400

450Figure 1.1: Time Series Plot of UK Pound to US Dollar Exchange Rate

Year

£/$ 

exch

ange

 rate

() Introductory Econometrics: Topic 1 6 / 90

Page 7: A Non-technical Introduction to Regression - Personal WWW Pages

Histograms (frequency distributions)

Commonly used with cross-sectional dataExample: real GDP per capita in 1992 for 90 countries.Frequency table counts how many countries have GDP falling indifferent class intervals or bins

Table 1.1: Frequency Table for GDP per capita DataClass Interval Frequency0 to $2, 000 33$2, 001 to $4, 000 22$4, 001 to $6, 000 7$6, 001 to $8, 000 3$8, 001 to $10, 000 4$10, 001 to $12, 000 2$12, 001 to $14, 000 9$14, 001 to $16, 000 6$16, 001 to $18, 000 4

() Introductory Econometrics: Topic 1 7 / 90

Page 8: A Non-technical Introduction to Regression - Personal WWW Pages

Histogram makes a bar chart out of a frequency table

­5 0 5 10 15 200

5

10

15

20

25

30

35Figure 1.2: Histogram of GDP per capita for 90 Countries

GDP per capita (thousands of US dollars)

Freq

uenc

y

Histogram is closely related to the idea of a distribution (a point wewill build on later)

() Introductory Econometrics: Topic 1 8 / 90

Page 9: A Non-technical Introduction to Regression - Personal WWW Pages

XY- Plots (scatter diagrams)

Used to shed light on relationship between two variables.

Example: deforestation versus population density

0 500 1000 1500 2000 2500 30000

1

2

3

4

5

6Figure 1.3: XY­Plot of Population Density Against Deforestation

Population per thousand hectares

Aver

age 

annu

al fo

rest

 loss

 (%)

Nicaragua

() Introductory Econometrics: Topic 1 9 / 90

Page 10: A Non-technical Introduction to Regression - Personal WWW Pages

Descriptive Statistics

Graphs have an immediate visual impact that is useful for livening upan essay or report.

Usually need to be numerically precise

Almost all of what we do in this course will involve numerical (asopposed to graphical) summaries such as Descriptive Statistics

These require some mathematical tools

Before descriptive statistics, introduce some maths from Appendix Aof textbook

() Introductory Econometrics: Topic 1 10 / 90

Page 11: A Non-technical Introduction to Regression - Personal WWW Pages

Mathematical Basics

The level of mathematics used in this course not too high and youshould have learned in your second year course.

But let me provide a brief summary of the basic mathematical toolsused in the course

The Equation of a Straight LineLet Y and X be two variables.

Any straight line relationship between the variables can be written interms of the equation:

Y = α+ βX

α and β are coeffi cients which determine a particular line.

Any line can be defined by its intercept and slope.

α is the intercept and β the slope.

Figure A.1 has an intercept of one (α = 1) and a slope of two(β = 2).

() Introductory Econometrics: Topic 1 11 / 90

Page 12: A Non-technical Introduction to Regression - Personal WWW Pages

0 1 2 3 4 5 6 7 8 9 100

5

10

15

20

25Figure A.1: A Straight Line

X

Y

Intercept= α

Slope of Line = β

() Introductory Econometrics: Topic 1 12 / 90

Page 13: A Non-technical Introduction to Regression - Personal WWW Pages

Mathematical Basics (cont.): Logarithms

For reasons which will be explained in the course, sometimes we willwant to work with the natural logarithm of A instead of A itself

Notation: ln (A).

Formal definition provided on page 335 of textbook.

Key thing: ln (A) is a function of A that can be calculated by thecomputer

The inverse of the natural logarithm function is the exponentialfunction, denoted by exp (A) (i.e. exp (ln (A)) = A)

Properties that we will use:

ln (AB) = ln (A) + ln (B)ln(AB)= B ln (A)

ln (1+ A) ≈ A if A is small.

() Introductory Econometrics: Topic 1 13 / 90

Page 14: A Non-technical Introduction to Regression - Personal WWW Pages

Mathematical Basics (cont.): Summation

Remember:Yi is observation for individual i for i = 1 to N.

Example: wages for 100 individuals

Greek letter ∑, is the summation (or “adding up”) operator andsuperscripts and subscripts on ∑ indicate the observations that arebeing added up.

Example:100

∑i=1Yi

This adds up the wages for all of the 100 individuals.

() Introductory Econometrics: Topic 1 14 / 90

Page 15: A Non-technical Introduction to Regression - Personal WWW Pages

Properties of the Summation Operator

Let Xi and Yi for i = 1, ..,N be observations on two variables and cbe a constant

N

∑i=1cXi = c

N

∑i=1Xi

N

∑i=1(Xi + Yi ) =

N

∑i=1Xi +

N

∑i=1Yi

N

∑i=1c = cN

() Introductory Econometrics: Topic 1 15 / 90

Page 16: A Non-technical Introduction to Regression - Personal WWW Pages

Back to Descriptive Statistics (or Summary Statistics)

Ex (continued): real GDP per capita across the 90 countries.

Histogram in Figure 1.2 plots the distribution of this variable

Common to present mean and standard deviation to summarizenumerically main features of the data

Mean is a measure of location (intuition: mean, average, typicalvalue, middle of distribution are similar concepts)

Remember: Y1, ..,YN are N different observations on our variable(N = 90 countries)

This is the sample.

The mean is given by:

Y =∑Ni=1 YiN

() Introductory Econometrics: Topic 1 16 / 90

Page 17: A Non-technical Introduction to Regression - Personal WWW Pages

For real GDP data, Y = $5, 443.80Mean or average can hide variation across countries.

Other useful summary statistics are the minimum and maximum.

Minimum GDP per capita is $408 (Chad)Maximum is $17, 945 (U.S.).Difference between minimum and maximum is one measure ofdispersion

Dispersion = measure of variability, how spread out, how unequal, etc.

() Introductory Econometrics: Topic 1 17 / 90

Page 18: A Non-technical Introduction to Regression - Personal WWW Pages

More common measure of dispersion is the standard deviation:

s =

√∑Ni=1

(Yi − Y

)2N − 1

Note: also common to use variance which is s2

In GDP example: s = $5, 369.496Diffi cult to interpret in an absolute sense, but useful for relativecomparisons

When comparing two different distributions, the one with the smallerstandard deviation will exhibit less dispersion.

() Introductory Econometrics: Topic 1 18 / 90

Page 19: A Non-technical Introduction to Regression - Personal WWW Pages

Correlation

Correlation numerically measures the degree of association or thestrength of the relationship between two variables.

Let X and Y be two variables (e.g. population density anddeforestation, respectively)

The formula to calculate the correlation between X and Y is:

r =∑Ni=1

(Yi − Y

) (Xi − X

)√∑Ni=1

(Yi − Y

)2√∑Ni=1

(Xi − X

)2Don’t worry about memorizing this: computer will calculate it for you

Note: if unclear from context, use subscripts: rXY is the correlationbetween variables X and Y

() Introductory Econometrics: Topic 1 19 / 90

Page 20: A Non-technical Introduction to Regression - Personal WWW Pages

Properties of Correlation

r lies between −1 and 1.Positive values of r indicate a positive correlation between X and Y .Negative values indicate a negative correlation. r = 0 indicates thatX and Y are uncorrelated.

Larger positive values of r indicate stronger positive correlation.r = 1 indicates perfect positive correlation.

Larger negative values of r indicate stronger negative correlation.r = −1 indicates perfect negative correlation.The correlation between Y and X is the same as the correlationbetween X and Y .

The correlation between any variable and itself is 1.

() Introductory Econometrics: Topic 1 20 / 90

Page 21: A Non-technical Introduction to Regression - Personal WWW Pages

Example: relationship between deforestation (Y ) and populationdensity (X ).

Figure 1.3 plotted this data in an XY-plot

r = 0.66.

Positive relationship between deforestation and population density.

How do you interpret the precise number?

r2 measures the proportion of the cross-country variability indeforestation that is explained by the variabillity in population density.

Crrelation is a numerical measure of the degree to which patterns inX and Y correspond.

Since 0.662 = 0.44, 44% of cross-country variance in deforestationcan be explained by variance in population density.

() Introductory Econometrics: Topic 1 21 / 90

Page 22: A Non-technical Introduction to Regression - Personal WWW Pages

Understanding Why Variables are Correlated

Care must always be taken with interpreting correlations (orregression results)

Correlation does not necessarily imply causality

Example: smoking causes lung cancer, but drinking alcohol does not

How would these facts reveal themselves in a data set?

X = number of cigarettes smoked

Y = lung cancer rate

() Introductory Econometrics: Topic 1 22 / 90

Page 23: A Non-technical Introduction to Regression - Personal WWW Pages

We would find rXY > 0

This correlation does reflect causality

Let Z be a measure of alcohol consumption

In practice, we often find rXZ > 0 (smokers tend to drink more thannon-smokers)

This will often lead to rYZ > 0 (correlation between drinking and lungcancer positive)

But this correlation does not reflect causality

() Introductory Econometrics: Topic 1 23 / 90

Page 24: A Non-technical Introduction to Regression - Personal WWW Pages

Understanding Correlation Through XY-plots

Figure 1.3 (plot of deforestation versus population) is type of graphwe obtain with moderately correlated variables.

Figure 1.6 shows what happens with perfectly correlated variableswith r = 1)

All the points lie exactly on a straight line.

Stronger correlations imply stronger patterns (less scattering of datapoints) in an XY plot

() Introductory Econometrics: Topic 1 24 / 90

Page 25: A Non-technical Introduction to Regression - Personal WWW Pages

­5 ­4 ­3 ­2 ­1 0 1 2 3 4 5­2.5

­2

­1.5

­1

­0.5

0

0.5

1

1.5

2

2.5Figure 1.6: XY­Plot of Two Perfectly Correlated Variables

X

Y

() Introductory Econometrics: Topic 1 25 / 90

Page 26: A Non-technical Introduction to Regression - Personal WWW Pages

Figure 1.7 plots two completely uncorrelated variables (r = 0).

Note that the points are randomly scattered over the entire graph.

­3 ­2.5 ­2 ­1.5 ­1 ­0.5 0 0.5 1 1.5 2­2.5

­2

­1.5

­1

­0.5

0

0.5

1

1.5

2

2.5Figure 1.7: XY­Plot of Unorrelated Variables

X

Y

() Introductory Econometrics: Topic 1 26 / 90

Page 27: A Non-technical Introduction to Regression - Personal WWW Pages

Correlation Between Several Variables

When we have many variables often want to obtain correlationsbetween each paper

E.g. with X , Y and Z , then there are three possible correlations(rXY , rXZ and rYZ ).

Can put them in a correlation matrix.

Table 1.2: A Correlation MatrixX Y Z

X 1.000Y rXY 1.000Z rXZ rYZ 1.000

() Introductory Econometrics: Topic 1 27 / 90

Page 28: A Non-technical Introduction to Regression - Personal WWW Pages

Regression

Regression is the most important tool of the applied economist

Used to understand the relationships between many variables.

We begin with simple regression to understand the relationshipbetween two variables, X and Y.

Regression can be understood as a Best Fitting Line

Example: see Figure 2.1 which is XY-plot of X = output versus Y =costs of production for 123 electric utility companies in the U.S. in1970.

The microeconomist will want to understand the relationship betweenoutput and costs.

Regression fits a line through the points in the XY-plot that bestcaptures the relationship between output and costs.

() Introductory Econometrics: Topic 1 28 / 90

Page 29: A Non-technical Introduction to Regression - Personal WWW Pages

0 10 20 30 40 50 60 70 800

50

100

150

200

250

300Figure 2.1: XY­plot of Output versus Costs

Output (Millions of KwH)

Cos

ts (M

illio

ns o

f $)

() Introductory Econometrics: Topic 1 29 / 90

Page 30: A Non-technical Introduction to Regression - Personal WWW Pages

Simple Regression: Some Theory

Question: What do we mean by “best fitting” line?

Assume a linear relationship between X = output and Y = costs

Y = α+ βX

where α is the intercept of the line and β its slope.

() Introductory Econometrics: Topic 1 30 / 90

Page 31: A Non-technical Introduction to Regression - Personal WWW Pages

Even if straight line relationship were true, never get all points on anXY-plot lying precisely on it due to measurement error.

True relationship probably more complicated, straight line may just bean approximation.

Important variables which affect Y may be omitted.

Due to these factors we add an error, ε, which yields the regressionmodel :

Y = α+ βX + ε

() Introductory Econometrics: Topic 1 31 / 90

Page 32: A Non-technical Introduction to Regression - Personal WWW Pages

What we know: X and Y .

What we do not know: α, β and ε.

Regression analysis uses data (X and Y ) to make a guess orestimate of what α and β are.

Notation: α and β are the estimates of α and β.

() Introductory Econometrics: Topic 1 32 / 90

Page 33: A Non-technical Introduction to Regression - Personal WWW Pages

Distinction Between Errors and Residuals

We have data for i = 1, ., ,N individuals (or countries, or companies,etc.).

Individual observations are denoted using subscripts: Yi fori = 1, ..,N and Xi for i = 1, ..,N

True Regression Line holds for every observation:

Yi = α+ βXi + εi

Error for i th individual can be written as:

εi = Yi − α− βXi

() Introductory Econometrics: Topic 1 33 / 90

Page 34: A Non-technical Introduction to Regression - Personal WWW Pages

If we replace α and β by estimates, we get fitted (or estimated)regression line:

Yi = α+ βXi

and residuals are given by

εi = Yi − α− βXi

Residuals measure distance that each observation is from the fittedregression line.

A “good fitting” regression line will have observations lying near theregression line and, thus, residuals will be small.

() Introductory Econometrics: Topic 1 34 / 90

Page 35: A Non-technical Introduction to Regression - Personal WWW Pages

Derivation of OLS Estimator

How do we choose and α and β?

A regression line which fits well will make residuals as small aspossible.

Usual way of measuring size of the residuals is the sum of squaredresiduals (SSR), which can be written in the following (equivalent)ways:

SSR = ∑Ni=1 ε2i

= ∑Ni=1

(Yi − α− βXi

)2= ∑N

i=1

(Yi − Yi

)2ordinary least squares (OLS) estimator finds values of α and β whichminimize SSR

Formula for OLS estimator will be discussed later (in practice,econometrics software packages will calculate α and β.)

() Introductory Econometrics: Topic 1 35 / 90

Page 36: A Non-technical Introduction to Regression - Personal WWW Pages

Jargon of Regression

Y = dependent variable.

X = explanatory (or independent) variable.

α and β are coeffi cients.

α and β and are OLS estimates of coeffi cients

“Run a regression of Y on X”

() Introductory Econometrics: Topic 1 36 / 90

Page 37: A Non-technical Introduction to Regression - Personal WWW Pages

Interpreting OLS Estimates

Remember fitted regression line is

Yi = α+ βXi

Interpretation of α is estimated value of Y if X = 0.

Example: X = lot size, Y = house price. α = estimated value of ahouse with lot size = 0 (not of interest since houses with lot sizeequal zero do not exist).

β is usually (but not always) the coeffi cient of most interest.

The following are a few different ways of interpreting β.

β is slope of the best fitting straight line through an XY-plot such asFigure 2.2

() Introductory Econometrics: Topic 1 37 / 90

Page 38: A Non-technical Introduction to Regression - Personal WWW Pages

0 10 20 30 40 50 60 70 800

50

100

150

200

250

300

350Figure 2.2: XY ­plot of  Output v ersus Costs with Fitted Regression Line

Output (Millions of  KwH)

Cos

ts (M

illio

ns o

f $)

() Introductory Econometrics: Topic 1 38 / 90

Page 39: A Non-technical Introduction to Regression - Personal WWW Pages

β can be interpreted as a derivative:

β =dYidXi

.

β is the marginal effect of X on Y .

It is a measure of how much the explanatory variable influences thedependent variable.

β is measure of how much Y tends to change when X is changed byone unit.

The definition of “unit”depends on the particular data set beingstudied.

() Introductory Econometrics: Topic 1 39 / 90

Page 40: A Non-technical Introduction to Regression - Personal WWW Pages

Example: Costs of production in the electric utilityindustry data set

Using data set in Figures 2.1 and 2.2 we find β = 4.79.

A measure of how much costs tend to change when output changesby a small amount.

Costs are measured in terms of millions of dollars and output ismeasured as millions of kilowatt hours of electricity produced.

Thus: if output is increased by one million kilowatt hours (i.e. achange of one unit in the explanatory variable), costs will tend toincrease by $4, 790, 000.

() Introductory Econometrics: Topic 1 40 / 90

Page 41: A Non-technical Introduction to Regression - Personal WWW Pages

Measuring the Fit of a Regression Model

The most common measure of fit is R2.

Intuition: “Variability”= (e.g.) how costs vary across companies

Total variability in dependent variable Y =

Variability explained by the explanatory variable (X ) in the regression

+Variability that cannot be explained and is left as error.

R2 measures the proportion of the variability in Y that can beexplained by X .

() Introductory Econometrics: Topic 1 41 / 90

Page 42: A Non-technical Introduction to Regression - Personal WWW Pages

Formalizing the Definition of R-squared

Remember that variance is measure of dispersion or variability.

Variance of any variable can be estimated by:

var (Y ) =∑Ni=1

(Yi − Y

)2N − 1

where Y = ∑Ni=1 YiN is the mean, or average value, of the variable.

Total sum of squares (TSS) is proportional to variance of dependentvariable:

TSS =N

∑i=1

(Yi − Y

)2

() Introductory Econometrics: Topic 1 42 / 90

Page 43: A Non-technical Introduction to Regression - Personal WWW Pages

The following is not hard to prove:

TSS = RSS + SSR

RSS is regression sum of squares, a measure of the explanationprovided by the regression model:

RSS =N

∑i=1

(Yi − Y

)2SSR is the sum of squared residuals.Formalizes idea that “variability in Y can be broken into explainedand unexplained parts”We can now define our measure of fit:

R2 =RSSTSS

or, equivalently,

R2 = 1− SSRTSS

.

() Introductory Econometrics: Topic 1 43 / 90

Page 44: A Non-technical Introduction to Regression - Personal WWW Pages

Note that TSS , RSS and SSR are all sums of squared numbers and,hence, are all non-negative.

This implies TSS ≥ RSS and TSS ≥ SSR. Using these facts, it canbe seen that 0 ≤ R2 ≤ 1.Intuition: small values of SSR indicate that residuals are small and,hence, regression model is fitting well.

Thus, values of R2 near 1 imply a good fit and that R2 = 1 implies aperfect fit.

Intuition: RSS measures how much of the variation in Y theexplanatory variables explain. If RSS is near zero, then we have littleexplanatory power (a bad fit) and R2 near zero.

() Introductory Econometrics: Topic 1 44 / 90

Page 45: A Non-technical Introduction to Regression - Personal WWW Pages

Example: In regression of Y = cost of production on X = output forthe 123 electric utility companies, R2 = .92. The fit of the regressionline is quite good.

92% of the variation in costs across companies can be explained bythe variation in output.

In simple regression (but not multiple regression), R2 is thecorrelation between Y and X squared.

() Introductory Econometrics: Topic 1 45 / 90

Page 46: A Non-technical Introduction to Regression - Personal WWW Pages

Basic Statistical Concepts in the Regression Model

α and β or only estimates of α and β. How accurate are theestimates?

This can be investigated through confidence intervals.

Closely related to the confidence interval is the concept of ahypothesis test.

Intuition relating to confidence intervals and hypothesis tests givenhere, formal derivation provided later.

() Introductory Econometrics: Topic 1 46 / 90

Page 47: A Non-technical Introduction to Regression - Personal WWW Pages

Confidence Intervals

Example: β = 4.79 is the point estimate of β in the regression ofcosts of production on output using our electric utility industry data

Point estimate is best guess of what β is.

Confidence intervals provide interval estimates which give a range inwhich you are highly confident that β must lie.

Example: If confidence interval is [4.53, 5.05]

“We are confident that β is greater than 4.53 and less than 5.05”

() Introductory Econometrics: Topic 1 47 / 90

Page 48: A Non-technical Introduction to Regression - Personal WWW Pages

We can obtain different confidence intervals corresponding todifferent levels of confidence.

95% confidence interval: “we are 95% confident that β lies in theinterval”

90% confidence interval: “we are 90% confident that β lies in theinterval”, etc..

The degree of confidence (e.g. 95%) is referred to as the confidencelevel.

Example: for the electric utility data set, the 95% confidence intervalfor β is [4.53, 5.05].

"We are 95% confident that the marginal effect of output on costs isat least 4.53 and at most 5.05".

() Introductory Econometrics: Topic 1 48 / 90

Page 49: A Non-technical Introduction to Regression - Personal WWW Pages

Hypothesis Testing

Hypothesis testing involves specifying a hypothesis to test. This isreferred to as the null hypothesis, H0.

It is compared to an alternative hypothesis, H1.

E.g. H0 : β = 0 vs. H1 : β 6= 0 is common (and Gretl will print outresults for this hypothesis test)

Many economic questions of interest have form: “Does theexplanatory variable have an effect on the dependent variable?”or,equivalently, “Does β = 0 in the regression of Y on X?”

() Introductory Econometrics: Topic 1 49 / 90

Page 50: A Non-technical Introduction to Regression - Personal WWW Pages

Aside on Confidence Intervals and Hypothesis Testing

Hypothesis testing and confidence intervals are closely related.

Can test whether β = 0 by checking whether the confidence intervalfor β contains zero.

If it does not then we can “reject the hypothesis that β = 0”

or conclude “X has significant explanatory power for Y”

or “β is significantly different from zero”or “β is statisticallysignificant”.

If confidence interval does include zero then change the word “reject”to “accept”and “has significant explanatory power”with “does nothave significant explanatory power”, and so on.

() Introductory Econometrics: Topic 1 50 / 90

Page 51: A Non-technical Introduction to Regression - Personal WWW Pages

Confidence interval approach to hypothesis testing is equivalent toapproach to hypothesis testing discussed next

Just as confidence intervals came with various levels of confidence(e.g. 95%), hypothesis tests come with various levels of significance.

Level of significance is 100% minus the confidence level.

E.g. if 95% confidence interval does not include zero, then you maysay “I reject the hypothesis that β = 0 at 5% level of significance”(i.e. 100%-95%=5%)

() Introductory Econometrics: Topic 1 51 / 90

Page 52: A Non-technical Introduction to Regression - Personal WWW Pages

Hypothesis Testing (continued)

First step: specify a hypothesis to test and choosing a significancelevel.

E.g. H0: β = 0 and the 5% level of significance.

Second step: calculate a test statistic and compare it to a criticalvalue (a concept we will define in Chapter 3).

E.g. For H0: β = 0, the test statistic is known as a t-statistic (ort-ratio or t-stat):

t =β

sbwhere we will explain sb later.

() Introductory Econometrics: Topic 1 52 / 90

Page 53: A Non-technical Introduction to Regression - Personal WWW Pages

Idea underlying hypothesis testing is that we accept H0 if the value ofthe test statistic is consistent with what could plausibly happen if H0is true.

If H0 is true, then we would expect β to be small (i.e. if β = 0 thenexpect β near zero).

But if β is large this is evidence against H0.

Formally test statistic is large or small relative to “critical value takenfrom statistical tables of the Student-t distribution” (define later).

() Introductory Econometrics: Topic 1 53 / 90

Page 54: A Non-technical Introduction to Regression - Personal WWW Pages

For empirical practice, often do not need critical value since P-valuefor this and other tests produced by computer packages.

P-value is level of significance at which you can reject H0.

E.g. with 5% level of significance and software package gives P-valueof 0.05 then reject H0.

If the P-value is less than 0.05 then you can also reject H0.

Students often want to interpret the P-value as measuring theprobability that β = 0.

E.g. if P-value less than 0.05 one wants to say "There is less than a5% probability that β = 0 and, since this is very small, I can rejectthe hypothesis that β = 0."

This is not formally correct. But, it does provide some informalintuition to motivate why small P-values lead you to reject H0.

() Introductory Econometrics: Topic 1 54 / 90

Page 55: A Non-technical Introduction to Regression - Personal WWW Pages

Hypothesis Testing involving R-squared: The F-statistic

Another popular hypothesis to test is H0: R2 = 0.

If R2 = 0 then X does not have any explanatory power for Y .

Note: for simple regression, this is equivalent to a test of β = 0.

However, for multiple regression (which we will discuss shortly), thetest of R2 = 0 will be different than tests of whether regressioncoeffi cients equal zero.

Same strategy: calculate a test statistic and compare to a criticalvalue.

Or most software will also calculate a P-value which directly gives ameasure of the plausibility of H0 : R2 = 0

() Introductory Econometrics: Topic 1 55 / 90

Page 56: A Non-technical Introduction to Regression - Personal WWW Pages

Test statistic is called the F-statistic:

F =(N − 2)R2(1− R2)

The appropriate statistical table for obtaining the critical value isF-distribution (to be explained later)

Or if the P-value for the F-test is less than 5% (i.e. 0.05), weconclude R2 6= 0.If the P-value for the F-test is greater than or equal to 5% , weconclude R2 = 0.

Of course, you can use levels of significance other than 5%.

() Introductory Econometrics: Topic 1 56 / 90

Page 57: A Non-technical Introduction to Regression - Personal WWW Pages

Computer packages typically provide the following:

β, the OLS estimate of β.

The 95% confidence interval, which gives an interval where we are95% confident β will lie.

Standard deviation (or standard error) of β, sb , which is a measure ofhow accurate β.

The test statistic, t, for testing H0: β = 0.

The P-value for testing H0: β = 0.

R2 which measures the proportion of the variability in Y explained byX

The F-statistic and P-value for testing H0 : R2 = 0.

() Introductory Econometrics: Topic 1 57 / 90

Page 58: A Non-technical Introduction to Regression - Personal WWW Pages

Example: Cost of Production in the Electric Utility Industry

Regression of Y = the costs of production and X = output ofelectricity by 123 electric utility companies.

Table 2.1 presents regression results in the form they would beproduced by most software packages.

Table 2.1: Regression Results Using Electric Utility Data Set

Variable CoeffStandError

t-stat P-value95% conf.interval

Intercept 2.19 1.88 1.16 0.25 [−1.53, 5.91]Output 4.79 0.13 36.36 5× 10−67 [4.53, 5.05]

R2 = 0.92 and the P-value for testing H0 : R2 = 0 is 5.4× 10−67.

() Introductory Econometrics: Topic 1 58 / 90

Page 59: A Non-technical Introduction to Regression - Personal WWW Pages

Multiple Regression

Multiple regression same as simple regression except manyexplanatory variables.

Intuition and derivation of multiple and simple regression very similar.

We will emphasise only the few differences between simple andmultiple regression.

() Introductory Econometrics: Topic 1 59 / 90

Page 60: A Non-technical Introduction to Regression - Personal WWW Pages

Example: Explaining House Prices

Data on N = 546 houses sold in Windsor, Canada.

Dependent variable, Y , is the sales price of the house in Canadiandollars.

Four explanatory variables:

X1= the lot size of the property (in square feet)

X2 = the number of bedrooms

X3 = the number of bathrooms

X4 = the number of storeys (excluding the basement).

() Introductory Econometrics: Topic 1 60 / 90

Page 61: A Non-technical Introduction to Regression - Personal WWW Pages

OLS Estimation of the Multiple Regression Model

With k explanatory variables model is:

Yi = α+ β1X1i + β2X2i + ..+ βkXki + εi

i subscripts to denote observations, i = 1, ..,N.

With multiple regression have to estimate α and β1, .., βk .

OLS estimates are found by choosing the values of α and β1, β2, .., βkthat minimize the SSR:

SSR =N

∑i=1

(Yi − α− β1X1i − β2X2i − ..− βkXki

)2Computer packages like Gretl will calculate OLS estimates.

() Introductory Econometrics: Topic 1 61 / 90

Page 62: A Non-technical Introduction to Regression - Personal WWW Pages

Statistical Aspects of Multiple Regression

Largely the same as for simple regression.

Formulae for confidence intervals, test statistics, etc. have only minormodifications.

R2 is still a measure of fit.

Can test R2 = 0 in same manner as for simple regression.

If you find R2 6= 0 then conclude that explanatory variables togetherprovide significant explanatory power (Note: this does not necessarilymean each individual explanatory variable is significant).

Confidence intervals can be calculated for each individual coeffi cientas before.

Can test βj = 0 for each individual coeffi cient (j = 1, 2, .., k) asbefore.

Note: have a confidence interval and a test statistic for eachcoeffi cient.

() Introductory Econometrics: Topic 1 62 / 90

Page 63: A Non-technical Introduction to Regression - Personal WWW Pages

Interpreting OLS Estimates in the Multiple RegressionModel

Mathematical Intuition: Total vs. partial derivative

Simple regression:

β =dYdX

Multiple Regression:

βj =∂Y∂Xj

for the j th coeffi cient j = 1, .., k.

() Introductory Econometrics: Topic 1 63 / 90

Page 64: A Non-technical Introduction to Regression - Personal WWW Pages

Interpreting OLS Estimates in the Multiple RegressionModel

Verbal intuition: with simple regression β is the marginal effect of Xon Y

Multiple regression: βj is the marginal effect of Xj on Y , ceterisparibus

βj is the effect of a small change in the jth explanatory variable on

the dependent variable, holding all the other explanatory variablesconstant.

() Introductory Econometrics: Topic 1 64 / 90

Page 65: A Non-technical Introduction to Regression - Personal WWW Pages

Example: Explaining House Prices (continued)

Multiple regression results using the house price data set:

Table 2.2: Multiple Regression Using House Price Data Set

Variable Coeffi cient t-stat P-value95% conf.interval

Intercept −4009.55 −1.11 0.27 [−11087, 3068]Lot Size 5.43 14.70 2× 10−41 [4.70, 6.15]# bedrm 2824.61 2.33 0.02 [439, 5211]# bathrm 17105.17 9.86 3× 10−21 [13698, 20512]# storeys 7634.90 7.57 1× 10−13 [5655, 9615]

Furthermore, R2 = 0.54 and the P-value for testing H0 : R2 = 0 is1.2× 10−88.

() Introductory Econometrics: Topic 1 65 / 90

Page 66: A Non-technical Introduction to Regression - Personal WWW Pages

Example: Explaining House Prices (continued)

How can we interpret the fact that β1 = 5.43?

An extra square foot of lot size will tend to add $5.43 onto the priceof a house, ceteris paribus.

For houses with the same number of bedrooms, bathrooms andstoreys, an extra square foot of lots size will tend to add $5.43 ontothe price of a house.

If we compare houses with the same number of bedrooms, bathroomsand storeys, those with larger lots tend to be worth more. Inparticular, an extra square foot in lot size is associated with anincreased price of $5.43.

() Introductory Econometrics: Topic 1 66 / 90

Page 67: A Non-technical Introduction to Regression - Personal WWW Pages

Confidence interval for β1: “I am 95% confident that the marginaleffect of lot size on house price (holding other explanatory variablesconstant) is at least $4.70 and at most $6.15”Hypothesis testing: “Since the P-value for testing H0 : β1 = 0 is lessthan 0.05, we can conclude that β1 is significant at the 5% level ofsignificance"

Can make similar statements for the other coeffi cients.

Since R2 = 0.54 can say: “54% of the variability in house prices canbe explained by the four explanatory variables”

Since the P-value for testing H0 : R2 = 0 is less than 0.05, we canconclude that the explanatory variables (jointly) have significantexplanatory power at the 5% level of significance

() Introductory Econometrics: Topic 1 67 / 90

Page 68: A Non-technical Introduction to Regression - Personal WWW Pages

Which Explanatory Variables to Choose in a MultipleRegression Model?

We will relate this question to topics of omitted variables bias andmulticollinearity.

First note that there are two important considerations which pull inopposite directions.

It is good to include all variables which help explain the dependentvariable (include as many explanatory variables as possible).

Including irrelevant variables (i.e. ones with no explanatory power)will lead to less precise estimates (include as few explanatory variablesas possible).

Playing off these two competing considerations is an important aspectof any empirical exercise. Hypothesis testing procedures can help withthis.

() Introductory Econometrics: Topic 1 68 / 90

Page 69: A Non-technical Introduction to Regression - Personal WWW Pages

Omitted Variables Bias

To illustrate this problem we use the house price data set.

A simple regression of Y = house price on X = number of bedroomsyields a coeffi cient estimate of 13, 269.98.

But in multiple regression (see Table 2.2), coeffi cient on number ofbedrooms was 2, 824.61.

Why are these two coeffi cients on the same explanatory variable sodifferent? i.e. 13, 269.98 is much bigger than 2, 824.61.

() Introductory Econometrics: Topic 1 69 / 90

Page 70: A Non-technical Introduction to Regression - Personal WWW Pages

Answer 1: They just come from two different regressions whichcontrol for different explanatory variables (different ceteris paribusconditions).

Answer 2:

Imagine a friend asked: “I have 2 bedrooms and I am thinking ofbuilding a third, how much will it raise the price of my house?”

Simple regression: “Houses with 3 bedrooms tend to cost $13,269.98more than houses with 2 bedrooms”

Does this mean adding a 3rd bedroom will tend to raise price of houseby $13,269.98? Not necessarily, other factors influence house prices.

() Introductory Econometrics: Topic 1 70 / 90

Page 71: A Non-technical Introduction to Regression - Personal WWW Pages

Houses with three bedrooms also tend to be desirable in other ways(e.g. bigger, with larger lots, more bathrooms, more storeys, etc.).Call these “good houses”.

Simple regression notes “good houses” tend to be worth more thanothers.

Number of bedrooms is acting as a proxy for all these “good house”characteristics and hence its coeffi cient becomes very big (13,269.98)in simple regression.

Multiple regression can estimate separate effects due to lot size,number of bedroom, bathrooms and storeys.

Tell your friend: “Adding a third bedroom will tend to raise yourhouse price by $2,824.61”.

Multiple regressions which contains all (or most) of housecharacteristics will tend to be more reliable than simple regressionwhich only uses one characteristic.

() Introductory Econometrics: Topic 1 71 / 90

Page 72: A Non-technical Introduction to Regression - Personal WWW Pages

Take a look at the correlation matrix for this data set:

Table 2.3: Correlations Matrix for House Price Data SetPrice Lot Size # bed # bath # storey

Price 1Lot Size 0.54 1# bed 0.37 0.15 1# bath 0.52 0.19 0.37 1# storey 0.42 0.08 0.41 0.32 1

Positive correlations between explanatory variables indicate thathouses with more bedrooms also tend to have larger lot size, morebathrooms and more storeys.

() Introductory Econometrics: Topic 1 72 / 90

Page 73: A Non-technical Introduction to Regression - Personal WWW Pages

Omitted Variable Bias

“Omitted variable bias” is a statistical term for these issues.

IF

1. We exclude explanatory variables that should be present in theregression,

AND

2. these omitted variables are correlated with the includedexplanatory variables,

THEN

3. the OLS estimates of the coeffi cients on the includedexplanatory variables will be biased.

() Introductory Econometrics: Topic 1 73 / 90

Page 74: A Non-technical Introduction to Regression - Personal WWW Pages

Example: Explaining House Prices (continued)

Simple regression used Y = house prices and X = number ofbedrooms.

Many important determinants of house prices omitted.

Omitted variables were correlated with number of bedrooms.

Hence, the OLS estimate from the simple regression of 13, 269.98 wasbiased.

() Introductory Econometrics: Topic 1 74 / 90

Page 75: A Non-technical Introduction to Regression - Personal WWW Pages

Practical Advice for Selecting Explanatory Variables

Include (insofar as possible) all explanatory variables which you thinkmight possibly explain your dependent variable. This will reduce therisk of omitted variable bias.

However, including irrelevant explanatory variables reduces accuracyof estimation and increases confidence intervals.

So do t-tests (or other hypothesis tests) to decide whether variablesare significant. Run a new regression omitting the explanatoryvariables which are not significant.

() Introductory Econometrics: Topic 1 75 / 90

Page 76: A Non-technical Introduction to Regression - Personal WWW Pages

Multicollinearity

Intuition: If explanatory variables are highly correlated with oneanother then regression model has trouble telling which individualvariable is explaining Y .

Symptom: Individual coeffi cients may look insignificant, butregression as a whole may look significant (e.g. R2 big, F-stat big,but t-stats on individual coeffi cients small).

Looking at a correlation matrix for explanatory variables can often behelpful in revealing extent and source of multicollinearity problem.

() Introductory Econometrics: Topic 1 76 / 90

Page 77: A Non-technical Introduction to Regression - Personal WWW Pages

Example of Multicollinearity

Y = exchange rate

Explanatory variable(s) = interest rate

X1 = bank prime rate

X2 = Treasury bill rate

Using both X1 and X2 will probably cause multicollinearity problem

Solution: Include either X1 or X2 but not both.

In some cases this “solution”will be unsatisfactory if it causes you todrop out explanatory variables which economic theory says should bethere.

() Introductory Econometrics: Topic 1 77 / 90

Page 78: A Non-technical Introduction to Regression - Personal WWW Pages

Multiple Regression with Dummy Variables

Dummy variable is either 0 or 1.

Use to turn qualitative (Yes/No) data into 1/0.

Example: Explaining House Prices (continued)

Data set has 5 potential dummy explanatory variables

D1 = 1 if the house has a driveway (= 0 if it does not)

D2 = 1 if the house has a recreation room (= 0 if not)

D3 = 1 if the house has a basement (= 0 if not)

D4 = 1 if the house has gas central heating (= 0 if not)

D5 = 1 if the house has air conditioning (= 0 if not)

() Introductory Econometrics: Topic 1 78 / 90

Page 79: A Non-technical Introduction to Regression - Personal WWW Pages

Simple Regression with a Dummy Variable

One dummy explanatory variable, D:

Yi = α+ βDi + εi

for i = 1, ..,N observations.

OLS estimation produces α and β, and fitted regression line is:

Yi = α+ βDi

Since Di is either 0 or 1, we either have Yi = α or Yi = α+ β.

() Introductory Econometrics: Topic 1 79 / 90

Page 80: A Non-technical Introduction to Regression - Personal WWW Pages

Example: Explaining House Prices (continued)

Regress Y = house price on D = dummy for air conditioning (=1 ifhouse has air conditioning, = 0 otherwise).

Fitted regression line is:

Yi = 59884.85+ 25995.74Di

Average price of house with air conditioning is $85, 881Average price of house without air conditioning is $59, 885Remember, however, omitted variables bias (this simple regression nodoubt suffers from it)

() Introductory Econometrics: Topic 1 80 / 90

Page 81: A Non-technical Introduction to Regression - Personal WWW Pages

Multiple Regression with Dummy Variables

Yi = α+ β1D1i + ..+ βkDki + εi

Example: Explaining House Prices (continued)

Regress Y = house price on D1 = driveway dummy and D2 = recroom dummy.

() Introductory Econometrics: Topic 1 81 / 90

Page 82: A Non-technical Introduction to Regression - Personal WWW Pages

Fitted regression line:

Yi = 47099.08+ 21159.91D1i + 16023.69D2i

Putting in either 0 or 1 values for the dummy variables, we obtain thefitted values for Y for the four categories of houses:

Houses with a driveway and recreation room (D1 = 1 and D2 = 1)have Yi = 47099+ 21160+ 16024 = $84, 283.Houses with a driveway but no recreation room (D1 = 1 and D2 = 0)have Yi = 47099+ 21160 = $68, 259.Houses with a recreation room but no driveway (D1 = 0 and D2 = 1)have Yi = 47099+ 16024 = $63, 123.Houses with no driveway and no recreation room (D1 = 0 andD2 = 0) have Yi = $47, 099.Multiple regression with dummy variables may be used to classify thehouses into different groups and to find average house prices for eachgroup.

() Introductory Econometrics: Topic 1 82 / 90

Page 83: A Non-technical Introduction to Regression - Personal WWW Pages

Multiple Regression with Dummy and non-DummyExplanatory Variables

E.g. one dummy variable (D) and one regular non-dummyexplanatory variable (X ):

Yi = α+ β1Di + β2Xi + εi

Example: Explaining House Prices (continued)Regress Y on D = air conditioning dummy and X = lot size.Obtain α = 32, 693, β1 = 20175 and β2 = 5.638.Get two different fitted regression lines

Yi = α+ β1 + β2Xi = 52868+ 5.638Xi

if Di = 1 (i.e. the ith house has an air conditioner) and

Yi = α+ β2Xi = 32693+ 5.638Xi

if Di = 0 (i.e. the house does not have an air conditioner).Note that the two regression lines have the same slope and only differin their intercepts.

() Introductory Econometrics: Topic 1 83 / 90

Page 84: A Non-technical Introduction to Regression - Personal WWW Pages

Interacting Dummy with non-Dummy ExplanatoryVariables

Consider the following regression model:

Yi = α+ β1Di + β2Xi + β3Zi + εi

where D and X are dummy and non-dummy explanatory variablesand Z = DX .How do we interpret results from a regression of Y on D,X and Z?Note that Zi is either 0 (for observations with Di = 0) or Xi (forobservations with Di = 1).Fitted regression lines for individuals with Di = 0 and Di = 1 are:If Di = 0 then Yi = α+ β2XiIf Di = 1 then Yi =

(α+ β1

)+(

β2 + β3

)Xi

Two different regression lines corresponding to D = 0 and D = 1exist and have different intercepts and different slopes.Marginal effect of X on Y is different for observations with Di = 0than with Di = 1.

() Introductory Econometrics: Topic 1 84 / 90

Page 85: A Non-technical Introduction to Regression - Personal WWW Pages

Example: Explaining House Prices (continued)

Regress Y = house price on D = air conditioner dummy, X = lot sizeand Z = D × Xα = 35684, β1 = 7613, β2 = 5.02 and β3 = 2.25.

Marginal effect of lot size on housing is 7.27 (i.e. β2 + β3) for houseswith air conditioners and only 5.02 for houses without.

() Introductory Econometrics: Topic 1 85 / 90

Page 86: A Non-technical Introduction to Regression - Personal WWW Pages

Working with Dummy Dependent Variables

Example: Dependent variable is a transport choice.

1 = “Yes I take my car to work”

0 = “No I do not take my car to work”

We will not discuss this case in this course.

Note only the following points:

There are some problems with OLS estimation. But OLS estimationmight be adequate in many cases.

Better estimation methods are “Logit” and “Probit” available inmany software packages.

() Introductory Econometrics: Topic 1 86 / 90

Page 87: A Non-technical Introduction to Regression - Personal WWW Pages

Chapter Summary

This non-technical introduction to regression, you should be able toget started in actually doing some empirical work (at least withcross-sectional data).The major points covered in this chapter include:Economic data comes in many forms. Common types are time series,cross-sectional and panel data.Simple graphical techniques, including histograms and XY-plots, areuseful ways of summarizing the information in a data set.Many numerical summaries can be used. The most important are themean, a measure of the location of a distribution, and the standarddeviation, a measure of how spread out or dispersed a distribution is.Correlation is a numerical measure of the relationship or associationbetween two variables.There are many reasons two variables might be correlated with eachother. However, correlation does not necessarily imply causalitybetween two variables.

() Introductory Econometrics: Topic 1 87 / 90

Page 88: A Non-technical Introduction to Regression - Personal WWW Pages

Simple regression quantifies effect of an explanatory variable, X , on adependent variable, Y , through a regression line Y = α+ βX .

Estimation of α and β involves choosing estimates which produces the"best fitting" line through an XY graph. These are called ordinaryleast squares (OLS) estimates, are labelled α and β and are obtainedby minimizing the sum of squared residuals (SSR).

Regression coeffi cients should be interpreted as marginal effects (i.e.as measures of the effect on Y of a small change in X ).

R2 is a measure of how well the regression line fits the data.

The confidence interval provides an interval estimate of any coeffi cient(e.g. an interval for β in which you can be confident β lies).

A hypothesis test of β = 0 used to find out whether explanatoryvariable belongs in regression. Hypothesis test can either be done bycomparing a test statistic (i.e. the t-stat) to a critical value takenfrom statistical tables or by examining P-value. If P-value is less than0.05 then you can reject the hypothesis at the 5% level of significance.

() Introductory Econometrics: Topic 1 88 / 90

Page 89: A Non-technical Introduction to Regression - Personal WWW Pages

The multiple regression model has more than one explanatoryvariable. The basic intuition (e.g. OLS estimates, confidenceintervals, etc.) is the same as for the simple regression model.However, with multiple regression the interpretation of regressioncoeffi cients is subject to ceteris paribus conditions.

If important explanatory variables are omitted from the regression andare correlated with included explanatory variables, omitted variablesbias occurs.

If explanatory variables are highly correlated with one another,coeffi cient estimates and statistical tests may be misleading. This isreferred to as the multicollinearity problem.

() Introductory Econometrics: Topic 1 89 / 90

Page 90: A Non-technical Introduction to Regression - Personal WWW Pages

The statistical techniques associated with the use of dummyexplanatory variables are exactly the same as with non-dummyexplanatory variables.

A regression involving only dummy explanatory variables classifies theobservations into various groups (e.g. houses with air conditionersand houses without). Interpretation of results is aided by carefulconsideration of what the groups are.

A regression involving dummy and non-dummy explanatory variablesclassifies the observations into groups and says that each group willhave a regression line with a different intercept. All these regressionlines have the same slope.

Regression involving dummy, non-dummy and interaction (i.e.dummy times non-dummy variables) explanatory variables classifiesthe observations into groups and says that each group will have adifferent regression line with different intercept and slope.

() Introductory Econometrics: Topic 1 90 / 90