Fitting the Data
description
Transcript of Fitting the Data
Lecture 2 1
Econ 140Econ 140
Fitting the DataLecture 2
Lecture 2 2
Econ 140Econ 140Today’s Plan
• Finishing off the examples from Lecture 1
• Introducing different types of data
• Fitting the data
– One of the most important lectures of the course
– There will be a question on this on a midterm and the final! (Almost guaranteed!)
– You can find this material in the Appendix 4.2
Lecture 2 3
Econ 140Econ 140Experimental vs Observational
• Because of financial/practical/ethical concerns, experiments in economics are rare (SIME/DIME, Tennessee STAR).
• Economists tend to use observational data - obtained from real world behavior. Collected using surveys/administrative records.
• Observational data poses problem: how to estimate causal effects, no random assignment, data definitions not quite right (what economic theory might require).
• Much of econometrics is devoted to estimation with problems encountered with observational data.
Lecture 2 4
Econ 140Econ 140Cross-Section Data
• We have already seen 2 examples of cross-section data:
– Wages and years of education
– Voting polls in Florida
• Cross section data sets provide information about individual/agent behavior at a moment in time
• Current Population Survey is a cross-section survey that generates monthly detail about the US work force
• Data on county/state/or even countries at a moment in time is also cross-section data.
Lecture 2 5
Econ 140Econ 140Time Series Data Sets (1)
• Time series data sets provide information about individual/agent behavior over time
– A time unit of observation (day, week, month, year) defines a time series
• We hear about time series data everyday:
– Nasdaq
– Financial Times Stock Exchange Index (FTSE)
– Dow Jones
– Government data: GDP/Unemployment/Inflation
Lecture 2 6
Econ 140Econ 140Time Series Data Sets (2)
• Composition of unit can change
– FTSE gives information on the top 100 stocks each day, not necessarily the same 100 stocks every day
– CPS: gives data from each month on the number of people who are unemployed. Not the same people (we hope!) from month to month.
• Characteristics of time series data sets
– set of observations over time
– composition of unit can change
– compositional changes are dealt with using weighting schemes (Lecture 3)
Lecture 2 7
Econ 140Econ 140Longitudinal Data Sets
• Longitudinal data sets provide information on a particular group of individuals/agents over time.
• For example: following Econ140, Fall 2002 over time. Alternatively, a set of firms over time.
• Example we will use: Production functions (Cobb-Douglas) - following firms over time.
• Book example: Traffic Deaths and Alcohol Taxes - following states over time.
Lecture 2 8
Econ 140Econ 140Ordinary Least Squares (OLS)
• Learning how to calculate a straight line (Appendix 4.2)
– Recall the scatter plot of earnings vs. years of education: there was a mess of data!
• We can use Ordinary Least Squares (OLS) to fit a straight line through these data points
– This line is called the least squares line or line of best fit
– Why is it called: ‘least square line’?
– Least squares line is the minimization of errors - the OLS regression line picks up the smallest distance between data points and the line
Lecture 2 9
Econ 140Econ 140Two Parts to OLS
1. Derive estimators for a (intercept) and b (slope coefficent)
– this means using differential calculus!
2. Calculate values for a and b from data
– this means mechanically using the derived formulas for a & b
• How to calculate a regression line through a mass of data points that do not necessarily lie on a straight line?
• Each data point (X,Y) has a value.
Lecture 2 10
Econ 140Econ 140OLS Line
• We’ll call the regression line
– this is an estimate of the true Y
Y
ie Y
iii YYe ˆ
• The errors will be the difference between and Y
– errors can be positive or negative
• We can write the following general equations:
Where i = 1 … n.
ii bXaY ˆ
Lecture 2 11
Econ 140Econ 140OLS Line
• A data set example is available at the course web site. It consists of five points. Using that output I can calculate the regression equation to be:
• Keeping this equation in mind we can find estimates of a and b given our general formulas for Y and
• We derive a and b from two different types of regression equations:
a from
b from
XYi 9.08.3ˆ
Y
iii
ii
ebXY
eaY
Lecture 2 12
Econ 140Econ 140OLS Line: Deriving a (1)
• We can rewrite as ei=Yi - a
– we could write objective function for a as:
• Go back to the regression analysis example: notice that the sum of errors is zero!
– Why? The positive and negative errors from the line of best fit always cancel out
– For a minimum you need a first order condition (FOC) set to zero.
– We need a FOC for OLS that is set to zero, not zero to start with!
ii eaY
n
iieag
1
Lecture 2 13
Econ 140Econ 140OLS Line: Deriving a (2)
• We can’t just minimize the sum of the errors because
• Instead, we have to minimize the sum of the errors squared (hence - least squares):
where ei = Y - a
01
n
iie
n
ieiag
1
2
aYieiag22
Lecture 2 14
Econ 140Econ 140OLS Line: Deriving a (3)
• Differentiate with respect to a to find the formula for the OLS estimator a
• Note that you set the first order condition to zero to find a minimum: -2ei = 0
(don’t worry about the second order derivative - which will be positive).
• Remember that ei = Y - a
• Solve for a: a = Yi/n.
Lecture 2 15
Econ 140Econ 140OLS Line: Deriving b (1)
22ii bXYbg e
2
2 0
022
i
ii
iii
iiiii
X
YXb
XbYX
eXXbXY
•Now consider the slope regression where
iiiii bXYeandbXY ˆ
•We use the same principles as before:
Note: this condition only holds if there’s no correlation between X and the errors
So:
(keep in mind that this expression only holds for the regression of a zero intercept and non-zero slope)
Lecture 2 16
Econ 140Econ 140OLS Line: Collect a & b
• We know a regression line with a non-zero intercept and a non-zero slope coefficient looks like:
ii bXaY ˆ
• We also know:
iii YYe ˆ
0011
i
n
ii
n
ii Xeande
• From the derivations of a and b we have the necessary first order conditions:
Lecture 2 17
Econ 140Econ 140OLS Line: Collect a & b (2)
• Plugging into the FOC from the derivation of b:
• Plug the new equation into the FOC from our derivation of a:
XbYaa
bXaYn
ii
n
i
n
ii
:for Solving111
2
1
2
1
1
2
11
:for Solving
0
XnX
YXnYXb
b
XbaXYX
n
ii
n
ii
n
ii
n
iii
n
ii
Lecture 2 18
Econ 140Econ 140Example
• From the data set posted on the web
• To calculate the regression line you need:
• Solve for a & b given the formulas:
nXYXXYn
ii
n
iii
n
ii
n
ii ,,,,
1
2
111
n
ii
n
iii
n
ii
n
ii
XnX
YXnYXb
n
Xb
n
Ya
1
22
111
Lecture 2 19
Econ 140Econ 140Example (2)
XY
n
Xb
n
Ya
XnX
YXnYXbb
nXYXXY
n
ii
n
ii
n
ii
n
iii
n
ii
n
iii
n
ii
n
ii
9.08.3ˆ
8.3)6(9.02.9
9.0
530
5190
546
530
5285
51902853046
11
2
1
22
1
1
2
111
Lecture 2 20
Econ 140Econ 140Wrap Up
• Introduced three data types: cross-section, time series, and longitudinal
• Using the OLS technique to derive formulas for an intercept and a slope coefficient
– We estimated the regression lines
– We found FOCs
= 0
• Then we put everything together to estimate
n
iie
10
1
i
n
ii Xe
aYi ˆii bXY ˆ
ii bXaY ˆ