. Please start your Daily Portfolio Introduction to Statistics for the Social Sciences SBS200,...

Post on 31-Mar-2015

216 views 1 download

Tags:

Transcript of . Please start your Daily Portfolio Introduction to Statistics for the Social Sciences SBS200,...

.Please start yourDaily Portfolio

Introduction to Statistics for the Social Sciences

SBS200, COMM200, GEOG200, PA200, POL200, or SOC200Lecture Section 001, Summer Session II, 2013

9:00 - 11:20am Monday - FridayRoom 312 Social Sciences (Monday – Thursdays)

Room 480 Marshall Building (Fridays)

http://www.youtube.com/watch?v=oSQJP40PcGI

My last name starts with a letter somewhere between

A. A – DB. E – LC. M – RD. S – Z

Please click in

Please double check All cell phones other electronic

devices are turned off and stowed away

Homework due – Wednesday

On class website: Please print and complete homework worksheet #13

Multiple Regression

Schedule of readings

Before Friday

Please read chapters 10 – 14

Please read Chapters 17, and 18 in PlousChapter 17: Social InfluencesChapter 18: Group Judgments and Decisions

Study Guide is

online

Next couple of lectures 7/30/13

Use this as your study guide

Simple and Multiple RegressionUsing correlation for predictions

r versus r2

Regression uses the predictor variable (independent) to make predictions about the predicted variable (dependent)

Coefficient of correlation is name for “r”Coefficient of determination is name for “r2”

(remember it is always positive – no direction info)

Standard error of the estimate is our measure of the variability of the dots around the regression line

(average deviation of each data point from the regression line – like standard deviation)

Coefficient of regression will “b” for each variable (like slope)

Other Problems

The expected frequeny of teeth brushing for having one cavity is

Frequency of teeth brushing= 5.5 + (-.91) Cavities If “Cavities” = 3, what is the prediction for “Frequency of teeth brushing”?

Frequency of teeth brushing= 5.5 + (-.91) Cavities Frequency of teeth brushing= 5.5 + (-.91) (3) Frequency of teeth brushing= 5.5 + (-2.73) = 2.77 (3.0, 2.77)

Prediction lineY’ = a + b1X1

Y-intercept

SlopeIf number of cavities = 3

Frequency of Teeth brushing

will be 2.77

Review

r = - 0.85 b1 = - 0.91(slope)

b0 = 5.5(intercept)

Draw a regression lineand regression equation

Prediction lineY’ = b1X1+ b0

Y’ = (-.91)X 1+ 5.5Review

Correlation - let’s predict how often they brushed their teeth

0 1 2 3 4 5

Number of cavities

Num

ber

of t

imes

per

da

y te

eth

are

brus

hed

1

2

3

4

5

0

Find prediction lineY’ = b1 X + b0

Y’ = (-0.91) X + 5.5

Y’ = (-0.91) 1 + 5.5 = 4.59(plot 1,4.59)

Y’ = (-0.91) 5 + 5.5 = 0.95(plot 5,0.95)

Plot line - predict Y’ from X- Pick an X

- Pick another X

Let’s try X of 1

Let’s try X of 5

Review

r = -0.85b1 = - 0.91b0 = 5.5

Y’ = b1 X + b0

Y’ = (-0.91) 3 + 5.5 = 2.77

Y’ = (-0.91) 1 + 5.5 = 4.59

Y’ = (-0.91) 2 + 5.5 = 3.68

Y’ = (-0.91) 3 + 5.5 = 2.77

Y’ = (-0.91) 5 + 5.5 = .95

Y’ = (-0.91) X + 5.5

X Y .

1 53 42 33 25 1

0 1 2 3 4 5

Number of cavities

Num

ber

of t

imes

per

da

y te

eth

are

brus

hed

1

2

3

4

5

0

Review

Correlation - Evaluating the prediction line

Does the prediction line perfectlypredict the Ys from the Xs?

No, let’s see

How much “error” is there?Exactly?

Prediction lineY’ = b1X 1+ b0

Y’ = (-.91)X 1+ 5.5

0 1 2 3 4 5Number of cavities

Num

ber

of t

imes

per

da

y te

eth

are

brus

hed

1

2

3

4

5

0

Residuals

The green lines show how much “error” there is in our prediction line…how much

we are wrong in our predictions

Correlation

Perfect correlation = +1.00 or -1.00

The more closely the dots approximate a straight line,(the less spread out they are) the stronger the relationship is.

One variable perfectly predicts the other

No variability in the scatterplot

The dots approximate a straight line

AnyResiduals?

0 1 2 3 4 5Number of cavities

5

Num

ber

of ti

mes

per

da

y te

eth

are

brus

hed

1

2

3

4

0

• Shorter green lines suggest better prediction – smaller error

• Longer green lines suggest worse prediction – larger error

• Why are green lines vertical? Remember, we are predicting the variable on the Y axis So, error would be how we are wrong about Y (vertical)

How well does the prediction line predict the Ys from the Xs?

Residuals

A note about curvilinear relationships and patterns

of the residuals

0 1 2 3 4 5Number of cavities

Num

ber

of t

imes

per

da

y te

eth

are

brus

hed

1

2

3

4

5

0

• Slope doesn’t give “variability” info• Intercept doesn’t give “variability info

• Correlation “r” does give “variability info

How well does the prediction line predict the Ys from the Xs?

Residuals

• Residuals do give “variability info

What if we want to know the “average deviation score”? Finding the standard error of the estimate (line)

Standard error of the estimate:

• a measure of the average amount of predictive error • the average amount that Y’ scores differ from Y scores

• a mean of the lengths of the green lines

Standard error of the estimate (line)

Sound familiar??

Correlation - let’s predict how often they brushed their teeth

0 1 2 3 4 5

Number of cavities

Num

ber

of t

imes

per

da

y te

eth

are

brus

hed

1

2

3

4

5

0

Find prediction lineY’ = b1 X + b0

Y’ = (-0.91) X + 5.5

Y’ = (-0.91) 1 + 5.5 = 4.59(plot 1,4.59)

Y’ = (-0.91) 5 + 5.5 = 0.95(plot 5,0.95)

Plot line - predict Y’ from X- Pick an X

- Pick another X

Let’s try X of 1

Let’s try X of 5

r = -0.85b1 = - 0.91b0 = 5.5

Y’ = b1 X + b0

Y’ = (-0.91) 3 + 5.5 = 2.77

Y’ = (-0.91) 1 + 5.5 = 4.59

Y’ = (-0.91) 2 + 5.5 = 3.68

Y’ = (-0.91) 4 + 5.5 = 1.86

Y’ = (-0.91) 5 + 5.5 = .95

Y’ = (-0.91) X + 5.5

X Y Y’ Y-Y’.

1 5 4.59 0.413 4 2.77 1.232 3 3.68 -0.683 2 2.77 -0.775 1 0.95 0.05

0 1 2 3 4 5

Number of cavities

Num

ber

of t

imes

per

da

y te

eth

are

brus

hed

1

2

3

4

5

0

These are our “predicted values” for each X score

A note on

Adding up

deviations

.41

1.23

-.77

-.68

0.05

r = -0.85b1 = - 0.91b0 = 5.5

Y’ = b1 X + b0

Y’ = (-0.91) 3 + 5.5 = 2.77

Y’ = (-0.91) 1 + 5.5 = 4.59

Y’ = (-0.91) 2 + 5.5 = 3.68

Y’ = (-0.91) 4 + 5.5 = 1.86

Y’ = (-0.91) 5 + 5.5 = .95

Y’ = (-0.91) X + 5.5

X Y Y’ Y-Y’. (Y-Y’)2

1 5 4.59 0.41 0.1683 4 2.77 1.23 1.5132 3 3.68 -0.68 0.4623 2 2.77 -0.77 0.5935 1 0.95 0.05 .0025

0 1 2 3 4 5

Number of cavities

Num

ber

of t

imes

per

da

y te

eth

are

brus

hed

1

2

3

4

5

0

.41

1.23

-.77

-.68

0.05

2.739

2.739

30.95

This is like our average

(or standard) size of our residual “Standard Error

of the Estimate”

Is the regression line better than just guessing the mean of the Y variable?

How much does the information about the relationship actually help?

0 1 2 3 4 5Number of cavities

Num

ber

of ti

mes

per

da

y te

eth

are

brus

hed

1

2

3

4

5

0

5

# of

tim

es

teet

h ar

e br

ushe

d

1

2

3

4

00 1 2 3 4 5

Number of cavities

Which minimizes errorbetter?

How much better does the regression line predict the observed results?

r2 Wo

w!

What is r2?

r2 = The proportion of the total variance in one variable that is predictable by its relationship with the other variable

If mother’s and daughter’s heights are correlated with an r = .8, then what amount

(proportion or percentage) of variance of mother’s height is accounted

for by daughter’s height?

Examples

.64 because (.8)2 = .64

What is r2?

r2 = The proportion of the total variance in one variable that is predictable for its relationship with the other variable

If mother’s and daughter’s heights are correlated with an r = .8, then what

proportion of variance of mother’s height

is not accounted for by daughter’s height?

Examples

.36 because (1.0 - .64) = .36or

36% because 100% - 64% = 36%

What is r2?

r2 = The proportion of the total variance in one variable that is predictable for its relationship with the other variable

If ice cream sales and temperature are correlated with an r = .5, then what amount (proportion or percentage) of

variance of ice cream sales is accounted for by temperature?

Examples

.25 because (.5)2 = .25

What is r2?

r2 = The proportion of the total variance in one variable that is predictable for its relationship with the other variable

If ice cream sales and temperature are correlated with an r = .5, then what amount (proportion or percentage) of variance of

ice cream sales is not accounted for by temperature?

Examples

.75 because (1.0 - .25) = .75or

75% because 100% - 25% = 75%

regression equations

Questions on homework?

the hours worked and weekly pay is a strong positive correlation. This correlation is significant, r(3) = 0.92; p < 0.05

The relationship between

+0.92

positive strong

updown

6.085755.286

y' = 6.0857x + 55.286207.43

85.71.846231 or 84%

84% of the total variance of “weekly pay” is accounted for by “hours worked”

For each additional hour worked, weekly pay will increase by $6.09

400380360340320300

4 85 6 7

Number of Operators

Wai

t Tim

e

280

-.73

The relationship between

wait time and number of operators working is negative and strong. This correlation is not significant, r(3) = 0.73; n.s.

negativestrong

number of operators increase, wait time decreases

458

-18.5

y' = -18.5x + 458

365 seconds

328 seconds

.53695 or 54%

The proportion of total variance of wait time accounted for by number ofoperators is 54%.

For each additional operator added, wait time will decrease by 18.5 seconds

Critical r = 0.878No we do not reject the null

39363330272421

Median Income

Perc

ent o

f BA

s

45 48 51 54 57 60 63 66

0.8875

The relationship between

median income and percent of residents with BA degree is strong and positive. This correlation is significant, r(8) = 0.89; p < 0.05.

positivestrong

median income goes up so does percent of residents who have a BA degree

3.1819

25% of residents

35% of residents.78766 or 78%

The proportion of total variance of % of BAs accounted for by median income is 78%.

For each additional $1 in income, percent of BAs increases by .0005

Percent of residents with a BA degree

108

0.0005

y' = 0.0005x + 3.1819

Critical r = 0.632Yes we reject the null

30272421181512

Median Income

Crim

e R

ate

45 48 51 54 57 60 63 66

-0.6293

The relationship between

crime rate and median income is negative and moderate. This correlation is not significant, r(8) = -0.63; p < n.s. [0.6293 is not bigger than critical of 0.632] .

negativemoderate

median income goes up, crime rate tends to go down

4662.5

2,417 thefts

1,418.5 thefts.396 or 40%

The proportion of total variance of thefts accounted for by median income is 40%.

For each additional $1 in income, thefts go down by .0499

Crime Rate

108

-0.0499

y' = -0.0499x + 4662.5

Critical r = 0.632No we do not reject the null

Example of Simple Regression

The manager of copier company wants to determine whether there is a relationship between the number of sales calls made in a month and the number of copiers sold that month. The manager selects a random sample of 10 representatives and determines the number of sales calls each representative made last month and the number of copiers sold.

What are we predicting?

Correlation: Independent and dependent variables• When used for prediction we refer to the predicted variable as the dependent variable and the predictor variable as the independent variable

Dependent Variable

Independent Variable

Soni

MarkTomSusan

JeffCarlos

Who sold the most copiers?

Who sold the fewest copiers?

Correlation Coefficient – Excel Example

Correlation Coefficient – Excel Example

0.759014

Interpret r = 0.759

• Positive relationship between the number of sales calls and the number of copiers sold.

• Strong relationship

• Remember, we have not demonstrated cause and effect here, only that the two variables—sales calls and copiers sold—are related.

Correlation Coefficient – Excel Example

0.759014

Interpret r = 0.759

• Does this correlation reach significance?

• n = 10, df = 8

• alpha = .05

• Observed r is larger than critical r (0.759 > 0.632) therefore we reject the null hypothesis.

• r (8) = 0.759; p < 0.05

Coefficient of Determination – Excel Example

0.759014

Interpret r2 = 0.576(.7592 = .576)

• we can say that 57.6 percent of the variation in the number of copiers sold is explained, or accounted for, by the variation in the number of sales calls.

• Remember, we lose the directionality of the relationship with the r2

Find Regression Equation – Excel Example

Find Regression Equation – Excel Example

Regression Equation - Example

State the regression equationY’ = a + bxY’ = 18.9476 + 1.1842x

Solve for some value of Y’Y’ = 18.9476 + 1.1842 (20)Y’ = 42.63

If make this many calls

If you probably sell this much

What is the expected number of copiers sold

by a representative who made 20 calls?

Interpret the slopeY’ = 18.9476 + 1.1842x“For each additional sales call made we sell

1.842 more copiers”

Regression Equation - Example

What is the expected number of copiers sold

by a representative who made 40 calls?

Solve for some value of Y’Y’ = 18.9476 + 1.1842 (40)Y’ = 66.3156

If make this many calls

If you probably sell this much

An example for The Standard Error of Estimate

The standard error of estimate measures the scatter, or dispersion, of the observed values around the line of regression

A formula that can be used to compute the standard error:

Standard error of the estimate (line)

Regression Analysis – Least Squares Principle

When we calculate the regression line we try to:• minimize distance between predicted Ys and actual (data) Y points (length of green lines)• remember because of the negative and positive values cancelling each other out we have to square those distance (deviations)• so we are trying to minimize the “sum of squares of the vertical distances between the actual Y values and the predicted Y values”

The Standard Error of Estimate

Step 1: List all the Y data points

The Standard Error of Estimate

Step 1: List all the Y data points

Step 2: Find all the predicted Y’ data points

The Standard Error of Estimate

Step 3: Find deviations

Step 4: Square and add up deviations

Then simply plug in the numbers and solve for the standard error of the estimate

Remember conceptually, this is like the average of the length of those green lines

784.211

10 - 2= 9.901=

Writing Assignment - 5 Questions

2. What is a residual? How would you find it?

1. What is regression used for?• Include and example

3. What is Standard Error of the Estimate (How is it related to residuals?)

4. Give one fact about r2

5. How is regression line like a mean?

Writing Assignment - 5 Questions

Regressions are used to take advantage of relationshipsbetween variables described in correlations. We choose a valueon the independent variable (on x axis) to predict values forthe dependent variable (on y axis).

1. What is regression used for?• Include and example

Writing Assignment - 5 Questions

2. What is a residual? How would you find it?

Residuals are the difference between our predicted y (y’)and the actual y data points. Once we choose a value on ourindependent variable and predict a value for our dependentvariable, we look to see how close our prediction was. Weare measuring how “wrong” we were, or the amount of “error”for that guess.

Y – Y’

Writing Assignment - 5 Questions

3. What is Standard Error of the Estimate (How is it related to residuals?)

The average length of the residualsThe average error of our guessThe average length of the green linesThe standard deviation of the regression line

Writing Assignment - 5 Questions

4. Give one fact about r2

5. How is regression line like a mean?

Correlation - the prediction line

Prediction line

• makes the relationship easier to see(even if specific observations - dots - are removed)

• identifies the center of the cluster of (paired) observations

• identifies the central tendency of the relationship (kind of like a mean)

• can be used for prediction

• should be drawn to provide a “best fit” for the data

• should be drawn to provide maximum predictive (explanatory) power for the data

• should be drawn to provide minimum predictive error

- what is it good for?

r2

Some useful terms

• Regression uses the predictor variable (independent) to make predictions about the predicted variable (dependent)

• Coefficient of correlation is name for “r”• Coefficient of determination is name for “r2”

(remember it is always positive – no direction info)

• Standard error of the estimate is our measure of the variability of the dots around the regression line(average deviation of each data point from the regression line – like standard deviation)

Correlation: Independent and dependent variables

• When used for prediction we refer to the predicted variable as the dependent variable and the predictor variable as the independent variable

Dependent VariableDependent

Variable Independent Variable

Independent Variable

What are we predicting?

What are we predicting?

How many dependent variables?

Multiple regression equations

Prediction line Y’ = b1X 1+ b0

Prediction line Y’ = b1X 1+ b2X 2+ b0

Prediction line Y’ = b1X 1+ b2X 2+ b3X 3+ b0

How many independent variables?

1

How many dependent variables?

1How many independent variables?

3

We can predict amount of crime in a city from • the number of bathrooms in city• the amount spent on education in city• the amount spent on after-school

programs

We can predict amount of crime in a city from • the number of bathrooms in city• the amount spent on education in city

We can predict amount of crime in a city from • the number of bathrooms in city

Multiple regression

• Used to describe the relationship between several independent variables and a dependent variable.

Prediction line Y’ = b1X 1+ b2X 2+ b3X 3+ b0

Can we predict amount of crime in a city from the number of bathrooms and the amount of spent on educationand on after-school programs?

• X1 X2 and X3 are the independent variables.• Y is the dependent variable (amount of crime)• b0 is the Y-intercept• b1 is the net change in Y for each unit change in X1

holding X2 and X3 constant. It is called a regression coefficient.

Multiple regression will use multiple independent variables to predict the single dependent variable

Expenses per year

Ye

arl

yIn

com

e

If you spend this much

You probably make this much

The predicted variable goes on the“Y” axis and is called the dependentvariable.

The predictor variable goes on the“X” axis and is called the independent variable

Dep

ende

nt V

aria

ble

(Pre

dict

ed)

Independent

Variable 1

(Predictor)Independent

Variable 2

(Predictor)

If you spend this much

If you save this much

You probably make this much

14-60

Regression Plane for a 2-Independent Variable Linear Regression Equation

Multiple regression equations

Can use variables to predict • behavior of stock market• probability of accident• amount of pollution in a particular well• quality of a wine for a particular year• which candidates will make best workers

14-62

Can we predict heating cost?

Three variables are thought to relate to the heating costs: (1) the mean daily outside temperature, (2) the number of inches of insulation in the attic, and (3) the age in years of the furnace.

To investigate, Salisbury's research department selected a random sample of 20 recently sold homes. It determined the cost to heat each home last January

Multiple Linear Regression - Example

Multiple Linear Regression - Example

14-64

The Multiple Regression Equation – Interpreting the Regression Coefficients

b1 = The regression coefficient for mean outside temperature

(X1) is -4.583.

The coefficient is negative and shows a negative correlation between heating cost and temperature.

As the outside temperature increases, the cost to heat the home decreases. The numeric value of the regression coefficient provides more information. If we increase temperature by 1 degree and hold the other two independent variables constant, we can estimate a decrease of $4.583 in monthly heating cost.

14-65

The Multiple Regression Equation – Interpreting the Regression Coefficients

b2 = The regression coefficient for mean attic insulation (X2) is -14.831.

The coefficient is negative and shows a negative correlation between heating cost and insulation.

The more insulation in the attic, the less the cost to heat the home. So the negative sign for this coefficient is logical. For each additional

inch of insulation, we expect the cost to heat the home to decline $14.83 per month, regardless of the outside temperature or the age of the furnace.

14-66

The Multiple Regression Equation – Interpreting the Regression Coefficients

b3 = The regression coefficient for mean attic insulation (X3) is 6.101

The coefficient is positive and shows a negative correlation between heating cost and insulation.

As the age of the furnace goes up, the cost to heat the home increases.

Specifically, for each additional year older the furnace is, we expect the cost to increase $6.10 per month.

Applying the Model for Estimation

What is the estimated heating cost for a home if:• the mean outside temperature is 30 degrees,• there are 5 inches of insulation in the attic, and• the furnace is 10 years old?