Presentation2 stats

24
REGRESSION MODELS By: Ayush Sharma 09 Mickey Haldia 19 Prerna Makhijani 29 Sanoj George 39 Sushant Jaggi 49 Nitish Dorle 59

description

 

Transcript of Presentation2 stats

Page 1: Presentation2 stats

REGRESSION MODELS

By:

Ayush Sharma 09

Mickey Haldia 19

Prerna Makhijani 29

Sanoj George 39

Sushant Jaggi 49

Nitish Dorle 59

Page 2: Presentation2 stats

Example

Year Population on Farm (in millions)

1935 32.1

1940 30.5

1945 24.4

1950 23.0

1955 19.1

1960 15.6

1965 12.5

Page 3: Presentation2 stats

Scatter Plot

1930 1935 1940 1945 1950 1955 1960 1965 19700

5

10

15

20

25

30

35

Population(in millions)

Poplation(in millions)

Page 4: Presentation2 stats

Correlation Coefficient (r)

It is a measure of strength of the linear relationship between two variables and is calculated using the following formula:

Page 5: Presentation2 stats

Interpretation

After calculating we find r = -0.993

There is a strong negative correlation.

Page 6: Presentation2 stats

Coefficient of Determination Squaring the correlation coefficient (r) gives us

the percent variation in the y-variable that is described by the variation in the x-variable

To relate x and y, the Regression Equation is calculated using Least Squares technique.

Regression Equation: Y’ = a +bX Slope of the regression line:

Page 7: Presentation2 stats

To continue with the example We found r = -0.993. By squaring we get the

Coefficient of Determination (R^2) = 0.987

1930 1935 1940 1945 1950 1955 1960 1965 197010

15

20

25

30

35

f(x) = − 0.670714285714286 x + 1330.35R² = 0.986600589014608

Regression

Year

Po

pu

lati

on

on

Far

m (

in m

il-

lio

ns)

Page 8: Presentation2 stats

Interpretation

We conclude that 98.7% of the decrease in farm population can be explained by timeline progression.

Theoretically, population is a dependent variable (y-axis) and timeline is an independent variable (x-axis).

Page 9: Presentation2 stats

Assumptions of the Regression Model

The following assumptions are made about the errors:

a) The errors are independent

b) The errors are normally distributed

c) The errors have a mean of zero

d) The errors have a constant variance(regardless of the value of X)

Page 10: Presentation2 stats

Patterns of Indicating Errors

Error

X

Page 11: Presentation2 stats

Estimating the Variance

The error variance is measured by the MSE s2 = MSE= SSE

n-k-1

where n = number of observations in the sample

k = number of independent variables

Therefore the standard deviation will be

s = sqrt (MSE)

Page 12: Presentation2 stats

Multiple regression Analysis

More than one independent variable Y=β0+β1X1+β2X2+……+βkXk+ϵ

Where, Y=dependent variable(response variable)

Xi=ith independent variable(predictor variable or explanatory variable)

β0= intercept(value of Y when all Xi = 0)βi= coefficient of the ith independent variablek= number of independent variables ϵ= random error

To estimate the values of these coefficients, a sample is taken and the following equation is developed : Ῡ= b0+b1X1+b2X2+…….+bkXk

where, Ῡ= predicted value of Yb0= sample intercept (and is an estimate of

β0)bi= sample coefficient of ith variable(and is an estimate of βi)

Page 13: Presentation2 stats

Testing the Model for Significance

• MSE and co-efficient of determination (r2) does not provide a good measure of accuracy when the sample size is small

• In this case, it is necessary to test the model for significance

• Linear Model is given by,

Y=β0 + β1X + ε

Null Hypothesis :If β1 = 0, then there is no linear relationship between X and Y

Alternate Hypothesis : If β1 ≠ 0, then there is a linear relationship

Page 14: Presentation2 stats

Steps in Hypothesis Test for a Significant Regression Model

1. Specify null and alternative hypothesis.

2. Select the level of significance (α). Common values are between 0.01 and 0.05

3. Calculate the value of the test statistic using the formula:

F = MSE/MSE

4. Make a decision using one of the following methods:

a) Reject if Fcalculated > Ftable

b) Reject if p-value < α

Page 15: Presentation2 stats

Triple A Construction Example

Step 1:

H0 :β1 = 0, (no linear relationship between X and Y)

H1 :β1 ≠ 0, (linear relationship between X and Y)

Step 2

Select α = 0.05

Page 16: Presentation2 stats

Triple A Construction Example

Step 3: Calculate the value of the test statistic

MSR = SSR/k

= 15.6250/1

= 15.6250

F = MSR/MSE

= 15.6250/1.7188

= 9.09

Page 17: Presentation2 stats

Triple A Construction Example Step 4: Reject the null hypothesis if the test statistic is greater

than the F value from the table.

To find table value, we need :

Level of Significance (α) = 0.05

df1 = k = 1

df2 = n – k – 1 = 4

where k = number of independent variables

n = sample size

Using these values, we find

Ftable = 7.71

Hence, we reject H0 because 9.09 > 7.71

Page 18: Presentation2 stats

Selling Price ($) Suare Footage AGE Condition95000 1926 30 GOOD

119000 2069 40 Excellent124800 1720 30 Excellent135000 1396 15 GOOD142800 1706 32 Mint145000 1847 38 Mint159000 1950 27 Mint165000 2323 30 Excellent182000 2285 26 Mint183000 3752 35 GOOD200000 2300 18 GOOD211000 2525 17 GOOD215000 3800 40 Excellent219000 1740 12 Mint

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.819680305

R Square 0.671875802

Adjusted R Square 0.612216857

Standard Error 24312.60729

Observations 14

ANOVA

df SS MS F Significance F

Regression 2 13313936968 6.7E+09 11.262 0.002178765

Residual 11 6502131603 5.9E+08

Total 13 19816068571

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 146630.89 25482.08287 5.75427 0.0001 90545.20735 202717 90545 202717

SF 43.819366 10.28096507 4.26218 0.0013 21.19111495 66.448 21.191 66.448

AGE -2898.686 796.5649421 -3.639 0.0039 -4651.91386 -1145 -4651.9 -1145.5

The p-values are used to test the individual variables for significance

The coefficient of determination r2

The regression coefficients

Jenny Wilson Reality

Page 19: Presentation2 stats

Binary or Dummy Variables Indicator Variable Assigned a value of 1 if a particular condition is

met, 0 otherwise The number of dummy variables must equal one

less than the number of categories of a qualitative variable

The Jenny Wilson realty example :– X3= 1 for excellent condition

= 0 otherwise– X4= 1 for mint condition

= 0 otherwise

Page 20: Presentation2 stats

Selling Price ($) Suare Footage AGE X3(Exc.) X4(Mint) Condition

95000 1926 30 0 0 GOOD119000 2069 40 1 0 Excellent124800 1720 30 1 0 Excellent135000 1396 15 0 0 GOOD142800 1706 32 0 1 Mint145000 1847 38 0 1 Mint159000 1950 27 0 1 Mint165000 2323 30 1 0 Excellent182000 2285 26 0 1 Mint183000 3752 35 0 0 GOOD200000 2300 18 0 0 GOOD211000 2525 17 0 0 GOOD215000 3800 40 1 0 Excellent219000 1740 12 0 1 Mint

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.94762

R Square 0.89798

Adjusted R Square 0.85264

Standard Error 14987.6

Observations 14

ANOVA

df SS MS F Significance F

Regression 4 17794427451 4E+09 19.8044 0.000174421

Residual 9 2021641120 2E+08

Total 13 19816068571

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 121658 17426.61432 6.9812 6.5E-05 82236.71393 161080 82236.71 161080

SF 56.4276 6.947516792 8.122 2E-05 40.71122594 72.144 40.71123 72.144

AGE -3962.82 596.0278736 -6.6487 9.4E-05 -5311.12866 -2614.5 -5311.129 -2614.5

X3(Exc.) 33162.6 12179.62073 2.7228 0.0235 5610.432651 60714.9 5610.433 60715

X4(Mint) 47369.2 10649.26942 4.4481 0.0016 23278.92699 71459.6 23278.93 71460

The coefficients of age is negative, indicating that the price decreases as a house gets older

Jenny Wilson Reality

Page 21: Presentation2 stats

Model Building

The value of r2 can never decrease when more variables are added to the model

Adjusted r2 often used to determine if an additional independent variable is beneficial

The adjusted r2 is

A variable should not be added to the model if it causes the adjusted r2 to decrease

Page 22: Presentation2 stats

Multiple RegressionSales/Decision to buy = B0+ B1* Price

Sales/Decision to buy = B0+ B1* (Price)3+ B2*(Design)2+B3*(Performance)

L = (Price)3

M = (Design)2

N = (Performance)

Sales/Decision to buy = B0+ B1* L+ B2* M+ B3* N

Page 23: Presentation2 stats

Pitfalls In Regression

A High Correlation does not mean one variable is causing a change in another (Some regressions have shown a significantly positive relation between individuals' college GPA and future salary. )

Values of the dependent variable should not be used that are above or below the ones from the sample

The number of independent variables that should be used in the model is limited by the number of observations.

Page 24: Presentation2 stats

Thank

You!!!