Statistics ECA Report

BUS105

STATISTICS

End-of-Course Assignment

January Semester Assessment

Submitted by:

Muhamad Fauzi Bin Mat Isa

14 March 2009

Table of Contents

Title Page/s

Content

Question 1 3-4

Question 2 4-7

Question 3 7-14

Question 4 15

References 16

Qn1

2

100 125 150 175 200

House_Value

37.0

38.0

39.0

40.0

41.0

Fam

ily_I

nco

me

40 45 50 55

Age_Head

37.0

38.0

39.0

40.0

41.0

Fam

ily_I

nco

me

a.

The scatterplot shows graphically that the higher the value of the house, it is likely that the family that owns the house has a higher family income. It appears that both variables have a strong direct positive linear relationship.

b.

The scatterplot shows a wide spread. It indicates that the older the age of the household, it is likely that the family income increases. It appears that both variables have a weak positive linear relationship.

c.

3

200 300 400 500

Mortgage_Payment

37.0

38.0

39.0

40.0

41.0

Fam

ily_I

nco

me

The scatterplot shows graphically that the higher the current monthly mortgage payment, it is likely that the income of the family decreases. It appears that both variables have a weak direct negative linear relationship.

Qn2

a.

Correlations

Family_Income House_ValueFamily_Income Pearson Correlation 1 .720(**)

Sig. (2-tailed) .000

N 25 2%5House_Value Pearson Correlation .720(**) 1

Sig. (2-tailed) .000N 25 25

** Correlation is significant at the 0.01 level (2-tailed).

From the SPSS output data, the correlation, r of 0.720 indicates a positive relationship. There is a direct relationship between the value of the house and the family income. Since the coefficient of correlation has a value range from -1.00 to +1.00 which indicates a perfect and strong correlation, there is evidence to show that the two variables have a moderately high strong association. It also demonstrates that the two variables are related i.e. the value of the house increases, the family income increases.

b.Model Summary

4

Model R R Square Adjusted R Square Std. Error of the Estimate1 .720(a) .518 .497 .7457

a Predictors: (Constant), House_Value

From the SPSS output data the coefficient of determination, r2 is indicated as 0.518. Through manual calculation, it is determined by (0.720) 2. It means that more than 52% of the variation of the family income is explained or responsible for by the variation in the value of the house.

c.

A test of significance for the coefficient of correlation may be used to determine if the computed r occurred in a population in which the two variables are not related i.e. zero correlation in the population.

Step 1: Stating the hypothesis:

H0: ρ = 0 (The correlation in the population is zero)

H1: ρ ≠ 0 (The correlation in the population is not zero)

The null hypothesis H0 is that there is no correlation in the population and the alternate H1 that there is a correlation. The two-tailed test is used due to the way H1 is stated.

Step 2: Stating the level of significance:

The test will be done using 0.05 significance level, α= 0.05

Step 3: Using the appropriate test statistic:

The test statistics follows the t distribution with the degrees of freedom (n - 2) and the formula is

Step 4: Stating the decisional rule:

H0 is rejected if t > tα/2,n-2 or t < -t α/2,n-2

t > t0.025,23 or t < -t .025,8

t > 2.069 or t < -2.069

5

-2.069 2.069

Step 5: Calculation of test value, critical value and decisional making:

Computing t:

t = = = 4.976

The computed t (4.976) is within the rejection region. Thus, H0 is rejected. It translates that there is correlation in the population is not zero. It also demonstrates that there is correlation with respect to the value of the house and the family income.

d.

To find the expected family income when the value of the house of the house is $175,000, the following formula is used:

General Form of Linear Regression Ŷ = a + bX

In order to utilise the equation, the following formulas are used to find the values of a and b:

Slope of the regression line b =

Y-Intercept a = Y - bX

By using the SPSS output, the values of the formulas of a and b can be inserted.

Descriptive Statistics

N Sum Mean Std. DeviationFamily_Income 25 998.2 39.928 1.0514House_Value 25 3849 153.96 28.841Valid N (listwise) 25

Slope of the regression line b =

b =

b = 0.026

6

Y-Intercept a = Y - bX

a = 39.928 – 0.026(153.96) = 35.925

Thus, the expected family income when the value of the house is $175,000:

General Form of Linear Regression Ŷ = a + bX

Ŷ = 35.925 + 0.026(175) = 40.475

Hence, the expected family income is $40,475 when the value of the house is $175,000.

Coefficients(a)

Model

Unstandardized Coefficients

Standardized

Coefficients t Sig.95% Confidence Interval

for B

B Std.

Error BetaLower Bound

Upper Bound B Std. Error

1 (Constant) 35.889 .826 43.440 .000 34.180 37.598 House_Valu

e.026 .005 .720 4.971 .000 .015 .037

a Dependent Variable: Family_Income

From the SPSS output, the values for a and b in the estimated linear regression equation are found in the blue rectangle box where a = 35.889 and b = 0.026.

Thus, the estimated regression equation is Ŷ = 35.889 + 0.026X. It is noted that there is a slight difference in the coefficient of the intercept based on manual computation and SPSS due to rounding.

Qn3

MULTIPLE REGRESSION ANALYSIS

From the SPSS output data, the values for b1 to b2 in the estimated multiple regression equation are found in the column highlighted inside the blue rectangle box:

Coefficients(a)

Model


Standardized Coefficients

t Sig.B Std. Error Beta1 (Constant) 35.635 1.345 26.490 .000

House_Value .025 .005 .677 4.540 .000

7

Age_Head .007 .027 .037 .273 .788

Mortgage_Payment -.001 .001 -.081 -.557 .584

Gender .716 .285 .345 2.507 .021


After substituting the b1 to b4 values, the multiple regression equation is as follows:

Ŷ = 35.635 + 0.025X1 + 0.07X2 – 0.001X3 + 0,716X4

Where:

Ŷ = the family income (S$)X1 = the value of the houseX2 = the age of head of the householdX3 = the current monthly mortgage paymentX4 = 0 if female is the head of household 1 if male is the head of household

Interpretations of the coefficients

b1: The value of the house (X1) indicates a direct positive relationship. When the value of the house increases, the family income increases as well. With each additional $1000 increase on the value of the house, the income of the family is expected to increase to $25 when the rest of the variables are held constant.

b2: The age of the head of the household (X2) indicates a direct positive relationship. With an older head, the family income also increases. Hence, for each additional year the head gets older, the family income is expected to increase by $7, provided the rest of the variables are held constant.

b3: The current monthly mortgage payment (X3) indicates a negative and inverse relationship. As the current monthly mortgage payment increases, the family income decreases. As such, the increase of mortgage payment of $100 and hold the other variables held constant, the estimated decrease of $0.10 in the family income.

b4: The family income headed by a male in the household is on average $71.60 higher than the family income headed by a female in the household.

MODEL FIT

The model fit uses the least squares criterion is develop the following equation:Ŷ = a +b1X1 + b2X2 + b3X3 + … bkXk

The SPSS package is used instead to compute the tedious nature of the calculation.

From the SPSS output, the ANOVA table generated the following values:

ANOVA(b)

8

Model Sum of

Squares df Mean Square F Sig.1 Regression 17.357 4 4.339 9.461 .000(a)

Residual 9.173 20 .459

Total 26.530 24

a Predictors: (Constant), Gender, Mortgage_Payment, Age_Head, House_Valueb Dependent Variable: Family_Income

The SSR (the sum of squares due to regression) and SST (the total sum of squares) are used to compute the multiple coefficient of determination (R2)

R2 = SSR/SST = 17.537/26.530 = 0.661

The computed result agrees with the R2 value appeared in the “Model Summary” of the SPSS Output below. It is noted that there is a slight difference in the value due to rounding.

Model Summary

Model R R Square Adjusted R Square Std. Error of the Estimate1 .809(a) .654 .585 .6772

a Predictors: (Constant), Gender, Mortgage_Payment, Age_Head, House_Value

The computed adjusted R2 ( ) is needed to confirm the adjusted R2 generated by the SPSSpackage. It can be computed through the following formula

where n = number of observations, and p = number of independent variables.

= 0.5845

From the result, the computed value agrees with the SPSS output above.

Based on the R2 value, 65.4% of the variability in the family income is explained by the estimated multiple regression equation with, value of the house, age of the head of the household, current monthly mortgage payment and the gender of the head of the household as the independent variables. After adjusting the coefficient of determination for the number of independent variables in the model, the % of variability explained by the model is moderately high (58.5%). Thus, on this basis (without performing residual analysis), the estimated multiple regression equation fits fairly well.

GLOBAL TEST (F-TEST)

The F-test is used to investigate whether any of the independent variables have significant coefficients. As such, the null and alternate hypotheses are:

9

H0 : β1= β2= β3= β4=0H1 : not all β’s equal to 0

The null hypothesis is not rejected if all the regression coefficients are all zero. If the regression coefficients are all zero, it indicates that they are no use in estimating the dependent variable. The alternate hypothesis is not rejected when at least one of the coefficients is not zero. It is conclusive that at least one of the variables is significant in explaining the family income.

The F distribution is used as the test statistic and the value F is found by the following equation:


H0 : β1= β2= β3= β4=0H1 : not all β’s equal to 0



n is the degree of freedom for the numerator = 4n-(k+1) is the degree of freedom for the denominator = 25 – (4+1) = 20

From the Appendix B.4 of the text book, the F critical value is 2.87


The one-tailed test F distribution is used as the test statistic.


H0 is rejected if p-value < 0.05 or if F > 2.87

From the Appendix B.4 of the text book, the F critical value is 2.87

Step 5: Determine whether to reject H0

From the computed value using the SPSS data in the ANOVA table, the F value is:

= = 9.461

10

ANOVA(b)

Model Sum of

Squares df Mean Square F Sig.1 Regression 17.357 4 4.339 9.461 .000(a)

Residual 9.173 20 .459

Total 26.530 24

a Predictors: (Constant), Gender, Mortgage_Payment, Age_Head, House_Valueb Dependent Variable: Family_Income

Since the computed value and the SPPS output exceeds the critical value of 2.87 and p-value of 0.05, H0 is rejected and H1 is accepted. In conclusion, at least one of the regression coefficients does not equal to zero.

TESTING OF INDIVIDUAL REGRESSION COEFFICIENTS

The testing is to determine if any of the independent variables are considered unimportant and to be dropped from the regression model. The t distribution will be used as the test statistic. It will also be a two-tailed test. Thus the null and alternate hypotheses are as follows:

H0 : β1 = 0 H0 : β2 = 0 H0 : β3 = 0 H0 : β4 = 0H1 : β1 ≠ 0 H1 : β2 ≠ 0 H1 : β3 ≠ 0 H1 : β4 ≠ 0

The null hypothesis is rejected when the coefficient of each variable is not equal to zero. This implies that the alternate hypothesis is true and the variable is significant and has an inverse relationship with the dependent variable i.e. the family income. Also, the variable should be dropped from the regression model if the null hypothesis is not rejected.

The formula of the testing the individual regression coefficient is as follows:


H0 : β1 = 0 H0 : β2 = 0 H0 : β3 = 0 H0 : β4 = 0H1 : β1 ≠ 0 H1 : β2 ≠ 0 H1 : β3 ≠ 0 H1 : β4 ≠ 0


11

df = (4, 20)

2.87



The two-tailed test is used. The degrees of freedom is [(n-(k+1)] = 20. Thus, from the Appendix B.2 of the text book, the value of t is 2.086.


H0 is rejected if t > 2.086 or t < -2.086 Step 5: Calculation of test value, critical value and decisional making:

The values of t1, t2, t3 and t4 are derived from the SPSS output in the blue rectangle box.

Coefficients(a)

Model


Standardized Coefficients

t Sig.B Std. Error Beta1 (Constant) 35.635 1.345 26.490 .000

House_Value .025 .005 .677 4.540 .000

Age_Head .007 .027 .037 .273 .788

Mortgage_Payment -.001 .001 -.081 -.557 .584

Gender .716 .285 .345 2.507 .021


The t -value for value of the house

t1 value is 5.0

The t -value for age of the head of the household

t2 value is 0.259

The t -value for value of the currently monthly mortgage payment

12

Regression Standardized Predicted Value210-1-2-3

Reg

ress

ion

Sta

ndar

dize

d R

esid

ual

2

1

0

-1

-2

Scatterplot

Dependent Variable: Family_Income

t3 value is -1.0

The t -value for gender of the head of the household

t4 value is 2.51

Based on the SPSS output and the computed values, the t-ratio for the value of the house and the gender of the head of the household exceed the t value but the computed values of the age of the head of the household and the current monthly mortgage payment are not in the rejected region. This indicates that the independent variables value of the house and gender of the head of the household should be retained and the other two variables should be dropped.

RESIDUAL ANALYSIS

Besides using the coefficient of multiple determination (R2) to determine the fit of the model, a more effective method i.e. residual analysis is used to validate a model. Furthermore, a high R2 does not guarantee that the model fits the data well. Use of a model that is that does not fit the data well cannot provide a solution to the underlying questions at hand.

A residual scatterplot can be used to assess both linearity and homoscedasticity. From the SPSS output, the following scatterplot is generated.

13

Regression Standardized Residual210-1-2

Freq

uenc

y

6

5

4

3

2

1

0

Histogram

Dependent Variable: Family_Income

Mean =-9.19E-15Std. Dev. =0.913

N =25

The points in the plot seem to be fluctuating randomly in the horizontal band around zero. The residual plots show a random distribution of positive and negative values across the entire range of the variable plotted on the horizontal axis. The points are scattered and there is no obvious pattern. Thus, the plot supports the assumption of linearity. Also, the scatterplot does not suggest violations of the assumptions of zero means and constant variance of the random errors.

The SPSS package also generated a histogram to determine if the distribution is normal. The histogram shows that the residuals are roughly symmetrically distributed and clustered around zero. Hence, it indicates that a normal distribution exists. However, one residual is unusually large and should be investigated.

Qn 4

The findings suggest that the gender of the head of the household is a significant variable in predicting the family income. On an average, a male earns $71.60 higher than the female counterpart. The regression coefficient indicates that is has the highest value among the variables. The analysis reflects the present environment where the male workforce earns higher than the female workforce. Although the Bank should focus its mortgage loans business on the males, it is also prudent to target the females as well as there is a significant rise in the female workforce and their earnings are getting closer to the males.

14

The next significant variable is the value of the house that a family owns. From the regression coefficient, it is found that the family income increase to $25 when the value of the house increase every $1000. Unsurprisingly, a family that pays a higher monthly mortgage loans tends to own a house of significant value. The Bank should target families that are scouting for their dream home. However, families may not want to incur any more debt due to the current economic climate.

Notwithstanding, the figures churned out are very basic parameters. Factors such as type of home, location and number or people in the household are important information to improve the quality of the data and thus, increase the accuracy of the tests. Such information is vital as it will give more comprehensive tests in indicating reliable predictors. It will ensure that the Bank will come up with a solid business plan for its mortgage loans activities.

Another aspect to be considered is that these tests are to determine the possible existence of relationship between the independent and dependent variables. The tests are not a causal analysis but merely tools to determine correlation between the variables.

Word Count : 300

References

Lind, D.A.., Marchal, W.G., & Wathen, S.A (2008), Basic Statistic for Business and

Economics, 6th International Edition, McGraw-Hill Companies Inc., New York USA

15

Statistics ECA Report

Documents

Transcript of Statistics ECA Report