Statistics ECA Report
-
Upload
muhamad-fauzi -
Category
Documents
-
view
152 -
download
2
Transcript of Statistics ECA Report
BUS105
STATISTICS
End-of-Course Assignment
January Semester Assessment
Submitted by:
Muhamad Fauzi Bin Mat Isa
14 March 2009
Table of Contents
Title Page/s
Content Page 2
Question 1 3-4
Question 2 4-7
Question 3 7-14
Question 4 15
References 16
Qn1
2
100 125 150 175 200
House_Value
37.0
38.0
39.0
40.0
41.0
Fam
ily_I
nco
me
40 45 50 55
Age_Head
37.0
38.0
39.0
40.0
41.0
Fam
ily_I
nco
me
a.
The scatterplot shows graphically that the higher the value of the house, it is likely that the family that owns the house has a higher family income. It appears that both variables have a strong direct positive linear relationship.
b.
The scatterplot shows a wide spread. It indicates that the older the age of the household, it is likely that the family income increases. It appears that both variables have a weak positive linear relationship.
c.
3
200 300 400 500
Mortgage_Payment
37.0
38.0
39.0
40.0
41.0
Fam
ily_I
nco
me
The scatterplot shows graphically that the higher the current monthly mortgage payment, it is likely that the income of the family decreases. It appears that both variables have a weak direct negative linear relationship.
Qn2
a.
Correlations
Family_Income House_ValueFamily_Income Pearson Correlation 1 .720(**)
Sig. (2-tailed) .000
N 25 2%5House_Value Pearson Correlation .720(**) 1
Sig. (2-tailed) .000N 25 25
** Correlation is significant at the 0.01 level (2-tailed).
From the SPSS output data, the correlation, r of 0.720 indicates a positive relationship. There is a direct relationship between the value of the house and the family income. Since the coefficient of correlation has a value range from -1.00 to +1.00 which indicates a perfect and strong correlation, there is evidence to show that the two variables have a moderately high strong association. It also demonstrates that the two variables are related i.e. the value of the house increases, the family income increases.
b.Model Summary
4
Model R R Square Adjusted R Square Std. Error of the Estimate1 .720(a) .518 .497 .7457
a Predictors: (Constant), House_Value
From the SPSS output data the coefficient of determination, r2 is indicated as 0.518. Through manual calculation, it is determined by (0.720) 2. It means that more than 52% of the variation of the family income is explained or responsible for by the variation in the value of the house.
c.
A test of significance for the coefficient of correlation may be used to determine if the computed r occurred in a population in which the two variables are not related i.e. zero correlation in the population.
Step 1: Stating the hypothesis:
H0: ρ = 0 (The correlation in the population is zero)
H1: ρ ≠ 0 (The correlation in the population is not zero)
The null hypothesis H0 is that there is no correlation in the population and the alternate H1 that there is a correlation. The two-tailed test is used due to the way H1 is stated.
Step 2: Stating the level of significance:
The test will be done using 0.05 significance level, α= 0.05
Step 3: Using the appropriate test statistic:
The test statistics follows the t distribution with the degrees of freedom (n - 2) and the formula is
Step 4: Stating the decisional rule:
H0 is rejected if t > tα/2,n-2 or t < -t α/2,n-2
t > t0.025,23 or t < -t .025,8
t > 2.069 or t < -2.069
5
-2.069 2.069
Step 5: Calculation of test value, critical value and decisional making:
Computing t:
t = = = 4.976
The computed t (4.976) is within the rejection region. Thus, H0 is rejected. It translates that there is correlation in the population is not zero. It also demonstrates that there is correlation with respect to the value of the house and the family income.
d.
To find the expected family income when the value of the house of the house is $175,000, the following formula is used:
General Form of Linear Regression Ŷ = a + bX
In order to utilise the equation, the following formulas are used to find the values of a and b:
Slope of the regression line b =
Y-Intercept a = Y - bX
By using the SPSS output, the values of the formulas of a and b can be inserted.
Descriptive Statistics
N Sum Mean Std. DeviationFamily_Income 25 998.2 39.928 1.0514House_Value 25 3849 153.96 28.841Valid N (listwise) 25
Slope of the regression line b =
b =
b = 0.026
6
Y-Intercept a = Y - bX
a = 39.928 – 0.026(153.96) = 35.925
Thus, the expected family income when the value of the house is $175,000:
General Form of Linear Regression Ŷ = a + bX
Ŷ = 35.925 + 0.026(175) = 40.475
Hence, the expected family income is $40,475 when the value of the house is $175,000.
Coefficients(a)
Model
Unstandardized Coefficients
Standardized
Coefficients t Sig.95% Confidence Interval
for B
B Std.
Error BetaLower Bound
Upper Bound B Std. Error
1 (Constant) 35.889 .826 43.440 .000 34.180 37.598 House_Valu
e.026 .005 .720 4.971 .000 .015 .037
a Dependent Variable: Family_Income
From the SPSS output, the values for a and b in the estimated linear regression equation are found in the blue rectangle box where a = 35.889 and b = 0.026.
Thus, the estimated regression equation is Ŷ = 35.889 + 0.026X. It is noted that there is a slight difference in the coefficient of the intercept based on manual computation and SPSS due to rounding.
Qn3
MULTIPLE REGRESSION ANALYSIS
From the SPSS output data, the values for b1 to b2 in the estimated multiple regression equation are found in the column highlighted inside the blue rectangle box:
Coefficients(a)
Model
Unstandardized Coefficients
Standardized Coefficients
t Sig.B Std. Error Beta1 (Constant) 35.635 1.345 26.490 .000
House_Value .025 .005 .677 4.540 .000
7
Age_Head .007 .027 .037 .273 .788
Mortgage_Payment -.001 .001 -.081 -.557 .584
Gender .716 .285 .345 2.507 .021
a Dependent Variable: Family_Income
After substituting the b1 to b4 values, the multiple regression equation is as follows:
Ŷ = 35.635 + 0.025X1 + 0.07X2 – 0.001X3 + 0,716X4
Where:
Ŷ = the family income (S$)X1 = the value of the houseX2 = the age of head of the householdX3 = the current monthly mortgage paymentX4 = 0 if female is the head of household 1 if male is the head of household
Interpretations of the coefficients
b1: The value of the house (X1) indicates a direct positive relationship. When the value of the house increases, the family income increases as well. With each additional $1000 increase on the value of the house, the income of the family is expected to increase to $25 when the rest of the variables are held constant.
b2: The age of the head of the household (X2) indicates a direct positive relationship. With an older head, the family income also increases. Hence, for each additional year the head gets older, the family income is expected to increase by $7, provided the rest of the variables are held constant.
b3: The current monthly mortgage payment (X3) indicates a negative and inverse relationship. As the current monthly mortgage payment increases, the family income decreases. As such, the increase of mortgage payment of $100 and hold the other variables held constant, the estimated decrease of $0.10 in the family income.
b4: The family income headed by a male in the household is on average $71.60 higher than the family income headed by a female in the household.
MODEL FIT
The model fit uses the least squares criterion is develop the following equation:Ŷ = a +b1X1 + b2X2 + b3X3 + … bkXk
The SPSS package is used instead to compute the tedious nature of the calculation.
From the SPSS output, the ANOVA table generated the following values:
ANOVA(b)
8
Model Sum of
Squares df Mean Square F Sig.1 Regression 17.357 4 4.339 9.461 .000(a)
Residual 9.173 20 .459
Total 26.530 24
a Predictors: (Constant), Gender, Mortgage_Payment, Age_Head, House_Valueb Dependent Variable: Family_Income
The SSR (the sum of squares due to regression) and SST (the total sum of squares) are used to compute the multiple coefficient of determination (R2)
R2 = SSR/SST = 17.537/26.530 = 0.661
The computed result agrees with the R2 value appeared in the “Model Summary” of the SPSS Output below. It is noted that there is a slight difference in the value due to rounding.
Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate1 .809(a) .654 .585 .6772
a Predictors: (Constant), Gender, Mortgage_Payment, Age_Head, House_Value
The computed adjusted R2 ( ) is needed to confirm the adjusted R2 generated by the SPSSpackage. It can be computed through the following formula
where n = number of observations, and p = number of independent variables.
= 0.5845
From the result, the computed value agrees with the SPSS output above.
Based on the R2 value, 65.4% of the variability in the family income is explained by the estimated multiple regression equation with, value of the house, age of the head of the household, current monthly mortgage payment and the gender of the head of the household as the independent variables. After adjusting the coefficient of determination for the number of independent variables in the model, the % of variability explained by the model is moderately high (58.5%). Thus, on this basis (without performing residual analysis), the estimated multiple regression equation fits fairly well.
GLOBAL TEST (F-TEST)
The F-test is used to investigate whether any of the independent variables have significant coefficients. As such, the null and alternate hypotheses are:
9
H0 : β1= β2= β3= β4=0H1 : not all β’s equal to 0
The null hypothesis is not rejected if all the regression coefficients are all zero. If the regression coefficients are all zero, it indicates that they are no use in estimating the dependent variable. The alternate hypothesis is not rejected when at least one of the coefficients is not zero. It is conclusive that at least one of the variables is significant in explaining the family income.
The F distribution is used as the test statistic and the value F is found by the following equation:
Step 1: Stating the hypothesis:
H0 : β1= β2= β3= β4=0H1 : not all β’s equal to 0
Step 2: Stating the level of significance:
The test will be done using 0.05 significance level, α= 0.05
n is the degree of freedom for the numerator = 4n-(k+1) is the degree of freedom for the denominator = 25 – (4+1) = 20
From the Appendix B.4 of the text book, the F critical value is 2.87
Step 3: Using the appropriate test statistic:
The one-tailed test F distribution is used as the test statistic.
Step 4: Stating the decisional rule:
H0 is rejected if p-value < 0.05 or if F > 2.87
From the Appendix B.4 of the text book, the F critical value is 2.87
Step 5: Determine whether to reject H0
From the computed value using the SPSS data in the ANOVA table, the F value is:
= = 9.461
10
ANOVA(b)
Model Sum of
Squares df Mean Square F Sig.1 Regression 17.357 4 4.339 9.461 .000(a)
Residual 9.173 20 .459
Total 26.530 24
a Predictors: (Constant), Gender, Mortgage_Payment, Age_Head, House_Valueb Dependent Variable: Family_Income
Since the computed value and the SPPS output exceeds the critical value of 2.87 and p-value of 0.05, H0 is rejected and H1 is accepted. In conclusion, at least one of the regression coefficients does not equal to zero.
TESTING OF INDIVIDUAL REGRESSION COEFFICIENTS
The testing is to determine if any of the independent variables are considered unimportant and to be dropped from the regression model. The t distribution will be used as the test statistic. It will also be a two-tailed test. Thus the null and alternate hypotheses are as follows:
H0 : β1 = 0 H0 : β2 = 0 H0 : β3 = 0 H0 : β4 = 0H1 : β1 ≠ 0 H1 : β2 ≠ 0 H1 : β3 ≠ 0 H1 : β4 ≠ 0
The null hypothesis is rejected when the coefficient of each variable is not equal to zero. This implies that the alternate hypothesis is true and the variable is significant and has an inverse relationship with the dependent variable i.e. the family income. Also, the variable should be dropped from the regression model if the null hypothesis is not rejected.
The formula of the testing the individual regression coefficient is as follows:
Step 1: Stating the hypothesis:
H0 : β1 = 0 H0 : β2 = 0 H0 : β3 = 0 H0 : β4 = 0H1 : β1 ≠ 0 H1 : β2 ≠ 0 H1 : β3 ≠ 0 H1 : β4 ≠ 0
Step 2: Stating the level of significance:
11
df = (4, 20)
2.87
The test will be done using 0.05 significance level, α= 0.05
Step 3: Using the appropriate test statistic:
The two-tailed test is used. The degrees of freedom is [(n-(k+1)] = 20. Thus, from the Appendix B.2 of the text book, the value of t is 2.086.
Step 4: Stating the decisional rule:
H0 is rejected if t > 2.086 or t < -2.086 Step 5: Calculation of test value, critical value and decisional making:
The values of t1, t2, t3 and t4 are derived from the SPSS output in the blue rectangle box.
Coefficients(a)
Model
Unstandardized Coefficients
Standardized Coefficients
t Sig.B Std. Error Beta1 (Constant) 35.635 1.345 26.490 .000
House_Value .025 .005 .677 4.540 .000
Age_Head .007 .027 .037 .273 .788
Mortgage_Payment -.001 .001 -.081 -.557 .584
Gender .716 .285 .345 2.507 .021
a Dependent Variable: Family_Income
The t -value for value of the house
t1 value is 5.0
The t -value for age of the head of the household
t2 value is 0.259
The t -value for value of the currently monthly mortgage payment
12
Regression Standardized Predicted Value210-1-2-3
Reg
ress
ion
Sta
ndar
dize
d R
esid
ual
2
1
0
-1
-2
Scatterplot
Dependent Variable: Family_Income
t3 value is -1.0
The t -value for gender of the head of the household
t4 value is 2.51
Based on the SPSS output and the computed values, the t-ratio for the value of the house and the gender of the head of the household exceed the t value but the computed values of the age of the head of the household and the current monthly mortgage payment are not in the rejected region. This indicates that the independent variables value of the house and gender of the head of the household should be retained and the other two variables should be dropped.
RESIDUAL ANALYSIS
Besides using the coefficient of multiple determination (R2) to determine the fit of the model, a more effective method i.e. residual analysis is used to validate a model. Furthermore, a high R2 does not guarantee that the model fits the data well. Use of a model that is that does not fit the data well cannot provide a solution to the underlying questions at hand.
A residual scatterplot can be used to assess both linearity and homoscedasticity. From the SPSS output, the following scatterplot is generated.
13
Regression Standardized Residual210-1-2
Freq
uenc
y
6
5
4
3
2
1
0
Histogram
Dependent Variable: Family_Income
Mean =-9.19E-15Std. Dev. =0.913
N =25
The points in the plot seem to be fluctuating randomly in the horizontal band around zero. The residual plots show a random distribution of positive and negative values across the entire range of the variable plotted on the horizontal axis. The points are scattered and there is no obvious pattern. Thus, the plot supports the assumption of linearity. Also, the scatterplot does not suggest violations of the assumptions of zero means and constant variance of the random errors.
The SPSS package also generated a histogram to determine if the distribution is normal. The histogram shows that the residuals are roughly symmetrically distributed and clustered around zero. Hence, it indicates that a normal distribution exists. However, one residual is unusually large and should be investigated.
Qn 4
The findings suggest that the gender of the head of the household is a significant variable in predicting the family income. On an average, a male earns $71.60 higher than the female counterpart. The regression coefficient indicates that is has the highest value among the variables. The analysis reflects the present environment where the male workforce earns higher than the female workforce. Although the Bank should focus its mortgage loans business on the males, it is also prudent to target the females as well as there is a significant rise in the female workforce and their earnings are getting closer to the males.
14
The next significant variable is the value of the house that a family owns. From the regression coefficient, it is found that the family income increase to $25 when the value of the house increase every $1000. Unsurprisingly, a family that pays a higher monthly mortgage loans tends to own a house of significant value. The Bank should target families that are scouting for their dream home. However, families may not want to incur any more debt due to the current economic climate.
Notwithstanding, the figures churned out are very basic parameters. Factors such as type of home, location and number or people in the household are important information to improve the quality of the data and thus, increase the accuracy of the tests. Such information is vital as it will give more comprehensive tests in indicating reliable predictors. It will ensure that the Bank will come up with a solid business plan for its mortgage loans activities.
Another aspect to be considered is that these tests are to determine the possible existence of relationship between the independent and dependent variables. The tests are not a causal analysis but merely tools to determine correlation between the variables.
Word Count : 300
References
Lind, D.A.., Marchal, W.G., & Wathen, S.A (2008), Basic Statistic for Business and
Economics, 6th International Edition, McGraw-Hill Companies Inc., New York USA
15