AMS394 Midterm Solution - Stony Brookzhu/ams394/Midterm_all_R.docx · Web view(3) Summary: From the...

AMS394 Midterm Solution, R################## Question 1 Version 1 - apples.xls #################

growth <- c(2569,2928,2865,3844,3027,2336,3211,3037, +2074,2885,3378,3906,2782,3018,3383,3447, +2505,2315,2667,2390,3021,3085,3308,3231, +2838,2351,3001,2439,2199,3318,3601,3291, +1532,2552,3083,2330,2079,3366,2416,3100)types <- c(rep("1",8), rep("2",8), rep("3",8),rep("4",8),rep("5",8))apples <- data.frame(types, growth)results1 = aov(growth ~ types, data = apples)summary(results1)

## Df Sum Sq Mean Sq F value Pr(>F)## types 4 1356282 339071 1.314 0.284## Residuals 35 9030048 258001

(1) Summary: the p-value here is 0.284, we fail to reject the null hypothesis.

TukeyHSD(results1, conf.level = 0.95)

## Tukey multiple comparisons of means## 95% family-wise confidence level## ## Fit: aov(formula = growth ~ types, data = apples)## ## $types## diff lwr upr p adj## 2-1 132.000 -598.1767 862.1767 0.9847805## 3-1 -161.875 -892.0517 568.3017 0.9678464## 4-1 -97.375 -827.5517 632.8017 0.9952020## 5-1 -419.875 -1150.0517 310.3017 0.4749275## 3-2 -293.875 -1024.0517 436.3017 0.7750989## 4-2 -229.375 -959.5517 500.8017 0.8937492## 5-2 -551.875 -1282.0517 178.3017 0.2136349## 4-3 64.500 -665.6767 794.6767 0.9990363## 5-3 -258.000 -988.1767 472.1767 0.8463381## 5-4 -322.500 -1052.6767 407.6767 0.7109124

(2) Summary: Tukey’s multiple comparisons of means confirms that the means are NOT significantly different for each pair of types. This is consistent with the results in part (a).

boxplot(growth ~ types, data = apples, main = "Side-by-Side Boxplots")

par(mfrow=c(2,2))plot(results1)

(3) Summary: From the boxplots, there is an indication of unequal variance, which conflicts with the equal variance assumption. The residual plot shows some concern of unequal variance as well.

types2 <- growth[9:16]types5 <- growth[33:40]

From the box-plots in part (c), we find the fastest growing is type 2, and the slowest growing is type 5.

shapiro.test(types2)

## ## Shapiro-Wilk normality test## ## data: types2## W = 0.95359, p-value = 0.7473

shapiro.test(types5)

## ## Shapiro-Wilk normality test## ## data: types5## W = 0.95931, p-value = 0.8035

Summary: From Shapiro-Wilk test, we fail to reject the null hypothesis and conclude that both samples are normal. We continue with the independent samples t-test.

var.test(types2, types5)

## ## F test to compare two variances## ## data: types2 and types5## F = 0.82805, num df = 7, denom df = 7, p-value = 0.8098## alternative hypothesis: true ratio of variances is not equal to 1

t.test(types2, types5, var.equal=T)

## ## Two Sample t-test## ## data: types2 and types5## t = 1.9029, df = 14, p-value = 0.07782## alternative hypothesis: true difference in means is not equal to 0

(4) Summary: The F-test shows that the variances can be considered equal. Therefore, we adopted the pooled-variance t-test and found there is no significant mean differences between types 2 and 5 at alpah = 0.05.

################ Question 1 Version 2 - startup.xls ##################

cost <- c( 80,125, 35, 58,110,140, 97, 50, 65, 79, +150, 40,120, 75,160, 60, 45,100, 86, 87, +48, 35, 95, 45, 75,115, 42, 78, 65,125, +100, 96, 35, 99, 75,150, 45,100,120, 50, +45, 80, 30, 35, 30, 28, 25, 75, 48, 50)business <- c(rep("1",10), rep("2",10), rep("3",10),rep("4",10),rep("5",10)) startup <- data.frame(business, cost)results2 = aov(cost ~ business, data = startup)summary(results2)

## Df Sum Sq Mean Sq F value Pr(>F) ## business 4 14487 3622 3.308 0.0185 *## Residuals 45 49272 1095 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(1) Summary: the p-value here is 0.0185, we reject the null hypothesis.

TukeyHSD(results2, conf.level = 0.95)

## Tukey multiple comparisons of means## 95% family-wise confidence level## ## Fit: aov(formula = cost ~ business, data = startup)## ## $business## diff lwr upr p adj## 2-1 8.4 -33.64813 50.4481262 0.9790864## 3-1 -11.6 -53.64813 30.4481262 0.9339754## 4-1 3.1 -38.94813 45.1481262 0.9995543## 5-1 -39.3 -81.34813 2.7481262 0.0771419## 3-2 -20.0 -62.04813 22.0481262 0.6610178## 4-2 -5.3 -47.34813 36.7481262 0.9963506## 5-2 -47.7 -89.74813 -5.6518738 0.0190102## 4-3 14.7 -27.34813 56.7481262 0.8569366## 5-3 -27.7 -69.74813 14.3481262 0.3469488## 5-4 -42.4 -84.44813 -0.3518738 0.0472130

(2) Summary: Tukey’s multiple comparisons of means show that business 2/5 and 4/5 are significantly different at the familywise error rate of 0.05. This is consistent with the results in part (a).

boxplot(cost ~ business, data = startup, main = "Side-by-Side Boxplots")


(3) Summary: From the boxplots, there is an indication of unequal variance, which conflicts with the equal variance assumption. The residual plot shows some concern of unequal variance as well.

business4 <- cost[31:40]business5 <- cost[41:50]

From the box-plots in part (c), we find the most expensive is type 2 and the least expensive is type 5.

shapiro.test(business4)

## ## Shapiro-Wilk normality test## ## data: business4## W = 0.94227, p-value = 0.5786

shapiro.test(business5)

## ## Shapiro-Wilk normality test## ## data: business5## W = 0.85582, p-value = 0.06811

Summary: The Shapiro-Wilk test shows that the business 4 follows normal distribution. Since the p-value of the business 5 is less than 0.1, we conclude that the business 5 doesn't follow normal distribution. Thus we can continue with the Wilcoxon Rank Sum Test.

### NOTE: IF YOU USE SIGNIFICANCE LEVEL OF 0.05 FOR NORMALITY TEST, THEN YOU CAN CONCLUDE THAT BOTH SAMPLES FOLLOW NORMAL DISTRIBUTION AND CONTINUE WITH VARIANCE TEST AND T-TEST. THE FINAL RESULT IS THE SAME. THERE IS SOME AMBIGUITY FOR THE NORMALITY TEST HERE.

### NOTE: USING SIGNIFICANCE LEVEL OF 0.1 FOR NORMALITY IS MORE ROBUST.

wilcox.test(business4, business5, exact=F)

## ## Wilcoxon rank sum test with continuity correction## ## data: business4 and business5## W = 86, p-value = 0.007153## alternative hypothesis: true location shift is not equal to 0

(4) Summary: The Wilcoxon Rank Sum Test shows the p-value is 0.007153, which means we reject the null hypothesis and conclude there is significant mean differences between business 4 and 5.

### T-TEST WILL BE ACCEPTED HERE AND WILL DRAW THE SAME CONCLUSION.

################# Question 2 Version 1 - advertise.xls ###############

inquiries <- c(11, 8, 6, 8,10,12,13,11, 4, 3, 5, 6, +9,10,10,12, 7, 8,11, 9, 5, 8, 6, 7, +8, 9, 9,11, 7, 8,10, 9, 5, 9, 7, 6, +4, 5, 3, 5, 9, 6, 8, 8, 7, 6, 6, 5, +13,12,11,14,10, 9, 9, 8,12,10,11,12)day <- c(rep("monday",12), rep("tuesday",12), rep("wednesday",12), rep("thursday",12), rep("friday",12))type <- c(rep(c(rep("news",4),rep("business",4),rep("sport",4)),5))advertise <- data.frame(day, type, inquiries)fit1 <- lm(inquiries ~ day + type + day*type, data = advertise)anova(fit1)

## Analysis of Variance Table## ## Response: inquiries## Df Sum Sq Mean Sq F value Pr(>F) ## day 4 146.833 36.708 20.9098 8.518e-10 ***## type 2 53.733 26.867 15.3038 8.503e-06 ***## day:type 8 135.767 16.971 9.6669 1.125e-07 ***## Residuals 45 79.000 1.756 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(1) Summary: We see significant interaction and main effects.

results3 = aov(inquiries ~ day + type + day*type, data = advertise)TukeyHSD(results3, conf.level = 0.95)

## Tukey multiple comparisons of means## 95% family-wise confidence level## ## $day## diff lwr upr p adj## monday-friday -2.83333333 -4.3703256 -1.2963411 0.0000395## thursday-friday -4.91666667 -6.4536589 -3.3796744 0.0000000## tuesday-friday -2.41666667 -3.9536589 -0.8796744 0.0004870## wednesday-friday -2.75000000 -4.2869923 -1.2130077 0.0000659## thursday-monday -2.08333333 -3.6203256 -0.5463411 0.0032396## tuesday-monday 0.41666667 -1.1203256 1.9536589 0.9378121## wednesday-monday 0.08333333 -1.4536589 1.6203256 0.9998683## tuesday-thursday 2.50000000 0.9630077 4.0369923 0.0002978## wednesday-thursday 2.16666667 0.6296744 3.7036589 0.0020424## wednesday-tuesday -0.33333333 -1.8703256 1.2036589 0.9717587## ## $type## diff lwr upr p adj## news-business -0.2 -1.215478 0.8154783 0.8823137## sport-business -2.1 -3.115478 -1.0845217 0.0000260

## sport-news -1.9 -2.915478 -0.8845217 0.0001241

(2) Summary: Using Tukey's multiple comparisons of means, the highlighted pairs above are significantly different from each other in means.

par(mfrow=c(1,1))plot(inquiries ~ day + type, data = advertise)

par(mfrow=c(1,1))interaction.plot(advertise$day, advertise$type, advertise$inquiries)


(3) Summary: (a) From boxplots, there is an indication of unequal variance for different day of the week, which conflicts the equal variance assumption. However, no worries for different type levels. (b) The profile plot indicates that the main effect and interactions are significant. (c) The residual plot seems all good.

################ Question 2 Version 2 - traps.xls ###################

moths <- c(28,19,32,15,13,35,22,33,21,17,32,29,16,18,20, +39,12,42,25,21,36,38,44,27,22,37,40,18,28,36, +44,21,38,32,29,42,17,31,29,37,35,39,41,31,34, +17,12,23,19,14,18,27,15,29,16,22,25,14,16,19)location <- c(rep("top",15),rep("middle",15),rep("lower",15),rep("ground",15))lure <- c(rep(c(rep("scent",5), rep("sugar",5),rep("chemical",5)),4))traps <- data.frame(location, lure, moths)fit2 <- lm(moths ~ location + lure + location*lure, data = traps)anova(fit2)

## Analysis of Variance Table## ## Response: moths## Df Sum Sq Mean Sq F value Pr(>F) ## location 3 1981.38 660.46 10.4503 2.094e-05 ***## lure 2 113.03 56.52 0.8943 0.4156 ## location:lure 6 114.97 19.16 0.3032 0.9322 ## Residuals 48 3033.60 63.20 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(1) Summary: From the anova table, we see significant location main effect. However, the lure main effect and interaction effect are NOT significant.

results4 = aov(moths ~ location + lure + location*lure, data = traps)TukeyHSD(results4, conf.level = 0.95)

## Tukey multiple comparisons of means## 95% family-wise confidence level## ## $location## diff lwr upr p adj## lower-ground 14.266667 6.541043 21.99229029 0.0000622## middle-ground 11.933333 4.207710 19.65895695 0.0008571## top-ground 4.266667 -3.458957 11.99229029 0.4633359## middle-lower -2.333333 -10.058957 5.39229029 0.8523085## top-lower -10.000000 -17.725624 -2.27437638 0.0063580## top-middle -7.666667 -15.392290 0.05895695 0.0524653## ## $lure## diff lwr upr p adj## scent-chemical -2.75 -8.829984 3.329984 0.5224665## sugar-chemical 0.30 -5.779984 6.379984 0.9921811## sugar-scent 3.05 -3.029984 9.129984 0.4512753

(2) Summary: Using Tukey's multiple comparisons of means, the highlighted pairs above are significantly different from each other in means, which is consistent with in (1).

par(mfrow=c(1,1))plot(moths ~ location + lure, data = traps)

par(mfrow=c(1,1))interaction.plot(traps$location, traps$lure, traps$moths)


(3) Summary: (a) From boxplots, there is an indication of unequal variance for different locations, which conflicts the equal variance assumption. However, no worries for different

lure levels. (b) The profile plot indicates that only the the location main effect is significant. (c) The residual plot seems all good.

##################### Question 3 - greens.xls #######################

setwd("/Users/Hongyu/AMS394 Midterm Data Files")data <- read.csv("greens.csv", header = T)fit.all <- lm(data$X1 ~ data$X2+data$X3+data$X4+data$X5+data$X6+factor(data$X7))summary(fit.all)

## ## Call:## lm(formula = data$X1 ~ data$X2 + data$X3 + data$X4 + data$X5 + ## data$X6 + factor(data$X7))## ## Residuals:## Min 1Q Median 3Q Max ## -26.701 -8.733 -2.642 2.710 38.518 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -14.75269 34.67987 -0.425 0.675327 ## data$X2 16.95540 3.56757 4.753 0.000138 ***## data$X3 0.18061 0.05794 3.117 0.005675 ** ## data$X4 10.76534 2.58403 4.166 0.000524 ***## data$X5 13.68904 1.77391 7.717 2.85e-07 ***## data$X6 -5.54066 1.84383 -3.005 0.007281 ** ## factor(data$X7)2 3.36159 9.08534 0.370 0.715474 ## factor(data$X7)3 -9.48834 9.29635 -1.021 0.320240 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 17.55 on 19 degrees of freedom## Multiple R-squared: 0.9939, Adjusted R-squared: 0.9917 ## F-statistic: 442.2 on 7 and 19 DF, p-value: < 2.2e-16

par(mfrow=c(1,1)) plot(fit.all)

Full Model Summary:

(a) The estimated model is X1(hat) = -14.75269 + 16.95540 X2 + 0.18061 X3 + 10.76534 X4 + 13.68904 X5 - 5.54066 X6 + 3.36159 Dummy1 - 9.48834 Dummy2. (b) The categorical variable X7 seems to appear insignificant in our model. (c) The fraction of variance is 0.9939, then we conclude our model is adequate. (d) The residual plot seems to shows non-linearity and heteroscedasticity, which conflicts the underlying assumptions.

library(MASS)step <- stepAIC(fit.all, direction="both")

## Start: AIC=161.22## data$X1 ~ data$X2 + data$X3 + data$X4 + data$X5 + data$X6 + factor(data$X7)## ## Df Sum of Sq RSS AIC## - factor(data$X7) 2 690.2 6541.4 160.23## <none> 5851.2 161.22## - data$X6 1 2780.8 8632.1 169.72## - data$X3 1 2992.2 8843.4 170.37## - data$X4 1 5345.1 11196.4 176.74## - data$X2 1 6956.1 12807.3 180.37## - data$X5 1 18339.1 24190.4 197.54## ## Step: AIC=160.23## data$X1 ~ data$X2 + data$X3 + data$X4 + data$X5 + data$X6## ## Df Sum of Sq RSS AIC## <none> 6541.4 160.23## + factor(data$X7) 2 690.2 5851.2 161.22## - data$X3 1 2862.7 9404.1 168.03## - data$X6 1 3020.9 9562.3 168.48## - data$X4 1 6454.6 12996.0 176.77## - data$X2 1 6508.4 13049.8 176.88## - data$X5 1 18327.4 24868.8 194.29

step$anova # display results

## Stepwise Model Path ## Analysis of Deviance Table## ## Initial Model:## data$X1 ~ data$X2 + data$X3 + data$X4 + data$X5 + data$X6 + factor(data$X7)## ## Final Model:## data$X1 ~ data$X2 + data$X3 + data$X4 + data$X5 + data$X6## ## ## Step Df Deviance Resid. Df Resid. Dev AIC## 1 19 5851.247 161.2215## 2 - factor(data$X7) 2 690.1637 21 6541.410 160.2319

Stepwise Summary: After Stepwise Variable Selection, categorical variable X7 was dropped. Since the from ANOVA table, the intercept does not seem to be siginificant as well. In the reduced model, the intercept will be dropped as well.

fit.sub <- lm(data$X1 ~ -1+data$X2+data$X3+data$X4+data$X5+data$X6)summary(fit.sub)

## ## Call:## lm(formula = data$X1 ~ -1 + data$X2 + data$X3 + data$X4 + data$X5 +

## data$X6)## ## Residuals:## Min 1Q Median 3Q Max ## -23.199 -10.612 -4.643 6.718 40.992 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## data$X2 16.00513 3.48131 4.597 0.00014 ***## data$X3 0.18276 0.05534 3.302 0.00324 ** ## data$X4 10.64650 2.07623 5.128 3.87e-05 ***## data$X5 12.95663 1.44262 8.981 8.21e-09 ***## data$X6 -6.33702 0.46018 -13.771 2.71e-12 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 17.4 on 22 degrees of freedom## Multiple R-squared: 0.9979, Adjusted R-squared: 0.9974 ## F-statistic: 2093 on 5 and 22 DF, p-value: < 2.2e-16

par(mfrow=c(1,1)) plot(fit.sub)

Reduced Model Summary: (a) The reduced estimated model is X1(hat) = 16.00513 X2 + 0.18276 X3 + 10.64650 X4 + 12.95663 X5 - 6.33702 X6. (b) The fraction of variance has slightly increased to 0.9979, and we conclude our model is adequate. (c) The residual plot seems to shows heteroscedasticity, which conflicts the underlying assumptions.

AMS394 Midterm Solution - Stony Brookzhu/ams394/Midterm_all_R.docx · Web view(3) Summary: From the...

Documents

Transcript of AMS394 Midterm Solution - Stony Brookzhu/ams394/Midterm_all_R.docx · Web view(3) Summary: From the...