Statistical assignment january 2011 - IBP Union Web viewThe mean and the median are very similar,...

15

Click here to load reader

Transcript of Statistical assignment january 2011 - IBP Union Web viewThe mean and the median are very similar,...

Page 1: Statistical assignment january 2011 - IBP Union Web viewThe mean and the median are very similar, with a mean of -7,2 and a median of -7. This indicates an approximately symmetric

Statistics Exam – IBP 2nd year January 2011

In the following questions 1 – 5 a significance level of 5% (α=0,05) will be used, unless otherwise stated in the questions.

Question 1The descriptive statistics are provided for the two variables that will be our response variables in the paper. The variables are Depress2 and Employed. Furthermore we have made a descriptive statistic for the difference between Depress1 and Depress2, this will be investigated further in the paper.

Descriptive statistics for Depress2

Depress2 is a quantitative variable and is depicted with a histogram. The data seems to be normally distributed, with a mean of Depress2 equal to 36.97 and a median of 37 only 0.03 point apart. The close match of the mean and median reflects an approximately symmetric distribution.

Descriptive statistics for Employed

no yes

Employed is a categorical variable and is illustrated in a bar chart. It is obvious when looking at the data that less than half of the population (37%) is employed after the treatment period.

Change between Depress2 and Depress1 Quantiles Moments

The histogram above shows the difference between Depress2 and Depress1 score. A negative number indicates a lower score in Depress2, meaning higher health after the period of seeking job.

Page 1 of 11

-30 -20 -10 0 10 20

0 10 20 30 40 50 60 70

Moments

Mean 36,97125Std Dev 11,535707Std Err Mean 0,2354716Upper 95% Mean 37,432999Lower 95% Mean 36,509501N 2400

Quantiles

100,0% maximum 7690,0% 5275,0% quartile 4550,0% median 3725,0% quartile 290,0% minimum 2

FrequenciesLevel Count ProbNo 1506 0,62750Yes 894 0,37250Total 2400 1,00000

100,0% Maximum 2275,0% Quartile -150,0% Median -725,0% Quartile -130,0% Minimum -35

Mean -7,177083Std Dev 8,4828184Std Err Mean 0,1731548Upper 95% Mean -6,837535Lower 95% Mean -7,516632N 2400

Page 2: Statistical assignment january 2011 - IBP Union Web viewThe mean and the median are very similar, with a mean of -7,2 and a median of -7. This indicates an approximately symmetric

Statistics Exam – IBP 2nd year January 2011

The data creates a nice bell shape, which reflects normally distributed data. The mean and the median are very similar, with a mean of -7,2 and a median of -7. This indicates an approximately symmetric distribution.

Question 2 Proportion of jobseekers When examining the proportion of jobseekers in each of the two groups the explanatory variable is treatment, and response variable is employed.

Contingency table, treatment by employed.

emplo

yed

0,00

0,25

0,50

0,75

1,00

control intervention

treatment

no

yes

The above contingency table gives us the sample proportion of the intervention group (p intervention = 0.409) and the sample proportion of the control group (pcontrol = 0.336). Differences of population proportions

are expressed ( p̂1− p̂2 ) 0.409-0.336= 0.073. This tells us that the intervention group has a higher

employment rate and that the difference in employment between the two groups is 7.3 percentage points.

Confidence interval Assumptions for a confidence interval:

Categorical response variable for two groups Independent random samples for the two groups Large enough sample sizes so that there are at least 10 “successes” and at least 10 “failures” in

each group

The formula for confidence intervals: ( p̂1− p̂2 )±z

se

Confidence interval of 95% (z= 1.96)

Page 2 of 11

CountTotal %Row %

no yes

control 79733,2166,42

40316,7933,58

120050,00

intervention 70929,5459,08

49120,4640,92

120050,00

150662,75

89437,25

2400

Description Risk Difference Lower 95% Upper 95%P(yes|intervention) - P(yes|control) 0,07333 0,11191 0,03476

Page 3: Statistical assignment january 2011 - IBP Union Web viewThe mean and the median are very similar, with a mean of -7,2 and a median of -7. This indicates an approximately symmetric

Statistics Exam – IBP 2nd year January 2011

SAS JMP calculates a confidence interval of lower 0.035 and upper 0.112. Based on the fact that both endpoints are positive (0.035 and 0.112) we can infer that p1-p2 is positive and therefore has a non- arbitrary statistical significant effect.

We are 95% sure that the intervention group has an employment rate that is between 3.5% and 11.2% higher than the control group.

Bivariate analysis Assumptions

Categorical response variable for the two groups Independent random samples either from random sampling or a randomized experiment. n1 and n2 are large enough so that there are at least five successes and five failures in each

group.

Hypotheses Null hypothesis, H0: p1 = p2 p1-p2=0. No difference in the proportions i.e. no association between treatment and employed. Alternative hypothesis Ha : p1 ≠ p2 Difference in the proportions i.e. an association between treatment and employed meaning job training has an effect on how likely people are to get a job. Degrees of freedom equal to 1 gives a chi-square of 3.84 (Table C, App A-4, Agresti et al.). Using SAS JMP we get a chi-square of 13.804 which is much higher than 3.84, which is evidence against the null hypothesis.Test ChiSquare Prob>ChiSqLikelihood Ratio 13,821 0,0002*Pearson 13,804 0,0002*The p-value 0.0002 further underlines this point by being smaller than the significance level. This allows us to reject the null hypothesis because we are unlikely to find more extreme values. The strength of the association is analyzed with the relative risk.

Relative RiskDescription Relative Risk Lower 95% Upper 95%P(no|control)/P(no|intervention) 1,124118 1,056611 1,195939P(no|intervention)/P(no|control) 0,889586 0,836163 0,946422P(yes|control)/P(yes|intervention) 0,820774 0,739215 0,911331P(yes|intervention)/P(yes|control) 1,218362 1,097296 1,352786

Intervention/control = 1.218362 1.22 – The proportion of subjects in the intervention group that got employed is 1.22 times higher than the proportion that got employed in the control group. We are 95% sure that the proportion of the subjects in the intervention group that got employed is between 1.10 and 1.35 times higher than the proportion of the control group that got employed.

Question 3:

In question three we want to examine the effect of the treatment on the depression score. We will start by estimating the mean improvement in the depression score. In order to do this we made a new column in JMP containing the values of Depress2-depress1.

Page 3 of 11

Page 4: Statistical assignment january 2011 - IBP Union Web viewThe mean and the median are very similar, with a mean of -7,2 and a median of -7. This indicates an approximately symmetric

Statistics Exam – IBP 2nd year January 2011

Assumptions: Random sample Quantitative response variable for two groups Close to normal population distribution for each groupHypotheses:

Null hypothesis, H0: µintervention = µcontrol µintervention - µcontrol = 0There is no effect of the treatment on the change in depression score.

Alternative hypothesis, Ha: µintervention ≠ µcontrol µintervention - µcontrol ≠ 0The alternative hypothesis is a two-sided test that checks if treatment has either a positive or negative effect.

Mean improvement and confidence intervalFit y by x – treatment as x-variable and depress2-depress1 as y-variable

LevelNumber Mean Std Error Lower 95% Upper 95%

control 1200 -2,881 0,21118 -3,29 -2,47intervention 1200 -11,473 0,21118 -11,89 -11,06

Intervention group: The 95% CI of the mean is [-11.89;-11.06] meaning that after the treatment the mean scores of depression has fallen with between 11.89 to 11.06 on the 0-100 depression scale. Control group: The 95% CI of the mean is [-3.29;-2.47] meaning that after the study was conducted the mean scores of depression had fallen with between -3.29 to -2.47 on the scale.None of the confidence intervals cover 0 and therefore both groups have an improvement in their depression score.

Test statistic and confidence interval for difference in mean improvementt-test - Intervention-control.Difference -8,5925 t Ratio -28,7711Std Err Dif 0,2986 DF 2398Upper CL Dif -8,0069 Prob > |t| 0,0000*Lower CL Dif -9,1781 Prob > t 1,0000Confidence 0,95 Prob < t 0,0000*

The confidence interval of 95% is [-9.1781;-8.0069]. This interval does not contain zero and we therefore know that the mean for the intervention group is between 9.1781 and 8.0069 points lower than the mean of the control group. We also get the t-score (t Ratio) -28.77 from the table above. From the table of critical t-score values (Table B, App A-3, Agresti et al.) we get the critical value of -1.96 for a 95 % confidence level with more than 100 df. The large difference between the two can be interpreted as evidence against the null hypothesis.

P-value Analysis of VarianceSource DF Sum of Squares Mean Square F Ratio Prob > Ftreatment 1 44298,63 44298,6 827,7789 <,0001*Error 2398 128329,11 53,5C. Total 2399 172627,74The P-value is very small, much smaller than our significance level of 5%. Therefore it would be highly unlikely, presuming that H0 is true, to have a value that is more extreme than the one we observe.

Page 4 of 11

Page 5: Statistical assignment january 2011 - IBP Union Web viewThe mean and the median are very similar, with a mean of -7,2 and a median of -7. This indicates an approximately symmetric

-3

-2

-1

0

1

2

3

Stu

dent

ized

Res

id d

epre

ss2

30 40 50 60age

Statistics Exam – IBP 2nd year January 2011

There is less than 0,0001 chance that the difference in mean improvement is -8,5925 (from t test table) if the H0 is true. Therefore H0 has to be rejected.

The H0 is rejected and as shown above the mean of the intervention group is significantly lower than the mean of the control group. Thereby it has been shown that the treatment has an effect on the depression level of the jobseekers. The control group also experienced a positive effect on the depression scale after doing the job-seeking, but it was lower than the intervention group’s improvement in depression. In general this could be due to the fact that even though they get no job training some of the individuals might still have found a job which might have led to their depression improvement, or that the mere fact that they have been searching for a job has reduced their depression.

Question 4: We now want to describe how the variables Treatment, Age, Gender, Marital, Education and Depress1 predict Depress2

Ydepress2=+ 1X1 + 2X2 + 3X3 + 4X4 + 5X5 + 6X6 +

Where Y= Depress2, X1= Treatment, X2 = Age, X3 = Gender, X4 = Marital, X5 = Education and X6=Depress1

Assumptions Linearity Residuals are normally distributed - and have a constant standard deviation Randomness – this we assume from the data set Independence of residuals – this we assume from the data set

HypothesisX1 Treatment X2 Age X3 Gender X4 Marital X5 Education X6 Depress1

H0 1=0 2=0 3=0 4=0 5=0 6=0HA 1≠0 2≠0 3≠0 4≠0 5≠0 6≠0

Test of assumptions 1. LinearityTo check for linearity on the quantitative predictor variables we plot the studentized residuals and look at how they place themselves around the variables.

Studentized Resid depress2 By age Studentized Resid depress2 By depress1

Page 5 of 11

-3

-2

-1

0

1

2

3

Stud

entiz

edR

esid

dep

ress

2

20 30 40 50 60 70depress1

Page 6: Statistical assignment january 2011 - IBP Union Web viewThe mean and the median are very similar, with a mean of -7,2 and a median of -7. This indicates an approximately symmetric

Statistics Exam – IBP 2nd year January 2011

The residuals are randomly scattered around the plot, which indicates a linear relationship. There are some outliers, but none that seems to be tilting the distribution.

2. Residuals are normally distributed - and have a constant Standard deviation To check that the residuals are normally distributed we make a distribution plot of the studentized residuals. To check if the residuals have a constant standard deviation we plot the studentized residuals of depress2 and the predicted depress 2

Studentized Residuals depress2 Studentized Residuals of depress2 By Predicted depress2

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

Stu

dent

ized

Res

id d

epre

ss2

20 30 40 50 60Predicted depress2

C)Which variables influence Depress2 significantly?

Effect TestsSource Nparm DF Sum of Squares F Ratio Prob > F treatment 1 1 44552,57 889,0117 <,0001*gender 1 1 63,57 1,2685 0,2602age 1 1 107,69 2,1490 0,1428marital 2 2 417,51 4,1656 0,0156*education 4 4 1019,74 5,0870 0,0004*depress1 1 1 144817,74 2889,725 0,0000*marital*education 8 8 193,36 0,4823 0,8695age*education 4 4 74,72 0,3728 0,8282

We have reduced the model on the basis of the p-values in the Effect Test. We have removed the variables one by one while checking for a significant p-value. Our final model will follow below.

We have checked for interactions between some parameters in the model. But none of them were found to be significant due to a p-value higher than the significance level. If we had found a significant interaction it could be added as an extra parameter in the model using this expression +7X4X5.

Page 6 of 11

The Residuals are normally distributed, and we can verify this assumption.

The residuals are randomly distributed in the scatter plot. They are not arranged in either a tunnel or a “U”. This leads us to verify the assumption that the residuals have a constant standard deviation.

Page 7: Statistical assignment january 2011 - IBP Union Web viewThe mean and the median are very similar, with a mean of -7,2 and a median of -7. This indicates an approximately symmetric

Statistics Exam – IBP 2nd year January 2011

Final modelModel: Y= + 1X1 + + 4X4 + 5X5 + 6X6 + Where: Y= Depress2, X1= Treatment, X4 = Marital, X5 = Education, X6 = Depress1Summary of FitRSquare 0,625305RSquare Adj 0,624052Root Mean Square Error 7,073076Mean of Response 36,97125Observations (or Sum Wgts) 2400

Our final model has a Rsquared Adjusted equal to 0.624052, meaning that we can explain 62% of the variation in the data with the variables we are looking at i.e. treatment, marital status, education and the score on depress1.

The model has a Root Mean Square Error of 7.073 which means that we can predict at persons score on the depress2 test within a range of 2 root mean square errors = 14.146.

Parameter Estimates

The variables education and marital status are significant for depress2 , but their significance depend on the group the person is in, for example it is not significant for the depress2 test score if the person has a bachelor education because the p-value is higher than the significance level, only “less high school” and “master” educations have a significant influence. This is underlined by the fact that the confidence intervals cover zero and thereby the effect of e.g. “high school” education becomes insignificant.

Prediction Expression

Page 7 of 11

Term Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95%Intercept 0,8307344 0,68496 1,21 0,2253 -0,512443 2,1739115treatment[control] 4,3115686 0,144498 29,84 <,0001* 4,0282136 4,5949237marital[divorced] 0,809593 0,244577 3,31 0,0009* 0,3299879 1,2891982marital[married] -0,231537 0,20284 -1,14 0,2538 -0,629298 0,1662244education[bachelor] -0,021226 0,326693 -0,06 0,9482 -0,661857 0,6194051education[high school] -0,245024 0,277971 -0,88 0,3782 -0,790113 0,3000652education[less high school] 0,8985871 0,254746 3,53 0,0004* 0,3990421 1,3981321education[master] -0,977022 0,318318 -3,07 0,0022* -1,601229 -0,352815depress1 0,8201893 0,015198 53,97 0,0000* 0,7903864 0,8499923

Page 8: Statistical assignment january 2011 - IBP Union Web viewThe mean and the median are very similar, with a mean of -7,2 and a median of -7. This indicates an approximately symmetric

Statistics Exam – IBP 2nd year January 2011

0,83073437135977

+Matchelse

treatment"control" 4,31156864229634"intervention" -4,3115686422963

.

+Match

else

marital

"divorced" 0,80959302766865"married" -0,2315366920202"single" -0,5780563356485

.

+Match

else

education

"bachelor" -0,0212257209321"high school" -0,2450240009236"less high school" 0,89858709160685"master" -0,9770219435782"other" 0,34468457382711

.+0,82018932153956*depress1

We have selected person nr. 1205. His data looks like this:Treatment Marital Education Depress1 Depress2intervention married less high school 42 30

The model for person 1205 looks like this: Model: depress2 = 0.83 - 4.311 -0.232 +0.899 +0.820*42= 31.626Which means that our model estimates person 1205 to have a depress2 score of 31.626. Prediction error, E = y- ŷ = 30 – 31.626 = -1.626

95% Prediction Interval If we apply a 95% prediction interval for person 1205, his lower and upper values look like this: Lower 95% 17.7472729Upper 95% 45.5210624

The Prediction Interval for person 1205 is between 17.7 – 45.5. This means that for a person with the same x-values, the depress2 score will with 95% probability be within the range of 17.7 and 45.5.

Question 5

When examining the relationship between treatment and probability of being employed the response variable in the model is categorical, thus we are using a logistic regression model:

ln ( p1−p )=α+β ×x OR p= eα+β x

1+eα+β x

Parameter EstimatesTerm Estimate Std Error ChiSquare Prob>ChiSq Lower 95% Upper 95%Intercept 0,52466476 0,0423769 153,29 <,0001* 0,44160765 0,60772187treatment[control] 0,15725336 0,0423769 13,77 0,0002* 0,07419625 0,24031047

Page 8 of 11

Page 9: Statistical assignment january 2011 - IBP Union Web viewThe mean and the median are very similar, with a mean of -7,2 and a median of -7. This indicates an approximately symmetric

Statistics Exam – IBP 2nd year January 2011

Thus the logistic regression prediction equation is: p= e0,52+0,16x

1+e0,52+0,16 x

P: probability of being employedβ: the parameter estimate of treatmentNull hypothesis, H0: β = 0, meaning that there is no effect of treatmentAlternative hypothesis, Ha: β ≠ 0, the alternative hypothesis is two-side to check if treatment has either a positive or negative effect.

Effect Likelihood Ratio TestsSource Nparm DF L-R ChiSquare Prob>ChiSq treatment 1 1 13,8212639 0,0002*

The P-value is below the significance-level; therefore the Treatment does influence Employed. We can thereby reject the null hypothesis and we know that treatment has an effect on the employment.

To know how big the effect of treatment is we use the odds ratio.

empl

oyed

no

yes

0,664

0,336

cont

rol

inte

rven

tion

controltreatment

empl

oyed

no

yes

0,591

0,409

cont

rol

inte

rven

tion

interventiontreatment

Prediction Profiler 1 Prediction profiler 2

Odds RatioOdds Ratio Lower 95% Upper 95%

1,369584 1,159968 1,617078When calculating the odds ratio the proportions from Prediction Profiler 1 and Prediction profiler 2 is used.

The odds ratio =

pa1−papb

1−pb

0,3360,6640,4090,591

=1,37

The odds of becoming employed are 37% higher in the group intervention group than in the control group.

Page 9 of 11

Page 10: Statistical assignment january 2011 - IBP Union Web viewThe mean and the median are very similar, with a mean of -7,2 and a median of -7. This indicates an approximately symmetric

Statistics Exam – IBP 2nd year January 2011

The 95% confidence interval is from 16 – 61 %. There is between 16-61% higher odds of getting a job for the intervention group than the control group.

Effect Likelihood Ratio TestsSource Nparm DF L-R ChiSquare Prob>ChiSqtreatment 1 1 14,4764507 0,0001*gender 1 1 0,10442317 0,7466age 1 1 28,6072613 <,0001*marital 2 2 1,53508031 0,4642education 4 4 40,1492772 <,0001*depress1 1 1 74,644566 <,0001*

Gender and marital status are not significant due to their high p-value so we take them out of the analysis. The two variables are taken out one at a time, and the model is re-run to check how the p-value/significance changes when the first variable is taken out (see appendix). We get a new table:

Parameter EstimatesTerm Estimate Std Error ChiSquare Prob>ChiSqIntercept -2,4005607 0,307261 61,04 <,0001*treatment[control] 0,16699719 0,0438888 14,48 0,0001*age 0,02547128 0,0048067 28,08 <,0001*education[bachelor] -0,2944387 0,0963714 9,33 0,0022*education[high school] 0,07211461 0,0846015 0,73 0,3940education[less high school] 0,26658157 0,0789517 11,40 0,0007*education[master] -0,4029125 0,0937793 18,46 <,0001*depress1 0,04052786 0,0046736 75,20 <,0001*

Prediction Expression -2,4005607286241

+Matchelse

treatment"control" 0,16699719196192"intervention" -0,1669971919619

.+0,02547128484272 * age

+Match

else

education

"bachelor" -0,2944387172438"high school" 0,07211460518324"less high school" 0,26658157484467"master" -0,4029125108276"other" 0,35865504804347

.+0,04052786339678 * depress1

The regression model is expressed as:

p= e+1 X 1+2X 2+3 X 3+4 X 4

1+e+1X 1+2 X 2+3 X3+ 4 X 4

Page 10 of 11

Page 11: Statistical assignment january 2011 - IBP Union Web viewThe mean and the median are very similar, with a mean of -7,2 and a median of -7. This indicates an approximately symmetric

Statistics Exam – IBP 2nd year January 2011

Where: P= probability of employment, X1= treatment, X2= Age, X3= Education, X4= Depress1

We have used person no. 1803 to estimate the probability of being employedTreatment Age Education Depress1 EmployedIntervention 60 Bachelor 48 No

SAS JMP calculates the probability of not being employed by the end of the study, therefore a higher value of the estimates leads to lower chance of being employed.

p= e−2.4−0.167+0.025∗60−0.294+0.041∗48

1+e−2.4−0.167+0.025∗60−0.294+0.041∗48

p ̂(not employed) = 0.647p ̂(employed) = 1 – 0.647 = 0.353From SAS JMP we get P(not employed)= 0.648 and P(employed) = 0.352. Our estimate suggests that the person has 35.3% chance of being employed at the end of the study.

Odds Ratios for treatment - For employed odds of no versus yesLevel1 /Level2 Odds Ratio Reciprocalintervention control 0,7160578 1,3965353The odds of not getting employed for the intervention group is 71.6% compared to the odds for the control group. This can also be stated using the reciprocal value which says that the odds in the intervention group for getting employed are 39.7% higher than in the control group.

Page 11 of 11