Introduction to Statistics - Petra Christian...
Transcript of Introduction to Statistics - Petra Christian...
Introduction to Statistics
Learning about the Word
Siana Halim
TOPICSTOPICS
I f b t MInferences about Means
Comparing Means
Pairs Samples and Blocks
Comparing Countsp g
References:•De Veaux, Velleman , Bock, Stats, Data and Models, Pearson Addison WesleyInternational Edition, 2005•John A Rice, Mathematical Statistics and Data Analysis, Duxbury Press, 1995
Siana Halim2
J , y , y ,
Inferences about MeansInferences about MeansMotor vehicle crashes are the leading cause of death for leading cause of death for people of every age between 4 and 33 years old.
S di i t ib ti f t Speeding is a contributing factor in 29% of all fatal accidents.
Siwalankerto is busy street that passes through a residential neighborhood. Residents there are concerned that vehicle traveling on Siwalankerto often traveling on Siwalankerto often exceed the posted speed limit of 30Km/hour.
Siana Halim3
Speedp29 29 24
34 34 34
5
434 32 36
28 31 31
30 27 34 Cars 3
29 37 36
38 29 21
#of
2
31 26
We are interested both in estimating the
1
0gtrue mean speed and in testing whether it exceeds the posted speed limit. Speed
363228240
Siana Halim4
A Sampling Distribution for Meansg
A sampling distribution for means
When the conditions are met the standardized sample meanWhen the conditions are met, the standardized sample mean,
nsySE
ySEyt =−
= )(,)(μ
follows a Student’s t-model with n-1 degree of freedom.We also use this model to obtain a P-value for testing the hypothesis
H : μ = μ
nySE )(
H0 : μ μ0
One- sample t-interval
Wh th diti t th fid i t l f l ti When the conditions are met, the confidence interval for population mean, μ is
)(* ySEty n 1−±
Siana Halim5
Sample SizeHow large a sample do we need ? The simple answer is “more”. But more data cost money, effort and time, so how much is enough ?g
Suppose your computer takes 2 hours on average to download a movies. You hear about a program that claims to
So you get the free the evaluation copy and test it by downloading 10 different movies. Of course, the mean downloading time is not hear about a program that claims to
download movies in less than an hour. You’re interested enough to spend $29.95 for it, but only if it really
, gexactly 1 hour as claimed. Observation vary. If the margin error was 8 minutes, though, you’d probably be able to decide whether the $29.95 for it, but only if it really
delivers. So you get the free evaluation spend $29.95 for it, but only if it really delivers.
ysoftware is worth the money. Doubling the sample size would require another 10 hours of testing and reduce your margin of error to a ybit under 6 minutes. You’d need to decide whether that was worth the effort.
Siana Halim6
Comparing Two MeansComparing Two MeansShould we buy generic rather than brand
b i ?name batteries ?
Brand Name Generic
194.0 190.7
205.5 203.5
199.2 203.5
172.4 206.5
184.0 222.5
169.5 209.4
Siana Halim7
Plot the data Comparing the MeansPlot the data Comparing the Means
The population model parameter of interest is the difference between the mean μ μ
22
2121 yVaryVaryySD +=− )()()(
is the difference between the mean, μ1 - μ2
22
2
2
2
2
1
1
nn ⎟⎟
⎠
⎞
⎜⎜
⎝
⎛+
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛=
σσ
22
21
21
2
22
1
21
ssyySE
nn
+=−
+=
)(
σσ
2121 nn
yy )(
Siana Halim8
A mpli di t ib ti f th diff b t t mA sampling distribution for the difference between two meansWhen the conditions are met, the standardized sample difference between the means of two independent groups,
)()()(
21
2121
yySEyy
t−
−−−=
μμ
Can be modeled by a Student’s t-model with a number of degree of freedom found in a special formula :
222 ⎞⎛
222
221
2
22
1
21
11⎟⎟⎞
⎜⎜⎛
+⎟⎟⎞
⎜⎜⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛+
=ss
ns
ns
df
2211 11 ⎟⎠
⎜⎝−⎟
⎠⎜⎝− nnnn
Siana Halim9
Assumptions and ConditionsAssumptions and Conditions1. Independence Assumption:
Randomization condition
10% Condition
2 Normal Population Assumption2. Normal Population Assumption
3. Independent Group Assumption
T th t l t th d th t To use the two-sample t methods, the two groups we are comparing must be independent of each other.
No statistical test can verify this assumption. You have to think about how the data were collected.
Siana Halim10
Two-sample t-intervalWhen the conditions are met, we are ready to find the confidence interval for the difference between means of two independent groups, μ1-μ2. The confidence interval isg p , μ1 μ2
( ) ( ) ( ) 22
21
212121 ns
nsyySEyySEtyy df +=−−±− ,*
The critical value depends on the particular confidence level that you specify and on the number of degrees of freedom, which we get
21 nn
specify and on the number of degrees of freedom, which we get from the sample sizes and a special formula.
Siana Halim11
Testing the Difference between Two MMeans
Price Offered For a Used Camera($)Price Offered For a Used Camera($)
Buying from a friend Buying from a Stranger
275 260
300 250
260 175
300 130300 130
255 200
275 225
290 240
300
Siana Halim12
Two-sample t-test for the difference between the means of two independent groups
The conditions for the two-sample t-test for the difference between the The conditions for the two sample t test for the difference between the means of two independent groups are the same as for the two-sample t-interval. We test the hypothesis
H ΔH0: μ1 - μ2 = Δ0,Where the hypothesized difference is almost always 0, using the statistic
( ) ( ) 22Δ
When the conditions are met and the null hypothesis is true, this statistic
( )( ) ( )
2
22
1
21
2121
021
ns
nsyySE
yySEyy
t +=−−
Δ−−= ,
When the conditions are met and the null hypothesis is true, this statistic can be closely modeled by a Student’s t-model with a number of degree of freedom given by a special formula. We use that model to obtain a P-value.
Siana Halim13
Pooled variance t-test and confidence interval for the difference between the means of two independent groupsbetween the means of two independent groups
The conditions for the two-sample t-test for the difference between the pmeans of two independent groups (commonly called a ”pooled t-test”) are the same as for the two-sample t-test with the additional assumption that the variances of the two groups are the same We test the hypothesisthe variances of the two groups are the same. We test the hypothesis
H0: μ1 - μ2 = Δ0,
Wh h h h d d ff l l 0 h Where the hypothesized difference is almost always 0, using the statistic
( ) ( )22
021 11ssyy pooledpooledΔ−−( )( ) ( )
212121
21
021 11nn
sn
sn
syySE
yySEyy
t pooledpooledpooled
pooledpooled
+=+=−−
Δ= ,
Siana Halim14
Where the pooled variance isWhere the pooled variance is
)()( 11 222
2112 −+−
=snsnS l d
When the conditions are met and the null hypothesis is true, this statistic can be closely modeled by a Student’s t-model with (n1-1)+(n2-1) degree of freedom We use that model
)()( 11 21 −+− nnS pooled
modeled by a Student s t model with (n1 1)+(n2 1) degree of freedom. We use that model to obtain a P-value.
The corresponding confidence interval isThe corresponding confidence interval is
( ) ( )2121 yySEtyy pooleddf −±− *
Where the critical value t* depends on the confidence level and is found with (n1-1)+(n2-1) degree of freedom.
Siana Halim15
Is the Pool All Wet?
So when should you use pooled-t methods rather than l h d ? N !two-sample t method ? Never !
When the variance of the two groups are in fact equal, h h d h h lthe two methods give pretty much the same result.
Pooled methods have a small advantage (slightly C I l h l f l ) l narrower C.I, slightly more powerful tests) mostly
because their d.f formula is usually a bit bigger, but the advantage is slight.g g
When the variance are not equal, the pooled methods are just not valid, and can give poor results.
Siana Halim16
j , g p