Copyright (c) Bani K. Mallick1 STAT 651 Lecture #15.
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Copyright (c) Bani K. Mallick1 STAT 651 Lecture #15.
Copyright (c) Bani K. Mallick 1
STAT 651
Lecture #15
Copyright (c) Bani K. Mallick 2
Topics in Lecture #15 Some basic probability
The binomial distribution
Inference about a single population proportions
Copyright (c) Bani K. Mallick 3
Book Sections Covered in Lecture #15
Chapters 4.7-4.8
Chapter 10.2
Copyright (c) Bani K. Mallick 4
Lecture 14 Review: Nonparametric Methods
Replace each observation by its rank in the pooled data
Do the usual ANOVA F-test
Kruskal-Wallis
Copyright (c) Bani K. Mallick 5
Lecture 14 Review: Nonparametric Methods
Once you have decided that the populations are different in their means, there is no version of a LSD
You simply have to do each comparison in turn
This is a bit of a pain in SPSS, because you physically must do each 2-population comparison, defining the groups as you go
Copyright (c) Bani K. Mallick 6
Categorical Data
Not all experiments are based on numerical outcomes
We will deal with categorical outcomes, i.e., outcomes that for each individual is a category
The simplest categorical variable is binary:
Success or failure
Male of female
Copyright (c) Bani K. Mallick 7
Categorical Data
For example, consider flipping a fair coin, and let
X = 0 means “tails”
X = 1 means “heads”
Copyright (c) Bani K. Mallick 8
Categorical Data
The fraction of the population who are “successes” will be denoted by the Greek symbol
Note that because it is a Greek symbol, it represents something to do with a population
For coin flipping, if you flipped all the fair coins in the world (the population), the fraction of the times they turn up heads equals
Copyright (c) Bani K. Mallick 9
Categorical Data
The fraction of the population who are “successes” will be denoted by the Greek symbol
The fraction of the sample of size n who are “successes” is going to be denoted by
We want to relate to
Let X = number of successes in the sample. The fraction = (# successes)/n = X / n
Copyright (c) Bani K. Mallick 10
Categorical Data
Suppose you flip a coin 10 times, and get 6 heads.
The proportion of heads = 0.60
The percentage of heads = 60%
Copyright (c) Bani K. Mallick 11
Categorical Data
The number of success X in n experiments each with probability of success is called a binomial random variable
There is a formula for this:
Pr(X = k) =
0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4! = 4 x 3 x 2 x 1 = 24, etc.
k n kn!Pr( k/ n) (1 )ˆ
k! (n-k)!
Copyright (c) Bani K. Mallick 12
Categorical Data
0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4! = 4 x 3 x 2 x 1 = 24, etc.
The idea is to relate the sample fraction to the population fraction using this formula
Key Point: if we knew , then we could entirely characterize the fraction of experiments that have k successes
k n kn!Pr(X k) Pr( k/ n) (1 )ˆ
k! (n-k)!
Copyright (c) Bani K. Mallick 13
Categorical Data
The probability that the coin lands on heads will be denoted by the Greek symbol
Suppose you flip a coin 2 times, and count the number of heads.
So here, X = number of heads that arise when you flip a coin 2 times
X takes on the values 0, 1 and 2
takes on the values 0/2, ½, 2/2
Copyright (c) Bani K. Mallick 14
Categorical Data: What the binomial formula does
The experiment results in 4 equally likely outcomes: each occurs ¼ of the time
Tails on toss #1
Heads on toss #1
Tails of toss #2
¼ ¼
Heads on Toss #2
¼ ¼
Copyright (c) Bani K. Mallick 15
Categorical Data
Heads = “success”:
Tails on toss #1
Heads on toss #1
Tails on toss #2
¼ ¼
Heads on Toss #2
¼ ¼
Pr(X 0) Pr( 0/ 2) 1/ 4ˆ Pr(X 1) Pr( 1/ 2) 1/ 2ˆ
Pr(X 2) Pr( 2/ 2) 1/ 4ˆ The binomial formula can be used to give these results without thinking
Copyright (c) Bani K. Mallick 16
Categorical Data
0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4! = 4 x 3 x 2 x 1 = 24, etc.
n=2, k=1, k! = 1, n! = 2, (n-k)! = 1
The binomial formula gives the answer ½, which we know to be correct
k n kn!Pr(X k) Pr( k/ n) (1 )ˆ
k! (n-k)!
k n k.5, and(1 ) .5
Copyright (c) Bani K. Mallick 17
Categorical Data
Roll a fair dice
1 2 3 4 5 6
First Dice
Every combination is equally likely, so what are the probabilities?
Copyright (c) Bani K. Mallick 18
Categorical Data
Roll a fair dice
1 2 3 4 5 6
1/6 1/6 1/6 1/6 1/6 1/6
First Dice
Every combination is equally likely, so what are the probabilities?
Copyright (c) Bani K. Mallick 19
Categorical Data
Roll a fair dice
1 2 3 4 5 6
1/6 1/6 1/6 1/6 1/6 1/6
First Dice
Every combination is equally likely, so what are the probabilities?
What is the chance of rolling a 1 or a 2?
Copyright (c) Bani K. Mallick 20
Categorical Data
Roll a fair dice
1 2 3 4 5 6
1/6 1/6 1/6 1/6 1/6 1/6
First Dice
Every combination is equally likely, so what are the probabilities?
What is the chance of rolling a 1 or 2? 2/6 = 1/3
Copyright (c) Bani K. Mallick 21
Categorical Data
Now roll two fair dice
1 2 3 4 5 6
1
2
3
4
5
6
Second Dice
First Dice
Every combination is equally likely, so what are the probabilities?
Copyright (c) Bani K. Mallick 22
Categorical Data
Roll two fair dice
1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
Second Dice
First Dice
Every combination is equally likely, so what are the probabilities?
Copyright (c) Bani K. Mallick 23
Categorical Data
Roll two fair dice
1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
Second Dice
First Dice
Define a success as rolling a 1 or a 2. What is the chance of two successes?
Copyright (c) Bani K. Mallick 24
Categorical Data
Roll two fair dice
1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
Second Dice
First Dice
Define a success as rolling a 1 or a 2. What is the chance of two successes? 4/36 = 1/9
Copyright (c) Bani K. Mallick 25
Categorical Data
Roll two fair dice
1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
Second Dice
First Dice
Define a success as rolling a 1 or a 2. What is the chance of two failures? 16/36 = 4/9
Copyright (c) Bani K. Mallick 26
Categorical Data
So, a success occurs when you roll a 1 or a 2
Pr(success on a single die) = 2/6 = 1/3 =
Pr(2 successes) = 1/3 x 1/3 = 1/9
Use the binomial formula: pr(X=k) when k=2
k!=2, n!=2, (n-k)!=1,
k n k1/ 9,and(1 ) 1
k n kn!Pr(X k) Pr( k/ n) (1 ) 1/ 9ˆ
k! (n-k)!
Copyright (c) Bani K. Mallick 27
Categorical Data
In other words, the binomial formula works in these simple cases, where we can draw nice tables
Now think of rolling 4 dice, and ask the chance the 3 of the 4 times you get a 1 or a 2
Too big a table: need a formula
Copyright (c) Bani K. Mallick 28
Categorical Data
Does it matter what you call as “success” and hat you call a “failure”?
No, as long as you keep track
For example, in a class experiment many years ago, men were asked whether they preferred to wear boxers or briefs
This is binary, because there are only 2 outcomes
“success” = ?????
Copyright (c) Bani K. Mallick 29
Categorical Data
Binary experiments have sampling variability, just like sample means, etc.
Experiment: “success” = being under 5’10” in height
First 6 men with SSN < 5
First 6 men with SSN > 5
Note how the number of “successes” was not the same! (I might have to do this a few times)
Copyright (c) Bani K. Mallick 30
Categorical Data
The sample fraction is a random variable
This means that if I do the experiment over and over, I will get different values.
These different values have a standard deviation.
Copyright (c) Bani K. Mallick 31
Categorical Data
The sample fraction has a standard error
Its standard error is
Note how if you have a bigger sample, the standard error decreases
The standard error is biggest when = 0.50.
ˆ
(1 )n
Copyright (c) Bani K. Mallick 32
Categorical Data
The sample fraction has a standard error
Its standard error is
The estimated standard error based on the sample is
ˆ
(1 )n
ˆ
(1 )ˆ ˆˆ
n
Copyright (c) Bani K. Mallick 33
Categorical Data
It is possible to make confidence intervals for the population fraction if the number of successes > 5, and the number of failures > 5
If this is not satisfied, consult a statistician
Under these conditions, the Central Limit Theorem says that the sample fraction is approximately normally distributed (in repeated experiments)
Copyright (c) Bani K. Mallick 34
Categorical Data
(1100% CI for the population fraction
is by looking up 1 in Table 1
/ 2 ˆzˆ ˆ
ˆ
(1 )ˆ ˆˆ
n
/ 2z
Copyright (c) Bani K. Mallick 35
Categorical Data
Often, you will only know the sample proportion/percentage and the sample size
Computing the confidence interval for the population proportion: two ways By hand
By SPSS (this is a pain if you do not have the data entered already)
Because you may need to do this by hand, I will make you do this.
Copyright (c) Bani K. Mallick 36
Categorical Data
(1100% CI for the population fraction
95% CI, = 1.96
n = 25, = 0.30
/ 2 ˆzˆ ˆ
ˆ
(1 ) .3(1 .3)ˆ ˆ 0.09165ˆn 25
/ 2z
/ 2 ˆz 0.30 1.96x0.09165ˆ ˆ
Copyright (c) Bani K. Mallick 37
Categorical Data
(1100% CI for the population fraction
Interpretation?
/ 2 ˆz 0.30 1.96x0.09165ˆ ˆ
0.30 0.18 [0.12,0.48]
Copyright (c) Bani K. Mallick 38
Categorical Data
(1100% CI for the population fraction
Interpretation? The proportion of successes in the population is from 0.12 to 0.48 (12% to 48%) with 95% confidence
/ 2 ˆz 0.30 1.96x0.09165ˆ ˆ
0.30 0.18 [0.12,0.48]
Copyright (c) Bani K. Mallick 39
Categorical Data
You can use SPSS as long as the number of successes and the number of failures both exceed 5
To get the confidence intervals, you first have to define a numeric version of your variable that classifies whether an observation is a success or failure.
You then compute the 1-sample confidence interval from “descriptives” “Explore”: Demo
Copyright (c) Bani K. Mallick 40
Categorical Data
If you set up your data in SPSS, the “mean” will be the proportion/fraction/percentage of 1’s
Data = 0 1 1 1 0 0 0 1 0 0
n = 10
Mean = 4/10 = .40
= .40
Copyright (c) Bani K. Mallick 41
Boxers versus briefs for males
Case Processing Summary
188 100.0% 0 .0% 188 100.0%Boxers or BriefsPerference
N Percent N Percent N Percent
Valid Missing Total
Cases
In this output, boxers = 1 and briefs = 0
Copyright (c) Bani K. Mallick 42
Boxers versus briefs for males: what % prefer boxers? In the
sample, 46.81%. In the population???
Descriptives
.4681 3.649E-02
.3961
.5401
.4645
.0000
.250
.5003
.00
1.00
1.00
1.0000
.129 .177
-2.005 .353
MeanLower Bound
Upper Bound
95% ConfidenceInterval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Boxers or BriefsPerference
Statistic Std. Error
In this output, boxers = 1 and briefs = 0. The proportionof 1’s is the mean
Copyright (c) Bani K. Mallick 43
Boxers versus briefs for males: what % prefer boxers? Between
39.61% and 54.01%
Descriptives
.4681 3.649E-02.3961
.5401
.4645
.0000
.250.5003
.00
1.001.00
1.0000.129 .177
-2.005 .353
Mean
Lower BoundUpper Bound
95% ConfidenceInterval for Mean
5% Trimmed Mean
MedianVariance
Std. DeviationMinimum
MaximumRange
Interquartile Range
SkewnessKurtosis
GenderMaleNumeric Boxers: 0
= Briefs, 1 = Boxers
Statistic Std. Error
Copyright (c) Bani K. Mallick 44
Boxers versus briefs
In the sample, 46.81% of the men preferred boxers to briefs: 53.19% preferred briefs.
Between 39.61% and 54.01% men prefer boxers to briefs (95% CI)
Is there enough evidence to conclude that men generally prefer briefs?
Copyright (c) Bani K. Mallick 45
Boxers versus briefs
In the sample, 46.81% of the men preferred boxers to briefs: 53.19% preferred briefs.
Between 39.61% and 54.01% men prefer boxers to briefs (95% CI)
Is there enough evidence to conclude that men generally prefer briefs?
No: since 50% is in the CI! This means that it is possible (95%CI) that 50% prefer boxers, 50% prefer briefs, = 0.50.
Copyright (c) Bani K. Mallick 46
Sample Size Calculations
The standard error of the sample fraction is
If you want an (1100% CI interval to be
you should set
ˆ
(1 )n
E
/ 2
(1 )E z
n
Copyright (c) Bani K. Mallick 47
Sample Size Calculations
This means that
/ 2
(1 )E z
n
2/ 2 2
(1 )n z
E
Copyright (c) Bani K. Mallick 48
Sample Size Calculations
The small problem is that you do not know . You have two choices: Make a guess for
Set = 0.50 and calculate (most conservative, since it results in largest sample size)
Most polling operations make the latter choice, since it is most conservative
2/ 2 2
(1 )n z
E
Copyright (c) Bani K. Mallick 49
Sample Size Calculations: Examples
Set E = 0.04, 95% CI, you guess that = 0.30
You have no good guess:
2/ 2 2
(1 )n z
E
22
.3(1 .3)n 1.96 504
.04
22
.5(1 .5)n 1.96 601
.04