1 Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 5: Generalisability of...
-
Upload
mervin-ryan -
Category
Documents
-
view
214 -
download
1
Transcript of 1 Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 5: Generalisability of...
1
Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 5:
Generalisability of Social Research and the Role of Inference
Dr Gwilym Pryce
2
Coin tossing experiment:is this a fair coin?
(A)Toss Outcome
(1 if H)
(B)RunningTotal thatare Heads
(B) / (A)Proportionof total sofar that are
heads1 1 1 12 0 1 0.503 0 1 0.334 1 2 0.5
3
Coin tossing experiment:is this a fair coin?
4
Coin tossing example:
5
Implications of fair coin experiment: If we want to survey a sample of people
as a means of saying something about the relative size of a particular group in the population– e.g. the true proportion of OO households
facing repayment difficulties it might take a large sample before the
true value of a proportion emerges
6
Social Research:
We usually only have a sample from which we want to infer something
about the population– proportion (e.g. % with MPPI)– mean (e.g. average income)
I.e. we want to be able to ‘generalise’
7
Statistical Inference: allows us to generalise from our sample
to the population in a systematic way:– Assumes that each member of the
‘population’ has an equally likely chance of entering our sample
– If this assumption holds, Statistical Inference takes into account random variation from sample to sample.
– Allows us to derive a ‘confidence interval’ for the population mean or proportion
8
CIs allow us to make the following types of statement:
E.g. CI for a mean:– 95% sure that the average age of a
homeless person in Glasgow is between 37 and 45 years.
E.g. CI for a proportion:– 95% sure that the proportion of one adult
households with no children that have MPPI is between 25% and 28%
9
We usually have different numbers of observations in different groups: confidence intervals for the proportions &
means based on those different groups will vary purely due to sample size
we may have a very large overall sample – but may end up with very large confidence
intervals for the population mean or proportion for a particular group.
10
Take-up of MPPI by Household Structure
Household Composition %mortgagorswith MPPI
One adult 26.6Two adults 30.0Three or more adults 28.6One adult one child 23.6One adult two children 23.0One adult three or more children 15.4Two adults one child 29.2Two adults two children 27.1Two adults three children 24.1Three adults one child 29.7Three adults two children 29.0Three adults three children 23.8UK 27.2
(Source: FRS 1994/95 and 1995/96;combined sample size of mortgagors = 18,566)
11
Take-up of MPPI by Household Structure
Household Composition %mortgagorswith MPPI
Size ofGroup as a %
of AllMortgagors
One adult 26.6 13.4Two adults 30.0 25.2Three or more adults 28.6 8.3One adult one child 23.6 1.8One adult two children 23.0 1.8One adult three or more children 15.4 0.6Two adults one child 29.2 13.0Two adults two children 27.1 19.1Two adults three children 24.1 7.0Three adults one child 29.7 3.3Three adults two children 29.0 1.2Three adults three children 23.8 0.6UK 27.2 100
(Source: FRS 1994/95 and 1995/96 combined; sample size of mortgagors = 18,566)
12
Household Category
% mortgagors with MPPI
Size of Group as %
of All Mortgagors
Sample size
One adult 26.6% 13.4 2488Two adults 30.0% 25.2 4679Three or more adults 28.6% 8.3 1541One adult one child 23.6% 1.8 334One adult two children 23.0% 1.8 334One adult three or more children 15.4% 0.6 111Two adults one child 29.2% 13.0 2414Two adults two children 27.1% 19.1 3546Two adults three children 24.1% 7.0 1300Three adults one child 29.7% 3.3 613Three adults two children 29.0% 1.2 223Three adults three children 23.8% 0.6 111UK 27.2% 100.0 18566
13
Q/ Given: • a sample size of 111 for HHs with 1 adult and 3
children with 15.4% with MPPI,
what do you think the 95% confidence interval would be for the population % with MPPI?
14
Household Category
Proportion of
mortgagors with MPPI
Sample size Lower Upper
One adult 26.6% 2488 24.9% 28.3%Two adults 30.0% 4679 28.7% 31.3%Three or more adults 28.6% 1541 26.3% 30.9%One adult one child 23.6% 334 19.1% 28.2%One adult two children 23.0% 334 18.5% 27.5%One adult three or more children 15.4% 111 8.7% 22.1%Two adults one child 29.2% 2414 27.4% 31.0%Two adults two children 27.1% 3546 25.6% 28.6%Two adults three children 24.1% 1300 21.8% 26.4%Three adults one child 29.7% 613 26.1% 33.3%Three adults two children 29.0% 223 23.0% 35.0%Three adults three children 23.8% 111 15.9% 31.7%UK 27.2% 18566 26.6% 27.8%
15
Why not use intuition?
Q/ in group of 25 people, what is the probability that at least two of them will have the same birthday?
Q/ what’s the probability in a group of 60?
16
17
How does inference work?
Central Limit Theorem:– if we were able to take repeated samples,
we would find that the means from each of those samples would be normally distributed.
– Similarly, if we take repeated samples and compute the sample proportion for each, we would find that the sample proportions would have a normal distribution.
18
CLT: Distribution of means from repeated samples is normal if n is large
As more samples are taken, normaldistribution of mean emerges
NORM_2
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
NORM_2
5
4
3
2
1
0
NORM_2
8
6
4
2
0
NORM_2
16
14
12
10
8
6
4
2
0
NORM_2
50
40
30
20
10
0
NORM_2
50
40
30
20
10
0
19
E.g. Even though GDP pc is not normally distributed, means from repeated samples are.
20
LTY
5.75.3
4.94.5
4.13.7
3.32.9
2.52.1
1.71.3
.9.5.1
LTY (all new borrowers)8000
6000
4000
2000
0
Std. Dev = 3.36
Mean = 2.3
N = 70289.00
LTY
5.705.30
4.904.50
4.103.70
3.302.90
2.502.10
1.701.30
.90.50
.10
LTY (Sample)70
60
50
40
30
20
10
0
Std. Dev = 1.03
Mean = 2.23
N = 497.00
LTY
5.705.30
4.904.50
4.103.70
3.302.90
2.502.10
1.701.30
.90.50
.10
LTY (Sample)60
50
40
30
20
10
0
Std. Dev = 5.08
Mean = 2.47
N = 491.00
LTY
5.59
5.05
4.50
3.95
3.41
2.86
2.32
1.77
1.23
.68
.14
LTY (Sample)100
80
60
40
20
0
Std. Dev = .77
Mean = 2.20
N = 494.00
Samples:
•non-normal
Population: •non-normal
21
LTY
5.705.30
4.904.50
4.103.70
3.302.90
2.502.10
1.701.30
.90.50
.10
LTY (Sample)70
60
50
40
30
20
10
0
Std. Dev = 1.03
Mean = 2.23
N = 497.00
LTY
5.705.30
4.904.50
4.103.70
3.302.90
2.502.10
1.701.30
.90.50
.10
LTY (Sample)60
50
40
30
20
10
0
Std. Dev = 5.08
Mean = 2.47
N = 491.00
LTY
5.59
5.05
4.50
3.95
3.41
2.86
2.32
1.77
1.23
.68
.14
LTY (Sample)100
80
60
40
20
0
Std. Dev = .77
Mean = 2.20
N = 494.00
Sampling Distribution of the mean:•normal (I.e. Means from repeated samples will be normally distributed)
As more samples are taken, normaldistribution of mean emerges
NORM_2
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
NORM_2
5
4
3
2
1
0
NORM_2
8
6
4
2
0
NORM_2
16
14
12
10
8
6
4
2
0
NORM_2
50
40
30
20
10
0
NORM_2
50
40
30
20
10
0
22
Mean of the sampling distribution of means = population mean
As more samples are taken, normaldistribution of mean emerges
NORM_2
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
NORM_2
5
4
3
2
1
0
NORM_2
8
6
4
2
0
NORM_2
16
14
12
10
8
6
4
2
0
NORM_2
50
40
30
20
10
0
NORM_2
50
40
30
20
10
0
LTY
5.75.3
4.94.5
4.13.7
3.32.9
2.52.1
1.71.3
.9.5.1
LTY (all new borrowers)8000
6000
4000
2000
0
Std. Dev = 3.36
Mean = 2.3
N = 70289.00
||
||
23
Normal Distribution and CLT:
CLT the distribution of sample means is normal
Also: population mean = mean of all sample means
These two properties allow us to compute confidence intervals because:– Statisticians have worked out the
probabilities associated with the normal curve
24
Suppose we know the sampling distribution of the mean:– we can then say where
95% of sample means lie:
• e.g. 95% of LTYs lie between 1.2 and 3.3
– That is, 1.1 either side of the population LTY of 2.3
25
But, to say that the sample mean lies within 1.1 of is the same as saying that is within 1.1 of the sample mean.– So 95% of all samples will capture the true
population mean in the interval Put another way, there are only 2
possibilities: • Either the interval (sample mean ± 1.1)
contains • Or our sample was one of the few samples (I.e.
one of the 5%) for which the sample mean is not within 1.1 of
26
CLT Applies also to proportions:
MPPI example:– For single parent HHs with 3 children, 95%
sure that the population proportion for MPPI take up lies between 8.4% and 22.1%
• Either the interval 8.4% to 22.1% contains the population proportion
• Or our sample was one of the few samples (I.e. one of the 5%) for which the sample mean is not within the interval 8.4% to 22.1%
27
Testing hypotheses
Sometimes we want to use our sample test a particular hypothesis about the population:
• Average age that Glasgwegians first have sex is below 15 years.
• MPPI take-up has now reached the government target of 50% of all mortgage borrowers
• On average men earn more than women• A higher proportion of smokers get lung cancer than
non-smokers
28
The procedure for hypothesis testing
First establish a null hypothesis, H0:• This usually says that something is equal to
something (sometimes this is the opposite of the hypothesis we’d like to prove but not always):
H0: Age 1st have sex = 15 years
H0: MPPI take-up = 50%
H0: Ave. male wage = Ave. female wage
H0: % smokers that get lung cancer = % non-smokers that get lung cancer
29
Then state the Alternative Hypothesis, H1:
H1 usually says how we think the outcome will go (but not always) and has to a statement that includes, “not”, “>”, “<“ or “”
H1: Age people 1st have sex < 15 years
H1: MPPI take-up 50%
H1: Ave. male wage > Ave. female wage
H1: % smokers that get lung cancer >
% non-smokers that get lung cancer
30
We usually write the alternative hypothesis under the null:
H0: Age 1st have sex = 15 years
H1: Age people 1st have sex < 15 years
H0: MPPI take-up = 50%
H1: MPPI take-up 50%
H0: Ave. male wage = Ave. female wage
H1: Ave. male wage > Ave. female wage
H0: % smokers that get lung cancer =
% non-smokers that get lung cancer
H1: % smokers that get lung cancer >
% non-smokers that get lung cancer
31
And then… Find the probability of false rejection of
H0:• I.e. if we reject H0, what are the chances that
we have done so incorrectly?
This particular probability has a special name: “significance level”.
If we say that our alternative hypothesis is “statistically significant” we mean that the chances of false rejection of the null hypothesis are small.
32
Summary: Social Research is usually based on samples We usually want to use our sample to say
something about the population– I.e. we want to be able to generalise
How precisely we can estimate the population mean or proportion depends on our sample size and the variation within the sample
Using the CLT, statistical inference offers a systematic way of establishing: – the range of values in which the population mean or
proportion is likely to lie (‘a confidence interval’).– Whether a hypothesis about a mean or a proportion is
likely to hold in the population.