1 Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 5: Generalisability of...

1

Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 5:

Generalisability of Social Research and the Role of Inference

Dr Gwilym Pryce

2

Coin tossing experiment:is this a fair coin?

(A)Toss Outcome

(1 if H)

(B)RunningTotal thatare Heads

(B) / (A)Proportionof total sofar that are

heads1 1 1 12 0 1 0.503 0 1 0.334 1 2 0.5

3

Coin tossing experiment:is this a fair coin?

4

Coin tossing example:

5

Implications of fair coin experiment: If we want to survey a sample of people

as a means of saying something about the relative size of a particular group in the population– e.g. the true proportion of OO households

facing repayment difficulties it might take a large sample before the

true value of a proportion emerges

6

Social Research:

We usually only have a sample from which we want to infer something

about the population– proportion (e.g. % with MPPI)– mean (e.g. average income)

I.e. we want to be able to ‘generalise’

7

Statistical Inference: allows us to generalise from our sample

to the population in a systematic way:– Assumes that each member of the

‘population’ has an equally likely chance of entering our sample

– If this assumption holds, Statistical Inference takes into account random variation from sample to sample.

– Allows us to derive a ‘confidence interval’ for the population mean or proportion

8

CIs allow us to make the following types of statement:

E.g. CI for a mean:– 95% sure that the average age of a

homeless person in Glasgow is between 37 and 45 years.

E.g. CI for a proportion:– 95% sure that the proportion of one adult

households with no children that have MPPI is between 25% and 28%

9

We usually have different numbers of observations in different groups: confidence intervals for the proportions &

means based on those different groups will vary purely due to sample size

we may have a very large overall sample – but may end up with very large confidence

intervals for the population mean or proportion for a particular group.

10

Take-up of MPPI by Household Structure

Household Composition %mortgagorswith MPPI

One adult 26.6Two adults 30.0Three or more adults 28.6One adult one child 23.6One adult two children 23.0One adult three or more children 15.4Two adults one child 29.2Two adults two children 27.1Two adults three children 24.1Three adults one child 29.7Three adults two children 29.0Three adults three children 23.8UK 27.2

(Source: FRS 1994/95 and 1995/96;combined sample size of mortgagors = 18,566)

11

Take-up of MPPI by Household Structure

Household Composition %mortgagorswith MPPI

Size ofGroup as a %

of AllMortgagors

One adult 26.6 13.4Two adults 30.0 25.2Three or more adults 28.6 8.3One adult one child 23.6 1.8One adult two children 23.0 1.8One adult three or more children 15.4 0.6Two adults one child 29.2 13.0Two adults two children 27.1 19.1Two adults three children 24.1 7.0Three adults one child 29.7 3.3Three adults two children 29.0 1.2Three adults three children 23.8 0.6UK 27.2 100

(Source: FRS 1994/95 and 1995/96 combined; sample size of mortgagors = 18,566)

12

Household Category

% mortgagors with MPPI

Size of Group as %

of All Mortgagors

Sample size

One adult 26.6% 13.4 2488Two adults 30.0% 25.2 4679Three or more adults 28.6% 8.3 1541One adult one child 23.6% 1.8 334One adult two children 23.0% 1.8 334One adult three or more children 15.4% 0.6 111Two adults one child 29.2% 13.0 2414Two adults two children 27.1% 19.1 3546Two adults three children 24.1% 7.0 1300Three adults one child 29.7% 3.3 613Three adults two children 29.0% 1.2 223Three adults three children 23.8% 0.6 111UK 27.2% 100.0 18566

13

Q/ Given: • a sample size of 111 for HHs with 1 adult and 3

children with 15.4% with MPPI,

what do you think the 95% confidence interval would be for the population % with MPPI?

14

Household Category

Proportion of

mortgagors with MPPI

Sample size Lower Upper

One adult 26.6% 2488 24.9% 28.3%Two adults 30.0% 4679 28.7% 31.3%Three or more adults 28.6% 1541 26.3% 30.9%One adult one child 23.6% 334 19.1% 28.2%One adult two children 23.0% 334 18.5% 27.5%One adult three or more children 15.4% 111 8.7% 22.1%Two adults one child 29.2% 2414 27.4% 31.0%Two adults two children 27.1% 3546 25.6% 28.6%Two adults three children 24.1% 1300 21.8% 26.4%Three adults one child 29.7% 613 26.1% 33.3%Three adults two children 29.0% 223 23.0% 35.0%Three adults three children 23.8% 111 15.9% 31.7%UK 27.2% 18566 26.6% 27.8%

15

Why not use intuition?

Q/ in group of 25 people, what is the probability that at least two of them will have the same birthday?

Q/ what’s the probability in a group of 60?

17

How does inference work?

Central Limit Theorem:– if we were able to take repeated samples,

we would find that the means from each of those samples would be normally distributed.

– Similarly, if we take repeated samples and compute the sample proportion for each, we would find that the sample proportions would have a normal distribution.

18

CLT: Distribution of means from repeated samples is normal if n is large

As more samples are taken, normaldistribution of mean emerges

NORM_2

3.5

3.0

2.5

2.0

1.5

1.0

.5

0.0

NORM_2

5

4

3

2

1

0

NORM_2

8

6

4

2

0

NORM_2

16

14

12

10

8

6

4

2

0

NORM_2

50

40

30

20

10

0

NORM_2

50

40

30

20

10

0

19

E.g. Even though GDP pc is not normally distributed, means from repeated samples are.

20

LTY

5.75.3

4.94.5

4.13.7

3.32.9

2.52.1

1.71.3

.9.5.1

LTY (all new borrowers)8000

6000

4000

2000

0

Std. Dev = 3.36

Mean = 2.3

N = 70289.00

LTY

5.705.30

4.904.50

4.103.70

3.302.90

2.502.10

1.701.30

.90.50

.10

LTY (Sample)70

60

50

40

30

20

10

0

Std. Dev = 1.03

Mean = 2.23

N = 497.00

LTY

5.705.30

4.904.50

4.103.70

3.302.90

2.502.10

1.701.30

.90.50

.10

LTY (Sample)60

50

40

30

20

10

0

Std. Dev = 5.08

Mean = 2.47

N = 491.00

LTY

5.59

5.05

4.50

3.95

3.41

2.86

2.32

1.77

1.23

.68

.14

LTY (Sample)100

80

60

40

20

0

Std. Dev = .77

Mean = 2.20

N = 494.00

Samples:

•non-normal

Population: •non-normal

21

LTY

5.705.30

4.904.50

4.103.70

3.302.90

2.502.10

1.701.30

.90.50

.10

LTY (Sample)70

60

50

40

30

20

10

0

Std. Dev = 1.03

Mean = 2.23

N = 497.00

LTY

5.705.30

4.904.50

4.103.70

3.302.90

2.502.10

1.701.30

.90.50

.10

LTY (Sample)60

50

40

30

20

10

0

Std. Dev = 5.08

Mean = 2.47

N = 491.00

LTY

5.59

5.05

4.50

3.95

3.41

2.86

2.32

1.77

1.23

.68

.14

LTY (Sample)100

80

60

40

20

0

Std. Dev = .77

Mean = 2.20

N = 494.00

Sampling Distribution of the mean:•normal (I.e. Means from repeated samples will be normally distributed)


NORM_2

3.5

3.0

2.5

2.0

1.5

1.0

.5

0.0

NORM_2

5

4

3

2

1

0

NORM_2

8

6

4

2

0

NORM_2

16

14

12

10

8

6

4

2

0

NORM_2

50

40

30

20

10

0

NORM_2

50

40

30

20

10

0

22

Mean of the sampling distribution of means = population mean


NORM_2

3.5

3.0

2.5

2.0

1.5

1.0

.5

0.0

NORM_2

5

4

3

2

1

0

NORM_2

8

6

4

2

0

NORM_2

16

14

12

10

8

6

4

2

0

NORM_2

50

40

30

20

10

0

NORM_2

50

40

30

20

10

0

LTY

5.75.3

4.94.5

4.13.7

3.32.9

2.52.1

1.71.3

.9.5.1

LTY (all new borrowers)8000

6000

4000

2000

0

Std. Dev = 3.36

Mean = 2.3

N = 70289.00

||

||

23

Normal Distribution and CLT:

CLT the distribution of sample means is normal

Also: population mean = mean of all sample means

These two properties allow us to compute confidence intervals because:– Statisticians have worked out the

probabilities associated with the normal curve

24

Suppose we know the sampling distribution of the mean:– we can then say where

95% of sample means lie:

• e.g. 95% of LTYs lie between 1.2 and 3.3

– That is, 1.1 either side of the population LTY of 2.3

25

But, to say that the sample mean lies within 1.1 of is the same as saying that is within 1.1 of the sample mean.– So 95% of all samples will capture the true

population mean in the interval Put another way, there are only 2

possibilities: • Either the interval (sample mean ± 1.1)

contains • Or our sample was one of the few samples (I.e.

one of the 5%) for which the sample mean is not within 1.1 of

26

CLT Applies also to proportions:

MPPI example:– For single parent HHs with 3 children, 95%

sure that the population proportion for MPPI take up lies between 8.4% and 22.1%

• Either the interval 8.4% to 22.1% contains the population proportion

• Or our sample was one of the few samples (I.e. one of the 5%) for which the sample mean is not within the interval 8.4% to 22.1%

27

Testing hypotheses

Sometimes we want to use our sample test a particular hypothesis about the population:

• Average age that Glasgwegians first have sex is below 15 years.

• MPPI take-up has now reached the government target of 50% of all mortgage borrowers

• On average men earn more than women• A higher proportion of smokers get lung cancer than

non-smokers

28

The procedure for hypothesis testing

First establish a null hypothesis, H0:• This usually says that something is equal to

something (sometimes this is the opposite of the hypothesis we’d like to prove but not always):

H0: Age 1st have sex = 15 years

H0: MPPI take-up = 50%

H0: Ave. male wage = Ave. female wage

H0: % smokers that get lung cancer = % non-smokers that get lung cancer

29

Then state the Alternative Hypothesis, H1:

H1 usually says how we think the outcome will go (but not always) and has to a statement that includes, “not”, “>”, “<“ or “”

H1: Age people 1st have sex < 15 years

H1: MPPI take-up 50%

H1: Ave. male wage > Ave. female wage

H1: % smokers that get lung cancer >

% non-smokers that get lung cancer

30

We usually write the alternative hypothesis under the null:

H0: Age 1st have sex = 15 years

H1: Age people 1st have sex < 15 years

H0: MPPI take-up = 50%

H1: MPPI take-up 50%

H0: Ave. male wage = Ave. female wage

H1: Ave. male wage > Ave. female wage

H0: % smokers that get lung cancer =


H1: % smokers that get lung cancer >


31

And then… Find the probability of false rejection of

H0:• I.e. if we reject H0, what are the chances that

we have done so incorrectly?

This particular probability has a special name: “significance level”.

If we say that our alternative hypothesis is “statistically significant” we mean that the chances of false rejection of the null hypothesis are small.

32

Summary: Social Research is usually based on samples We usually want to use our sample to say

something about the population– I.e. we want to be able to generalise

How precisely we can estimate the population mean or proportion depends on our sample size and the variation within the sample

Using the CLT, statistical inference offers a systematic way of establishing: – the range of values in which the population mean or

proportion is likely to lie (‘a confidence interval’).– Whether a hypothesis about a mean or a proportion is

likely to hold in the population.

1 Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 5: Generalisability of...

Documents

Transcript of 1 Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 5: Generalisability of...