Biostatistics for biomedical profession

56
Biostatistics for biomedical profession Lecture II BIMM34 Karin Källen & Linda Hartman November 2015 2015-11-04 1

Transcript of Biostatistics for biomedical profession

Page 1: Biostatistics for biomedical profession

Biostatistics for biomedical profession Lecture II BIMM34

Karin Källen & Linda Hartman

November 2015

20

15

-11

-04

1

Page 2: Biostatistics for biomedical profession

Lecture 2

20

15

-11

-04

2

Repetition, Lecture I: ‐ Types of variables (binary/nominal/ordinal/discrete/continuous) - Descriptive statistics

- Central tendency measures (mean, median) - Dispersion measures (standard deviation, percentiles)

- Graphical presentation - Barplot - Histogram - Boxplot

Lecture 2 Subject

Normal distribution

Reference interval, Confidence interval, SEM

Population, samples, generalisability

Page 3: Biostatistics for biomedical profession

20

15

-11

-04

3

Repetition, Lecture I: ‐ Types of variables

(binary/nominal/ordinal/discrete/continuous)

- Descriptive statistics - Central tendency measures (mean, median) - Dispersion measures (standard deviation, percentiles)

- Graphical presentation

- Barplot - Histogram - Boxplot

Page 4: Biostatistics for biomedical profession

Types of data

Categorical Quantitative

Binary/ dichotomous

Nominal Discrete Continuous Ordinal

2 categories >=2 categories Order matters

Only whole numbers as

values

Data that can take any value

4

20

15

-11

-04

Page 5: Biostatistics for biomedical profession

20

15

-11

-04

5

Repetition, Lecture I: ‐ Types of variables (binary/nominal/ordinal/discrete/continuous)

- Descriptive statistics

- Central tendency measures (mean, median) - Dispersion measures (standard deviation, percentiles)

- Graphical presentation

- Barplot - Histogram - Boxplot

Page 6: Biostatistics for biomedical profession

Central tendency measures • Median

‐ The middle observation if data are sorted

• Mean ‐

‐ The sum of the observations devided by

the number of observations • Mode

‐ The most frequently occuring value

𝑿 =𝑿𝟏 + 𝑿𝟐 +⋯+ 𝑿𝑵

𝑵= 𝟏

𝑵 𝑿𝒊

𝑵

𝒊=𝟏

20

15

-11

-04

6

What to do if there is an even number of observations? If the number of observations are even: take the mean of the middle two.

Page 7: Biostatistics for biomedical profession

Central tendency,

Mean or median The choice depends on the distribution of the data: • Symmetric data • Asymmetric data • Ordinal data

Symmetric distribution Asymmetric distribution (positive skew)

20

15

-11

-04

7

Page 8: Biostatistics for biomedical profession

Central tendency measures, Summary

Type of data Central tendency measure

Symmetric data Mean

Asymmetric data Median

Ordinal Median

Nominal -

20

15

-11

-04

8

Page 9: Biostatistics for biomedical profession

Measures of dispersion /spread

Small spread /

low variability

Big spread/

high variability

Describes the variability around the central part of the distribution

20

15

-11

-04

9

Page 10: Biostatistics for biomedical profession

Measures of dispersion

• Standard deviation – The mean deviation from the mean value

• Percentiles & quartiles

– Splits the data in fixed proportions

• Range – The difference between min and max

20

15

-11

-04

10

Page 11: Biostatistics for biomedical profession

Measures of dispersion

• Standard deviation – variability around the mean

𝒔 =𝟏

𝑵 − 𝟏 (𝑿𝒊 − 𝑿 )

𝟐

𝑵

𝒊=𝟏

20

15

-11

-04

11

• Variance = (standard deviation)2

𝒔𝟐 =𝟏

𝑵 − 𝟏 (𝑿𝒊 − 𝑿 )

𝟐

𝑵

𝒊=𝟏

Page 12: Biostatistics for biomedical profession

Summary: Summary measures,

Type of data Central tendency measure

Dispersion measure

Symmetric data Mean Standard deviation

Asymmetric data Median Percentiles (e.g. QL and

QU )

Ordinal Median Percentiles

Nominal - -

20

15

-11

-04

12

Page 13: Biostatistics for biomedical profession

20

15

-11

-04

13

Repetition, Lecture I: ‐ Types of variables (binary/nominal/ordinal/discrete/continuous)

- Descriptive statistics

- Central tendency measures (mean, median) - Dispersion measures (standard deviation, percentiles)

- Graphical presentation

- Barplot - Histogram - Boxplot

Page 14: Biostatistics for biomedical profession

20

15

-11

-04

14

Histogram Bar plot

Box plot

Lowest ”normal” value

Lower quartile, QL

Median Upper quartile, QU

Highest ”normal” value

(Inner fence)

Outliers

Whisker

Whisker

Normal value: Value within 1.5 x interquartile range

Page 15: Biostatistics for biomedical profession

Lecture 2

20

15

-11

-04

15

Repetition, Lecture I: ‐ Types of variables (binary/nominal/ordinal/discrete/continuous) - Descriptive statistics

- Central tendency measures (mean, median) - Dispersion measures (standard deviation, percentiles)

- Graphical presentation - Barplot - Histogram - Boxplot

Lecture 2 Subject

Normal distribution

Reference interval, Confidence interval, SEM

Population, samples, generalisability

Page 17: Biostatistics for biomedical profession

The (perfect) normal distribution

20

15

-11

-04

17

• The mean, median, and mode all have the same value

• The curve is symmetric around the mean; the skew and kurtosis is 0

• The curve approaches the X-axis asymptotically

• Mean ± 1 SD covers 2*34.1% of data • Mean ± 2 SD covers 2*47.5% of data • Mean ± 3 SD covers 99.7% of data

+1.96 exactly

-1.96 exactly

Page 18: Biostatistics for biomedical profession

The (perfect) normal distribution

20

15

-11

-04

18

• The mean, median, and mode all have the same value

• The curve is symmetric around the mean; the skew and kurtosis is 0

• The curve approaches the X-axis asymptotically

• Mean ± 1 SD covers 2*34.1% of data • Mean ± 2 SD covers 2*47.5% of data • Mean ± 3 SD covers 99.7% of data

Excercise: What is the theoretical proportion of babies who have a birth weight between -2SD and +2SD, assuming that birth weight is normally distributed??

95%

Page 19: Biostatistics for biomedical profession

To produce a standardized normal distribution

20

15

-11

-04

19

Page 20: Biostatistics for biomedical profession

20

15

-11

-04

20

To produce a standardized normal distribution

1. Substract the mean from the absolute values

To produce a standardized normal distribution

Page 21: Biostatistics for biomedical profession

20

15

-11

-04

21

To produce a standardized normal distribution

1. Substract the mean from the absolute values

2. Divide the values with by the standard deviation

To produce a standardized normal distribution

Standard scores / z-scores

Page 22: Biostatistics for biomedical profession

20

15

-11

-04

22

Excercise: 1. Estimate the value corresponding to a z-score of -2 and +2, respectively, for maternal height, based on the following sample measurements: Mean: 165.8 cm SD: 6.8 cm 2. Estimate the z-score corresponding to a maternal height of 173 cm.

• -2 z-scores: 165.8 – 2*6.8 = 152.2 • +2 z-scores: 165.8 + 2*6.8 = 179.4

• (173.0 - 165.8)cm / 6.8cm = 1.1

Page 23: Biostatistics for biomedical profession

Maternal height example Continued….

20

15

-11

-04

23

173 cm corresponds to +1.1 SD among Swedish women giving birth

Page 24: Biostatistics for biomedical profession

Lecture 2

20

15

-11

-04

24

Subject

Normal distribution

Reference interval, Confidence interval, SEM

Population, samples, generalisability

Page 25: Biostatistics for biomedical profession

Reference interval

20

15

-11

-04

25

The interval within which a certain percentage (usually 95%) lies is called a Reference interval. E.g., in the previous example regarding maternal height, the mean and 95% reference interval is 165.8 cm (152.2 – 179.4)

Page 26: Biostatistics for biomedical profession

Confidence interval

20

15

-11

-04

26

• Often, we would like to know the preciscion of a certain summary measurement,

• An interval within which a certain parameter lies (with e.g. 95% certainty) is called a confidence interval.

• In order to estimate the confidence interval for a certain mean value, we have to estimate the variance (or the standard deviation) of the mean.

• How to estimate the variation of a certain mean?

Page 27: Biostatistics for biomedical profession

Exercise

20

15

-11

-04

27

1. What does the theoretical distribution from rolling a dice look like?

2. What is the expected mean from rolling a dice 100 times?

3. What does the theoretical distribution from the means of 100 series of dice rolls look like, if each series is based on 100 rolls?

Page 28: Biostatistics for biomedical profession

20

15

-11

-04

28

The distribution of 20 throws with a dice: (an example from a computor simulation)

Page 29: Biostatistics for biomedical profession

Rectangular distribution Expected mean=(low+high)/2

20

15

-11

-04

29

The distribution of 5000 dice rolls (computer simulation):

Page 30: Biostatistics for biomedical profession

Distribution of ten thousand mean values

20

15

-11

-04

30

Each mean value is based on 10 dice rolls

Page 31: Biostatistics for biomedical profession

Exercise… continued

20

15

-11

-04

31

1. What does the theoretical distribution from rolling a dice look like?

1. What is the expected mean from rolling a dice 100 times? 1. What does the theoretical distribution from the means of 100 series

of dice rolls look like, if each series is based on 100 rolls?

Rectangular

3.5

Normal distribution

Page 32: Biostatistics for biomedical profession

The distribution of the sample mean

20

15

-11

-04

32

The Central Limit Theorem: If the number of samples are large enough, the mean of many random variables, independently drawn from the same distribution, is distributed approximately normally, irrespective of the form of the original distribution.

A cruzial limitation that is often neglected

?

”Large enough” depends on the distribution that the samples are drawn from.

Page 33: Biostatistics for biomedical profession

20

15

-11

-04

33

Estimating the spread of the mean Excercise: One curve shows the distribution of ten thousand mean values, where each mean is based on 10 rolls One curve shows the distribution of ten thousand mean values, where each mean is based on 100 rolls Which is which?

_______ Means based on 10 rolls _ _ _ _ _ Means based on 1000 rolls

Page 34: Biostatistics for biomedical profession

Thus, we know that we can estimate the spread of the mean. This is called the

Standard error of the mean (SEM)

20

15

-11

-04

34

Let

s= the population standard deviation n=sample size

Then

SEM=

Page 35: Biostatistics for biomedical profession

Cotinine is the main metabolite from nicotine. The graph shows the cotinine levels in the umbelical artery.

20

15

-11

-04

35

Mean = 31.2 Median=0.2

Normal distribution?

Page 36: Biostatistics for biomedical profession

Cotinine levels in the umbelical artery, continued

20

15

-11

-04

36

• By looking at the histogram, it is obvious that the cotinine levels are not normally distributed.

• Besides visual inspection and a comparison between mean and median, there are other methods to decide whether the distribution is normal or not.

Page 37: Biostatistics for biomedical profession

Child cotinine levels, continued…

20

15

-11

-04

37

Exercise: Normal distribution? NO!

Page 38: Biostatistics for biomedical profession

Difference between distribution in the population, and distribution of means of samples, by sample sizeifferent sizes

20

15

-11

-04

38

Population distribution

Distribution of means from 1000 samples, by different sample sizes

N=10

N=50

N=100

Page 39: Biostatistics for biomedical profession

Difference between distribution in the population, and distribution of means of samples, by sample sizeifferent sizes

20

15

-11

-04

39

Population distribution

Distribution of means from 1000 samples, by different sample sizes

N=10

N=50

N=100

Let m=mean(X), s= Std(X) in the population Excercise: If 𝑋 is based on N observations: Calculate the expected • mean(𝑋 ) (= mean (of means) ) • S td(𝑋 ) ( = standard deviation (of means)=standard

error of means=SEM ) in the 3 histograms.

Mean = 31.2 u Median=0.2 u s=74.4 u

SEM=

SEM=23.5u

SEM=10.5u

SEM=7.4u

Page 40: Biostatistics for biomedical profession

Confidence interval and Reference interval

20

15

-11

-04

40

A confidence interval tells us within which interval the ’true’ estimate of a parameter probably lies-

E.g., a 95% confidence interval tells us between which

limits the ’true’ estimate (with 95% certainty) lies.

Repetition: 95% of the data will lie between +/- 2 SD (1.96 exactly).

A Large sample 95% CI for the mean could be constructed: (mean-1.96*SEM to mean+1.96*SEM)

Page 41: Biostatistics for biomedical profession

Large sample Confidence interval for the mean

20

15

-11

-04

41

Excercise: Assume that you have got the following estimates for maternal hight from a sample. Mean: 165.8 cm SD: 6.8 cm Construct a 95% CI for the mean of the sample if…. 1. the sample is based on 300 observations 2. The sample is based on 10 000 observations

• Mean (95%CI)= Mean+/-1.96*SEM • Mean (95%CI)=165.8 (165.0-166.6)

• Mean (95%CI)=165.8 (165.7-165.9)

Page 42: Biostatistics for biomedical profession

The distribution of the mean, if the mean is based on (at least) ≈100 observations (from the same distribution, not necessarily normal):

20

15

-11

-04

42

n>=100, the T-distribution = the Normal distribution (sometimes called Z-distribution)

Page 43: Biostatistics for biomedical profession

The distribution of the mean, if the mean is based on 7 observations from a normal distribution:

20

15

-11

-04

43

T-distribution Df=”Degree of freedom”= n-1=6

n>=100, the T-distribution = the Normal distribution (sometimes called Z-distribution)

Page 44: Biostatistics for biomedical profession

The distribution of a the mean, if the mean is based on 4 observations from a normal distribution:

20

15

-11

-04

44 n=4, T-distribution with Df=”Degree of freedom”= n-1=3

n=7, T-distribution with Df=”Degree of freedom”= n-1=6

n>=100, the T-distribution = the Normal distribution (sometimes called Z-distribution)

Page 45: Biostatistics for biomedical profession

20

15

-11

-04

45

Confidence interval for the mean • Repetition: For a large sample 95% CI=(𝑋 -1.96∙

𝑠

𝑁 to 𝑋 +1.96 ∙

𝑠

𝑁)

• For small samples we must account for uncertainty in the estimate of the standard deviation 𝑠

• 95% CI=(𝑋 -t(N-1)*𝑠

𝑁 to 𝑋 +t(N-1)*

𝑠

𝑁)

Degrees

of

freedom

t-constant for

95% CI

5 2.57

9 2.26

19 2.09

29 2.02

49 2.01

99 1.98

1.96

N-1=”degrees of freedom”

Page 46: Biostatistics for biomedical profession

20

15

-11

-04

46

Excercise: Construct a 95% Confidence interval for the mean of the following series: 2 5 3 6 1 7

• Mean= 24/6 =4

• s= = =√5.6 ≈ 2.4

• SEM = s/ √ n =2.4 / √ 6 ≈ 0.97

• Upper 95% CL: mean + t’*SEM : 4+2.57*0.97 ≈ 6.5 • Lower 95% CL: mean – t’*SEM : 4 -2.57*0.97 ≈ 1.5

• Mean (95%CI): 4.0 (1.5 - 6.5)

• Standard deviation – variability around the mean

𝒔 =𝟏

𝑵 − 𝟏 (𝑿𝒊 − 𝑿 )

𝟐

𝑵

𝒊=𝟏

Degrees of

freedom

t-constant

for

95% CI

5 2.57

9 2.26

19 2.09

29 2.02

49 2.01

99 1.98

1.96

Page 47: Biostatistics for biomedical profession

Reference intervals and confidence intervals

20

15

-11

-04

47

The reference interval reflects the interval within which 95% of the population (or values) lies

Example: Lower limit: mean – 1.96 * s Upper limit: mean +1.96 * s

A 95% confidence interval tells us between which limits the ’true’ estimate of the mean (with 95% certainty) lies:

(Large sample: mean-1.96*SEM to mean+1.96*SEM) (SEM=s/ √ n)

Page 48: Biostatistics for biomedical profession

20

15

-11

-04

48

Excercise: The middle line shows the mean head circumference by gestational age at birth. Does the distance between the upper and lower lines represent the reference interval, or the confidence interval? The graph is based on approx 1 000 000 births.

Gestational age (completed weeks)

Hea

d c

ircu

mfe

ren

ce (

cm)

Mean and 95% reference interval

Page 49: Biostatistics for biomedical profession

20

15

-11

-04

49

Excercise: The middle line shows the mean head circumference by gestational age at birth. Do es the distance between the upper and lower lines represent reference interval, or confidence interval? The graph is based on approx 1 000 000 births.

Gestational age (completed weeks)

Hea

d c

ircu

mfe

ren

ce (

cm)

Mean with 95% CI

Page 50: Biostatistics for biomedical profession

20

15

-11

-04

50

The 95%CI are often represented by error bars

Statistical inference: Is there a true difference between the means?

Page 51: Biostatistics for biomedical profession

Lecture 2

20

15

-11

-04

51

Subject

Normal distribution

Reference interval, Confidence interval, SEM

Population, samples, generalisability

Page 52: Biostatistics for biomedical profession

Generalisability

Population

Sample

Inferential statistics

Descriptive statistics

20

15

-11

-04

52

Page 53: Biostatistics for biomedical profession

Inferential statistics….

20

15

-11

-04

53

… are used to determine the probability (or likelihood) that a conclusion based on analysis of

data from a sample is true.

The sample describes the individuals/animals/cell cultures in the study, the population describes the hypothetical (infinite) number of individuals/animals/cell cultures to whom you wish to generalize.

Page 54: Biostatistics for biomedical profession

Random error

20

15

-11

-04

54

Any measurement based on a biological sample, even though the sample is selected by a process of random sampling, will differ from the ’true’ value as a result of random process.

Page 55: Biostatistics for biomedical profession

20

15

-11

-04

55

Excercise: (You do not have information enough to answer these questions, but you have some information that could help you on the way). • Given the previous example of height, would it be possible to draw

conclusions regarding the height in the population?

• Given the previous example of maternal height, would it be possible

to draw conclusions regarding the height among women in the population?

• Conclusion: Be sure that the method you use for choosing your sample corresponds with the objectives of your study.

The sample contains women only…

The sample contains women who gave birth only…

Page 56: Biostatistics for biomedical profession

Next Lecture

20

15

-11

-04

56

Repetition, Lecture I and 2:

• Hypothesis testing (error of type I and type II, p-value)

• T-test

• ANOVA