Biostatistics for biomedical profession

Biostatistics for biomedical profession Lecture II BIMM34

Karin Källen & Linda Hartman

November 2015

20

15

-11

-04

1

Lecture 2

20

15

-11

-04

2

Repetition, Lecture I: ‐ Types of variables (binary/nominal/ordinal/discrete/continuous) - Descriptive statistics

- Central tendency measures (mean, median) - Dispersion measures (standard deviation, percentiles)

- Graphical presentation - Barplot - Histogram - Boxplot

Lecture 2 Subject

Normal distribution

Reference interval, Confidence interval, SEM

Population, samples, generalisability

20

15

-11

-04

3

Repetition, Lecture I: ‐ Types of variables

(binary/nominal/ordinal/discrete/continuous)

- Descriptive statistics - Central tendency measures (mean, median) - Dispersion measures (standard deviation, percentiles)

- Graphical presentation

- Barplot - Histogram - Boxplot

Types of data

Categorical Quantitative

Binary/ dichotomous

Nominal Discrete Continuous Ordinal

2 categories >=2 categories Order matters

Only whole numbers as

values

Data that can take any value

4

20

15

-11

-04

20

15

-11

-04

5

Repetition, Lecture I: ‐ Types of variables (binary/nominal/ordinal/discrete/continuous)

- Descriptive statistics




Central tendency measures • Median

‐ The middle observation if data are sorted

• Mean ‐

‐ The sum of the observations devided by

the number of observations • Mode

‐ The most frequently occuring value

𝑿 =𝑿𝟏 + 𝑿𝟐 +⋯+ 𝑿𝑵

𝑵= 𝟏

𝑵 𝑿𝒊

𝑵

𝒊=𝟏

20

15

-11

-04

6

What to do if there is an even number of observations? If the number of observations are even: take the mean of the middle two.

Central tendency,

Mean or median The choice depends on the distribution of the data: • Symmetric data • Asymmetric data • Ordinal data

Symmetric distribution Asymmetric distribution (positive skew)

20

15

-11

-04

7

Central tendency measures, Summary

Type of data Central tendency measure

Symmetric data Mean

Asymmetric data Median

Ordinal Median

Nominal -

20

15

-11

-04

8

Measures of dispersion /spread

Small spread /

low variability

Big spread/

high variability

Describes the variability around the central part of the distribution

20

15

-11

-04

9

Measures of dispersion

• Standard deviation – The mean deviation from the mean value

• Percentiles & quartiles

– Splits the data in fixed proportions

• Range – The difference between min and max

20

15

-11

-04

10

Measures of dispersion

• Standard deviation – variability around the mean

𝒔 =𝟏

𝑵 − 𝟏 (𝑿𝒊 − 𝑿 )

𝟐

𝑵

𝒊=𝟏

20

15

-11

-04

11

• Variance = (standard deviation)2

𝒔𝟐 =𝟏

𝑵 − 𝟏 (𝑿𝒊 − 𝑿 )

𝟐

𝑵

𝒊=𝟏

Summary: Summary measures,

Type of data Central tendency measure

Dispersion measure

Symmetric data Mean Standard deviation

Asymmetric data Median Percentiles (e.g. QL and

QU )

Ordinal Median Percentiles

Nominal - -

20

15

-11

-04

12

20

15

-11

-04

13

Repetition, Lecture I: ‐ Types of variables (binary/nominal/ordinal/discrete/continuous)

- Descriptive statistics




20

15

-11

-04

14

Histogram Bar plot

Box plot

Lowest ”normal” value

Lower quartile, QL

Median Upper quartile, QU

Highest ”normal” value

(Inner fence)

Outliers

Whisker

Whisker

Normal value: Value within 1.5 x interquartile range

Lecture 2

20

15

-11

-04

15

Repetition, Lecture I: ‐ Types of variables (binary/nominal/ordinal/discrete/continuous) - Descriptive statistics


- Graphical presentation - Barplot - Histogram - Boxplot

Lecture 2 Subject

Normal distribution



The normal distribution

20

15

-11

-04

16

Everyone recognizes a normal (=Gaussian) distribution, but what is it?

http://www.google.se/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=OcT5P0QfzHZJkM&tbnid=meyUKuKVJLPK4M:&ved=0CAUQjRw&url=http://www.statlect.com/ucdnrm1.htm&ei=g9x4Uq3YLMmH4AStoYH4DQ&bvm=bv.55980276,d.bGE&psig=AFQjCNFYQ29_G9kzEzl3kUeRiMdw3Ht69g&ust=1383738869542189

The (perfect) normal distribution

20

15

-11

-04

17

• The mean, median, and mode all have the same value

• The curve is symmetric around the mean; the skew and kurtosis is 0

• The curve approaches the X-axis asymptotically

• Mean ± 1 SD covers 2*34.1% of data • Mean ± 2 SD covers 2*47.5% of data • Mean ± 3 SD covers 99.7% of data

+1.96 exactly

-1.96 exactly

The (perfect) normal distribution

20

15

-11

-04

18

• The mean, median, and mode all have the same value

• The curve is symmetric around the mean; the skew and kurtosis is 0

• The curve approaches the X-axis asymptotically

• Mean ± 1 SD covers 2*34.1% of data • Mean ± 2 SD covers 2*47.5% of data • Mean ± 3 SD covers 99.7% of data

Excercise: What is the theoretical proportion of babies who have a birth weight between -2SD and +2SD, assuming that birth weight is normally distributed??

95%

To produce a standardized normal distribution

20

15

-11

-04

19

20

15

-11

-04

20


1. Substract the mean from the absolute values


20

15

-11

-04

21


1. Substract the mean from the absolute values

2. Divide the values with by the standard deviation


Standard scores / z-scores

20

15

-11

-04

22

Excercise: 1. Estimate the value corresponding to a z-score of -2 and +2, respectively, for maternal height, based on the following sample measurements: Mean: 165.8 cm SD: 6.8 cm 2. Estimate the z-score corresponding to a maternal height of 173 cm.

• -2 z-scores: 165.8 – 2*6.8 = 152.2 • +2 z-scores: 165.8 + 2*6.8 = 179.4

• (173.0 - 165.8)cm / 6.8cm = 1.1

Maternal height example Continued….

20

15

-11

-04

23

173 cm corresponds to +1.1 SD among Swedish women giving birth

Lecture 2

20

15

-11

-04

24

Subject

Normal distribution



Reference interval

20

15

-11

-04

25

The interval within which a certain percentage (usually 95%) lies is called a Reference interval. E.g., in the previous example regarding maternal height, the mean and 95% reference interval is 165.8 cm (152.2 – 179.4)

Confidence interval

20

15

-11

-04

26

• Often, we would like to know the preciscion of a certain summary measurement,

• An interval within which a certain parameter lies (with e.g. 95% certainty) is called a confidence interval.

• In order to estimate the confidence interval for a certain mean value, we have to estimate the variance (or the standard deviation) of the mean.

• How to estimate the variation of a certain mean?

Exercise

20

15

-11

-04

27

1. What does the theoretical distribution from rolling a dice look like?

2. What is the expected mean from rolling a dice 100 times?

3. What does the theoretical distribution from the means of 100 series of dice rolls look like, if each series is based on 100 rolls?

20

15

-11

-04

28

The distribution of 20 throws with a dice: (an example from a computor simulation)

Rectangular distribution Expected mean=(low+high)/2

20

15

-11

-04

29

The distribution of 5000 dice rolls (computer simulation):

Distribution of ten thousand mean values

20

15

-11

-04

30

Each mean value is based on 10 dice rolls

Exercise… continued

20

15

-11

-04

31

1. What does the theoretical distribution from rolling a dice look like?

1. What is the expected mean from rolling a dice 100 times? 1. What does the theoretical distribution from the means of 100 series

of dice rolls look like, if each series is based on 100 rolls?

Rectangular

3.5

Normal distribution

The distribution of the sample mean

20

15

-11

-04

32

The Central Limit Theorem: If the number of samples are large enough, the mean of many random variables, independently drawn from the same distribution, is distributed approximately normally, irrespective of the form of the original distribution.

A cruzial limitation that is often neglected

?

”Large enough” depends on the distribution that the samples are drawn from.

http://en.wikipedia.org/wiki/Mean

http://en.wikipedia.org/wiki/Random_variables

http://en.wikipedia.org/wiki/Random_variables

20

15

-11

-04

33

Estimating the spread of the mean Excercise: One curve shows the distribution of ten thousand mean values, where each mean is based on 10 rolls One curve shows the distribution of ten thousand mean values, where each mean is based on 100 rolls Which is which?

_______ Means based on 10 rolls _ _ _ _ _ Means based on 1000 rolls

Thus, we know that we can estimate the spread of the mean. This is called the

Standard error of the mean (SEM)

20

15

-11

-04

34

Let

s= the population standard deviation n=sample size

Then

SEM=

Cotinine is the main metabolite from nicotine. The graph shows the cotinine levels in the umbelical artery.

20

15

-11

-04

35

Mean = 31.2 Median=0.2

Normal distribution?

Cotinine levels in the umbelical artery, continued

20

15

-11

-04

36

• By looking at the histogram, it is obvious that the cotinine levels are not normally distributed.

• Besides visual inspection and a comparison between mean and median, there are other methods to decide whether the distribution is normal or not.

Child cotinine levels, continued…

20

15

-11

-04

37

Exercise: Normal distribution? NO!

Difference between distribution in the population, and distribution of means of samples, by sample sizeifferent sizes

20

15

-11

-04

38

Population distribution

Distribution of means from 1000 samples, by different sample sizes

N=10

N=50

N=100

Difference between distribution in the population, and distribution of means of samples, by sample sizeifferent sizes

20

15

-11

-04

39

Population distribution

Distribution of means from 1000 samples, by different sample sizes

N=10

N=50

N=100

Let m=mean(X), s= Std(X) in the population Excercise: If 𝑋 is based on N observations: Calculate the expected • mean(𝑋 ) (= mean (of means) ) • S td(𝑋 ) ( = standard deviation (of means)=standard

error of means=SEM ) in the 3 histograms.

Mean = 31.2 u Median=0.2 u s=74.4 u

SEM=

SEM=23.5u

SEM=10.5u

SEM=7.4u

Confidence interval and Reference interval

20

15

-11

-04

40

A confidence interval tells us within which interval the ’true’ estimate of a parameter probably lies-

E.g., a 95% confidence interval tells us between which

limits the ’true’ estimate (with 95% certainty) lies.

Repetition: 95% of the data will lie between +/- 2 SD (1.96 exactly).

A Large sample 95% CI for the mean could be constructed: (mean-1.96*SEM to mean+1.96*SEM)

Large sample Confidence interval for the mean

20

15

-11

-04

41

Excercise: Assume that you have got the following estimates for maternal hight from a sample. Mean: 165.8 cm SD: 6.8 cm Construct a 95% CI for the mean of the sample if…. 1. the sample is based on 300 observations 2. The sample is based on 10 000 observations

• Mean (95%CI)= Mean+/-1.96*SEM • Mean (95%CI)=165.8 (165.0-166.6)

• Mean (95%CI)=165.8 (165.7-165.9)

The distribution of the mean, if the mean is based on (at least) ≈100 observations (from the same distribution, not necessarily normal):

20

15

-11

-04

42

n>=100, the T-distribution = the Normal distribution (sometimes called Z-distribution)

The distribution of the mean, if the mean is based on 7 observations from a normal distribution:

20

15

-11

-04

43

T-distribution Df=”Degree of freedom”= n-1=6


The distribution of a the mean, if the mean is based on 4 observations from a normal distribution:

20

15

-11

-04

44 n=4, T-distribution with Df=”Degree of freedom”= n-1=3

n=7, T-distribution with Df=”Degree of freedom”= n-1=6


20

15

-11

-04

45

Confidence interval for the mean • Repetition: For a large sample 95% CI=(𝑋 -1.96∙

𝑠

𝑁 to 𝑋 +1.96 ∙

𝑠

𝑁)

• For small samples we must account for uncertainty in the estimate of the standard deviation 𝑠

• 95% CI=(𝑋 -t(N-1)*𝑠

𝑁 to 𝑋 +t(N-1)*

𝑠

𝑁)

Degrees

of

freedom

t-constant for

95% CI

5 2.57

9 2.26

19 2.09

29 2.02

49 2.01

99 1.98

1.96

N-1=”degrees of freedom”

20

15

-11

-04

46

Excercise: Construct a 95% Confidence interval for the mean of the following series: 2 5 3 6 1 7

• Mean= 24/6 =4

• s= = =√5.6 ≈ 2.4

• SEM = s/ √ n =2.4 / √ 6 ≈ 0.97

• Upper 95% CL: mean + t’*SEM : 4+2.57*0.97 ≈ 6.5 • Lower 95% CL: mean – t’*SEM : 4 -2.57*0.97 ≈ 1.5

• Mean (95%CI): 4.0 (1.5 - 6.5)

• Standard deviation – variability around the mean

𝒔 =𝟏

𝑵 − 𝟏 (𝑿𝒊 − 𝑿 )

𝟐

𝑵

𝒊=𝟏

Degrees of

freedom

t-constant

for

95% CI

5 2.57

9 2.26

19 2.09

29 2.02

49 2.01

99 1.98

1.96

Reference intervals and confidence intervals

20

15

-11

-04

47

The reference interval reflects the interval within which 95% of the population (or values) lies

Example: Lower limit: mean – 1.96 * s Upper limit: mean +1.96 * s

A 95% confidence interval tells us between which limits the ’true’ estimate of the mean (with 95% certainty) lies:

(Large sample: mean-1.96*SEM to mean+1.96*SEM) (SEM=s/ √ n)

20

15

-11

-04

48

Excercise: The middle line shows the mean head circumference by gestational age at birth. Does the distance between the upper and lower lines represent the reference interval, or the confidence interval? The graph is based on approx 1 000 000 births.

Gestational age (completed weeks)

Hea

d c

ircu

mfe

ren

ce (

cm)

Mean and 95% reference interval

20

15

-11

-04

49

Excercise: The middle line shows the mean head circumference by gestational age at birth. Do es the distance between the upper and lower lines represent reference interval, or confidence interval? The graph is based on approx 1 000 000 births.

Gestational age (completed weeks)

Hea

d c

ircu

mfe

ren

ce (

cm)

Mean with 95% CI

20

15

-11

-04

50

The 95%CI are often represented by error bars

Statistical inference: Is there a true difference between the means?

Lecture 2

20

15

-11

-04

51

Subject

Normal distribution



Generalisability

Population

Sample

Inferential statistics

Descriptive statistics

20

15

-11

-04

52

Inferential statistics….

20

15

-11

-04

53

… are used to determine the probability (or likelihood) that a conclusion based on analysis of

data from a sample is true.

The sample describes the individuals/animals/cell cultures in the study, the population describes the hypothetical (infinite) number of individuals/animals/cell cultures to whom you wish to generalize.

Random error

20

15

-11

-04

54

Any measurement based on a biological sample, even though the sample is selected by a process of random sampling, will differ from the ’true’ value as a result of random process.

20

15

-11

-04

55

Excercise: (You do not have information enough to answer these questions, but you have some information that could help you on the way). • Given the previous example of height, would it be possible to draw

conclusions regarding the height in the population?

• Given the previous example of maternal height, would it be possible

to draw conclusions regarding the height among women in the population?

• Conclusion: Be sure that the method you use for choosing your sample corresponds with the objectives of your study.

The sample contains women only…

The sample contains women who gave birth only…

Next Lecture

20

15

-11

-04

56

Repetition, Lecture I and 2:

• Hypothesis testing (error of type I and type II, p-value)

• T-test

• ANOVA

Biostatistics for biomedical profession

Documents

Transcript of Biostatistics for biomedical profession