Biostatistics for biomedical profession
Transcript of Biostatistics for biomedical profession
Biostatistics for biomedical profession Lecture II BIMM34
Karin Källen & Linda Hartman
November 2015
20
15
-11
-04
1
Lecture 2
20
15
-11
-04
2
Repetition, Lecture I: ‐ Types of variables (binary/nominal/ordinal/discrete/continuous) - Descriptive statistics
- Central tendency measures (mean, median) - Dispersion measures (standard deviation, percentiles)
- Graphical presentation - Barplot - Histogram - Boxplot
Lecture 2 Subject
Normal distribution
Reference interval, Confidence interval, SEM
Population, samples, generalisability
20
15
-11
-04
3
Repetition, Lecture I: ‐ Types of variables
(binary/nominal/ordinal/discrete/continuous)
- Descriptive statistics - Central tendency measures (mean, median) - Dispersion measures (standard deviation, percentiles)
- Graphical presentation
- Barplot - Histogram - Boxplot
Types of data
Categorical Quantitative
Binary/ dichotomous
Nominal Discrete Continuous Ordinal
2 categories >=2 categories Order matters
Only whole numbers as
values
Data that can take any value
4
20
15
-11
-04
20
15
-11
-04
5
Repetition, Lecture I: ‐ Types of variables (binary/nominal/ordinal/discrete/continuous)
- Descriptive statistics
- Central tendency measures (mean, median) - Dispersion measures (standard deviation, percentiles)
- Graphical presentation
- Barplot - Histogram - Boxplot
Central tendency measures • Median
‐ The middle observation if data are sorted
• Mean ‐
‐ The sum of the observations devided by
the number of observations • Mode
‐ The most frequently occuring value
𝑿 =𝑿𝟏 + 𝑿𝟐 +⋯+ 𝑿𝑵
𝑵= 𝟏
𝑵 𝑿𝒊
𝑵
𝒊=𝟏
20
15
-11
-04
6
What to do if there is an even number of observations? If the number of observations are even: take the mean of the middle two.
Central tendency,
Mean or median The choice depends on the distribution of the data: • Symmetric data • Asymmetric data • Ordinal data
Symmetric distribution Asymmetric distribution (positive skew)
20
15
-11
-04
7
Central tendency measures, Summary
Type of data Central tendency measure
Symmetric data Mean
Asymmetric data Median
Ordinal Median
Nominal -
20
15
-11
-04
8
Measures of dispersion /spread
Small spread /
low variability
Big spread/
high variability
Describes the variability around the central part of the distribution
20
15
-11
-04
9
Measures of dispersion
• Standard deviation – The mean deviation from the mean value
• Percentiles & quartiles
– Splits the data in fixed proportions
• Range – The difference between min and max
20
15
-11
-04
10
Measures of dispersion
• Standard deviation – variability around the mean
𝒔 =𝟏
𝑵 − 𝟏 (𝑿𝒊 − 𝑿 )
𝟐
𝑵
𝒊=𝟏
20
15
-11
-04
11
• Variance = (standard deviation)2
𝒔𝟐 =𝟏
𝑵 − 𝟏 (𝑿𝒊 − 𝑿 )
𝟐
𝑵
𝒊=𝟏
Summary: Summary measures,
Type of data Central tendency measure
Dispersion measure
Symmetric data Mean Standard deviation
Asymmetric data Median Percentiles (e.g. QL and
QU )
Ordinal Median Percentiles
Nominal - -
20
15
-11
-04
12
20
15
-11
-04
13
Repetition, Lecture I: ‐ Types of variables (binary/nominal/ordinal/discrete/continuous)
- Descriptive statistics
- Central tendency measures (mean, median) - Dispersion measures (standard deviation, percentiles)
- Graphical presentation
- Barplot - Histogram - Boxplot
20
15
-11
-04
14
Histogram Bar plot
Box plot
Lowest ”normal” value
Lower quartile, QL
Median Upper quartile, QU
Highest ”normal” value
(Inner fence)
Outliers
Whisker
Whisker
Normal value: Value within 1.5 x interquartile range
Lecture 2
20
15
-11
-04
15
Repetition, Lecture I: ‐ Types of variables (binary/nominal/ordinal/discrete/continuous) - Descriptive statistics
- Central tendency measures (mean, median) - Dispersion measures (standard deviation, percentiles)
- Graphical presentation - Barplot - Histogram - Boxplot
Lecture 2 Subject
Normal distribution
Reference interval, Confidence interval, SEM
Population, samples, generalisability
The normal distribution
20
15
-11
-04
16
Everyone recognizes a normal (=Gaussian) distribution, but what is it?
The (perfect) normal distribution
20
15
-11
-04
17
• The mean, median, and mode all have the same value
• The curve is symmetric around the mean; the skew and kurtosis is 0
• The curve approaches the X-axis asymptotically
• Mean ± 1 SD covers 2*34.1% of data • Mean ± 2 SD covers 2*47.5% of data • Mean ± 3 SD covers 99.7% of data
+1.96 exactly
-1.96 exactly
The (perfect) normal distribution
20
15
-11
-04
18
• The mean, median, and mode all have the same value
• The curve is symmetric around the mean; the skew and kurtosis is 0
• The curve approaches the X-axis asymptotically
• Mean ± 1 SD covers 2*34.1% of data • Mean ± 2 SD covers 2*47.5% of data • Mean ± 3 SD covers 99.7% of data
Excercise: What is the theoretical proportion of babies who have a birth weight between -2SD and +2SD, assuming that birth weight is normally distributed??
95%
To produce a standardized normal distribution
20
15
-11
-04
19
20
15
-11
-04
20
To produce a standardized normal distribution
1. Substract the mean from the absolute values
To produce a standardized normal distribution
20
15
-11
-04
21
To produce a standardized normal distribution
1. Substract the mean from the absolute values
2. Divide the values with by the standard deviation
To produce a standardized normal distribution
Standard scores / z-scores
20
15
-11
-04
22
Excercise: 1. Estimate the value corresponding to a z-score of -2 and +2, respectively, for maternal height, based on the following sample measurements: Mean: 165.8 cm SD: 6.8 cm 2. Estimate the z-score corresponding to a maternal height of 173 cm.
• -2 z-scores: 165.8 – 2*6.8 = 152.2 • +2 z-scores: 165.8 + 2*6.8 = 179.4
• (173.0 - 165.8)cm / 6.8cm = 1.1
Maternal height example Continued….
20
15
-11
-04
23
173 cm corresponds to +1.1 SD among Swedish women giving birth
Lecture 2
20
15
-11
-04
24
Subject
Normal distribution
Reference interval, Confidence interval, SEM
Population, samples, generalisability
Reference interval
20
15
-11
-04
25
The interval within which a certain percentage (usually 95%) lies is called a Reference interval. E.g., in the previous example regarding maternal height, the mean and 95% reference interval is 165.8 cm (152.2 – 179.4)
Confidence interval
20
15
-11
-04
26
• Often, we would like to know the preciscion of a certain summary measurement,
• An interval within which a certain parameter lies (with e.g. 95% certainty) is called a confidence interval.
• In order to estimate the confidence interval for a certain mean value, we have to estimate the variance (or the standard deviation) of the mean.
• How to estimate the variation of a certain mean?
Exercise
20
15
-11
-04
27
1. What does the theoretical distribution from rolling a dice look like?
2. What is the expected mean from rolling a dice 100 times?
3. What does the theoretical distribution from the means of 100 series of dice rolls look like, if each series is based on 100 rolls?
20
15
-11
-04
28
The distribution of 20 throws with a dice: (an example from a computor simulation)
Rectangular distribution Expected mean=(low+high)/2
20
15
-11
-04
29
The distribution of 5000 dice rolls (computer simulation):
Distribution of ten thousand mean values
20
15
-11
-04
30
Each mean value is based on 10 dice rolls
Exercise… continued
20
15
-11
-04
31
1. What does the theoretical distribution from rolling a dice look like?
1. What is the expected mean from rolling a dice 100 times? 1. What does the theoretical distribution from the means of 100 series
of dice rolls look like, if each series is based on 100 rolls?
Rectangular
3.5
Normal distribution
The distribution of the sample mean
20
15
-11
-04
32
The Central Limit Theorem: If the number of samples are large enough, the mean of many random variables, independently drawn from the same distribution, is distributed approximately normally, irrespective of the form of the original distribution.
A cruzial limitation that is often neglected
?
”Large enough” depends on the distribution that the samples are drawn from.
20
15
-11
-04
33
Estimating the spread of the mean Excercise: One curve shows the distribution of ten thousand mean values, where each mean is based on 10 rolls One curve shows the distribution of ten thousand mean values, where each mean is based on 100 rolls Which is which?
_______ Means based on 10 rolls _ _ _ _ _ Means based on 1000 rolls
Thus, we know that we can estimate the spread of the mean. This is called the
Standard error of the mean (SEM)
20
15
-11
-04
34
Let
s= the population standard deviation n=sample size
Then
SEM=
Cotinine is the main metabolite from nicotine. The graph shows the cotinine levels in the umbelical artery.
20
15
-11
-04
35
Mean = 31.2 Median=0.2
Normal distribution?
Cotinine levels in the umbelical artery, continued
20
15
-11
-04
36
• By looking at the histogram, it is obvious that the cotinine levels are not normally distributed.
• Besides visual inspection and a comparison between mean and median, there are other methods to decide whether the distribution is normal or not.
Child cotinine levels, continued…
20
15
-11
-04
37
Exercise: Normal distribution? NO!
Difference between distribution in the population, and distribution of means of samples, by sample sizeifferent sizes
20
15
-11
-04
38
Population distribution
Distribution of means from 1000 samples, by different sample sizes
N=10
N=50
N=100
Difference between distribution in the population, and distribution of means of samples, by sample sizeifferent sizes
20
15
-11
-04
39
Population distribution
Distribution of means from 1000 samples, by different sample sizes
N=10
N=50
N=100
Let m=mean(X), s= Std(X) in the population Excercise: If 𝑋 is based on N observations: Calculate the expected • mean(𝑋 ) (= mean (of means) ) • S td(𝑋 ) ( = standard deviation (of means)=standard
error of means=SEM ) in the 3 histograms.
Mean = 31.2 u Median=0.2 u s=74.4 u
SEM=
SEM=23.5u
SEM=10.5u
SEM=7.4u
Confidence interval and Reference interval
20
15
-11
-04
40
A confidence interval tells us within which interval the ’true’ estimate of a parameter probably lies-
E.g., a 95% confidence interval tells us between which
limits the ’true’ estimate (with 95% certainty) lies.
Repetition: 95% of the data will lie between +/- 2 SD (1.96 exactly).
A Large sample 95% CI for the mean could be constructed: (mean-1.96*SEM to mean+1.96*SEM)
Large sample Confidence interval for the mean
20
15
-11
-04
41
Excercise: Assume that you have got the following estimates for maternal hight from a sample. Mean: 165.8 cm SD: 6.8 cm Construct a 95% CI for the mean of the sample if…. 1. the sample is based on 300 observations 2. The sample is based on 10 000 observations
• Mean (95%CI)= Mean+/-1.96*SEM • Mean (95%CI)=165.8 (165.0-166.6)
• Mean (95%CI)=165.8 (165.7-165.9)
The distribution of the mean, if the mean is based on (at least) ≈100 observations (from the same distribution, not necessarily normal):
20
15
-11
-04
42
n>=100, the T-distribution = the Normal distribution (sometimes called Z-distribution)
The distribution of the mean, if the mean is based on 7 observations from a normal distribution:
20
15
-11
-04
43
T-distribution Df=”Degree of freedom”= n-1=6
n>=100, the T-distribution = the Normal distribution (sometimes called Z-distribution)
The distribution of a the mean, if the mean is based on 4 observations from a normal distribution:
20
15
-11
-04
44 n=4, T-distribution with Df=”Degree of freedom”= n-1=3
n=7, T-distribution with Df=”Degree of freedom”= n-1=6
n>=100, the T-distribution = the Normal distribution (sometimes called Z-distribution)
20
15
-11
-04
45
Confidence interval for the mean • Repetition: For a large sample 95% CI=(𝑋 -1.96∙
𝑠
𝑁 to 𝑋 +1.96 ∙
𝑠
𝑁)
• For small samples we must account for uncertainty in the estimate of the standard deviation 𝑠
• 95% CI=(𝑋 -t(N-1)*𝑠
𝑁 to 𝑋 +t(N-1)*
𝑠
𝑁)
Degrees
of
freedom
t-constant for
95% CI
5 2.57
9 2.26
19 2.09
29 2.02
49 2.01
99 1.98
1.96
N-1=”degrees of freedom”
20
15
-11
-04
46
Excercise: Construct a 95% Confidence interval for the mean of the following series: 2 5 3 6 1 7
• Mean= 24/6 =4
• s= = =√5.6 ≈ 2.4
• SEM = s/ √ n =2.4 / √ 6 ≈ 0.97
• Upper 95% CL: mean + t’*SEM : 4+2.57*0.97 ≈ 6.5 • Lower 95% CL: mean – t’*SEM : 4 -2.57*0.97 ≈ 1.5
• Mean (95%CI): 4.0 (1.5 - 6.5)
• Standard deviation – variability around the mean
𝒔 =𝟏
𝑵 − 𝟏 (𝑿𝒊 − 𝑿 )
𝟐
𝑵
𝒊=𝟏
Degrees of
freedom
t-constant
for
95% CI
5 2.57
9 2.26
19 2.09
29 2.02
49 2.01
99 1.98
1.96
Reference intervals and confidence intervals
20
15
-11
-04
47
The reference interval reflects the interval within which 95% of the population (or values) lies
Example: Lower limit: mean – 1.96 * s Upper limit: mean +1.96 * s
A 95% confidence interval tells us between which limits the ’true’ estimate of the mean (with 95% certainty) lies:
(Large sample: mean-1.96*SEM to mean+1.96*SEM) (SEM=s/ √ n)
20
15
-11
-04
48
Excercise: The middle line shows the mean head circumference by gestational age at birth. Does the distance between the upper and lower lines represent the reference interval, or the confidence interval? The graph is based on approx 1 000 000 births.
Gestational age (completed weeks)
Hea
d c
ircu
mfe
ren
ce (
cm)
Mean and 95% reference interval
20
15
-11
-04
49
Excercise: The middle line shows the mean head circumference by gestational age at birth. Do es the distance between the upper and lower lines represent reference interval, or confidence interval? The graph is based on approx 1 000 000 births.
Gestational age (completed weeks)
Hea
d c
ircu
mfe
ren
ce (
cm)
Mean with 95% CI
20
15
-11
-04
50
The 95%CI are often represented by error bars
Statistical inference: Is there a true difference between the means?
Lecture 2
20
15
-11
-04
51
Subject
Normal distribution
Reference interval, Confidence interval, SEM
Population, samples, generalisability
Generalisability
Population
Sample
Inferential statistics
Descriptive statistics
20
15
-11
-04
52
Inferential statistics….
20
15
-11
-04
53
… are used to determine the probability (or likelihood) that a conclusion based on analysis of
data from a sample is true.
The sample describes the individuals/animals/cell cultures in the study, the population describes the hypothetical (infinite) number of individuals/animals/cell cultures to whom you wish to generalize.
Random error
20
15
-11
-04
54
Any measurement based on a biological sample, even though the sample is selected by a process of random sampling, will differ from the ’true’ value as a result of random process.
20
15
-11
-04
55
Excercise: (You do not have information enough to answer these questions, but you have some information that could help you on the way). • Given the previous example of height, would it be possible to draw
conclusions regarding the height in the population?
• Given the previous example of maternal height, would it be possible
to draw conclusions regarding the height among women in the population?
• Conclusion: Be sure that the method you use for choosing your sample corresponds with the objectives of your study.
The sample contains women only…
The sample contains women who gave birth only…
Next Lecture
20
15
-11
-04
56
Repetition, Lecture I and 2:
• Hypothesis testing (error of type I and type II, p-value)
• T-test
• ANOVA