A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.
Biostatistics in Practice
description
Transcript of Biostatistics in Practice
![Page 1: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/1.jpg)
Biostatistics in Practice
Peter D. ChristensonBiostatistician
http://gcrc.LABioMed.org/Biostat
Session 2: Summarization of Quantitative
Information
![Page 2: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/2.jpg)
Topics for this Session
Experimental Units
Independence of Measurements
Graphs: Summarizing Results
Graphs: Aids for Analysis
Summary Measures
Confidence Intervals
Prediction Intervals
![Page 3: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/3.jpg)
Most Practical from this Session
Geometric Means
Confidence Intervals
Reference Ranges
Justify Methods from Graphs
![Page 4: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/4.jpg)
Experimental Units_____
Independence of Measurements
![Page 5: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/5.jpg)
Statistical IndependenceExperimental units are the smallest independent entities for addressing a scientific question in an analysis of an experiment.
“Independent” refers to the measurement that is made and the question, not the units.
Definition: If knowledge of the value for a unit does not provide information about another unit’s value, given other factors (and the overall mean) in the analysis of the experiment, then the units are independent for this measurement.
There may be a hierarchy of units.
![Page 6: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/6.jpg)
Importance of Independence
Many basic statistical methods require that measurements are independent for the analysis to be valid.
Other methods can incorporate the lack of independence.
There can be some subjectivity regarding independence. Statistical methods use models. Models can be wrong.
![Page 7: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/7.jpg)
Example: Units and Independence
Ten mice receive treatment A, each is bled, and blood samples are each divided into 3 aliquots. The same is done for 10 mice on treatment B.
1. A serum hormone is measured in the 60 aliquots and compared between A and B.
The aliquots for a mouse are not independent.
The unit is a mouse.
A summary statistic from a mouse’s 3 aliquots (e.g., maximum or mean) are independent.
N=10 and 10, not 30 and 30.
![Page 8: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/8.jpg)
Example, Continued
2. One of the 30 A aliquots is further divided into 25 parts and 5 different in vitro challenges are each made to a random set of 5 of the parts. The same is done for a single B aliquot.
For this challenge experiment, each part is a unit, the values of challenge response are independent, and N=25+25.
For comparing A and B, there are only N=1+1 experimental units, the two mice.
![Page 9: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/9.jpg)
Experimental Units in Case Study
![Page 10: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/10.jpg)
Experimental Units in Case Study
There is a nested hierarchy of several "levels" of data: Schools, children within the schools, and diets received by every child. What would you use for the "N" for this study?
Which outcomes do you intuitively think are correlated (in common language)? Results from one child's three diets? Results from children in the same school? Schools?
![Page 11: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/11.jpg)
Experimental Units in Case Study
N = Number of children
Results from one child's three diets cannot be modeled as independent.
Results from children in the same school also could be “correlated” (dependent). They can be modeled as independent, if the effect of school is included in the analysis. Knowing one child’s score and the school mean gives no info on another child’s score.
![Page 12: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/12.jpg)
Units and Analysis in the Case Study
N = Number of children
Analysis:
This method is a complex generalization of methods we discuss in Session 3.
For any method, though, you need to inform the software of the correct experimental units. For some experiments, it is obvious and implicit.
![Page 13: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/13.jpg)
Graphs:
Summarizing Results
![Page 14: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/14.jpg)
Common Graphical Summaries
Graph Name Y-axis X-axis
Histogram Count or % Category
Scatterplot Continuous Continuous
Dot Plot Continuous Category
Box Plot Percentiles Category
Line Plot Mean or value Category
Kaplan-Meier Probability Time
Many of the examples are from StatisticalPractice.com
![Page 15: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/15.jpg)
Data Graphical Displays
Histogram Scatter plot
Raw DataSummarized*
* Raw data version is a stem-leaf plot. We will see one later.
![Page 16: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/16.jpg)
Data Graphical Displays
Dot Plot Box Plot
Raw Data Summarized
![Page 17: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/17.jpg)
Data Graphical DisplaysLine or Profile Plot
Summarized - bars can represent various types of ranges
![Page 18: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/18.jpg)
Data Graphical Displays
Kaplan-Meier Plot0.
000.
250.
500.
751.
00S
urvi
val P
rob
abili
ty
0 5 10 15 20Years
Kaplan-Meier survival estimate
This is not necessarily 35% of subjects
Probability of Surviving 5 years is 0.35
![Page 19: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/19.jpg)
Graphs:
Aids for Analysis
![Page 20: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/20.jpg)
Graphical Aids for Analysis
Most statistical analyses involve modeling.
Parametric methods (t-test, ANOVA, Χ2) have stronger requirements than non-parametric methods (rank -based).
Every method is based on data satisfying certain requirements.
Many of these requirements can be assessed with some useful common graphics.
![Page 21: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/21.jpg)
Look at the Data for Analysis Requirements
What do we look for?
In Histograms (one variable):Ideal: Symmetric, bell-shaped.
Potential Problems:• Skewness.• Multiple peaks.• Many values at, say, 0, and bell-shaped
otherwise.• Outliers.
![Page 22: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/22.jpg)
Example Histogram: OK for Typical* Analyses
• Symmetric.• One peak.• Roughly bell-shaped.• No outliers.
*Typical: mean, SD, confidence intervals, to be discussed in later slides.
![Page 23: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/23.jpg)
876543210
150
100
50
0
Intensity
Fre
qu
en
cyHistograms: Not OK for Typical Analyses
Skewed
Need to transform intensity to another
scale, e.g. Log(intensity)
1207020
20
10
0
Tumor Volume
Fre
quen
cy
Multi-Peak
Need to summarize with percentiles, not
mean.
![Page 24: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/24.jpg)
Histograms: Not OK for Typical Analyses
Truncated Values
Need to use percentiles for most
analyses.
Outliers
Need to use median, not mean, and
percentiles.
1050
60
50
40
30
20
10
0
Assay Result
Fre
qu
en
cy
LLOQ
Undetectable in 28 samples (<LLOQ)
840
100
50
0
Expression LogRatio
Fre
qu
en
cy
![Page 25: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/25.jpg)
Look at the Data for Analysis Requirements
What do we look for?
In Scatter Plots (two variables): Ideal: Football-shaped; ellipse.
Potential Problems:• Outliers.• Funnel-shaped.• Gap with no values for one or both variables.
![Page 26: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/26.jpg)
Example Scatter Plot: OK for Typical Analyses
![Page 27: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/27.jpg)
Scatter Plot: Not OK for Typical Analyses
Gap and Outlier
Consider analyzing subgroups.
Funnel-Shaped
Should transform y-value to another scale,
e.g. logarithm.
0 100 200 300 400
0
50
100
150
EPO
nR
BC
Co
un
t
All Subjects:
r = 0.54 (95% CI: 0.27 to 0.73)
p = 0.0004
EPO < 150:
r = 0.23 (95% CI: -0.11 to 0.52)
p = 0.17
EPO > 300:
r = -0.04 (95% CI: -0.96 to 0.96)
p = 0.96
Ott, Amer J Obstet Gyn 2005;192:1803-9.Ferber et al, Amer J Obstet
Gyn 2004;190:1473-5.
![Page 28: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/28.jpg)
Summary Measures
![Page 29: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/29.jpg)
Common Summary Measures
Mean and SD or SEM
Geometric Mean
Z-Scores
Correlation
Survival Probability
Risks, Odds, and Hazards
![Page 30: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/30.jpg)
Summary Statistics: One Variable
Data Reduction to a few summary measures.
Basic: Need Typical Value and Variability of Values
Typical Values (“Location”):• Mean for symmetric data.• Median for skewed data.• Geometric mean for some skewed data - details in later slides.
![Page 31: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/31.jpg)
Summary Statistics:Variation in Values
• Standard Deviation, SD =~ 1.25 *(Average |deviation| of values from their mean).
• Standard, convention, non-intuitive values.
• SD of what? E.g., SD of individuals, or of group means.
• Fundamental, critical measure for most statistical methods.
![Page 32: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/32.jpg)
Examples: Mean and SD
Mean = 60.6 min.
Note that the entire range of data in A is about 6SDs wide, and is the source of the “Six Sigma” process used in quality control and business.
95857565554535
25
20
15
10
5
0
Time
Fre
qu
en
cy
SD = 9.6 min.
201510
15
10
5
0
OD
Fre
qu
en
cy
Mean = 15.1 SD = 2.8
A B
![Page 33: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/33.jpg)
876543210
150
100
50
0
Intensity
Fre
qu
en
cyExamples: Mean and SD
Skewed
1207020
20
10
0
Tumor Volume
Fre
quen
cy
Multi-Peak
Mean = 1.0 min.SD = 1.1 min. Mean = 70.3
SD = 22.3
![Page 34: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/34.jpg)
Summary Statistics:Rule of Thumb
For bell-shaped distributions of data (“normally” distributed):
• ~ 68% of values are within mean ±1 SD
• ~ 95% of values are within mean ±2 SD “(Normal) Reference
Range”
• ~ 99.7% of values are within mean ±3 SD
![Page 35: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/35.jpg)
Summary Statistics: Geometric means
Commonly used for skewed data.1. Take logs of individual values.2. Find, say, mean ±2 SD → mean and
(low, up) of the logged values.3. Find antilogs of mean, low, up. Call
them GM, low2, up2 (back on original scale).
4. GM is the “geometric mean”. The interval (low2,up2) is skewed about GM (corresponds to graph).
[See next slide]
![Page 36: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/36.jpg)
Geometric Means
These are flipped histograms rotated 90º, with box plots.
Any log base can be used.
≈ 909.6
≈ 11.6
GM = exp(4.633)
= 102.8
low2 = exp(4.633-2*1.09)
= 11.6
upp2 = exp(4.633+2*1.09)
= 909.6
≈ 102.8
![Page 37: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/37.jpg)
Confidence Intervals
Reference ranges - or Prediction Intervals -are for individuals.
Contains values for 95% of individuals. _____________________________________
Confidence intervals (CI) are for a summary measure (parameter) for an entire population.
Contains the (still unknown) summary measure for “everyone” with 95% certainty.
![Page 38: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/38.jpg)
Z- Score = (Measure - Mean)/SD
35 45 55 65 75 85 95
0
5
10
15
20
25
Time
Fre
qu
ency
35 45 55 65 75 85 95
0
5
10
15
20
25
Time
Fre
qu
ency
Mean = 60.6 min.SD = 9.6 min.
Z-Score = (Time-60.6)/9.6
-2 0 2
41 61 79
Mean = 0SD = 1
Standardize a measure to have mean=0 and SD=1.
Z-scores make different measures comparable.
![Page 39: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/39.jpg)
Outcome Measure in Case StudyGHA = Global Hyperactivity Aggregate
For each child at each time:Z1 = Z-Score for ADHD from TeachersZ2 = Z-Score for WWP from ParentsZ3 = Z-Score for ADHD in ClassroomZ4 = Z-Score for Conner on Computer
All have higher values ↔ more hyperactive.Z’s make each measure scaled similarly.
GHA= Mean of Z1, Z2, Z3, Z4
![Page 40: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/40.jpg)
Confidence Interval for Population Mean
95% Reference range - or Prediction Interval - or “Normal Range”, if subjects normal, is
sample mean ± 2(SD) _____________________________________
95% Confidence interval (CI) for the (true, but unknown) mean for the entire population is
sample mean ± 2(SD/√N)
SD/√N is called “Std Error of the Mean” (SEM)
![Page 41: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/41.jpg)
Confidence Interval: More Details
Confidence interval (CI) for the (true, but unknown) mean for the entire population is
95%, N=100: sample mean ± 1.98(SD/√N)95%, N= 30: sample mean ± 2.05(SD/√N)90%, N=100: sample mean ± 1.66(SD/√N)99%, N=100: sample mean ± 2.63(SD/√N)
If N is small (N<30?), need normally, bell-shaped, data distribution. Otherwise, skewness is OK. This is not true for the PI, where percentiles are needed.
![Page 42: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/42.jpg)
Confidence Interval: Case Study
Confidence Interval:
-0.14 ± 1.99(1.04/√73) =
-0.14 ± 0.24 → -0.38 to 0.10
Table 2
Prediction Interval:
-0.14 ± 1.99(1.04) =
-0.14 ± 2.07 → -2.21 to 1.93
0.13 -0.12 -0.37
Adjusted CI
close to
![Page 43: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/43.jpg)
CI for the Antibody Example
So, there is 95% assurance that an individual is between 11.6 and 909.6, the PI.
So, there is 95% certainty that the population mean is between 92.1 and 114.8, the CI.
GM = exp(4.633)
= 102.8
low2 = exp(4.633-2*1.09)
= 11.6
upp2 = exp(4.633+2*1.09)
= 909.6
GM = exp(4.633)
= 102.8
low2 = exp(4.633-2*1.09 /√394)
= 92.1
upp2 = exp(4.633+2*1.09 /√394)
= 114.8
![Page 44: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/44.jpg)
Summary Statistics:Two Variables (Correlation)
• Always look at scatterplot.• Correlation, r, ranges from -1 (perfect
inverse relation) to +1 (perfect direct). Zero=no relation.
• Specific to the ranges of the two variables.
• Typically, cannot extrapolate to populations with other ranges.
• Measures association, not causation.
We will examine details in Session 5.
![Page 45: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/45.jpg)
Correlation Depends on Range of Data
Graph B contains only the points from graph A that are in the ellipse.
Correlation is reduced in graph B.
Thus: correlation between two quantities may be quite different in different study populations.
BA
![Page 46: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/46.jpg)
Correlation and Measurement Precision
A lack of correlation for the subpopulation with 5<x<6 may be due to inability to measure x and y well.
Lack of evidence of association is not evidence of lack of association.
B
A
r=0 for s
Boverall
5 6
12
10
![Page 47: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/47.jpg)
0.00
0.25
0.50
0.75
1.00
Sur
viva
l Pro
bab
ility
0 5 10 15 20Years
Kaplan-Meier survival estimate
Actually uses finer subdivisions than 0-2, 2-4, 4-5 years, with exact death times.
Example: 100 subjects start a study. Nine subjects drop out at 2 years and 7 drop out at 4 yrs and 20, 20, and 17 died in the intervals 0-2, 2-4, 4-5 yrs.
Then, the 0-2 yr interval has 80/100 surviving.
The 2-4 interval has 51/71 surviving; 4-5 has 27/44 surviving.
So, 5-yr survival prob is (80/100)(51/71)(27/44) = 0.35.
Summary Statistics: Survival Probability
Don’t know vital status of 16 subjects at 5 years.
![Page 48: Biostatistics in Practice](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813b48550346895da42d90/html5/thumbnails/48.jpg)
Summary Statistics:Relative Likelihood of an Event
Compare groups A and B on mortality.
Relative Risk = ProbA[Death] / ProbB[Death]where Prob[Death] ≈ Deaths per 100 Persons
Odds Ratio = OddsA[Death] / OddsB[Death] where Odds= Prob[Death] / Prob[Survival]
Hazard Ratio ≈ IA[Death] / IB[Death]where I = Incidence
= Deaths per 100 PersonDays