Biostatistics in Practice
description
Transcript of Biostatistics in Practice
Biostatistics in Practice
Session 2: Summarization of Quantitative
Information
Youngju PakBiostatistician
http://research.LABioMed.org/Biostat 1
Topics for this Session
Experimental Units
Independence of Measurements
Graphs: Summarizing Results
Graphs: Aids for Analysis
Summary Measures
Confidence Intervals
Prediction Intervals 2
Experimental Units_____
Independence of Measurements
3
Units and IndependenceExperiments may be designed such that each measurement does not give additional independent information.
Many basic statistical methods require that measurements are “independent” for the analysis to be valid.
In mathematics, two events are independent if and only if the occurrence of one event makes it neither more nor less probable that the other occurs. 4
Population
Sample
Sample estimate of population parameter
Population parameter
Sampling mechanism: random sample or convenience sample
Confidence Interval
for population parameter
5
Summarizing the Data with Descriptive Statistics
6
Experimental Units in Case Study
What is the experimental unit in this study? 1. School 2. Child 3. Parent 4. GHA score (results from three diets)
Are all GHA scores(eg. 153 x 3 groups=459 GHA scores for 3-4 years old children)
The analysis MUST incorporate this possible correlation (clustering) if there exists.
7
Common Descriptive Statistics used
Sample Mean and Standard Deviation (SD)
Sample Median and Inter-Quartile Range (IRQ)
Sample Correlation
Sample Survival Probability
Sample Risks & Odds 8
Mean vs. Median(measure the central tendency)
• Mean – What most people
think of as “average”– Easy to calculate– Easily distorted– Be cautious with
SKEWED data– Calculate:
sum of data / number of data points
• Median– Relatively easy to
obtain– Not affected by
extreme values so it is considered a “ROBUST” statistic
– Calculate: • Sort data • If odd number points,
the middle is the median
• Otherwise, the median is the average of the middle two numbers
9
Standard Deviation (SD) &Inter-Quartile Range(IRQ)(measuring the variability or scatterness of the data )
• Inter-Quartile Range (IQR)=
75th percentile (Q3) - 25th percentile(Q1)
, where 25% of the data <Q1 , 75% of the data < Q3
• SD is usually used for the normally distributed data (bellshape, symmetric around the mean)
• IQR is usually used when the data distribution is skewed.• Range = Max -Min
10
Summarization of the Case Study
How are the outcome measures summarized? e.g., Table 2:
11
Summary Statistics:Relative Likelihood of an Event
Compare groups A and B on mortality.
Relative Risk = ProbA[Death] / ProbB[Death]where Prob[Death] ≈ Deaths per 100 Persons
Odds Ratio = OddsA[Death] / OddsB[Death] where Odds= Prob[Death] / Prob[Survival]
Hazard Ratio ≈ IA[Death] / IB[Death]where I = Incidence
= Deaths per 100 PersonDays12
Summarizing the Data with Graphs
13
Data Graphical DisplaysMany of the following examples are from StatisticalPractice.com
Histogram Scatter plot
Raw DataSummarized*
* Raw data version is a stem-leaf plot. We will see one later.14
Data Graphical Displays
Dot Plot Box Plot
Raw Data Summarized
15
Bar Charts
16
Pie Charts
17
Data Graphical DisplaysLine or Profile Plot
Summarized - bars can represent various types of ranges18
Data Graphical Displays
Kaplan-Meier Plot
Interval (Start-End)
# At Risk at Start of Interval
# Censored During Interval
# At Risk at End of Interval
# Who Died at End of Interval
Proportion Surviving This Interval
Cumulative Survival at End of Interval
0-1 7 0 7 16/7 = 0.86
0.86
1-4 6 2 4 13/4 = 0.75
0.86 * 0.75 = 0.64
4-10 3 1 2 11/2 = 0.5
0.86 * 0.75 * 0.5 = 0.31
10-12 1 0 1 01/1 = 1.0
0.86 * 0.75 * 0.5 * 1.0 = 0.31
(Source: www.cancerguide.org) 19
Graphs:
Aids for Analysis
20
Graphical Aids for Analysis
Most statistical analyses involve modeling.
Parametric methods (t-test, ANOVA, Χ2) have stronger requirements than non-parametric methods (rank -based).
Every method is based on data satisfying certain requirements.
Many of these requirements can be assessed with some useful common graphics.
21
Look at the Data for Analysis Requirements
What do we look for?
In Histograms (one variable):Ideal: Symmetric, bell-shaped.
Potential Problems:• Skewness.• Multiple peaks.• Many values at, say, 0, and bell-shaped
otherwise.• Outliers. 22
Example Histogram: OK for Typical* Analyses
• Symmetric.• One peak.• Roughly bell-shaped.• No outliers.
*Typical: mean, SD, confidence intervals, to be discussed in later slides. 23
Z- Score = (Measure - Mean)/SD
35 45 55 65 75 85 95
0
5
10
15
20
25
Time
Fre
qu
ency
Standardizes a measure to have mean=0 and SD=1.
Z-scores make different measures comparable.
35 45 55 65 75 85 95
0
5
10
15
20
25
Time
Fre
qu
ency
Mean = 60.6 min.
Mean = 60.6 min.SD = 9.6 min.
SD = 9.6 min.
Z-Score = (Time-60.6)/9.6
-2 0 2
41 61 79
Mean = 0SD = 1
24
Outcome Measure in Case StudyGHA = Global Hyperactivity Aggregate
For each child at each time:Z1 = Z-Score for ADHD from TeachersZ2 = Z-Score for WWP from ParentsZ3 = Z-Score for ADHD in ClassroomZ4 = Z-Score for Conner on Computer, where weekly score=changes from T0All have higher values ↔ more hyperactive.Z’s make each measure scaled similarly.
GHA= Mean of Z1, Z2, Z3, Z4 25
Summary Statistics:Rule of Thumb
For bell-shaped distributions of data (“normally” distributed):
• ~ 68% of values are within mean ±1 SD
• ~ 95% of values are within mean ±2 SD “(Normal) Reference Range”
• ~ 99.7% of values are within mean ±3 SD26
876543210
150
100
50
0
Intensity
Fre
qu
en
cyHistograms: Not OK for Typical Analyses
Skewed
Need to transform intensity to another scale,
e.g. Log(intensity)
1207020
20
10
0
Tumor Volume
Fre
quen
cy
Multi-Peak
Need to summarize with percentiles, not
mean.27
Look at the Data for Analysis Requirements
What do we look for?
In Scatter Plots (two variables): Ideal: Football-shaped; ellipse.
Potential Problems:• Outliers.• Funnel-shaped.• Gap with no values for one or both variables. 28
Example Scatter Plot: OK for Typical Correlation Analyses
29
Summary Statistics:Two Variables (Correlation)
• Always look at scatterplot.• Correlation, r, ranges from -1 (perfect
inverse relation) to +1 (perfect direct). Zero=no relation.
• Specific to the ranges of the two variables.• Typically, cannot extrapolate to populations
with other ranges.• Measures association, not causation.
We will examine details in Session 5.30
Correlation Depends on Range of Data
Graph B contains only the points from graph A that are in the ellipse.
Correlation is reduced in graph B.
Thus: correlation between two quantities may be quite different in different study populations.
BA
31
Correlation and Measurement Precision
A lack of correlation for the subpopulation with 5<x<6 may be due to inability to measure x and y well.
Lack of evidence of association is not evidence of lack of association.
B
A
r=0 for s
Boverall
5 6
12
10
32
Confidence Interval (CI)• How well your sample mean(m) reflects
the true( or population) mean How confident? 95%?
• A confidence interval (CI) is one of inferential statistics that estimate the true unknown parameter using interval scales.
33
Confidence Interval for Population Mean
95% Reference range or “Normal Range”, is
sample mean ± 2(SD) _____________________________________
95% Confidence interval (CI) for the (true, but unknown) mean for the entire population is
sample mean ± 2(SD/√N)
SD/√N is called “Std Error of the Mean” (SEM)34
Confidence Interval: Case Study
Confidence Interval:
-0.14 ± 1.99(1.04/√73) =
-0.14 ± 0.24 → -0.38 to 0.10
Table 2
Normal Range:
-0.14 ± 1.99(1.04) =
-0.14 ± 2.07 → -2.21 to 1.93
0.13 -0.12 -0.37
Adjusted CI
close to
35