Bio Statistics Lecture 1
-
Upload
rohit-pandey -
Category
Documents
-
view
219 -
download
0
Transcript of Bio Statistics Lecture 1
-
7/28/2019 Bio Statistics Lecture 1
1/100
Copyright 2009, The Johns Hopkins University and John McGready. All rights reserved. Use of thesematerials permitted only in accordance with license rights granted. Materials provided AS IS; no
representations or warranties provided. User assumes all responsibility for use, and all liability relatedthereto, and must independently review all materials for accuracy and efficacy. May contain materials
owned by others. User is responsible for obtaining permissions for use from third parties as needed.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your useof this material constitutes acceptance of that license and the conditions of use of materials on this site.
-
7/28/2019 Bio Statistics Lecture 1
2/100
John McGreadyJohns Hopkins University
Describing Data: Part I
-
7/28/2019 Bio Statistics Lecture 1
3/100
Lecture Topics
What role does statistics have in public health?
Types of data: continuous, binary, categorical, time-to event
Continuous data: numerical summary measures
Continuous data: visual summary measure
Sample data versus population (process) level data
3
-
7/28/2019 Bio Statistics Lecture 1
4/100
What Role Does Biostatistics Play in Public Health (Why DoI Need this Stuff?)
Section A
-
7/28/2019 Bio Statistics Lecture 1
5/100
Data Is Everywhere!
Data is utilized and summarized frequently in research literature
FromArchives of Surgeryarticle, August 2000:- Hypothesis: Surgeon-directed institutional peer review,
associated with positive physician feedback, can decrease themorbidity and mortality rates associated with carotid
endarterectomy.- Results: Stroke rate decreased from 3.8% (1993-1994) to 0%
(1997-1998). The mortality rate decreased from 2.8%(1993-1994) to 0% (1997-1998). (Average) length of staydecreased from 4.7 days (1993-1994) to 2.6 days (1997-1998).The (average) total cost decreased from $13,344 (1993-1994) to$9,548 (1997-1998).
5
Source: Olcott IV, C., et al. (2004). Institutional peer review can reduce the risk and cost of carotid endarterectomy
Arch Surg, 135: 939-942.
-
7/28/2019 Bio Statistics Lecture 1
6/100
Data Is Everywhere!
Data is utilized and summarized with statistics frequently in popularmedia
From cnn.com, Monday July 8th, 2008:- For the first time, an influential doctors group is
recommending that some children as young as eight be given
cholesterol-fighting drugs to ward off future heart problems. . . With one-third of U.S. children overweight and about 17percent obese, the new recommendations are important, saidDr. Jennifer Li, a Duke University children's heart specialist.
6
-
7/28/2019 Bio Statistics Lecture 1
7/100
Data Is Everywhere!
Data is utilized and summarized with statistics frequently in popularmedia
From Washington Post, June 27th, 2008:-The number of young homosexual men being newly diagnosed
with HIV infection is rising by 12 percent a year, with the
steepest upward trend in young black men, according to a newreport.
7
-
7/28/2019 Bio Statistics Lecture 1
8/100
Data Provides Information
Good data can be analyzed and summarized to provide usefulinformation
Bad data can be analyzed and summarized to provide incorrect/harmful/non-informative information
8
-
7/28/2019 Bio Statistics Lecture 1
9/100
Steps in a Research Project
Planning/design of study
Data collection
Data analysis
Presentation
Interpretation
Biostatistics CAN play a role in each of these steps! (but sometimesis only called upon for the data analysis part)
9
-
7/28/2019 Bio Statistics Lecture 1
10/100
Biostatistics Issues
Planning/design of studies- Primary question(s) of interest:
Quantifying information about a single group? Comparing multiple groups?
- Sample size How many subjects needed total? How many in each of the groups to be compared?
- Selecting study participants Randomly chosen from master list? Selected from a pool of interested persons? Take whoever shows up?
- If group comparison of interest, how to assign to groups?
10
-
7/28/2019 Bio Statistics Lecture 1
11/100
Biostatistics Issues
Data collection
Data analysis- What statistical methods are appropriate given the data
collected?
- Dealing with variability (both natural and sampling related): Important patterns in data are obscured by variability Distinguish real patterns from random variation
- Inference: using information from the single study coupled withinformation about variability to make statement about thelarger population/process of interest
11
-
7/28/2019 Bio Statistics Lecture 1
12/100
Biostatistics Issues
Presentation- What summary measures will best convey the main messages
in the data about the primary (and secondary) researchquestions of interest
- How to convey/ rectify uncertainty in estimates based on thedata
Interpretation- What do the results mean in terms of practice, the program,
the population etc.?
12
-
7/28/2019 Bio Statistics Lecture 1
13/100
1954 Salk Polio Vaccine Trial
13
Source: Meier, P. (1972), The Biggest Public Health Experiment Ever: The 1954 Field Trial of the Salk Poliomyelitis
Vaccine, InJ. Tanur (Editor), Statistics: A Guide to the Unknown. Holden-Day.
-
7/28/2019 Bio Statistics Lecture 1
14/100
Design: Features of the Polio Trial
Comparison group
Randomized
Placebo controls
Double blind
Objectivethe groups should be equivalent except for the factor(vaccine) being investigated
14
-
7/28/2019 Bio Statistics Lecture 1
15/100
Analysis Question
Question- There were almost twice as many polio cases in the placebo
compared to the vaccine group
- Could the results be due to chance?
15
-
7/28/2019 Bio Statistics Lecture 1
16/100
Such Great Imbalance by Chance?
Polio cases- Vaccine82- Placebo162
Statistical methods tell us how to make these probabilitycalculations
16
-
7/28/2019 Bio Statistics Lecture 1
17/100
Section B
Types of Data
-
7/28/2019 Bio Statistics Lecture 1
18/100
-
7/28/2019 Bio Statistics Lecture 1
19/100
Categorical Data
Categorical data (place individuals in categories)
Nominal categorical data: no inherent order to categories- Race/ethnicity- Country of birth- Religious affiliation
Ordinal categorical data: order to categories- Income level categorized into four categories, least to greatest- Degree of agreement, five categories from strongly disagree to
strongly agree
4
-
7/28/2019 Bio Statistics Lecture 1
20/100
Continuous Data
Continuous data (finer measurements)- Blood pressure, mmHg- Weight, pounds (kilograms, ounces, etc.)- Height, feet (centimeters, inches, etc.)- Age, years (months)- Income level, dollars/year (Euro by year, etc.)
5
-
7/28/2019 Bio Statistics Lecture 1
21/100
Time to Event Data
Data that is a hybrid of continuous data and binary data- Whether an event occurs and time to the occurrence (or time tolast follow-up without occurrence)
6
-
7/28/2019 Bio Statistics Lecture 1
22/100
Different Methods for Different Data Types
To compare the number of polio cases in the two treatment arms ofthe Salk Polio vaccine, you could use . . .- Fishers Exact Test- Chi-Square Test
To compare blood pressures in a clinical trial evaluating two bloodpressure-lowering medications, you could use . . .- 2-Sample t-Test- Wilcoxon Rank Sum Test
7
-
7/28/2019 Bio Statistics Lecture 1
23/100
Section C
Continuous Data: Numerical Summary Measures; SampleEstimates versus Population Measures
-
7/28/2019 Bio Statistics Lecture 1
24/100
Summarizing and Describing Continuous Data
Measures of the center of data- Mean- Median
Measure of data variability- Standard deviation (variance)- Range
3
-
7/28/2019 Bio Statistics Lecture 1
25/100
Sample Mean: The Average or Arithmetic Mean
Add up data, then divide by sample size (n)
The sample size n is the number of observations (pieces of data)
4
-
7/28/2019 Bio Statistics Lecture 1
26/100
Mean, Example
Five systolic blood pressures (mmHg) (n = 5)- 120, 80, 90, 110, 95
Can be represented with math type notation:- x1= 120, x2 = 80, . . . x5 = 95
The sample mean is easily computed by adding up the five valuesand dividing by fivein statistical notation the sample mean is
frequently represented by a letter with a line over it
- For example (pronounced x bar)-
5
-
7/28/2019 Bio Statistics Lecture 1
27/100
Mean, Example
Five systolic blood pressures (mmHg) (n = 5)-120, 80, 90, 110, 95
6
-
7/28/2019 Bio Statistics Lecture 1
28/100
Notes on Sample Mean
Generic formula representation
In the formula to find the mean, we use the summation sign- This is just mathematical shorthand for add up all of the
observations
7
-
7/28/2019 Bio Statistics Lecture 1
29/100
Notes on Sample Mean
Also called sample average or arithmetic mean
Sensitive to extreme values- One data point could make a great change in sample mean
Why is it called the sample mean?- To distinguish it from population mean (will discuss at end of
this section)
8
-
7/28/2019 Bio Statistics Lecture 1
30/100
Sample Median
The median is the middle number (also called the 50th percentile)- Other percentiles can be computed as well, but are notmeasures of center
80 90 95 110 120
9
-
7/28/2019 Bio Statistics Lecture 1
31/100
80 90 95 110 200
Sample Median
The sample median is not sensitive to extreme values- For example, if 120 became 200, the median would remain thesame, but the mean would change to 115
10
-
7/28/2019 Bio Statistics Lecture 1
32/100
80 90 95 110 120 125
Median
Sample Median
If the sample size is an even number
11
-
7/28/2019 Bio Statistics Lecture 1
33/100
Describing Variability
Sample variance (s2)
Sample standard deviation (s or SD)
The sample variance is the average of the square of the deviationsabout the sample mean
12
-
7/28/2019 Bio Statistics Lecture 1
34/100
Describing Variability
The sample standard deviation is the square root ofs2
13
-
7/28/2019 Bio Statistics Lecture 1
35/100
Describing Variability
Recall, the five systolic blood pressures (mm Hg) with sample mean( ) of 99 mmHg
Five systolic blood pressures (mmHg) (n = 5)- 120, 80, 90, 110, 95
14
-
7/28/2019 Bio Statistics Lecture 1
36/100
Describing Variability
Example: n = 5 systolic blood pressures (mm Hg)
15
-
7/28/2019 Bio Statistics Lecture 1
37/100
Describing Variability
Sample variance
Sample standard deviation (s)
s = 15.97 (mmHg)
16
-
7/28/2019 Bio Statistics Lecture 1
38/100
Notes on s
The bigger s is, the more variability there is
s measures the spread about the mean
s can equal 0 only if there is no spread-
All n observations have the same value
The units ofs are the same as the units of the data (for example,mm Hg)
Often abbreviated SD or sd
s2 is the best estimate from the sample of the population variance2; s is the best estimate of the population standard deviation
17
-
7/28/2019 Bio Statistics Lecture 1
39/100
Population Versus Sample
Sample: a subset (part) of a larger group (population) from whichinformation is collected to learn about the larger group- For example, sample of blood pressures n = five 18-year-old
male college students in the United States
Population: the entire group for which information is wanted- For example, the blood pressure of all 18-year-old male collegestudents in the United States
18
-
7/28/2019 Bio Statistics Lecture 1
40/100
Random Sampling
For studies it is optimal if the sample which provides the data isrepresentative of the population under study- Certainly not always possible!
For this term, we will make this assumption unless otherwisespecified
One way of getting a representative sample: simple randomsampling
- A sampling scheme in which every possible sub-sample of size nfrom a population is equally likely to be selected
- How to do it? More detail in second half of term, but think ofthe names in a hat idea
19
-
7/28/2019 Bio Statistics Lecture 1
41/100
The sample summary measures (mean, median, sd) are calledstatistics, and are just estimates of their population (process)counterparts
Assuming the sample is representative of the population from whichit is taken (for example, a randomly drawn sample) these sample
estimates should be good estimates of true quantities
Population Versus Sample
20
Population
Population (true) mean: Population (true) SD:
Sample
Sample mean:Sample SD: s
-
7/28/2019 Bio Statistics Lecture 1
42/100
Population Versus Sample
For example, we will never know the population mean but wouldlike to know it
We draw a sample from the population
We calculate the sample mean
How close is to ?
Statistical theory allow us to estimate how close is to usinginformation computed from the same single sample we use toestimate
21
-
7/28/2019 Bio Statistics Lecture 1
43/100
The Role of Sample Size on Sample Estimates
Increasing sample size, increases Goodness of sample statistics asestimates for their population counterparts- Sample mean based on random sample of 1,000 observations is
better estimate of true (population) mean than sample mean
than sample mean based on random sample of 100 from same
population
- Same logic applies to sample standard deviation estimates- Well define better estimate in the third lecture
22
-
7/28/2019 Bio Statistics Lecture 1
44/100
The Role of Sample Size on Sample Estimates
Increasing sample size does not dictate how sample estimates fromtwo different representative samples of difference size willcompare in value!
Researcher can not systematically decrease (or increase) value ofsample estimates such as mean and standard deviation by taking
larger samples!
23
-
7/28/2019 Bio Statistics Lecture 1
45/100
The Role of Sample Size on Sample Estimates
Extreme values, both larger and smaller, are actually more likely inlarger samples- The smaller and larger extremes in larger samples they
balance each other out
- This balancing act tends to keep the mean in a steady stateas sample size increasesit tends to be about the same
In addition, non-extreme values (values closer to mean) are alsomore likely in larger samples
- Hence, sample SD also stays balanced, i.e., does notsystematically increase/decrease with larger samples
24
-
7/28/2019 Bio Statistics Lecture 1
46/100
SD: Why Do We Divide by n1 Instead of n?
We really want to replace with in the formula for s2
Since we dont know , we use
But generally, tends to be smaller than
- To compensate, we divide by a smaller number: n1 instead ofn This will be explored further in an optional component of the third
lecture
25
-
7/28/2019 Bio Statistics Lecture 1
47/100
n1
n1 is called the degrees of freedom of the variance or SD
Why?- The sum of the deviations is zero- The last deviation can be found once we know the other n1- Only n1 of the squared deviations can vary freely
The term degrees of freedom arises in other areas of statistics
It is not always n1, but it is in this case
26
-
7/28/2019 Bio Statistics Lecture 1
48/100
Why SD as Measure of Variation
Why note use the range of the data for example?
- Range = maximum minimum What happens to the sample maximum and minimum as sample size
increases?- As it turns out, as sample size increases, the maximum tends to
increase, and the minimum tends to decrease: Extreme valuesare more likely with larger samples!
- This will tend to increase the range systematically withincreased sample size
27
-
7/28/2019 Bio Statistics Lecture 1
49/100
Section D
Visually Displaying Continuous Data: Histograms
-
7/28/2019 Bio Statistics Lecture 1
50/100
Pictures of Data: Continuous Variables
Histograms-
Means and medians and standard deviations do not tell thewhole story
- Differences in shape of the distribution- Histograms are a way of displaying the distribution of a set of
data by charting the number (or percentage) of observations
whose values fall within pre-defined numerical ranges
3
k
-
7/28/2019 Bio Statistics Lecture 1
51/100
4
How to Make a Histogram
Consider the following data collected from the 1995 StatisticalAbstracts of the United States
- For each of the 50 United States, the proportion of individualsover 65 years of age has been recorded
H M k Hi
-
7/28/2019 Bio Statistics Lecture 1
52/100
How to Make a Histogram
5Source: Statistical Abstracts of the United States (1995).
H t M k Hi t
-
7/28/2019 Bio Statistics Lecture 1
53/100
How to Make a Histogram
AK 4.6
FL 18.4
6Source: Statistical Abstracts of the United States (1995).
H t M k Hi t
-
7/28/2019 Bio Statistics Lecture 1
54/100
How to Make a Histogram
Break the data range into mutually exclusive, equally sized bins:here each is 1% wide
Count the number of observations in each bin
7
H t M k Hi t
-
7/28/2019 Bio Statistics Lecture 1
55/100
How to Make a Histogram
Draw the histogram
Label scales
8
15
Pi t f D t Hi t
-
7/28/2019 Bio Statistics Lecture 1
56/100
Pictures of Data: Histograms
Suppose we have a sample of blood pressure data on a sample of113 men
Sample mean ( ) : 123.6 mmHg
Sample Median (m): 123.0 mmHg
Sample sd: (s): 12.9 mmHg
9
Pi t f D t Hi t g
-
7/28/2019 Bio Statistics Lecture 1
57/100
Histogram of the Systolic Blood Pressure for
sample of 113 men. Each bar spans a width offive mmHg on the horizontal axis. The height of
each bar represents the number of individuals
with SBP in that range.
Pictures of Data: Histograms
10
Pict res of Data: Histograms
-
7/28/2019 Bio Statistics Lecture 1
58/100
Another histogram of the blood pressure of 113
men. In this graph, each bar has a width of 20mmHg and there are a total of only four bars
making it hard to characterize the distribution
of blood pressures in the sample.
Pictures of Data: Histograms
11
Pictures of Data: Histograms
-
7/28/2019 Bio Statistics Lecture 1
59/100
Yet another histogram of the same BP
information on 113 men. Here, the bin width isone mmHg, perhaps giving more detail than is
necessary.
Pictures of Data: Histograms
12
Other Examples
-
7/28/2019 Bio Statistics Lecture 1
60/100
Another way to present the data in a histogram
is to label the y-axis with relative frequenciesas opposed to counts. The height of each bar
represents the percentage of individuals in thesample with BP in that range. The bar heights
should add to one.
Other Examples
13
Intervals
-
7/28/2019 Bio Statistics Lecture 1
61/100
How many intervals (bins) should you have in a histogram?-
There is no perfect answer to this- Depends on sample size n- Rough rule of thumb: # Intervals
Intervals
14
-
7/28/2019 Bio Statistics Lecture 1
62/100
Section E
Stem and Leaf Plots, Box Plots
-
7/28/2019 Bio Statistics Lecture 1
63/100
Sample 113 Men
Suppose we took another look at our random sample of 113 men andtheir blood pressure measurements
One tool for visualizing the data is the histogram
3
-
7/28/2019 Bio Statistics Lecture 1
64/100
4
Histogram: BP for 113 males
-
7/28/2019 Bio Statistics Lecture 1
65/100
Sample 113 Men: Stem and Leaf
Another common tool for visually displaying continuous data is thestem and leaf plot
Very similar to a histogram- Like a histogram on its side- Allows for easier identification of individual values in the
sample
5
-
7/28/2019 Bio Statistics Lecture 1
66/100
Stem and Leaf: BP for 113 Males
6
-
7/28/2019 Bio Statistics Lecture 1
67/100
Stem and Leaf: BP for 113 Males
7
Stems
-
7/28/2019 Bio Statistics Lecture 1
68/100
Stem and Leaf: BP for 113 Males
8
Leaves
-
7/28/2019 Bio Statistics Lecture 1
69/100
Stem and Leaf: BP for 113 Males
9
-
7/28/2019 Bio Statistics Lecture 1
70/100
Stem and Leaf: BP for 113 Males
10
-
7/28/2019 Bio Statistics Lecture 1
71/100
Sample 113 Men: Stem and Boxplot
Another common visual display tool is the boxplot
- Gives good insight into distribution shape in terms of skewnessand outlying values (extremes: values different than most ofthe rest of the data)
- Very nice tool for easily comparing distribution of continuousdata in multiple groupscan be plotted side by side
11
-
7/28/2019 Bio Statistics Lecture 1
72/100
Boxplot: BP for 113 Males
12
B l BP f 113 M l
-
7/28/2019 Bio Statistics Lecture 1
73/100
13
SampleMedian
Blood
Pressure
Boxplot: BP for 113 Males
B l BP f 113 M l
-
7/28/2019 Bio Statistics Lecture 1
74/100
Boxplot: BP for 113 Males
14
75th Percentile
of Sample
25th Percentile
of Sample
B l t BP f 113 M l
-
7/28/2019 Bio Statistics Lecture 1
75/100
Boxplot: BP for 113 Males
15
Largest
Observation
Smallest
Observation
B l t BP f 113 M l
-
7/28/2019 Bio Statistics Lecture 1
76/100
Boxplot: BP for 113 Males
16
75th Percentile
of Sample
25th Percentile
of Sample
H it l L th f St f 1 000 P ti t
-
7/28/2019 Bio Statistics Lecture 1
77/100
17
Hospital Length of Stay for 1,000 Patients
Suppose we took a representative sample of discharge records from1,000 patients discharged from a large teaching hospital in a singleyear
How could we visualize this data?
Hi t L th f St
-
7/28/2019 Bio Statistics Lecture 1
78/100
Histogram: Length of Stay
18
B l t L gth f St
-
7/28/2019 Bio Statistics Lecture 1
79/100
Boxplot: Length of Stay
19
Boxplot: Length of Stay
-
7/28/2019 Bio Statistics Lecture 1
80/100
Boxplot: Length of Stay
20
Boxplot: Length of Stay
-
7/28/2019 Bio Statistics Lecture 1
81/100
Boxplot: Length of Stay
21
Largest
Non-Outlier
Smallest
Non-Outlier
Boxplot: Length of Stay
-
7/28/2019 Bio Statistics Lecture 1
82/100
Boxplot: Length of Stay
22
Large Outliers
Stem and Leaf: Length of Stay
-
7/28/2019 Bio Statistics Lecture 1
83/100
Stem and Leaf: Length of Stay
23
Side by Side Distribution Comparison
-
7/28/2019 Bio Statistics Lecture 1
84/100
24
Side by Side Distribution Comparison
Side by side histograms of length of stay for female and malepatients in sample
Side by Side Distribution Comparison
-
7/28/2019 Bio Statistics Lecture 1
85/100
Side by Side Distribution Comparison
Side by side boxplots of length of stay for female and male patientsin sample
25
-
7/28/2019 Bio Statistics Lecture 1
86/100
Section F
Samples versus Populations, Part 2: Sample Distribution
versus Underlying Population Distribution
Sample Distribution
-
7/28/2019 Bio Statistics Lecture 1
87/100
In research, samples are taken from larger population
If the sample is taken randomly, the sample characteristics willimperfectly mimic the population characteristics
The characteristics include the mean, median and sd (but also thedistribution of individual values)
3
Example 1: Blood Pressure in Males
-
7/28/2019 Bio Statistics Lecture 1
88/100
Histogram of BP values for random sample of 113 men
4
Example 1: Blood Pressure in Males
-
7/28/2019 Bio Statistics Lecture 1
89/100
Histogram of BP values for random sample of 500 men
5
Example 1: Blood Pressure in Males
-
7/28/2019 Bio Statistics Lecture 1
90/100
Histogram of BP values for male population
6
The Histogram and the Probability Density
-
7/28/2019 Bio Statistics Lecture 1
91/100
7
Theprobability densityis a smooth idealized curve that shows theshape of the distribution in the population
This is generally a theoretical distribution that we can never see:we can only estimate it from the distribution presented by a
representative (random) sample from the population
Areas in an interval under the curve represent the percentage of thepopulation in the interval
The distributions shown are indicative of a symmetric, bell shapeddistribution for blood pressure measurements in men
-
7/28/2019 Bio Statistics Lecture 1
92/100
Example 2: Hospital Length of Stay
-
7/28/2019 Bio Statistics Lecture 1
93/100
Histogram of LOS values for 500 patients
9
Example 2: Hospital Length of Stay
-
7/28/2019 Bio Statistics Lecture 1
94/100
Histogram of LOS values for all patients
10
Common Shapes of the Distribution
-
7/28/2019 Bio Statistics Lecture 1
95/100
Some shapes of data distributions
Symmetricaland bell
shaped
Positivelyskewed or
skewed to the
right
Negativelyskewed or
skewed to the
left
A B C
11
Shapes of the Distribution
-
7/28/2019 Bio Statistics Lecture 1
96/100
Some possible shapes for frequency distributions
Bimodal ReverseJ-shaped
Uniform
A B C
12
Distribution Characteristics
-
7/28/2019 Bio Statistics Lecture 1
97/100
Mode: Peak(s)
Median: Equal areas point
Mean: Balancing point
MeanMode
Median
13
Shapes of Distributions
-
7/28/2019 Bio Statistics Lecture 1
98/100
Symmetric (right and left sides are mirror images)- Left tail looks like right tail- Mean = Median = Mode
MeanMedian Mode
14
Shapes of Distributions
-
7/28/2019 Bio Statistics Lecture 1
99/100
Right skewed(positively skewed)- Long right tail- Mean > Median
MeanMode
Median
15
Shapes of Distributions
-
7/28/2019 Bio Statistics Lecture 1
100/100
Left skewed(negatively skewed)- Long left tail- Mean < Median
ModeMedianMean