Biostatistics I PubH 6450 Fall 2005. 2 June 2015 2 PubH 6450 – Biostatistics I Instructor: Susan...
-
date post
19-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Biostatistics I PubH 6450 Fall 2005. 2 June 2015 2 PubH 6450 – Biostatistics I Instructor: Susan...
Biostatistics IBiostatistics IPubH 6450PubH 6450
Fall 2005Fall 2005
April 18, 2023 2
PubH 6450 – Biostatistics IPubH 6450 – Biostatistics IInstructor: Susan TelkeInstructor: Susan Telke
email: email: [email protected]@biostat.umn.edu (office hours: (office hours: 3:20pm – 3:20pm – 4:20pm (T and TH), location – lecture 4:20pm (T and TH), location – lecture hall or by appointment, hall or by appointment, location -A349 Mayolocation -A349 Mayo building)building)
Teaching Assistants: Teaching Assistants: Pei Li – email: Pei Li – email: [email protected]@biostat.umn.edueduXiaoxiao Kong – email: Xiaoxiao Kong – email: [email protected]@biostat.umn.edueduJianmin Liu – email: Jianmin Liu – email: [email protected]@biostat.umn.edueduXiaobo Liu – email: Xiaobo Liu – email: [email protected]@biostat.umn.edueduRan Li – email: Ran Li – email: [email protected]@biostat.umn.edueduJia Xu – email: Jia Xu – email: [email protected]@biostat.umn.edueduJay Pottala – email: Jay Pottala – email: [email protected]@biostat.umn.edu
April 18, 2023 3
Book for 6450Book for 6450
Introduction to the Practice of Introduction to the Practice of Statistics -Statistics -(Moore and McCabe)(Moore and McCabe)
April 18, 2023 4
Web PageWeb Page
http://www.biostat.umn.edu/~susant/PH6450DESC.htmlhttp://www.biostat.umn.edu/~susant/PH6450DESC.html
Information on the web:Information on the web:1.1. General class informationGeneral class information
2.2. SyllabusSyllabus
3.3. Course notes (updated weekly)Course notes (updated weekly)
4.4. HomeworkHomework
5.5. Computer HelpComputer Help
April 18, 2023 5
Computer LabsComputer Labs
Mayo C381 (Biostatistics Lab)Mayo C381 (Biostatistics Lab)
Teaching Assistants will have Teaching Assistants will have computer sessions located in the computer sessions located in the mayo lab to help you with your mayo lab to help you with your homework assignmentshomework assignments..
Deihl Hall (Medical Library)Deihl Hall (Medical Library)
April 18, 2023 6
PC SASPC SAS
Primary computing environment will be Primary computing environment will be
PC SASPC SAS PC SAS is available in computing lab MAYO C381PC SAS is available in computing lab MAYO C381 PC SAS can be purchased at the bookstore (one PC SAS can be purchased at the bookstore (one
year agreement is about $50).year agreement is about $50). SAS (not PC SAS) is available using the UNIX SAS (not PC SAS) is available using the UNIX
version of SAS by telnet to the biostat workstation version of SAS by telnet to the biostat workstation saturn.saturn.
April 18, 2023 7
Exams and HomeworkExams and Homework There will be weekly homework There will be weekly homework
assignmentsassignments There will be two midterms and one final There will be two midterms and one final
exam.exam. Students who get an “A” on all exams get an Students who get an “A” on all exams get an
“A” in the course.“A” in the course. For all other students the midterms account For all other students the midterms account
for 25% each and the final accounts for 30% for 25% each and the final accounts for 30% of the course grade. The remaining 20% is of the course grade. The remaining 20% is based on homework (best 9)based on homework (best 9)
April 18, 2023 8
Introduction to PubH Introduction to PubH 64506450
The study of The study of statisticsstatistics explores the explores the collection, organization, analysis and collection, organization, analysis and interpretation of numerical data.interpretation of numerical data.
When the focus of the analysis is on the When the focus of the analysis is on the biological and health sciences it is called biological and health sciences it is called BiostatisticsBiostatistics..
April 18, 2023 9
Trial by Jury:Trial by Jury:A Familiar ScenarioA Familiar Scenario
You have a crime.You have a crime. You have a suspect.You have a suspect. A police investigation collects evidence A police investigation collects evidence
against the suspect.against the suspect. A prosecutor presents summarized A prosecutor presents summarized
evidence to a jury.evidence to a jury.
April 18, 2023 10
Trial by Jury:Trial by Jury:The ProcessThe Process
The Jury reaches a verdict based on their The Jury reaches a verdict based on their judgment of the evidence presented.judgment of the evidence presented.
Rules for determining a verdict:Rules for determining a verdict:• The accused is innocent until proven guiltyThe accused is innocent until proven guilty• The evidence must be sufficient to convict The evidence must be sufficient to convict
beyond all reasonable doubtbeyond all reasonable doubt• Decision must be unanimousDecision must be unanimous
April 18, 2023 11
Trial by Jury:Trial by Jury:The NeedThe Need
Why is the Trial by Jury process needed?Why is the Trial by Jury process needed?
The truth is unknown or uncertain because The truth is unknown or uncertain because of :of :
• VariabilityVariability: Every case is different.: Every case is different.• Incomplete information: Some evidence Incomplete information: Some evidence
may be missing.may be missing.
April 18, 2023 12
Trial by Jury:Trial by Jury:RationaleRationale
Trial by Jury is the way our society deals Trial by Jury is the way our society deals with uncertainties related to criminal with uncertainties related to criminal justice.justice.
Its goal is to minimize errors/mistakes Its goal is to minimize errors/mistakes within the limits of human understanding.within the limits of human understanding.
It is impossible to eliminate all mistakes in It is impossible to eliminate all mistakes in verdicts made based on uncertain, verdicts made based on uncertain, incomplete evidence.incomplete evidence.
April 18, 2023 13
Trial by Jury:Trial by Jury:Dealing with UncertaintyDealing with Uncertainty
A A hypothesishypothesis (assumption) is stated: (assumption) is stated: “Every person is innocent until proven “Every person is innocent until proven guilty”guilty”
DataData is collected: Evidence against the is collected: Evidence against the hypothesis – not against the suspect.hypothesis – not against the suspect.
A verdict is reached based on the A verdict is reached based on the evidence about whether the hypothesis evidence about whether the hypothesis should be rejected. (If should be rejected. (If hypothesis rejectedhypothesis rejected – verdict is guilty)– verdict is guilty)
April 18, 2023 14
Trial by Jury:Trial by Jury:Elements of a Successful Elements of a Successful
TrialTrial A probable A probable causecause (a crime and a suspect). (a crime and a suspect). A thorough A thorough investigationinvestigation (by police). (by police). An efficient An efficient presentationpresentation (by D.A.’s office (by D.A.’s office
attorneys – organization and attorneys – organization and summarization of evidence).summarization of evidence).
A fair & impartial A fair & impartial assessmentassessment by the jury. by the jury.
April 18, 2023 15
Trial by Jury:Trial by Jury:How does this relate to How does this relate to
Biostatistics?Biostatistics? A probable A probable causecause: The crime is lung cancer & : The crime is lung cancer &
the suspect is cigarette smoking.the suspect is cigarette smoking. A thorough A thorough investigationinvestigation: A : A clinic trialclinic trial or or case case
control studycontrol study to gather information. to gather information. An efficient An efficient presentationpresentation: Using biostatistics : Using biostatistics
tools to organize and summarize data.tools to organize and summarize data. A fair & impartial A fair & impartial assessmentassessment by the jury: by the jury:
Making proper Making proper statistical inferencestatistical inference based on based on data collected.data collected.
April 18, 2023 16
Areas of BiostatisticsAreas of Biostatistics
Experimental Designs:Experimental Designs:How will the data be collected?How will the data be collected?Descriptive Statistics:Descriptive Statistics:Organization of dataOrganization of dataSummary statistics of dataSummary statistics of dataEffective graphical representation of dataEffective graphical representation of dataStatistical InferenceStatistical InferenceThe science of drawing statistical conclusions from The science of drawing statistical conclusions from
specific data using a knowledge of probability.specific data using a knowledge of probability.
April 18, 2023 17
Goals …Goals …
By the end of the course you should be able to use By the end of the course you should be able to use the following aspects of statistical thinking:the following aspects of statistical thinking:
Critically read the literature in your field that makes Critically read the literature in your field that makes use of statistical analysis.use of statistical analysis.
Read about new statistical techniques and Read about new statistical techniques and understand how they may apply to your field.understand how they may apply to your field.
Create and analyze descriptive statistics based on Create and analyze descriptive statistics based on data.data.
Develop hypotheses and use appropriate statistics to Develop hypotheses and use appropriate statistics to evaluate these hypotheses.evaluate these hypotheses.
April 18, 2023 18
The Language of The Language of Statistics:Statistics:DefinitionsDefinitions PopulationPopulation: The entire group of people, : The entire group of people,
animals or things about which we want animals or things about which we want information. (e.g. population of the U.S.)information. (e.g. population of the U.S.)
IndividualsIndividuals(units): The objects described (units): The objects described by a set of data. (e.g. People)by a set of data. (e.g. People)
SampleSample: A part of the population from : A part of the population from which we actually collect information, used which we actually collect information, used to draw conclusions about the whole to draw conclusions about the whole population. (e.g. sample=1000 people)population. (e.g. sample=1000 people)
April 18, 2023 19
The Language of The Language of Statistics:Statistics:DefinitionsDefinitions
VariableVariable: Any characteristic of an : Any characteristic of an individual. A variable can take different individual. A variable can take different values for different individuals. Also, a values for different individuals. Also, a variable can take different values for the variable can take different values for the same individual at different times. (e.g. same individual at different times. (e.g. Height, age, gender)Height, age, gender)
April 18, 2023 20
Two “Types” of VariablesTwo “Types” of Variables
Quantitative VariableQuantitative Variable: measures that are : measures that are recorded on a naturally occurring numerical recorded on a naturally occurring numerical scale. Operations such as adding and scale. Operations such as adding and “averaging” make sense. (e.g. Height, time, test “averaging” make sense. (e.g. Height, time, test scores)scores)
Qualitative VariableQualitative Variable (Categorical): Variables that (Categorical): Variables that are classified into one of a group of categories. are classified into one of a group of categories. Arithmetic operations do NOT make sense with Arithmetic operations do NOT make sense with this type of variable. (e.g. Geographical this type of variable. (e.g. Geographical location, gender)location, gender)
April 18, 2023 21
Examples:Examples:
Age in yearsAge in years ID #ID # Temperature in degreesTemperature in degrees Political partyPolitical party Smoking statusSmoking status Length in cmLength in cm GenderGender Blood pressureBlood pressure
April 18, 2023 22
Group Work!Group Work!
April 18, 2023 23
Two Methods for Two Methods for Describing Sets of DataDescribing Sets of Data
Exploratory Data analysis: Exploratory Data analysis: examining data examining data in order to describe their main features.in order to describe their main features.
GraphicalGraphical
NumericalNumerical
April 18, 2023 24
Displaying Displaying DistributionsDistributions with Graphswith Graphs
DistributionDistribution: The distribution of a variable : The distribution of a variable tells us what values it takes on and how tells us what values it takes on and how often it takes on these values.often it takes on these values.
April 18, 2023 25
Describing Categorical Describing Categorical Variables with GraphsVariables with Graphs
Bar GraphsBar Graphs
Percent of Children Living in Crack/cocaine Households
0%
10%
20%
30%
40%
50%
60%
70%
80%
Black White AmericanIndian
Other
Percent of Children Living inCrack/cocaine Households
BlackBlack 70%70%
WhiteWhite 18% 18%
American American IndianIndian
8%8%
OtherOther 4%4%
NOTE: 668 children living in crack/cocaine households were categorized based on race
April 18, 2023 26
Describing Categorical Describing Categorical Variables with GraphsVariables with Graphs
Pie ChartPie ChartPercent of Children Living in Crack/cocaine Households
Black White
American Indian Other
BlackBlack 70%70%
WhiteWhite 18% 18%
American American IndianIndian
8%8%
OtherOther 4%4%
NOTE: 668 children living in crack/cocaine households were categorized based on race
April 18, 2023 27
Describing Quantitative Describing Quantitative DataData
StemplotsStemplots HistogramsHistograms Time PlotsTime Plots Box Plots (section 1.2)Box Plots (section 1.2)
April 18, 2023 28
StemplotsStemplots
Quick easy way to see distribution of 40 or Quick easy way to see distribution of 40 or less data pointsless data points
How to make a stemplotHow to make a stemplot Create LeafCreate Leaf Order DataOrder Data Arrange StemsArrange Stems Place LeavesPlace Leaves
April 18, 2023 29
Stemplots:Stemplots:An ExampleAn Example
Average Monthly Temperature. Source: Average Monthly Temperature. Source: World World AlmanacAlmanac 1996 p.180 1996 p.180
JAJANN
FFEEBB
MMAARR
AAPPRR
MMAAYY
JJUUNN
JJUULL
AAUUGG
SSEEPP
OOCCTT
NNOOVV
DDEECC
DuluthDuluth 66 1212 2323 3838 5050 5959 6565 6363 5454 4444 2828 1414
MinneapolisMinneapolis 1111 1818 2929 4646 5959 6868 7373 7171 6161 5050 3333 1919
April 18, 2023 30
Stemplots:Stemplots:An ExampleAn Example
00 66
11 1248912489
22 389389
33 3838
44 4646
55 0049900499
66 13581358
77 1313
6,11,12,14,18,19,23,28,29,6,11,12,14,18,19,23,28,29,
33,38,44,46,50,50,54,59,59,33,38,44,46,50,50,54,59,59,
61,63,65,68,71,7361,63,65,68,71,73
April 18, 2023 31
HistogramsHistograms
Histograms are useful to display the Histograms are useful to display the distribution of large amounts of data.distribution of large amounts of data.
Steps for creating a histogramSteps for creating a histogram1.1. Divide range into classes of equal widthDivide range into classes of equal width
2.2. Count number of observations in each classCount number of observations in each class
3.3. Draw histogram Draw histogram
April 18, 2023 32
Histogram:Histogram:An ExampleAn Example
Weights of 92 Penn State Students:Weights of 92 Penn State Students: FemalesFemales
140 120 130 138 121 125 116 145 150 112 125 130 120 130 131 120 140 120 130 138 121 125 116 145 150 112 125 130 120 130 131 120 118 125 135 125 118 122 115 102 115 150 110 116 108 95 125 133 118 125 135 125 118 122 115 102 115 150 110 116 108 95 125 133 110 150 108 110 150 108
MalesMales140 145 160 190 155 165 150 190 195 138 160 155 153 145 170 175 140 145 160 190 155 165 150 190 195 138 160 155 153 145 170 175 175 170 180 135 170 157 130 185 190 155 170 155 215 150 145 155 175 170 180 135 170 157 130 185 190 155 170 155 215 150 145 155 155 150 155 150 180 160 135 160 130 155 150 148 155 150 140 180 155 150 155 150 180 160 135 160 130 155 150 148 155 150 140 180 190 145 150 164 140 142 136 123 155190 145 150 164 140 142 136 123 155
April 18, 2023 33
Histogram of WeightsHistogram of Weights
Weights of Penn State Students
0
5
10
15
20
25
<110 110-119
120-129
130-139
140-149
150-159
160-169
170-179
180-189
190+
Weight in Lbs.
Nu
mb
er
of
Stu
de
nts
All Students
April 18, 2023 34
Number of IntervalsNumber of Intervals
There is no clear-cut rule on the number of There is no clear-cut rule on the number of intervals or classes that should be used.intervals or classes that should be used.
Too many intervals – the data may not be Too many intervals – the data may not be summarized enough for a clear summarized enough for a clear visualization of how they are distributed.visualization of how they are distributed.
Too few intervals – the data may be over-Too few intervals – the data may be over-summarized and some of the details of the summarized and some of the details of the distribution may be lost.distribution may be lost.
April 18, 2023 35
Pictures of Data: HistogramsPictures of Data: Histograms Blood pressure data on a sample of 113 menBlood pressure data on a sample of 113 men
Histogram of the Systolic Blood Pressure for 113 men. Each bar spans a width of 5 mmHg on the horizontal axis. The height of each bar represents the number of individuals with SBP in that range.
05
1015
20N
um
be
r o
f M
en
80 100 120 140 160Systolic BP (mmHg)
April 18, 2023 36
Another histogram of the blood pressure of 113 men. In this graph, each bar has a width of 20 mmHg, and there are a total of only 4 bars making it hard to characterize the distribution of blood pressures in the sample.
Pictures of Data: Pictures of Data: HistogramsHistograms
020
4060
Nu
mb
er
of
Me
n
80 100 120 140 160Systolic BP (mmHg)
April 18, 2023 37
Yet another histogram of the same BP information on 113 men. Here, the bin width is 1 mmHg, perhaps giving more detail than is necessary.
Pictures of Data: Pictures of Data: HistogramsHistograms
02
46
Nu
mb
er
of
Me
n
80 100 120 140 160Systolic BP (mmHg)
April 18, 2023 38
Width of IntervalsWidth of Intervals
Without some specific reason (i.e. showing Without some specific reason (i.e. showing infant death) the intervals should all be the infant death) the intervals should all be the same width.same width.
Common width =W=Common width =W=
R = range of the dataR = range of the data k = the number of intervalsk = the number of intervals
k
R
April 18, 2023 39
Consideration when Consideration when Determining WidthDetermining Width
Width should be chosen so that it is convenient Width should be chosen so that it is convenient to use or easy to recognize (multiples of 5 or 1).to use or easy to recognize (multiples of 5 or 1).
The beginning of the first interval must be low The beginning of the first interval must be low enough so that the first interval includes the enough so that the first interval includes the smallest observation. smallest observation.
If the data has If the data has xx decimal places, the interval decimal places, the interval limits should also have limits should also have xx decimal places. decimal places.
April 18, 2023 40
Data ExampleData Example
Weight in pounds of 57 school children at a Weight in pounds of 57 school children at a day-care center:day-care center:
6868 6363 4242 2727 3030 3636 2828 3232 79792727
2222 23 23 2424 2525 4444 6565 4343 2323 7474 5151
3636 42 42 2828 3131 2828 2525 4545 1212 5757 5151
12 12 32 32 4949 3838 4242 2727 3131 5050 3838 2121
16 16 24 24 6969 4747 2323 2222 4343 2727 4949 2828
2323 19 19 4646 3030 4343 4949 1212
April 18, 2023 41
Data Example – Step 1Data Example – Step 1
From the data we have:From the data we have: Minimum = 12Minimum = 12 Maximum = 79Maximum = 79 R = 79-12 = 67R = 79-12 = 67
If we use k=5 and 15 we get:If we use k=5 and 15 we get: W= 69/5 = 13.4W= 69/5 = 13.4 W= 69/15 = 4.5 W= 69/15 = 4.5
Since the dataset is not large, we will Since the dataset is not large, we will choose w=10 to have fewer intervals.choose w=10 to have fewer intervals.
April 18, 2023 42
Data Example – Step 2Data Example – Step 2
Next we have to construct the intervals.Next we have to construct the intervals. With w = 10 and minimum=12 choose the With w = 10 and minimum=12 choose the
first interval to start at 10.first interval to start at 10.INTERVALS (in lbs): INTERVALS (in lbs): 10-1910-19
20-2920-29 30-3930-39 40-4940-49 50-5950-59 60-6960-69 70-7970-79
April 18, 2023 43
Data Example – Step 3Data Example – Step 3
Examine the values one at a time and tally the Examine the values one at a time and tally the number in each interval.number in each interval.
April 18, 2023 44
Data Example – Step 4Data Example – Step 4
Calculate Relative Frequencies:Calculate Relative Frequencies:
Relative freq. = Relative freq. = frequency in intervalfrequency in interval
# obs in dataset# obs in dataset
April 18, 2023 45
Frequency TableFrequency TableWeight Weight Interval Interval
(lbs)(lbs)
FrequencyFrequency Relative Relative Frequency Frequency
(%)(%)
10-1910-19
20-2920-29
30-3930-39
40-4940-49
50-5950-59
60-6960-69
70-7970-79
TotalTotal
April 18, 2023 46
Frequency TableFrequency TableWeight Weight Interval Interval
(lbs)(lbs)
FrequencyFrequency Relative Relative Frequency Frequency
(%)(%)
10-1910-19 55
20-2920-29 1919
30-3930-39 1010
40-4940-49 1313
50-5950-59 44
60-6960-69 44
70-7970-79 22
TotalTotal 5757
April 18, 2023 47
Frequency TableFrequency TableWeight Weight Interval Interval
(lbs)(lbs)
FrequencyFrequency Relative Relative Frequency Frequency
(%)(%)
10-1910-19 55 8.88.8
20-2920-29 1919 33.333.3
30-3930-39 1010 17.517.5
40-4940-49 1313 22.822.8
50-5950-59 44 7.07.0
60-6960-69 44 7.07.0
70-7970-79 22 3.53.5
TotalTotal 5757 100100
April 18, 2023 48
HistogramHistogramWeights of Daycare Children
0
2
4
6
8
10
12
14
16
18
20
10-19 20-29 30-39 40-49 50-59 60-60 70-79
Weight Range
Nu
mb
er
of
Ch
ild
ren
•Horizontal scale represents the value of the variable
•The vertical scale represents the frequency or relative frequency in each interval
•Rectangular bars are joined together
April 18, 2023 49
Consider DistibutionsConsider Distibutions If the data are homogeneous, the graphs If the data are homogeneous, the graphs
usually show a unimodal pattern with one usually show a unimodal pattern with one peak in the middle. peak in the middle.
The plots can be used to determine if the The plots can be used to determine if the data is symmetric. A symmetric distribution data is symmetric. A symmetric distribution is one in which the distribution has the same is one in which the distribution has the same shape on both sides of the peak. shape on both sides of the peak.
April 18, 2023 50
Shapes of the Shapes of the DistributionDistribution
Three common shapes of frequency Three common shapes of frequency distributions: distributions:
Symmetrical and bell shaped
Positively skewed or skewed to the right
Negatively skewed or skewed to the left
A B C
April 18, 2023 51
Shapes of the Shapes of the DistributionDistribution
Three less common shapes of frequency Three less common shapes of frequency distributions: distributions:
Bimodal ReverseJ-shaped
Uniform
A B C
April 18, 2023 52
April 18, 2023 53
TimeplotsTimeplots
Data is displayed over time. Data is displayed over time. Data may show seasonal, yearly or Data may show seasonal, yearly or
changes in environment over time.changes in environment over time. Timeplot data can give different Timeplot data can give different
impressions depending on the scales used impressions depending on the scales used on the x and y axis.on the x and y axis.
April 18, 2023 54
Timeplots:Timeplots:An ExampleAn Example
Time series data can display effects of Time series data can display effects of changes in government policy. The table changes in government policy. The table shows the data on motor vehicle deaths in shows the data on motor vehicle deaths in the U.S. (death rate per 100 million miles the U.S. (death rate per 100 million miles driven).driven).
April 18, 2023 55
Timeplots:Timeplots:An ExampleAn Example
YearYear RateRate YearYear RateRate YearYear RateRate YearYear RateRate
19601960 5.15.1 19701970 4.74.7 19801980 3.33.3 19901990 2.12.1
19621962 5.15.1 19721972 4.34.3 19821982 2.82.8 19921992 1.71.7
19641964 5.45.4 19741974 3.53.5 19841984 2.62.6 19941994 2.72.7
19661966 5.55.5 19761976 3.23.2 19861986 2.52.5 19961996 1.71.7
19681968 5.25.2 19781978 3.33.3 19881988 2.32.3 19981998 1.61.6
April 18, 2023 56
Timeplots:Timeplots:An ExampleAn Example
Death Rates Over T ime
0
1
2
3
4
5
6
1960 1964 1968 1972 1976 1980 1984 1988 1992 1996
Year
Dea
th R
ate/
100
Mill
ion
M
iles
April 18, 2023 57
Timeplots:Timeplots:An ExampleAn Example
1.1. During these years, safety requirements During these years, safety requirements for motor vehicles became stricter and for motor vehicles became stricter and interstate highways replaced old roads.interstate highways replaced old roads.
2.2. In 1974 the national speed limit was In 1974 the national speed limit was lowered to 55 miles an hour. In the mid lowered to 55 miles an hour. In the mid 1980’s most states raised speed limits to 1980’s most states raised speed limits to 65 miles an hour. Some say lower 65 miles an hour. Some say lower speed limits saved lives. Is this evident speed limits saved lives. Is this evident in our plot?in our plot?