MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter...
Transcript of MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter...
![Page 1: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/1.jpg)
MMI 409: Introduction To Biostatistics
D’Agostino, R.B., Sullivan, L.M., & Beiser, A.S. (2006). Introductory applied biostatistics. Belmont, CA:
Brooks/Cole, Cengage Learning.
[ISBN-13: 978-0534423995]
Statistics for the Terrified, Fourth Edition, John H. Kranzler, November, 2009, Pearson: Prentice Hall, ISBN-
13:978-0-13-193011-7.
Contents Chapter 1: Introduction ................................................................................................................................ 2
Chapter 2: Summarizing Data ....................................................................................................................... 4
Chapter 3 Probability .................................................................................................................................... 9
Chapter 4: Sampling distribution ................................................................................................................ 13
Chapter 5 Statistical Inference: Procedures for µ ....................................................................................... 14
Chapter 6: Statistical inference: Procedures for µ1 – µ2 ............................................................................. 23
Chapter 9: Analysis of Variance .................................................................................................................. 27
Chapter 7 Categorical Data ......................................................................................................................... 35
Chapter 10: Correlation and Regression ..................................................................................................... 39
Chapter 12: Non parametric tests .............................................................................................................. 43
![Page 2: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/2.jpg)
Chapter 1: Introduction 1.1 Vocabulary
Page 1.2
- Statistical analysis: the analysis of characteristics of subjects of interest
- Subjects: the units on which characteristics are measured. E.g. human beings, cells, blood
samples
- Variables = Characteristics: measurable properties. E.g. age, blood pressure
- Need to define variables explicitly in each application along with appropriate measurement
units
- Data element / data point: a representation of a particular measurable characteristic, or
variable. E.g. Subjects’ ages measured in years
- Need to define the population of interest explicitly.
- Population: The collection of all subjects of interest
- Sample: A subset of the population of interest
Page 1.3
- Many different sample can be selected from any given population
- N = Population size: The number of subjects in a population
- ‘n = sample size: The number of subjects in a sample
- Parameter = Population parameter: any descriptive measure based on a population. E.g. N
- Statistics = sample statistics: any descriptive measure based on a sample. E.g. n
- X = data elements or observations deriving from either a population or a sample
- X is a variable name or placeholder representing the characteristic of interest
Page 1.4
1.2 Population Parameters
- = population mean : what is the typical number of visits that the sample made
Page 1.5
- = 𝑋
𝑁 : X = characteristics under consideration. N = population size
- Population range: The difference between the largest (maximum) score and the smallest
(minimum ) score
- (X – ): deviations from the mean = distances between each observation and the population
mean
![Page 3: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/3.jpg)
Page 1.6
- Sum of deviations from the mean = 0
- Variance = (X – )2 : the average squared deviation from the mean
- 2 =
𝑋− 𝜇 2
𝑁 = σ squared = population variance: average of squared deviations from the mean.
E.g. The number of primary-care visits in 3 years is, on average, 32 visits squared from the mean
of 8.
- = Population standard deviation = 𝜎2 = 𝑋− 𝜇 2
𝑁. E.g. the typical deviation in the number
of primary care visits in 3 years is about 5.7 visits from the mean of 8
- Population standard deviation is useful in comparing populations with respect to a particular
characteristic. E.g. if σ1 = 5.7 and σ2 = 2.3, the number of visits are more dispersed in
population 1 as compared to population 2. In population 2, the reported number of visits are
more tightly clustered
-
1.3 Sampling and Sample Statistics
- Use sample instead of population because:
o Impossible or impractical to analyze the entire population
o Population may be too large to measure a specific characteristic
o Too costly to measure certain characteristics one each subject
o Too time consuming to analyze the entire population
Page 1.8
- Statistical inference: Assume what we observe in the sample is similar to what we would
observe in the population
-
Page 1.9
- Sample mean = x : summing the values and dividing by the sample size
Page 1.10
- Sampling distribution of the sample means: The enumeration of all possible sample mean
- Sampling distribution: The listing of all values of a statistic based on all possible samples
generated under a particular sampling strategy
1.4 Statistical Inference
Page 1.11
- In statistical inference, we have a single sample and wish to make inferences about unknown
population parameters based on sample statistics
![Page 4: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/4.jpg)
- Take a sample of subjects from the population and compute various statistics (e.g. the sample
mean, x ).
- Inferences about a population are valid if the sample is a random sample.
- Random sample: All samples have an equal chance of being selected
- The larger the sample size, the better the inference about the population parameter
Page 1.12
- On the average, the sample mean (x ) is equal to the population mean ( )
- The sample mean is an unbiased estimator of the population mean
- On average, the sample range is not equal to the population range.
- The sample range is a biased estimator of the population range.
Page 1.16
Chapter 2: Summarizing Data - Before statistical inference, we must understand how to appropriately summarize our sample
Page 1.17
2.1 Background
2.1.1 Vocabulary
- Data element
data point : a representation of a particular measurable characteristic, or variable.
- E.g. Subjects’ ages measured in years
- Data elements are measured on subjects or units of measurement
- E.g. Time it takes each animal to complete a specific task
- Time to complete a specific task = variable of interest
- Observed time (in minutes) = data elements
- Subjects
𝑈𝑛𝑖𝑡𝑠 𝑜𝑓 𝑀𝑒𝑎𝑠𝑢𝑟𝑒𝑚𝑒𝑛𝑡𝑠 are specific to each application and must be defined explicitly
- Population: The collection of all subjects of interest
- Sample: A subset of the population of interest
2.1.2 Classification of Variables
- Continuous (or measurement ) variables assume any value between the minimum and
maximum value of a particular measurement scale. E.g. Age, height, weight
Page 2.18
- Discrete variables take on a limited number of values or categories
![Page 5: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/5.jpg)
- Ordinal variables: Take on a limited number of values or categories, but the categories are
ordered. E.g. Symptom severity: minimal, moderate, severe
- Categorical variables: Take on a limited number of values or categories, but the categories are
un-ordered. E.g. Gender
- Parameter = Population parameter: any descriptive measure based on a population. E.g. N
- Statistics = sample statistics: any descriptive measure based on a sample. E.g. n
- Data set: Collection of data elements
Page 2.19
2.2 Descriptive Statistics and Graphical Methods
- Data set: Collection of data elements
- Sample mean = x = 𝑋𝑖
𝑛 : Sum all of the observations and divide by the sample size
- (X – x ) : Deviations from the mean
Page 2.22
- Dispersion: measures of dispersion addresses whether the data elements are tightly clustered
together or widely spread about the mean
- Mean Absolute deviation (MAD) = 𝑋− x
𝑛
- Sample variance = Definitional formula for the same variance = s2 = (𝑋2)−
( 𝑋 )2
𝑛
𝑛−1
- Sample standard deviation = s = 𝑠2
- Standard deviation is very useful for comparing samples
Page 2.24
- Computational formula for the sample variance = s2 = (𝑋2)−
( 𝑋 )2
𝑛
𝑛−1
Page 2.25
- Standard summary statistics: Sample size (n), sample mean (x ) and sample standard deviation
(s) provide information on the number of subjects in the sample, the location, and the
dispersion of the sample
- Descriptive statistics should include no more than one decimal place beyond that observed in
the original data elements
- Sample median for odd number of samples: The middle value
- Sample median for even number of samples: Mean of the two middle values
Page 2.26
![Page 6: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/6.jpg)
- Mean and median are statistics that measure the average or typical value of a particular
characteristic
- Median is particularly useful when there are extreme values in the sample
- When sample mean and sample median are very different, it suggests that extreme values are
affecting the mean
- Position of the Median when the number of observations is odd = 𝑛+1
2
- Position of the 2 middle values when the number of observations is even = 𝑛
2 and
𝑛
2 + 1
- The first quartile = Q1: The sample value that holds approximately 25% of the data elements at
or below it
- The second quartile = Q2: the median
- The third quartile = Q3: The sample value that holds approximately 25% of the data elements at
or above it
Page 2.27
- Position of the quartile when number of sample is odd = [k] = (𝑛+3)
4 , where [k] is the greatest
integer less than k
- Position of the quartile when the number of sample is even = [k] = 𝑛+2
4
- Quartiles are very informative statistics for understanding the distribution of a particular
characteristic
- Mode: Most frequent value
- A sample with no repeated values has no mode
- Minimum and maximum values are useful to identifying outliers
Page 2.28
- Range: Maximum minus the minimum
Page 2.30
- Uncorrected sum of squares = uncorrected SS = sum of the observations squared = 𝑋2
- Corrected sum of squares = corrected SS = (𝑋 − x ) 2
Page 2.31
- The coefficient of variation = Coeff Variation = Ratio of the sample deviation to the sample
mean, expressed as a percentage = CV = 𝑠
x * 100
- Standard error of the mean = Std Error mean = 𝑠
𝑛
- If there are no outliers, the mean and the standard deviation are the most appropriate
measures of location and dispersion
- If there are outliers, the median and the interquartile deviation are the most appropriate
measures of location and dispersion
![Page 7: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/7.jpg)
- Interquartile range = Q3 – Q1
- Interquartile deviation = 𝑄3−𝑄1
2
- Quantiles = Percentiles = The kth quantile is the score that holds k% of the data below it
Page 2.36
- Empirical Rule:
o Approximately 68% of the observations fall between x – s and x + s
o Approximately 95% of the observations fall between x – 2s and x + 2s
o Approximately all of the observations fall between x – 3s and x + 3s
Page 2.39
- Outliers:
o Observations outside the range = outliers = x ± 3s
o Observations above Q3 ± 1.5 (Q3 – Q1)
- If no outliers, the most appropriate measure of the location is the sample mean and the
measure of dispersion is the sample standard deviation
- If outliers, the most appropriate measure of location is the median and the most appropriate
measure of dispersion is the interquartile deviation 𝐼𝑄𝑅
2 =
𝑄3−𝑄1
2
- Box and whisker plot: Informative graphical display for continuous variables incorporates the
minimum and maximum, the median, and the quartiles
- The shorter the sections, the more observations are clustered
Page 2.42
- Stem and leaf plot
o Each data element is split into 2 pieces: the stem and the leaf
o The leading digits denote the stem and the trailing digits denote the leaf
o Best results are achieved when there are between 6 and 12 unique stems
o The stem and leaf plot is essentially a histogram
Page 2.45
2.2.3 Numerical summaries for discrete variables
- Frequency = f : The number of times each data element (or response option) is observed
- Frequency distribution table: Table that displays each response option and its frequency
- Sum of the frequencies = sample size
- Relative Frequency = 𝑓
𝑛 , where n = sample size
Page 2.47
![Page 8: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/8.jpg)
- Cumulative frequency indicates the number of data elements at or better than each response
option
- Cumulative relative frequency = Ratio of the cumulative frequency to the sample size
Page 2.48
- Numeric values are assigned to each of the response options to produce an ordered scale
- These numeric values are then used to generate summary statistics
Page 2.49
- Sometimes, continuous data are organized into categories as a means of summarizing
- The most useful presentations include between 6 and 12 distinct categories or classes
Page 2.51
2.2.4 Graphical summaries for discrete variables
- Bar chart: Graphical display for categorical variables
o Breaks between rectangles indicate distinct, unordered response options
- Histogram: Graphical display for ordinal variables
o No breaks between rectangles suggesting order in the response options
Page 2.67
2.5 Analysis of Framingham Heart Study Data
- Investigate risk factors for cardiovascular disease
- Cohort underwent extensive clinical examinations every 2 years
- Exams included measurements: height, weight, systolic and diastolic blood pressure, laboratory
and blood tests to measure cholesterol levels, echocardiography to measure heart function
- Offspring and spouses are enrolled and participated in examination every 4 years
- 3rd generation enrolled for investigation into genetic factors associated with cardiovascular and
other diseases
- Participants contributed up to 3 examination cycles of data, approximately 6 years apart, and
follow-up data on incident or newly developed cardiovascular disease, cerebrovascular disease,
and death are available over a 24 year followup period
Page 3.88
![Page 9: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/9.jpg)
Chapter 3 Probability - An experiment: A process by which a measurement is taken or observations are made or as a
procedure that generates outcomes
Page 3.89
- Deterministic experiments: The same outcome is observed each time the experiment is
performed
- Random experiments: one of several possible outcomes is observed each time the experiment is
performed
- Probability theory is concerned with random experiments
- Sample space = S: the enumeration of all possible outcomes of an experiment. E.g. 6 numbers 1
to 6, S = {1, 2, 3, 4, 5, 6}
- Events: Collections of outcomes. Denoted with capital letters
- Simple events: Individual outcomes
Page 3.90
- Probability: The likelihood that the outcome occurs
- P(outcome): Probability of outcome
- P(event): Probability of event
- 𝑃(𝑜𝑢𝑡𝑐𝑜𝑚𝑒)𝑠 = 1
- P(outcome) = 1
𝑁, where N denotes the total number of outcomes in the experiment
- P(event) = # 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑖𝑛 𝑒𝑣𝑒𝑛𝑡
𝑁
-
Page 3.91
- Complement of an event: consists of all outcomes in the sample space that are not in the event.
E.g A’ = {2, 3, 4}
- Complement Rule: P(A’) = 1 – P(A)
Page 3.92
- Union of two events consists of outcomes in either event (or in both events) = U = or
- Intersection of two events consists of outcomes that are in both events = n = and
- Union of 2 events = Addition Rule = P (A or B) = P(A) + P(B) – P (A and B)
- Mutually exclusive: 2 events have no outcomes in common
Page 3.93
- When 2 events are mutually exclusive, the probability of their intersection is 0 = P( A and B) = 0
- Conditional Probability = P(A | B): Probability that A is select given the condition of B
![Page 10: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/10.jpg)
- Conditional Probability Rule: P (A|B) = 𝑃 𝐴 𝑎𝑛𝑑 𝐵
𝑃(𝐵)
Page 3.94
- Independent: The probability of one is not influenced by the occurrence or nonoccurrence of
the other
Page 3.95
- Independence: 2 events A and B are independent if
o P(A) = P( A | B) or
o P(B) = P(B| A) or
o 𝑃(𝐴 ∩ 𝐵)= P (A) x P (B)
Page 3.96
- What is the probability that a patient who is at low risk for depression is clinically depressed
o = P (clinically depressed | Low risk for depression) = P clinically depressed and low ris k
𝑃(𝑙𝑜𝑤 𝑟𝑖𝑠𝑘 )
- Is clinical assessment of depression independent of the patient’s risk for depression
- = Is P(clinical assessment of depression) equals P (clinical assessment of depression | risk of
depression)
o P(clinical assessment of depression) = Depressed total
𝑁
o P(clinical assessment of depression | Low risk)
Page 3.98
- Determining sampling strategy
o Whether individuals are sampled from a population with replacement or without
replacement
o Whether the order in which individuals are sampled is important
Page 3.99
- Permutations of n individuals from a population N = Sampling without replacement, order
important, the number of distinct arrangements = N Pn = 𝑁!
𝑁−𝑛 !
o E.g. 6 numbers, pick 2 at a time
Page 3.100
- Combinations of n individuals from a population of size N = Sampling without replacement,
order is NOT important = N C n = 𝑁!
𝑛 !∗ 𝑁−𝑛 !
Page 3.102
![Page 11: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/11.jpg)
- Probability distribution = Probability model: The listing of the values of the random variable and
corresponding probabilities
Page 3.103
- Binomial distribution model has 3 specific attributes:
o Each performance of the experiments, called a trial, results in only one of 2 possible
outcomes, which we call either a success or a failure
o The probability of a success on each trial is constant, denoted p, with 0 <= p <= 1
o The trials are independent
Page 3.104
- Binomial distribution formula for discrete random variable X, which denotes the number of
successes out of n trials:
o P ( X = x) = n C x 𝑝𝑥(1− 𝑝)(𝑛−𝑥)
o P ( X = x) = 𝑛!
𝑛!∗ 𝑛−𝑥 ! 𝑝𝑥(1− 𝑝)(𝑛−𝑥)
Where x = # successes of interest (x = 0, 1, …., n)
p = p(success) on any trial
n = # trials
n C x = the number of combinations of x successes in n trials
- The binomial distribution is an appropriate model for the number of successes in a simple
random sample of size n with replacement drawn from a population where there are 2 possible
outcomes (success and failure)
Page 3.105
- When the probability of success is high, we are more likely to observe success out of more trials
than to observe none
Page 3.108
- In the binomial distribution
o Mean = = n * p
o Variance = 2 = n * p * (1 – p)
- The mean number of successes represents the typical number of successes
- Normal distribution: A characteristic (continuous variable) is said to follow a normal distribution
if the distribution of values is bell shaped
- The total area under the normal curve is 1.0
![Page 12: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/12.jpg)
Page 3.109
- The normal distribution is one where values in the center of the distribution are more likely than
values at the extreme
- Normal probability distribution:
o f(x) = 1
𝜎 2𝜋𝑒−
1
2 [
(𝑋−𝜇 )
𝜎]2
o where x is a continuous random variable
o = mean of random variable X
o = standard deviation of the random variable X
- Attributes of normal distribution
o Symmetric about the mean
o The mean = the median = the mode
o The mean and variance , and 2, completely characterize the normal distribution
o P ( µ – s < X < µ + s) = 0.68
o P( µ – 2s < X < µ + 2s) = 0.95
o P( µ – 3s < X < µ + 3s) = 0.99
o P ( a < X < b) = the area under the normal curve from a to b
- X ~ N (µ , 𝑠2) where ~ is read “is distributed as”: if a random variable X follows a normal
distribution with mean m and variance 𝑠2
Page 3.113
- To calculate probabilities for any normal distribution X, we first standardize using
o Z = (𝑋− µ)
s
o Where Z = standard normal distribution
o Locate P(Z < 0.5) from Table B.2 = 0.6915
o Determine P (Z > 0.5) = 1 – 0.6915 = 0.31
Page 3.119
- For the normal distribution, percentiles = the score that holds k% of the scores below it
o X = µ + Zσ
µ = mean of the standard random variable X
σ= standard deviation of the random variable X
Z = value from the standard normal distribution for the desired percentile
Page 3.121
- As the number of trials, n, in binomial applications becomes large, the distribution of the
number of successes, x, is approximately normally distributed
- The normal approximation is appropriate if np >= 5 and n(1 – p) >= 5
- For binomial distribution
- µ= np
![Page 13: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/13.jpg)
- σ= 𝑛 ∗ 𝑝 ∗ (1− 𝑝)
- Z = 𝑋−𝜇
𝜎
- Z = (𝑋±0.5)−𝑛𝑝
𝑛∗𝑝∗(1−𝑝)
- A continuity correction of 0.5 is generally applied prior to implementing the formula
Page 4.150
Chapter 4: Sampling distribution - Sampling distribution of sample means: The collection of all possible sample means
- For simple random samples (srs) with replacement:
o µx = µ, where µx is the mean of the sample means
o σx = 𝜎
𝑛 where σ of σx is the standard deviation of the sample means
- Standard error = Standard deviation of the sample means
- For simple random samples (srs) without replacement
o µx = µ
o σx = 𝜎
𝑛 𝑁−𝑛
(𝑁−1)
o If population size N is large or the ratio of the sample size to the population size 𝑛
𝑁 is
small, 𝑁−𝑛
(𝑁−1) will be close to 1, then standard error = with replacement =
𝜎
𝑛
Page 4.152
- We make inferences about a population parameters based on a sample statistic
- When sample size is increased, the standard error is reduced
Page 4.153
- Central Limit Theorem: If we take simple random samples of size n with replacement from the
population, for large n, the sampling distribution of the sample means is approximately normally
distributed with
o µx = µ
o σx = 𝜎
𝑛
o Where, in general, n >= 30 is sufficiently large
- CLT is important because in we make inferences about the population mean µ based on the
value of a single sample mean
- If the population is normal, then the results hold for any size n
![Page 14: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/14.jpg)
- If the population is binomial, then the following criteria are required, np > 5, and n (1-p) > 5
- CLT tells us that we can compute probabilities about the same mean even if the population is
not normal, we can compute probabilities about normal random variables by standardizing
(transforming to Z)
Page 4.153
- For normal population distribution, the sampling distributions of the sample mean are normal
for each sample size considered
- In non normal cases, the sampling distributions of the sample means approach normality as the
sample size increases (e.g. n >= 30)
Page 4.159
- Without knowing the population distribution, as long as the sample is sufficiently large, we can
appeal to the Central Limit Theorem to make probabilistic statements about the relationship
between the sample mean and the unknown population mean
- Distribution of sample means of size n:
o Z =
x − μx
σx
o
o Z = x − μ
𝜎
𝑛
Page 5.174
Chapter 5 Statistical Inference: Procedures for µ - Point estimate for a population parameter is the “best” single number estimate of that
parameter
- A confidence interval estimate is a range of values for the population parameter with a level of
confidence attached (e.g. 95% confidence that the range or interval contains the unknown
parameter)
Page 5.175
- The point estimate for the population mean µ is the “best” single valued estimate of the
population mean derived from the sample.
- The “best” single number estimate for the population mean is the value of the sample mean
- µ ^ = x , where µ ^ is the estimate of the population mean
![Page 15: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/15.jpg)
- The sample mean is an unbiased estimator of the population mean
- Unbiasedness is a desirable property in a point estimate
Page 5.176
- If sample 1 mean standard deviation is 0.95, then the sample means are 0.95 units from the
population mean µ
- Standard error: A measure of the variation in the sample means, denote σx
- As the sample size increases, the standard error decreases, resulting in a more precise estimate
of the population mean.
Page 5.177
- Once a point estimate is determined, the next issue of concern is how close that estimate is to
the true, unknown value.
- The standard error addresses this issue
- The confidence interval incorporates both the point estimate and the standard error
- Confidence Interval (CI): a range of values that is likely to cover the true parameter (e.g. µ).
- CI start with the point estimate, and build in a component that addresses how close the point
estimate is from the true, unknown parameter.
- This component is called the margin of error
- The general form of a confidence interval is point estimate ± margin of error
- In developing confidence intervals, we select a level of confidence that reflects the likelihood
the confidence interval contains the true, unknown parameter
- With a 95% confidence interval, we are 95% confident that the range of values will contain the
true unknown parameter
- In normal distribution, P(-1.96 < Z < 1.96) = 0.95
- For large samples, Z =
x − μx
σx
Page 5.178
- The 95% confidence interval for µ = x ± 1.96 σx
o = x ± 1.96 𝜎
𝑛
o Where x is the point estimate
o 1.96 𝜎
𝑛 is the margin of error
o If x = 37.85, σ = 9.5, n = 100, then 95% confidence interval is 35.99 to 39.71
o Margin of error is wider for smaller sample size α
o If we have 40 different random samples of the same size, in theory, 95% of 40 = 38 will
cover the true mean µ
o The general form of the confidence interval for µ is
o x ± Z 1-(α/2) 𝜎
𝑛
![Page 16: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/16.jpg)
o Where Z 1-(α/2) is the value from the standard distribution with area in the lower tail
equal to 1 – α / 2 corresponding to the 100(1-α)% confidence interval.
o Α refers to the total area in the 2 tails of the standard normal distribution
o Z 1-(α/2) = Z 0.975 for 95% Confidence Interval
Page 5.180
o In the 95% confidence interval (i.e. total tail area = 0.05, 100(1-0.05)% = 95%
o α = 0.05, Z 1-(α/2) = Z 1-(0.05/2) = Z 0.975 = 1.96
o As confidence level increases, the value of Z 1-(α/2) increases, margin of error increases
Page 5.182
- If the sample size is large, then the sample standard deviation can be used to estimate σ
- σ ^ = s
- When the population standard deviation (σ) is unknown and the sample size is small (n < 30),
the Central Limit Theorem no longer applies and we cannot assume that the distribution of the
sample mean is approximately normal. We can use a value for the t distribution
- T values are indexed by their degree of freedom (df)
- ‘df = n – 1
- As The degrees of freedom increases, the t values approach the standard normal values
- Confidence interval for µ when σ is unknown and n < 30:
o x ± t 1-(α/2) 𝑠
𝑛 , df = n – 1
Page 5.185
- The margin of error is small when the sample size is large and / or when the standard error is
small
- In experimental design, the number of subjects selected depends on practical considerations
(e.g. time and / or financial constraints), or on statistical considerations.
- In order to determine the sample size:
o How much error can be tolerated in the estimate (i.e. how close must the estimate be to
the true mean µ)?
o What level of confidence is desired?
Page 5.186
- n = Z 1−
α2 ∗ σ
𝐸 2
o Where E = margin of error
o Minimum number of subjects required to ensure a margin of error equal to E in the
confidence interval for µ with the specified level of confidence
Page 5.187
![Page 17: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/17.jpg)
- When σ is unknown,
o We can substitute a value derived from a similar application reported in the literature
o We can generate an estimate based on a previous application
o We can conduct a pilot study
o We can use educated guess
- For best guess, use the Empirical Rule, where range is about 6 standard deviations
- Based on the Empirical Rule, a conservation estimate of the standard deviation:
o σ ^ = 𝑟𝑎𝑛𝑔𝑒
4
Page 5.188
- To ensure a smaller margin of error, more subjects are needed in the sample
Page 5.190
- Hypothesis Tests about µ
- In hypothesis testing, an explicit statement or hypothesis is generated about the population
parameter
- Null hypothesis = H0 : µ = 130 (mean remains unchanged)
- Alternative Hypothesis = H1 : µ > 130
Page 5.191
- If n = 10, and x = 130, H0 is most likely true
- If x = 150, H1 is most likely true
- If x = 135, then how likely are we to observe a sample mean of 135 from a population mean µ =
130?
o We must determine a critical value such that if our sample mean is less than the critical
value, we will conclude that H0 is true, and if our sample mean is greater than the critical
value, we will conclude that H1 is true
Page 5.192
o For n >= 30, Z =
x − μx
σx
= Z = x − μ0
𝜎
𝑛
o here µ0 is the mean specified in H0
o Z is close to zero when x is close to µ0 = 130, then H0 is most likely true
o Z is large when x is larger than µ0, then H1 is most likely true
o In hypothesis testing, we need to determine the point at which Z is too large, that point
is the critical value of Z
o n = 108; σ = 15, under null hypothesis (when µ = 130), the distribution of the sample
means is approximately normal
o µx = µ
![Page 18: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/18.jpg)
o σx = 𝜎
𝑛
-
o µx = µ0 = 130
o σx = 𝜎
𝑛 = 1.44
o For x > 132.88, Z = x − μ0
𝜎
𝑛
, Z = 132.88− 130
15
108
= 2
o P (x > 132.88) = P ( Z > 2) = 0.0228
o If x > 132.88 and we reject the null hypothesis, the probability that we are making a
mistake is 2.28%
o We must decide what level of error we can tolerate
o Level of significance = α: the probability of rejecting H0 when H0 is true = false positive,
generally in the range of 0.01 to 0.10
o Small levels of significance are purposely selected
o Once a level of significance is selected, a decision rule is formulated.
o The decision rule is a formal statement of the criteria used to draw a conclusion in the
hypothesis test.
o E.g. if level of significance = α = 0.05 = 5% chance of rejecting H0 when H0 is true
o See graph on page 5.193
Page 5.193
o P ( Z > ?) = 0.05
o P ( Z > 1.645) = 0.05
o The region in the tail end of the curve (shaded) is called the rejection region (critical
region)
o Decision Rule:
Reject H0 if Z >= 1.645
Do not reject H0 if Z < 1.645
o Test statistic: Z = x − μ0
𝜎
𝑛
If x = 135 , then Z = (135−130)
15
108
Z = 3.46
o Compare test statistic to the decision rule
Test Statistic 3.45 >= 1.645, therefore reject H0
Therefore, we have significance evidence, with level of significance = α = 0.05,
that the mean has increased (alternative hypothesis).
Page 5.194
- When σ is unknown and the sample size is small, the test statistic follows a t distribution
- The critical values must be selected from the t table
![Page 19: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/19.jpg)
- When the test indicates a rejection of H0, a very strong statement is made
- When the test does not indicate a rejection of H0, a very weak statement is made
- Type I error: Reject H0 when H0 is true. Probability = α = P (Type I error ) = P (Reject H0 | H1) =
False positive
- Type II error: H0 is false, but do not reject H0. Probability = β = P (Type II error) = P (Do not reject
H0 | H0 false) = False Negative
Page 5.195
- Because we select a level of significance, α, a priori, we know that the probability of committing
a Type I error is small
- When we reject H0, we can be confident that a correct decision has been made, since the
chance of committing a Type I error is small.
- When we do not reject H0, we cannot be confident that a correct decision has been made,
because the change of committing a Type II error is unknown
Page 5.201
- P value = the exact level of significance of data
- P value is the smallest level of significance that would lead to the rejection of the null hypothesis
- Reject H0 if p value <= α
- H0 : µ = 30
- H1: µ ≠ 30
- α = 0.05
- n = 25, x = 35, s = 10
- SAS produces t = 2.5 and p value = 0.0194
- The p value is the probability of observing a value as extreme or more extreme than the
observed test statistics: P ( t > 2.50 or t < 2.50)
- In a 2 tail test, at t > 2.50, P (t > 2.50) = 0.0097
- The critical values for a 2 sided test with α = 0.05 , t = 2.064
- Since t = 2.5 > 2.064, we reject H0 and conclude the mean is significantly different from 30
- Also, because p value (0.0194) <= α (0.05), we reject H0
- If α = 0.01, we do not reject H0
- When comparing the test statistic to a critical value, actual t statistics are compared
- When comparing the p value to the level of significance, areas in the tails of the distribution are
compare
Page 5.203
- If one sides test is desired:
o Reject H0 if (p value / 2) <= α
o
Computing P Values by Hand
![Page 20: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/20.jpg)
- At 5% level of confidence
- H0: µ = 130
- H1: µ > 130
- At α = 0.05, P (Zcritical > 1.645) = 0.05
- The decision rule:
o Reject H0 If Ztest statistc ≥ Zcritical = (1.645)
o Do not reject H0 if Ztest statistc < Zcritical
o Ztest statistc = x − μ0
𝜎
𝑛
= Z = (135−130)
15
108
= 3.46
o Therefore reject H0
o We could have selected a smaller level of significance and still reached the same
conclusion
- How to calculate p value
o For P value , find P (Zcritical < 3.46)
o At α = 0.005, Zcritical = 2.576, we can reject H0
o At α = 0.001, Zcritical = 3.090, we can reject H0
o At α = 0.0001, Zcritical = 3.791, we cannot reject H0 because 3.46 < 3.791
o Therefore, the smallest level of significance where we can still reject H0 is 0.001
o The significance of this data, or the p value, is less than 0.001
o Therefore, p < 0.001
Page 5.204
- The probability of a Type II error, β, is difficult to control because it depends on several factors
- One of the factors is α: β decreases as α increases
- One must weigh the choice of a lower β = P (Type II error), which is desirable, against a higher
level of significance, which is undesirable
- In hypothesis testing, we are concerned with the power of a test = 1- β
- Power of test: The ability to detect or reject false null hypothesis
- Power = 1 – β = P (Reject H0| H0 false)
- As power increases, β decreases, resulting in a better test
- Power is a function of
o n = sample size
o ‘α = level of significance = P (Type I error)
o ES = the effect size = the standardized difference in means specified under H0 and H1
- The power of a particular test is higher (or better) with a larger sample size, a larger level of
significance, and a larger effect size
- If H0: µ = 100
- H1: µ > 100
- α = 0.05 (Z = 2.567)
- Reject H0 if Ztest statistc ≥ Zcritical = (2.567)
![Page 21: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/21.jpg)
- σx = 6, Ztest statistc = x − μ0
σx
- If true mean = 110
- α = P (Type I error) = p (reject H0| H0true)
- β = P (Type II error) = p (does not reject H0| H0 false)
- Power = 1 – β = p(reject H0| H0 false)
Page 5.206
- See graph
- α is the area under the left most curve (H0 true) where we reject H0
- β is the area under the rightmost curve (H1 true) where we do not reject H0
- Power is (1 – β), the remaining area under the rightmost curve (H1 true)
- When α increases from 0.05 to 0.10, the power increases. A better test is more powerful
- A larger sample size ensures a more powerful test
Page 5.207
- If the mean in H1 increases from 110 to 120, at α = 0.05, power increases as the difference in H0
and H1 increases, and power increases
- This difference is called the detectable difference
- The power of a two sided test for µ:
o Power = P ( Z > Z 1-(α/2) - 𝜇0−𝜇1
𝜎
𝑛
)
- The power of a one sided test for mu:
o Power = P ( Z > Z (1-α) - 𝜇0−𝜇1
𝜎
𝑛
)
- If n = 20, σ =95,
- H0 : µ = 80
- H1 : µ ≠ 80
- α = 0.05
- Find the power of the test if µ = 85
- Power = 0.6554
- There is a 65% probability that this test will detect a difference of 5 units (85 – 80)
- If n = 20, µ = 83, Power = 0.2912
Page 5.209
- There is only a 29% probability that the test will detect a difference of 3 units in mean
- The sample size required to ensure a specific level of power in a 2 sided test:
o ‘n = ((𝑍 1−
𝛼2 + 𝑍 1−𝛽 )
𝐸𝑆 )2
o Where Z (1-α/2) is the value from the standard normal distribution with lower tail area
equal to 1 – α/2
![Page 22: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/22.jpg)
o Where Z (1-β) is the value from the standard normal distribution with lower tail area
equal to 1 – β
o ES is the effect size, defined as the standard difference in means under the null and
alternative hypotheses
ES = 𝜇0−𝜇1
𝜎
- The sample size required to ensure a specific level of power in a one sided test:
o ‘n = ((𝑍 1−𝛼 + 𝑍 1−𝛽 )
𝐸𝑆 )2
o Where Z (1-α) is the value from the standard normal distribution with lower tail area
equal to 1 – α
o Where Z (1-β) is the value from the standard normal distribution with lower tail area
equal to 1 – β
o ES is the effect size, defined as the standard difference in means under the null and
alternative hypotheses
ES = 𝜇0−𝜇1
𝜎
Page 5.210
- With β= 0.20, Power = 1 – β = 0.8, there is an 80% chance of rejecting a false null hypothesis
relative to a specific effect size
- The effect size reflects the magnitude of a clinically important difference in means
- If H0 : µ = 100
- H1: µ ≠100
- A difference of 5 units in the mean score is considered a clinically meaningful difference
- If true mean is less than 95 or greater than 105, we do not want to fail to reject the H0
- α = 0.05
- σ = 9.5
- How many subjects would be required to ensure that the probability of detecting a 5 unit
difference is 80% (power = 0.80)
- 2 sided test
- ES = 𝜇0−𝜇1
𝜎= 105−100
9.5 =0.526
- ‘n = ((𝑍 0.975 + 𝑍 0.80 )
0.526 )2 = 29
-
Page 6.232
![Page 23: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/23.jpg)
Chapter 6: Statistical inference: Procedures for µ1 – µ2
- 2 independent samples procedure: 2 comparison groups are physically distinct (they are
comprised up different individuals)
- 2 dependent samples procedure: each individual serves as his or her own control
Page 6.233
- In 2 independent samples procedures, the parameter of interest is the difference in population
means: (µ1 – µ2).
- Confidence intervals are concerned with estimating µ1 – µ2, the difference in means
- Both the null and alternative hypothesis are concerned with the difference in means,
o e.g. H0: µ1 – µ2 = 0 (no difference in means) versus H1: µ1 – µ2 ≠ 0
- In 2 dependent samples procedures, the parameter of interest is the mean difference: µd
- Confidence intervals are concerned with estimating µd, the mean difference
Page 6.235
- Case 1: 2 Independent Populations – Population Variances Known
o 2 independent populations being compared with respect to their means and the
population variances (σ12 and σ2
2) are known
o E.g. male gender vs female gender
o E.g. Age < 30 vs Age >= 30
o H0: µ1 – µ2 = 0 (H0: µ1 = µ2)
o Test statistic: Z = (x
1− x
2)
𝜎1
2
𝑛1+𝜎2
2
𝑛2
o Confidence interval: (x1ˉ – x2ˉ ) ± Z(1-α/2) * 𝜎1
2
𝑛1+𝜎2
2
𝑛2
o We are 95% confident that the difference in mean is between a and b
o If we were to run a test of hypothesis, the hypotheses would be H0: µ1 – µ2 = 0, H1: µ1 –
µ2 ≠ 0
o If the confidence interval estimate contains the value specified in H0, then we do not
reject H0
o If the result is 22.6 ± 1.54, we can conclude that there is a statistically significance
difference in mean because the confidence interval estimate does not include 0
o If H0: µ1 = µ2, and H1: µ1 > µ2 (or µ1 – µ2 > 0) (upper tail), α = 0.05
o Independent populations with known variances
o Decision rule: Reject H0if Z ≥ 1.645, Do not reject H0 if Z < 1.645
o Test Statistics = 8.26
o Therefore Reject H0: µ1 is significantly higher than µ2
![Page 24: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/24.jpg)
Page 6.239
- Case 2: 2 Independent Populations – Population Variances unknown but assumed equal
o When there are 2 independent populations and the population variance are unknown
but assumed to be equal.
o σ12 and σ2
2 unknown but assumed to be equal, n1 >= 30 and n2 >= 30
Test Statistic: Z = (x
1− x
2)
𝑆𝑝 1
𝑛1+
1
𝑛2
Confidence interval: (x1ˉ – x2ˉ ) ± Z(1-α/2) * 𝑆𝑝 1
𝑛1+
1
𝑛2
o σ12 and σ2
2 unknown but assumed to be equal, n1 < 30 or n2 < 30
Test Statistic: t = (x
1− x
2)
𝑆𝑝 1
𝑛1+
1
𝑛2
, df = n1 + n2 - 2
Confidence interval: (x1ˉ – x2ˉ ) ± t(1-α/2) * 𝑆𝑝 1
𝑛1+
1
𝑛2
o Sp is the pooled estimate of the common standard deviation:
Sp = ((𝑛1−1)𝑠1
2+(𝑛2− 2)𝑠22)
𝑛1−1 + (𝑛2−1)
Sp is a weighted average of the sample variances
If the sample sizes are equal, then Sp = (𝑠1
2+𝑠22)
2
o As a rule of thumb, if the sample sizes are equal or if the sample sizes are unequal but
the sample variances are close in value, defined as 0.5 <= 𝑠1
2
𝑠22 <= 2, then we can assume
population variances are equal
Page 6.242
- At α = 0.05, two sided test, df = 25, from Appendix B.3, t critical = 2.060
- T calculated = 2.22. Therefore, reject H0 because T calculated > T critical
- To find P value, at α = 0.02, df = 25, t critical = 2.485, we cannot reject H0, therefore p value <
0.05
Page 6.244
- Case 3: 2 Independent Populations, Population Variances possibly unequal
- In order to determine if population variances are equal, we conduct a preliminary test of H0: σ12
= σ22 against H1: σ1
2 ≠ σ22
Page 6.245
- The preliminary test is always a two sided test
![Page 25: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/25.jpg)
- The test statistic is determined by the ratio of the sample variances, which follows an F
distribution
o F = 𝑠1
2
𝑠22
- The F distribution has 2 degrees of freedom, denoted df1, and df2, called the numerator and
denominator degrees of freedom
- Numerator degrees of freedom = df1 = n1 – 1
- Denominator degrees of freedom = df2 = n2 – 1
- If the test statistic, F, is close to 1, then H0: σ12 = σ2
2 is most likely true
- Decision rule: Reject H0 if F ≤ 1
𝐹 1−
𝛼2 (𝑑𝑓2 ,𝑑𝑓1)
or if F ≥ 𝐹 1−
𝛼
2 (𝑑𝑓1,𝑑𝑓2)
- Do not reject H0 if 1
𝐹 1−
𝛼2 (𝑑𝑓2 ,𝑑𝑓1)
< F < 𝐹 1−
𝛼
2 (𝑑𝑓1 ,𝑑𝑓2)
- The critical values from the F distribution are found in Table B.4
Page 6.247
- If σ12 ≠ σ2
2, then
o σ12 and σ2
2unknown but assumed to be not equal, n1 >= 30 and n2 >= 30
Test Statistic: Z = (x
1− x
2)
𝑠1
2
𝑛1+𝑠2
2
𝑛2
Confidence interval: : (x1ˉ – x2ˉ ) ± Z(1-α/2) * 𝑠1
2
𝑛1+𝑠2
2
𝑛2
o σ12 and σ2
2 unknown but assumed to be not equal, n1 < 30 or n2 < 30
Test Statistic: t = (x
1− x
2)
𝑠1
2
𝑛1+𝑠2
2
𝑛2
‘df = (𝑠1
2
𝑛1+𝑠2
2
𝑛2)2
𝑠12
𝑛1
2
𝑛1−1
+
𝑠22
𝑛2
2
𝑛2−1
Confidence interval: (x1ˉ – x2ˉ ) ± t(1-α/2) * 𝑠1
2
𝑛1+𝑠2
2
𝑛2
Or ci = 𝑠𝑖
2
𝑛 𝑖
df = (𝑐1+𝑐2)2
𝑐1
2
𝑛1−1 +
𝑐22
𝑛2−1
Page 6.259
- Case 4: 2 Dependent Populations: The Data are matched or paired
![Page 26: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/26.jpg)
- 2 dependent samples test: A single sample of subjects is drawn and evaluated, each subject is
weighed at the outset, instructed to follow the treatment, and then weighed again after 4 weeks
of treatment. 2 weights are measured on each subject, one pretreatment and the second post
treatment. The samples are dependent or matched by subject
- Dependent sample test
o Samples are matched or paired, n (# pairs ) >= 30
Z = x
d− μd
𝑠𝑑 𝑛
Confidence Interval = xd ± Z 1-(α/2)* 𝑠𝑑
𝑛
o Samples are matched or paired, n (# pairs ) < 30
t = x
d− μd
𝑠𝑑 𝑛
Confidence Interval = xd ± t 1-(α/2)* 𝑠𝑑
𝑛
Df = n -1
o xd : The mean of the difference scores
o sd : standard deviation of the difference scores
Page 6.261
o Difference scores can be computed by subtracting the first measurement from the
second, or vice versa
Page 6.264
- Crossover design: Each participant is given each treatment under investigation, as opposed to
only a single treatment
- Carryover Effects: When subjects learn from one treatment and therefore improve under the
second treatment solely due to the carryover from the first treatment and not due to the
second treatment itself
Page 6.266
- In 2 sample application
- The effect size: the standardized difference in the values of the parameter of interest specified
under H0 and H1
- ES = | μ1−μ2 H 1
− (μ1−μ2)H 0 |
𝜎
- When H0: µ1 – µ2 = 0, ES = | μ1−μ2 H 1 |
𝜎 =
| μ1− μ2 |
𝜎
- Where σ = the common standard deviation (i.e. σ12 and σ2
2 = σ)
- The power of a test is higher (or better) with larger sample sizes (n1 and n2)
- Power = P ( Z > Z (1 – α /2) - | μ1− μ2 |
2𝜎2/𝑛 )
![Page 27: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/27.jpg)
o Where µ1 – µ2 is the difference in means under H1
o σ is the common standard deviation
o ‘n is the common sample size (n1 = n2= n)
- Samples sizes are equal
Page 6.268
- The sample size required to ensure a specific level of power in a 2 sided, 2 independent samples
test:
o ‘ni = 2 ∗ (𝑍 1−
𝛼2 + 𝑍(1−𝛽 )
𝐸𝑆)2
o where ni is the minimum number of subjects required in sample i ( i = 1, 2)
o ES = | μ1− μ2 |
𝜎
o Total subjects = n1 + n2
Page 9.408
Chapter 9: Analysis of Variance - ANOVA: Analysis of Variance is one of the most widely used statistical techniques for testing the
equality of population means
- ANOVA is used to test the equality of more than 2 treatment means
o H0: µ1 = µ2 = µ3 = … = µk
o H1: means not all equal
o Where k = number of populations under consideration (k > 2)
- Assumptions:
o K independent populations
o Random samples from each of k (k > 2) populations under consideration
o Large samples (ni >= 30, where I = 1, 2, … , k) or normal populations
o Equal population variances (i.e. σ12 = σ2
2= σ32 = … = σk
2)
Page 9.409
- In ANOVA, we compare the variation within samples to the variation between samples to access
the equality of the population means
![Page 28: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/28.jpg)
- If the observations within a sample are similar in value (ie small within sample variation), and
the means are different across samples (large between sample variation), then a real difference
is said to exist in the population means
Page 9.411
- To test H0, we compute 2 estimates of the common population variance (σ2).
- Estimate of the within treatment variation: The first estimate is independent of H0
o We do not assume that the population means are equal and we treat each sample
separately
- Estimate of the between treatment variation: The second estimate is based on the assumption
that H0 is true (ie population means are equal) and we pool all data together
- For the estimate of the within treatment variation: we assume that σ12 = σ2
2= σ32 = … = σk
2= σ2
- sw2=
𝑠𝑗2
𝑘𝑘𝑗=1 =
𝑠12+𝑠2
2+ … +𝑠𝑘2
𝑘 , where sw
2 is the within treatment variance or within variation
Page 9.412
- For the estimate of the between treatment variation depends on the assumptions of ANOVA
(σ12 = σ2
2= σ32 = … = σk
2= σ2) and also on the assumption that H0 is true (ie H0: µ1 = µ2 = µ3 = … =
µk)
- If the population means are all equal, then each sample mean xj is an estimate of the common
population mean µ
- The xj can be viewed as a simple random sample of size k from a population with mean µxj = µ
and variance of σxj 2 =
𝜎2
𝑛
- The variance of the k sample means is
o ‘sxj 2 =
(x j− x )2
𝑘−1𝑘𝑗=1 where x is the overall mean
o Since sxj 2 ~= σxj
2 = 𝜎2
𝑛
o ‘nsxj 2 ~= σ2
o Or sb2 = nsxj
2, where sb2 is the between treatment variance or between variance
- The test statistic in ANOVA is based on the ratio of the 2 estimates:
o F = 𝑠𝑏
2
𝑠𝑤2
- If the 2 estimates of σ2are close in value, the F ~= 1, we cannot reject H0
- If the variation between samples sb2 is large and variation within samples sw
2 is small (F >> 1),
then we can reject H0
Page 9.413
- We need 2 degrees of freedom:
o Numerator degrees of freedom = df1 = k – 1
o Denominator degrees of freedom = df2 = k (n – 1) = kn – k = N – k
o Where k = number of populations or treatments
![Page 29: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/29.jpg)
o n = sample size per treatment
o we sometimes let N = kn, where N = the total sample size
- Reject H0 if Fcalculate ≥ F (df1, df2) = F (k – 1, N – k)
- Xij = ith observation (row) in the jth treatment (column)
- A “.” In place of I or j denotes summation over that index
o E.g. X.1 = summation over observations in treatment 1
o X1. = sum first observations over treatments
o x .1 = sample mean in treatment 1
o x .. = overall sample mean
Page 9.414
- F = 𝑠𝑏
2
𝑠𝑤2 ,
o sb2 = between variation = MSb = mean square between
o sw2 = within variation = MSw = mean square within
- sb2 = MSb =
𝑆𝑆𝑏
𝑘−1 ,
o SSb = Sum of squares between = 𝑛𝑗 (x .𝑗− x
..)2, df = k -1
- ‘sw2= MSw =
𝑆𝑆𝑤
𝑁−𝑘
o where sw2= Sum of squares within = (x
𝑖𝑗− x
.𝑗)2 , df = N – k
- SSTotal = Sum of squares Total = (x 𝑖𝑗− x
..)2 , df = N – 1
- Where
o Xij = ith observation in the jth treatment
o x .j = Sample mean of the jth treatment
o x .. = overall sample mean
o ‘k = # of treatments
o N = 𝑛𝑗 = Total sample size
o nj= number of observations in each treatment
Page 9.415
- Generally, to organize computations in ANOVA applications, an ANOVA table is constructed
Analysis of Variance (ANOVA) table
Source of Variation
Sum of Squares (SS) Degrees of Freedom (df)
Mean Squares (MS)
F
Between SSb = 𝑛𝑗 (x .𝑗− x
..)2 k – 1 sb
2 = MSb = 𝑆𝑆𝑏
𝑘−1 F =
𝑠𝑏2
𝑠𝑤2 =
𝑀𝑆𝑏
𝑀𝑆𝑤
Within SSw = (x 𝑖𝑗− x
.𝑗)2 N - k sw
2= MSw = 𝑆𝑆𝑤
𝑁−𝑘
Total SSTotal = (x 𝑖𝑗− x
..)2 N - 1
![Page 30: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/30.jpg)
- The within treatment sums of squares = SSw = SSerror is also called the sums of squares due to
error, is computed by summing the squared differences between each observation and its
treatment mean
Page 9.416
- To compute the within sums of squares, SSw we construct the following table
Treatment A Treatment A Treatment B Treatment B Treatment C Treatment C
(X𝑖1 − x .1
) (X𝑖1 − x .1
)2 (X𝑖2 − x .2
) (X𝑖2 − x .2
)2 (X𝑖3 − x .3
) (X𝑖3 − x .3
)2
(X𝑖1 − x .1
) (X𝑖1 − x .1
)2 (X𝑖2 − x .2
) (X𝑖2 − x .2
)2 (X𝑖3 − x .3
) (X𝑖3 − x .3
)2
- SSw = (x 𝑖𝑗− x
.𝑗)2
- = (X𝑖1 − x .1
)2+ (X𝑖2 − x .2
)2 + (X𝑖3 − x .3
)2
- SSTotal = Sum of squared difference between each observation and the overall mean
- SSTotal = SSb + SSw
- Also should construct the following summary statistics table to help with the above calculation
Treatment A Treatment B Treatment C
𝑛𝑗 𝑛1 𝑛2 𝑛3
X𝑖𝑗 X𝑖𝑗 X𝑖𝑗 X𝑖𝑗
x .𝑗
x .1
x .2
x .3
Page 9.419
- When sample sizes are equal, x .. = x
.1+ x
.2+x
.3
𝑁
Page 9.424
- Fixed effects models: The treatment groups under study represent all treatments of interest
o Concluding statement: There is significance evidence of a difference in means among
the treatment studied
o Our conclusions apply only to the treatment studied
- Random effects models: we randomly select k treatments for the investigation from a larger
pool of available treatments
o Concluding statement: There is significance evidence of a difference in means among
ALL treatments, though we studied only a subset
![Page 31: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/31.jpg)
- The formulas in this chapter is only for fixed effects models
Page 9.425
- η2 : The ratio of variation due to the treatments (SSb ) to the total variation
o η2 = 𝑆𝑆𝑏
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 where 0 ≤ η2 ≤ 1
o Values of η2 that are closer to 1 imply that more variation in the data is attributable to
the treatments
o E.g. if η2 = 0.629, then 62.9% of the variation in the times to relief is due to the
medication (i.e Drug A, B, or C)
- Once we reject H0 in ANOVA application, we say there is a significance difference among all of
the treatment means
- Then we can test specific hypotheses comparing certain treatments
- Pairwise Comparison: compare 2 medications (H0 : µ1 = µ2) in the 3 groups
- We can compare the mean time to relief for patients assigned to A or B to the mean time for C
(H0 : 𝜇1+𝜇2
2 = µ3)
- These comparisons are called contrasts
- Multiple Comparison Procedures (MCPs): Statistical procedures for contrasts comparison
Page 9.426
- In applications involving k treatments, there are as many as 𝑘(𝑘−1)
2 possible pairwise
comparisons
- In the worst case, the Type I error can be as large as α (𝑘(𝑘−1)
2 )
- Error rate per comparison (ER_PC) = P (Type I error) on any one test or comparison
- In general, the error rate per comparison is 0.05
- Error rate per experiment (ER_PE) = the number of Type I errors we expect to make in any
experiment under H0
- E.g. 100 two independent samples t test at α = 0.05, ER_PE = 100 * 0.05 = 5, that is, we expect 5
tests to be significant by chance (5 type I errors in the experiment).
- ER_PE is a frequency
- Familywise error rate (FW_ER) = P(at least 1 Type I error) in experiment
- FW_ER = 1− (1− 𝛼𝑖)𝑐 , where αi is the error rate per comparison (ER_PC) and c represents
the number of contrasts, or comparisons, in the experiment
o E.g. 5 treatments, α = 0.05
o Pairwise comparison = 𝑘(𝑘−1)
2 =
5(5−1)
2 = 10 distinct comparisons
o ER_PC = 0.05
o ER_PE = 10 * 0.05 = 0.5
o FW_ER = 1− (1− 0.05)10 = 0.401
o We expect to make 0.5 type I errors solely by chance
o The probability of at least one Type I error is large (40.1% chance)
![Page 32: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/32.jpg)
Page 9.427
- The Scheffe Procedure: A MCP that controls the family error rate
- P(Type I error) is controlled (and equal to α) over the family of all comparisons
- Scheffe procedure:
o Set up the hypotheses:
H0: µi = µj , H1: µi ≠ µj , where µi and µj are two of k treatment means that were
found to be significantly different based on ANOVA
o Select the appropriate test statistic:
F = (x
.𝑖−x
.𝑗)2
𝑠𝑤2 (
1
𝑛𝑖+
1
𝑛𝑗)
F = (x
.𝑖−x
.𝑗)2
𝑀𝑆𝑒𝑟𝑟𝑜𝑟 (1
𝑛𝑖+
1
𝑛𝑗)
Where sw2is the estimate of within variation = mean square within = mean
square error (from ANOVA table)
Page 9.428
o Decision rule:
Reject H0 if F ≥ (k – 1) * F (k-1, N-k)
Do not reject H0 if F < (k – 1) * F (k-1, N-k)
o Example 1: ANOVA performed, rejected µ1 = µ2 = µ3
Compare the medications taken 2 at a time (ie. Pairwise comparisons)
x .1 = 33.0, s.1 = 5.7; x .2 = 26.0, s.2 = 4.2; x .3 = 20.0, s.3 = 3.5
Drug A vs Drug B:
H0: µ1 = µ2, H1: µ1 ≠ µ2, α = 0.05
F = (x
.𝑖−x
.𝑗)2
𝑠𝑤2 (
1
𝑛𝑖+
1
𝑛𝑗)
F = (x
.𝑖−x
.𝑗)2
𝑀𝑆𝑒𝑟𝑟𝑜𝑟 (1
𝑛𝑖+
1
𝑛𝑗)
Reject H0 if F ≥ (k – 1) * F (k-1, N-k) = (k -1 ) * F (3-1, 15-3) = 2 * 3.89 = 7.78
sw2= MSw =
𝑆𝑆𝑤
𝑁−𝑘 = MSerror = 20.83
Fcalculate = (33.0−26.0)2
20.82(1
5+
1
5)
Do not reject H0 since 5.88 < 7.78
Page 9.431
SAS indicates which means are significantly different from others by assigning
letters (Scheffe Groupings) to each treatment
If the same letters are assigned to different treatments, the treatments means
are not significantly different
![Page 33: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/33.jpg)
If different letters are assigned to the treatments, the treatment means are
significantly different
The treatments A and B are assigned the letter A, therefore treatments A and B
are not significantly different
The treatments B and C are assigned the letter B, therefore treatments B and C
are not significantly different
The treatments A and C are different (A assigned letter A, C assigned letter B)
o Example 2: Compare urban students to the rural and suburban students combined
H0 : 𝜇1+𝜇2
2 = µ3, H1 :
𝜇1+𝜇2
2 ≠ µ3
H0 : 𝜇1+𝜇2
2 - µ3 = 0, therefore the weights for the sample means are ½, ½, and -1
F = ((
x .1
+x .2
2)−x
.3)2
𝑀𝑆𝑒𝑟𝑟𝑜𝑟 (1
22 1
𝑛1 +
1
22 1
𝑛2 +(
1
𝑛3))
where the denominator are weighted by the coefficients associated with the
population mean squared
Page 9.432
Reject H0 if F ≥ (k – 1) * F (k-1, N-k) = (k -1 ) * F (3-1, 28-3)
= (k -1 ) * F (2, 25) = 2 * 3.39 = 6.78
F calculate = 9.13
Reject H0 since 9.13 > 6.78
- The Tukey Procedure = The Studentized Range test: a popular, widely applied MCP that also
controls the familywise error rate
- Appropriate for pairwise comparisons
- It does not handle general contrasts
- It is a less conservative procedure (ie has better statistical power) than the Scheffe procedure
when there are a large number of pariwise comparisons
- Steps:
o Treatments are ordered according to the magnitude of their respective sample means:
o x 1’=largest sample mean, x 2’ = second largest sample mean to x k’= smallest sample
mean
o First comparison involves a comparison of the treatments with the largest and smallest
sample means
o If this test is significant, then a test comparing the treatment with the largest sample
mean to the treatment with the next to smallest sample mean is performed, and so on
o If all tests with largest sample are significant, then test the treatment with the second
largest mean with the smallest sample mean
o Stop test when non significant are found
![Page 34: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/34.jpg)
Page 9.431
o Example:
H0: µ1’ = µk’, µ1’ ≠ µk’
Test Statistics: qk = (x
1′−x
𝑘′)
𝑠𝑤2
𝑛
= (x
1′−x
𝑘′)
𝑀𝑆 𝑒𝑟𝑟𝑜𝑟
𝑛
, where n is the number of
observations in each treatment
From Table B.6 Critical Values of the Studentized Range Distribution, get the
critical value for the Tukey test
The critical value depends upon the level of significance α, the number of
treatments involved in the analysis k and the error degrees of freedom from the
ANOVA table
Reject H0 if qk ≥ qα (k, dferror)
Do not reject H0 if qk < qα (k, dferror)
Page 9.437
- Assess changes in a particular measure over time
- Measurements Xsj = measurement (X) on the sth subject (s = 1 to n) on the jth occasion (j = 1 to
k)
- There is one sample of subjects and repeated measures are taken on these subjects
- Introducing a dependency among the measurements
- Steps
o H0: µ1 = µ2 =…= µk, H1: means not all equal
o F = 𝑠𝑏
2
𝑠𝑤2 =
𝑀𝑆𝑏
𝑀𝑆𝑤
o Analysis includes an additional source of variation – variation due to the subjects
o Decision rile: Select the appropriate critical value from the F distribution table B.4, with
degrees of freedom df1 = k -1, df2 = (n-1)*(k-1)
Page 9.438
- Repeated Measures Analysis of Variance (ANOVA) table
Source of Variation
Sum of Squares (SS) Degrees of Freedom (df)
Mean Squares (MS) F
Between Subjects
SSSubject = 𝑘(x 𝑠.− x
..)2 n – 1
Between Treatments
SSb = 𝑛𝑗 (x .𝑗− x
..)2 k – 1 sb
2 = MSb = 𝑆𝑆𝑏
𝑘−1 F =
𝑠𝑏2
𝑠𝑤2 =
𝑀𝑆𝑏
𝑀𝑆𝑤
Within SSw = SSTotal – SSSubject – SSb (n-1)*(k-1) sw2= MSw =
𝑆𝑆𝑤 𝑛−1 ∗(𝑘−1)
Total SSTotal = (x𝑠𝑗 − x ..)2 = nk - 1
![Page 35: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/35.jpg)
(𝑋𝑠𝑗2 ) −
𝑋𝑠𝑗 2
𝑁
- Where x𝑠𝑗 = measurement on the sth subject in the jth treatment
- x 𝑠.
= sample mean of the sth subject
- x .𝑗
= sample mean of the jth treatment
- x ..= overall sample mean
- k = # of measurements per subject
- n = number of subjects
Chapter 7 Categorical Data Page 7.294
- Statistical inference techniques applied to discrete variables are concerned with the proportions
of subjects in each response category
- Population Proportion, p = Number of successes
Population size =
X
N
Page 7.295
- 2 types of estimates for the population proportion: The point estimate and the confidence
interval estimate
- Point estimate is the best single number estimate of the population proportion and is given by
the sample proportion = p^
- The confidence interval estimate is a range of plausible values for the population proportion
- p^ = Number of successes in the sample
𝑛 =
𝑋
𝑛
- Standard error (p^ ) = 𝑝 1−𝑝
𝑛 , where p = population proportion
- For unknown p, and large samples, standard error (p^ ) = p 1− p
𝑛 ,
- For applications with binomial variables, a large sample is one with at least 5 successes and 5
failures (min (np^ , n(1 – p^ )) >= 5
Page 7.296
- When the sample size is large:
![Page 36: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/36.jpg)
- Z test statistics = p − p0
𝑝0 1−p 0
𝑛
- Confidence Interval = p^ ± Z(1-α/2) * p 1− p
𝑛
Page 7.301
- Cross tabulation tables are constructed to display the data
- Cross tabulation tables are also called R X C (R by C) tables, where R denotes the number of rows
in the table and C denotes the number of columns
- The probability of success or outcome, is called the risk of outcome
- Effect measures are used to compare risks of outcomes between populations (or between
treatments)
Page 7.303
- A diagnostic test is a tool used to detect outcome or events that are not directly observable
- The diagnostic test will indicate an event when the event is present
- The diagnostic test will indicate a nonevent when the event is absent
- Sensitivity = P (Positive Test | Disease)
- Specificity = P ( Negative Test | No Disease)
- Predictive value positive = PV+ = P (𝐷𝑖𝑠𝑒𝑎𝑠𝑒
𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑒𝑠𝑡)
- Predictive value negative = PV- = P (𝑁𝑜 𝐷𝑖𝑠𝑒𝑎𝑠𝑒
𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑇𝑒𝑠𝑡)
Page 7.305
- The point estimate for the risk difference, or difference in independent proportions, is given by
p^ 1 – p^ 2, where p^ i is the sample proportion in population i
- If the samples from both populations are sufficiently large ( min( n1p^ 1, n1(1-p^ 1), n2p^ 2, n2(1-p^ 2),)
then
- Confidence interval = (p^ 1 – p^ 2) ± Z(1-α/2) * p
1(1− p
1)
𝑛1 +
p 2
(1− p 2)
𝑛2 where p^ 1 =
𝑋1
𝑛1 , p^ 2 =
𝑋2
𝑛2 ,
- Z = ( p
1− p
2)
p
1(1− p
1)
𝑛1 +
p 2
(1− p 2)
𝑛2
where p^ 1 = 𝑋1
𝑛1 , p^ 2 =
𝑋2
𝑛2
- Or Z = ( p
1− p
2)
p 1− p (1
n 1+
1
n 2)
o where p^ 1 = 𝑋1
𝑛1 , p^ 2 =
𝑋2
𝑛2
o p = (𝑋1+𝑋2)
(𝑛1+𝑛2)
![Page 37: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/37.jpg)
Page 7.306
- Allocation scheme 2-to-1: where twice as many participants are randomized to the
investigational treatment as the control (e.g. 250 in treatment group, 125 in control group)
Page 7.307
- The 2 sample confidence interval concerning (p1 – p2) estimates the difference in proportions, as
opposed to the value of either proportion
- Result is significant, if your 95% CI does not include zero (or if not zero, the value stated in the
null hypothesis), then your P value will be less than 0.05. Similarly if your 99% CI doesn’t
include zero, P < 0.01.
Page 7.310
- For the tests of hypothesis in the presence of multinomial data in one sample or two or more
sample applications, we can use 2 tests, called the goodness of fit test and the test of
independence
- Both test involve a test statistic that follows a chi-square distribution (χ2)
Page 7.311
- Goodness of Fit Test:
- H0 : p1 = p2 = p3 (population proportions are equal)
- H1: H0 is false
- In multinomial distribution, the test statistics is based on the observed frequencies, or numbers
of subjects in each response category
- If equal proportions of patients in each response category were equal, (p1 = p2 = p3 =0.33), then
we would have expected 33 patients to select each timeslot
- χ2 tests are based on the agreement between expected (under H0) and observed (sample)
frequencies
- χ2 = ( 𝑂−𝐸)2
𝐸
o Where Sum over the k response categories
o O = observed frequencies
o E = expected frequencies
- The statistic follows a χ2 distribution and has df = k -1, where k = number of response categories
- We reject the null hypothesis if the value of the test statistic is large
- The test statistic is large when the observed and expected frequencies are not similar
- Critical value from χ2 table in Table B.5
Page 7.317
- Test of Independence:
![Page 38: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/38.jpg)
- Applications involving 2 or more samples or 2 categorical variables, where interest lies in
evaluating whether these 2 categorical variable are related (dependent) or unrelated
(independent)
- Site vs Treatment in a 3 x 3 table
- Want to test the hypothesis that the 2 variables are independent
Page 7.318
- H0: Site and Treatment Regimen are independent (No relationship between site and treatment
regimen)
- H1: H0 is false (Site and treatment regimen are related)
- Test statistic is based on the observed frequencies, or the number of subjects in each cell of the
contingency table
- If the null hypothesis is true, we would expect the distribution of patients by treatment regimen
to be similar across sites
Page 7.319
- To compute the test statistic we must compute the expected frequencies (ie the expected
numbers of patients in each cell of the table
- Expected cell frequency = n * P(cell)
- Expected cell frequency = (Row Total ∗ Column Total )
𝑛
- χ2 statistics for test of independence = ( 𝑂−𝐸)2
𝐸
- The test statistic follows a chi-squared distribution and has df = (R – 1) * (C – 1), where R = # of
rows and C = # of columns
- Reject H0 if test statistics is large
Page 7.326
- χ2 tests are valid when the expected frequencies in each cell are greater than or equal to 5
Page 7.328
- Sample size requirement to produce an estimate for p with a certain level of precision
- ‘n = (𝑝 1 − 𝑝 ∗ (𝑍
(1−𝛼2
)
𝐸)2
o Where p = population proportion
- P can be estimated from a previous study
- If p is not available, then p * (1 – p) is maximized at p = 0.50
Page 7.329
- The most conservation estimate of n is produced by substituting p = 0.5
- N = 0.5 * (1 – 0.5) * (𝑍
(1−𝛼2
)
𝐸)2
![Page 39: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/39.jpg)
- N = 0.25 * (𝑍
(1−𝛼2
)
𝐸)2
Page 7.331
- H0 : p1 = p2 or H0 : p1 - p2 = 0
- The effect size (ES) is defined as the difference in the proportions under H0 and H1
- ES = | p1 - p2 |
- The sample size required to ensure a specific level of power in a 2 sides, 2 independent samples
test for proportions is
- ‘ni = (
p q 2 𝑍 1−
𝛼2 + p
1q
1+p
2q
2𝑍(1−𝛽 )
𝐸𝑆)2
o Where ni is the minimum number of subjects required in sample i
o p1 = proportion of successes in population 1, q1 = 1 – p1
o p2 = proportion of successes in population 2, q2 = 1 – p2
o p = (𝑝1+𝑝2)
2
o q = 1 – p
Chapter 10: Correlation and Regression Page 10.466
- In correlation and regression applications, we consider the relationship between 2 continuous
variables
- We measure both variables on each of n randomly selected subjects
- The variable name X is used to represent the independent or predictor variable
- The variable name Y is used to represent the dependent, outcome or response variable
- The goal of correlation analysis is to understand the nature and strength of the association
between the 2 measurement variables,
- A first step in understanding the relationship between 2 variables is through a scatter diagram.
- A scatter diagram is a plot of the X,Y pairs recorded in each of the n subjects
- The X variable is plotted on the horizontal axis
- The Y variable is plotted on the vertical axis
- Positive / direct linear relationship = positive slope straight line
- Negative / inverse linear relationship = negative slope straight line
- The population correlation coefficient, ρ ( rho), quantifies the nature and strength of the linear
relationship between X and Y
- -1 <= ρ <= 1
- The sign of the correlation coefficient indicates the nature of the relationship between X and Y
- The magnitude of the correlation coefficient indicate the strength of the linear association
![Page 40: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/40.jpg)
Page 10.467
- ρ = -1 : A perfect inverse relationship between X and Y
- ‘ρ = -0.5: a moderate inverse relationship between X and Y
- ‘ρ = 1 : A perfect direct relationship between X and Y
- ‘ρ = 0.5: a moderate direct relationship between X and Y
- ‘ρ = 0: No linear association between X and Y
Page 10.468
- When ‘ρ = 0, there can be association between X and Y, just not linear
Page 10.469
- The correlation coefficient can be affected by truncation. E.g. Measuring SAT scores in high
school versus GPA in college. Individuals with poor SAT are not in college. Individuals with poor
SAT are excluded.
- A correlation between 2 variables may turn out to be zero due to a confounding variable : a
variable that affects either X or Y or both
- The sample correlation coefficient, r = 𝐶𝑜𝑣(𝑋,𝑌)
(𝑉𝑎𝑟 𝑋 𝑉𝑎𝑟 (𝑌)
o Where Cov(X,Y) is the covariance of X and Y
o Var(X) is the sample variances of X = 𝑋−x
2
𝑛−1
o Var(Y) is the sample variances of Y = 𝑌−Y
2
𝑛−1
o Cov(X,Y) = 𝑋−x 𝑌−Y
𝑛−1
Page 10.472
- A correlation of 0.3 or larger in absolute value are usually indicative of meaningful or important
relationship
- Statistic Inference concerning ‘ρ
o H0 : ‘ρ = 0 (no linear association)
o H1: ‘ρ ≠ 0 (linear association)
Page 10.473
o Test statistic: t = 𝑟 𝑛−2
1− 𝑟2 , df = n – 2
o T statistic can be used regardless of the sample size
o Decision rule: Reject H0 if t test statistic ≥ t critical or t test statistic ≤ - t critical
o Conclusion: we have significant evidence to reject H0, to show there is evidence of a
significant linear association between X and Y variables
![Page 41: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/41.jpg)
Page 10.477
- Simple Linear Regression: The independent and dependent variables must be specified
- In regression analysis we address the following issues:
o 1. What mathematical equation best describes the relationship between X and Y (e.g. a
line or a curve of some form)?
o 2. How do we estimate the equation that describes the relationship between X and Y?
o 3. Is the form specified in (1) appropriate?
- Simple linear regression equation: The equation of a line relating X and Y:
o Y = β0 + β1X + ε
o Where Y is the dependent variable
o X is the independent variable
o β0 is the Y intercept (the value of Y when X = 0)
o β1 is the slope (the expected change in Y relative to one unit change in X)
o ε is the random error
- Least squares estimates
o β 1 = estimates of β1 = 𝑟 𝑉𝑎𝑟 (𝑌)
𝑉𝑎𝑟 (𝑋)
o β 0 = Y – β 1 x
o
o The estimate of the simple linear regression equation:
Ŷ = β 0 + β 1 X
Where Ŷ is the expected value of Y for a given value of X
Page 10.478
- If Ŷ = β 0 + β 1 X = 40.98 + 3.61X
- β 1 = 3.61, a 1 unit in variable X is associated with 3.61 units increases in Y
Page 10.481
- A Covariance Matrix contains covariances between the variables in the row and columns of the
matrix
- The covariance between a variable and itself is identical to the variance of that variable
- Cov(A,A) = Var(A)
- Cov(A,B) = Cov(B,A)
- The Correlation Matrix contains the sample correlations (r) between variables in the rows and
columns of the matrix
- Under each sample correlation is a 2 sided p value for testing H0: ‘ρ = 0
- The correlation between a variable and itself is always 1
- Corr(A,B) = Corr(B,A)
- Corr(A,B) is significant at p value < 0.05
![Page 42: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/42.jpg)
- The ANOVA table to test whether the regression is significant or not, which provides estimates
of the parameters of the regression equation (ie estimates of the intercept and the slope)
- In regression analysis, the Total sum of squares = the variation in the dependent variable Y
- The total variation is partitioned into 2 components: The Model and Error
- The Model sum of squares = the regression sums of squares : The variation in Y accounted for by
the regression
- The Error sum of squares = the residual sum of squares: The variation in Y not accounted for by
the regression.
- The p value is used to test whether the regression is significant
- When there is a single independent variable (simple linear regression), this p value is equal to
the p value in the correlation analysis, and equal to the p values used to test if the slope is
significant
- From the Y intercept and the slope, it is of interest to test if the slope is significant from zero
- If p value < 0.05, the slope (reflecting the change in Y associated with 1 unit change in X) is
significant
Page 10.482
- If a regression is run and it has been established that the regression is significant, we want to
quantify how much variation in the dependent variable is “explained” by the independent
variable
- The coefficient of determination, R2, is the ratio of the regression (or Model) sums of squares to
the total sums of squares
- R2= 𝑆𝑆𝑚𝑜𝑑𝑒𝑙
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 , where 0 <= R2 <= 1
- Higher value of R2 imply that more variation in the dependent variable is “explained by the
independent variable
- If R2 = 0.7395, then 73.9% of the variation in the dependent variable is explained by the
independent variable
- Values of R2 > 0.1 can be clinically significant
Page 10.485
- Multiple Regression Analysis
- We consider applications involving one single continuous dependent variable Y, and multiple
independent variables denoted X1, X2, etc.
- Y = β0 + β1 X1 + β2 X2 + … + βp Xp + ε
o Where Y is the dependent variable
o X1 to Xp are the dependent variables
o β0 is the intercept
o βi are the slope coefficients, also called the regression parameters (ie the expected
change in Y relative to a 1 unit change in Xi)
o ε is the random error
![Page 43: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/43.jpg)
Page 10.486
- In multiple regression analysis, the total sums of squares = the variation in the dependent
variable Y
- The ANOVA table is used to perform a global test – to test if the collection of variables is
significant
- The p value is used to test whether the set of independent variables is significant
Page 10.487
- The estimate of the regression parameter for an indicator variable, e.g. 4.40: On average, males
(MALE = 1) have SBP 4.40 units higher than females (MALE = 0), holding other variables constant
- R2 = 0.2257, the 3 variables account for 22.6% of the variation in the dependent variable
Page 10.487
- Logistic Regression Analysis
- We consider applications involving a single independent variable Y, which is dichotomous
(success vs failure)
- 𝑙𝑛 𝑝
(1−𝑝) = β0 + β1 X1 + β2 X2 + … + βp Xp + ε
Page 12.546
Chapter 12: Non parametric tests - Test of hypothesis assume that the characteristic under investigation is approximately normally
distributed or rely on large samples for the application of the Central Limit Theorem
- Techniques in Chapter 5, 6, 9 are parametric procedures
- Parametric Procedures: Based on assumptions regarding the distributional form of the analytic
variable (e.g. that the analytic variable follows a normal distribution)
- Non-parametric procedures: Method that do not require such assumptions
Page 12.547
- The Sign Test: 2 Dependent Samples Test:
o H0: The median difference is zero
o Find the differences between baseline and post treatment (Baseline – Post Treatment)
o Record the sign of the difference (+, -, or 0)
Page 12.548
o The sign test is based on the binomial distribution
o Negative Difference: Post treatment higher than baseline
![Page 44: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/44.jpg)
o Positive Difference: Post treatment lower than baseline
o We expect to see 50% + signs and 50% - signs
o H0 : Probability of a + sign = 0.5
o H1 : The drug lowers blood pressure (one sided test)
o From binomial table. Find n=?, p = 0.5
o From experiment, we get 7 out of 10 + signs
o Therefore, what is the probability of getting 7 or more success out of 10
o From binomial table, Find P(X≥7), at n = 10, p = 0.5
o Pvalue = P(X≥7) = P(X=7) + P(X=8) + P(X=9) + P(X=10) = 0.1719 one sided test
Page 12.549
o Reject H0 if pvalue < 0.05.
o In this case, Pvalue = 0.1719 > 0.05, therefore Do not reject H0
o Therefore, we do not have significant evidence to show that the drug lowers the blood
pressure
o For 0, we can ignore 0 or
o For 0, we can count as 0.5+ and 0.5-
o The P value for P(X≥7.5) = P(X=8) + P(X=9) + P(X=10) = 0.0547 = marginally significant
o The P Value corresponding to a 2 sided test is (P X ≥ 7 or X ≤ 3) = 2 * P(X ≥ 7) = 2 x 0.1719
= 0.3438
o In parametric test, we also computed differences but focused on the magnitude of
those differences.
o In the Sign test we do not capture the magnitude of the differences, only the direction.
o In the sign test, there is no distinction between someone with more substantial
reduction
o The sign test has lower power than competing procedure
- Wilcoxon Signed Rank Test: 2 Dependent Samples Test
o H0: Median difference is zero
Page 12.550
o Assign ranks to the absolute values of the difference scores from smallest to largest
o If there are ties in the absolute values of the differences, then the mean rank is assigned
to both
o Rank until all non zero differences are ranked
o Reattach the signs of the differences to the ranks
Page 12.551
o If H0 is true, then the sum of positive ranks = sum of negative ranks
![Page 45: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/45.jpg)
o Test in Wilcoxon Signed-Rank is the sum of the positive ranks, called T.
o If all the differences are negative, then all the signs will be negative, so all the signed
ranks will be negative and T = 0
o If all differences are positive, then T = sum of positive = 𝑛 𝑛+1
2
o Sum of positive ranks, T ranges from 0 to 𝑛 𝑛+1
2
o The median value = 𝑛 𝑛+1
4
o The observed sum of positive ranks = T = 40.5.
o The theoretical maximum = 45, median = 22.5
o Converts T to a Z score, and the produce a p value from the standard normal distribution
o Z = (T – median ) / standard error
o Z = 𝑇−
𝑛 (𝑛+1)
4
𝑛 𝑛+1 (2𝑛+1)
24
Where n = number with non zero differences
o Z = 2.13
o The one sided p value = P (Z ≥ 2.13) = 1 – 0.9834 = 0.0166 < 0.05, therefore reject H0.
o The median measure is lower post treatment as compared to baseline
Page 12.554
o The Signed-Rank test incorporates the relative magnitude of the values through ranks
o Ranks are useful in the presence of outliers
o Tests based on ranks do not capture the absolute magnitude of the differences as
compared to parametric tests
o From SAS, the Shapiro-Wilk test, tests H0 = normal distribution. If p value < 0.05, reject
H0 = Not normal distribution = use non parametric tests
- Wilcoxon Rank Sum Test for 2 Independent Samples Test
o H0: medians of the 2 samples are equal
Page 12.555
o Small samples and the Analytic variable is ordinal, therefore use a non-parametric test
o Pooled the data from 2 groups, rank from low to high
o Assign ranks from lowest to highest
o When there are ties, the mean ranks are assigned
o If there is difference between treatments, we expect clustering of lower ranks in one
group and higher ranks in the other
Page 12.556
![Page 46: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/46.jpg)
o The Wilcoxon Rank Sum test statistic is S, the smaller of the sums of the ranks in the
groups.
o Sum of all ranks is 𝑁(𝑁+1)
2
o Convert the smaller sum of ranks, S, to a Z score , and then produces a p value from the
standard normal distribution
o Z = 𝑆−
𝑛1(𝑛1+𝑛2+1)
2
𝑛1𝑛2 𝑛1+𝑛2+1
12
o S = 16, n1 = n2 = 4, Z = -0.58
o P(Z≤-0.58) = 0.2810 = one sides p value
o 2 sided p value = 2 * 0.2810 = 0.5620 > 0.5
o Therefore we cannot reject H0.
o We do not have significant evidence to show a difference between the 2 treatments
o Rank Sum test does not use the actual observed values
Page 12.558
- The Kruskal – Wallis Test: k Independent Samples Test
o Based on assigning ranks to the observed values and then comparing the observed sums
of ranks to what would be expected if there were no difference among groups
Page 12.560
- Spearman Correlation: Correlation Between Variables
o Correlation based on ranks
o Let X = # of cigarettes per day, Y = Number of hours exercise per week
o Replace the raw X and Y scores with ranks
o Ranks are assigned for each variable, considered separately
o When scores are tied, the mean rank is applied to each of the tied values
Page 12.561
o Spearman correlation = rs = 𝐶𝑜𝑣(𝑅𝑥 ,𝑅𝑦 )
𝑉𝑎𝑟 𝑅𝑥 𝑉𝑎𝑟 (𝑅𝑦 )
Where Rx and Ry are the ranks of X and Y
Variance of Rx = Var(Rx) = 𝑅𝑥−R
𝑥
2
(𝑛−1)
Page 12.562
Variance of Ry = Var (Ry) = 𝑅𝑦−R
𝑦
2
(𝑛−1)
Page 12.563
![Page 47: MMI 409: Introduction To Biostatisticsmedicalinformatics.weebly.com/uploads/2/7/1/8/... · Chapter 1: Introduction 1.1 Vocabulary Page 1.2 -Statistical analysis: the analysis of characteristics](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edc3000ad6a402d6666bf25/html5/thumbnails/47.jpg)
Cov(Rx , Ry) = 𝑅𝑥−��
𝑥 𝑅𝑦−R
𝑦
(𝑛−1)
rs = -0.453
There is a moderately strong inverse association between the 2 variables
Page 12.566
- Guidelines for determining when to use a non-parametric procedure
o When substantive knowledge of the analytic variable suggests non-normality
Some variables do not follow a normal distribution (e.g. ordinal variables with
few response options)
o Observed non-normality in the sample data.
Visual inspection of the distribution of a variable might suggest that the variable
is highly skewed
Comparison of measures of central tendency, such as the mean and median,
which are equal in symmetric distributions, can help determine the extent of
skewness
When standard deviation is larger than the mean also suggest non-normality
A formal test for normality might indicate that a variable does not follow a
normal distribution