Biostatistics in Practice Youngju Pak Biostatistician Peter D. Christenson Session 1: Quantitative...

Biostatistics in Practice

Youngju PakBiostatistician

Peter D. Christenson

http://research.LABioMed.org/Biostat

Session 1: Quantitative and Inferential Issues I

Why Statistics ?• For Today’s Graduate, Just One Words:

Statistics, NY Times, Aug 5, 2009 " I keep saying that the sexy job in the next 10 years

will be statisticians," said Hal Varian, chief economist at Google.

• "I am not much given to regret, so I puzzled over this one a while. Should have taken much more statistics in college, I think. :)" —Max Levchin, Paypal Co-founder, Slide founder

Who am I?

• Dr. Youngju Pak• Originally come from South Korea.• PhD-Biostatistics, MS-Stat., BA-Stat.• Assistant Professor of Biostatistics at MU

until 2012• Joined LA BioMED in March 2013• Practicing Biostatistics since 2000

Who are you?

• Name• Career Aspirations

Class webpage & Session Schedule• Class Webpage: Select "Courses" at

http://research.LABioMed.org/biostat (use Explore. Chrome is not quite working with this website somehow)

• All class material are posted and will be updated on the class webpage

• There will be some pop-up Quizzes • There will be some HW assignments. • The TOP THREE will be announced and

rewarded at the last session.

Session 1 Objectives

• General quantitative needs in biological research

• Overview of statistical issues using a published paper

• How to run Statistical software, MYSTAT

General Quantitative Needs

Descriptive: Appropriate summarization to meet scientific questions: e.g.,

• changes, or % changes, or reaching threshold?

• mean, or minimum, or range of response?

• average time to death, or chances of dying by a fixed time?

General Quantitative Needs, Cont’d

• Inferential: Could results be spurious, a fluke, due to “natural” variations or chance?

• Inferential statistics: 95% confidence intervals, p-values, etc.

• Sensitivity/Power: How many subjects are needed?

Professional Statistics Software Package

Output

Enter code; syntax.

Stored data; access-ible.

Microsoft Excel for Statistics

• Primarily for descriptive statistics.

• Limited output.

• No analyses for %s.

Free Statistics Software: Mystatwww.systat.com

Free Study Size Softwarewww.stat.uiowa.edu/~rlenth/Power

Session 1 Objectives

• General quantitative needs in biological research

• Overview of statistical issues using a published paper

• How to run Statistical software, MYSTAT

Statistical Issues

•Subject selection

•Randomization

•Efficiency from study design

•Summarizing study results

Paper with Common Statistical Issues

Case Study:

McCann, et al., Lancet 2007 Nov 3;370(9598):1560-7

• Food additives and hyperactive behaviour in 3-year-old and 8/9-year-old children in the community: a randomised, double-blinded, placebo-controlled trial.

• Objective: test whether intake of artificial food color and additive (AFCA) affects childhood behavior

• Target population: 3-4, 8-9 years old children• Study design: randomized, double-blinded, controlled, crossover

trial • Sample size: 153 (3 years), 144(8-9 years) in Southampton UK• Sampling: Stratified sampling based on SES• Baseline measure: 24h recall by the parent of the child’s pretrial diet• Group: three groups (mix A, mix B, placebo)• Outcomes: ADHD rating scale IV by teachers, WWP hyperactivity

score by parents, classroom observation code, Conners continuous performance test II (CPTII) GHA score

Statistical Issues


•Randomization



Representative or Random Samples

How were the children to be studied selected (second column on the first page)? The authors purposely selected "representative" social classes.

Is this better than a "randomly" chosen sample that ignores social class?

Often hear: Non-random = Non-scientific.

Case Study: Participant Selection

No mention of random samples.

Case Study: Participant Selection

It may be that only a few schools are needed to get sufficient individuals. If, among all possible schools, there are few that are lower SES, none of these schools may be chosen.

So, a random sample of schools is chosen from the lower SES schools, and another random sample from the higher SES schools.

Selection by Over-Sampling

It is not necessary that the % lower SES in the study is the same as in the population. There may still be too few subjects in a rare subgroup to get reliable data.

Can “over-sample” a rare subgroup, and then weight overall results by proportions of subgroups in the population. The CDC NHANES(http://www.cdc.gov/nchs/nhanes.htm) studies do this.

Statistical Issues


•Randomization



Basic Study Designs

1. Prospective (longitudinal) :Risk Factor (2014) Disease status (2020)

2. Retrospective(Case-Control) :Disease status (2014) Risk Factor (2000)

3. Cross sectional : Disease status (2014) Risk Factor (2014)

4. Experimental or Randomized- Control

: Risk Factor (2014) Disease status (2020) with assignment of Risk Factor

Random Samples vs. Randomization

We have been discussing the selection of subjects to study, often a random sample.

An observational study would, well, just observe them. An interventional study assigns each subject to one or more treatments in order to compare treatments. Randomization refers to making these assignments in a random way.

Why Randomize?

So that groups will be similar except for the intervention.

So that, when enrolling, we will not unconsciously choose an “appropriate” treatment for a particular subject.

Minimizes the chances of introducing bias when attempting to systematically remove it, as in plant yield example.

Case Study: Crossover Design

Each child is studied on 3 occasions under different diets.

Is this better than three separate groups of children?

Why, intuitively?

How could you scientifically prove your intuition?

Statistical Issues

• Subject selection

• Randomization

• Efficiency from study design

• Summarizing study results

Blocked vs. Unblocked Studies

AKA matched vs. unmatched.AKA paired vs. unpaired.

Block = Pair = Set receiving all treatments. Set could be an individual at multiple times (pre and post), or left and right arms for sunscreen comparison; twins or family; centers in multi-center study, etc. Block ↔ Homogeneous.

Blocking is efficient because treatment differences are usually more consistent among subjects than each separate treatment is.

Potential Efficiency Due to Pairing

. .

.

.

. . . . … . .. .. ..

. .

.

.

. . . . … . .. .. ..

.

. . … . . . . . .. . . . .

. .

..

.

. .

. . . . . . ..

A B A B Δ=B-A

…….… ….

…….… ….

Δ3 3

3

Unpaired

A and B Separate Groups

Paired

A and B in a Paired Set

Statistical Issues

• Subject selection

• Randomization

• Efficiency from study design

• Summarizing study results

Outcome Measures

Generally, how were the outcome measures defined (third page)?

They are more complicated here than for most studies.

What are the units (e.g., kg, mmol, $, years)?

Outcome measures are specific and pre-defined. Aims and goals may be more general.

Summarization of Data with Descriptive Statistics

Summarization of Data with Descriptive Statistics

What is the difference between Table 1 and Table 2 in terms of methods used to summarize the data?

Variable

Categorical Numerical

Ordinal

Categories are mutually exclusive and ordered

Examples:Disease stage, Education level , 5 point likert scale

Counts

Integer values

Examples:Days sick per year, Number of pregnancies,Number of hospital visits

Measured(continuous)Takes any value in a range of values

Examples:weight in kg, height in feet, age (in years)

Qualitative Quantitative

Nominal

Categories are mutually exclusive and unordered

Examples:Gender, Blood group,Eye colour,Marital status

Types of Data

It is critical to identify the type of data since the choice of an appropriate statistical test as well as how to summarize the data depend on

the type of the data.

36

Describing categorical & quantitative data

• Categorical Data

– Binary, Nominal, or Ordinal data• Disease status ( yes, no)• Education level • The assignment of the

treatment• Cancer stage• Marital Status

– Frequency tables (one, two, or multi way tables) are usually used

• Quantitative Data

– Counts or Continuous Data• Weight • Blood pressure• Age• Length of hospital stay in days• The total number of ER visits per

year– Means or Medians are used

for the measure of the central tendency.

– Standard deviations or percentiles are used for the measure of variability.

– When data is skewed, Medians & percentiles are better summary statistics

How to display Data

• A picture is worth a thousand words !

• To getting a ‘feel’ for the data.

• Categorical data– Frequency tables, Contingency tables (cross

tables), Bar charts, Pie-charts• Quantitative data

– Dot plots, Histograms, Box-Whisker plots*, Scatter plots

Frequency Tables

How Arrived

Frequency Percent Valid Percent

Cumulative

Percent

BUS 11 13.9 13.9 13.9

CAR 66 83.5 83.5 97.5

WALK 2 2.5 2.5 100.0

Valid

Total 79 100.0 100.0

Gender

Frequency Percent Valid Percent

Cumulative

Percent

F 46 58.2 58.2 58.2

M 33 41.8 41.8 100.0

Valid

Total 79 100.0 100.0

Contingency Tables(Crosstabulations)

How Arrived * Gender Crosstabulation

Count

Gender

F M Total

BUS 6 5 11

CAR 38 28 66

How Arrived

WALK 2 0 2

Total 46 33 79

Bar Charts

Pie Charts

Histograms• To catch the patterns of the data• Divide up the data points into several mutually exclusive

intervals –Categorize the data points.

Scatter plots• Usually used to illustrate a relationship b/w two variables.

Box-Whisker Plots

What have we learn today?

Assignments • HW #1 is posted on the course website• Pre-Step for HW #1

– Install MYSTAT in your labtop or a computer in your school computer lab with permission from your school (Ask Ms. Aberle for help)

– Download Survey.sav (SPSS data file) from the course website (under Session 1)

• Submit the hard copy of the completed HW in next session.

• Read the article focusing on contents in Table 3 &4 and Figure 4.

Biostatistics in Practice Youngju Pak Biostatistician Peter D. Christenson Session 1: Quantitative...

Documents

Transcript of Biostatistics in Practice Youngju Pak Biostatistician Peter D. Christenson Session 1: Quantitative...