21st-Century Statistics: We've never done it that way before!

40
21st-Century Statistics: We've never done it that way before! Robin High Statistical Programmer and Consultant Information Services

description

21st-Century Statistics: We've never done it that way before!. Robin High Statistical Programmer and Consultant Information Services. Relevant Topics. Research with Small Samples (Exact Tests) Survey Research Design Analysis (Frequencies, Means, Regression) Generalized Linear Models - PowerPoint PPT Presentation

Transcript of 21st-Century Statistics: We've never done it that way before!

Page 1: 21st-Century Statistics: We've never done it that way before!

21st-Century Statistics:We've never done it that way before!

Robin High

Statistical Programmer and Consultant

Information Services

Page 2: 21st-Century Statistics: We've never done it that way before!

Relevant Topics

Research with Small Samples (Exact Tests)

Survey ResearchA. DesignB. Analysis (Frequencies, Means, Regression)

Generalized Linear Models

Linear Models

Page 3: 21st-Century Statistics: We've never done it that way before!

Where to Find More “Information”

www.uoregon.edu/~robinh/statistics.html

GENERALIZED LINEAR MODELS

http://www.uoregon.edu/~robinh/genmod_sas.html

LINEAR MODELS: IS THE LINEAR MODEL PROCEDURE GLM OBSOLETE?

http://www.uoregon.edu/~robinh/mixed_sas.html

Page 4: 21st-Century Statistics: We've never done it that way before!

1908: Student’s t-test

• William Sealy Gosset (1876-1937) who worked at the Guinness Brewery in Dublin, otherwise known as “Student”.

• Derivation of the t-distribution was published as The Probable Error of a Mean.

Page 5: 21st-Century Statistics: We've never done it that way before!

1912-1922

Sir Ronald A. Fisher (1890-1962)

Notable Contributions to Statistics (among many)

Maximum Likelihood Estimation

Analysis of Variance

Page 6: 21st-Century Statistics: We've never done it that way before!

Maximum Likelihood Estimation to Extract “Information”

• 1921 – Fisher utilized information supplied by the data to estimate unknown parameters

• The Variance/Covariance matrix shows how this information has been utilized

• To this day the name "Fisher Information

Matrix" has been attached to it

Page 7: 21st-Century Statistics: We've never done it that way before!

Maximum Likelihood Estimation: Find the estimate of the parameter that maximizes the likelihood function

Page 8: 21st-Century Statistics: We've never done it that way before!

Generalized Linear Models• Data analysis methods introduced by Nelder and Wedderburn

(1972)

• Developed a unified theory for a variety of statistical models

• Maximum likelihood estimation works directly with a model chosen for the specific characteristics of the data

• OUTCOMES belong to a member of the exponential family:

Normal Negative Binomial Binomial Inverse Gaussian Poisson Gamma

Page 9: 21st-Century Statistics: We've never done it that way before!

One "problematic" topic of statistical analysis:

Analysis of data from non-normal distributions as if they were normal

The Choice Today

Select an appropriate model based on the distributional assumptions of the data

http://www.uoregon.edu/~robinh/seven_prblms.txt

Page 10: 21st-Century Statistics: We've never done it that way before!

Linear Statistical Models

• T-Test• Multiple Linear Regression• Analysis of Variance• Analysis of Covariance

• All these techniques utilize “least squares” to estimate parameters

• Computing different types of “Sums of Squares” is fundamental

Page 11: 21st-Century Statistics: We've never done it that way before!

ANALYSIS OF VARIANCEAssumptions

1. Independent observations (one observation for each subject randomly assigned to two or more classification groups)

2. Residuals of the model are normally distributed

3. Residuals have equal variances within each classification group (i.e., homoskedasticity)

Page 12: 21st-Century Statistics: We've never done it that way before!

• The science of statistics allows us to interpret the following statement:

Before you become too entranced with gorgeous gadgets and mesmerizing video displays, let me remind you that information is not knowledge, knowledge is not wisdom, and wisdom is not foresight. Each grows out of the other, and we need them all.” (Arthur C. Clark)

• Bold-face words are now interpreted through simple example

Page 13: 21st-Century Statistics: We've never done it that way before!

Design an experiment to evaluate two teaching methods

Two groups of randomly assigned subjects

Group 1: assigned the standard teaching method

Group 2: assigned the new teaching method

y: Assessment (which is a continuous response )

Want to test Hypotheses:

Null : HO: mu_1 = mu_2

Alternative: HA: mu_1 <> mu_2

Example

Page 14: 21st-Century Statistics: We've never done it that way before!

Group y Group y Group y Group y Group y

1 27.30 1 27.32 1 27.65 1 27.43 1 27.80 1 28.10 1 27.93 1 27.59 1 27.44 1 27.91 1 27.40 1 27.32 1 27.73 1 27.42 1 27.61 1 27.70 1 27.62 1 27.79 1 27.65 1 27.64 1 28.00 1 27.34 1 27.73 1 27.83 1 27.51 1 28.10 1 27.63 1 27.24 1 27.81 1 27.67 1 27.40 1 27.42 1 28.16 1 28.07 1 27.69 1 27.10 1 27.53 1 28.01 1 27.73 1 27.68 2 28.40 2 28.11 2 28.26 2 27.96 2 28.63 2 27.80 2 27.91 2 28.30 2 28.10 2 28.03 2 28.10 2 28.43 2 28.39 2 28.53 2 27.98 2 28.30 2 27.84 2 27.88 2 28.44 2 28.17 2 27.90 2 27.63 2 27.99 2 28.07 2 28.17 2 27.60 2 28.31 2 28.03 2 28.32 2 27.89 2 28.50 2 28.33 2 28.33 2 28.13 2 27.85 2 27.90 2 27.90 2 27.41 2 27.85 2 28.43 2 28.40 2 28.19 2 28.20 2 28.15 2 27.84 2 27.70 2 28.36 2 27.92 2 28.02 2 27.65

InformationLooking at the data alone is too overwhelming to interpret.

Page 15: 21st-Century Statistics: We've never done it that way before!

Knowledge

• Apply statistical methods to interpret data

• Make table of summary statistics

• Make side-by-side box plots of data

• Compute p-value from two-sample t-test

Page 16: 21st-Century Statistics: We've never done it that way before!

Summary Statistics

------------------------------| | y || |------------------|| | N | Mean | Var ||---------+---+------+-------||group | | | ||1 | 40| 27.65| 0.0676||2 | 50| 28.09| 0.0739|------------------------------

Page 17: 21st-Century Statistics: We've never done it that way before!

Graphical Display

Page 18: 21st-Century Statistics: We've never done it that way before!

Wisdom

• Observe a difference in means of .44 with group 2 larger than group 1

• Produces a highly significant p-value (<.0001)

• Is this difference meaningful from your expectations of what a significant improvement should be?

Page 19: 21st-Century Statistics: We've never done it that way before!

Foresight

• If the difference is meaningful, is it worth the time and expenses required to implement the new program?

Page 20: 21st-Century Statistics: We've never done it that way before!

Statistical Computations• Examples shown with SAS code

• Similar results for other programs (SPSS, SPLUS, STATA, etc.)

• Point-and-click interface with statistical programs is NOT recommended!

• Syntax method is utilized since “writing program code is a good way to debug your thinking” (Bill Venables, 1997) and as you proceed it clearly documents your analysis process

Page 21: 21st-Century Statistics: We've never done it that way before!

1980’s: T-test with PROC GLM

PROC GLM DATA=indata;CLASS group;MODEL y = group / solution ss3;ESTIMATE ‘group 1 vs 2’ group 1 -1 ;LSMEANS group / stderr pdiff tdiff;RUN;

Computes Sums of Squares and Mean Squares to compute F-tests

Page 22: 21st-Century Statistics: We've never done it that way before!

1990’s: T-test with PROC MIXED

PROC MIXED DATA=indata METHOD=type3;CLASS group;MODEL y = group / solution ;ESTIMATE ‘group 1 vs 2’ group 1 -1 ;LSMEANS group / diff ;RUN;

Computes Sums of Squares, Mean Squares, and F-tests just like PROC GLM

Page 23: 21st-Century Statistics: We've never done it that way before!

• Why do we need yet another statistical procedure for ANOVA?

• Is PROC GLM really obsolete?

Page 24: 21st-Century Statistics: We've never done it that way before!

Real World Situations• GLM was designed primarily a “fixed”

effects analysis tool and still works fine for these situations

• It’s the “other” situations you should now consider alternatives

• Truly independent observations rarely achieved

Page 25: 21st-Century Statistics: We've never done it that way before!

And there are many “what ifs”?

• Unequal variances??

One or more group variances are notably different from the others

Prior solution with GLM: Levene’s test to detect and Weighted Least Squares for estimation

Page 26: 21st-Century Statistics: We've never done it that way before!

MIXED: Unequal Variance Model

PROC MIXED DATA=indat;CLASS group;MODEL y = group / solution;REPEATED / GROUP=group ;LSMEANS group / diff ;RUN;

• The REPEATED statement essentially does weighted least squares

Page 27: 21st-Century Statistics: We've never done it that way before!

RANDOM Effects??

Experimental data frequently are collected in clusters by design or the nature of the study

Subjects who belong to the same cluster are influenced by common factor

Prior solution: GLM can work with random effects, though it is not very elegant

Page 28: 21st-Century Statistics: We've never done it that way before!

MIXED: RANDOM Intercepts Model

PROC MIXED DATA=indat;

CLASS group site;

MODEL y = group / solution ;

RANDOM int / SUBJECT=site ;

LSMEANS group / diff;

RUN;

RANDOM site ; * can also enter this statement;

Page 29: 21st-Century Statistics: We've never done it that way before!

REPEATED Measurements??

• Can achieve greater efficiencies by collecting multiple observations from subjects

• Subjects tend to respond alike across trials

• Along with Random effects, this leads to a Within Subject Covariance Matrix

• Limited number of possibilities under GLM: * Independence (NOT good) * Compound Symmetry (e.g., Equal variances and equal covariances) * Unstructured (Multivariate)

Page 30: 21st-Century Statistics: We've never done it that way before!

Variety of ways to work with the Within Subject Covariance Matrix

• MIXED can work with it the same as GLM

• Structure and content the within subject covariance matrix can take many forms

• In the 21st Century you can consider many additional possibilities

Page 31: 21st-Century Statistics: We've never done it that way before!

GLM: REPEATED MEASURES

1980’s data structure Data placed in a multivariate format:

id group time1 time2 time3 1 1 5 6 8 2 1 4 . 7 3 2 6 8 9

PROC GLM DATA=mult_dat;CLASS group ;MODEL time1 time2 time3 = group / nouni ;REPEATED time 3 / PrintE summary;RUN;

Page 32: 21st-Century Statistics: We've never done it that way before!

MIXED: REPEATED MEASURES2000: Data are placed in univariate format:

id group time y 1 1 1 5 1 1 2 6 1 1 3 8 2 1 1 4 2 1 2 . 2 1 3 7

PROC MIXED DATA=indat;CLASS group id time;MODEL y = group time group*time / solution ;REPEATED / SUBJECT=id TYPE=ar(1) rcorr; *autoregressive;RUN;

Page 33: 21st-Century Statistics: We've never done it that way before!

Advantages of MIXED Model

• Allows data to be Missing at Random

• Greater variety and much more flexibility for working with within-subject covariance structures (e.g., type=cs or type=ar(1) or type=un among many others)

• Random effects allow one to compute more appropriate standard errors, F-tests, T-tests, and their associated pvalues

Page 34: 21st-Century Statistics: We've never done it that way before!

Other Advantages

• Options that print the actual within subject covariance and correlation matrices (e.g., rcorr and vcorr)

• Combined with Output Delivery System offers a powerful way to compute means, differences in means with pvalues, and then present them in table or graphs

Page 35: 21st-Century Statistics: We've never done it that way before!

One Important Feature

• Utilizes Restricted Maximum Likelihood (REML) to estimate parameters for models with complex covariance structures

• Can only find the peak of the likelihood function (next slide) through iterative search methods

• Models with more complex covariance structures don’t produce “Sums of Squares”

Page 36: 21st-Century Statistics: We've never done it that way before!

Maximum Likelihood Estimation

Page 37: 21st-Century Statistics: We've never done it that way before!

MIXED as a Study Planning Tool

Assume you will have three groups with these anticipated means

• Group 1 = 10 Group 2 = 15 Group 3 = 20• n=10 subjects per group• Standard Deviation of response = 4

• What is the “power” of this design to detect differences in these 3 means?

Page 38: 21st-Century Statistics: We've never done it that way before!

ODS OUTPUT TESTS3=tst3;

PROC MIXED DATA=means NOprofile;CLASS group;MODEL mn_y = group ;PARMS (16) / NOiter;RUN;

With an intermediate computation you find:

Power = 0.7614

Page 39: 21st-Century Statistics: We've never done it that way before!

And Where does this lead to the 21st Century

Many Training Opportunities

•Awareness of Many Possibilities

•Better (more appropriate) statistical tests

•Collaboration essential

Page 40: 21st-Century Statistics: We've never done it that way before!

GLIMMIX (2008)

Utilizes features of Generalized Linear Models (for a variety of distributions) containing Random Effects (within and between subjects)