JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for...

28
JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M. Goodman, Ph.D. University of Ontario Institute of Technology

Transcript of JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for...

Page 1: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

JSM 2015, Session #256

Resampling as a Way to Test for Differences in Educational Outcomes

for Paired Cohorts Observed over Several Years

by

William M. Goodman, Ph.D.University of Ontario Institute of Technology

Page 2: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

• Introduce the motivating case. (It’s in education-evaluation; but other applications are

possible.)

• Introduce the proposed method. (Provides a Significance Test, and an Effect Size Measure.)

• Explore if ANOVA could work instead.• Discuss: Robustness, Power, and other

considerations • Concluding example

Outline

Page 3: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

The Motivating Case1

• Studies compared “Pathway” students’ academic performance with that of “Traditional” students’. • “Traditional Students” enter their program in Year 1.• “Pathway students” join a program “mid-stream”, with a

prior (not necessarily related) 2-yr. diploma from community-college. (Sometimes with “bridge” courses.)

Average marks on “grade-point” scale

1. Table originally presented at the Student Pathways in Higher Education Conference. Ontario Council for Articulation and Transfer, Feb. 2013

Page 4: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

The Motivating Case1

• Studies compared “Pathway” students’ academic performance with that of “Traditional” students’. • “Traditional Students” enter their program in Year 1.• “Pathway students” join a program “mid-stream”, with a

prior (not necessarily related) 2-yr. diploma from community-college. (Sometimes with “bridge” courses.)

Average marks on “grade-point” scale

• Suppose these data are representative for similar contexts in future… (so we can make inferences)

Page 5: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

Consider in the Business program:

• “By eye”, the Pathway cohort had higher GPA’s for every year comparisons could be made. But is that apparent pattern significant? And what would be a good measure for the effect size?

• The displayed numbers appear “paired”; yet something like a paired t test would not apply. The displayed values are averages, and allowance needs to be made for the different n’s and standard deviations in the cells.

Page 6: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

• If we had the raw data underlying the cells, ANOVA would be applicable:

We’re looking for a Main Effect for the row variable (Traditional versus Pathway)

• …But often not the case. For example: • Working from already-aggregated data (e.g. a table in a paper)• Or privacy/permissions issues (requesting data from the Registrar)

Etc.

….Etc.

….Etc.

….Etc.

….

Etc.

….Etc.

….Etc.

….Etc.

….

Page 7: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

nyeari,cohortj syeari,cohortjGPAs(i.e. means)yeari,cohortj

nyr2,T nyr3,T nyr4,T nyr5,T

nyr2,P nyr3,P nyr4,P nyr5,P

syr2,T syr3,T syr4,T syr5,T

syr2,P syr3,P syr4,P syr5,P

• What if we could get the summary data, but not the raw data?

• Can we run an ANOVA given only the summary data?• Yes—if One-Way, or Two-Way…. Balanced. (n/a here)• Two-way unbalanced?

…..well, in theory: ……

Page 8: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

One 2x2 case has been worked out and posted*

at: www.stat.ufl.edu/~winner

/cases/ethicgen.ppt

{Thanks to Dr. Larry Winner, University of Florida}

I’ve not found an expanded example or computerized version. And Dr. Winner is not (yet, to my

knowledge) working on these.

Page 9: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

Introduction to Template (with Winner’s example2)::

2. Summary data (only) provided as Exhibit 3 in the paper White, R.D. “Are women more ethical….” Journal of Public Administration Research and Theory. 1999. URL: www.jstor.org/stable/1181652

Input the known summary data here. (Values are scores from an ‘ethics test’)

Treatments 1, 0 are Genders: Female, MaleControl for Ranks: F1, F2 are Officers, Enlisted

Page 10: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

Introduction to Template (with Winner’s example)::

Calculated values for each column:

Total n’s

Difference in means for each column

Sample Statistic: n-Weighted Mean of the Differences

Ranks

M

F

Page 11: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

Introduction to Template (with Winner’s example):

Calculated values (cont’d) Weighted means for columns Pooled standard deviations

for columns

Ranks

For a “No Difference” test for scores based on Treatment Level, the assumption is that, for each column(Fi), both cell’s values are generated by the same, common mean and standard deviation.

M

F

Page 12: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

1. 1. Simulate the null

assumption (re-stated below) in action: Randomly generate* values for each cell in the source table—based on assumed, common values for mean and std. dev. for cells in the same column.

Resampling Steps

Each column names a cell position in the source table

*(Presumed: Distributions are normal.)

For a “No Difference” test for scores based on Treatment Level, the assumption is that, for each column(Fi), both cell’s values are generated by the same, common mean and standard deviation.

Page 13: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

Resampling Steps2. For each TiFj column at lower left,

interpret the first nTreatment,Factor random entries in the column as the new (re)sample for that specific, corresponding cell in the table.

2.

Observe, empirically, and record, the means for each cell in the just-resampled table.

Page 14: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

Resampling Steps3. The output from Step

(re-displayed) represents one sample that might be generated from the underlying population—if the null hypothesis is true.

For this resample, calculate and record the n-Weighted Mean of the in-column Differences of the means.

3.

Page 15: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

Resampling Steps

4. Re-sampling Steps 1 – 3 are repeated 5000 times.

The resulting list of outcomes from the loop described approximates the expected sampling distribution for the “Wt’dMeanOfDiff’s” parameter, under null assumptions.

Add this re-sample’s outcome to the bottom of

the list of all outcomes

4.

Page 16: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

Resampling Steps

5

This is the re-sampling-based estimate for the p-value.

5. The p-value for the resample test (two tailed) is the proportion of outputs for the weighted mean of differences that have a ±magnitude of at least the size of the Sample Statistic being tested

Page 17: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

Comparison of Results to Winner’s ANOVA-based* *(as developed for the 2 x 2 case)

*(Exact “p-values” producedby the method will vary slightly if the program is re-run)

**(Not formally tested by this method. But a

graph can convey sufficient information.)

*

**

Page 18: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

Comparison of Results to Winner’s ANOVA-based* *(as developed for the 2 x 2 case)

***

***(For Resampling, you could reverse which variable is interpreted as the Treatment (Rows), and re-run.)

Page 19: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

Comparison of Results to Winner’s ANOVA-based* *(as developed for the 2 x 2 case)

****

**** NOTE: This method does not “control for” the effect of the column variable (Rank) in the ordinary-least-squares sense of parsing up the error terms.

Yet the method does control for Rank analogously to how a paired t-test controls for a second variable: Calculations of differences based on treatment are not made all-at-once, but by columns—relative to each columns’ specific Treatment0 values for levels of the Factor variable.

Page 20: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

• An intuitive, sample-based estimate for:

How large is the likely ‘true’, pair-wise difference in outcomes, when compared by treatment levels.

• Does not depend on the resampling (can be estimated directly)

Effect Size Measures

Page 21: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

• Estimate a standard error for each column’s variation:

(Pooled s) / sqrt(nColumnFi/2)

• Aggregate the standard error estimate by an n-weighted mean of the column estimates

Effect Size Measures

• Conventionally: Estimate the magnitude of the effect based on:

z* = (Sample Stat (compared to zero)) / (standard error estimate){Repeated testing confirms this estimate closely matches a

resampling-based estimate for the sample stat’s standard error}

*

*

Page 22: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

Caution:

It is not recommended to use the z* estimate, parametrically, to generate the p-value for the hypothesis test.

Effect Size Measures

0.500.250.00-0.25-0.50

99.99

99

95

80

50

20

5

1

0.01

Mean 0.001478StDev 0.08596N 5000AD 7.521P-Value <0.005

NormalexampleSampleStats

Perc

ent

Probability Plot of NormalexampleSampleStatsNormal - 95% CI

The sampling distributions of (differences of sample stat from zero) appear to have a ‘peaked’ distribution, which veers from normal on the tails--precisely where conventional p-value estimates are focused. The resampling method does not require the sampling distribution will be normal (or t, etc.) on the tails.

Normal Probability Plot for the Distribution of Resampled Sample Stats.

Page 23: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

• {Fuller details will be provided in the written, Proceedings version of this paper.}

• Robustness:

– The resamples generated for the hypothesis test presume normal distributions for data in the columns.

– What if they are not?

• Upon testing, the actual Type 1 error rates by the new method appeared to equal or stay close to the nominal α presumed by the method for these underlying distributions:– (a) censored normal (e.g. if raw grades are percents, and many

students fail, but all grades below 50% are classified as GPA = “0”)

– (b) uniform;

– (c) skewed

Robustness and Power

Page 24: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

• Power: • Simplifying assumptions:

– Source data are normally distributed– Common standard deviation σ for all cells, each run.– In-column “True” differences were of same size

(in units of the common σ)

Robustness and Power

Page 25: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

• For realistic sample sizes for such problems, the average cell n has little impact on power

• If (right graph) the true average column difference of means is approaching 1/2 a standard deviation (relative to the common σ in the table’s cells), it will most likely be picked up.

• Expressed in terms aggregate standard errors for the column differences: A difference will likely be picked up if the z* to be detected is at least 1.5 standard errors.

Robustness and Power

(relative to common σ)

Page 26: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

• The usual caveats apply about observation-based studies

• For the educational application (depending on the study and data access), the same individual students may contribute multiple marks for the different years of the study—or even within single years. (Violating independence assumptions.)– This limitation would apply for a variety of methods

Other Issues

Page 27: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

Program in Action

Page 28: JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M.

If anyone would like a copy of these slides or of the template used for running the model, please write me at: [email protected]

Thank You!