WILCOXON SIGNED-RANK TEST - Society for … SIGNED-RANK TEST Dawn VanLeeuwen, Professor, Economics,...

49
WILCOXON SIGNED-RANK TEST Dawn VanLeeuwen, Professor, Economics, Applied Statistics & International Business Department, New Mexico State University Martha Archuleta, Professor, Nutrition, Dietetics & Food Sciences Dept., Utah State University

Transcript of WILCOXON SIGNED-RANK TEST - Society for … SIGNED-RANK TEST Dawn VanLeeuwen, Professor, Economics,...

WILCOXON SIGNED-RANK TEST

Dawn VanLeeuwen, Professor, Economics, Applied Statistics & International Business Department, New Mexico State University

Martha Archuleta, Professor, Nutrition, Dietetics & Food Sciences Dept., Utah State University

Cooking Schools Improve Nutrient Intake Patterns of People with Type 2 Diabetes

Journal of Nutrition Education and Behavior, 44(4):319-325.

• Study objective: Determine if cooking classes offered by the Cooperative Extension Service improved nutrient intake pattern in people with type 2 diabetes.

• Main outcome measures: 3 day food records pre and post cooking school participation. Analyzed macronutrients (CHO, Fat, Protein) fiber, cholesterol, sodium.

Note: P values for differences are from Wilcoxon signed-rank test (n=117)

• “Program efficacy was assessed using the Wilcoxon signed-rank test to compare differences between the pre-training and post-training nutrient consumption variables.”*

• In addition, the main table reported medians and interquartile

ranges (IQRs). *Archuleta, M., VanLeeuwen, D., Halderson, K., Jackson, K., Bock, M. A., Eastman, W., Powell, J., Titone, M., Marr, C., and L. Wells. 2012. Cooking Schools Improve Nutrient Intake Patterns of People with Type 2 Diabetes. Journal of Nutrition Education and Behavior, 44(4):319-325.

We will answer the following questions: • Why use the Wilcoxon signed-rank test instead of a paired t-test? • Why report the median and IQR instead of the mean and standard

deviation?

We will briefly discuss measuring program efficacy using alternatives to the post-pre difference.

• post/pre ratio • Log(post/pre ratio)

Learning Objectives:

• Understand when a nonparametric test such as the Wilcoxon signed- rank test might be more appropriate than a paired t-test.

• Understand why statistics based on the five-number summary might be more appropriate for skewed data than the mean±SD.

• Be aware of alternatives (such as the post/pre ratio) to the post-pre difference as the measure of program efficacy.

Measures of Center and Measures of Variability

• Most commonly used and reported measure of central tendency is the mean.

• Most commonly used and reported measure of variability is the standard deviation (SD).

• Are the best choices for normally distributed data.

• And are pretty good whenever the distribution is unimodal and symmetric.

But what about asymmetric or skewed data? Consider the following datasets where 95% of the data remain the same:

• Sample with n=200 from a normal distribution (sample mean=0.000; sample

median=0.076; SD=9.690; range -22.779 to 23.283; IQR=12.941).

• Bottom 5% of data values changed (sample mean=-2.348; sample median=0.076; SD=19.082; range -126.083 to 23.283; IQR=12.941).

• Bottom 5% of data values changed more (sample mean=-10.486; sample median=0.076; SD=50.877; range -229.276 to 23.283; IQR=12.941).

Resistance

Statistics that are affected by changes to a small portion (e.g., the bottom 5%) of the data:

• The mean goes from 0 to -2.348 to -10.486.

• The SD goes from 9.690 to 19.082 to 50.877.

• The minimum value goes from -22.779 to -126.083 to -229.276.

• The range is also greatly affected.

Resistant statistics:

• The median remains constant at 0.076.

• The IQR remains constant at 12.941.

The Five Number Summary

• Minimum

• Q1 – the 25th percentile

• Q2 – the 50th percentile or median

• Q3 – the 75th percentile

• Maximum

• Recall that the SD= (1

𝑛−1) (𝑦𝑖 − 𝑦 )2

• Impacted by outliers and changes to small portions of data.

• The inter-quartile range=IQR=Q3-Q1. • The IQR is a resistant measure of variability.

• Is unaffected by outliers. • The IQR provides the range over which the “middle” 50% of

the data are distributed.

• For symmetric, unimodal distributions use mean±SD.

• For other distributions or when data include a few extreme outliers, consider using the five number summary or median ± IQR.

• Other options such as the so-called “outlier strategy” (Ramsey and

Schafer, 2002) exist.

Example Data Descriptive Statistics

Five number summary, IQR, mean, SD and range.

_NAME_ min Q1 Q2 Q3 max IQR mean SD range

y -22.78 -6.67 0.08 6.28 23.28 12.94 -0.00 9.69 46.06

y2 -126.08 -6.67 0.08 6.28 23.28 12.94 -2.35 19.08 149.37

y4 -229.28 -6.67 0.08 6.28 23.28 12.94 -10.49 50.88 252.56

Paired t-tests

• Generally considered “robust” to departures from normality. • But robustness depends on:

• form and the degree of the departure

• sample size

• But may not be “resistant” to outliers.

• Used to draw inference to the mean not the median.

Nonparametric alternatives to the paired t-test

• “Nonparametric” because no specific distributional assumptions required

• The Sign test.

• The Wilcoxon signed-rank test.

• Typically used to draw inference to something other than the mean (e.g., the median).

The Sign Test

• Compares number of positive differences to the number of negative differences.

• Uses no information about the magnitudes of the differences.

• Tests the null hypothesis that the median difference is zero.

The Wilcoxon Signed-Rank Test

• Ranks magnitudes of differences then attaches a sign to the rank.

• Exact permutation test can be computed or a normal approximation.

• Assuming symmetry, tests the null hypothesis that the median difference is zero; tests whether the two measurement occasions tend to differ.

What is the trade-off? That is, why do we ever use parametric methods?

Parametric versus Nonparametric example

Consider the following data:

Pre Post Difference

3 5 2

2 6 4

4 7 3

• P-values: • T-test p=0.0351

• Sign test p=0.25

• Wilcoxon signed-rank test p=0.25

Parametric versus Nonparametric example: effect size increased

Consider the following data:

Pre Post Difference

3 6 3

2 7 5

4 8 4

• P-values: • T-test p=0.0202

• Sign test p=0.25

• Wilcoxon signed-rank test p=0.25

• When assumptions of parametric methods are met, nonparametric statistics have less power. • Nonparametric methods may have no power to detect differences

with small sample sizes. • Nonparametric methods lose some or all of the information

contained in the magnitudes of the differences power may or may not go up with increasing effect size.

• Nonparametric statistic may not be testing the hypothesis you are interested in. • Hypotheses about means may not be equivalent to hypotheses

about medians.

Example from the paper: Cholesterol

• Analysis run using SAS version 9.3 software (SAS Institute Inc., 2010).

proc univariate data=chol normal plot;

var diff;

run;

The UNIVARIATE Procedure

Variable: diff

Moments

N 117 Sum Weights 117

Mean -56.61433 Sum Observations -6623.8767

Std Deviation 227.291988 Variance 51661.6478

Skewness -3.6938459 Kurtosis 28.4466869

Uncorrected SS 6367757.48 Corrected SS 5992751.14

Coeff Variation -401.4743 Std Error Mean 21.0131517

Basic Statistical Measures

Location Variability

Mean -56.6143 Std Deviation 227.29199

Median -25.8967 Variance 51662

Mode . Range 2438

Interquartile Range 193.37333

Tests for Location: Mu0=0

Test Statistic p Value

Student's t t -2.69423 Pr > |t| 0.0081

Sign M -9.5 Pr >= |M| 0.0957

Signed Rank S -962.5 Pr >= |S| 0.0083

Tests for Normality

Test Statistic p Value

Shapiro-Wilk W 0.714433 Pr < W <0.0001

Kolmogorov-Smirnov D 0.145671 Pr > D <0.0100

Cramer-von Mises W-Sq 0.852518 Pr > W-Sq <0.0050

Anderson-Darling A-Sq 5.277337 Pr > A-Sq <0.0050

Quantiles (Definition 5)

Quantile Estimate

100% Max 658.3233

99% 326.8267

95% 148.9833

90% 125.6067

75% Q3 54.2433

50% Median -25.8967

25% Q1 -139.1300

10% -290.7033

5% -362.8733

1% -531.8267

0% Min -1780.0167

Extreme Observations

Lowest Highest

Value Obs Value Obs

-1780.017 78 151.563 34

-531.827 90 165.793 82

-427.047 92 190.080 55

-400.927 9 326.827 36

-378.150 107 658.323 44

Note: P values for differences are from Wilcoxon signed-rank test (n=117)

QUESTIONS??

Addressing non-normality by using an alternative measure of program efficacy.

Alternatives to the post-pre difference: • post/pre ratio

• No difference indicated by a ratio of 1

• Log(post/pre ratio)=log(post) – log (pre) • No difference indicated by log transformed ratio of 0

Caution when interpreting!!!

Interpreting the log(post/pre ratio)

• Mean(log(ratio))≠ log(mean(ratio))

• Median(log(ratio))=log(median(ratio))

• But if log(ratio) has a symmetric distribution • Mean(log(ratio))=Median(log(ratio))=log(median(ratio)) • And exp(log(median(ratio))=median(ratio) • (Ramsey & Schafer, 2002)

Backtransforming by exponentiating (i.e., computing exp(mean(log(ratio)))) can be interpreted as an effect to the median ratio! And effects to ratios are multiplicative effects.

Consider the log transformed post/pre ratio for cholesterol

proc univariate data=Chol plots cibasic normal;

var lnratio;

run;

The UNIVARIATE Procedure

Variable: lnratio

Moments

N 117 Sum Weights 117

Mean -0.103481 Sum Observations -12.107272

Std Deviation 0.57419433 Variance 0.32969913

Skewness -0.3700547 Kurtosis 1.17139014

Uncorrected SS 39.4979708 Corrected SS 38.2450987

Coeff Variation -554.87922 Std Error Mean 0.05308428

Basic Statistical Measures

Location Variability

Mean -0.10348 Std Deviation 0.57419

Median -0.11872 Variance 0.32970

Mode . Range 3.38208

Interquartile Range 0.70305

Basic Confidence Limits Assuming Normality

Parameter Estimate 95% Confidence Limits

Mean -0.10348 -0.20862 0.00166

Std Deviation 0.57419 0.50886 0.65893

Variance 0.32970 0.25894 0.43419

Tests for Location: Mu0=0

Test Statistic p Value

Student's t t -1.94937 Pr > |t| 0.0537

Sign M -9.5 Pr >= |M| 0.0957

Signed Rank S -681.5 Pr >= |S| 0.0635

Tests for Normality

Test Statistic p Value

Shapiro-Wilk W 0.979803 Pr < W 0.0751

Kolmogorov-Smirnov D 0.049231 Pr > D >0.1500

Cramer-von Mises W-Sq 0.056903 Pr > W-Sq >0.2500

Anderson-Darling A-Sq 0.46553 Pr > A-Sq >0.2500

Quantiles (Definition 5)

Quantile Estimate

100% Max 1.278154

99% 1.181760

95% 0.825164

90% 0.547117

75% Q3 0.253496

50% Median -0.118722

25% Q1 -0.449552

10% -0.757795

5% -0.952930

1% -1.633783

0% Min -2.103925

Extreme Observations

Lowest Highest

Value Obs Value Obs

-2.10393 78 0.88754 98

-1.63378 90 1.06476 45

-1.58529 92 1.13813 44

-1.53629 9 1.18176 36

-0.99624 40 1.27815 64

Estimates and Confidence Intervals

• Estimated mean for the log ratio: -0.103481

• Estimated median post/pre ratio: e(-0.103481)=0.9017

• 95% CI for mean log ratio: (-0.20862, 0.00166)

• 95% CI for median post/pre ratio: (0.8117, 1.0017) • Interpretive statement: The 95% confidence interval for the median ratio

estimates that the post-program cholesterol intake is between 0.81 and 1.00 times the pre-program cholesterol intake.

Analysis of difference versus log ratio disagree!!

• CI’s suggest no significant difference as the intervals contain 0 (log ratio) and 1 (ratio)

• The t-test for the log ratio is also not significant (P=0.0537) versus Wilcoxon signed rank for the difference (P=0.0083)

• But the two analyses do not draw inference to the same thing: One method draws inference to a median difference while the other draws inference to the median ratio.

Should we use the analysis of the log ratio?

• Log post/pre ratio appears roughly normal although there still appear to be outliers.

• Shapiro-Wilk (P=0.0751) while not significant, does not indicate the data are clearly consistent with what would be obtained if sampling from a normal distribution.

• What really were the research objectives? Are we interested in change measured as a difference or a ratio? Is the median ratio an acceptable measure of central tendency? What about the median difference? If not should we have used the paired t-test and relied on robustness properties of the t-test?

In conclusion, when dealing with skewed distributions or a few extreme outliers…

• Descriptive statistics based on the five number summary may be more meaningful than the mean±SD.

• Nonparametric tests may be more appropriate than the usual t-tools as long as hypotheses tested are of interest.

• Alternative measures of program efficacy might allow use of the usual t-tools but may require alternative interpretations.

References

• Archuleta, M., VanLeeuwen, D., Halderson, K., Jackson, K., Bock, M. A., Eastman, W., Powell, J., Titone, M., Marr, C., and L. Wells. 2012. Cooking Schools Improve Nutrient Intake Patterns of People with Type 2 Diabetes. Journal of Nutrition Education and Behavior, 44(4):319-325.

• Ramsey, F. L., and D. W. Schafer. 2002. The Statistical Sleuth: A Course in Methods of Data Analysis, 2nd Ed. Pacific Grove, CA: Duxbury.

QUESTIONS??

JNEB Pre-conference Workshop: Introduction to Qualitative Methods

Saturday, July 25 | 8 a.m. – 3 p.m.

Reservation required | Breakfast and lunch provided | 7 CEU

SNEB Members $60 | Students $50 | Non-Members $75

Speakers: Suzie Goodell, PhD, RD, North Carolina State University; Virginia Carraway-Stage, PhD, RD, East Carolina University; Natalie Cooke, PhD, North Carolina State University; Amanda Peterson, BS, East Carolina University

Through a series of interactive lessons and practice sessions, participants will receive introductory training in conducting qualitative research. Participants will learn how to create sound qualitative research questions; design rigorous qualitative study protocols to increase the trustworthiness of data; develop semi-structured interview or moderator guides; compare and contrast different qualitative methods and data collection techniques; develop codes and codebooks; and explain the development of themes and theoretical models from qualitative data. (Level 1 Training)

Organized by the Higher Education Division and funded by Elsevier, East Carolina University, North Carolina State University

Fall 2015 Journal Club Survey Design and Validation in Nutrition Education and Behavior Research

Potential topics:

• Face validity

• Cognitive interviews for question development

• Focus groups for question development

• Internal reliability (Cronbach and kappa)

• Test-retest reliability

• Question scale selection

• Open-ended questions

• Incorporating behavior theory

• Including “not sure”, “no response”, “not applicable” in surveys

• Content validity

• Criterion-related validity

• Principal Component Analysis and Factor Analysis

If you want to be registered for all sessions, email [email protected].

Become an SNEB member!

• Benefits of membership • Professional Member - $190/year

• Associate Member - $95/year

• Student Member - $60/year

• Subscription to the Journal of Nutrition Education and Behavior

• Free access to live and recorded webinars

• Deepest discount to attend the SNEB Annual Conference

• Membership in an SNEB division

• Connection to other professionals through SNEB listserv

• www.sneb.org/join

-