8Power Comparisons

15
CHAPTER 7 POWER COMPARISONS FOR UNIVARIATE TESTS FOR NORMALITY "Depending on the nature of the alternative distribution and on the sample size, the various procedures show to better or worse advantage." Shapiro, Wilk and Chen, 1968 The most frequent measure of the value of a test for normality is its power, the ability to detect when a sample comes from a non-normal distribution. All else being equal (which decidedly never happens) the test of choice is the most powerful. However, in addition to power which depends on both the alternative distribution and sample size, choice of test when assessing normality can be based on a variety of other reasons, including ease of computation and availability of critical values. Ideally, one would prefer the most powerful test for all situations, while in reality no such test exists. 7.1 Power of Tests for Univariate Normality Often, while the specific alternative is not known, some general character- istics of the data may be known in advance (e.g., skewness). If not, there may be limited concerns about the types of departures from normality. For example, regression residuals which are symmetric but have short tails are Copyright © 2002 by Marcel Dekker, Inc. All Rights Reserved.

Transcript of 8Power Comparisons

  • CHAPTER 7

    POWER COMPARISONS FOR UNIVARIATE TESTSFOR NORMALITY

    "Depending on the nature of the alternative distribution and on thesample size, the various procedures show to better or worse advantage."

    Shapiro, Wilk and Chen, 1968

    The most frequent measure of the value of a test for normality is its power,the ability to detect when a sample comes from a non-normal distribution.All else being equal (which decidedly never happens) the test of choice isthe most powerful. However, in addition to power which depends on boththe alternative distribution and sample size, choice of test when assessingnormality can be based on a variety of other reasons, including ease ofcomputation and availability of critical values. Ideally, one would preferthe most powerful test for all situations, while in reality no such test exists.

    7.1 Power of Tests for Univariate Normality

    Often, while the specific alternative is not known, some general character-istics of the data may be known in advance (e.g., skewness). If not, theremay be limited concerns about the types of departures from normality. Forexample, regression residuals which are symmetric but have short tails are

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

  • Table 7.1 Univariate tests for normality discussed in this chapter.Test Symbola

    A2

    Vh&2

    x2

    DD*EkEDF tests*i,F2k2

    -ft-mn,KlLkMPLSI testsr

    RSsTTin, Tin

    T2r*u

    uu

    2

    Vww'w2

    z

    Test NameGeary's testAnderson-Darling testskewnesskurtosischi-squared testD'Agostino's DKolmogorov-SmirnovTietjen-Moore test for > 1 outlier

    LaBreque's testsP-P correlation testsample entropy testjoint kurtosis/skewness testGrubbs' test for > 1 outlier

    probability plot correlationrectangular skewness/kurtosis testomnibus MPLSI testGrubbs' outlier testLocke and Spurrier testsOja's testLocke and Spurrier testrange testUthoff's testWatson's testKuiper's VWilk-Shapiro testShapiro-Francia testCrarner-von Mises testLin and Mudholkar's test

    Reference SectionSection 3.3.1Section 5.1.4Section 3.2.1Section 3.2.2Section 5.2Section 4.3.2Section 5.1.1Section 6.2.6Section 5.1Section 2.3.3Section 2.3.2Section 4.4.1Section 3.2.3Section 6.2.4Section 4.2Section 2.3.2Section 3.2.3Section 4.2.6Section 3.4.1, 6.2.1Section 4.3.1Section 4.3.3Section 4.3.1Section 4.1.2, 6.2.3Section 3.4.2, 4.1.3Section 5.1.3Section 5.1.2Section 2.3.1Section 2.3.2Section 5.1.3Section 4.4.2

    usually not of interest, so a test which has high power at detecting skewedand long-tailed symmetric alternatives need only be considered. Therefore,it is important to be able to identify which tests are competitively pow-erful under certain specific situations, in case some information is knownconcerning the alternative.

    It is also important to know which tests have decent power under alltypes of alternatives, for those instances where no a priori information is

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

  • available. It is also useful to have tests which can be used as substitutesfor each other. Therefore, we have compiled the results of many studiesinto a summary of the power of tests for normality. We then make recom-mendations for testing based on different scenarios.

    7.1.1 Background of Power Comparison Simulations

    The advent of computers, along with the seminal papers on tests for nor-mality by Shapiro and Wilk (1965) and Shapiro, Wilk and Chen (1968),essentially set the standards for the development and presentation of newtests for normality (as well as tests for other distributions). In general,theoretical power calculations for specific tests are either difficult or in-tractable; in cases where power could be estimated, it was usually basedon asymptotic approaches (e.g., Geary, 1947). Thus, simulation became thevehicle of convenience for estimating power and the comparison of tests.

    At the time, there were relatively few tests for normality: besides VF,there were only four moment-type tests (\/bi, b?, u, and a) and the moregeneral goodness of fit tests (e.g., x2 and EDF tests). The choice of testwas actually more limited than that since the x2 and EDF tests were onlyvalid for simple hypotheses, which is not a common practical situation.Also, W had only been developed for sample sizes up to 50.

    Shortly after the introduction of W, Lilliefors (1967) presented somedistributional results for the Kolmogorov-Smirnov test, D*, for a compositenormal null hypothesis. In hindsight it can be stated that this was notuseful since this test is almost universally not recommended as a test fornormality because of its poor power properties.

    In 1971, D'Agostino introduced his D statistic, for use as an omnibustest in samples of over size 50. Shapiro and Francia (1972), Weisberg andBingharn (1975) and Filliben (1975) suggested correlation tests similar inconstruction to W which also overcame the sample size limitation of W.Between the introduction of W in 1965 and the probability plot correla-tion test in 1975, there were essentially no other new tests for normalityintroduced, making Filliben's (1975) simulation the last word in powercomparisons at the time, since he included all of the well-known tests (ex-cluding EDF tests) for normality. The only exceptions seem to be thosetests developed by Uthoff (1968; 1973).

    The use of EDF tests as tests for normality did not become popu-lar until about that time, when Stephens (1974) not only developed nulldistributions for composite EDF tests for the normal distribution, but alsoidentified relationships of critical values with functions of sample size, mak-ing these tests more widely available and applicable. A comparison of the

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

  • power of these tests with W showed that at least some of the EDF testswere useful as tests for normality. This nearly doubled the number of teststhat could be used in power comparisons.

    The complexity of power comparisons, which were to become almostmandatory when presenting new tests, was also increased by the number ofalternatives used to compare tests in the earlier studies. Shapiro, Wilk andChen (1968) used 45 parameterizations of 12 different alternative distribu-tions. Pearson, D'Agostino and Bowman (1977) presented power estimatesfor 58 parameterizations of 12 different distributions. While both of thesestudies only included a small number of useful tests (Pearson, D'Agostinoand Bowman included only four omnibus tests and four directional tests)and did not include composite hypothesis EDF tests, they set a standardwhich would be difficult to measure up to, given space limitations in jour-nals. Not only would a power comparison use up a lot of space when itincluded all tests (or at, least all that had shown some useful character-istics), but very little additional information would be gained on ensuingpublications, which would also have to include all tests (plus one new one)and the large range of alternatives.

    These difficulties gave rise to the practice of comparing a new testwith a small subset of tests for normality and/or alternatives, during thetime when the development of tests for normality flourished, from 1975 tothe middle of the 1980's. For example, Locke and Spurrier (1976) onlycompared their two tests (T\n and T^n] with \/b\. Although this is anextreme example of the limitations on power comparisons, there were veryfew large scale comparisons which could be used to directly compare a largenumber of tests for a broad range of alternatives.

    In addition, there was no common standard for the design of the powercomparisons. Different studies used different sample sizes and a levels. Re-liability of the estimated power differed between studies, because differentnumbers of replications were used. Tests were sometimes used as two-tailedand sometimes as one-tailed tests; sometimes it was not stated how manytails were used. In some comparisons the estimated power of a new test wasbased on a new simulation, while the power estimates for the comparisontests were obtained from a previously published study.

    7.1.2 Power Comparisons: Long-Tailed Symmetric Alternatives

    Shapiro and Wilk (1965) compared W, \fb\, b?, and u. They included theX2 and EDF tests in their comparison but assumed known parameters sothat the tests could be used with a simple hypothesis; therefore, they willnot be discussed here. Their simulation only included 200 samples of size

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

  • 20. For the three long-tailed symmetric alternatives they used, W was seento be generally competitive with 62, although neither was decisively better.In their more extensive simulation, Shapiro, Wilk and Chen (1968) usedthe same tests but included ten long-tailed symmetric alternatives and fivesample sizes between 10 and 50. W and 62 were again competitive, with&2 tending to be more powerful for the larger sample sizes.

    D'Agostino (1971) compared D with the simulation results of Shapiro,Wilk and Chen (1968). Using only samples of size 50, he determined thatD was competitive with W and 62 for long-tailed symmetric distributions.D'Agostino and Rosman (1974) only compared Geary's test, W and D, butused samples of size 20, 50 and 100; they found that both D and a workedwell for long-tailed symmetric alternatives. Hogg (1972) showed virtuallyno difference between UthofT's (1968) ?7, asymptotically equivalent to a,and 62 for logistic and double exponential alternatives. Smith (1975), us-ing the same tests as Hogg, only considered symmetric long-tailed stablealternatives, and suggested that U be used, especially as tail heavinessincreases.

    Csorgo, Seshadri and Yalovsky (1973) showed that a Kolmogorov-Smirnov test based on a sample characterization of normality yields re-sults comparable to those of W for small samples. Stephens (1974) com-pared composite hypothesis EDF tests to W and D and showed that theAnderson-Darling and Cramer-von Mises tests were comparable to bothtests, with A2 being slightly better than W and W2, and not quite as goodas D. The Kolmogorov-Smirnov test D* always performed worse than theother tests. Green and Hegazy (1976) compared D, W and some modifiedEDF tests, with D nearly always being most powerful for the Cauchy anddouble exponential alternatives for samples between 5 and 80.

    Filliben (1975) showed little difference between a, D, W, W', 62 andr for samples of size 20; for samples of size 50, 62 and W did not seemcompetitive, and a was marginally the best test. Gastwirth and Owens(1977) indicated that a seemed better than b-2 for long-tailed symmetricalternatives. For samples of size 20, Spiegelhalter (1977) showed that theMPLSI test for a double exponential alternative dominated W and 62 forseveral long-tailed symmetric alternatives.

    Of the tests introduced by LaBreque (1977), F\ was always most pow-erful when compared to W, A2, a and 62> sometimes by an appreciableamount. His results were based only on samples of size 12 and 30.

    Pearson, D'Agostino and Bowman (1977) showed that for omnibustests, the combined skewness and kurtosis test Kg was more powerful thanW or D; however, when used in a directional manner, D was better thanall of the omnibus tests, but there was no real difference between D anda directional (upper-tailed) 62 test. White and MacDonald (1980) also

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

  • indicated that the power of D slightly exceeded both W and 62- Spiegel-halter (1980) showed a possible slight advantage of his omnibus test Ssover W and 62 in some circumstances. Oja (1981) showed Ty and Lockeand Spurrier's (1977) T* tests to be essentially equivalent in power, andbetter than W and 62 fr samples of size 20. A modification of T^ (Oja,1983) which is easier to calculate showed slight loss of power over T%, butstill exceeded that of the other tests. The MPLSI test for a Cauchy alter-native (Franck, 1981) had better performance than other tests for stablealternatives, at least for samples of size 20; for samples of size 50 not muchdifference was demonstrated.

    For scale contaminated normal alternatives, which have populationkurtosis greater than 3, kurtosis and other absolute moment tests withexponent greater than 3 had higher power, on average, than other tests,including D, a, U and W (Thode, Smith and Finch, 1983). However, ),a and u had nearly equivalent power to the absolute moment tests for theheavier tailed mixtures.

    Thodc (1985) showed that a, D and U were the best tests for detectingdouble exponential alternatives.

    Looney and Gulledge (1984; 1985) showed essentially no difference inpower among correlation tests based on different plotting positions except,notably, W', which had slightly lower power than the others. Gan andKoehler (1990) also showed that the correlation test based on the plottingposition i/(n + 1) had slightly higher power than A2 and W. Tests basedon normalized spacings (Lockhart, O'Reilly and Stephens, 1986) were notas good as cither A2 or W.

    7.1.3 Power Comparisons: Short-Tailed Symmetric Alternatives

    For samples from size 10 to 50, Shapiro and Wilk (1965) and Shapiro, Wilkand Chen (1968) showed that u usually dominated both b? and W whenthe alternative was short-tailed and symmetric. This may not be surprisingsince u is the likelihood ratio test for a uniform alternative to normality.

    Using samples of size 50, D'Agostino (1971) indicated that D usuallyhad lower power than w, W and b^. D'Agostino and Rosman (1974) onlycompared Geary's test, W and D, using samples of size 20, 50 and 100; theyfound that a single-sided a worked best for short-tailed symmetric alterna-tives, compared to the other two tests. Hogg (1972) showed dominance ofu for a uniform alternative.

    EDF tests based on characterizations of normality (Csorgo, Seshadriand Yalovsky, 1973) were never as powerful as W for the short-tailed alter-natives they used; however, they did not use any other tests for normality

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

  • in their comparison. Using only the uniform distribution as a short-tailedalternative, Stephens (1974) showed that W dominated D and the EOFtests. Green and Hegazy (1976) compared D, W and some modified EDFtests which were more powerful than both D and W] however, again onlythe uniform alternative was used, and u was not included.

    Filliben (1975) showed that u was most powerful for all of the short-tailed alternatives he used in his simulation study, with 62 being a some-what distant second. For samples of size 20, Spiegelhalter (1977) showedthat the MPLSI test for a uniform alternative dominated W and 62 for thetwo short-tailed alternatives he used; however, this test is equivalent to u.

    Of the tests compared by LaBreque (1977), u was always most power-ful; however, his only short-tailed symmetric alternative was the uniform,and his results were based only on samples of size 12 and 30.

    Pearson, D'Agostino and Bowman (1977) showed that for omnibustests, W was somewhat better than K2 and R (their notation for therectangular bivariate joint skewness and kurtosis test, Section 3.2.3), whilea two-tailed D had poor power. As a directional test, 62 nad appreciablyhigher power than D; however, u was not considered in this comparison.Spiegelhalter's (1980) omnibus test Ss was comparable to W and 62 forthe two alternatives (uniform and Tukey(0.7)) he included in his study forsamples of size 20 and 50. Oja (1981) showed T% and Locke and Spurrier's(1977) T* tests to be essentially equivalent in power, and better than Wfor samples of size 20. A modification of T^ (Oja, 1983) showed a slightincrease in power over T%.

    For selected sample sizes, Thode (1985) showed that for a uniformalternative the best tests were u, the lower-tailed Grubbs' T, and absolutemoment tests with moment greater than 2 (including 62 > the absolute fourthmoment test). No other short-tailed symmetric alternatives were examined.

    Of all correlation tests based on different plotting positions, W hadthe highest power (Looney and Gulledge, 1984; 1985). Tests based onnormalized spacings (Lockhart, O'Reilly and Stephens, 1986) were not asgood as either A2 or W. W had noticeably higher power than EDF tests,including A2, and correlation tests based on P-P plots, such as k2 (Ganand Koehler, 1990).

    7.1.4 Power Comparisons: Asymmetric Alternatives

    For samples of size 10 to 50, Shapiro and Wilk (1965) and Shapiro, Wilkand Chen (1968) showed a possible advantage of W over other tests, includ-ing the commonly used \Ai"> against asymmetric alternatives. D'Agostino(1971) showed this also for samples of size 50, with D having poor powerrelative to W and \fb\.

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

  • Tests based on characterization of normality (Csorgo, Seshadri andYalovsky, 1973) had poor power for asymmetric alternatives, and here alsoW was slightly better than ^/b\ for samples up to size 35. Stephens (1974)showed that W had higher power than composite EOF tests, althoughA2 was somewhat competitive; D fared poorly in his comparison. Forsamples of size 90, Stephens used W rather than W, and showed thatit was slightly more powerful than A2. For all sample sizes, D and theKolmogorov-Smirnov tests did poorly, with W2, Kuiper's V and U2 havingintermediate power.

    Filliben (1975) showed that for samples of size 20 and 50, the best testswere W, r and \/bi, m that order. Locke and Spurrier (1976) comparedTin, T-2n and \A7 fr asymmetric alternatives, breaking them down intodistributions with both tails light (e.g., beta distributions), both tails heavy(e.g., Johnson U), and one light tail and one heavy tail (e.g., gamma). Forboth tails heavy, ^/bl was best, while for other alternatives T\n was best,particularly for those alternatives with both tails light.

    For samples of size 12 and 30, LaBreque's (1977) F-2 was better thanW when vT^T was 2 or less, and W was slightly better otherwise. Slightlyless powerful than these two tests, and essentially equivalent to each other,were A2, \fb~i and LaBreque's F\.

    Of the omnibus tests compared by Pearson, D'Agostino and Bowman(1977), W was by far the most powerful; it was also somewhat competitivewith the single-tailed ^/b\ and right angle tests, which were about equalin power. Against stable alternatives, Saniga and Miles (1979) showedthat \fb\ was more powerful than W for samples of sizes between 10 and100; they also included 62, D, u and a joint skewness/kurtosis test forcomparison. Using a %2 and lognorrnal distribution as alternatives, Whiteand MacDonald (1980) showed that W and W were equivalent in power forsamples from 20 to 50, and were more powerful than \fb\. For samples ofsize 100, W was the most powerful test. For samples of size 20 and 30, Linand Mudholkar (1980) showed the highest power for Vasicek's (1976) Kmnfor beta distributions, while for other alternatives either W or z had thehighest power. EDF tests (Kolmogorov-Smirnov and Cramer-von Mises)and A/5i" were also compared in this simulation. Spiegelhalter's (1980)omnibus test was better than W for samples of size 20, while W had higherpower than Ss for samples of size 50; \fb[ was less powerful than both testsfor both sample sizes. Oja (1981) showed T\ and Locke and Spurrier's(1977) T\n tests to be essentially equivalent in power, and better than Wand ^/b~i for samples of size 20. A modification of T\ (Oja, 1983) which iseasier to calculate showed power equivalent to TI.

    In their comparison of correlation tests and W, Looney and Gulledge(1984; 1985) showed a slight advantage in power for W, although all corre-

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

  • lation tests were about equivalent in power. For samples of size 20 and 40,the A2 test based on normalized spacings showed a slight but consistentadvantage over W (Lockhart, O'Reilly and Stephens, 1986). W and A2outperformed the P-P plot correlation test k"2 (Gan and Koehler, 1990).

    7.1.5 Recommendations for Tests of Univariate Normality

    On the basis of power, the choice of test is directly related to the informa-tion available or assumptions made concerning the alternative. The morespecific the alternative, the more specific and more powerful the test willusually be; this will also result in the most reliable recommendation. A testshould also be based on ease or practicality of computation, and necessarytables (coefficients, if applicable, and critical values) should be available.All recommendations were based on the assumption that the parametersof the distribution are unknown.

    Regardless of the degree of knowledge concerning the distribution, itshould be common practice to graphically inspect the data. Therefore,our first recommendation is to always inspect at least a histogram and aprobability plot of the data.

    If the alternative is completely specified up to the parameters, then anoptimal test for that distribution should be used, i.e., either a likelihoodratio test or a MPLSI test (Chapter 4), assuming such a test exists. Forexample, for an exponential distribution Grubbs' statistic is the MPLSItest. If no alternative-specific test can be used, then a related test maybe available, e.g., Uthoff's U is the MPLSI test for a double exponentialalternative, but since critical values are not readily available a single-sideda could be used in its place. The next choice would be a directional test forthe class of the alternative, e.g., a one-tailed \fb[ for a gamma alternative.

    If the shape and the direction of the shape (e.g., skewed to the left;symmetric with long tails) are assumed known in the event the alternativehypothesis is true, but a specific alternative is not, then usually a one-tailed test of the appropriate type will be more powerful than omnibus orbidirectional tests. Grubbs' statistic, W or one-tailed \/bi will usually beamong the most powerful of all tests for detecting a skewed distributionin a known direction. These are also the tests of choice (using the appro-priate choice of critical values for Grubbs' statistic and \/&7) for skewedalternatives in which the direction of skewness is not prespecified.

    Uthoff's U is the MPLSI test for a double exponential alternative,and Geary's test is asymptotically equivalent to [/; therefore, these testswould be likely candidates for detecting long-tailed symmetric alternatives.D'Agostino's D is based on a kernel which indicates that it might also be

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

  • appropriate for long-tailed symmetric alternatives. Many of the power com-parisons described above have shown that these tests should be used, in aone-tailed manner, under these circumstances. LaBreque's FI needs to beinvestigated further for this class of alternatives. For short-tailed symmet-ric distributions, theoretical and simulation results indicate the best testsare u, one-tailed b^ or Grubbs' statistic.

    If there is no prior knowledge about the possible alternatives, then anomnibus test would be most appropriate. A joint skewness and kurtosistest such as K~ provides high power against a wide range of alternatives,as does the Anderson-Darling A2. The Wilk-Shapiro W showed relativelyhigh power among skewed and short-tailed symmetric alternatives whencompared to other tests, and respectable power for long-tailed symmetricalternatives.

    Tests to be avoided for evaluation of normality include the x2 testand the Kolmogorov-Smirnov test D*. The %2 test, however, has oftenbeen shown to have among the highest power of all tests for a lognorrnalalternative. Half-sample methods (Stephens, 1978) and spacing tests alsohave poor power for testing for normality.

    7.2 Power of Outlier Tests

    Relative to tests for normality, there have been few power comparisons oftests for outliers. One reason may be because there are relatively few outliertests, and of the outlier tests each has a specific function. For example,Grubbs' (1950, 1969) and Dixon's (1950, 1951) outlier tests are used todetect a single outlier, whereas Lk is a test for k > 1 outliers; therefore, acomparison between Tn and 3, say, would be meaningless. Similarly, Lkand sequential procedures would not be comparable since for the formerthe number of outliers is prespccified while for the latter the number ofoutliers is tested for sequentially. Some of the tests that are usually notthought of as outlier tests (\/b\, b-2, w, tests for normal mixtures) are notoften compared to tests labeled as outlier tests.

    7.2.1 Power Comparisons of Tests

    Ferguson (1961) compared the power of b^ and \/b~i with Grubbs' andDixon's outlier tests. He used normal random samples and added a fixedconstant to one observation in each sample. He found virtually no differencein power between Grubbs' outlier test and x/&T, with Dixon's test being onlyslightly less powerful. Similarly, he computed 62, T and Dixon's r for the

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

  • same samples in order to determine the power when two-tailed tests wererequired. Here he found that 62 and T were virtually identical and againDixon's test was only slightly lower in power.

    In a second experiment Ferguson added a positive constant to two ob-servations in the samples to determine the power of T and 62 > in particularbecause of the possibility of the masking effect on T. Kurtosis did sig-nificantly better than Grubbs' outlier test when there was more than oneoutlier in the sample; Dixon's test was not included in this experiment.

    Thode, Smith and Finch (1983) showed T to be one of the most power-ful tests at detecting scale contaminated normal distributions, a commonlyused model for generating samples with outliers. Samples were generateddifferently than those of Ferguson: contaminating observations were gen-erated randomly so that none, one or more than one contaminating obser-vation may have existed in each sample. T was shown to have 92% relativepower to kurtosis over all parameterizations studied, where relative powerwas defined as the ratio of sample sizes needed for each test (62 to T} inorder to obtain the same power. T significantly outperformed \fb{, theDixon test and w, which had only 65% relative power to 62-

    Whereas in the above three studies the measure of performance ofa test was simply how often the null hypothesis was rejected, Tietjen andMoore (1972) and Johnson and Hunt (1979) examined other characteristicsof outlier tests. Specifically, they were interested in the performance of thetests in detecting which and how many observations were identified asoutliers using Ek or (sequentially) other tests for outliers.

    Johnson and Hunt (1979) claimed that T was superior to the Tietjen-Moore, W and Dixon tests when there was one extreme value in the sampletested. T did show loss of performance, especially compared to more gen-eral normality tests, when there was more than one outlier (Ferguson, 1961;Johnson and Hunt, 1979). In a comparison of a number of tests for normal-ity and goodness of fit in the context of normal mixtures, Mendell, Finchand Thode (1993) showed that ^Jb[ was the most powerful test when morethan one outlier was present (and they were all in the same direction).

    7.2.2 Recommendations for Outlier Tests

    When there is the possibility of a single outlier in a sample, T has con-sistently been shown to be better than other procedures. For more thanone outlier, ^/b\ and &2 should be used when there are outliers in one orin both directions, respectively. However, if identification of the number ofoutliers is of concern, then a sequential procedure should be used.

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

  • References

    Csorgo, M., Seshadri, V., and Yalovsky, M. (1973). Some exact tests fornormality in the presence of unknown parameters. Journal of the RoyalStatistical Society B 35, 507-522.

    D'Agostino, R.B. (1971). An omnibus test of normality for moderate andlarge size samples. Biometrika 58, 341-348.

    D'Agostino, R.B., and Rosrnan, B. (1974). The power of Geary's test ofnormality. Biometrika 61, 181-184.

    Dixon, W. (1950). Analysis of extreme values. Annals of MathematicalStatistics 21, 488-505.

    Dixon, W. (1951). Ratios involving extreme values. Annals of Mathemat-ical Statistics 22, 68-78.

    Ferguson, T.S. (1961). On the rejection of outliers. Proceedings, FourthBerkeley Symposium on Mathematical Statistics and Probability, Uni-versity of California Press, Berkeley, 253-287.

    Filliben, J.J. (1975). The probability plot coefficient test for normality.Technometrics 17, 111-117.

    Franck, W.E. (1981). The most powerful invariant test of normal versusCauchy with applications to stable alternatives. Journal of the AmericanStatistical Association 76, 1002-1005.

    Can, F.F., and Koehler, K.J. (1990). Goodness-of-fit tests based on P-Pprobability plots. Technometrics 32, 289-303.

    Gastwirth, J.L., and Owens, M.E.B. (1977). On classical tests of normality.Biometrika 64, 135-139.

    Geary, R.C. (1947). Testing for normality. Biometrika 34, 209-242.Green, J.R., and Hegazy, Y.A.S. (1976). Powerful modified-EDF goodness

    of fit tests. Journal of the American Statistical Association 71, 204-209.

    Grubbs, F. (1950). Sample criteria for testing outlying observations. An-nals of Mathematical Statistics 21, 27-58.

    Grubbs, F. (1969). Procedures for detecting outlying observations in sam-ples. Technometrics 11, 1-19.

    Hogg, R.V. (1972). More light on the kurtosis and related statistics. Jour-nal of the American Statistical Association 67, 422-424.

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

  • Johnson, B.A., and Hunt, H.H. (1979). Performance characteristics forcertain tests to detect outliers. Proceedings of the Statistical Comput-ing Section, Annual Meeting of the American Statistical Association,Washington, D.C.

    LaBreque, J. (1977). Goodness-of-fit tests based on nonlinearity in proba-bility plots. Technometrics 19, 293-306.

    Lilliefors, H.W. (1967). On the Kolmogorov-Smirnov test for normalitywith mean and variance unknown. Journal of the American StatisticalAssociation 62, 399-402.

    Lin, C.-C., and Mudholkar, G.S. (1980). A simple test for normality againstasymmetric alternatives. Biometrika 67, 455-461.

    Locke, C., and Spurrier, J.D. (1976). The use of U-statistics for testingnormality against non-symmetric alternatives. Biometrika 63, 143-147.

    Locke, C., and Spurrier, J.D. (1977). The use of U-statistics for testing-normality against alternatives with both tails heavy or both tails light.Biometrika 64, 638-640.

    Lockhart, R.A., O'Reilly, F.J., and Stephens, M.A. (1986). Tests of fitbased on normalized spacings. Journal of the Royal Statistical SocietyB 48, 344-352.

    Looney, S.W., and Gulledge, Jr., T.R. (1984). Regression tests of fit andprobability plotting positions. Journal of Statistical Computation andSimulation 20, 115-127.

    Looney, S.W., and Gulledge, Jr., T.R. (1985a). Probability plotting posi-tions and goodness of fit for the normal distribution. The Statistician34, 297-303.

    Mendell, N.R., Finch, S.J., and Thode, Jr., H.C. (1993). Where is the like-lihood ratio test powerful for detecting two component normal mixtures?Biometrics 49, 907-915.

    Oja, H. (1981). Two location and scale free goodness of fit tests. Biometrika68, 637-640.

    Oja, H. (1983). New tests for normality. Biometrika 70, 297-299.Pearson, E.S., D'Agostino, R.B., and Bowman, K.O. (1977). Tests for

    departure from normality: comparison of powers. Biometrika 64, 231-246.

    Saniga, E.M., and Miles, J.A. (1979). Power of some standard goodness offit tests of normality against asymmetric stable alternatives. Journal ofthe American Statistical Association 74, 861-865.

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

  • Shapiro, S.S., and Francia, R.S. (1972). Approximate analysis of variancetest for normality. Journal of the American Statistical Association 67,215-216.

    Shapiro, S.S., and Wilk, M.B. (1965). An analysis of variance test fornormality (complete samples). Biometrika 52, 591-611.

    Shapiro, S.S., Wilk, M.B., and Chen, H.J. (1968). A comparative study ofvarious tests for normality. Journal of the American Statistical Associ-ation 62, 1343-1372.

    Smith, V.K. (1975). A simulation analysis of the power of several tests fordetecting heavy-tailed distributions. Journal of the American StatisticalAssociation 70, 662-665.

    Spiegelhalter, D.J. (1977). A test for normality against symmetric alter-natives. Biometrika 64, 415-418.

    Spiegelhalter, D.J. (1980). An omnibus test for normality for small sam-ples. Biometrika 67, 493-496.

    Stephens, M.A. (1974). EDF statistics for goodness of fit and some com-parisons. Journal of the American Statistical Association 69, 730-737.

    Stephens, M.A. (1978). On the half-sample method for goodness of fit.Journal of the Royal Statistical Society B 40, 64-70.

    Thode, Jr., H.C. (1985). Power of absolute moment tests against symmetricnon-normal alternatives. Ph.D. dissertation, University Microfilms, AnnArbor, MI.

    Thode, Jr., H.C., Smith, L.A., and Finch, S.J. (1983). Power of tests ofnormality for detecting scale contaminated normal samples. Communi-cations in Statistics - Simulation arid Computation 12, 675-695.

    Tietjen, G.L., and Moore, R.H. (1972). Some Grubbs-type statistics forthe detection of outliers. Technometrics 14, 583-597.

    Uthoff, V.A. (1968). Some scale and origin invariant tests for distributionalassumptions. Ph.D. thesis, University Microfilms, Ann Arbor, MI.

    Uthoff, V.A. (1973). The most powerful scale and location invariant testof the normal versus the double exponential. Annals of Statistics 1,170-174.

    Vasicek, O. (1976). A test for normality based on sample entropy. Journalof the Royal Statistical Society B 38, 54-59.

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

  • Weisberg, S., and Bingham, C. (1975). An approximate analysis of variancetest for non-normality suitable for machine calculation. Technometrics17, 133-134.

    White, H., and MacDonald, G.M. (1980). Some large-sample tests fornonnormality in the linear regression model. Journal of the AmericanStatistical Association 75, 16-31.

    Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

    TESTING FOR NORMALITYCONTENTSCHAPTER 7 POWER COMPARISONS FOR UNIVARIATE TESTS FOR NORMALITY7.1 Power of Tests for Univariate Normality7.1.1 Background of Power Comparison Simulations7.1.2 Power Comparisons: Long-Tailed Symmetric Alternatives7.1.3 Power Comparisons: Short-Tailed Symmetric Alternatives7.1.4 Power Comparisons: Asymmetric Alternatives7.1.5 Recommendations for Tests of Univariate Normality

    7.2 Power of Outlier Tests7.2.1 Power Comparisons of Tests7.2.2 Recommendations for Outlier Tests

    References