Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical...

33
Chapter II-1 Experimental design: statistical considerations and analysis James F. Campbell USDA-ARS Grain Marketing and Production Research Center Biological Research Unit Manhattan, KS 66502, USA Stephen P. Wraight USDA-ARS, Plant, Soil & Nutrition Laboratory Ithaca, NY 14853, USA 1 Introduction In this chapter, information on how field exper- iments in invertebrate pathology are designed and the data collected, analyzed, and interpreted is presented. The approach will be to present this information in a step-by-step fashion that, hopefully, will emphasize the logical framework for designing and analyzing experiments. The practical and statistical issues that need to be considered along the way and the rationale and assumptions behind different designs or proce- dures will be given, rather than the nuts-and- bolts of specific types of analysis. We want to emphasize that we are not statisticians by training and strongly recommend consulting a statistician during the planning stages of any experiment. Writing a chapter on this topic presents a number of difficulties. The first is the incredible breadth of the topic. Second, the wide range of statistical expertise of the target audience of this book means that there will be a correspondingly wide range of expectations over what should be included in this chapter. Third, making general recommendations is complicated by the many factors that need to be considered in selecting an experimental design and statistical analysis and the fact that there is often disagreement over the best statistical approaches to use for particular situations. We have geared this chapter for researchers with very limited statistical and experimental design experience. The roadmap provided here is intended to assist beginning field researchers to make decisions about designing and analyzing experiments, dealing with some of the real world problems that arise, finding sources of additional in-depth information, and, most importantly, communicating with statisticians. There is a large body of information available on the design and analysis of agricultural and ecological experiments that is applicable to invertebrate pathology field experiments. Experimental design and analysis principles are similar in the laboratory and the field, but the experimental approaches taken for invertebrate pathology field experiments differ from those in the laboratory in a number of significant ways. Many laboratory studies focus on individuals, and their analyses emphasize detecting trends in susceptibility using such statistical tools as probit, survival or failure time, or logistic regression. In the field, researchers typically measure population changes and have to deal with more confounding factors (e.g., variation within the field site, insect movement, inability to sample all individuals exposed to a treatment) and interactions than in the laboratory. 37 L.A. Lacey and H.K. Kaya (eds.), Field Manual of Techniques in Invertebrate Pathology, 37–69. © 2007 Springer.

Transcript of Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical...

Page 1: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

Chapter II-1

Experimental design: statistical considerations and analysis

James F. CampbellUSDA-ARS Grain Marketing and ProductionResearch CenterBiological Research UnitManhattan, KS 66502, USA

Stephen P. WraightUSDA-ARS, Plant, Soil & Nutrition LaboratoryIthaca, NY 14853, USA

1 Introduction

In this chapter, information on how field exper-iments in invertebrate pathology are designedand the data collected, analyzed, and interpretedis presented. The approach will be to presentthis information in a step-by-step fashion that,hopefully, will emphasize the logical frameworkfor designing and analyzing experiments. Thepractical and statistical issues that need to beconsidered along the way and the rationale andassumptions behind different designs or proce-dures will be given, rather than the nuts-and-bolts of specific types of analysis. We want toemphasize that we are not statisticians by trainingand strongly recommend consulting a statisticianduring the planning stages of any experiment.

Writing a chapter on this topic presents anumber of difficulties. The first is the incrediblebreadth of the topic. Second, the wide range ofstatistical expertise of the target audience of thisbook means that there will be a correspondinglywide range of expectations over what should beincluded in this chapter. Third, making generalrecommendations is complicated by the manyfactors that need to be considered in selectingan experimental design and statistical analysisand the fact that there is often disagreementover the best statistical approaches to use forparticular situations. We have geared this chapter

for researchers with very limited statistical andexperimental design experience. The roadmapprovided here is intended to assist beginning fieldresearchers to make decisions about designingand analyzing experiments, dealing with someof the real world problems that arise, findingsources of additional in-depth information,and, most importantly, communicating withstatisticians.

There is a large body of information availableon the design and analysis of agriculturaland ecological experiments that is applicableto invertebrate pathology field experiments.Experimental design and analysis principles aresimilar in the laboratory and the field, but theexperimental approaches taken for invertebratepathology field experiments differ from those inthe laboratory in a number of significant ways.Many laboratory studies focus on individuals,and their analyses emphasize detecting trendsin susceptibility using such statistical tools asprobit, survival or failure time, or logisticregression. In the field, researchers typicallymeasure population changes and have to dealwith more confounding factors (e.g., variationwithin the field site, insect movement, inabilityto sample all individuals exposed to a treatment)and interactions than in the laboratory.

37L.A. Lacey and H.K. Kaya (eds.), Field Manual of Techniques in Invertebrate Pathology, 37–69.© 2007 Springer.

Page 2: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

38 Campbell and Wraight

2 Types of experiments and analyses

A Experiments

Experiments in all areas of science serve asimportant tools to test hypotheses. However,the experiment must be planned, performed,analyzed and interpreted in such a way thatmisleading interpretations resulting from faultsin design and execution are minimized. Oneof the most important aspects of experimentaldesign is determining the specific objectives ofthe experiment. A clear statement of the objec-tives will keep the experimental design focusedand provide a means of assessing if the exper-iment or a portion thereof, is worth performing.The statement can be in the form of questions tobe answered, hypotheses to be tested, or effects tobe estimated. The population about which gener-alizations from the results of the experiment aregoing to be made needs to be clearly definedand included in this statement. In addition, eachtreatment should be clearly defined and its role inreaching the objectives of the experiment deter-mined. To provide useful results, experimentsshould have the following characteristics (Littleand Hills, 1978): (1) the experimental designshould be as simple as possible and still meetthe objectives of the experiment; (2) the exper-iment should be able to measure differenceswith a degree of precision that is desired bythe experimenter (e.g., by using an appropriateexperimental design and level of replication); (3)the experiment should be planned to minimizethe chance of systematic error (e.g., by appro-priate blocking of experimental treatments); (4)the range of validity of the conclusions from anexperiment should be as wide as feasible (e.g., byrepeating the experiment in space and time or byperforming a factorial experiment); and (5) thedegree of uncertainty in the experiment shouldbe determined.

Experiments can be divided into twobroad categories: mensurative and manipu-lative (Hurlbert, 1984). Mensurative experimentsinvolve measurements of some parameter takenat one or more points in time or space and,typically, the experimenter does not manipulateor perturb the system. Statistical comparisonsmay or may not be necessary for these typesof experiments. In mensurative experiments,

analysis is typically performed to determinehow a population conforms to a predictedvalue (goodness-of-fit), if there are differ-ences among systems, or how variables arecorrelated.

Manipulative experiments are those wheretreatments are assigned to experimental units orplots and these assignments can be randomized.Manipulative experiments are discussed exten-sively in most books on experimental designand are probably the most common experi-ments in invertebrate pathology. Manipulativeexperiments have an advantage over mensurativeexperiments in that confounding variables can bebetter controlled. However, both types of exper-iments are useful and have important roles in thedevelopment of theory.

B Hypothesis testing

Hypothesis testing is central to the scientificmethod, but there are two types of hypothesistesting that need to be considered: experimentalhypothesis testing and statistical hypothesistesting. Experimental hypotheses involve thesystem that the researcher is investigating. Forexample, an experimental hypothesis might bethat a particular pathogen will reduce hostpopulations under a certain set of environ-mental conditions. Statistical hypothesis tests areused as tools to test experimental hypotheses.In testing hypotheses, statistical procedures areused to draw inferences about a test populationby examining samples from that population.Commonly, inferences are drawn with regardto population means, and the first step inhypothesis testing is to formulate a precisestatement or hypothesis about the mean. Basedon the results of the experiment, the hypothesisis either falsified (the prediction is not met)or not falsified. If the hypothesis is falsified,the set of assumptions is altered and the theoryrevised.

The term assumption can also have multiplemeanings. First, explanatory assumptions arethose made about the universe and involvethe theory that generated the hypothesis thatthe experiment is testing. Second, simpli-fying assumptions are made for analyticalconvenience. These assumptions have eitherbeen tested by previous experiments or are

Page 3: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 39

robust, based on independent experiments oranalyses. Third, statistical assumptions underliethe statistical procedure used. Statistical assump-tions are the basis for the stated probabilitythat the deviation of the observed parameteris significantly different from the predicted.Meeting these statistical assumptions is criticalif the statistical procedure is to providea correct conclusion, but these assumptionsare often not fully understood or tested byresearchers.

After establishing the specific question orhypothesis about what will happen under acertain set of experimental conditions, it isnecessary to test it. This involves creatingthe conditions required by the hypothesisand observing the results. In all statisticalhypothesis testing, the nature of the hypothesisis an important consideration. Proving that ahypothesis is true is difficult, because, basedon inductive reasoning, the results for everypossible circumstance need to be observed orinferred. This requirement is usually impossibleto meet, because there is always the potentialthat an additional experiment could disprove thehypothesis (Underwood, 1990). Consequently,a falsification procedure is employed with theobjective being to disprove the hypothesis. Thisis more straightforward because once disproved,additional experimentation would not alter theconclusion. Accordingly, a hypothesis that isactually the opposite of the one we want to testis proposed and is termed the null hypothesis(abbreviated H0 and called null because itcomprises a statement of no difference, e.g., nodifference between two means). Experimentsthen attempt to disprove the null hypothesis. If astatistical test leads to a conclusion to reject thenull hypothesis, then an alternative hypothesisis needed. This alternative hypothesis �HA� willusually not specify a single value, but, rather, isformulated to account for all possible outcomesnot stated by the null hypothesis. Thus, H0 andHA together account for all possible outcomes,and rejection of H0 leads to acceptance of HA.See Sokal and Rohlf (1995) and Zar (1999)for more extensive discussions of hypothesistesting. Experiments should be designed so thatthere are only two possible outcomes, either thenull hypothesis or the alternative hypothesis isaccepted.

C Type I and type II errors

Two types of error, type I and II, can ariseover the decision to reject a null hypothesisbeing tested. A type I error occurs when thenull hypothesis is false when it is actually true(i.e., a false positive). A type II error is thefailure to falsify the null hypothesis when it isactually false. Statistical procedures are designedto indicate the likelihood or probability of theseerrors. The ultimate goal of the experiment needsto be considered in setting type I and type II errorrates and interpreting results (Scheiner, 1993).

Most researchers are familiar with the proba-bility of committing a type I error; this isindicated by the � (threshold level determinedbefore analysis) and P values (estimated fromthe analysis) that are typically chosen by theexperimenter and reported with statistical tests.However, the power of a test is also importantto consider when designing an experiment andinterpreting the results. The power of an exper-iment is the probability of correctly rejectinga false null hypothesis and is one minus theprobability of committing a type II error �1−��.At some point, the acceptable probability ofcommitting a type I error became convention-alized at 0.05. While it is important that some sortof objective criterion be used to avoid alteringexpectations to meet results, and being 95%confident is a reasonable standard, the actualprobability used should be based on the biologyof the system and the requirements of the inves-tigator. For example, in the medical field a 99%confidence level is often used, but in field exper-iments the replication needed to obtain that levelof confidence may not be possible and that levelof confidence unnecessary. Prior to starting anexperiment, it is desirable to balance the proba-bility of committing a type I error, the powerof the test, and the number of replicates. Theobjectives of the experiment and the biology ofthe system also need to be kept in mind whendesigning statistical hypothesis tests. See Toftand Shea (1983), Young and Young (1991), andShrader-Frechette and McCoy (1992) for morediscussion of this issue.

When a value of � is set for a statistical test, theaccuracy of that value depends upon the validityof the statistical assumptions. When the assump-tions do not hold, the true alpha level may be

Page 4: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

40 Campbell and Wraight

higher or lower than the nominal value set bythe researcher. When the actual alpha is higherthan nominal, the test is described as liberal. Aliberal test declares more significant differencesthan it should (rejects a true null hypothesis toofrequently). When alpha is lower than nominal,the test is conservative and declares fewer differ-ences than it should. Conservative tests are alsodescribed as having low power to detect realdifferences.

3 Choosing a statistical analysis

Choosing the type of analysis to perform canbe a difficult problem and should be consideredduring the planning stages of an experiment.There are a large number of approaches, andselection of the best test depends on the typeof data collected, the objective of the exper-iment, and the assumptions of the analysis. Statis-tical tests can be divided into two categories:parametric and nonparametric. Parametric testsare based on the assumption that the data aresampled from a normal or Gaussian distribution(e.g., t-test, Analysis of Variance). Nonpara-metric, or distribution free, tests do not makeassumptions about any specific population distri-bution and typically involve analyzing theranking of data (e.g., Wilcoxon test, Kruskal-Wallis test). Nonparametric tests may be basedon an assumption that the samples beingcompared came from populations with similarshapes and dispersions. However, these testsare often applied to data with heterogeneousvariances as they are less affected by differencesin population dispersions than ANOVA (Sokaland Rohlf, 1995; Zar, 1999).

Nonparametric tests tend to be less powerfulthan their parametric counterparts if the dataare normally distributed or even approximatelynormally distributed, and it is therefore best touse parametric approaches where justified. If alarge number of data points are collected, itmay be possible to determine if the distributionis normal by plotting the data or performing atest such as the Komogorov-Smirnoff test (Zar,1999). If data from a previous experiment areavailable, then these can also be used to helpdetermine if the population has a normal distri-bution. The decision of whether to use parametric

or nonparametric tests is difficult for small datasets because the nature of the distribution cannotbe determined; ultimately, the parametric testmay not be robust and the nonparametric testmay not be powerful. It is sometimes suggestedthat nonparametric tests are the only option fortesting of ordinal-scale (ranked) data, but thisis incorrect. Parametric tests may be applied tosuch data provided the usual assumptions hold(Zar, 1999).

It is noteworthy that ratio data are notnormally distributed, and in some cases, nonpara-metric analyses of ratios may be markedlymore powerful than parametric analyses. Ofconsiderable relevance to the present topic areratios derived from two independent normallydistributed variables. Such data comprise aheavy-tailed distribution known as the Cauchydistribution. Prominent examples of Cauchy-distributed data are LC50 values, which aregenerated from regression analyses essentiallyas the intercept divided by the slope. Outliers(aberrant data points) tend to occur frequentlyin heavy-tailed distributions, and in the presenceof outliers, application of traditional analysis ofvariance (ANOVA) results in a conservative test.Nonparametric tests, in this case, exhibit substan-tially greater power than standard ANOVA, andare strongly recommended (Anderson and Lydic,1977; Randles and Wolfe, 1979; Zimmerman,2001). More information on nonparametricapproaches can be obtained in Krauth (1988),Siegel and Castellan (1988), Daniel (1990),and Conover (1999).

A Comparison of two treatments

Many statistical comparisons involve only twotreatments. These types of comparisons arerelatively simple to perform, and the analysesare well covered in most general statisticstexts (e.g., Sokal and Rohlf, 1995; Zar, 1999).Different approaches are used for paired versusunpaired data and normal versus nonnormalpopulation distributions.

1 Comparing two unpaired groups

When comparing two unpaired groups, anunpaired t-test can be used for situations where

Page 5: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 41

a parametric analysis is appropriate and a Mann-Whitney test where a nonparametric analysis isappropriate. The t-test produces results mathe-matically identical to analysis of variance(ANOVA), and the statistical assumptions arethe same for both tests; these are discussed insection B. Motulsky (1995) describes a procedurefor estimating the power of a t-test. The Mann-Whitney test does not assume a normal distri-bution, but does assume that the samples areselected randomly from a larger population, themeasurements were obtained independently, andthat the distributions of the two populationsare similar. When variances or standard devia-tions are not equal, Day and Quinn (1989)recommend the Welch (1938) t-test, generalizedwith Satterthwaite’s degrees of freedom (Winer,1971), as a parametric procedure and the Fligner-Policello test (Fligner and Policello, 1981) fornonparametric comparisons.

Resampling statistics (e.g., Monte Carlomethods such as randomization andbootstrapping) are a nonparametric approach thatis becoming more commonly used in agricul-tural and ecological experimental analysis.Resampling approaches use the entire set of datathat has been collected to produce new samplesof simulated data and then compare the actualresults to the simulated data set. A resamplingtest can be constructed for almost any statisticalinference and they are computationally easierand have fewer assumptions than most tradi-tional tests. These approaches are not typicallycovered in general biostatistics texts, but thereis a growing number of statistical texts dealingwith this type of analysis (e.g., Dixon, 1993;Efron and Tibshirani, 1993; Manly, 1997).

2 Comparing two paired groups

Paired tests can be used whenever the valueof one replicate in the first group is expectedto be more similar to a particular replicate inthe second group than to a randomly selectedreplicate in the second group. This occurs whenmeasurements are taken from a replicate beforeand after the treatment is applied, when replicatesare matched for certain variables, and when anexperiment is run multiple times with a controland treated replicate performed in parallel. Pairedstatistical tests such as the paired t-test and

the Wilcoxon paired-sample test can be usedunder these circumstances (e.g., Motulsky, 1995;Sokal and Rohlf, 1995; Zar, 1999). Paired statis-tical tests have an advantage over unpaired testsbecause they distinguish variation among repli-cates from variation due to differences betweentreatments. For a parametric approach, the pairedt-test can be used. Assumptions of this testinclude that the pairs are selected randomlyfrom a larger population, the samples are pairedor matched, each pair is selected indepen-dently of the others, and the differences in thepopulation are normally distributed. A nonpara-metric approach that can be used is the Wilcoxonpaired-sample test. This nonparametric test hasthe same assumptions as the paired t-test, withthe exception of the normal distribution.

3 Comparing two proportions

Many measurements collected in invertebratepathology experiments can be expressed asproportions, which do not have normal distri-butions. For example, experimental treatmentsare applied and after a certain exposure periodthe number of individuals live or dead iscounted and the result expressed as a proportionor percentage. Two commonly used tests forcomparing the proportion responding to differenttreatments are Fisher’s exact test and chi-squaretest (Motulsky, 1995; Zar, 1999). The chi-square test is easier to calculate than Fisher’stest, but is not as accurate. Both tests usecontingency tables to estimate the probabilityof obtaining the observed results. Contingencytables typically consist of rows that representexposure to treatments (e.g., pathogen dosage)and columns that represent alternative outcomes(e.g., number of individuals live or dead). Therows must be mutually exclusive and the columnsmust be mutually exclusive. Each cell of thetable contains the number of subjects matchingthe combination of row and column categories.Both tests have the following assumptions:random sampling of the data, the data mustform a contingency table (i.e., values must benumber of subjects observed and the categoriesdefining the rows and columns must be mutuallyexclusive), the subjects are independent, and themeasurements are independent. The chi-squaretest should be used for large samples and if

Page 6: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

42 Campbell and Wraight

the total number of subjects is more than 20and no expected value is less than five (but seesection B.2.). For small to moderate samples,Fisher’s exact test is recommended; however,the method generally requires a computer. Ifa computer program is not available, the chi-square test can be used with Yates’ correction(Zar, 1999). There are several approachesthat can be used to calculate the powerof comparisons of proportions (Cohen, 1988;Motulsky, 1995).

McNemar’s test can be used for comparingpaired samples when the collected data aredichotomous and two treatments are compared(e.g., the number of live and dead insects inresponse to application of a pathogen comparedto the number of live and dead among controls)(Zar, 1999). It can also be used for comparisonsof measurements made before and after someevent (e.g., the number of live and dead insectsin a population before and after application ofa pathogen). The McNemar’s test uses a tablesimilar to a contingency table, but the rows andcolumns are not independent. This test calculatesa chi-square value and, like the chi-square test,should not be used for small samples.

Unfortunately, chi-square and related tests areamong the most misused in biometrics (seeHurlbert, 1984). The most common misappli-cation in microbial control research involvessampling from unreplicated treatment versuscontrol plots. Samples from a single field plotto which a treatment has been applied are corre-lated pseudoreplicates or subsamples, not truereplicates, and generally should not form thebasis of a statistical analysis. This approach isacceptable only in cases where no treatment hasbeen applied (e.g., to compare rates of naturaldisease prevalence at two different sites).

Misapplication of chi-square testing is alsocommon in cases where treatments have beenreplicated, but where replicates comprise groupsof individuals. In pathogen efficacy testing,insects are commonly treated and maintained ingroups (e.g., groups of insects in Petri dishes,or groups of insects infesting field plots). Subse-quently, these groups, or samples from thesegroups, might be pooled and subjected to chi-square analysis. However, as in the previousexample, this approach violates the assumptionof independence (that each insect represents

an independent replicate). Nevertheless, thisassumption is often overlooked in cases whereabsence of significant unexplained error canbe demonstrated; data are pooled, but onlyif the responses among the similarly treatedreplicates are homogeneous. This is readilydetermined by a preliminary chi-square testfor heterogeneity (described in most standardbiometry texts). Pooling in spite of significantheterogeneity puts any subsequent chi-squareanalysis at risk of declaring significant differ-ences that may be the result of factors otherthan the applied treatment (e.g., lack of properrandomization). It can certainly be argued thattreatment of any pseudoreplicates as true repli-cates carries similar risk, regardless of the degreeof heterogeneity; however, heterogeneity testingis considered an adequate safeguard by manystatisticians, as long as the data are carefullyexamined for systematic errors. An analogousexample, is the test for heterogeneity (goodness-of-fit) in probit analysis. If significant hetero-geneity is not detected, individual test subjectsthat have been treated in groups are acceptedas independent replicates, and a large-samplet value is used for calculation of confidenceintervals (Finney, 1971). Sokal and Rohlf (1995)describe a procedure for analysis of replicatedtests of goodness of fit and discuss consequencesof pooling data without testing for heterogeneity.

The reader is alerted to the many seeminglyroutine applications of pooling for chi-square analysis, such as that employed fordetection of synergism or antagonism (chi-square test of independence) between controlagents (Robertson and Priesler, 1992; Sokaland Rohlf, 1995). Pooling may be questionablewhenever subjects have been treated in groupsand especially in the absence of hetero-geneity testing. One should obviously avoidchi-square analysis of pseudoreplicated datawhenever possible. ANOVA based on the trulyreplicated groups (e.g., % mortality amongindividuals in each group) is a sound alter-native following application of any necessarytransformation.

B Comparing more than two groups

Many statistical analyses require the comparisonof more than two treatments or groups;

Page 7: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 43

particularly in field experiments. The problemwith using multiple two-treatment comparisonsis that the probability of obtaining a significant Pvalue by chance (i.e., probability of committinga type I error) increases with each comparison.The best approach is to compare all the groups atonce using analysis of variance or an equivalenttype of approach.

1 Analysis of variance

Analysis of Variance (ANOVA) is a generaltest that can make many types of comparisonsand, as the name implies, analyzes the varianceamong values. In depth coverage of ANOVA isbeyond the scope of this chapter. More detaileddiscussions of ANOVA can be obtained byreading Scheffé (1959), Sokal and Rohlf (1995),Underwood (1997), Zar (1999) and many othergeneral statistics books. The emphasis takenhere will be on the underlying assumptions ofANOVA and how they influence experimentaldesign and analysis. The ANOVA procedure hasseveral assumptions that need to be taken intoaccount in the design and analysis. Failure tomeet the assumptions of a statistical test canaffect the level of significance of the test. Someof the common violations of ANOVA assump-tions in agricultural and ecological experimentsare discussed by Gomez and Gomez (1983)and Underwood (1997), respectively. The majorassumptions of ANOVA are discussed below.

a Normally distributed data

Fortunately, the results of an ANOVA aregenerally not greatly affected by violations ofthis assumption. This is particularly true whenthere are either many treatments, many replicatesper treatment, or the experiment is balanced.Tests for normality are not very useful unless thenumber of replicates is quite large, but plottingthe relationship between the sample means andthe variances can often enable the researcherto determine how normal is the distributionof the data. Outliers in the data can affectthe normality of the distribution and the datashould be checked for presence of outliers. If thedata are highly skewed, the log transformationcan make the data more normally distributed.

When this is not possible, nonparametric testsshould be used instead of ANOVA.

b Homogeneity of variances

If the variances within treatments are differentamong treatments, pooling them is not justifiedand the validity of the ANOVA is called intoquestion. Heterogeneity of variance can affectboth the type I and type II error rates. Samplingfrom populations with unequal variances meansthat there is an increased probability that popula-tions will differ statistically than if they weredrawn from a population of equal variances.There are a number of tests for the heterogeneityof variances; the most useful of which is probablyCochran’s (1951) test (Underwood, 1997). Ifthe Cochran’s test is significant, it indicates thepotential for serious problems with the ANOVA.

If the variances are not equal, it is importantto determine why this has occurred, because theexperimental units were assumed to representsamples from a single population. Differencesin variation may indicate that (1) the treatmentsare affecting the variance differently amongtreatments (which would be something inter-esting to investigate further), (2) the experi-mental units were not from the same population,(3) test subjects were not randomly assigned tothe treatments, or (4) errors in data collectionmay have skewed the distribution of the datain some treatments. If the variances are notequal, several approaches can be used to tryand meet the assumption of homogeneity ofvariances: data could be separated into groupswith similar variances, the data could be trans-formed, or a procedure of weighting meansaccording to their variances could be performed(Underwood, 1997).

c Additivity of the main effects

In two-way or multi-way ANOVA, the effects ofthe factor levels must be additive, meaning thatthe levels of each factor must differ, on average,by a constant value. Violations of this assumptionare generally reflected in highly significantinteractions and, notably, may lead to falseconclusions in analysis of synergism/antagonism.A common failure to meet this assumptionoccurs when factor effects are multiplicative

Page 8: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

44 Campbell and Wraight

rather than additive, and this problem is mostreadily corrected by log transformation. A testdevised by Tukey (1949) and detailed by Sokaland Rohlf (1995) is useful for determiningif a significant interaction is the result ofnonadditivity.

d Independence of data within and amongtreatments

Independence indicates that there is norelationship between the size of the error termsand the experimental grouping. This is a reasonto avoid having all plots receiving a giventreatment occupying adjacent positions in thefield. Adjacent portions of a field are more closelyrelated to each other than randomly scatteredplots. Lack of independence in sampling isperhaps the greatest problem faced in biologicalexperiments (Hurlbert, 1984). However, indepen-dence of the data can be achieved with anunderstanding of the biology of the system andproper planning and effort by the experimenter.Some examples of the biological basis fornonindependence of data and how it impactsanalysis are provided by Underwood (1997).

The best way to deal with non-independenceis to give careful consideration to the biology ofthe system when designing the experiment. Pilotstudies and previous research can also be helpfulin assessing the independence of sampled data.However, some violations of the independenceof data are not necessarily biologically based. Apositive correlation between means and variancescan be encountered when there is a wide range ofsample means. For example, if comparing insectdensities with means ranging from 5 to 500,the variation around the average of 5 would beexpected to be less than that around the averageof 500. This would violate the assumption ofthe ANOVA. If this occurs, the data frequentlycan be transformed so that the assumption ofindependence is supported.

2 Alternatives to analysis of variance

There are a number of alternatives to ANOVAthat can be tried if the assumptions of theANOVA are not supported and problems cannotbe corrected using transformation. Day andQuinn (1989) list four parametric alternatives to

ANOVA, but recommend the W test (Welch,1951), which can be used when the assumptionof homogeneous variances is questioned. Thesealternatives work for one-way ANOVA designswhen variances are heterogeneous, but there areno robust alternatives for nested or factorialANOVAs (Day and Quinn, 1989; Weerahandi,1995). A non-parametric procedure for one-waydesigns that is widely used is the Kruskal-Wallistest. This test is appropriate if the distributionof the data is not normal but the variancesof the data (or transformed data) are equal. Arobust version of the Kruskal-Wallis test thatrequires only that the variances be symmetrical,not necessarily equal, is provided by Rust andFligner (1984). Nonparametric analysis of two-way designs without interaction (randomizedblocks) can be achieved using the Friedman test.

The above-mentioned nonparametric alterna-tives are valid only for one-way ANOVA ortwo-way ANOVA without interaction. Thereare, unfortunately, few options for nonparametricanalysis of factorial tests with interaction. Anextension of the Kruskal-Wallis test in whichdata are simply transformed to single-array ranksand subjected to standard ANOVA (Iman, 1974;Conover and Iman, 1976; Scheirer et al., 1976),has been recommended in the past (e.g., seeSokal and Rohlf, 1995); however, a theoreticalstudy by Thompson (1991) revealed that it isnot reliable. Conover (1999) recommends useof aligned-rank procedures; however, these testsare not strictly distribution free (nonparametric),and Mansouri and Chang (1995) found theyperformed poorly with respect to interactiontesting for Cauchy-distributed data. Conover(1999) also recommends analysis by standardANOVA of both the unranked and rank-transformed data and comparison of the results;if the results produced by the two procedures arenearly identical, the analysis of the unranked datacan be considered valid.

Proportions can be transformed (by arcsine)to normalize the distribution and analyzed usingANOVA. However, chi-square analysis canalso be used when more than two groups ofproportional data need to be compared. Thechi-square test can be used for contingencytables with more than two rows or columns ifthe following assumptions are met (Motulsky,1995; Zar, 1999): data are randomly selected

Page 9: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 45

from populations, the data form a contingencytable, and the observations are independent(see previous discussion of pseudoreplicationproblems in chi-square analysis). The use ofthe chi-square, as discussed above, has severalsimplifying assumptions that make it valid onlyfor large data sets. A general rule is that 80% ofthe expected values must be greater or equal tofive and all must be greater or equal to 2 (greaterthan or equal to 1 if the table has more than30 df) (Motulsky, 1995). However, Zar, (1999)and Conover, (1999) suggest guidelines that areless conservative. If these rules are violated itmay be possible to combine two or more rows orcolumns to increase the expected values. Three-dimensional contingency tables (e.g., species,location, and disease) can also be used to analyzefactorial experiments (Zar, 1999).

3 Multiple comparison tests

After rejecting the null hypothesis of a multi-factor ANOVA, it may be necessary to determinewhich means are actually different from eachother. There are many procedures for deter-mining which of the many alternatives to thenull hypothesis is best supported. However,selecting a procedure can be confusing becauseof considerable controversy about the use ofvarious techniques (Day and Quinn, 1989;Underwood, 1997).

An essential point when considering multiplecomparisons is to determine if specific compar-isons can be determined prior to the start ofan experiment (i.e., planned or a priori). Thismeans that the potential outcome needs to bespecified before the experiment is conductedand should be based on the knowledge of theresearcher and the particular question beingasked. These comparisons can be performedregardless of the results of the ANOVA butshould not have been suggested by the resultsand should test completely separate hypotheses(i.e., planned orthogonal comparisons) (Sokaland Rohlf, 1995). For example, in an experimentwhere the effects of a formulation technique onpathogen persistence are being studied, a specifichypothesis to be tested could be that survival isgreater in treatments with the formulation thanin the control. In this case, we have establishedbefore the start of the experiment what we expect

our alternative(s) to the null hypothesis to be.Thus, fewer than all possible comparisons willbe performed and this reduces the likelihood ofmaking a type I error.

The specific number of planned comparisonsdepends on the hypotheses about the data, butthe number should be limited. It is not proper toconsider the decision to compare all of the meanswith every other mean as a planned comparison.The number of comparisons should not exceedthe total number of treatments or means – 1(Sokal and Rohlf, 1995). A priori comparisonshave many advantages and are generally morepowerful than unplanned comparisons (Winer,1971; Sokal and Rohlf, 1995). Researchersshould attempt to ask focused questions that canbe analyzed using planned comparisons whenpossible to take advantage of the benefits of thistype of analysis.

Planned comparisons use the per-comparisonerror rate and can be pairwise comparisons,where two means or contrasts (i.e., combi-nations of means) are compared. As long aseach comparison addresses a separate biologicalquestion, a per-comparison error rate is appro-priate and the statistical approaches listed for thecomparison of two treatments can be used. Theleast significant difference (LSD) test can be usedfor a limited number of a priori comparisons, butshould not be used for a posteriori comparisonsbecause the experiment-wise type I error rate isuncontrolled (Day and Quinn, 1989).

If the experiment does not have a clear,expected pattern to the results or the questionbeing asked is more general or exploratory, aposteriori or unplanned comparisons are neededfor multiple comparisons. For example, in anexperiment comparing a variety of species andor strains of pathogens to control a certainpest, a researcher may not have any specificpredictions and is just interested in how thedifferent pathogens rank. This area of statisticshas been quite controversial and there is consid-erable disagreement over the best approachesand even if these tests should be performedat all. Typically, unplanned comparisons areanalyzed by comparing all the possible pairs ofmeans using a multiple comparison procedure.Unplanned comparisons should be run as a twostage or protected procedure: only when the null

Page 10: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

46 Campbell and Wraight

hypothesis is rejected using ANOVA are themultiple comparison procedures performed.

It is important to assess if multiple comparisonprocedures are necessary and that the assump-tions of the analyses are met by the data. Oftenexperimenters do not have a specific rationale forusing a certain procedure, but the different meancomparison procedures have different levels ofrobustness to violations of assumptions anddifferent probabilities of committing type I andtype II errors. The assumption of equal variancesis especially important. Multiple comparisons arealso often performed in situations where they areunnecessary. For example, in situations wheredifferent doses or distances are being compared,the relationship between levels of treatment maybe explained using regression analysis. Underthese situations the trend may be more importantthan whether successive treatments are different.

The different types of analyses fall into twocategories: repeated pairwise comparisons ormultivariate approaches. Concerns about usingmany pairwise comparisons arise because of thepotential problems of excessive type I errors. Ifrepeated analyses at a pre-set probability of atype I error ��� are performed, the set of tests hasa greater probability of error than each individualtest (Ryan, 1959). This problem is even greaterwhen there are multiple factors in an exper-iment or multiple experiments are comparedbecause of the introduction of additional sourcesof error. The Bonferroni or Dunn-Sidak methodscan be used to correct the significance level ofmultiple pairwise tests but tend to be very conser-vative (Rice, 1989; Sokal and Rohlf, 1995). Theproblem is to determine the appropriate error fora given experiment.

A number of a posteriori tests are commonlyused for unplanned comparisons of means ininvertebrate pathology field experiments. Thesetests all have different levels of power, controlof the type I error rate, and underlying assump-tions that need to be taken into account. Twotests that have been used traditionally as a poste-riori tests, but which are no longer consideredvalid because of problems with the control of thetype I error rate are the LSD test and Duncan’s(1955) multiple range test (Scheffé, 1959; Dayand Quinn, 1989). The Student-Newman-Keuls(SNK) test can also have problems controllingthe type I error under certain situations and is

not recommended by Day and Quinn 1989), butis recommended by Underwood (1997) for manysituations because of its level of power. Severalapproaches that are generally considered to bevalid are Tukey’s honestly significant difference(HSD) method or Scheffe’s test for equal samplesizes and the Tukey-Kramer method for unequalsample sizes (Sokal and Rohlf, 1995). Dayand Quinn (1989) recommend what they callthe Ryan’s Q test (Einot and Gabriel, 1975)for parametric unplanned multiple comparisonsbecause it is powerful, controls for the exper-iment error rate, and is easy to use. Dayand Quinn (1989) also recommend the Joint-Rank Ryan test with treatments ordered bytheir joint-rank sums (Campbell et al., 1985)as a nonparametric procedure for unplannedmultiple comparisons. The Dunnett’s test usedin a stepwise manner (Miller, 1981) is a usefultechnique when the researcher is just interestedin comparing experimental manipulations to acontrol. See Day and Quinn (1989), Sokal andRohlf (1995), and Underwood (1997) for moredetails on the specific tests mentioned above,among other procedures, and under what condi-tions different tests are appropriate.

C Analysis of trends

1 Correlation and regression

Often a researcher is interested in the relationshipamong variables, not just if their effectsdiffer significantly from one another. Regressionanalysis can be used to address questionsconcerning the functional relationship betweenone variable and another [i.e., the dependence ofa variable (y) on an independent variable (x)].The purpose of this type of analysis is to evaluatethe impact of a variable on a particular outcome.For example, regression analysis could be usedto determine how increasing pathogen densityimpacts host mortality. Correlation analysis canbe used for determining the amount of associ-ation between variables without the assumptionof causation (i.e., to determine the amount ofinterdependence between variables x and y). Thepurpose of this type of analysis is usually tomeasure the strength of relationships betweentwo continuous variables, and is good for devel-oping hypotheses rather than testing causation.

Page 11: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 47

Correlation analysis could be used to investigatethe relationships between disease prevalence inan insect population and characteristics of theinsect’s host plant such as age. Most generalstatistics books cover the techniques of corre-lation and linear regression in considerable detail(e.g., Snedecor and Cochran, 1989; Sokal andRohlf, 1995; Zar, 1999).

Correlation analysis is commonly used inmensurative experiments when relationshipsneed to be determined between measuredvariables. Correlation analysis does not assume acausal relationship between the variables (i.e., therelationship does not change, regardless of whichvariables are on the x or y axis). Correlationshould be used when the causal relationship isnot known or assumed or if the two variables areboth effects of the same cause (e.g., relationshipbetween insect weight and length). Correlationanalysis should also be used when neither of thevariables is under the control of the investigator(i.e., when the variables are random rather thanfixed). The assumptions for correlation analysisare that: (1) the replicates are randomly selectedfrom a larger population, (2) each replicatehas both x and y values, (3) each replicate isindependent, (4) the x and y values are measuredindependently, (5) x values are measured andnot controlled, (6) x and y are each sampledfrom a normal distribution, and (7) covariationis linear. If the assumption that x and y arefrom a population with a normal distribution arenot supported, there are nonparametric alterna-tives to calculating the correlation coefficient.The Spearman rank correlation is one of the mostwidely used nonparametric correlation analyses,and it has the same assumptions as above, exceptfor the normality assumption (Sokal and Rohlf,1995; Zar, 1999).

For manipulative experiments, the regressionanalysis is generally more appropriate becausethe researcher has set the values of x (e.g., dose).Using regression analysis, the relationshipbetween two variables is explained using alinear or more complex model that relatesthe value of one variable (dependent) as afunction of the other (independent). Regressionanalysis can be used for a number of appli-cations, but care needs to be taken to insurethat the independent variable is measured withminimal error. Sokal and Rohlf (1995) discuss

a number of uses for regression analysis; acouple of which are mentioned here. One ofthe most common is the study of causation,where the variation in y is caused by changesin another variable (x). Regression analysis canalso be used to estimate functional relation-ships between variables. There are a numberof statistical methods for analyzing regressions,including ANOVA. There are also many types ofregression, including linear regression, multipleregression, logistic regression and nonlinearregression.

Linear regression is one of the most commonlyused regression approaches, and it involves thecalculation of the slope and intercept of therelationship. Usually, if the researcher thinks thatthere is a linear functional relationship betweentwo variables, the analysis will address one oftwo hypotheses: (1) there is some slope ��� tothe relationship that is different from zero (� > 0or � < 0) or (2) there is a particular relationship�� = z�. In the first case, the null hypothesisis that � = 0 and in the second case, the nullhypothesis is that � = z. The null hypothesistests for regressions are similar to those usedfor means.

Linear regression has the following assump-tions: (1) x and y are asymmetrical, (2) therelationship between x and y is linear, (3) thevariability of values around the line follows anormal distribution, (4) the standard deviationof the variability is the same regardless ofthe value of x (i.e., homoscedasticity), (5) xis fixed and measured without error, or theimprecision in the measurement of x is smallcompared to that of y, and (6) each xy pairis randomly and independently sampled fromthe population. Some of these assumptions areunder the control of the researcher and canbe addressed using good experimental design.In field applications especially, it is unlikelythat x is fixed because of the imprecisioninherent in applying treatments, but in most fieldtrials the imprecision in treatment application(under investigator control) is small relativeto the imprecision in the measurement of theresponse to the treatment (y). The independenceassumption is frequently violated, particularlywhen using repeated sampling of an experi-mental unit. Dealing with the assumptions of

Page 12: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

48 Campbell and Wraight

linear regression in ecological experiments isdiscussed in Underwood (1997).

2 Repeated measures

For some experiments, the researcher is inter-ested in the impact of a pathogen applicationon an insect population over time and it maybe desirable or necessary to make repeatedmeasures from the same experimental units.However, repeated sampling of an experimentalunit raises a number of statistical concerns,especially the lack of independence amongsampling times. Repeated measures designs arequite common in laboratory experiments thatmeasure survival of insects after exposure toa pathogen, and the use of survival analysisis often appropriate (Kalbfeisch and Prentice,1980). Repeated measures can be analyzed usingANOVA (Snedecor and Cochran, 1989; Crowderand Hand, 1990; Winer et al., 1991), and theexperimental design is similar to that for a split-plot design, which is discussed in more detailin a later section. However, there are a numberof problems with this approach. One potentialproblem is the assumption that there is no inter-action between sampling times and experimentalunits within a treatment. If an interaction exists,then there is no overall interpretation of the inter-actions between sampling time and treatmentusing repeated measures ANOVA (Underwood,1997). Another underlying assumption with split-plot and repeated measures ANOVA is equalityof the variances of the differences for all pairs oflevels of the repeated measures factor (Stevens,1996). This requirement is called sphericity,and violation of this assumption is commonin repeated measures analyses due to correla-tions among the data. Many software packagescalculate corrections for violation of sphericity(e.g., the Greenhouse-Geisser and Huynh-Feldtestimators). These corrections reduce the degreesof freedom for the F test to hold alpha nearnominal (Girden, 1992).

There are a number of alternative approachesfor analysis of time series data that may bemore appropriate than ANOVA. The first isto avoid repeated sampling altogether and setup enough experimental units so that separateand independent samples can be taken ateach time. This would be a two-factor design

(e.g., treatment and sampling time) and it isstraightforward to analyze, but the number ofexperimental units needed may not be feasible.Second, if experimental units are repeatedlysampled, each sampling time can be analyzedseparately. Thus, at each sampling time thedata conform to a single factor analysis ofvariance. The probabilities of committing type Ierrors should be adjusted because of the seriesof comparisons that will be performed. Ifthe number of sampling times is small, thisadjustment could be made using the Bonferroniprocedure (also called the Dunn test), where theacceptable alpha value (e.g., 0.05) is divided bythe number of sampling times. Power of thistest is rapidly lost as the number of comparisonsincreases.

A third approach can be used if the researcheris just interested in differences in the temporaltrend among treatments. In this case, eachexperimental unit is an independent measureof the temporal trend and can be analyzedusing techniques such as linear regressionor nonlinear curve fitting. Repeated measuresdata can also be analyzed using multivariatetechniques such as MANOVA (multivariateanalysis of variance). Using this approach eachmeasurement (e.g., sampling time) for an exper-imental unit is treated as a different dependentvariable. MANOVA is also a useful approachif more than one variable is measured at eachsampling time. A strong advantage of MANOVAis that it does not depend on the sphericityassumption; however, it does have limitationsdue to constraints on the number of levelsof sampling (i.e., within-subject sampling) andproblems with low power when sample sizes aresmall. Stevens (1996) recommends applicationof MANOVA in addition to sphericity-correctedANOVA (with � for each test set at 0.025)to substantiate test results. Issues regarding theuse of multivariate vs. univariate techniques forrepeated measures analysis are also discussed invon Ende (1993) and Underwood (1997).

The best approach for dealing with repeatedmeasures data is probably that of Mixed LinearModels, which is a generalization of the mixed-model ANOVA (Crowder and Hand, 1990;Littell et al., 1996). This statistically sophisti-cated approach is useful because it can moreaccurately handle the inclusion of random and

Page 13: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 49

fixed effects in the same model; typically,experimental units are considered to be randomeffects and sampling date as a fixed effect.This technique can also handle missing dataand many levels of repeated factors that cancause problems for some of the other approaches.However, mixed linear models can be compli-cated to perform because different forms of thevariance-covariance matrix can provide differentresults. Therefore, the goodness of fit measurefor the different matrices needs to be determinedand the best one selected for analysis.

D Analysis of covariance

Analysis of covariance (ANCOVA) encompassesa large number of statistical methodologies. Theprinciple behind ANCOVA is to use the infor-mation about the relationship of the variable (y)to the covariate (x) to estimate the values ofthe variable in each treatment if all measure-ments had been performed on the same valueof x. This allows a test of the null hypothesisof no differences among y, having removed anyscatter due to x (Underwood, 1997). ANCOVA isparticularly useful for situations where the initialallocation of experimental units to treatments isnot equally representative. For example, if exper-imental units differ in some variable other thanthe treatment that may contribute to observeddifferences among treatments. If these variablescan be determined in advance, they can poten-tially be controlled when designing the exper-iment, but these variables may not be knownin advance or it may not be possible to controlthem. ANCOVA can be used to control forthis variation after the experiment is performed.Another use for analysis of covariance is whenwe know in advance that the variable we areinterested in is correlated with another variable.For example, growth rate measurements willoften depend on the original size of the animal. Ifwe estimate the relationship between the two, wecan use this information to improve the precisionof the estimate of the differences betweentreatments.

In ANCOVA, the regression relationshipbetween the variable (y) and the covariate (x)is determined. It is also possible to use aseries of covariates. This relationship is usedto adjust the data to a chosen value of the

covariate. This involves fitting a series of threeregression models and then comparing the devia-tions from each of the models. In the first model,a regression is constructed for each treatment. Inthe second model a common regression is madefor each treatment by constructing a regressionmodel with the smallest squared deviations overall the treatments. In the third model, all the dataare combined and a total regression, including allof the treatments, is constructed. The deviationsfrom this third model should be larger than in thesecond model. If the deviation is similar betweenmodels 2 and 3, there is no difference amongtreatments. If there is sufficient similarity of theslopes among the treatments so that a commonregression can be fitted to the data, then the datacan be adjusted. The adjustment is to move themean value of the variable in every treatmentfrom its value at the mean of the covariate in itstreatment to its predicted value at the mean ofthe covariates in all treatments. Adjusted meanscan be compared as if there were no influence ofthe covariate.

ANCOVA cannot, or need not, be performedunder certain circumstances. If there are nodifferences among the treatments in the meanvalues of the covariates, there is no reason toperform the analysis of covariance. A prelim-inary ANOVA on the covariates from thedifferent treatments can be performed andANCOVA only performed if there is a signif-icant difference among treatments. If there isheterogeneity of slopes among the treatments,the means of the treatments cannot be comparedusing a uniform adjustment procedure becausethere is no common slope (Underwood, 1997).However, this variation in slopes may be ofbiological interest even if it complicates statis-tical analysis. Thus, if the relationship betweenpathogen dose and host population densitydiffered among habitats, this would be an inter-esting result and require additional experimentsto further study the phenomenon.

If the regressions are similar and there aresignificant differences among adjusted meansof the treatments, multiple comparison proce-dures are needed to determine which alter-native hypothesis is supported. The procedures,like those following ANOVA, may be a priorior a posteriori. A useful a priori test is theDunn-Bonferroni test and a useful a posteriori

Page 14: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

50 Campbell and Wraight

test is the Bryant, Paulson, and Tukey test.These and other procedures are described byHuitema (1980).

There are many assumptions behind ANCOVAbecause it combines the assumptions of bothregression and analysis of variance. In addition tothe assumptions of ANOVA and regression, thereare also some assumptions specific to ANCOVA.These include the homogeneity of regressions(i.e., the slopes are parallel) and that the treatmentand covariate are independent. These assump-tions are discussed in more detail in Underwood(1997). The ANCOVA approach can still beused if there is more than one covariate and/orif non-linear regression is used to determinethe relationship between variable and covariate.These more complex approaches are discussedin Huitema (1980).

4 Experimental treatments, materialand units

The experimental units and the treatments thatthey receive are two fundamental components ofan experimental design, and there are theoreticaland practical considerations involved in makingdecisions about both components (Mead andCurnow, 1983). Selection of these componentsof the design has an impact on what experi-mental design will be used and how the datawill be analyzed and interpreted. Determiningthe structure of the experimental units involvesidentifying the units and describing the patternsof variation among the units. This variation istaken into account in experimental design byreplication, blocking, and randomization. Deter-mining the structure of the treatments involveschoosing the different treatments to be includedin the experiment. Parameters such as the type ofanalysis to be performed, the factorial structureof the design, whether the treatments are quali-tative or quantitative, and the nature of suitablecontrol treatments need to be taken into accountin choosing treatments.

The reason for doing a manipulative exper-iment is to examine the effects of two ormore treatments. Treatment in its broadest senseincludes applied treatments such as differentconcentrations of a control agent, and inherenttreatments such as strains or varieties of crops,

and controls. To compare two or more treat-ments, the effect of the treatment on experimentalunits needs to be observed. These experimentalunits may be a field, section of a field, groupof plants or insects, or even individual plants orinsects. Because the responses of these experi-mental units to a treatment differ, we need twoor more experimental units per treatment. Somelarge-scale studies cannot be truly replicated, butmay be analyzed using methods discussed later.

A Selection of treatments

The selection of treatments is often not giventhe level of thought that it should receive. Thisis unfortunate because this step has many influ-ences on the subsequent steps of experimentaldesign and analysis. Keep in mind that the objec-tives of the experiment can help with treatmentselection and enable the use of a more powerfulstatistical test. The more focused the objec-tives of the experiment, the fewer treatmentsneeded, and the experiment may also be easierto design and analyze. However, many exper-imental situations exist where large numbersof treatments are needed (e.g., screening largenumbers of pathogen strains or large factorialexperiments). The influence of the number oftreatments on block design will be discussedlater. In treatment selection, it is also importantto keep in mind that not all treatments need tohave equal status in the experiment. The treat-ments can be compared with different levels ofprecision and this can influence the number ofreplicates of each treatment and how the exper-iment is designed.

Manipulative experiments need one or morecontrols. The control is the baseline to whichthe other treatments are compared. The controlmay be either experimental units with no appliedtreatment, application of a sham treatment, or theuse of standard control tactic technique to whichother techniques are being compared. Sham treat-ments receive everything applied to the othertreatments except the ingredient being tested.Common sham treatments include application ofso-called spray carriers or formulation blanks.When using sham treatments it is useful, whenpossible, to also include controls that receiveno applications to determine the impact of theother ingredients. The use of standard grower

Page 15: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 51

practices as a control is also recommended whentesting material on a grower’s field or orchard.This type of control has two uses; it providesan additional realistic control treatment and,if the grower complains that the experimentaltreatment has caused harm to his crop, it mayprovide evidence that the damage was due tofactors other than the experimental treatments(Gomez and Gomez, 1983).

Because before and after (i.e., paired) compar-isons are statistically powerful, it is oftendesirable to take measurements of all experi-mental units (both control and experimental treat-ments) before and after applying experimentalmaterials. This type of approach is particularlyrelevant when studying the impact of a pathogenon a host population, but the biology of thetarget invertebrate and the pathogen need to beconsidered so that before and after measurementscan be meaningfully compared. This approachis appropriate when a fast acting agent is usedand the same generation of hosts is measuredat both sampling times, or the post-treatmentpopulation comprises the direct offspring of thetreated population. This approach is not valid ifthere is significant insect movement into or outof the test plots, as may occur, e.g., if a pathogenis slow-acting and considerable time elapsesbetween samples. Generally, as the length oftime between measurements increases, the twomeasurements become less related and this statis-tical approach less applicable. Obviously, thesetypes of problems will influence the interpre-tation of the results of unpaired comparisons aswell. Henderson and Tilton (1955) developed amodification of Abbott’s formula as a method forcalculating the percent control due to treatmentbased on before-and-after counts.

It is usually desirable to use treatments that are‘realistic.’ For applied experiments, this meansusing (1) experimental materials, especially forcontrols, that are in common usage, (2) appli-cation techniques that are appropriate for aparticular crop, or (3) application rates that areeconomically feasible. In more basic experi-ments, realistic treatments might mean usingpathogen densities that are within the rangeof naturally occurring populations. However,attempts to use realistic treatments should notstand in the way of using treatments that areappropriate for addressing the hypothesis being

tested or that will provide insight into thebiology of the system.

B Selection of experimental material

Many of the parameters associated with selectingthe experimental material are context specificand are made based on the experimenter’sexperience with the system and the pathogen.Three major considerations when selectingexperimental material are:1. Experimental material should be consistent with

the population or system about which gener-alizations are to be made. Ideally the experi-mental material should be drawn from the samepopulation about which generalizations based onthe results of the experiment are to be made.For example, if making recommendations aboutcommercial products, the products and applicationrates that a grower would actually use should beselected as the experimental materials.

2. The experimental material should be homogeneousacross all experimental units. The key points formaintaining homogeneity are: (1) making surethat the sample of material applied to all experi-mental units in a treatment is drawn from the samepopulation; (2) if the material comes from severalproduction batches or sources, then it should bemixed well before applying; (3) material shouldbe applied in a consistent manner among the treat-ments; and (4) some assessment of the variationin the experimental material (e.g., viability, infec-tivity, etc.) should be made before application inthe field.

3. The experimental material should be applieduniformly across all experimental units. Appli-cation rates of microbial control agents are oftenreported as viable propagules, infectious units,or colony-forming units per unit area. Thesedosages are typically based on an applied amountof pathogen suspension, and the concentrationof the active ingredient in the suspension isroutinely determined using hemacytometer orPetroff-Hausser counting chambers. In makingcounts, a common error is to consider the numbersfrom the individual counting units (etched squares)in a single chamber sample as true replicates.These counts are, in fact, pseudoreplicates thatdo not take into account the potentially consid-erable error associated with extraction of a minutesample �< 10 �l� from a suspension and loading

Page 16: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

52 Campbell and Wraight

it into a counting chamber. Valid estimates ofconcentrations and standard deviations can onlybe obtained from counts of multiple samples,preferably using multiple counting chambers toaccount for chamber- or coverslip-related errors.Independently prepared and quantified suspen-sions should be applied to replicate plots, andif possible, application rates should be sampledin the field. For example, slides, coverslips, orPetri dishes containing selective media can bepositioned in the field to confirm uniformity ofspray depositions or characterize the level ofvariability potentially resulting from numerousfactors, especially variable weather conditions, orspray equipment malfunctions.

C Selection of experimental unit

Experiments involve the taking of measurementsfrom an experimental unit. The experimentalunit is the smallest division to which experi-mental material is applied and usually receivesdifferent treatments and is replicated. Replicatesare experimental units that have received thesame treatment. In field experiments, an experi-mental unit can be a section of land that containsplants and insects, a plant or group of plants oran insect aggregation or nest. An experimentalunit can also be an individual insect, but thisis rarely used in field experiments. The experi-mental unit may not be the unit that is actuallymeasured during the experiment. For example,subsamples may be taken from an experimentalunit to calculate an estimate of the impact ofa treatment on the experimental unit. Failureto accurately define what the experimental unitis can lead to problems of pseudoreplication(Hurlbert, 1984).

1 Size and shape of experimental unit

The size and shape of an experimental unit influ-ences the precision of the experiment; variabilitybetween experimental units generally decreaseswith increase in plot size, but variability withinunits can increase. Above a certain size, therate of decrease in variability among units isreduced and there is a diminishing return toincreasing unit size. How the experimental unitis sampled will also influence how the increasein experimental unit size influences precision.

Increasing the number of objects being sampled(e.g., insects) per experimental unit increasesprecision. As the experimental unit increases insize it becomes increasingly more difficult tosample it accurately. Also, more than one typeof measurement is often made from a samplingunit (i.e., insect density, disease prevalence, plantdamage, etc.), and the influence of experimentalunit size on all of the measured variables needsto be considered.

The optimum size and shape of an experi-mental unit depends on a number of param-eters, many of which are practical or biologicalin nature. Cultural practices and applicationequipment can impact the decision on whatsize and shape of experimental unit to use.For example, aerial or tractor applications oftreatment material can limit how small an exper-imental unit can be, whereas hand applicationscan limit how large an experimental unit can be.The limitations on the amount of sampling effort,space, treatment material, and other param-eters also influence the size of experimentalunits.

The nature of the experimental material(i.e., the pathogen) and the target organismshould both be considered when determiningthe size and shape of an experimental unit.The spatial distribution and movement patternsof arthropods should be considered becausethe efficiency of sampling techniques and thevariance of the means depend on the densitywithin an experimental unit. If the density ofarthropods is too low per experimental unit,the data may not be normal in distribution andmany sampled units may not have insects. If thedensity is too high, sub-sampling of the exper-imental unit may be necessary. The feasibilityof accurately sub-sampling an experimental unitalso limits the size of the unit. The experi-mental unit should be large enough to containthe normal movement patterns of the arthropodsthat are being measured; otherwise the affectsof the treatments may be diluted and inaccurateconclusions may be drawn from the analysis.How the application of a treatment will impactinsect movement should also be considered. Ifthese factors cannot be addressed by the experi-mental design, then they should be acknowledgedwhen interpreting the results of an experiment.

Page 17: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 53

The influence of the shape of the experimentalunit is likely to increase as size increases. Forsmall-scale field experiments, shape of the plotmay have little influence. For larger size exper-imental units, considerations discussed later fordetermining the shape of experimental blocksshould be taken into account. Factors such as thedistribution and movement patterns of the targetorganism or gradients of environmental factorssuch as soil type can have implications for thebest shape experimental unit. Again, biologicalconsiderations should be taken into account, butpractical constraints may limit the alternativesavailable in experimental unit shape.

Statistical considerations also play a role indetermining experimental unit size. In somecases, such as incomplete block experimentaldesigns, the number of treatments and replica-tions is rigidly set. More commonly, constraintson space or resources will limit experimental unitsize. In blocking experiments, the size of theblocks coupled with the number of treatmentsconstrains the size of the experimental units. Anumber of methods have been proposed for deter-mining an optimal size of an experimental unit(Federer, 1955), but because practical concernsand biology are often more important in thisdetermination, these methods are not particularlyuseful.

2 Borders

Sampling units should be independent from eachother but, with very mobile insects for example,this independence can be difficult or impos-sible to obtain. In invertebrate pathology, twoconcerns need to be addressed. First, the appli-cation of the treatment needs to be confinedto each experimental unit. For instance, sprayapplications can drift between sampling units.Second, treatment effects need to be confined tothe experimental unit. For example, insects, afterapplication of a control agent, may move amongsampling units and immigration and emigrationfrom the experiment may occur. Attempts tocontrol these effects include using an appro-priate size and shape experimental unit and byintroduction of a border area between samplingunits that is not sampled. In field trials withsmall experimental units, physical barriers to

prevent movement of arthropods among exper-imental units (e.g., screen cages to confineinsect movement, barren soil strips to inhibitwalking insects) or to prevent movement ofthe experimental material among experimentalunits (e.g., metal barriers to impede nematodemovement in the soil) may be used. In field trialswith larger experimental units or where barriersare impractical or undesirable, then border areasare often used. Whether border areas are treatedor left untreated can influence the sampling unit.Edge effects in field trials are often underappre-ciated and it may be desirable to analyze datawith and without samples taken near the edges.In most cases, border areas should be planted andtreated the same way as the experimental unitsbecause of the microclimatic effects that they canhave on plants and animals in the targeted areas.

3 Subsampling experimental units

Frequently in field experiments it is difficult tomeasure all of the possible results of a particulartreatment within an experimental unit. As the sizeof the experimental unit increases, the ability tosample all of the subjects becomes more difficult.The method of sampling an experimental unitshould yield a sample value that is as close aspossible to the value that would be obtainedif the whole sampling unit was measured. Thedifference between these two measurements isthe sampling error. The smaller the samplingerror is, the better the sampling technique. Todevelop a plot sampling technique, the followingparameters need to be specified: sampling unit,number of samples, and sampling design (Gomezand Gomez, 1983).

a Number and size of subsamples

The unit on which the actual measurement ismade is the sampling unit. A sampling unit couldbe a leaf, a whole plant, a volume of soil, or acertain area of the experimental unit. To facil-itate sampling, the sampling units should be easyto identify. The size of the unit should be appro-priate for the type of measurement being taken.For example, if performing a total count of allinsects present, the size of the sampling unitwill depend on the distribution of the insectsin the experimental units and the constraints

Page 18: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

54 Campbell and Wraight

on the number of insects that can be countedin a reasonable period of time. The precisionof the estimate and the cost of generating thatmeasurement need to be balanced.

The number of samples is determined bythe amount of variability among sampling unitswithin the same experimental unit (samplingvariance) and the degree of precision wanted bythe researcher. The number of samples requiredfrom an experimental unit can be estimatedfor a certain degree of precision expressedas the margin of error of the plot mean ortreatment mean. Gomez and Gomez (1983)present the details of these techniques. Many ofthe techniques discussed below for estimating thenumber of replicates can be used for estimatingthe number of samples. Krebs (1989) also lists anumber of approaches and a stepwise empiricalapproach for estimating the number of sampleswhen repeatedly measuring from a site.

b Sampling design

Sampling design determines how the samplingunits are distributed within an experimental unit.Taking a representative sample is not neces-sarily the same as taking a random sample.Random samples do avoid biases introduced bythe experimenter selecting the sampling unitsdirectly. However, when additional informationis available, other sampling techniques may bemore representative of the experimental unit.Four sampling designs are presented here andfurther information on these and other techniquescan be found in Gomez and Gomez (1983), Krebs(1989), and Underwood (1997). These sametechniques also apply to developing samplingschemes in large scale unreplicated or mensu-rative experiments.1. Random sampling. This is the simplest and most

widely used sampling design. In this design, thereis only one type of sampling unit and all ofthe sampling units are known. Sampling units tobe measured can be selected using probabilitysampling (Krebs, 1989). In many situations, e.g., infield or forest sampling, it is rarely possible toenumerate all sample units and select a trulyrandom sample. A common approach in thesecases is to devise a sampling scheme that directsthe sampler to a random location where the sampleis then taken. For example, the sampler might beinstructed to walk a randomly determined number

of meters or paces down a randomly selected croprow or transect line to locate a specific plant tobe sampled. If the sample unit does not comprisethe whole plant, the sampler might be givenadditional instructions for random selection (or atleast unbiased selection) of a specific sample unitfrom the plant. Random samples are on averagerepresentative, but depending on the sample sizemay not be representative of a particular plot.

2. Stratified random sampling. This method ofsampling is similar to blocking experimental unitsand is a powerful tool in sampling design. In strat-ified sampling, the experimental unit is dividedinto non-overlapping subpopulations and each ofthe subpopulations is sampled using a randomsampling technique. The rationale behind dividingthe experimental unit into strata is that somepattern of variation within the unit, typicallypopulation density, exists. If done properly, strati-fication can increase the precision of the estimate.The number of samples per strata can be propor-tional (equal percentage of samples from eachstrata) or optimal (number of samples per stratavaries and depends on prior information) (Krebs,1989). In optimal sampling, more samples canbe taken if a stratum is larger, more variable, orcheaper to sample than another stratum [see Krebs(1989) for details].

3. Multistage sampling. In this process samplingunits are selected randomly and elements withinthese units are selected randomly and measure-ments are taken from these elements. For example,several trees may be selected randomly from anexperimental unit, a number of leaves are selectedfrom random locations on that tree, and insects arecounted on each of the leaves. Additional subsam-pling can also be performed to make even moresubdivisions.

4. Systematic sampling. This is a useful techniquebecause it is simple and can sample evenly acrossan experimental unit (Krebs, 1989). A commontechnique of systematic sampling is the centricsystematic area-sample. In this technique, theexperimental unit is divided into N sampling unitsand a sample is taken from the center of eachsampling unit. As the individual sampling units arenot identified for random selection, special care isrequired to avoid bias in the final selection of asample. If, for example, insects or insect damage isprominently visible to the sampler, a method mustbe devised to insure blind (unbiased) selectionof the sample (unless the protocol specifically

Page 19: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 55

calls for sampling of infested or damaged sampleunits). Regardless of the sampling method used,all treatments within an experimental block shouldbe sampled by a single individual to accountfor sampler bias. Randomization is generallypreferable to systematic sampling when feasible.

c Pseudoreplication

When setting up an experiment with subsamplingor analyzing the results, it is important to differ-entiate between true replicates (experimentalunits) and subsamples. As related earlier in thediscussion of chi-square testing, pseudorepli-cation involves subsampling from a single exper-imental unit. These subsamples are not truereplicates and are not statistically independent(Hurlbert, 1984). Subsamples taken from asingle experimental unit thus do not increasethe degrees of freedom available for statisticaltests. The simplest approach is to use onlythe mean of the subsamples from an experi-mental unit in statistical tests. An alternativeapproach is to use an analysis that can take intoaccount the levels of independence such as nestedanalysis of variance, but this does not increasethe power to find differences among treatments.Pseudoreplication is of special concern whenusing large-scale mensurative experiments. Forexample, spraying a section of forest with apathogen and then taking samples of diseaseprevalence at several locations is not true repli-cation. In order for a treatment to be trulyreplicated, each replicate plot must be indepen-dently treated with an independently preparedpathogen preparation (the applied treatment mustbe replicated). Strict replication may not befeasible in every field test situation, dependingon such factors as the type and size of sprayequipment used, the numbers of treatmentscalled for within a given time frame, and theamount of pathogen preparation available. Insuch cases, true replication may be achieved byrepeating the experiment over time. Consideringthe degree to which pathogen efficacy is affectedby weather conditions, repetition over multiplefield seasons is highly recommended regardlessof pseudoreplication problems. Whenever testsinvolve subsampling, this should be clearlyreported in the methods of the research report.

The classic paper by Hurlbert (1984) should beconsulted for excellent coverage of these issues.

5 Select an experimental design

Experimental design is simply the logicalstructure of the experiment (Fisher, 1951), anda large number of different designs have beendeveloped over the years to aid in the processof assigning treatments to experimental units inmanipulative experiments. In single factor exper-iments, one factor is varied and all others are heldconstant (e.g., comparing the efficacy of severalspecies of pathogen with that of a chemicalpesticide standard). In more complex experi-mental designs, more than one factor is varied(e.g., comparing how irrigation rate influencesthe efficacy of several species of pathogen). Thefactors being compared can be either fixed orrandom. Fixed factors are generally selected bythe experimenter or all levels of a factor areincluded in the experiment. Random factors inthe experiment are randomly selected from theset of all possible factors. A number of statis-tical approaches can be used to analyze simpleexperimental designs such as the completelyrandomized design, but ANOVA is the mostuseful analysis for more complicated designsand will be the approach that is emphasizedin this section. More detailed information onthe setup and analysis of experimental designscan be found in Federer (1955), Cochran andCox (1957), Little and Hills (1978), Gomez andGomez (1983), Mead and Curnow (1983), andPearce (1983).

When designing an experiment, it is importantto consider the nature of the site where theexperiment is to be performed. Field sites oftenhave permanent features that may influencethe response of the experimental units to atreatment. These features include soil depth,moisture, air movement, border areas, insectdensity, etc. Some features of the site can beidentified and their effects controlled in theexperimental design, but others are less apparentand/or difficult to control. Often more than onefeature will vary at a field site and the relativeimportance of a feature will differ. This makesit difficult to develop an experimental designthat controls for all features of the site. It takes

Page 20: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

56 Campbell and Wraight

experience with the system to make decisionsabout how to handle these sources of variation.Reviewing the results of previous experimentscan provide insights into the relative impor-tance of these various features on the system ofinterest. Understanding the variance present inthe field site will influence what experimentaldesign is appropriate and how experimental unitsand treatments are allocated.

A Single factor experimental designs

1 Completely randomized design

The completely randomized design is thesimplest type of experimental design. Treatmentsare randomly assigned to each experimental unitso each unit has the same chance of receivinga treatment. This design is appropriate wherethe experimental area is virtually uniform orwhere variation is suspected but the variationhas no pattern or its pattern is unknown tothe experimenter. The completely randomizeddesign is most useful for small-scale field trialsin relatively uniform environments. The advan-tages of this design are that (1) it is simple toperform, (2) the placement of the experimentalunits is flexible, and (3) the degrees of freedomfor estimating experimental error is maximizedand the F value in an ANOVA required for deter-mining statistical significance is minimized. Aprincipal disadvantage of this design is that theprocess of randomization can assign most or allof the replicates of a particular treatment intoexperimental units that are different in some wayfrom others in the experiment. Thus, random-ization can introduce a systematic bias, especiallywhen the number of replicates is low. For thisdesign, differences among experimental unitsare included in the experimental error. Usingmore complex experimental designs, the sourcesof variation among the experimental units canbe identified and incorporated into the analysisand smaller statistically significant differencesamong treatments can be detected.

The number of experimental units will be thenumber of treatments multiplied by the numberof replicates. Before assigning treatments, eachexperimental unit is given a number. Treat-ments are assigned to the experimental unitsby selecting random numbers from a random

numbers table or using some other sourceof random numbers. There are two sourcesof variation among the experimental unitsin a completely randomized design; treatmentvariation and experimental error. The relativesize of the two is used to determine if differencesamong treatments are due to chance (i.e., thedifference is significant if the treatment variationis sufficiently larger than the experimental error).

2 Randomized complete block design

In the randomized complete block design, thetreatments are assigned randomly to a groupof experimental units that are termed a block.The objective of this approach is to minimizethe probability of placing most or all replicatesin locations that are unique in some way otherthan treatment by minimizing the variabilityamong experimental units within a block andmaximizing the variability among blocks. Thisis one of the most widely used experimentaldesigns in agricultural research. The advantageof this approach is that experimental erroris reduced because the block variability isremoved from the experimental error term in anANOVA. This design becomes more efficient,relative to the completely randomized design,at detecting differences among treatments asthe variability among blocks increases. If thereare no differences among blocks, this designwill not contribute to the precision in detectionof treatment differences. Therefore, it is mosteffective when there is a predictable pattern ofvariability.

Two important parameters in blocking are theselection of a source or sources of variability touse for blocking and selection of block shape andorientation. To ensure that a block is as uniformas possible, the experimental units should begrouped on the basis of some parameter thatwill provide uniformity. Blocks should be keptcompact, because as size increases so does theprobability of within-block variability. The fewerthe number of treatments the more compact theblocks can be and still have at least one replicateof each treatment in each block. Generally,blocks that are long and narrow in shape perpen-dicular to the direction of the gradient and thatare close together are best when variability is inone direction. Blocks that are square are better

Page 21: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 57

when variability is not predictable. Keepingblocks the same size will also reduce within-block variability. The treatments are assignedrandomly to each experimental unit in a blockand a separate randomization is performed foreach block. For the analysis of variance, thereare three sources of variability in this design;treatment, block, and experimental error.

3 Latin square design

The Latin square design is not used very oftenin insect pathology field research but is appli-cable when there are two sources of variabilitythat vary in different directions. Treatments arerandomized into columns as well as in rows.Experimental error from both of the sources ofvariation can therefore be removed. A limitationof this design is the requirement that the numberof replicates equal the number of treatments,thus space rapidly becomes a limiting factor.The number of treatments that can be accommo-dated will depend on a variety of factors suchas the space available and the experimental unitsize, but experiments with more than eight totwelve treatments are generally not practical. ForANOVA there are four sources of variation; row,column, treatment, and experimental error.

4 Incomplete block designs

When a large number of treatments are neededin an experiment, an incomplete block designmay be appropriate. As the number of treat-ments increase, it becomes difficult to use acomplete block design. This is because eachblock must contain at least one replicate ofeach treatment and to accommodate additionaltreatments, the block size has to increase. Asblock size increases the homogeneity of the blockdecreases and the experimental error increases.Incomplete block designs do not have all ofthe treatments represented in each block andthis enables the size of the blocks to remainsmall even with a large number of treatments.These types of designs do have some signif-icant disadvantages: (1) the number of treat-ments and replicates is relatively inflexible, (2)there is unequal precision in the comparison oftreatment means, and (3) the data analysis is more

complex. However, the availability of sophis-ticated computer software capable of handlingthese analyses (general linear models) has madethe latter issue less important. There are manyincomplete block designs, with the balancedlattice design being the most commonly used,and more information on these designs can befound in Cochran and Cox (1957).

B Multiple factor designs

Factorial experimental designs are used for situa-tions where two or more variable factors andtheir interactions need to be studied. The resultsof a single factor experiment are technically onlyapplicable to the conditions present during theexperiment. However, the way that an organismresponds to an experimental treatment is ofteninfluenced by other factors. It is often desirableto determine how the response to a variable isinfluenced by one or more other factors. Forexample, it may be of interest to determine howthe efficacy of an entomopathogenic nematode isinfluenced by the level of fertilizer applied to thesoil. The two factors (frequently termed primaryand secondary factor) interact if the effect of onefactor changes with the level of another factor.When there is no interaction among factors, thesimple or direct effect of the factor is the sameas the average effect of the primary factor acrossthe levels of the secondary factor (i.e., maineffect) and generalizations can be made. If thereis an interaction, no generalizations can be madeacross levels of the secondary factor. Althoughinteractions can cause statistical problems theyare often the most biologically interesting resultsof an experiment.

A complete factorial experiment is one inwhich all combinations of the selected levels intwo or more factors are present. For example,if three species of entomopathogenic nematodeand three levels of fertilizer were used in afactorial experiment, it would be a 3×3 factorialexperiment with nine different treatments repre-senting each possible combination of factors. Thenumber of treatments increases rapidly with theaddition of factors or levels of factors. Factorialexperiments can become quite large and costlyto perform, and this can limit their applica-bility. The running of a single factor experiment

Page 22: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

58 Campbell and Wraight

first and then asking selected questions with afactorial experiment can be a preferred approach.

The term factorial refers to the method ofdetermining the treatments, not to a specificexperimental design. The completely randomizedand randomized complete block designs can beused for factorial experiments. The analysis ofvariance does differ for factorial experiments,because the treatment sum of squares are parti-tioned into the main effects of the individualfactors and their interactions. Because of thelarge number of treatments that can result froma factorial experiment, complete block designsmay be too large to be used efficiently. Incom-plete block designs like the balanced latticedesign are not appropriate for factorial experi-ments, but there are comparable designs that canbe used with factorial experiments. The mostcommonly used experimental design for factorialexperiments is the split-plot design.

1 Split-plot design

The basic split-plot design involves assigningthe treatments of one factor to main plots (mainplot factor) and the treatments of a secondfactor (subplot factor) to subplots within eachmain plot. This design sacrifices precision inestimating the average effects of the treatmentsassigned to the main plots, but it often improvesthe precision of comparing the average effectsof treatments in the subplots. When interac-tions exist, the precision is increased for compar-isons of subplot treatments for a given mainplot treatment. The experimental error for mainplots is usually larger than the experimentalerror used to compare subplot treatments. Theerror term for subplot treatments is smaller thanwould be obtained if all treatment combinationswere arranged in a randomized complete blockdesign. Because of this variation in precision, theselection of which factors to assign to the mainplot or subplot is very important. Several param-eters are important in making this decision. Twoparameters are statistical and deal with the degreeof precision desired for either factor and theexpected size of the main effects of each factor.The factor of greater interest or with the smallerexpected main effect size should be the subplotfactor. Management practices also influence thedecision of which factors should be main plot

or subplot factors. Using our above example,fertilizer levels would be a likely main plot factorand nematode species a subplot factor.

There are two steps to the randomizationof a split plot design. Main plot factors canbe arranged using a completely random orrandomized complete block design. The random-ization procedure of the main plot depends onthe design that was selected. Subplot factors arerandomized within each main plot, with separaterandomizations carried out within each main plot.The analysis of variance of a split-plot design isdivided into two analyses; the main-plot and thesubplot analysis.

2 Designs with more than two factors

The number of factors included in a factorialexperiment can be increased beyond two factors,but there is a rapid increase in the number oftreatments and in the number and types of inter-actions that can occur. Sometimes these inter-actions are of interest, but often the complexityand cost of a large factorial experiment areprohibitive. There are a number of experi-mental designs that can be used for three-or-more factor experiments; including randomizedcomplete block design, split-split-plot design,and fractional factorial design. Generally, it ispreferable that simpler experimental designs thanmulti-factorial designs be used. Information onthese more complex designs can be obtainedfrom the references cited in the beginning of thissection.

6 Improving the precisionof an experiment

The level of precision in sampling refers to thesize of the confidence interval that with a highprobability contains the true mean. The smallerthe confidence interval, the greater the precisionand the smaller the detectable difference betweentreatments. The level of precision is influencedby three things: the sample size, variation in thepopulation, and the probability used to constructthe confidence interval (Underwood, 1997).Methods that increase precision are designedto lower unaccountable variability per plot andto increase the effective number of replicates.

Page 23: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 59

Precision can be improved by increased repli-cation, careful selection of treatments, refinementof technique, selection of experimental materials,selection of experimental units, taking additionalmeasurements, and planned grouping of experi-mental units. The goal in experimental design isto maximize precision within the constraints ofcost (e.g., labor and time availability, space andmaterial limitations).

A Replication

Before starting an experiment, the number ofexperimental units receiving the same treatment(replication or sample size) needs to be deter-mined. How many replicates to perform isperhaps one of the most frequently askedquestions concerning experimental design and,unfortunately, it is a difficult one to answer.Determining the number of replicates involvesbalancing a number of factors, only some ofwhich are statistical. Careful consideration isgiven to replicate number because it influencesthe power and precision of the experiment.Enough replication is needed to obtain validresults, but not so much as to make the exper-iment impractical. The amount of replication isoften determined by estimation by the experi-menter, but with experience these estimates canbe quite accurate. In addition, the amount ofreplication is constrained by space availabilityand economics. Some experimental designs, asalready discussed, can influence decisions on thenumber of replicates.

The precision of an experiment can beincreased by adding additional replication, butthe degree of improvement falls off rapidly as thenumber of replicates increases. In general, fieldresearch with four to eight replicates will providereasonable precision. However, estimating theamount of replication needed based on thedifference in magnitude the experimenter isinterested in detecting or the desired powerof the experiment is a preferable approachwhen feasible. There are a number of statisticalapproaches to determine sample size, but they allrequire some estimation of the expected variationin the data based on previous data or pilot studies.The experimenter also needs to have some ideaof the size of the difference between treatmentsthat must be detectable. The calculated replicate

number is still only an estimate because it isbased on estimates and arbitrarily set values andshould not necessarily be followed blindly. Ifsufficient resources are lacking to perform theexperiment at the level of precision needed, itis best to determine this before conducting theexperiment.

1 Estimating the number of replicates needed

There are different ways of estimating thenumber of replicates needed for an experiment.Simple formulae are available for calculatingthe appropriate sample size for comparing twomeans or two proportions (Motulsky, 1995; Zar,1999). To estimate the optimal number of repli-cates prior to performing an ANOVA, we canuse the following equation and the process ofiteration (Zar, 1999).

� =√

n�2

2 ks2(1)

In equation (1) the variables are: �, a valuerelated to the noncentrality parameter; k, thenumber of treatments used in the ANOVA; s2,the error MS from an ANOVA of a similar exper-iment; �, the minimum detectable difference thatthe experimenter wants to be able to detect; n,the number of replicates. This approach requiresthe results of an ANOVA performed on a similarsystem. The process of determining the numberof replicates involves making an initial guess andthen repeatedly refining that estimate. To do this,select a power for the experiment, e.g., 80%, andan initial estimate of the number of replicatesand solve for �. The parameter � is then plottedon the graphs prepared by Pearson and Hartley(1951; also included in Zar, 1999) to determinethe power of the test. Each graph is for a differentv1 (group df) and the position where � (on thex-axis) intersects a curve for a given v2 (errordf) provides a value for power (y axis) (Zar,1999). If the resulting power is above or below80%, the number of replicates can be adjustedappropriately and a new value of � calculated.This process is repeated until the required poweris achieved and this value of n is used for thenumber of replicates.

Page 24: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

60 Campbell and Wraight

2 Different amounts of replication amongtreatments

In most experiments, the design uses an equalnumber of replicates for all treatments. In somesituations, not all comparisons of treatments areof equal interest, and this can influence thenumber of replicates that are needed for eachtreatment. For example, frequently there are twocomparisons of interest: one among experimentaltreatments and one between each of these treat-ments and the control treatment. In this situationit may be desirable to have two replicates of thecontrol and one for each of the other treatmentsin a block. This will result in different powerlevels for each type of comparison. This consid-eration is also of importance when there is notsufficient experimental material or experimentalunits for equal replication.

B Statistical power

Prior to running a field experiment, the power ofthe proposed experiment should be determined.The results of this determination can help indeciding if an experimental design needs to bemodified or even if the experiment should be runat all. If the power is not adequate, a numberof parameters associated with an experiment canbe modified to increase the power of the test(e.g., increase sample size, increase differenceamong population means, decrease the numberof groups, decrease the variability within popula-tions, or use a larger value of �).

A number of procedures have been describedfor estimating the power, required sample size, ordetectable difference among means for ANOVA,and most biostatistics books present methods fordetermining the power of an experimental test.The technique described above for estimatingreplication can also be modified to estimatepower. Here techniques described in Zar (1999)that use the graphs of Pearson and Hartley (1951)are described. The term � can be calculatedfor an ANOVA in several ways. If the meansquares of an ANOVA from a similar exper-iment are available, then the following formulacan be used.

� =√

�k−1��groupsMS�

ks2(2)

The variables in formula (2) are: �, a valuerelated to the noncentrality parameter; k, thenumber of treatments used in the ANOVA;groups MS, the mean squares from an ANOVAfor a similar experiment; and s2, the error MSfrom an ANOVA for a similar experiment. Theestimated power can then be determined byconverting � to a power measurement using thefigure listed above.

If the variability among populations isexpressed in terms of deviations of the kpopulation means, �i, from the overall mean ofall populations, �, then the following calculationcan be used.

� =n

k∑i=1

��i −��2

ks2(3)

A third approach to determining the power of anexperiment is to specify the smallest differencebetween two populations that we want to detect.This term is the minimum detectable difference���. We can then calculate � using the followingformula that is similar to equation (1).

� =√

n�2

4ks2(4)

The variables in formula (4) and the method ofcalculating power are the same as in equation (1).This formula can also be used to estimate thepower of comparing two means by substitutinganother error term for the error MS.

7 Issues associated with data analysis

A Correction for control mortality

In many experiments it is desirable to correctthe mortality in the experimental treatments bythe mortality that occurs in the control treatment.When the treatments consist of a graded seriesof treatments, the response can be corrected forcontrol mortality using probit analysis (Finney,1971). When there is a small number oftreatments or treatments are not related in aseries, correction for control mortality has tradi-tionally involved the use of Abbott’s formula(Abbott, 1925). However, Abbott’s formula is

Page 25: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 61

not a complete correction for control mortalitybecause it does not include an estimate of thevariance in the control mortality (Rosenheimand Hoy, 1989). It is common practice toretain the estimate of variance for the uncor-rected treatment mortality and use it for thecorrected treatment mortality, but this can leadto misleading conclusions. Rosenheim and Hoy(1989) presented a technique to correct forcontrol mortality that provides an approximateconfidence interval for the corrected treatmentmortality. An alternative, but more complex,approach that provides a better approximationwas pointed out by Koopman (1994). Thisapproach derives the point and confidenceintervals directly from likelihood theory forbinomial data (Finney, 1971; Gart and Nam,1988). Another approach is to use a resamplingtechnique such as bootstrapping to calculatethe variance estimate of the index (Manly,1997). Any of these approaches to varianceestimation should be used rather than the originalAbbott’s formula alone when correcting forcontrol mortality.

B Calculating the power of the ANOVA

It is useful to determine the power of anANOVA, or any other statistical procedure,after it has been performed. This is especiallyuseful if the null hypothesis is not rejectedbecause we would like to know how likely thetest was to detect a true difference among thepopulation means. This can be done in a mannersimilar to that described above for estimatingpower before running an ANOVA. The followingequation, which is similar to equation (2) canbe used.

� =√

�k−1��groupsMS − s2�

ks2(5)

C Data transformation

Data transformation is a common method usedto overcome violations of the assumptions ofa statistical test. During data transformation,the original data are converted to a new scalethat is expected to be more consistent withthe assumptions of the analysis. Because alldata are treated the same, comparisons among

treatments remain valid. However, data trans-formations are performed only so that theanalysis will be valid, not so that more desirableresults are obtained. Moreover, it is importantto recognize that application of a transformationto data that already satisfy the assumptionsof an analysis will produce data that violatethe assumptions and may lead to erroneousconclusions.

After a valid transformation has been selected,all analyses are conducted in the transformedscale. For presentation purposes, results are mosteasily comprehended if converted back to theoriginal scale; however, the back-transformationis not straightforward in all respects. Means canbe directly back-transformed to yield correctlyweighted means in the original scale; however,this is not the case with standard errors, as theyare symmetrical (expressible as the familiar ±values) only in the transformed scale. Instead,confidence limits must be calculated in the trans-formed scale and then converted back to theoriginal scale, a process that generates asymmet-rical confidence intervals. Any consideration ofmeans from the ANOVA as being “correctly”weighted, of course depends on the reliabilityof all the usual ANOVA assumptions, and inreporting results, it is desirable, space permitting,to present both the weighted and unweightedmeans. The standard errors of the unweightedmeans (the original data) are especially useful asindicators of potential problems (heterogeneityof variances, outliers, etc).

Some common data transformations to dealwith ANOVA assumption violations are listedbelow. More detail on data transformationsis available in many general statistics books(e.g., Gomez and Gomez, 1983; Pearce, 1983;Sokal and Rohlf, 1995; Zar, 1999). After thedata are transformed, the assumptions of theanalysis should be tested again to make sure thatthe adjustment is adequate. Alternative statisticalapproaches that have assumptions that are notviolated should also be considered as an alter-native to transformation.

1 The log transformation

The log transformation is useful for data wherethe means and variances are not independentand/or the effects are multiplicative not additive.

Page 26: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

62 Campbell and Wraight

This generally occurs with data that are ratiosof two variables or are whole numbers that varyover a wide range of values (e.g., the numberof insects per sampling unit). Data are trans-formed by taking the logarithm of each datapoint. Logarithms of any base can be used. Datawith negative values or zeros cannot be trans-formed this way. If there are zeros or values lessthan 10 in the data set, a positive value, suchas 1.0 or 0.1, should be added to all of the databefore transforming. If most of the data are large,adding a large constant such as 1 makes littledifference. However, if some samples are small�< 01� with some zeros and others are large�> 10�, adding 1.0 can make a big differencein their relative magnitudes. In this situation,a small number such as 0.1 should be added.More on the influence of adding a constant tolog transformations can be found in McArdle andGaston (1992).

2 The square-root transformation

The square root transformation is useful whenthe data are not normally distributed but follow aPoisson distribution. In this case, the relationshipbetween mean and variance (mean = variance)violates the homogeneity of variance assumption.This type of distribution occurs frequently whenthe data are based on counts and there is alow probability of an event occurring for anyindividual sample. This type of distribution istypical of insect counts per plant, quadrant, ornet sweep. Causes other than Poisson distribu-tions can also generate data where the varianceis proportional to the mean. This transformationcan also be used for percentage data if the valuesare between 0 and 30% or between 70 and 100%.Plotting the variances against the means canbe useful for determining if a transformation isnecessary. Adding 0.5 or 1.0 to the data beforetaking the square root can reduce the hetero-geneity, especially when the data are close tozero (i.e., < 10).

3 The arcsine transformation

The arcsine transformation is useful for databased on percentages or proportions derivedfrom counts (e.g., the proportion of insects thatare infected). This type of data is binomial

in distribution rather than normal. In this typeof distribution, variances tend to be small atvalues close to 0% or 100% and large near50%, and this can lead to heterogeneity of thevariances. To make the distribution more normal,the arcsine of the square root of the data istaken. In this transformation, the proportion isused, not the percentage. If the majority of thepercentage data lie between 30 and 70%, thedata probably do not need to be transformed.The arcsine transformation is not as effective fordata near the extreme values of 0% or 100%.If the actual proportions are known, the trans-formation is improved by replacing 0/n with1/4n and n/n with 1–1/4n (Bartlett, 1947). Evengreater improvements are realized with a numberof other transformations (see Sokal and Rohlf,1995; Zar, 1999).

4 The Box-Cox transformation

A common approach for transformingto normality involves raising the data to somepower. Box and Cox (1964) developed aprocedure to identify the best transformationfrom a defined family of power transformations.This is accomplished through an iterativeprocess, and for practical application, requires acomputer. The Box-Cox algorithm is includedin many statistical software packages. If notavailable, Sokal and Rohlf (1995) recommendtrying the series 1/

√Y

√Y ln Y, and 1/Y for

distributions skewed to the right and Y2 Y3 for distributions skewed to the left. In regression,the reciprocal transformation, 1/Y or 1/�Y + 1�allowing for zero values, is often effectivefor linearization of the hyperbolic curves thatcharacterize many rate phenomena such as eggsproduced/female/day (Sokal and Rohlf, 1995). Itis possible to apply the Box-Cox transformationto either the independent or dependent variableor to both variables simultaneously in regressionanalysis.

5 No transformation is possible

Sometimes the violations of ANOVA assump-tions, such as heterogeneous variances, cannotbe corrected with data transformation. However,if the data are balanced and there are a largenumber of treatments and replicates �n > 5�,

Page 27: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 63

then ANOVA is relatively robust to departuresfrom the assumptions, and in some cases it maybe worthwhile to proceed with the ANOVA.If violation of an assumption is known toproduce a liberal test (e.g., analysis of data withunequal variances), it is reasonable to acceptan ANOVA finding of no significant difference.Similarly, if a violation creates a conservativetest (e.g., analysis of Cauchy-distributed data),it is reasonable to accept a result of signif-icant difference. In each case; however, the alter-native result must be viewed with considerablecaution. Uses of nonparametric tests as alter-natives to ANOVA in cases where the under-lying assumptions cannot be met were discussedpreviously.

D Missing data, outliers, and other ‘mistakes’

Mistakes happen in field experiments and unlikein many laboratory experiments, it may not bepossible to repeat the experiment. Every effortshould be made to avoid losing data, but clearly,some things are out of an experimenter’s control.Missing data occur when an experimental unitis being excluded from analysis for reasons thathave nothing to do with the experimental treat-ments. Missing data can result from a number ofdifferent situations, but care needs to be takenin deciding when data should not be used inan analysis. Discussed below are some of thecommon causes of missing data, along with waysto deal with them.1. In many cases the decision on what experimental

units should be excluded is clear-cut. For example,if a treatment was not properly applied (e.g., noapplication or application of an incorrect concen-tration) to an experimental unit, that unit should beexcluded. An exception would be if the treatmentwas applied improperly to all replicates and theresearcher wanted to retain the modified treatment.Another example would be data or samples thatare lost after they are collected in the field. Thesedata can be excluded from analysis.

2. Careful thought should be given to the decisionto exclude data when the cause is less clearcut because an incorrect choice could lead toinaccurate conclusions being drawn from theexperiment. For example, if all or most of thecrop plants are lost from an experimental unit, itneeds to be determined if this loss is in any way

treatment-related. If it is not treatment related, thereplicate could be declared missing data, but if thiscannot be determined clearly, the data should beincluded.

3. In some cases, experimental units are mixed(e.g., the samples from two experimental unitsare not labeled and their origin cannot be deter-mined). Common sense may be all that is needed todetermine if an experimental unit can be assignedto a particular treatment with reasonable certaintyor if it should be thrown out. Some mathematicaltests for dealing with this problem are provided byPearce (1983).

4. If so many replicates are lost that a treatmenthas only one replicate remaining, the treatmentshould be excluded. Similarly, a block should beexcluded if all replicates of all treatments but oneare lost.

5. After the data have been recorded, some datapoints may appear illogical (e.g., have valuesthat are beyond the normal range for a particularmaterial). Performing an objective check foroutliers in the data can be useful. However,illogical data should not be excluded just becauseit does not meet expectations. Only illogicaldata that result from some identifiable type oferror should be manipulated or excluded fromanalysis.

6. If detected early enough, some errors in thedata (e.g., misread observations, improper use ofequipment, transcription errors) can be correctedby repeating the measurement or adjustingthe data. Check the data immediately aftertaking measurements so that errors may bedetected and corrected before the opportunityis lost.

7. If there is a good reason to think that a data pointis wrong, but it is not different from the rest of thedata, it should be left in the analysis.

8. If there are objective reasons to think that anextreme data point is flawed, there are a number ofapproaches to assess if it is an outlier and shouldbe excluded (Dunn and Clark, 1987; Hoaglin et al.,1983). However, any data set is likely to havesome outliers (i.e., data points that are not typicalof the other data collected), and it is thereforeimportant to have some objective criteria forexclusion (Underwood, 1981). Often, examinationof outliers can provide useful insights into someinteresting aspects of the biology of the system.It is this variation that is often of greatest interestwhen trying to understand how a system works.

Page 28: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

64 Campbell and Wraight

9. If an experimental unit or replicate is to beexcluded from analysis, there are a variety ofprocedures to deal with the missing data [seePearce (1983), Gomez and Gomez (1983), orSmith (1981) for more detailed information].One approach is to analyze the data without theexcluded replicates and use an analysis that isappropriate for unbalanced data (e.g., GeneralLinear Models procedure). This is by far the mostcommon approach and is the least contentious.A second approach is to estimate missing datavalues to fill the gaps left by the missing data.A third approach is to replace the missing valueswith approximate values and use the analysis ofcovariance to make adjustments. The later twoapproaches are more useful for complex experi-mental designs that have partitioning of treatmenteffects.

8 Interpretation and presentationof results

A Biological versus statistical significance

Statistical analyses provide P values, but oftena researcher is interested in determining if theobtained P value is statistically significant byusing statistical hypothesis testing. This decisionis based on comparing the P value calculatedfrom the data to a predetermined value of alpha(typically 0.05) that is based on the conse-quences of type I and type II errors. Valuesof P less than alpha are considered statisticallysignificant and the null hypothesis is rejected(a difference is declared). The determinationof statistical significance has its advantages insituations where a dichotomous decision needsto be made. However, the disadvantage is thatthis conclusion can be misinterpreted (Jones andMatloff, 1986). If the results of an experimentare statistically significant, this means that oneof three possibilities is true (Motulsky, 1995):(1) the null hypothesis is actually true and theP value indicates the probability of this fact; (2)the null hypothesis is false and the populationsare different in some biologically meaningfulway; and (3) the null hypothesis is false, but thepopulations are not different in any meaningfulbiological way. With large sample sizes, evenvery small differences between populations will

be significantly different. The scientific impor-tance of the conclusion depends on the sizeof the difference between populations and thesignificance of this difference in the biologyof the organism or the requirements of theresearcher. The fact that the results of a statis-tical analysis are significant does not neces-sarily mean that the results of the experiment arebiologically or economically meaningful. On theother hand, with small or even moderate samplesizes, small differences may not be declaredsignificant that are, in fact, important. Thus,the fact that the results of a statistical analysisare not significant does not necessarily meanthat the results are not biologically or econom-ically significant. A common example is statis-tical analysis of crop yields. Small differencesin yield may not be detectable even in a well-designed experiment; yet the difference couldbe of considerable economic importance to agrower. Multiple repetitions of a test may berequired to obtain sufficient statistical power todetect small but important differences.

In biological experiments, it is not alwaysnecessary to reach a sharp decision from eachP value; instead the P value itself can be used asan indication of the degree of confidence in theresult. However, there is disagreement over howto use P values. Many researchers and statisti-cians believe that how close the P value is toalpha does not matter. If it is below the thresholdit is significant and if it is above, it is not signif-icant. Others think that differences in P valuesdo provide important information and that moreconfidence can be placed in the results of anexperiment with P = 0001 than in an experimentwith P = 0049. Sometimes results of statisticalanalyses are presented with a scale of signif-icance levels: P < 005 being significant �∗�P < 001 being highly significant �∗∗�, andP < 0001 being extremely significant �∗∗∗�.Regardless of the approach that a researcherchooses, differences of questionable biologicalrelevance and P values near the value of alphamay warrant further experimentation beforereaching scientific conclusions. Calculating thepower of the analysis can also help with the inter-pretation of nonsignificant results.

Page 29: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 65

B Presentation of results

In publications, the details needed to repeatthe experimental procedure are usually reported,but details needed to understand the experi-mental design and statistical analysis are oftennot as clearly stated. Not reporting proceduresadequately is one of the most common statisticalproblems (Fowler, 1990). Some things that areimportant to state when reporting the results ofan experiment are listed below.1. Clearly state the objectives of the experiment to

enable the reader to determine the relationshipbetween the logical structure, the data collected,and the conclusions reached.

2. Report clearly what statistical methods were usedand why they were chosen so that readers can drawtheir own conclusions.

3. The means, sample sizes, and standard errorsshould be the minimum amount of summary infor-mation reported.

4. State how experimental units were selected andsampled so that the reader can determine ifpseudoreplication may have occurred.

5. If questionable data (outliers) are adjusted or elimi-nated from analysis, these manipulations shouldbe described in the research report.

6. State the assumptions of the tests and how thisinfluenced the selection of transformation proce-dures and selection of statistical procedures.

7. Cite the software package that performed theanalysis if a computer was used or the referencefor the procedure if done by hand.

8. Describe completely the experimental material andhow it was handled (e.g., commercial source,generations in laboratory culture, storage time andconditions).

9. Distinguish between statistical and biologicalor economical significance when presenting theresults of an experiment. Performing poweranalysis before running an experiment can helpwith this problem.

9 Special cases

A Large scale unreplicated trials

Some important questions can only be addressedon large scales that are difficult or impos-sible to replicate and randomize. These types

of experiments may be natural experiments(e.g., perturbations of the system that result fromunexpected events such as natural movementof a pathogen into a particular area) or exper-imenter manipulations (e.g., classical biologicalcontrol introductions, large-scale applications).Smaller scale experiments that can be repli-cated and randomized may be used, but thesemay not provide accurate predictions of large-scale phenomena. Some large-scale studiescan be analyzed using techniques developedfor time series data. However, even onlarge scales, replication is preferred if at allpossible.

Most of these types of experiments have abefore and after component to them; measure-ments are made of the system before and afteran intervention. Time series data are obtainedby repeated subsampling from the same exper-imental unit through time. Time series analysistechniques can be used to determine if anabnormal change followed a manipulation of thesystem compared to the normal variation in theexperimental unit. Determining the amount ofnormal variation through time is more difficultthan determining the experimental error ina replicated and randomized experiment. Thedifferent samples that are taken through time arenot independent from each other and are not thesame as replicates. Good design and analysis ofunreplicated experiments are difficult and timeconsuming. Rasmussen et al. (1993) provide anexcellent introduction to this type of analysis.More detailed information can be obtained fromBox and Jenkins (1976) and other referencescited in Rasmussen et al. (1993).

These designs are prone to having randomevents influence interpretation. Some waysaround this are to have multiple interventionsor to switch back and forth between treatmentand control. The probability of the responsebeing coincidental is reduced with each observedresponse. Using paired units, where one unitreceives a treatment and the other does not, isalso a useful approach (Stewart-Oaten et al.,1986). In this design, the control unit providessupport that the response in the perturbed unitis not due to random changes. Two importantfactors need to be considered in designing thistype of experiment: (1) what are the experimentalunits and (2) how many and how often should

Page 30: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

66 Campbell and Wraight

the experimental units be sampled. Treatmentand control experimental units should be similarto each other and they should represent thesystems to which the results will be generalized.The samples should be evenly spaced throughtime with the time interval determined by therate of change in the population being studiedand the specific questions being asked. Forexample, insect populations with large fluctu-ations in density will need to be studied overlonger periods of time than insects with morestable population densities. Generally, the greaterthe number of sampling times the stronger thestatistical analysis will be (usually 50 or moresamples are needed). The number of samplestaken from an experimental unit at a particularpoint in time will depend on the character-istics of the experimental units and the questionof interest.

There are many different statistical approachesfor the analysis of time-ordered sequences ofobservations. Rasmussen et al. (1993) describesthe autoregressive integrated moving average(ARIMA) models, which are a subset of theavailable statistical approaches. These modelsdescribe a wide range of processes and areappropriate for evenly spaced intervals. Theanalysis of time series data from unrepli-cated experiments can be more subjective thanother more traditional statistical approaches andcaution must be used in analysis and interpre-tation. Consultation with a statistician is stronglyrecommended.

B Meta-analysis

The ability to generalize and summarize is anessential part of statistical analysis and in sciencein general. If a number of independent experi-ments have been performed, it may be desirableto have a statistical synthesis of this research:this type of analysis is termed meta-analysis.This approach is somewhat controversial, but hasbeen used in fields such as medicine and thesocial sciences and is becoming more frequentlyused in ecology (Gurevitch and Hedges, 1993).Progress in science depends on the abilityto reach general conclusions from a bodyof research. Meta-analysis provides a way toreach general conclusions based on quantitativetechniques: e.g., how large is the effect, how

frequently does it occur, what is the differencein the magnitude of the effect among differentstudies.

Counting up the number of statistically signif-icant results from various studies to gain insightinto the importance of an effect (i.e., votecounting) is commonly done but is subject toserious flaws. This is because the significancelevel of a study is the result of the magnitude ofthe effect and the sample size. Small studies areless likely to produce significant results (lowerpower), and therefore, vote counting is stronglybiased toward finding no effect (Gurevitchand Hedges, 1993). Even review articles thatsummarize results of previous studies can besubject to the bias associated with vote countingbecause these qualitative summaries are basedon the significance of the outcomes of previousstudies without considering sample size andstatistical power.

Meta-analysis begins by representing theoutcome of each experiment by a quantitativeindex of the effect size. Effect size is chosento reflect differences between experimental andcontrol treatments in a way that is independentof sample size and scale of measurement. Aneffect size can be determined by dividing thedifference between the treatment and the controlby the pooled standard deviation. The averagemagnitude of the effect across all studies is deter-mined and whether that effect is significantlydifferent from zero is tested. This procedure isnot subject to the problems associated with votecounting, but some specificity and fine detail arelost in the analysis.

The calculations involved in meta-analysisare relatively simple, but the gathering andhandling of data can be complex. Cooper (1989),Cooper and Hedges (1993), and Light andPillemer (1984) deal with some of the issuesassociated with handling data for meta-analysis.The problems associated with publication bias(e.g., not publishing negative results) and waysto deal with these issues have also been discussed(Hedges and Olkin, 1985; Cooper and Hedges,1993). Gurevitch and Hedges (1993) provide agood introduction to the use of meta-analysis andsome the issues related to its use for ecologicalstudies which also have relevance to manyagricultural situations.

Page 31: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 67

10 References

Abbott, W. S. 1925. A method of computing the effec-tiveness of an insecticide. J. Econ. Entomol. 18, 265–267.

Anderson, D. E. and Lydic, R. 1977. On the effect ofusing ratios in the analysis of variance. Biobehav. Rev. 1,225–229.

Bartlett, M. S. 1947. The use of transformations. Biometrics3, 39–52.

Box, G. E. P. and Cox, D. R. 1964. An analysis of trans-formations. J. R. Stat. Soc., Ser. B. 26, 211–243.

Box, G. E. P. and Jenkins, G. M. 1976. “Time SeriesAnalysis: Forecasting and Control,” Holden-Day, SanFrancisco.

Campbell, G., and Skillings, J. H. 1985. Nonparametricstepwise multiple comparison procedures. J. Am. Stat.Assoc. 80, 998–1003.

Cochran, W. G. 1951. Testing a linear relation amongvariances. Biometrics 7, 17–32.

Cochran, W. G. and Cox, G. M. 1957. “ExperimentalDesigns”. John Wiley & Sons, New York.

Cohen, J. 1988. “Statistical Power Analysis for the Behav-ioral Sciences”. Lawrence Erlbaum, Hillsdale.

Connor, E. F. and Simberloff, D. 1986. Competition, scien-tific method and null models in ecology. Am. Sci. 75,155–162.

Conover, W. J. 1999. “Practical Nonparametric Statistics,Third Edition.” John Wiley and Sons, New York.

Conover, W. J. and Iman, R. L. 1976. On some alter-native procedures using ranks for the analysis of experi-mental designs. Commun. Stat. Theory and Methods A5,1349–1368.

Cooper, H. M. 1989. “Integrating Research: A Guidefor Literature Reviews.” Sage Publications, NewburyPark, CA.

Cooper, H. M., and Hedges, L. V. 1993. “Handbookof Research Synthesis.” Russell Sage Foundation,New York.

Crowder, M. J., and Hand, D. J. 1990. “Analysis of RepeatedMeasures.” Chapman and Hall, London.

Daniel, W. W. 1990. “Applied Nonparametric Statistics,Second Edition.” PWS-KENT Publishing Co., Boston.

Day, R. W. and Quinn, G. P. 1989. Comparisons of treat-ments after an analysis of variance in ecology. Ecol.Monogr. 59, 433–463.

Dixon, P. M. 1993. The bootstrap and the jackknife:describing the precision of ecological indices. In “Designand Analysis of Ecological Experiments” (S. M. Scheinerand J. Gurevitch, Eds.), pp. 290–318. Chapman and Hall,New York.

Duncan, D. B. 1955. Multiple range and multiple F tests.Biometrics 11, 1–42.

Dunn, O. J. and Clark, V. A. 1987. “Applied statistics:analysis of variance and regression.” John Wiley & Sons,New York.

Efron, B. and Tibshirani, R. 1993. “Introduction to theBootstrap.” Chapman & Hall, London.

Einot, I. and Gabriel, K. R. 1975. A study of the powersof several methods of multiple comparisons. J. Am. Stat.Assoc. 70, 574–583.

Federer, W. T. 1955. “Experimental Design.” MacMillanCompany, New York.

Finney, D. J. 1971. “Probit Analysis.” CambridgeUniversity Press, Cambridge.

Fisher, R. A. 1951. “Design of Experiments.” Oliver andBoyd, Edinburgh.

Fligner, M. A. and Policello, G. E. 1981. Robust rank proce-dures for the Behrens-Fisher problem. J. Am. Stat. Assoc.76, 162–168.

Fowler, N. 1990. The 10 most common statistical errors.Bull. Ecol. Soc. Am. 71, 161–164.

Gart, J. J. and Nam, J. 1988. Approximate intervalestimation of the ratio of binomial parameters: A reviewand corrections for skewness. Biometrics 44, 323–338.

Girden, E. R. 1992. “ANOVA: Repeated Measures.” Sagepublications, Newbury Park, CA.

Gomez, K. A. and Gomez, A. A. 1983. “Statistical Proce-dures for Agricultural Research.” John Wiley & Sons,New York.

Gurevitch, J. and Hedges, L. V. 1993. Meta-analysis:Combining the results of independent experiments.In “Design and Analysis of Ecological Experiments”(S. M. Scheiner and J. Gurevitch, Eds.), pp. 378–398.Chapman and Hall, New York.

Hedges, L. V. and Olkin, I. 1985. “Statistical Methods forMeta-Analysis.” Academic Press, New York.

Henderson, C. F. and Tilton, E. W. 1955. Tests with acari-cides against the brown wheat mite. J. Econ. Entomol.48, 157–161.

Hoaglin, D. C., Mosteller, F. and Tukey, J. W. 1983.“Understanding Robust and Exploratory Data Analysis.”John Wiley and Sons, New York.

Hollander, M., and Wolfe, D. A. 1973. “NonparametricStatistical Methods.” John Wiley & Sons, New York.

Huitema, B. E. 1980. ‘The Analysis of Covariance andAlternatives.’ Wiley Interscience, New York.

Hurlbert, S. J. 1984. Pseudoreplication and the designof ecological field experiments. Ecol. Monogr. 54,187–211.

Iman, R. L. 1974. A power study of a rank transform for thetwo-way classification model when interaction may bepresent. Can. J. Stat. Section C Applications 2, 227–239.

Jones, D., and Matloff, N. 1986. Statistical hypothesistesting in biology: a contradiction in terms. J. Econ.Entomol. 79, 1156–1160.

Kalbfeisch, J. D. and Prentice, R. L. 1980. “The statisticalanalysis of failure time data”. John Wiley and Sons,New York.

Koopman, P. A. R. 1994. Confidence intervals for theAbbott’s formula correction of bioassay data for controlresponse. J. Econ. Entomol. 87, 833.

Page 32: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

68 Campbell and Wraight

Krauth, J. 1988. “Distribution-free Statistics: AnApplication-oriented Approach.” Elsevier, Amsterdam.

Krebs, C. J. 1989. “Ecological Methodology.” Harper &Row, New York.

Light, R. J. and Pillemer, D. B. 1984. “Summing Up: TheScience of Reviewing Research.” Harvard UniversityPress, Cambridge, MA.

Littell, R. C., Milliken, G. A., Stroup, W. W., andWolfinger, R. D. 1996. “SAS System for MixedModels.” SAS Institute Inc., Cary NC.

Little, T. M. and Hills, F. J. 1978. “Agricultural Experimen-tation.” John Wiley & Sons, New York.

Manly, B. F. J. 1997. “Randomization, Bootstrap andMonte Carlo Methods in Biology.” CRC Press, BocaRaton.

Mansouri, H. and Chang, G.-H. 1995. A comparative studyof some rank tests for interaction. Comput. Stat. DataAnal. 19, 85–96.

McArdle, B. H. and Gaston, K. J. 1992. Comparingpopulation variabilities. Oikos 64, 610–612.

Mead, R. and Curnow, R. N. 1983. “Statistical Methods inAgricultural and Experimental Biology.” Chapman andHall, London.

Miller, R. G. 1981. “Simultaneous Statistical Inference.”Springer-Verlag, Berlin.

Motulsky, H. 1995. “Intuitive Biostatistics.” OxfordUniversity Press, New York.

Pearce, S. C. 1983. “The Agricultural Field Experiment.”John Wiley and Sons, New York.

Pearson, E. S., and Hartley, H. O. 1951. Charts for the powerfunction for analysis of variance tests, derived from thenon-central F-distribution. Biometrika 38, 112–130.

Randles, R. H. and Wolfe, D. A. 1979. “Introduction tothe Theory of Nonparametric Statistics.” John Wiley andSons, New York.

Rasmussen, P. W., Heisey, D. M., Nordheim, E. V., andFrost, T. M. 1993. Time-series intervention analysis:unreplicated large-scale experiments. In: “Design andAnalysis of Ecological Experiments” (S. M. Scheinerand J. Gurevitch, Eds.), pp. 138–158. Chapman and Hall,New York.

Rice, W. R. 1989. Analyzing tables of statistical tests.Evolution 43, 223–225.

Robertson, J. L. and Priesler, H. K. 1992. “PesticideBioassays with Arthropods.” CRC Press, BocaRaton, FL.

Rosenheim, J. A. and Hoy, M. A. 1989. Confidence intervalsfor the Abbott’s formula correction of bioassay data forcontrol response. J. Econ. Entomol. 82, 331–335.

Rust, S. W., and Fligner, M. A. 1984. A modification ofthe Kruskal-Wallis statistic for the generalized Behrens-Fisher problem. Communications in Statistics. Part A –Theory and Methods 13, 2013–2027.

Ryan, T. A. 1959. Multiple comparisons in psychologicalresearch. Psych. Bull. 56, 26–47.

Scheffé, H. 1959. “The Analysis of Variance.” John Wileyand Sons, New York.

Scheiner, S. M., 1993. Introduction: Theories, hypotheses,and statistics. In: “Design and Analysis of EcologicalExperiments” (S. M. Scheiner and J. Gurevitch, Eds.),pp. 1–13. Chapman and Hall, New York.

Scheirer, C. J., Ray, W. S., and Hare, N. 1976. The analysisof ranked data derived from completely randomizedfactorial designs. Biometrics 32, 429–434.

Shrader-Frechette, K. S. and McCoy, E. D. 1992. Statistics,costs and rationality in ecological inference. Trends Ecol.Evol. 7, 96–99.

Siegel, S. and Castellan, N. J., Jr. 1988. “NonparametricStatistics for the Behavioral Sciences.” McGraw-Hill,New York.

Smith, P. L. 1981. The use of analysis of covariance toanalyze data from designed experiments with missing ormixed-up values. Appl. Stat. 30, 1–8.

Snedecor, G. W. and Cochran, W. G. 1989. “StatisticalMethods.” University of Iowa Press, Ames, IA.

Sokal, R. R. and Rohlf, F. J. 1995. “Biometry: the Principlesand Practice of Statistics in Biological Research, ThirdEdition.” W. H. Freeman, San Francisco.

Stevens, J. 1996. “Applied Multivariate Statistics for theSocial Sciences, Third Edition.” Lawrence ErlbaumAssociates, Mahwah, NJ.

Stewart-Oaten, A., Murdoch, W. W., and Parker, K. R.1986. Environmental impact assessment: “Pseudorepli-cation” in time. Ecology 67, 929–940.

Thompson, G. L. 1991. A note on the rank transform forinteractions. Biometrika 78, 697–701.

Toft, C. A. and Shea, P. J. 1983. Detecting community-wide patterns: estimating power strengthens statisticalinference. Am. Nat. 122, 618–625.

Tukey, J. W. 1949. One degree of freedom for non-additivity. Biometrics 5, 232–242.

Underwood, A. J. 1981. Techniques of analysis of variancein experimental marine biology and ecology. Ocean.Marine Biol. Annu. Rev. 19, 513–605.

Underwood, A. J. 1990. Experiments in ecology andmanagement: their logics, functions and interpretations.Aust. J. Ecol 15, 365–389.

Underwood, A. J. 1997. “Experiments in ecology:their Logical Design and Interpretation usingAnalysis of Variance.” Cambridge University Press,Cambridge, UK.

von Ende, C. 1993. Repeated-measures analysis: growthand other time-dependent measures. In: “Design andAnalysis of Ecological Experiments” (S. M. Scheinerand J. Gurevitch, Eds.), pp. 113–137. Chapman andHall, New York.

Weerahandi, S. 1995. Anova under unequal error variances.Biometrics 51, 589–599.

Welch, B. L. 1938. The significance of the differencebetween two means when the population variances areunequal. Biometrika 29, 28–35.

Page 33: Field Manual of Techniques in Invertebrate Pathology Volume 207 || Experimental design: statistical considerations and analysis

II-1 Experimental design 69

Welch, B. L. 1951. On the comparison of severalmean values: an alternative approach. Biometrika 38,330–336.

Winer, B. J. 1971. “Statistical Principles in ExperimentalDesign.” McGraw-Hill, New York.

Winer, B. J., Brown, D. R., and Michel, K. M. 1991. “Statis-tical Principals in Experimental Design.” McGraw-Hill,New York.

Young, L. J. and Young, J. H. 1991. Alternative viewof statistical hypothesis testing. Environ. Entomol. 20,1241–1245.

Zar, J. H. 1999. “Biostatistical Analysis, Fourth Edition.”Prentice-Hall, Upper Saddle River, NJ.

Zimmerman, D. W. 2001. Increasing the power of theANOVA F test for outlier-prone distributions bymodified ranking methods. J. Gen. Psychol. 122, 83–94.