Hypothesis-Driven and Exploratory Data Analysis

8/12/2019 Hypothesis-Driven and Exploratory Data Analysis

1/3

Hypothesis-Driven and Exploratory Data

Analysis

The 14th-century maxim known as Ockham's Razor, paraphrased by Jefferys and Berger (1992) as "It

is vain to do with more what can be done with less", is usually applied to the interpretation of

scientific results. However, it applies equally well to choice of analysis. Thus if one has a very simple

ecological data set, consisting of few species and few samples, ordination is not worthwhile. In such a

case, the data are easiest to interpret in a simple table.

In a typical data set, however, there are dozens of species and samples. It is impossible for the human

mind to simultaneously contemplate dozens of dimensions. The purpose of ordination is to assist the

implementation of Ockham's Razor: a few dimensions are easier to understand than many dimensions.

A good ordination technique will be able to determine the most important dimensions (or gradients) in

a data set, and ignore "noise" or chance variation.

Both direct and indirect gradient analysis have the potential to reduce the dimensionality of a data set.

However, reduction of dimensionality is not the only reason to use ordination. Before the

development of CCA, most widely-used ordination techniques were indirect, and the primary goal of

ordination was considered "exploratory" (Gauch 1982). It was the job of the ecologist to use his or her

knowledge and intuition to collect and interpret data; pure objectivity could potentially interfere with

the ability to distinguish important gradients. Ordination was often considered as much an art as a

science.

Once CCA was available, multivariate direct gradient analysis became feasible. It became possible to

rigorously test statistical hypotheses and go beyond mere "exploratory" analysis. However, testinghypotheses requires complete objectivity, which results in repeatability and falsifiability. The two

basic motivations for multivariate direct gradient analysis, hypothesis testing and exploratory analysis,

conflict with each other to some extent:

Table 1.Hypothesis-driven analysis, exploratory analysis, and their major characteristics and

motivations. This table applies to regression techniques and indirect gradient analysis in addition to

CCA.

HYPOTHESIS DRIVEN EXPLORATORY

Motivating Question: "Can I reject the null hypothesis thatspecies are unrelated to a postulated environmental factor or

factors?"

Motivating Question: "How can I optimally explain or

describe variation in my data set?"

objective subjective

sites must be representative of universe: random, stratifiedrandom, regular placement

sites can be "encountered" or subjectively located

analyses must be planned a priori "data diving" permissible;post-hocanalyses, explanations,hypotheses OK

p-values meaningful p-values only a rough guide

Hypothesis-Driven and Exploratory Data Analysis http://ordination.okstate.edu/motivate

of 3 6/5/14, 6:44


2/3

stepwise techniques not valid without cross-validation stepwise techniques (e.g. forward selection) valid anduseful.

To perform a hypothesis-driven analysis, one must be very specific about the analyses one wishes to

perform. The null hypothesis must be clearly stated, and the data must be collected in a repeatable

manner. Usually, the sampling design will involve random, stratified random, or regular distribution of

study plots. If there is any subjectivity involved in locating or orienting study plots, the results are

technically not valid. All of the analyses, including variations of data transformation and use ofdifferent ordination options (e.g. detrending or not), must be planned in advance, or else the user runs

the risk of "data diving" or "data mining", i.e. getting an artificially significant result because so many

options are tried. Stepwise techniques (discussed later) are automated forms of "data diving", and will

typically also lead to incorrect statistical inference (Cliff 1987, Draper and Smith 1981). The reward

for rigorously adhering to these rather stringent criteria is that the statistical inference (i.e. thep-value)

is valid.

Exploratory analyses might lack statistical rigor, but they are still a mainstay of vegetation research.

The purpose of exploratory analysis is to find pattern in nature, which is an inherently subjective

enterprise. Exploratory analyses incorporate the wisdom, skill, and intuition of the investigator into

the experiment. Unless you can find another investigator with identical wisdom, skill and intuition, theanalyses are not strictly repeatable, and are hence not falsifiable. While it is possible to perform

exploratory analyses on sample plots located according to a rigorous, objective sampling design, such

careful placement is not necessary. Indeed, an exploratory analysis can be aided if the investigator

subjectively places study plots in locations he or she considers to be important or interesting.

Orienting plots within vegetation which appears homogeneous is highly subjective, but very useful in

evaluating differences between plots.

With exploratory analysis, "data diving" (e.g. using different transformations of species abundances,

adjusting ordination options, selecting different subsets of environmental variables, or selecting

different subsets of study plots) is no longer to be avoided. Instead, it is a way for the investigator to

learn more about the data set. Stepwise analysis is a form of automated data diving. It is useful as atool to help discover "important" or "interesting" variables.

Ecologists are often mislead into thinking thatp-values from stepwise methods have a rigorous

meaning, and that the results of stepwise methods give the best possible model. Such thinking is false.

It is possible to combine exploratory analysis and hypothesis-driven analysis into a larger study. One

way of doing this is to perform a 2-phase study, in which the first phase is an exploratory analysis,

perhaps involving subjectively located plots and employing many variations on analysis. The patterns

found in the first phase are then posed as hypotheses for the second phase. The second phase involves

the collection of fresh data from objectively located plots, and an entirely planned data analysis.

A second way to combine the two major types of analysis is through data set subdivision. The data set

is randomly divided into two subsets: an exploratorysubset and a confirmatorysubset (alternatively

called model buildingand model validation, respectively). Many, varied analyses can be performed on

the exploratory subset (including stepwise analysis) - and such analyses can be based upon intuition,

hunches, or superstition. If interesting patterns are found with respect to particular environmental

variables, and using particular data transformations, these patterns can be statistically tested using the


of 3 6/5/14, 6:44


3/3

confirmatory subset. To use data set subdivision properly, samples must be objectively located.

Literature cited

(see also selected references for self-education)

Cliff, N. 1987. Analyzing Multivariate Data. Harcourt Brace Jovanovich, Publishers, San Diego,

California.

Draper, N. R., and H. Smith. 1981. Applied Regression Analysis. second edition. Wiley, New York.

Gauch, H. G., Jr. 1982. Multivariate Analysis and Community Structure. Cambridge University Press,

Cambridge.

Hallgren, E., M. W. Palmer, and P. Milberg. 1999. Data diving with cross validation: an investigation

of broad-scale gradients in Swedish weed communities. Journal of Ecology 87:1037-1051.

Jefferys, W. H., and J. O. Berger. 1992. Ockham's Razor and Bayesian Analysis. Am. Sci. 80:64-72.

This page was created and is maintained by Michael Palmer

To the ordination web page


of 3 6/5/14, 6:44

Hypothesis-Driven and Exploratory Data Analysis

Documents

Transcript of Hypothesis-Driven and Exploratory Data Analysis