Hypothesis-Driven and Exploratory Data Analysis
Transcript of Hypothesis-Driven and Exploratory Data Analysis
-
8/12/2019 Hypothesis-Driven and Exploratory Data Analysis
1/3
Hypothesis-Driven and Exploratory Data
Analysis
The 14th-century maxim known as Ockham's Razor, paraphrased by Jefferys and Berger (1992) as "It
is vain to do with more what can be done with less", is usually applied to the interpretation of
scientific results. However, it applies equally well to choice of analysis. Thus if one has a very simple
ecological data set, consisting of few species and few samples, ordination is not worthwhile. In such a
case, the data are easiest to interpret in a simple table.
In a typical data set, however, there are dozens of species and samples. It is impossible for the human
mind to simultaneously contemplate dozens of dimensions. The purpose of ordination is to assist the
implementation of Ockham's Razor: a few dimensions are easier to understand than many dimensions.
A good ordination technique will be able to determine the most important dimensions (or gradients) in
a data set, and ignore "noise" or chance variation.
Both direct and indirect gradient analysis have the potential to reduce the dimensionality of a data set.
However, reduction of dimensionality is not the only reason to use ordination. Before the
development of CCA, most widely-used ordination techniques were indirect, and the primary goal of
ordination was considered "exploratory" (Gauch 1982). It was the job of the ecologist to use his or her
knowledge and intuition to collect and interpret data; pure objectivity could potentially interfere with
the ability to distinguish important gradients. Ordination was often considered as much an art as a
science.
Once CCA was available, multivariate direct gradient analysis became feasible. It became possible to
rigorously test statistical hypotheses and go beyond mere "exploratory" analysis. However, testinghypotheses requires complete objectivity, which results in repeatability and falsifiability. The two
basic motivations for multivariate direct gradient analysis, hypothesis testing and exploratory analysis,
conflict with each other to some extent:
Table 1.Hypothesis-driven analysis, exploratory analysis, and their major characteristics and
motivations. This table applies to regression techniques and indirect gradient analysis in addition to
CCA.
HYPOTHESIS DRIVEN EXPLORATORY
Motivating Question: "Can I reject the null hypothesis thatspecies are unrelated to a postulated environmental factor or
factors?"
Motivating Question: "How can I optimally explain or
describe variation in my data set?"
objective subjective
sites must be representative of universe: random, stratifiedrandom, regular placement
sites can be "encountered" or subjectively located
analyses must be planned a priori "data diving" permissible;post-hocanalyses, explanations,hypotheses OK
p-values meaningful p-values only a rough guide
Hypothesis-Driven and Exploratory Data Analysis http://ordination.okstate.edu/motivate
of 3 6/5/14, 6:44
-
8/12/2019 Hypothesis-Driven and Exploratory Data Analysis
2/3
stepwise techniques not valid without cross-validation stepwise techniques (e.g. forward selection) valid anduseful.
To perform a hypothesis-driven analysis, one must be very specific about the analyses one wishes to
perform. The null hypothesis must be clearly stated, and the data must be collected in a repeatable
manner. Usually, the sampling design will involve random, stratified random, or regular distribution of
study plots. If there is any subjectivity involved in locating or orienting study plots, the results are
technically not valid. All of the analyses, including variations of data transformation and use ofdifferent ordination options (e.g. detrending or not), must be planned in advance, or else the user runs
the risk of "data diving" or "data mining", i.e. getting an artificially significant result because so many
options are tried. Stepwise techniques (discussed later) are automated forms of "data diving", and will
typically also lead to incorrect statistical inference (Cliff 1987, Draper and Smith 1981). The reward
for rigorously adhering to these rather stringent criteria is that the statistical inference (i.e. thep-value)
is valid.
Exploratory analyses might lack statistical rigor, but they are still a mainstay of vegetation research.
The purpose of exploratory analysis is to find pattern in nature, which is an inherently subjective
enterprise. Exploratory analyses incorporate the wisdom, skill, and intuition of the investigator into
the experiment. Unless you can find another investigator with identical wisdom, skill and intuition, theanalyses are not strictly repeatable, and are hence not falsifiable. While it is possible to perform
exploratory analyses on sample plots located according to a rigorous, objective sampling design, such
careful placement is not necessary. Indeed, an exploratory analysis can be aided if the investigator
subjectively places study plots in locations he or she considers to be important or interesting.
Orienting plots within vegetation which appears homogeneous is highly subjective, but very useful in
evaluating differences between plots.
With exploratory analysis, "data diving" (e.g. using different transformations of species abundances,
adjusting ordination options, selecting different subsets of environmental variables, or selecting
different subsets of study plots) is no longer to be avoided. Instead, it is a way for the investigator to
learn more about the data set. Stepwise analysis is a form of automated data diving. It is useful as atool to help discover "important" or "interesting" variables.
Ecologists are often mislead into thinking thatp-values from stepwise methods have a rigorous
meaning, and that the results of stepwise methods give the best possible model. Such thinking is false.
It is possible to combine exploratory analysis and hypothesis-driven analysis into a larger study. One
way of doing this is to perform a 2-phase study, in which the first phase is an exploratory analysis,
perhaps involving subjectively located plots and employing many variations on analysis. The patterns
found in the first phase are then posed as hypotheses for the second phase. The second phase involves
the collection of fresh data from objectively located plots, and an entirely planned data analysis.
A second way to combine the two major types of analysis is through data set subdivision. The data set
is randomly divided into two subsets: an exploratorysubset and a confirmatorysubset (alternatively
called model buildingand model validation, respectively). Many, varied analyses can be performed on
the exploratory subset (including stepwise analysis) - and such analyses can be based upon intuition,
hunches, or superstition. If interesting patterns are found with respect to particular environmental
variables, and using particular data transformations, these patterns can be statistically tested using the
Hypothesis-Driven and Exploratory Data Analysis http://ordination.okstate.edu/motivate
of 3 6/5/14, 6:44
-
8/12/2019 Hypothesis-Driven and Exploratory Data Analysis
3/3
confirmatory subset. To use data set subdivision properly, samples must be objectively located.
Literature cited
(see also selected references for self-education)
Cliff, N. 1987. Analyzing Multivariate Data. Harcourt Brace Jovanovich, Publishers, San Diego,
California.
Draper, N. R., and H. Smith. 1981. Applied Regression Analysis. second edition. Wiley, New York.
Gauch, H. G., Jr. 1982. Multivariate Analysis and Community Structure. Cambridge University Press,
Cambridge.
Hallgren, E., M. W. Palmer, and P. Milberg. 1999. Data diving with cross validation: an investigation
of broad-scale gradients in Swedish weed communities. Journal of Ecology 87:1037-1051.
Jefferys, W. H., and J. O. Berger. 1992. Ockham's Razor and Bayesian Analysis. Am. Sci. 80:64-72.
This page was created and is maintained by Michael Palmer
To the ordination web page
Hypothesis-Driven and Exploratory Data Analysis http://ordination.okstate.edu/motivate
of 3 6/5/14, 6:44