philosophy on data analysis and research
Transcript of philosophy on data analysis and research
As I see it, useful application of data analysis is ultimately a problem of strategic matching under
uncertainty. This calls for a deeper theory-building beyond simple (or complex) data analysis. We want
to know not merely the characteristics of the voters or the consumers, but what choices they face and
their elasticities of substitution among them conditional on their understanding of the universe. As an
example, one might consider the problem of voters and elections. We might know the characteristics of
the voters and their likely choice if they were constrained to, say, the Democrats and Republicans. But
the stability of the match is predicated on how much they value their "likely choice" relative to the next
best alternative and the metrics they use to evaluate them. Was Eric Cantor defeated because he was a
Republican or because he was not conservative enough? No. He was defeated because enough voters
evaluated him and his potential replacements based on the metrics other than partisanship and
conventionally defined "liberal-conservative" ideology. What are these metrics? That is something that
we actually need to analyze in greater depth, not only to understand what these are, but also to
understand how large a departure from the "obvious" patterns they represent. It is awareness of these
relatively obscure mismatches that allows someone to gain an advantage from better understanding of
the data. Bigger sample sizes paired with careful and self-aware application of statistical techniques
helps separate these patterns from the noise.
Of course, this is not a uniquely "political" problem. This is the problem of reverse platoon splits in
baseball, female RPG gamers in gaming industry, and all manner of odd couples in dating--basically
anything where there are "big obvious patterns" with significant (and potentially important) possibilities
of exceptions. If anything, being too attached to the truisms of the particular field intrudes in identifying
these patterns. We "know" the obvious patterns in "politics," for example: that in US, partisanship and
ideology are statistically the most important determinants of voter decision. Thus, we might conclude
(mistakenly, I insist) that Cantor was defeated because he was a Republican and/or too conservative.
Being too aware of the big patterns can prevent us from seeing the exception and countertrends.
(There was a nifty poli sci article showing this empirical pattern lately, although minus the theorizing.
Alas!) If the goal of the data analysis is to identify and measure these exceptions and countertrends and
conceptualize to profit from them, they must be always open to the possibility of the caveats,
exceptions, and conditionalities and constantly prepared to measure them even as they are aware of the
big patterns. As I see it, this is the essential requirement.
My approach to research and data analysis follows from this philosophy. The first step is to understand
the substance as much as possible, with two mutually reinforcing goals. I want to know what the big
patterns are and where and when to expect the breaks from these patterns. This may require an
understanding of the substance at the theoretical, foundational level beyond just being aware of the
facts. The second step is to think about statistical approach to the problem, with the recognition that I
still may not know all the moving parts in the data. This should never be just a formulaic approach to a
big clump of data. If I had done my homework properly, I should have a good idea as to where the
exceptions and countertrends might pop up. I'd want to know whether that is in fact true and if so, how
significant (both in statistical and conventional sense) these countertrends are. After all, I'd like to know
if I can make money, win baseball games, or win elections based on these counterpatterns and how
much to invest on them. The third step is to reevaluate the data. While the basic data-cleanup should
have taken place before the second step (i.e. checking for obvious data formatti ng problems, accounting
for possible sampling bias and appropriately weighing the sample, etc.), there is always a chance that
there is some unexpected source of bias that led to the observed result that can be significant. (There
was a clinical trial of new medicine that initially could not be replicated because some specific detail of
how the medicine was administered turned out to be important.) The fourth step, following up on the
reevaluation of the data, should be a reexamination of the theory, once again with the expectation that
the theory, even when consistent with the data, is likely to have room for exceptions and countertrends
to it that can be identified. That's a good thing: every good theory will be copied and used by everyone
sooner or later. You want to start thinking about how to stay ahead of the competition, by knowing
where to look for the next set of exceptions and countertrends where you could profit from once the
copycats are poaching on your first theory.