philosophy on data analysis and research

2
As I see it, useful application of data analysis is ultimately a problem of strategic matching under uncertainty. This calls for a deeper theory-building beyond simple (or complex) data analysis. We want to know not merely the characteristics of the voters or the consumers, but what choices they face and their elasticities of substitution among them conditional on their understanding of the universe. As an example, one might consider the problem of voters and elections. We might know the characteristics of the voters and their likely choice if they were constrained to, say, the Democrats and Republicans. But the stability of the match is predicated on how much they value their "likely choice" relative to the next best alternative and the metrics they use to evaluate them. Was Eric Cantor defeated because he was a Republican or because he was not conservative enough? No. He was defeated because enough voters evaluated him and his potential replacements based on the metrics other than partisanship and conventionally defined "liberal-conservative" ideology. What are these metrics? That is something that we actually need to analyze in greater depth, not only to understand what these are, but also to understand how large a departure from the "obvious" patterns they represent. It is awareness of these relatively obscure mismatches that allows someone to gain an advantage from better understanding of the data. Bigger sample sizes paired with careful and self-aware application of statistical techniques helps separate these patterns from the noise. Of course, this is not a uniquely "political" problem. This is the problem of reverse platoon splits in baseball, female RPG gamers in gaming industry, and all manner of odd couples in dating--basically anything where there are "big obvious patterns" with significant (and potentially important) possibilities of exceptions. If anything, being too attached to the truisms of the particular field intrudes in identifying these patterns. We "know" the obvious patterns in "politics," for example: that in US, partisanship and ideology are statistically the most important determinants of voter decision. Thus, we might conclude (mistakenly, I insist) that Cantor was defeated because he was a Republican and/or too conservative. Being too aware of the big patterns can prevent us from seeing the exception and countertrends. (There was a nifty poli sci article showing this empirical pattern lately, although minus the theorizing. Alas!) If the goal of the data analysis is to identify and measure these exceptions and countertrends and conceptualize to profit from them, they must be always open to the possibility of the caveats, exceptions, and conditionalities and constantly prepared to measure them even as they are aware of the big patterns. As I see it, this is the essential requirement. My approach to research and data analysis follows from this philosophy. The first step is to understand the substance as much as possible, with two mutually reinforcing goals. I want to know what the big patterns are and where and when to expect the breaks from these patterns. This may require an understanding of the substance at the theoretical, foundational level beyond just being aware of the facts. The second step is to think about statistical approach to the problem, with the recognition that I still may not know all the moving parts in the data. This should never be just a formulaic approach to a big clump of data. If I had done my homework properly, I should have a good idea as to where the exceptions and countertrends might pop up. I'd want to know whether that is in fact true and if so, how significant (both in statistical and conventional sense) these countertrends are. After all, I'd like to know if I can make money, win baseball games, or win elections based on these counterpatterns and how much to invest on them. The third step is to reevaluate the data. While the basic data-cleanup should

Transcript of philosophy on data analysis and research

Page 1: philosophy on data analysis and research

As I see it, useful application of data analysis is ultimately a problem of strategic matching under

uncertainty. This calls for a deeper theory-building beyond simple (or complex) data analysis. We want

to know not merely the characteristics of the voters or the consumers, but what choices they face and

their elasticities of substitution among them conditional on their understanding of the universe. As an

example, one might consider the problem of voters and elections. We might know the characteristics of

the voters and their likely choice if they were constrained to, say, the Democrats and Republicans. But

the stability of the match is predicated on how much they value their "likely choice" relative to the next

best alternative and the metrics they use to evaluate them. Was Eric Cantor defeated because he was a

Republican or because he was not conservative enough? No. He was defeated because enough voters

evaluated him and his potential replacements based on the metrics other than partisanship and

conventionally defined "liberal-conservative" ideology. What are these metrics? That is something that

we actually need to analyze in greater depth, not only to understand what these are, but also to

understand how large a departure from the "obvious" patterns they represent. It is awareness of these

relatively obscure mismatches that allows someone to gain an advantage from better understanding of

the data. Bigger sample sizes paired with careful and self-aware application of statistical techniques

helps separate these patterns from the noise.

Of course, this is not a uniquely "political" problem. This is the problem of reverse platoon splits in

baseball, female RPG gamers in gaming industry, and all manner of odd couples in dating--basically

anything where there are "big obvious patterns" with significant (and potentially important) possibilities

of exceptions. If anything, being too attached to the truisms of the particular field intrudes in identifying

these patterns. We "know" the obvious patterns in "politics," for example: that in US, partisanship and

ideology are statistically the most important determinants of voter decision. Thus, we might conclude

(mistakenly, I insist) that Cantor was defeated because he was a Republican and/or too conservative.

Being too aware of the big patterns can prevent us from seeing the exception and countertrends.

(There was a nifty poli sci article showing this empirical pattern lately, although minus the theorizing.

Alas!) If the goal of the data analysis is to identify and measure these exceptions and countertrends and

conceptualize to profit from them, they must be always open to the possibility of the caveats,

exceptions, and conditionalities and constantly prepared to measure them even as they are aware of the

big patterns. As I see it, this is the essential requirement.

My approach to research and data analysis follows from this philosophy. The first step is to understand

the substance as much as possible, with two mutually reinforcing goals. I want to know what the big

patterns are and where and when to expect the breaks from these patterns. This may require an

understanding of the substance at the theoretical, foundational level beyond just being aware of the

facts. The second step is to think about statistical approach to the problem, with the recognition that I

still may not know all the moving parts in the data. This should never be just a formulaic approach to a

big clump of data. If I had done my homework properly, I should have a good idea as to where the

exceptions and countertrends might pop up. I'd want to know whether that is in fact true and if so, how

significant (both in statistical and conventional sense) these countertrends are. After all, I'd like to know

if I can make money, win baseball games, or win elections based on these counterpatterns and how

much to invest on them. The third step is to reevaluate the data. While the basic data-cleanup should

Page 2: philosophy on data analysis and research

have taken place before the second step (i.e. checking for obvious data formatti ng problems, accounting

for possible sampling bias and appropriately weighing the sample, etc.), there is always a chance that

there is some unexpected source of bias that led to the observed result that can be significant. (There

was a clinical trial of new medicine that initially could not be replicated because some specific detail of

how the medicine was administered turned out to be important.) The fourth step, following up on the

reevaluation of the data, should be a reexamination of the theory, once again with the expectation that

the theory, even when consistent with the data, is likely to have room for exceptions and countertrends

to it that can be identified. That's a good thing: every good theory will be copied and used by everyone

sooner or later. You want to start thinking about how to stay ahead of the competition, by knowing

where to look for the next set of exceptions and countertrends where you could profit from once the

copycats are poaching on your first theory.