Nicolas Nassim Taleb Misuses of Statistics

25
6 Some Misuses of Statistics in Social Science Recall from the Introduction that the best way to figure out if someone is using an erroneous statistical technique is to use such technique on a dataset for which you have the answer. The best way to know the exact properties is to generate it by Monte Carlo. So the technique throughout the chapter is to generate fat-tailed data, the properties of which we know with precision, and check how such standard and mechanistic methods detect the true properties, then show the wedge between observed and true properties. Also recall from Chapter x that fat tails make it harder for someone to detect the true properties; for this we need a much, much larger dataset, more rigorous ranking techniques allowing inference in one direction not another (Chapter 2), etc. In a way this is an application of the theorems and rules of Chapter 2. Figure 6.1: Fragilista Gintis (about The Black Swan): "the plural of anecdote is not data", a representative (but el- ementary) violation of probability theory. For large devia- tions, n =1 is plenty of data (see maximum of divergence (Lévy, Petrov), or Kolmogorov-Smirno): looking at the ex- tremum of a time series is not cherry picking. Remarkably such imbeciles fall for the opposite mistake, the "n-large", in thinking that confirmatory observations provide "p-values". All these errors are magnified by fat tails. 6.1 Attribute Substitution It occurs when an individual has to make a judgment (of a target attribute) that is complicated complex, and instead substitutes a more easily calculated one. There have been many papers (Kahneman and Tversky [5] , Hoggarth and Soyer, [3] and comment [4]) showing how statistical researchers overinterpret their own findings, as simplication leads to the fooled by randomness ef- fect. Dan Goldstein and this author (Goldstein and Taleb [1]) showed how professional researchers and practition- ers substitute norms in the evaluation of higher order properties of time series, mistaking kxk 1 for kxk 2 (or P |x| for p P x 2 ). The common result is underesti- mating the randomness of the estimator M , in other words read too much into it (and, what is worse, under- estimation of the tails, since, as we saw in 1.5, the ratio p P x 2 P |x| increases with "fat-tailedness" to become infi- nite under tail exponents 2 ). Standard deviation is ususally explained and interpreted as mean deviation. Simply, people find it easier to imagine that a variation of, say, (-5,+10,-4,-3, 5, 8) in temperature over suc- cessive day needs to be mentally estimated by squaring the numbers, averaging them, then taking square roots. Instead they just average the absolutes. But, what is key, they tend to do so while convincing themselves that 71

description

Taleb on Statistics

Transcript of Nicolas Nassim Taleb Misuses of Statistics

Page 1: Nicolas Nassim Taleb Misuses of Statistics

6 Some Misuses of Statistics in Social Science

Recall from the Introduction that the best way to figure out if someone is using an erroneous statistical technique isto use such technique on a dataset for which you have the answer. The best way to know the exact properties is togenerate it by Monte Carlo. So the technique throughout the chapter is to generate fat-tailed data, the propertiesof which we know with precision, and check how such standard and mechanistic methods detect the true properties,then show the wedge between observed and true properties.

Also recall from Chapter x that fat tails make it harder for someone to detect the true properties; for this weneed a much, much larger dataset, more rigorous ranking techniques allowing inference in one direction not another(Chapter 2), etc. In a way this is an application of the theorems and rules of Chapter 2.

Figure 6.1: Fragilista Gintis (about The Black Swan): "theplural of anecdote is not data", a representative (but el-ementary) violation of probability theory. For large devia-tions, n = 1 is plenty of data (see maximum of divergence(Lévy, Petrov), or Kolmogorov-Smirnoff): looking at the ex-tremum of a time series is not cherry picking. Remarkablysuch imbeciles fall for the opposite mistake, the "n-large", inthinking that confirmatory observations provide "p-values".All these errors are magnified by fat tails.

6.1 Attribute Substitution

It occurs when an individual has to make a judgment(of a target attribute) that is complicated complex, andinstead substitutes a more easily calculated one. Therehave been many papers (Kahneman and Tversky [5] ,Hoggarth and Soyer, [3] and comment [4]) showing howstatistical researchers overinterpret their own findings,as simplication leads to the fooled by randomness ef-fect.

Dan Goldstein and this author (Goldstein and Taleb[1]) showed how professional researchers and practition-ers substitute norms in the evaluation of higher orderproperties of time series, mistaking kxk

1

for kxk2

(orP|x| for

p

P

x2). The common result is underesti-mating the randomness of the estimator M , in otherwords read too much into it (and, what is worse, under-estimation of the tails, since, as we saw in 1.5, the ratiopP

x2

P|x| increases with "fat-tailedness" to become infi-

nite under tail exponents ↵ � 2 ). Standard deviationis ususally explained and interpreted as mean deviation.Simply, people find it easier to imagine that a variationof, say, (-5,+10,-4,-3, 5, 8) in temperature over suc-cessive day needs to be mentally estimated by squaringthe numbers, averaging them, then taking square roots.Instead they just average the absolutes. But, what iskey, they tend to do so while convincing themselves that

71

Page 2: Nicolas Nassim Taleb Misuses of Statistics

72 CHAPTER 6. SOME MISUSES OF STATISTICS IN SOCIAL SCIENCE

they are using standard deviations.There is worse. Mindless application of statistical tech-

niques, without knowledge of the conditional nature ofthe claims are widespread. But mistakes are often el-ementary, like lectures by parrots repeating "N of 1"or "p", or "do you have evidence of?", etc. Many so-cial scientists need to have a clear idea of the differencebetween science and journalism, or the one between rig-orous empiricism and anecdotal statements. Science isnot about making claims about a sample, but using asample to make general claims and discuss propertiesthat apply outside the sample.Take M’ (short for MX

T (A, f)) the estimator we sawabove from the realizations (a sample path) for someprocess, and M* the "true" mean that would emanatefrom knowledge of the generating process for such vari-able. When someone announces: "The crime rate inNYC dropped between 2000 and 2010", the claim islimited M’ the observed mean, not M⇤ the true mean,hence the claim can be deemed merely journalistic, notscientific, and journalists are there to report "facts" nottheories. No scientific and causal statement should bemade from M’ on "why violence has dropped" unlessone establishes a link to M* the true mean. M can-not be deemed "evidence" by itself. Working with M’alone cannot be called "empiricism".

What we just saw is at the foundation of statistics(and, it looks like, science). Bayesians disagree on howM’ converges to M*, etc., never on this point. From hisstatements in a dispute with this author concerning hisclaims about the stability of modern times based on themean casualy in the past (Pinker [2]), Pinker seems tobe aware that M’ may have dropped over time (whichis a straight equality) and sort of perhaps we might notbe able to make claims on M* which might not havereally been dropping.

In some areas not involving time series, the differ-nce between M’ and M* is negligible. So I rapidlyjot down a few rules before showing proofs and deriva-tions (limiting M’ to the arithmetic mean, that is, M’=MX

T ((�1,1), x)).Note again that E is the expectation operator under

"real-world" probability measure P.

6.2 The Tails Sampling Property

From the derivations in 5.1, E[| M’- M* |] increasesin with fat-tailedness (the mean deviation of M*seen from the realizations in different samples ofthe same process). In other words, fat tails tendto mask the distributional properties. This is theimmediate result of the problem of convergence bythe law of large numbers.

6.2.1 On the difference between the initial (gen-erator) and the "recovered" distribution

(Explanation of the method of generating data from aknown distribution and comparing realized outcomes toexpected ones)

0 100 200 300 400 500 600 700

100

200

300

400

500

600

Figure 6.2: Q-Q plot" Fitting extreme value theory to datagenerated by its own process , the rest of course owing tosample insuficiency for extremely large values, a bias thattypically causes the underestimation of tails, as the readercan see the points tending to fall to the right.

6.2.2 Case Study: Pinker [2] Claims On TheStability of the Future Based on PastData

When the generating process is power law with low ex-ponent, plenty of confusion can take place.

For instance, Pinker [2] claims that the generat-ing process has a tail exponent ⇠1.16 but made themistake of drawing quantitative conclusions from itabout the mean from M’ and built theories aboutdrop in the risk of violence that is contradicted by thedata he was showing, since fat tails plus negativeskewness/asymmetry= hidden and underestimatedrisks of blowup. His study is also missing the Casanova

Page 3: Nicolas Nassim Taleb Misuses of Statistics

6.2. THE TAILS SAMPLING PROPERTY 73

problem (next point) but let us focus on the error of be-ing fooled by the mean of fat-tailed data.

The next two figures show the realizations of twosubsamples, one before, and the other after the turkeyproblem, illustrating the inability of a set to naively de-liver true probabilities through calm periods.

Time!Years"

1000

2000

3000

4000

Casualties !000"

Figure 6.3: First 100 years (Sample Path):A Monte Carlo generated realization of a process forcasualties from violent conflict of the "80/20 or 80/02style", that is tail exponent ↵= 1.15

Time!Years"

200 000

400 000

600 000

800 000

1.0! 106

1.2! 106

1.4! 106

Casualties!000"

Figure 6.4: The Turkey Surprise: Now 200 years, the second100 years dwarf the first; these are realizations of the exactsame process, seen with a longer window and at a differentscale.The next simulations shows M1, the mean of casualtiesover the first 100 years across 104sample paths, and M2the mean of casualties over the next 100 years.

200 400 600 800M1

200

400

600

800

M2

Figure 6.5: Does the past mean predict the future mean?Not so. M1 for 100 years,M2 for the next century. Seen ata narrow scale.

5000 10 000 15 000 20 000 25 000 30 000M1

2000

4000

6000

8000

10 000

12 000

M2

Figure 6.6: Does the past mean predict the future mean?Not so. M1 for 100 years,M2 for the next century. Seen ata wider scale.

1.0 1.5 2.0M1

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

M2

Page 4: Nicolas Nassim Taleb Misuses of Statistics

74 CHAPTER 6. SOME MISUSES OF STATISTICS IN SOCIAL SCIENCE

Figure 6.7: The same seen with a thin-tailed distribution.So clearly it is a lunacy to try to read much into themean of a power law with 1.15 exponent (and this isthe mild case, where we know the exponent is 1.15.Typically we have an error rate, and the metaprobabil-ity discussion in Chapter x will show the exponent to

be likely to be lower because of the possibility of er-ror).

6.2.3 Claims Made From Power Laws

The Cederman graph, Figure 6.8 shows exactly how notto make claims upon observing power laws.

Figure 6.8: Cederman 2003, used by Pinker. I wonder if I am dreaming or if the exponent ↵ is really = .41. Chapters x and xshow why such inference is centrally flawed, since low exponents do not allow claims on mean of the variableexcept to say that itis very, very high and not observable in finite samples. Also, in addition to wrong conclusions from the data, take for now that theregression fits the small deviations, not the large ones, and that the author overestimates our ability to figure out the asymptoticslope.

Page 5: Nicolas Nassim Taleb Misuses of Statistics

6.3. A DISCUSSION OF THE PARETAN 80/20 RULE 75

6.3 A discussion of the Paretan 80/20Rule

Next we will see how when one hears about the Paretan80/20 "rule" (or, worse, "principle"), it is likely to un-derestimate the fat tails effect outside some narrow do-mains. It can be more like 95/20 or even 99.9999/.0001,or eventually 100/✏. Almost all economic reports ap-plying power laws for "GINI" (Chapter x) or inequalitymiss the point. Even Pareto himself miscalibrated therule.

As a heuristic, it is always best to assume underes-timation of tail measurement. Recall that we are in aone-tailed situation, hence a likely underestimation ofthe mean.

Where does this 80/20 business come from?. As-sume ↵ the power law tail exponent, and an exceedantprobability PX>x = x

min

x�↵, x 2(xmin

, 1). Simply,the top p of the population gets S = p

↵�1

↵ of the shareof the total pie.

↵ =

log(p)

log(p)� log(S)

which means that the exponent will be 1.161 for the80/20 distribution.

Note that as ↵ gets close to 1 the contribution explodesas it becomes close to infinite mean.

Derivation:. Start with the standard density f(x) =

x↵min

↵ x�↵�1, x � xmin

.1) The Share attributed above K, K � x

min

, be-comes

R1K

xf(x) dxR1xmin

xf(x) dx= K1�↵

2) The probability of exceeding K,

Z 1

K

f(x)dx = K�↵

3) Hence K�↵ of the population contributesK1�↵=p

↵�1

↵ of the result

6.3.1 Why the 80/20 Will Be Generally an Er-ror: The Problem of In-Sample Calibra-tion

Vilfredo Pareto figured out that 20% of the land inItaly was owned by 80% of the people, and the reverse.He later observed that 20 percent of the peapods in hisgarden yielded 80 percent of the peas that were har-vested. He might have been right about the peas; butmost certainly wrong about the land.For fitting in-sample frequencies for a power law does

not yield the proper "true" ratio since the sample islikely to be insufficient. One should fit a powerlawusing extrapolative, not interpolative techniques, suchas methods based on Log-Log plotting or regressions.These latter methods are more informational, thoughwith a few caveats as they can also suffer from sampleinsufficiency.

Data with infinite mean, ↵ 1, will masquerade asfinite variance in sample and show about 80% contri-bution to the top 20% quantile. In fact you are expectedto witness in finite samples a lower contribution of thetop 20%/Let us see: Figure 6.9. Generate m samples of ↵ =1data Xj=(xi,j)

ni=1

, ordered xi,j� xi�1,j , and examinethe distribution of the top ⌫ contribution Z⌫

j =P

i⌫n

xjP

in

xj

,with ⌫ 2 (0,1).

6.4 Survivorship Bias (Casanova) Prop-erty

E(M 0�M⇤) increases under the presence of an absorb-ing barrier for the process. This is the Casanova effect,or fallacy of silent evidence see The Black Swan, Chap-ter 8. ( Fallacy of silent evidence: Looking at history,we do not see the full story, only the rosier parts of theprocess, in the Glossary)

History is a single sample path we can model as aBrownian motion, or something similar with fat tails(say Levy flights). What we observe is one path amongmany "counterfactuals", or alternative histories. Let uscall each one a "sample path", a succession of discretelyobserved states of the system between the initial stateS0

and ST the present state.Arithmetic process: We can model it as S(t) =

S(t � �t) + Z�t where Z

�t is noise drawn from any

Page 6: Nicolas Nassim Taleb Misuses of Statistics

76 CHAPTER 6. SOME MISUSES OF STATISTICS IN SOCIAL SCIENCE

0.5 0.6 0.7 0.8 0.9 1.0Z

1!5

0.01

0.02

0.03

0.04

Pr

Figure 6.9: The difference betwen the generated (ex ante) and recovered (ex post) processes; ⌫=20/100, N= 10

7. Even when itshould be .0001/100, we tend to watch an average of 75/20

distribution.Geometric process: We can model it as S(t) =

S(t��t)eWt typically S(t��t)eµ�t+sp�tZ

t but Wt

can be noise drawn from any distribution. Typically,log

S(t)S(t�i�t)

is treated as Gaussian, but we can usefatter tails. The convenience of the Gaussian is stochas-tic calculus and the ability to skip steps in the process,as S(t)=S(t-�t)eµ�t+s

p�tW

t , with Wt ⇠N(0,1), worksfor all �t, even allowing for a single period to summarizethe total.The Black Swan made the statement that history ismore rosy than the "true" history, that is, the mean ofthe ensemble of all sample path.Take an absorbing barrier H as a level that, whenreached, leads to extinction, defined as becom-ing unobservable or unobserved at period T.

Barrier H

200 400 600 800 1000Time

50

100

150

200

250

Sample Paths

Table 6.1: Counterfactual historical paths subjected to anabsorbing barrier.

When you observe history of a family of processessubjected to an absorbing barrier, i.e., you see the win-ners not the losers, there are biases. If the survival ofthe entity depends upon not hitting the barrier, thenone cannot compute the probabilities along a certainsample path, without adjusting.BeginThe "true" distribution is the one for all sample paths,the "observed" distribution is the one of the successionof points (Si�t)

Ti=1

.

Page 7: Nicolas Nassim Taleb Misuses of Statistics

6.5. LEFT (RIGHT) TAIL SAMPLE INSUFFICIENCY UNDER NEGATIVE (POSITIVE) SKEWNESS 77

Bias in the measurement of the mean. In the pres-ence of an absorbing barrier H "below", that is, lowerthan S

0

, the "observed mean" > "true mean"

Bias in the measurement of the volatility. The "ob-served" variance (or mean deviation) 6 "true" vari-anceThe first two results are well known (see Brown, Goet-

zman and Ross (1995)). What I will set to prove hereis that fat-tailedness increases the bias.

First, let us pull out the "true" distribution using thereflection principle.

Table 6.2: The reflection principle (graph from Taleb,1997). The number of paths that go from point a to pointb without hitting the barrier H is equivalent to the numberof path from the point - a (equidistant to the barrier) to b.

Thus if the barrier is H and we start at S0

thenwe have two distributions, one f(S), the other f(S-2(S0

-H))By the reflection principle, the "observed" distribution

p(S) becomes:

p(S) =

f(S)� f (S � 2 (S0

�H)) if S > H0 if S < H

Simply, the nonobserved paths (the casualties "swal-lowed into the bowels of history") represent a massof 1-

R1H

f(S) � f (S � 2 (S0

�H)) dS and, clearly,it is in this mass that all the hidden effects re-side. We can prove that the missing mean isRH

1 S (f(S)� f (S � 2 (S0

�H))) dS and perturbatef(S) using the previously seen method to "fatten" thetail.

Observed Distribution

H

Absorbed Paths

Table 6.3: If you don’t take into account the sample pathsthat hit the barrier, the observed distribution seems morepositive, and more stable, than the "true" one.

The interest aspect of the absorbing barrier (frombelow) is that it has the same effect as insuffi-cient sampling of a left-skewed distribution underfat tails. The mean will look better than it reallyis.

6.5 Left (Right) Tail Sample Insuffi-ciency Under Negative (Positive)Skewness

E[ M’- M* ] increases (decreases) with negative (posi-tive) skeweness of the true underying variable.Some classes of payoff (those affected by Turkey prob-

lems) show better performance than "true" mean. Oth-ers (entrepreneurship) are plagued with in-sample un-derestimation of the mean. A naive measure of a sam-ple mean, even without absorbing barrier, yields a higheroberved mean than "true" mean when the distributionis skewed to the left, and lower when the skewness is tothe right.

Page 8: Nicolas Nassim Taleb Misuses of Statistics

78 CHAPTER 6. SOME MISUSES OF STATISTICS IN SOCIAL SCIENCE

!140 !120 !100 !80 !60 !40 !20Outcomes

Probability

Unseen rare events

Figure 6.10: The left tail has fewer samples. The probabilityof an event falling below K in n samples is F(K), where F isthe cumulative distribution.This can be shown analytically, but a simulation works

well.To see how a distribution masks its mean because

of sample insufficiency, take a skewed distribution withfat tails, say the standard Pareto Distribution we sawearlier.

The "true" mean is known to be m= ↵↵�1

. Gener-ate a sequence (X

1,j , X2,j , ...,XN,j) of random samplesindexed by j as a designator of a certain history j. Mea-sure µj =

PN

i=1

Xi,j

N . We end up with the sequence ofvarious sample means (µj)

Tj=1

, which naturally shouldconverge to M with both N and T. Next we calculateµ the median value of

PTj=1

µj

M⇤T , such that P>µ = 1

2

where, to repeat, M* is the theoretical mean we expectfrom the generating distribution.

!

!

!

!

!

!

!!

!!

! ! ! ! ! ! !

1.5 2.0 2.5

!

0.75

0.80

0.85

0.90

0.95

"#

Figure 6.11: Median ofP

T

j=1µ

j

MT

in simulations (106 MonteCarlo runs). We can observe the underestimation of themean of a skewed power law distribution as ↵ exponent getslower. Note that lower ↵ imply fatter tails.

Entrepreneurship is penalized by right tail insufficiencymaking performance look worse than it is. Figures 0.1and 0.2 can be seen in a symmetrical way, producingthe exact opposite effect of negative skewness.

6.6 Why N=1 Can Be Very, Very Sig-nificant Statistically

The Power of Extreme Deviations: Under fat tails,large deviations from the mean are vastly more infor-mational than small ones. They are not "anecdotal".(The last two properties corresponds to the black swanproblem, inherently asymmetric).

We saw the point earlier (with the masquerade prob-lem) in ??.??. The gist is as follows, worth repeatingand applying to this context.A thin-tailed distribution is less likely to deliver a single

large deviation than a fat tailed distribution a series oflong calm periods. Now add negative skewness to theissue, which makes large deviations negative and smalldeviations positive, and a large negative deviation, un-der skewness, becomes extremely informational.

Mixing the arguments of ??.?? and ??.?? weget:

Asymmetry in Inference: Under both negative[positive] skewness and fat tails, negative [positive]deviations from the mean are more informationalthan positive [negative] deviations.

6.7 The Instability of Squared Varia-tions in Regression Analysis

Probing the limits of a standardized method by ar-bitrage. We can easily arbitrage a mechanistic methodof analysis by generating data, the properties of whichare known by us, which we call "true" properties, andcomparing these "true" properties to the properties re-vealed by analyses, as well as the confidence of the anal-ysis about its own results in the form of "p-values" orother masquerades.This is no different from generating random noise andasking the "specialist" for an analysis of the charts, inorder to test his knowledge, and, even more importantly,asking him to give us a probability of his analysis be-ing wrong. Likewise, this is equivalent to providing aliterary commentator with randomly generated giberish

Page 9: Nicolas Nassim Taleb Misuses of Statistics

6.7. THE INSTABILITY OF SQUARED VARIATIONS IN REGRESSION ANALYSIS 79

and asking him to provide comments. In this section weapply the technique to regression analyses, a great sub-ject of abuse by the social scientists, particularly whenignoring the effects of fat tails.

In short, we saw the effect of fat tails on higher mo-ments. We will start with 1) an extreme case of in-finite mean (in which we know that the conventionalregression analyses break down), then generalize to 2)situations with finite mean (but finite variance), then 3)finite variance but infinite higher moments. Note thatexcept for case 3, these results are "sort of" standard inthe econometrics literature, except that they are ignoredaway through tweaking of the assumptions.

Fooled by ↵=1. Assume the simplest possible regres-sion model, as follows. Let yi= �

0

+ �1

xi + s zi, withY=(yi)1<in the set of n dependent variables and X=(xi)1<in, the independent one; Y, X ✏ R, i ✏ N. Theerrors zi are independent but drawn from a standardCauchy (symmetric, with tail exponent ↵ =1), multi-plied by the amplitude or scale s; we will vary s acrossthe thought experiment (recall that in the absence andvariance and mean deviation we rely on s as a measureof dispersion). Since all moments are infinite, E[zni ]= 1 for all n�1, we know ex ante that the noise issuch that the "errors" or ’residuals" have infinite meansand variances –but the problem is that in finite samplesthe property doesn’t show. The sum of squares will befinite.The next figure shows the effect of a very expectedlarge deviation, as can be expected from a Cauchyjump.

The big deviation

20 40 60 80 100x

!4000

!3000

!2000

!1000

y!x"

Figure 6.12: A sample regression path dominated by a largedeviation. Most samples don’t exhibit such deviation this,which is a problem. We know that with certainty (an appli-cation of the zero-one laws) that these deviations are certainas n ! 1 , so if one pick an arbitrarily large deviation, suchnumber will be exceeded, with a result that can be illustratedas the sum of all variations will come from a single largedeviation.Next we generate T simulations (indexed by j) of npairs (yi, xi)1<in for increasing values of x, thanks toCauchy distributed variables variable z↵i,j and multipliedz↵i,j by the scaling constant s, leaving us with a set⇣

�0

+ �1

xi + sz↵i,j�n

i=1

⌘T

j=1

. Using standard regres-

sion techniques of estimation we "regress" and obtainthe standard equation Y est

= �est0

+X�est1

, where Y est isthe estimated Y, and E a vector of unexplained residuals

E⌘(✏i,j) ⌘⇣

yesti,j � �est

0

� �est1

xij

�n

i=1

⌘T

j=1

. We thus

obtain T simulated values of ⇢ ⌘ (⇢j)Tj=1

, where ⇢j⌘1-P

n

i=1

✏i,j

2

Pn

i=1

(yi,j

�yj

)

2

, the R-square for a sample run j, whereyj= 1

n

Pni=1

yi,j , in other words 1- ( squared residuals)/ (squared variations). We examine the distribution ofthe different realizations of ⇢.

0.2 0.4 0.6 0.8 1.0R

20.0

0.1

0.2

0.3

0.4

Pr! " 1; s " 5

Figure 6.13: The histograms showing the distribution of RSquares; T = 10

6 simulations.The "true" R-Square shouldbe 0. High scale of noise.

Page 10: Nicolas Nassim Taleb Misuses of Statistics

80 CHAPTER 6. SOME MISUSES OF STATISTICS IN SOCIAL SCIENCE

0.2 0.4 0.6 0.8 1.0R

2

0.05

0.10

0.15

Pr!"1; s".5

Figure 6.14: The histograms showing the distribution of RSquares; T = 10

6 simulations.The "true" R-Square shouldbe 0. Low scale of noise.

20 40 60 80 100x

!10

!5

5

10

15

y!x"

Figure 6.15: We can fit different regressions to the samestory (which is no story). A regression that tries to accom-modate the large deviation.

20 40 60 80 100x

!5

5

10

15

y!x"

Figure 6.16: Missing the largest deviation (not necessarilyvoluntarily: the sample doesn’t include the critical observa-tion.

Arbitraging metrics. For a sample run which, typi-cally, will not have a large deviation,R-squared: 0.994813 (When the "true" R-squaredwould be 0)The P-values are monstrously misleading.

Estimate Std Error T-Statistic P-Value1 4.99 0.417 11.976 7.8⇥ 10

�33

x 0.10 0.00007224 1384.68 9.3⇥ 10

�11426

6.7.1 Application to Economic Variables

We saw in ??.?? that kurtosis can be attributable to 1 in10,000 observations (>50 years of data), meaning it isunrigorous to assume anything other than that the datahas "infinite" kurtosis. The implication is that even ifthe squares exist, i.e., E[z2i ] < 1, the distribution of z2ihas infinite variance, and is massively unstable. The "P-values" remain grossly miscomputed. The next graphshows the distribution of ⇢ across samples.

0.2 0.4 0.6 0.8 1.0R

20.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Pr

!"3

Figure 6.17: Finite variance but infinite kurtosis.

6.8 Statistical Testing of DifferencesBetween Variables

A pervasive attribute substitution: Where X and Yare two random variables, the properties of X-Y, saythe variance, probabilities, and higher order attributesare markedly different from the difference in proper-ties. So E(X) � E(Y ) = E(X) � E(Y ) but of course,V ar(X � Y ) 6= V ar(X) � V ar(Y ), etc. for highernorms. It means that P-values are different, and ofcourse the coefficient of variation ("Sharpe"). Where� is the Standard deviation of the variable (or sam-ple):

Page 11: Nicolas Nassim Taleb Misuses of Statistics

6.9. STUDYING THE STATISTICAL PROPERTIES OF BINARIES AND EXTENDING TO VANILLAS 81

E(X � Y )

�(X � Y )

6= EX)

�(X)

� E(Y ))

�(Y )

In Fooled by Randomness (2001):A far more acute problem relates to the out-performance, or the comparison, between twoor more persons or entities. While we are cer-tainly fooled by randomness when it comes toa single times series, the foolishness is com-pounded when it comes to the comparison be-tween, say, two people, or a person and abenchmark. Why? Because both are ran-dom. Let us do the following simple thoughtexperiment. Take two individuals, say, a per-son and his brother-in-law, launched throughlife. Assume equal odds for each of good andbad luck. Outcomes: lucky-lucky (no differ-ence between them), unlucky-unlucky (again,no difference), lucky- unlucky (a large differ-ence between them), unlucky-lucky (again, alarge difference).

Ten years later (2011) it was found that 50% of neuro-science papers (peer-reviewed in "prestigious journals")that compared variables got it wrong.

In theory, a comparison of two experimentaleffects requires a statistical test on their dif-ference. In practice, this comparison is oftenbased on an incorrect procedure involving twoseparate tests in which researchers concludethat effects differ when one effect is signifi-cant (P < 0.05) but the other is not (P >0.05). We reviewed 513 behavioral, systemsand cognitive neuroscience articles in five top-ranking journals (Science, Nature, Nature Neu-roscience, Neuron and The Journal of Neuro-science) and found that 78 used the correctprocedure and 79 used the incorrect procedure.An additional analysis suggests that incorrectanalyses of interactions are even more commonin cellular and molecular neuroscience.

In Nieuwenhuis, S., Forstmann, B. U., & Wagenmak-ers, E. J. (2011). Erroneous analyses of interactions inneuroscience: a problem of significance. Nature neuro-science, 14(9), 1105-1107.Fooled by Randomness was read by many profession-als (to put it mildly); the mistake is still being made.Ten years from now, they will still be making the mis-take.

6.9 Studying the Statistical Propertiesof Binaries and Extending to Vanil-las

See discussion in Chapter 7. A lot of nonsense in discus-sions of rationality facing "dread risk" (such as terrorismor nuclear events) based on wrong probabilistic struc-tures, such as comparisons of fatalities from falls fromladders to death from terrorism. The probability of fallsfrom ladder doubling is 1 10

20. Terrorism is fat-tailed:similar claims cannot be made.A lot of unrigorous claims like "long shot bias" is also

discussed there.

6.10 The Mother of All Turkey Prob-lems: How Economics Time SeriesEconometrics and Statistics Don’tReplicate

(Debunking a Nasty Type of PseudoScience)Something Wrong With Econometrics, as AlmostAll Papers Don’t Replicate. The next two reliabilitytests, one about parametric methods the other aboutrobust statistics, show that there is something wrong ineconometric methods, fundamentally wrong, and thatthe methods are not dependable enough to be of use inanything remotely related to risky decisions.

6.10.1 Performance of Standard Parametric RiskEstimators, f(x) = xn (Norm L2 )

With economic variables one single observation in10,000, that is, one single day in 40 years, can explainthe bulk of the "kurtosis", a measure of "fat tails",that is, both a measure how much the distribution un-der consideration departs from the standard Gaussian,or the role of remote events in determining the totalproperties. For the U.S. stock market, a single day, thecrash of 1987, determined 80% of the kurtosis. Thesame problem is found with interest and exchange rates,commodities, and other variables. The problem is notjust that the data had "fat tails", something peopleknew but sort of wanted to forget; it was that we wouldnever be able to determine "how fat" the tails werewithin standard methods. Never.The implication is that those tools used in economicsthat are based on squaring variables (more techni-cally, the Euclidian, or L2 norm), such as standard devi-ation, variance, correlation, regression, the kind of stuff

Page 12: Nicolas Nassim Taleb Misuses of Statistics

82 CHAPTER 6. SOME MISUSES OF STATISTICS IN SOCIAL SCIENCE

you find in textbooks, are not valid scientifically(exceptin some rare cases where the variable is bounded). Theso-called "p values" you find in studies have no meaningwith economic and financial variables. Even the moresophisticated techniques of stochastic calculus used inmathematical finance do not work in economics exceptin selected pockets.The results of most papers in economics based on thesestandard statistical methods are thus not expected toreplicate, and they effectively don’t. Further, thesetools invite foolish risk taking. Neither do alternativetechniques yield reliable measures of rare events, exceptthat we can tell if a remote event is underpriced, withoutassigning an exact value.From Taleb (2009), using Log returns,

Xt ⌘ log

P (t)

P (t� i�t)

Take the measure MXt

(�1,1), X4

of the fourthnoncentral moment

MXt

(�1,1), X4

� ⌘ 1

n

nX

i=0

X4

t�i�t

and the n-sample maximum quartic observationMax(Xt�i�t

4

)

ni=0

. Q(n) is the contribution of the max-imum quartic variations over n samples.

Q(n) ⌘ Max�

X4

t��ti)ni=0

Pni=0

X4

t��ti

For a Gaussian (i.e., the distribution of the square of aChi-square distributed variable) show Q

10

4

the maxi-mum contribution should be around .008 ± .0028. Vis-ibly we can see that the distribution 4

th moment hasthe property

P�

X > max(x4

i )i2n

� ⇡ P

X >nX

i=1

x4

i

!

Recall that, naively, the fourth moment expresses thestability of the second moment. And the second mo-ment expresses the stability of the measure across sam-ples.

Security Max Q Years.Silver 0.94 46.SP500 0.79 56.CrudeOil 0.79 26.Short Sterling 0.75 17.Heating Oil 0.74 31.Nikkei 0.72 23.FTSE 0.54 25.JGB 0.48 24.Eurodollar Depo 1M 0.31 19.Sugar #11 0.3 48.Yen 0.27 38.Bovespa 0.27 16.Eurodollar Depo 3M 0.25 28.CT 0.25 48.DAX 0.2 18.

Note that taking the snapshot at a different periodwould show extremes coming from other variables whilethese variables showing high maximma for the kurto-sis, would drop, a mere result of the instability of themeasure across series and time. Description of thedataset:All tradable macro markets data available as of August2008, with "tradable" meaning actual closing prices cor-responding to transactions (stemming from markets notbureaucratic evaluations, includes interest rates, curren-cies, equity indices).

0.0

0.2

0.4

0.6

0.8

Share of Max Quartic

Figure 6.18: Max quartic across securities

Page 13: Nicolas Nassim Taleb Misuses of Statistics

6.10. THE MOTHER OF ALL TURKEY PROBLEMS: HOW ECONOMICS TIME SERIES ECONOMETRICS AND STATISTICS DON’T REPLICATE83

0

10

20

30

40

EuroDepo 3M: Annual Kurt 1981!2008

Figure 6.19: Kurtosis across nonoverlapping periods

0.2

0.4

0.6

0.8

Monthly Vol

Figure 6.20: Monthly delivered volatility in the SP500 (asmeasured by standard deviations). The only structure itseems to have comes from the fact that it is bounded at 0.This is standard.

0.00

0.05

0.10

0.15

0.20

Vol of Vol

Figure 6.21: Montly volatility of volatility from the samedataset, predictably unstable.

6.10.2 Performance of Standard NonParametricRisk Estimators, f(x)= x or |x| (Norm L1),A =(-1, K]

Does the past resemble the future in the tails? Thefollowing tests are nonparametric, that is entirely basedon empirical probability distributions.

Figure 6.22: Comparing M[t-1, t] and M[t,t+1], where ⌧=1year, 252 days, for macroeconomic data using extreme de-viations, A= (-1 ,-2 standard deviations (equivalent)], f(x)= x (replication of data from The Fourth Quadrant, Taleb,2009)

Figure 6.23: The "regular" is predictive of the regular, thatis mean deviation. Comparing M[t] and M[t+1 year] formacroeconomic data using regular deviations, A= (-1 ,1), f(x)= |x|

Concentration of tail events

without predecessors

Concentration of tail events

without successors

0.0001 0.0002 0.0003 0.0004 0.0005M!t"

0.0001

0.0002

0.0003

0.0004

M!t!1"

Figure 6.24: This are a lot worse for large deviations A=(-1 ,-4 standard deviations (equivalent)], f(x) = xSo far we stayed in dimension 1. When we look athigher dimensional properties, such as covariance ma-trices, things get worse. We will return to the pointwith the treatment of model error in mean-variance op-timization.When xt are now in RN , the problems of sensitivityto changes in the covariance matrix makes the estima-tor M extremely unstable. Tail events for a vector arevastly more difficult to calibrate, and increase in dimen-sions.

The Responses so far by members of the eco-nomics/econometrics establishment: "his books aretoo popular to merit attention", "nothing new" (sic),

"egomaniac" (but I was told at the National ScienceFoundation that "egomaniac" does not apper to havea clear econometric significance). No answer as to why

Page 14: Nicolas Nassim Taleb Misuses of Statistics

84 CHAPTER 6. SOME MISUSES OF STATISTICS IN SOCIAL SCIENCE

Figure 6.25: Correlations are also problematic, which flows from the instability of single variances and the effect of multiplicationof the values of random variables.

they still use STD, regressions, GARCH, value-at-riskand similar methods.Peso problem: Note that many researchers invoke"outliers" or "peso problem" as acknowledging fat tails,yet ignore them analytically (outside of Poisson modelsthat we will see are not possible to calibrate except afterthe fact). Our approach here is exactly the opposite: donot push outliers under the rug, rather build everythingaround them. In other words, just like the FAA and theFDA who deal with safety by focusing on catastropheavoidance, we will throw away the ordinary under therug and retain extremes as the sole sound approach torisk management. And this extends beyond safety sincemuch of the analytics and policies that can be destroyedby tail events are unusable.Peso problem attitude towards the Black Swanproblem:

"(...) "black swans" (Taleb, 2007). Thesecultural icons refer to disasters that occurso infrequently that they are virtually im-possible to analyze using standard statisti-

cal inference. However, we find this per-spective less than helpful because it sug-gests a state of hopeless ignorance in whichwe resign ourselves to being buffeted andbattered by the unknowable."(Andrew Lo who obviously did not botherto read the book he was citing. The com-ment also shows the lack of common senseto look for robustness to these events).

Lack of Skin in the Game. Indeed one wonders whyeconometric methods can be used while being wrong, soshockingly wrong, how "University" researchers (adults)can partake of such a scam. Basically they capture theordinary and mask higher order effects. Since blowupsare not frequent, these events do not show in data andthe researcher looks smart most of the time while be-ing fundamentally wrong. At the source, researchers,"quant" risk manager, and academic economist do nothave skin in the game so they are not hurt by wrongrisk measures: other people are hurt by them. And thescam should continue perpetually so long as people are

Page 15: Nicolas Nassim Taleb Misuses of Statistics

6.11. A GENERAL SUMMARY OF THE PROBLEM OF RELIANCE ON PAST TIME SERIES 85

allowed to harm others with impunity. (More in Taleband Sandis, 2013)

6.11 A General Summary of The Prob-lem of Reliance on Past Time Se-ries

The four aspects of what we will call the nonreplicabil-ity issue, particularly for mesures that are in the tails.These are briefly presented here and developed moretechnically throughout the book:a- Definition of statistical rigor (or Pinker Prob-lem). The idea that an estimator is not about fitness topast data, but related to how it can capture future re-alizations of a process seems absent from the discourse.Much of econometrics/risk management methods donot meet this simple point and the rigor required byorthodox, basic statistical theory.b- Statistical argument on the limit of knowledgeof tail events. Problems of replicability are acute fortail events. Tail events are impossible to price owingto the limitations from the size of the sample. Naivelyrare events have little data hence what estimator wemay have is noisier.c- Mathematical argument about statistical de-cidability. No probability without metaprobability.Metadistributions matter more with tail events, andwith fat-tailed distributions.

1. The soft problem: we accept the probability dis-tribution, but the imprecision in the calibration(or parameter errors) percolates in the tails.

2. The hard problem (Taleb and Pilpel, 2001, Taleband Douady, 2009): We need to specify an a pri-ori probability distribution from which we depend,or alternatively, propose a metadistribution withcompact support.

3. Both problems are bridged in that a nestedstochastization of standard deviation (or the scaleof the parameters) for a Gaussian turn a thin-tailed distribution into a power law (and stochas-tization that includes the mean turns it into ajump-diffusion or mixed-Poisson).d- Economic arguments: The Friedman-Phelpsand Lucas critiques, Goodhart’s law. Actingon statistical information (a metric, a response)changes the statistical properties of some pro-cesses.

6.12 Conclusion

This chapter introduced the problem of "surprises" fromthe past of time series, and the invalidity of a certainclass of estimators that seem to only work in-sample.Before examining more deeply the mathematical prop-erties of fat-tails, let us look at some practical as-pects.

Page 16: Nicolas Nassim Taleb Misuses of Statistics

86 CHAPTER 6. SOME MISUSES OF STATISTICS IN SOCIAL SCIENCE

Page 17: Nicolas Nassim Taleb Misuses of Statistics

D On the Instability of Econometric Data

Table D.1: Fourth noncentral moment at daily, 10-day, and 66-day windows for the random variables

K

(1) K (10) K

(66)MaxQuartic Years

Australian Dol-lar/USD 6.3 3.8 2.9 0.12 22.

AustraliaTB 10y 7.5 6.2 3.5 0.08 25.

Australia TB 3y 7.5 5.4 4.2 0.06 21.BeanOil 5.5 7.0 4.9 0.11 47.Bonds 30Y 5.6 4.7 3.9 0.02 32.Bovespa 24.9 5.0 2.3 0.27 16.BritishPound/USD 6.9 7.4 5.3 0.05 38.

CAC40 6.5 4.7 3.6 0.05 20.Canadian Dol-lar 7.4 4.1 3.9 0.06 38.

Cocoa NY 4.9 4.0 5.2 0.04 47.Coffee NY 10.7 5.2 5.3 0.13 37.Copper 6.4 5.5 4.5 0.05 48.Corn 9.4 8.0 5.0 0.18 49.Crude Oil 29.0 4.7 5.1 0.79 26.CT 7.8 4.8 3.7 0.25 48.DAX 8.0 6.5 3.7 0.20 18.Euro Bund 4.9 3.2 3.3 0.06 18.Euro Cur-rency/DEMpreviously

5.5 3.8 2.8 0.06 38.

EurodollarDepo 1M 41.5 28.0 6.0 0.31 19.

EurodollarDepo 3M 21.1 8.1 7.0 0.25 28.

FTSE 15.2 27.4 6.5 0.54 25.Gold 11.9 14.5 16.6 0.04 35.Heating Oil 20.0 4.1 4.4 0.74 31.Hogs 4.5 4.6 4.8 0.05 43.Jakarta StockIndex 40.5 6.2 4.2 0.19 16.

87

Page 18: Nicolas Nassim Taleb Misuses of Statistics

88 APPENDIX D. ON THE INSTABILITY OF ECONOMETRIC DATA

Japanese GovBonds 17.2 16.9 4.3 0.48 24.

Live Cattle 4.2 4.9 5.6 0.04 44.Nasdaq Index 11.4 9.3 5.0 0.13 21.Natural Gas 6.0 3.9 3.8 0.06 19.Nikkei 52.6 4.0 2.9 0.72 23.Notes 5Y 5.1 3.2 2.5 0.06 21.Russia RTSI 13.3 6.0 7.3 0.13 17.Short Sterling 851.8 93.0 3.0 0.75 17.Silver 160.3 22.6 10.2 0.94 46.Smallcap 6.1 5.7 6.8 0.06 17.SoyBeans 7.1 8.8 6.7 0.17 47.SoyMeal 8.9 9.8 8.5 0.09 48.Sp500 38.2 7.7 5.1 0.79 56.Sugar #11 9.4 6.4 3.8 0.30 48.SwissFranc 5.1 3.8 2.6 0.05 38.TY10Y Notes 5.9 5.5 4.9 0.10 27.Wheat 5.6 6.0 6.9 0.02 49.Yen/USD 9.7 6.1 2.5 0.27 38.

Page 19: Nicolas Nassim Taleb Misuses of Statistics

7 On the Difference between Binary Prediction andTrue Exposure

(With Implications For Forecasting Tournaments and Decision Making Research)

There are serious statistical differences between predictions, bets, and exposures that have a yes/no type of payoff, the“binaries”, and those that have varying payoffs, which we call the “vanilla”. Real world exposures tend to belong tothe vanilla category, and are poorly captured by binaries. Yet much of the economics and decision making literatureconfuses the two. Vanilla exposures are sensitive to Black Swan effects, model errors, and prediction problems, whilethe binaries are largely immune to them. The binaries are mathematically tractable, while the vanilla are much less so.Hedging vanilla exposures with binary bets can be disastrous–and because of the human tendency to engage in attributesubstitution when confronted by difficult questions,decision-makers and researchers often confuse the vanilla for the binary.

7.1 Binary vs Vanilla Predictions andExposures

Let � be the one-dimensional generalized payoff func-tion considered as of time t

0

over a certain horizon t,for a variable S with initial value St

0

and value St attime of the payoff.

It0

,t(St,K, L,H) ⌘(

(St � L)+ � (St �H)

+

+ P if I = 1

(H � St)+

+ (L� St)+ � P if I = �1

(7.1)

The indicator I denotes whether the exposure is "posi-tive" or "negative" with respect of the random variableS.Definition 9. The "vanilla" corresponds to an exposure�

It0

,t(St, St0

, 0,1). A standard call option correspondsto a payoff �

It0

,t(St, St0

, 0,1)

Theorem 1. Every payoff can beBinary: Binary predictions and exposures are about welldefined discrete events, with yes/no types of answers,such as whether a person will win the election, a single

individual will die, or a team will win a contest. We callthem binary because the outcome is either 0 (the eventdoes not take place) or 1 (the event took place), thatis the set {0,1} or the set {aL, aH}, with aL < aH anytwo discrete and exhaustive values for the outcomes.For instance, we cannot have five hundred people win-ning a presidential election. Or a single candidate run-ning for an election has two exhaustive outcomes: winor lose.

Vanilla: “Vanilla” predictions and exposures, alsoknown as natural random variables, correspond to sit-uations in which the payoff is continuous and can takeseveral values. The designation “vanilla” originates fromdefinitions of financial contracts1 ; it is fitting outsideoption trading because the exposures they designate arenaturally occurring continuous variables, as opposed tothe binary that which tend to involve abrupt institution-mandated discontinuities. The vanilla add a layer ofcomplication: profits for companies or deaths due toterrorism or war can take many, many potential values.You can predict the company will be “profitable”, butthe profit could be $1 or $10 billion.

There is a variety of exposures closer to the vanilla,1The “vanilla” designation comes from option exposures that are open-ended as opposed to the binary ones that are called “exotic”.

89

Page 20: Nicolas Nassim Taleb Misuses of Statistics

90 CHAPTER 7. ON THE DIFFERENCE BETWEEN BINARY PREDICTION AND TRUE EXPOSURE

namely bounded exposures that we can subsume math-ematically into the binary category.

The main errors are as follows.• Binaries always belong to the class of thin-tailed

distributions, because of boundedness, while thevanillas don’t. This means the law of large num-bers operates very rapidly there. Extreme eventswane rapidly in importance: for instance, as wewill see further down in the discussion of the Cher-noff bound, the probability of a series of 1000 betsto diverge more than 50% from the expected av-erage is less than 1 in 10

18, while the vanilla canexperience wilder fluctuations with a high proba-bility, particularly in fat-tailed domains. Compar-ing one to another can be a lunacy.

• The research literature documents a certain classof biases, such as "dread risk" or "long shot bias",which is the overestimation of some classes of rareevents, but derived from binary variables, then

falls for the severe mathematical mitake of ex-tending the result to vanilla exposures. If eco-logical exposures in the real world tends to havevanilla, not binary properties, then much of theseresults are invalid.

Let us return to the point that the variations of vanillaare not bounded, or have a remote boundary. Hence,the prediction of the vanilla is marred by Black Swaneffects and need to be considered from such a view-point. For instance, a few prescient observers saw thepotential for war among the Great Power of Europe inthe early 20th century but virtually everyone missed thesecond dimension: that the war would wind up killing anunprecedented twenty million persons, setting the stagefor both Soviet communism and German fascism and awar that would claim an additional 60 million, followedby a nuclear arms race from 1945 to the present, whichmight some day claim 600 million lives.

The Black Swan is Not About Probability But Pay-off

In short, the vanilla has another dimension, the payoff,in addition to the probability, while the binary is limitedto the probability. Ignoring this additional dimensionis equivalent to living in a 3-D world but discussing itas if it were 2-D, promoting the illusion to all who willlisten that such an analysis captures all worth captur-ing.Now the Black Swan problem has been misunderstood.We are saying neither that there must be more volatil-ity in our complexified world nor that there must bemore outliers. Indeed, we may well have fewer suchevents but it has been shown that, under the mecha-nisms of “fat tails”, their “impact” gets larger and largerand more and more unpredictable. The main cause isglobalization and the spread of winner-take-all effectsacross variables (just think of the Google effect), aswell as effect of the increased physical and electronicconnectivity in the world, causing the weakening of “is-land effect” a well established fact in ecology by whichisolated areas tend to have more varieties of species persquare meter than larger ones. In addition, while phys-ical events such as earthquakes and tsunamis may nothave changed much in incidence and severity over thelast 65 million years (when the dominant species on ourplanet, the dinosaurs, had a very bad day), their effect

is compounded by interconnectivity.So there are two points here.

Binary predictions are more tractable than expo-sures. First, binary predictions tend to work; we canlearn to be pretty good at making them (at least onshort timescales and with rapid accuracy feedback thatteaches us how to distinguish signals from noise —allpossible in forecasting tournaments as well as in elec-toral forecasting — see Silver, 2012). Further, theseare mathematically tractable: your worst mistake isbounded, since probability is defined on the interval be-tween 0 and 1. But the applications of these binariestend to be restricted to manmade things, such as theworld of games (the “ludic” domain).

It is important to note that, ironically, not only doBlack Swan effects not impact the binaries, but theyeven make them more mathematically tractable, as willsee further down.

Binary predictions are often taken as a substitutefor vanilla ones. Second, most non-decision makerstend to confuse the binary and the vanilla. And well-intentioned efforts to improve performance in binaryprediction tasks can have the unintended consequenceof rendering us oblivious to catastrophic vanilla expo-sure.

Page 21: Nicolas Nassim Taleb Misuses of Statistics

7.1. BINARY VS VANILLA PREDICTIONS AND EXPOSURES 91

i

!20

!10

0

10

20

xi

i

!250

!200

!150

!100

!50

0

50

xi

Figure 7.1: Comparing digital payoff (above) to the vanilla (below). The vertical payoff shows x

i

(x1, x2, ...) and the horizontalshows the index i= (1,2,...), as i can be time, or any other form of classification. We assume in the first case payoffs of {-1,1},and open-ended (or with a very remote and unknown bounds) in the second.

The confusion can be traced to attribute substitu-tion and the widespread tendency to replace difficult-to-answer questions with much-easier-to-answer ones.For instance, the extremely-difficult-to-answer questionmight be whether China and the USA are on an his-torical trajectory toward a rising-power/hegemon con-frontation with the potential to claim far more lives thanthe most violent war thus far waged (say 10X morethe 60M who died in World War II). The much-easier-binary-replacement questions —the sorts of questionslikely to pop up in forecasting tournaments or predic-tion markets — might be whether the Chinese militarykills more than 10 Vietnamese in the South China Sea

or 10 Japanese in the East China Sea in the next 12months or whether China publicly announces that it isrestricting North Korean banking access to foreign cur-rency in the next 6 months.The nub of the conceptual confusion is that although

predictions and payoffs are completely separate math-ematically, both the general public and researchers areunder constant attribute-substitution temptation of us-ing answers to binary questions as substitutes for expo-sure to vanilla risks.We often observe such attribute substitution in finan-

cial hedging strategies. For instance, Morgan Stanleycorrectly predicted the onset of a subprime crisis, but

Page 22: Nicolas Nassim Taleb Misuses of Statistics

92 CHAPTER 7. ON THE DIFFERENCE BETWEEN BINARY PREDICTION AND TRUE EXPOSURE

they had a binary hedge and ended up losing billionsas the crisis ended up much deeper than predicted (Bloomberg Magazine, March 27, 2008).

Or, consider the performance of the best forecastersin geopolitical forecasting tournaments over the last 25years (Tetlock, 2005; Tetlock & Mellers, 2011; Mellerset al, 2013). These forecasters may will be right whenthey say that the risk of a lethal confrontation claiming10 or more lives in the East China Sea by the end of2013 is only 0.04. They may be very “well calibrated”in the narrow technical sense that when they attach a4% likelihood to events, those events occur only about4% of the time. But framing a vanilla question as abinary question is dangerous because it masks exponen-tially escalating tail risks: the risks of a confrontationclaiming not just 10 lives of 1000 or 1 million. No onehas yet figured out how to design a forecasting tour-nament to assess the accuracy of probability judgmentsthat range between .00000001% and 1% —and if some-one ever did, it is unlikely that anyone would have thepatience —or lifespan —to run the forecasting tourna-ment for the necessary stretches of time (requiring usto think not just in terms of decades, centuries and mil-lennia).

The deep ambiguity of objective probabilities at theextremes—and the inevitable instability in subjectiveprobability estimates—can also create patterns of sys-tematic mispricing of options. An option or option likepayoff is not to be confused with a lottery, and the “lot-tery effect” or “long shot bias” often discussed in theeconomics literature that documents that agents over-pay for these bets should not apply to the properties ofactual options.

In Fooled by Randomness, the narrator is asked“do you predict that the market is going up or down?”“Up”, he said, with confidence. Then the questioner gotangry when he discovered that the narrator was shortthe market, i.e., would benefit from the market goingdown. The trader had a difficulty conveying the idea

that someone could hold the belief that the market hada higher probability of going up, but that, should it godown, it would go down a lot. So the rational responsewas to be short.

This divorce between the binary (up is more likely)and the vanilla is very prevalent in real-world variables.Indeed we often see reports on how a certain finan-cial institution “did not have a losing day in the entirequarter”, only to see it going near-bust from a mon-strously large trading loss. Likewise some predictorshave an excellent record, except that following their ad-vice would result in large losses, as they are rarely wrong,but when they miss their forecast, the results are dev-astating.

Remark:More technically, for a heavy tailed distribution(defined as part of the subexponential family, see Taleb2013), with at least one unbounded side to the randomvariable, the vanilla prediction record over a long serieswill be of the same order as the best or worst prediction,whichever in largest in absolute value, while no singleoutcome can change the record of the binary.

Another way to put the point: to achieve the reputationof “Savior of Western civilization,”a politician such asWinston Churchill needed to be right on only one super-big question (such as the geopolitical intentions of theNazis)– and it matters not how many smaller errors thatpolitician made (e.g. Gallipoli, gold standard, autonomyfor India). Churchill could have a terrible Brier score (bi-nary accuracy) and a wonderful reputation (albeit onethat still pivots on historical counterfactuals).Finally, one of the authors wrote an entire book (Taleb,

1997) on the hedging and mathematical differences be-tween binary and vanilla. When he was an optiontrader, he realized that binary options have nothing todo with vanilla options, economically and mathemati-cally. Seventeen years later people are still making themistake.

7.2 A Semi-Technical Commentary onThe Mathematical Differences

Chernoff Bound. The binary is subjected to verytight bounds. Let (Xi)

1<in bea sequence indepen-dent Bernouilli trials taking values in the set {0, 1}, withP(X = 1]) = p and P(X = 0) = 1 � p, Take the sum

Sn =

P

1<in Xi. with expectation E(Sn)= np = µ.Taking � as a “distance from the mean”, the Chernoffbounds gives:For any � > 0

Page 23: Nicolas Nassim Taleb Misuses of Statistics

7.2. A SEMI-TECHNICAL COMMENTARY ON THE MATHEMATICAL DIFFERENCES 93

!4 !2 2 4

0.1

0.2

0.3

0.4

0.5

0.6

Figure 7.2: Fatter and fatter tails: different values for a. Note that higher peak implies a lower probability of leaving the ±1 �

tunnel

P(S � (1 + �)µ) ✓

e�

(1 + �)1+�

◆µ

and for 0 < � 1

P(S � (1 + �)µ) 2e�µ�

2

3

Let us compute the probability of coin flips n of having50% higher than the true mean, with p= 1

2

and µ =n2

: P�

S � � 32

n2

� 2e�µ�

2

3

= e�n/24

which for n = 1000 happens every 1 in 1.24⇥ 10

18.

Fatter tails lower the probability of remoteevents (the binary) and raise the value of thevanilla.

The following intuitive exercise will illustrate what hap-pens when one conserves the variance of a distribution,but “fattens the tails” by increasing the kurtosis. Theprobability of a certain type of intermediate and largedeviation drops, but their impact increases. Counter-intuitively, the possibility of staying within a band in-creases.Let x be a standard Gaussian random variable withmean 0 (with no loss of generality) and standard devi-ation �. Let P>1� be the probability of exceeding onestandard deviation. P>1�= 1 � 1

2

erfc⇣

� 1p2

, whereerfc is the complementary error function, so P>1� =

P<1� '15.86% and the probability of staying within the“stability tunnel” between ± 1 � is 1� P>1�� P<1� '68.3%.Let us fatten the tail in a variance-preserving manner,using the “barbell” standard method of linear combi-nation of two Gaussians with two standard deviationsseparated by �

p1 + a and �

p1� a , a 2(0,1), where

a is the “vvol” (which is variance preserving, technicallyof no big effect here, as a standard deviation-preservingspreading gives the same qualitative result). Such amethod leads to the immediate raising of the standardKurtosis by

1 + a2�

since E(

x4

)

E(x2

)

2

= 3

a2 + 1

, whereE is the expectation operator.

(7.2)

P >1� = P<1�

= 1� 1

2

erfc✓

� 1p2

p1� a

� 1

2

erfc✓

� 1p2

pa+ 1

So then, for different values of a in Eq. 1 as we cansee in Figure 2, the probability of staying inside 1 sigmarises, “rare” events become less frequent.Note that this example was simplified for ease of argu-ment. In fact the “tunnel” inside of which fat tailednessincreases probabilities is between�

q

1

2

5�p17

� andq

1

2

5�p17

� (even narrower than 1 � in the exam-ple, as it numerically corresponds to the area between

Page 24: Nicolas Nassim Taleb Misuses of Statistics

94 CHAPTER 7. ON THE DIFFERENCE BETWEEN BINARY PREDICTION AND TRUE EXPOSURE

-.66 and .66), and the outer one is ±q

1

2

5 +

p17

, that is the area beyond ±2.13 �.

The law of large numbers works better with thebinary than the vanilla

Getting a bit more technical, the law of large numbersworks much faster for the binary than the vanilla (forwhich it may never work, see Taleb, 2013). The moreconvex the payoff, the more observations one needs tomake a reliable inference. The idea is as follows, as canbe illustrated by an extreme example of very tractablebinary and intractable vanilla.Let xt be the realization of the random variable X2 (-1, 1) at period t, which follows a Cauchy dis-tribution with p.d.f. f (xt)⌘ 1

⇡((x0

�1)

2

+1)

. Let us setx0

= 0 to simplify and make the exposure symmetricaround 0. The Vanilla exposure maps to the variablext and has an expectation E (xt) =

R1�1 xtf(x)dx,

which is undefined (i.e., will never converge to a fixedvalue). A bet at x

0

has a payoff mapped by as a Heav-iside Theta Function ✓>x

0

(xt) paying 1 if xt > x0

and0 otherwise. The expectation of the payoff is simplyE(✓(x)) =

R1�1 ✓>x

0

(x)f(x)dx=R1x0

f(x)dx, which issimply P (x > 0). So long as a distribution exists, thebinary exists and is Bernouilli distributed with proba-bility of success and failure p and 1—p respectively.The irony is that the payoff of a bet on a Cauchy, ad-mittedly the worst possible distribution to work withsince it lacks both mean and variance, can be mappedby a Bernouilli distribution, about the most tractable ofthe distributions. In this case the Vanilla is the hardestthing to estimate, and the binary is the easiest thing toestimate.Set Sn =

1

n

Pni=1

xti

the average payoff of avariety of vanilla bets xt

i

across periods ti, andS✓

n =

1

n

Pni=1

✓>x0

(xti

). No matter how large n,limn!1 S✓

n has the same properties — the exact sameprobability distribution —as S

1

. On the other handlimn!1 S✓

n=

p; further the presaymptotics of S✓n are

tractable since it converges to 1

2

rather quickly, andthe standard deviations declines at speed

pn , since

p

V (S✓n) =

q

V (S✓

1

)

n =q

(1�p)pn (given that the

moment generating function for the average is M(z)=�

pez/n � p+ 1

�n).

Binary

Vanilla

Bet

Level

x

f!x"

Figure 7.3: The different classes of payoff f(x) seen in rela-tion to an event x. (When considering options, the vanillacan start at a given bet level, so the payoff would be con-tinuous on one side, not the other).

The binary has necessarily a thin-tailed distribu-tion, regardless of domain

More, generally, for the class of heavy tailed distribu-tions, in a long time series, the sum is of the sameorder as the maximum, which cannot be the case forthe binary:

lim

X!1

P (X >Pn

i=1

xti

)

P⇣

X > max (xti

)i2n

= 1 (7.3)

Compare this to the binary for which

lim

X!1P⇣

X > max (✓(xti

))i2n

= 0 (7.4)

The binary is necessarily a thin-tailed distribution, re-gardless of domain.We can assert the following:

• The sum of binaries converges at a speed fasteror equal to that of the vanilla.

• The sum of binaries is never dominated by a singleevent, while that of the vanilla can be.

How is the binary more robust to model er-ror?

In the more general case, the expected payoff of thevanilla is expressed as

R

AxdF (x) (the unconditional

shortfall) while that of the binary=R

A dF (x), whereA is the part of the support of interest for the exposure,typically A⌘[K,1), or (�1,K]. Consider model erroras perturbations in the parameters that determine thecalculations of the probabilities. In the case of the

Page 25: Nicolas Nassim Taleb Misuses of Statistics

7.3. THE APPLICABILITY OF SOME PSYCHOLOGICAL BIASES 95

vanilla, the perturbation’s effect on the probability ismultiplied by a larger value of x.As an example, define a slighly more complicated vanillathan before, with option-like characteristics, V (↵,K)

⌘ R1K

x p↵(x)dx and B(↵,K) ⌘R1K

p↵(x) dx, whereV is the expected payoff of vanilla, B is that of the bi-nary, K is the “strike” equivalent for the bet level, andwith x2[1, 1) let p↵(x) be the density of the Paretodistribution with minimum value 1 and tail exponent ↵,so p↵(x) ⌘ ↵x�↵�1.Set the binary at .02, that is, a 2% probability of exceed-ing a certain number K, corresponds to an ↵=1.2275and a K=24.2, so the binary is expressed as B(1.2,24.2). Let us perturbate ↵, the tail exponent, todouble the probability from .02 to .04. The result isB(1.01,24.2)B(1.2,24.2) = 2. The corresponding effect on the vanilla

is V (1.01,24.2)V (1.2,24.2) = 37.4. In this case the vanilla was ⇠18

times more sensitive than the binary.

7.3 The Applicability of Some Psycho-logical Biases

Without going through which papers identifying biases,Table 1 shows the effect of the error across domains.We are not saying that the bias does not exist; ratherthat, if the error is derived in a binary environment, orone with a capped payoff, it does not port outside thedomain in which it was derived.

Table 7.1: True and False Biases

Bias Erroneous appli-cation

Justified applica-tions

Dread Risk Comparing Terror-ism to fall fromladders

Comparing risks ofdriving vs flying

General over-estimation ofsmall proba-bilities

Bounded bets inlaboratory setting

Open-ended pay-offs in fat-taileddomains

Long shotbias

Lotteries Financial payoffs

Precautionaryprinciple

Volcano eruptions Climatic issues

Acknowledgments

Bruno Dupire, Raphael Douady, Daniel Kahneman, Bar-bara Mellers.

References

Chernoff, H. (1952), A Measure of Asymptotic Effi-ciency for Tests of a Hypothesis Based on the Sumof Observations, Annals of Mathematic Statistics, 23,1952, pp. 493âĂŞ507.Mellers, B. et al. (2013), How to win a geopolitical fore-casting tournament: The power of teaming and train-ing. Unpublished manuscript, Wharton School, Univer-sity of Pennsylvania Team Good Judgment Lab.Silver, Nate, 2012, The Signal and the Noise.Taleb, N.N., 1997, Dynamic Hedging: ManagingVanilla and Exotic Options, WileyTaleb, N.N., 2001/2004, Fooled by Randomness, Ran-dom HouseTaleb, N.N., 2013, Probability and Risk in the RealWorld, Vol 1: Fat TailsFreely Available Web Book,www.fooledbyrandomness.comTetlock, P.E. (2005). Expert political judgment: Howgood is it? How can we know? Princeton: PrincetonUniversity Press.Tetlock, P.E., Lebow, R.N., & Parker, G. (Eds.) (2006).Unmaking the West: What-if scenarios that rewriteworld history. Ann Arbor, MI: University of MichiganPress.Tetlock, P. E., & Mellers, B.A. (2011). Intelligentmanagement of intelligence agencies: Beyond account-ability ping-pong. American Psychologist, 66(6), 542-554.