The Search for Repeatable Performance - Duke's …charvey/Teaching/663_2017/...The Search for...
Transcript of The Search for Repeatable Performance - Duke's …charvey/Teaching/663_2017/...The Search for...
The Search for Repeatable Performance
Campbell R. HarveyDuke University, NBER and
Man Group plc
1
February 20, 2017
International Finance
Campbell R. Harvey 2017 2
Source: https://xkcd.com/882/ Campbell R. Harvey 2017 3
Campbell R. Harvey 2017 4Source: https://xkcd.com/882/
Campbell R. Harvey 2017 5Source: https://xkcd.com/882/
Campbell R. Harvey 2017 6Source: https://xkcd.com/882/
Campbell R. Harvey 2017 7Source: https://xkcd.com/882/
Campbell R. Harvey 2017 8
Skip 17 panels of more negative tests, all p‐values>0.05
Campbell R. Harvey 2017 9Source: https://xkcd.com/882/
Campbell R. Harvey 2017 10Source: https://xkcd.com/882/
Examples in Financial Economics
Two sigma rule only appropriate for a single test• As we do more tests, there is a chance we find something “significant” (by the two sigma rule) but it is a fluke.
• Here is a simple way to see the impact of multiple tests for a two sigma test:
Campbell R. Harvey 2015
# of tests 1 5 10 20 26 50 nProb of fluke 5% 23% 40% 64% 74% 92% 1‐0.95^n
XKCD Jelly Beans and Acne
11
Examples in Financial Economics
The promotional email:• You get an email at the end of each month from an investment manager with “judge my record” as a slogan
• The email recommends either a long or a short position in the S&P• After receiving 10 correct recommendations in a row, you switch your investment account to the new manager
Campbell R. Harvey 2017 12
Examples in Financial Economics
The promotional email• Later out, you find out (the hard way) the strategy• Manager sends out each month 100,000 emails:
50,000 saying long and 50,000 short• The next month manager sends only to those who got the correct prediction, so next month is 25,000 long and 25,000 short recommendations
• 97 people will get 10 correct in a row (100,000 x 0.510 )• No skill here. It is random.
Campbell R. Harvey 2017 13
Examples in Financial Economics
3.4 sigma strategy• Profitable during fin crisis• Zero beta vs. market, value,size, and momentum• Impressive performance recently
14Campbell R. Harvey, “The Scientific Outlook in Financial Economics”, Presidential Address, American Finance Association, 2017
Campbell R. Harvey 2017
Examples in Financial Economics
Details• Long tickers “S”• Short tickers “U”
15Campbell R. Harvey 2017
Examples in Financial Economics
Two sigma rule only appropriate for a single test• As we do more tests, there is a chance we find something “significant” (by the two sigma rule) but it is a fluke.
• Here is a simple way to see the impact of multiple tests for a two sigma test:
Campbell R. Harvey 2015
# of tests 1 5 10 20 26 50 nProb of fluke 5% 23% 40% 64% 74% 92% 1‐0.95^n
XKCD Jelly Beans and Acne Alphabet, i.e., ticker symbols
16
Examples in Financial Economics
Research• Companies with meaningful ticker symbols, like Southwest’s LUV, and show they outperform.1
• There is another study that argues that tickers that are easy to pronounce, like BAL vs. BDL, outperform in IPOs.2
• There is yet another study that suggests that tickers that are congruentwith the company’s name, outperform.3
171 Head, Smith and Watson, 2009; 2 Alter and Oppenheimer, 2006; 3 Srinivasan and Umashankar
Campbell R. Harvey 2017
Examples in Financial Economics
5 factors
Campbell R. Harvey 2017 18
Examples in Financial Economics
Campbell R. Harvey 2017
15 factors
19
Examples in Financial Economics
82 factors
Campbell R. Harvey 2017Source: The Barra US Equity Model (USE4), MSCI (2014)
20
Examples in Financial Economics
400 factors!
Campbell R. Harvey 2017Source: https://www.capitaliq.com/home/who‐we‐help/investment‐management/quantitative‐investors.aspx
21
Examples in Financial Economics
18,000 signals examined in Yan and Zheng (2015)
22Campbell R. Harvey 2017
What’s going on?Forces causing mistakes1. Failure to account for luck + evolutionary propensity not to account for luck2. Failure in specifying and conducting scientific tests3. Failure to take rare effects into account
23Campbell R. Harvey 2017
A framework to separate luck from skill
Four research initiatives:*1. Explicitly adjust for multiple tests (“Backtesting”)2. Bootstrap (“Lucky Factors”)3. Noise reduction (“Rethinking Performance Evaluation”)4. Controlling for rare effects (“Scientific Outlook in Financial
Economics”)
24Campbell R. Harvey 2017
*Bibliography on last page. All my research at: https://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=16198
Luck
• Why are we so easily fooled by randomness?
Campbell R. Harvey 2017 25
Terminology
26Campbell R. Harvey 2017
Terminology
27
I thought this manager was skilled but that was a mistake:False Positive
Campbell R. Harvey 2017
Terminology
28Campbell R. Harvey 2017
Terminology
29
I didn’t invest inthis manager but that was a mistakeFalse Negative
Type II linked toType I. • For example, if all
patients declared pregnant, there is no Type II error.
Campbell R. Harvey 2017
Evolutionary Foundations
Campbell R. Harvey 2017Rustling sound in the grass …. 30
Evolutionary Foundations
Campbell R. Harvey 2017Rustling sound in the grass ….
Type I error
31
Evolutionary Foundations
Campbell R. Harvey 2017
Type II error
32
Evolutionary Foundations
Campbell R. Harvey 2017
Type II errorIn examples, cost of Type II error is large – potentially death.
33
Evolutionary Foundations
Campbell R. Harvey 2017
• High Type I error (low Type II error) animals survive• This preference is passed on to the next generation• This is the case for an evolutionary predisposition for allowing high Type I errors
34
Evolutionary Foundations
Campbell R. Harvey 2017
B.F. Skinner 1947
Pigeons put in cage. Food delivered at regular intervals – feeding time has nothing to do with behavior of birds.
35
Evolutionary Foundations
Campbell R. Harvey 2017
Results• Skinner found that birds associated their behavior with food delivery• One bird would turn counter‐clockwise• Another bird would tilt its head back
36
Evolutionary Foundations
Campbell R. Harvey 2017
Results• A good example of overfitting – you think there is pattern but there isn’t• Skinner’s paper called:
• ‘Superstition’ in the Pigeon, JEP (1947)
• But this applies not just to pigeons or gazelles…
37
Evolutionary Foundations
Campbell R. Harvey 2017
Klaus Conrad 1958
Coins the term Apophänie. This is where you see a pattern and make an incorrect inference. He associated this with psychosis and schizophrenia.
38
Evolutionary Foundations
Campbell R. Harvey 2017 39
Evolutionary Foundations
Campbell R. Harvey 2017 40
Evolutionary Foundations
Campbell R. Harvey 2017 41
Evolutionary Foundations
Campbell R. Harvey 2017
• Apophany is a Type I error (i.e., false insight)• Epiphany is the opposite (i.e., true insight)
– Apophany may be interpreted as overfitting
K. Conrad, 1958. Die beginnende Schizophrenie. Versuch einer Gestaltanalyse des Wahns
“....nothing is so alien to the human mind as the idea of randomness.” ‐‐John Cohen
42
Evolutionary Foundations
Campbell R. Harvey 2017
• Sagan (1995):– As soon as the infant can see, it recognizes faces, and we now know that this skill is hardwired in our brains.
C. Sagan, 1995. The Demon‐Haunted World43
Evolutionary Foundations
Campbell R. Harvey 2017
• Sagan (1995):– Those infants who a million years ago were unable to recognize a face smiled back less, were less likely to win the hearts of their parents and less likely to prosper.
44
What about Finance?
Performance of trading strategyis very impressive. • SR=1• Consistent• Drawdowns acceptable
Source: Man‐AHL Research
Campbell R. Harvey 2017 45
What about Finance?
Source: Man‐AHL Research
Campbell R. Harvey 2017 46
What about Finance?
Sharpe = 1
Sharpe = 2/3
Sharpe = 1/3
Source: Man‐AHL Research
200 random time‐seriesmean=0; volatility=15%
Campbell R. Harvey 2017 47
Other Sciences?
Particle Physics• Higg’s boson proposed in 1964 (same year as Sharpe published the CAPM)
• First tests of the CAPM in 1972 • Nobel award in 1990.
• Longer road for Higgs: • $5 billion to construct LHC. • “Discovered” in 2012. • Nobel 2013.
Campbell R. Harvey 2017 48
Other Sciences?
Particle Physics• Testing method very important• Particle rare and decays quickly and the key is measuring the decay signature
• Frequency is 1 in 10 billion collisions and over a quadrillion collisions were conducted
• Problem is that the decay signature could also be caused by normal events from known processes
Campbell R. Harvey 2017 49
Other Sciences?
Particle Physics• The two groups involved in testing (CMS and ATLAS) decided on what appears to be a tough standard: t‐statistic must exceed 5 (i.e., 5‐sigma)
Campbell R. Harvey 2017 50
Other Sciences?
Genetic‐Wide Association Studies• Genetic association research plagued by multiple testing• Researchers try to link certain diseases to certain genes• More than 20,000 human genes• In addition, there is a massive number of combinations of genes• For the first 10 years of publication of association studies, 98% of published results have been found to be false– John Ioannidis: “…There are millions of scientists, some of whom run millions of analyses in each study they conduct. To avoid false‐positive results in genetics, the current goal for a p value should be less than 0.00000005.”*
Campbell R. Harvey 2017 51*Approximately 5.3 sigma
Other Sciences?
Genetic‐Wide Association Studies• Recent paper in Nature claims two genetic linkages to Parkinson’s Disease• Over 500,000 genetic sequences are tried• By chance, thousands of sequences will appear to be linked to the disease• Identified loci had t‐statistics>5.3
Campbell R. Harvey 2017 52
1. Multiple Tests
• Provide a new framework to do multiple tests in the presence of correlations among tests and publication bias (hidden tests)
• Provide guidelines for future research
Campbell R. Harvey 2017 53
1. Multiple Tests: Number of Factors and Publications
0
40
80
120
160
200
240
280
0
10
20
30
40
50
60
70
Cumulative
Per y
ear
Factors and Publications
# of factors # of papers Cumulative # of factors
Campbell R. Harvey 2017 54
1. Multiple Tests: How Many Discoveries Are False?
• In multiple testing, how many tests are likely to be false? • In single testing (significance level = 5%), 5% is the “error rate” (false discoveries)
• In multiple testing, the false discovery rate (FDR) is usually much larger than 5%
Campbell R. Harvey 2017 55
1. Multiple Tests: Bonferroni's Method
• Here is a simple adjustment called the Bonferroni adjustment • For a single test, you are tolerant of 5% false discoveries• Hence, a p‐value of 5% or less means you declare a finding “true”• Bonferroni simply multiplies the p‐value by the number of tests
Campbell R. Harvey 2017 56
1. Multiple Tests: Bonferroni's Method
• Bonferroni simply multiplies the p‐value by the number of tests• In a single test, if you get a p‐value of 0.05 you declare “significant”• Returning to the Jelly Bean, suppose the green jelly bean test has a p‐value of 0.04 – which appears “significant”
• Bonferroni adjustment 20x0.04 = 0.80 which is “not significant” – not even close!
Campbell R. Harvey 2017 57
1. Multiple Tests: Bonferroni's Method
• Stock market does better under Democratic presidents
• Difference “significant” p‐value=.03
Campbell R. Harvey 2017 58
1. Multiple Tests: Bonferroni's Method
• However many possible choices for the test ‐‐ here are some:
• Bonferroni adjustment eliminates “significant” difference
Campbell R. Harvey 2017 59
President + House + Senate vs. President + House + SenatePresident + House + Senate vs. President + House + SenatePresident + House + Senate vs. President + House + SenatePresident + House + Senate vs. President + House + SenatePresident + House + Senate vs. President + House + SenatePresident + House + Senate vs. President + House + SenatePresident + House + Senate vs. President + House + SenatePresident + House + Senate vs. President + House + SenatePresident + House + Senate vs. President + House + Senate
Cocquemas and Whaley, 2016
1. Multiple Tests: Rewriting History
Campbell R. Harvey 2017
HML MOM
MRT
EP SMB
LIQ
DEFIVOL
SRV
CVOL
DCG
LRV
316 factors in 2012 if working
papers are included
0
80
160
240
320
400
480
560
640
720
800
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
1965 1975 1985 1995 2005 2015 2025
Cumulative # of factors
t‐ratio
BonferroniHolmBHYT‐ratio = 1.96 (5%)
60
1. Multiple Tests: Discussion
However:• Independence among test statistics is still not dealt with.• The number of hidden tests seems too low.
Campbell R. Harvey 2017 61
1. Multiple Tests: A New Framework
Campbell R. Harvey 2017
No skill. Expected return = 0%
Skill. Expected return = 6%
62
1. Multiple Tests: Harvey, Liu and Zhu Approach
Allows for correlation among strategy returns Allows for missing tests Review of Financial Studies, 2016
Campbell R. Harvey 2017 63
1. Multiple Tests: Backtesting
• Due to data mining, a common practice in evaluating backtests of trading strategies is to discount Sharpe ratios by 50%
• The 50% haircut is only a rule of thumb; we develop an analytical way to determine the haircut
Campbell R. Harvey 2017 64
1. Multiple Tests: Backtesting
Method • Suppose we observe a strategy with an attractive Sharpe Ratio. • This Sharpe Ratio directly implies a p‐value (which roughly tells you the probability that your strategy is a fluke)
• Suppose the p‐value is 0.01 which looks pretty good.
Campbell R. Harvey 2017 65
1. Multiple Tests: Backtesting
Method • However, suppose you tried 10 strategies and picked the best one• The Bonferroni adjusted p‐value is 10x0.01 = 0.10 which would not be deemed “significant”
• Reverse engineer the 0.10 back to the “haircut” Sharpe Ratio*
Campbell R. Harvey 2017*Note Tstat SR√T 66
1. Multiple Tests: Backtesting
Campbell R. Harvey 2017
Results: Percentage Haircut is Non‐Linear
Journal of Portfolio Management
67
2. Bootstrapping
Multiple testing approach has drawbacks• Need to know the number of tests• Need to know the correlation among the tests• With similar sample sizes, this approach does not impact the ordering of performance
Campbell R. Harvey 2017 68
2. Bootstrapping: Lucky Factors
Suppose we have 100 possible fund returns and 500 observations.• Step 1. Strip out the alpha from all fund returns (e.g. regress on benchmark and use residuals). This means alpha and t‐stat exactly equal zero – we have enforced “no skill”.
• Step 2. Bootstrap rows of the data to produce a new sheet 500x100* (note some rows sampled more than once and some not sampled at all)
Campbell R. Harvey 2017
*500x101 with the benchmark included69
Campbell R. Harvey 2017
Insert animation here
70
2. Bootstrapping: Lucky Factors
• Step 3. Recalculate the alphas and t‐stats on new data. Save the highest t‐statistic from the 100 funds. Note, in the unbootstrapped data, every t‐statistic is exactly zero.
• Step 4. Repeat steps 2 and 3 10,000 times.• Step 5. Now that we have the empirical distribution of the max t‐statistic under the null of no skill, compare to the max t‐statistic in real data.
Campbell R. Harvey 2017 71
2. Bootstrapping: Lucky Factors
• Step 5a. If the max t‐stat in the real data fails to exceed the threshold (95th percentile of the null distribution), stop (no fund has skill).
• Step 5b. If the max t‐stat in the real data exceeds the threshold, declare the fund, say, F7, “true”
Campbell R. Harvey 2017
‐6 ‐5 ‐4 ‐3 ‐2 ‐1 0 1 2 3 4 5 6
Bootstrap distributionof the max t‐stat
95th percentile t=4.2
72
2. Bootstrapping: Lucky Factors
• Step 6. Replace the F7 (no skill) with the actual F7 (positive alpha). • Step 7. Note that 99 funds have zero alpha and one fund has positive alpha.
Campbell R. Harvey 2017 73
2. Bootstrapping: Lucky Factors
• Step 8. Repeat Steps 3‐5 but now we are saving the “second to max” and comparing to the second highest t‐ratio in the real data.
• Step 9. Continue until the max ordered t‐statistic in the data fails to exceed the max ordered from the bootstrap.
Campbell R. Harvey 2017 74
2. Bootstrapping: Lucky Factors
Baseline model
YesAugmented model
No
Candidate factors
Terminate to arrive at the final model
75Campbell R. Harvey 2017
2. Bootstrapping: Lucky Factors
• Addresses data mining directly• Allows for cross‐correlation of the fund strategies because we are bootstrapping rows of data
• Allows for non‐normality in the data (no distributional assumptions imposed – we are resampling the original data)
• Potentially allows for time‐dependence in the data by changing to a block bootstrap.
• Answers the questions: • How many funds out‐perform? • Which ones were just lucky?
Campbell R. Harvey 2017 76
2. Bootstrapping: Lucky Factors
Campbell R. Harvey 2017 77
3. Noise reduction: Rethinking
Issue • Past alphas do a poor job of predicting future alphas (e.g., top quartile managers are about as likely to be in top quartile next year as this year’s bottom quartile managers!)
78Campbell R. Harvey 2017
3. Noise reduction: Rethinking
Issue • This could be because all managers are unskilled – or it could be a result of a lot of noise historical performance
79Campbell R. Harvey 2017
3. Noise reduction: Rethinking
Goal • Develop a metric that maximizes cross‐sectional predictability of performance• Useful for separating “skill” vs. “luck” and “smart” vs. “not‐smart”
80Campbell R. Harvey 2017
3. Noise reduction: Rethinking
Observed performance consists of four components:• Alpha• True factor premia• Unmeasured risk (e.g., low vol strategy having negative convexity)• Noise (good or bad luck)
81Campbell R. Harvey 2017
3. Noise reduction: Rethinking
Intuition • Current alpha is overfit. Regression maximizes the time‐series R2 for a particular fund.
• This time‐series regression has nothing to do with cross‐sectional predictability.
• All of the noise will be put in the alpha.• No surprise that past alpha have no ability to forecast future alphas
Campbell R. Harvey 2017 82
3. Noise reduction: Rethinking
Our approach• We follow the machine learning literature and “regularize” the problem by imposing a parametric distribution on the cross‐section of alphas.
• Leads to lower time‐series R2 – but higher cross‐sectional R2
Campbell R. Harvey 2017 83
3. Noise reduction: Rethinking
• t‐stat = 3.9%/4.0% = 0.98 < 2.0• alpha = 0 cannot be ruled out
Campbell R. Harvey 2017 84
3. Noise reduction: Rethinking
• Both t‐stats < 2.0• alpha = 0 cannot be rejected for either
Campbell R. Harvey 2017 85
3. Noise reduction: Rethinking
• t‐stat < 2.0 for all funds• alpha = 0 cannot be excluded for all• However, population mean seems to cluster around 4.0%. Should we declare all alphas as zero?
Estimated alphas cluster around 4.0%
Campbell R. Harvey 2017 86
3. Noise reduction: Rethinking
• Although no individual fund has a statistically significant alpha, the population mean seems to be well estimated at 4.0%.
• This might suggest grouping all funds into an index and estimating the alpha for the index. However, the index regression does not always work, as the next example shows.
Campbell R. Harvey 2017 87
3. Noise reduction: Rethinking
• Again, no fund generates a significant alpha individually
• An index fund that groups all funds together would indicate an approximately zero alpha for the index
• Fund alphas cluster into two groups. The two group classification seems more informative than declaring all alphas zero
Campbell R. Harvey 2017 88
3. Noise reduction: Rethinking
We assume that fund alphas are drawn from an underlying distribution (regularization)
– In Example 1, the distribution is a point mass at 4.0%; in Example 2, the distribution is a discrete distribution that has a mass of 0.5 at ‐4.0% and 0.5 at 4.0%
– We search for the best fitting distribution that describes the cross‐section of fund alphas using a generalized mixture distribution
Campbell R. Harvey 2017 89
3. Noise reduction: Rethinking
We refine the alpha estimate of each individual fund by drawing information from this underlying distribution
– In Example 1, knowing that most alphas cluster around 4.0% would pull our estimate of an individual fund’s alpha towards 4.0% and away from zero.
– In Example 2, knowing that alphas cluster at ‐4.0% and 4.0% with equal probabilities would pull our estimate of a negative alpha towards ‐4.0% and a positive alpha towards 4.0%, and both away from zero.
Campbell R. Harvey 2017 90
3. Noise reduction: Rethinking
Campbell R. Harvey 2017 91
3. Noise reduction: Rethinking
Key idea: – We assume that true alphas follow a parametric distribution. We back out this distribution from the observed returns and use it to aid the inference of each individual fund.
Main difficulty: – We do not observe the true alphas. We only observe returns, which provide noisy information on true alphas.
Our approach:– We treat true alphas as missing observations and adapt the Expectation‐Maximization (EM) algorithm to uncover the true alphas.
Campbell R. Harvey 2017 92
3. Noise reduction: Rethinking
Iterative method: – This method weights both the time‐series information for a particular fund’s alpha as well as the cross‐sectional information.
– This delivers a new estimate of alpha – the noise‐reduced alpha, distribution for that individual fund’s alpha, as well as a cross‐sectional distribution. The shapes of the distributions is general (we use as generalized mixture distribution).
Campbell R. Harvey 2017 93
3. Noise reduction: Rethinking
• Estimate fund‐by‐fund OLS alphas, betas, and standard errors. • Call these alpha0, beta0, sigma0 (denote square of sigma as var0).• Assume a two component population GMD and fit the GMD0 based on the OLS alphas, i.e. each fund’s alpha0.
• This implies one set of five parameters, MU01, MU02, SIGMA01, SIGMA02, P0 (mixing parameter). The first subscript denotes the iteration step.
• Also perturb these parameters to have 35 population GMDs for starting values (we want to minimize the chance we hit a local optima). – Note population parameters are denoted in UPPER CASE and fund‐specific parameters in lower case.
Campbell R. Harvey 2017 94
3. Noise reduction: Rethinking
• Given fund‐specific alpha0, beta0 and var0, and the population GMD0 also fit fund‐specific GMDs denoted as gmd0 (again, lower case for fund specific).
• If the GMD is one component (i.e., a normal distribution), then the alpha for fund 1 also follows a one‐component GMD (i.e., a normal distribution).
• The mean of gmd0 would be:
alpha/
GMDMU //
• Note that VAR0 is the variance of the population GMD (i.e. cross‐sectional variance). Hence, if the alpha0 is precisely estimated (high R2and low var0/T), there is a greater weight placed on the alpha0.
• This will be alpha1 for a candidate fund under a single component GMD. Campbell R. Harvey 2017 95
3. Noise reduction: Rethinking• If the GMD is two components, there are five parameters and, again, they will be
a weighted average of the fund‐specific alpha0 parameters and the GMD0. • The parameters governing this fund specific gmd will be conditional on the fund’s
betas, standard error, and the GMD that govern the alpha population.
mu01 alpha/
MU //
var01 1/1
var /1
VAR
mu02 alpha/
MU //
var02 1/1
var /1
VAR
Campbell R. Harvey 2017 96
3. Noise reduction: Rethinking
Details of method:• There is also a fifth parameter of the gmd, p0 (the drawing probability from the gmd component).
• Its formula is a function of the GMD’s P0 and is provided on p. 49 of our paper.
• The basic intuition is that we increase the drawing probability to the component that implies a mean that is closer to the mean of the population GMD. For example, we will make p0 larger if alpha0 is closer to MU01 than MU02.
Campbell R. Harvey 2017 97
3. Noise reduction: Rethinking
Details of method:• For each fund's gmd, we calculate its mean. We estimate new regressions where we constrain the intercepts to be the calculated means. This will produce different estimates of the fund betas (beta1) and the standard errors (sigma1).
Campbell R. Harvey 2017 98
3. Noise reduction: Rethinking
Details of method:• We fit a new GMD based on the cross‐section of gmd's. For each fund, we randomly draw n = 10,000 alphas from its gmd. Suppose we have n funds in the cross‐section. We will have mn draws from the entire panel. We find the MLE of the GMD that best describes these mn alphas.
• Recalculate fund‐specific gmds (gmd1) and draw alpha2• Continue to iterate until there is negligible change in the parameters of the GMD.
• Repeat the entire process 35 times with different initial GMD0s to ensure global convergence.
Campbell R. Harvey 2017 99
3. Noise reduction: Rethinking
• An exemplar outperforming fund
Campbell R. Harvey 2017 100
3. Noise reduction: Rethinking
Campbell R. Harvey 2017 101
3. Noise reduction: Rethinking
• In‐sample: 1984‐2001; Out‐of‐sample: 2002‐2011 In‐sample, NRA forecast
error (%)OLS forecast error (%)
# of funds
(‐∞, ‐2.0) 3.29 6.61 64
[‐2.0, ‐1.5) 3.09 3.70 75
[‐1.5, 0) 2.75 2.92 565
[0, 1.5) 2.61 5.54 610
[1.5, 2.0) 2.38 10.47 87
[2.0, +∞) 2.77 12.02 87
Overall 2.71 5.17 1,488
*Mean absolute forecast errors.Campbell R. Harvey 2017 102
4. P‐Hacking
• Choices are made in research that lead to a positive outcome
Campbell R. Harvey 2017 103
The Hurricane and Himicane
Jung, Shavitt, Viswanathan and Hilbe, “Female hurricanes are deadlier than male hurricanes” Proceedings of the National Academy of Sciences, 2014
Hypothesis: Sexism causes people take less seriously hurricanes with female names.
http://www.pnas.org/content/111/24/8782
2015 Impact Factor = 9.4Publishes 3,100 papers per year
Campbell R. Harvey 2017 104
The Hurricane and Himmicane
Jung, Shavitt, Viswanathan and Hilbe, 2014:“a hurricane with a relatively masculine name … is estimated to cause 15.15 deaths, whereas a hurricane with a relatively feminine name … is estimated to cause 41.84 deaths. Our model suggests that changing a severe hurricane’s name from Charley … to Eloise could nearly triple its death toll.”
“a hazardous form of implicit sexism”http://www.pnas.org/content/111/24/8782 Campbell R. Harvey 2017 105
The Hurricane and Himmicane
However, certain choices were made by the researchers:• Why exclude named tropical storms? 18 Atlantic tropical storms caused 235 deaths compared to 22 hurricanes causing 614 deaths
• Why exclude storms that do not make landfall? • Why only count offshore fatalities if the storm comes on‐shore?• Why exclude fatalities outside the US? In 1980, Hurricane Allen made landfall near Brownsville, TX on the border of Mexico. There were 269 deaths but Jung et al. count only 2.
• Why exclude fatalities from other countries?• Why not test robustness of results using Pacific hurricane data?
106http://dx.doi.org/10.1016/j.wace.2015.11.006 Campbell R. Harvey 2017
The Hurricane and Himmicane
Gary Smith, “Hurricane names: A bunch of hot air?” Weather and Climate Extremes, 2016, accuses authors of:• Arbritrary exclusion of “outliers”• Arbitrary construction of a masculinity‐femininity index on a scale of 1‐11. (Sandy is considered strongly feminine – more than Edith)
• Estimation of dozens of models and cherry picking the one that gives “significant” results.
107http://dx.doi.org/10.1016/j.wace.2015.11.006
Impact Factor = 1.4
Campbell R. Harvey 2017
The Hurricane and Himmicane
Gary Smith:2016, accuses authors of:• Dropping a key variable “years elapsed since the occurrence of hurricanes”
• Misspecification of the basic model by including monetary damages as an “explanatory variable” (monetary damage cannot ‘cause’ fatalities)
• Accuses authors of “data dredging”: “If you torture the data long enough, it will confess.”
• No significant differences when sample expanded and specification corrected
108http://dx.doi.org/10.1016/j.wace.2015.11.006
Impact Factor = 1.4
Campbell R. Harvey 2017
4. P‐Hacking
• Sample selection• “Outlier” exclusion• Data transformation (scaling)• Variable selection and combination• Statistical test choice• In‐sample/out‐of‐sampleIndeed, most academic and commercial research suffers from some form of p‐hacking.
Campbell R. Harvey 2017 109
5. Rare Effects
• Bonferroni correction increases the threshold as a result of multiple tests
• If an effect is rare, there will be a very high Type I error rate (a lot of false positives)
Campbell R. Harvey 2017 110
The Power Pose
Hypothesis: Standing in a posture of confidence impacts testosterone and cortisol levels in the brain leading to increased risk taking.
Evidence: Carney, Cuddy and Yap (2010) Psychological Science
Carney, Cuddy and Yap, 2012. Power Posing: Brief Nonverbal Displays Affect Neuroendocrine Levels and Risk Tolerance, Psychological Science 21(1) 1363‐1368.Campbell R. Harvey 2017 111
The Power Pose
Second most viewed TED talk in historyCampbell R. Harvey 2017 112
The Power Pose
New York Times Best Seller (reached #3)
Campbell R. Harvey 2017 113
The Power Pose
Simmons and Simonsohn, 2016, https://ssrn.com/abstract=2791272 Campbell R. Harvey 2017 114
The Power Pose
Simmons and Simonsohn, 2016, https://ssrn.com/abstract=2791272Also see Gelman and Fung, 2016. http://www.slate.com/articles/health_and_science/science/2016/01/amy_cuddy_s_power_pose_research_is_the_latest_example_of_scientific_overreach.html
Campbell R. Harvey 2017 115
The Power Pose
Simmons and Simonsohn, 2016, https://ssrn.com/abstract=2791272
24 studies
Campbell R. Harvey 2017 116
The Power Pose
Dana Carney retracts
http://faculty.haas.berkeley.edu/dana_carney/pdf_My%20position%20on%20power%20poses.pdfCampbell R. Harvey 2017 117
Rare Effects: 500 Shades of Gray
Experiment conducted at University of Virginia• Hypothesis: Political extremists see only black and white – literally.• Experiment: Show words in different shades of gray and then ask participants to try to match color on gradient.
• Afterwards, evaluate where their political beliefs place on the spectrum and test hypothesis that moderates are more accurate.
Nosek, Spies and Motyl (2012)118Campbell R. Harvey 2017
Rare Effects: 500 Shades of Gray
Hello
Drag slide to match the color of the word
119Campbell R. Harvey 2017
Rare Effects: 500 Shades of Gray
Group 1: Moderates
Group 2: Extremists Group 2: Extremists
120Campbell R. Harvey 2017
Rare Effects: 500 Shades of Gray
Dramatic results with large sample of 2,000 participants• Moderates were able to see significantly more shades of gray• P‐value<0.001 which is highly significant; Implying only a 0.1% chance that the observed test results were consistent with the null hypothesis of no effect
121Campbell R. Harvey 2017
Rare Effects: 500 Shades of Gray
Researchers decided to replicate before submitting results for publication in a top journal• Replication saw no significant difference• P‐value was 0.59 (not even close to significant)
122Campbell R. Harvey 2017
Rare Effects: 500 Shades of Gray
Lesson: If the hypothesis is unlikely, then we need to be especially careful. There will be a lot of false positives using standard testing procedures. Ideally, we incorporate information in the testing procedure when we know the effect is rare.
123Campbell R. Harvey 2017
Baker‐Miller Pink
A.G. Schauss• “Tranquilizing Effect of Color Reduces Aggressive Behavior and Potential Violence”*
• “Room Color and Aggression in A Criminal Detention Holding Cell: A Test of the ‘Tranquilizing Pink’ Hypothesis”**– Named Baker‐Miller pink after the two Naval correctional institute directors that sponsored the experiment.
Campbell R. Harvey 2017 124
*http://www.orthomolecular.org/library/jom/1979/pdf/1979‐v08n04‐p218.pdf
** http://orthomolecular.org/library/jom/1981/pdf/1981‐v10n03‐p174.pdf
Last month Kendall Jenner, the reality television celebrity and half‐sister of Kim Kardashian, announced to the world that she had painted her living‐room wall the shade of pink used in some American police cells to calm rowdy detainees. The Times, February 2, 2017
A friend had told her that staring at this hue, which is known variously as “drunk‐tank” pink and Baker‐Miller pink, was “scientifically proven” to suppress the appetite.
Baker‐Miller Pink
Baker‐Miller Pink
Gilliam and Unruh • “The Effects of Baker‐Miller Pink on Biological, Physical and Cognitive Behavior”*– Original effect debunked– Yet belief the effect is true lives on …
Campbell R. Harvey 2017 126
* http://www.orthomolecular.org/library/jom/1988/pdf/1988‐v03n04‐p202.pdf
5. Rare Effects
Go to:
PollEv.com/finance663
• 1% of women aged 40‐50 have breast cancer – a relatively rare event• 90% chance of a true positive test from a mammogram• 10% error rate from mammogram
What is the chance that a woman has breast cancer given a positive test?
Campbell R. Harvey 2017 127
5. Rare Effects
• Survey of doctors found mean response 75%
Campbell R. Harvey 2017 128
5. Rare Effects
• Sample size=1,000 and 10 true cases• Test 90% accurate, 9/10 of the true tests significant
•• Given the test result, what is the probability of cancer?
• In other words, the probability of a false diagnosis is
Note, this is a simple application of Bayes Rule
Campbell R. Harvey 2017 129
Rate of False Discoveries/Diagnoses
is the prior probability (1% for breast cancer); is the error rate; is the power (the probability the test will reject the null when the alternative is true) • Even with 100% power, if is very small, then the expected fraction of false discoveries is very high (close to one).
130Campbell R. Harvey 2017
The Case for Using Priors
Three experiments:*1. The musicologist2. The tea drinker3. The bar patron
Campbell R. Harvey 2017 131*Based on correspondence from Leonard Savage in 1962.
The Case for Using Priors
Musicologist claims to be able to identify from unlabeled scores whether Haydn or Mozart is the composer
Simple experiment: 10 pairs of scores. Musicologist gets 10/10 correctCampbell R. Harvey 2017 132
The Case for Using Priors
Tea drinker claims to be able to identify whether milk was in the tea cup before the tea was poured or added afterwards
Simple experiment: 10 pairs of tea cups. The tea drinker gets 10/10 correctCampbell R. Harvey 2017 133
The Case for Using Priors
Bar patron claims that alcohol enables him to foresee the future
Simple experiment: Flip coin 10 times. Drunk gets 10/10 correct. Campbell R. Harvey 2017 134
The Case for Using Priors
All three experiments have the identical p‐value:0.510=0.000977 (or p‐value<0.001)
• This means there is less than 1 out a 1000 chance that what we observed is consistent with the null hypothesis (no ability to choose correct answers)
• Though p‐values are identical, the results have different impacts on our beliefs
Campbell R. Harvey 2017 135
The Case for Using Priors
Three experiments:1. The musicologist: We already know she is an expert. Indeed, it is not
even clear that we need to do the experiment. Our beliefs are barely impacted.
2. The tea drinker: We might have been bit skeptical of this long time tea drinker. However, after these results, the plausibility of the claim is greatly strengthened and our beliefs shift.
3. The bar patron: The hypothesis is preposterous. P‐value of 0.001 or even lower would not change our beliefs.
Campbell R. Harvey 2017 136
The Bayesian Setup
Bayesian learning implies:Posterior = Bayes Factor x Prior
• Where the Bayes Factor is the ratio of the data likelihood under the null to the data likelihood under the alternative.
• The Bayes Factor tells us how much we are moved away from the prior given the evidence.
• A very small Bayes factor is supportive of the alternative.• In practice, could be difficult to implement
Campbell R. Harvey 2017 137
A Simplified Approach
Minimum Bayes Factor• The MBF is the lower bound among all Bayes factors. • It is achieved when the prior distribution of alternative hypotheses has all of its density at the maximum likelihood estimate of the data.
• It is the Bayes factor that provides the strongest evidence against the null hypothesis.
Campbell R. Harvey 2017 138
A Simplified Approach
Minimum Bayes Factor• It is very easy to calculate:
/
• So, if t‐stat = 2.0 (usually associated with p‐value of 0.05), the MBF is 0. 14.
Campbell R. Harvey 2017 139
A Bayesianized P‐value
We can use the MBF to answer a key question:
1
• The Bayesianized p‐value tells us the probability the null is true given the data
• It answers the right question
Campbell R. Harvey 2017 140
A Bayesianized P‐value
Example: Suppose our null hypothesis is that variable Y is not predicted by X. We run a regression of Y on X with 300 observations and find a “significant” coefficient with a t‐statistic of 2.6 which has a p‐value of 0.014.
/ = 0.034
• Let’s assume prior odds are even, i.e., 1:1
..
= 0.033
• However, if you think there are modest odds against the effect being real, say 2:1, the probability that the null is true increases to 0.064.
Campbell R. Harvey 2017 141
A Bayesianized P‐value
There is another MBF that is not as generous to the alternative. It places the mass at the null with Symmetric and Declining density. It also very easy to calculate:
Where e is the natural exponent and p is the usual p‐value
• This type of prior might be more appropriate in certain situations in financeCampbell R. Harvey 2017 142
Examples in Practice
Campbell R. Harvey 2017 143
Reported Reported Prior odds‐ratio Bayesianized Prior category Effect Sample t‐stat p‐value MBF (null/alternative) p‐valueA stretch Clever tickers outperform 1984‐2005 2.66 0.0079 0.0291 99/1 0.742
(Head, Smith and Wilson, 2009)
Perhaps Size priced 1963‐1990 2.58 0.0099 0.0359 4/1 0.125(Fama and French, 1992)
Solid footing Market beta priced 1935‐1968 2.57 0.0100 0.0368 1/1 0.035(Fama and MacBeth, 1973)
Final perspectives
Combination of: propensity for Type I errors, incorrect testing methods, and lack of effort to reduce noise implies
• Most published empirical research findings are likely false • Most research conducted within companies is likely false• Most managers are just “lucky”• Most the smart beta products are not “smart”• No predictability in performance based on past performance
144Campbell R. Harvey 2017
Final perspectives
• My research makes progress on goal of identifying repeatable performance
• There are a host of other issues:• Factor loadings also noisy• Ex‐post factor loading unfairly punish market timers• It is essential to look beyond the Sharpe Ratio and incorporate other info
145Campbell R. Harvey 2017
5.
Credits
Joint work with
Yan LiuTexas A&M University
Based on: • “The Scientific Outlook in Financial Economics” https://ssrn.com/abstract=2893930 [Presidential Address]
and my joint work with Yan Liu:• “… and the Cross‐section of Expected Returns”
http://ssrn.com/abstract=2249314 [Best paper in investment, WFA 2014]
• “Backtesting”http://ssrn.com/abstract=2345489 [Bernstein Fabozzi/Jacobs‐Levy best paper, JPM 2016]
• “Evaluating Trading Strategies” [Bernstein Fabozzi/Jacobs‐Levy best paper, JPM 2015]http://ssrn.com/abstract=2474755
• “Lucky Factors”http://ssrn.com/abstract=2528780
• “Rethinking Performance Evaluation”http://ssrn.com/abstract=2691658
Campbell R. Harvey 2017 146
Appendix: Questionnaire
Hypothesis: Firms with small boards of directors outperform companies with large boards. Test for differences in mean returns (with controls) is significant with t=2.7 (p‐value=0.01). Consider the following questions: (True or False)1. You have disproved the null hypothesis (no difference in mean performance) 2. You have found the probability of the null hypothesis being true. 3. You have proved your hypothesis that firms with small boards outperform firms with
large boards. 4. You can deduce the probability of your hypothesis (small better than large) being true.5. You know, when you reject the null hypothesis (of no difference), the probability that
you are making a mistake6. You have a reliable finding in the sense that if, hypothetically, the experiment were
repeated a great number of times, you would obtain a significant result on 99% of the occasions.
147Campbell R. Harvey 2017
Appendix
Answer is “False” for each question.• The p‐value does not tell you whether the null hypothesis or the underlying
experimental hypothesis is “true”. It is also incorrect to interpret the test as providing (1‐p‐value) percent confidence that the effect being tested is true. Hence, both (1) and (3) are false.
• The p‐value tells us the probability of observing an effect, D, or greater, given the null hypothesis, H0, is true, i.e. p(D|H0). It does not tell us p(H0|D) – hence (2) is false.
• The p‐value says nothing about the experimental hypothesis being true or false –hence (4) is false. Question (5) also refers to the probability of a hypothesis which the p‐value does not deal with. Hence, (5) is false.
• The complement of the p‐value does not tell us the probability that a similar effect will hold up in the future unless we know the null is true – and we don’t. Hence (6) is false. 148Campbell R. Harvey 2017
Appendix
P‐value is P[D|H] not P[H|D]• Where H is the null hypothesis and D is the observed data
• It is routine to look at a low p‐value, like p=0.01 and conclude that there is only a 1% chance the null is true. That is incorrect.
149Campbell R. Harvey 2017
Appendix
P‐value is P[D|H] not P[H|D]• To see the large gap consider the difference between:
– P[Death|Hanging] = .99– P[Hanging|Death] = .01
It makes no sense to equate the two.
150Campbell R. Harvey 2017
Appendix
In addition...• The p‐value is routinely used to choose among specifications, i.e. choose the one
with the lowest p‐value. Comparing p‐values across specifications has no statistical meaning.
• A low p‐value while rejecting the null hypothesis tells us very little about the ability of the hypothesis to explain the data. That is, you might observe a low p‐value but the model has a low R2.
• Low p‐values could be a result of not controlling for multiple testing.• Low p‐values could be a result of selection and/or p‐hacking.• Low p‐value could be a result of a misspecified test.• P‐values crucially depend on the amount of data. It has been well known since
Berkson (1938, 1942) that with enough data, you can reject any null hypothesis.• P‐values do not tell us about size of the economic effect.
151Campbell R. Harvey 2017