An algorithm that finds truth even if most people are …...An algorithm that finds truth even if...

An algorithm that finds truth even if most people are wrong*

Drazen Prelec

Massachusetts Institute of Technology

Sloan School, Cambridge MA 02139

[email protected]

H. Sebastian Seung

Howard Hughes Medical Institute

and

Massachusetts Institute of Technology, Brain and Cognitive Sciences

Cambridge MA 02139

[email protected]

Abstract

Effective use of subjective judgment is essential in all fields of knowledge. We present a method forfinding truth when the subjective judgments of multiple respondents are the only evidence available,and majority opinion may be wrong. Respondents are scored for their own judgments and for theirmetaknowledge of others’ judgments. In a probabilistic model of belief formation, their scoresconverge to the subjective probabilities they implicitly assign to the truth. Hence, the judgments of thehighest scoring respondents reveal truth, irrespective of majority opinion. An experiment demonstratesthat the method outperforms majority rule in selecting truth, and in identifying experts.

* Acknowledgments: Support by NSF SES-0519141, Institute for Advanced Study and John SimonGuggenheim Foundation (Prelec). Hospitality of Aspen Center for Physics, Zagreb Math Dept, TomPalfrey and Princeton Lab for Exp Soc Sci. Comments and discussion: Dilip Abreu, Shane Frederick,Michael Kearns, Eric Maskin, Danica Mijovic-Prelec, Whitman Richards, Rebecca Saxe, Ray Weaver.

Subjective judgment is an irreducible component of pure and applied knowledge (1). Experts

estimate the likelihood of rare accidents, assess the evolutionary significance of bone fragments,pronounce on aesthetic value and give interpretation on points of law. Even the touchstone of scientific

practice — peer review of articles and proposals — rests on referees’ subjective assessments oforiginality and significance. But experts often disagree, and can be wrong (2, 3). The lack of accepted

procedures for reconciling opinions feeds skepticism about subjective judgment, and promotes

controversy when expertise is claimed but difficult to prove (4).New Web-based technologies have made it easier to tap the knowledge of dispersed individuals,

raising hopes that automated opinion aggregation mechanisms might provide an impartial, efficientalternative to expert judgment (2, 5). Indeed, electronic information (prediction) markets have proved

remarkably accurate in some domains (6, 7). However, markets suffer from a fundamental limitation:

They can only trade claims whose final value is defined by a public event. For example, one canorganize a market to predict the outcome of a jury trial, which is of course a public event, but not to

determine the defendant’s true guilt or innocence (5).

We present a method for finding truth when subjective judgment remains the only source ofevidence and there is a possibility that most people are wrong. The method selects the judgments of

respondents who reveal superior metaknowledge — knowledge of others’ opinions. The explicitmetaknowledge criterion distinguishes our method from other approaches, such as voting algorithms

(8-13), multi-item analysis (14, 15), or Bayesian updating (1, 16-18).

The instrument for measuring metaknowledge is a mathematical formula called the BayesianTruth Serum (BTS) (19). As input to the formula, respondents provide their judgments and also their

predictions of the distribution of judgments in a sample of peers. These dual reports then undergocompetitive, zero-sum scoring as a function of the sample statistics. Previously, it was shown that BTS

scoring provides incentives for respondents to answer truthfully (19), but the problem of identifying

judgments that are closer to objective, impersonal truth was left open. Here, the same formula is shownto select judgments that are objectively true, under fairly mild assumptions about how truth and belief

are related. Hence, when these assumptions hold, the BTS scoring system provides an integratedsolution to both problems — rewarding truthfulness, and aggregating judgments to discover truth (20).

Metaknowledge is an effective truth diagnostic whenever information is unevenly distributed

among respondents, so that some pieces are widely shared but other pieces are known only by aminority (16, 21). As an elementary example, consider the (false) proposition that Chicago is the capital

of the state of Illinois. Respondents might form different opinions about the truth of the proposition,depending on whether they knew: (a) that Chicago is a large city, (b) that it is located in the state of

Illinois, (c) that Springfield is the actual capital of Illinois, and so on. If the typical person is aware of

(a) and (b) but not of (c), then the majority of those queried might vote for the incorrect answer, that the

proposition is True.A democratic poll ignores the asymmetry in metaknowledge between respondents who know

the right answer and those who do not. Those who know that Chicago is not the capital of Illinois canimagine that many others will be misled. A comparable insight into the opinions of others is not

available to those who falsely believe the answer is Yes (22). Our scoring method in effect reweights

the votes so as to reflect different levels of metaknowledge associated with each possible answer. If themethod works as claimed, the true answer should emerge as the winner, regardless of how many

respondents endorse it. Later on, this claim will be justified within a Bayesian model of beliefformation, using the mathematical theorem about the BTS formula given in Box 1. But first we give a

description of the method itself.

The scoring works at the level of a single question with m answers, which is posed to nrespondents, indexed by r, s,.. Œ {1,2,..,n}. The question should admit only one right answer, and it

should be reasonable to believe that if one could somehow pool information from all respondents, then

that collective knowledge would determine the right answer. For example,

(a) Chicago is the state capital of Illinois (m=2: True, False)(b) The best current estimate of the global temperature increase between now and 2100 is _____

(m=5: < 2°C, 2° - 4°C, 4° - 6°C, 6° - 8°C, > 8°C)(c) On current evidence, the probability of intelligent extraterrestrial life in the universe is ____

(m=10: numerical probability scale, divided into 10 categories)

The respondent is asked to endorse the answer mostly likely to be true, and to predict the proportion of

the sample that will endorse each possible answer. Let

†

xkr Œ 0,1{ } indicate whether respondent r has

endorsed answer k, and

†

y = (y1r,..,ym

r ) her prediction of the sample proportions (

†

ykr ≥ 0,

†

ykr =1

kÂ ). The

truth-selection algorithm proceeds in four steps.Step 1 Calculate the average

†

x k of the endorsements and the geometric mean

†

y k of the

predictions:

†

x k =1n

xkr

r=1

n

Â , log y k =1n

log ykr

r=1

n

Â (1)

Step 2 Calculate the BTS score (23) of each individual r:

†

ur = xkr

k=1

m

Â log x ky k

+ x kk=1

m

Â log ykr

x k(2)

Step 3 For each answer k, calculate the average BTS score

†

u k of all individuals endorsing

answer k:

†

u k =1

n x kxk

rur

r=1

n

Â (3)

Step 4 Select the answer k that maximizes

†

u k.

This algorithm will be validated both theoretically and experimentally, but first a few words about our

methodology are in order. The theoretical justification is based on a model of belief formation

presented below. While this model is intuitively reasonable, its quality as an approximation to real-world situations cannot be decided by purely theoretical means. Therefore we will also provide a simple

empirical test. In this application the true answers are known independently, so that the accuracy of themethod can be quantified.

The model formalizes the relation between truth and subjective beliefs of an ideal Bayesian

respondent. The truth

†

W = i is associated with one of the m answers, and is drawn from a probabilitydistribution

†

Pr W = k[ ]. Each respondent r then receives a signal Tr, representing information relevant to

the question. Respondents are identical apart from the value of their signal, and signals of differentrespondents are presumed independent when conditioned on W. Therefore, the signal generation

process is fully specified by a signal matrix

†

Skj = Pr T r = k W = j[ ].

The complete probability distribution over all variables, also called the common prior, is

assumed to be common knowledge among respondents (24). In that case, Bayesian respondents can

compute a belief matrix

†

B jk = Pr W = j T r = k[ ] , which expresses subjective uncertainty about truth, and

a metaknowledge matrix,

†

M jk = Pr T s = j | T r = k[ ], which expresses uncertainty about signals received

by others. We simplify the belief matrix by assuming that exactly one signal favors each possible

answer. Then there is no ambiguity in taking

†

T r = k to mean that respondent r believes that W=k is

most likely to be true,

†

Pr W = k | T r = k[ ] > Pr W = j | T r = k[ ], for all j≠k.

For dichotomous questions this completes the model. For m≥3, one further assumption on the structureof the belief matrix is necessary. This is truth sensitivity,

†

Pr W = i | T r = i[ ] > Pr W = i | T r = k[ ] (4)

for any k different from the truth i. In other words, if we compare the degree of beliefs in the true

answer W=i, those who received the signal Tr=i believe more strongly in the true answer than those who

received other signals. The reasonableness of this assumption may be more evident when rewrittenusing Bayes’ rule, as

†

Pr T r = i | W = i[ ]Pr T r = k | W = i[ ]

>Pr T r = i[ ]Pr T r = k[ ]

which asserts that conditioning upon truth strengthens belief in truth relative to belief in untruth (25).

Above we stated that respondents’ signals are independent when conditioned on truth. From this

assumption one might jump to the mistaken conclusion that the predictions of all respondents should bethe same, regardless of the answers they endorse. In fact, the signals of the respondents have statistical

dependencies induced by uncertainty about W. Calculation of the metaknowledge matrix by Bayes’

rule therefore requires averaging over all possible values of W, giving rise to differences in

metaknowledge and causing predictions of respondents to depend on which answer they endorse. This

dependence is consistent with Bayesian reasoning and is also well-corroborated by experiments (26).Observe that the common prior contains information about counterfactual scenarios. For

example, what fraction of people would believe that Chicago is the capital of Illinois, if Chicago wereactually the capital of Illinois? What is the prior probability that Chicago is the capital of Illinois?

Introspectively, these seem to be difficult and speculative questions, and our method does not require

respondents to answer them. Instead, we only request that respondents pick the answer they believemost likely to be true, and predict the distribution of endorsements in the sample.

These paired reports are in turn directly related to the elements of the belief model. Theendorsement of a respondent is identified with his signal. The predictions of a respondent are

interpreted as a noisy report of the single column of the metaknowledge matrix associated with his

signal. Therefore, if respondents are truthful, their predictions and their distribution of endorsementswill estimate, respectively, the full metaknowledge matrix and a single column of the signal matrix, the

column corresponding to the true answer.Nothing in the belief model rules out the possibility that the largest entry in this column is

associated with a wrong answer, which is to say, that a wrong answer will receive the most votes. The

model is noncommittal about the circumstances that might give rise to such widespread false beliefs,but it is worth outlining an example for concreteness (Box 2). Suppose that most respondents only

recognize Chicago as the largest city in Illinois. They are likely to conclude that Chicago is the statecapital, whether or not that statement is true. A smaller number of respondents have more specific

information, because they remember hearing that another city is capital. This memory is not 100%

reliable, but is sufficient to tip judgment against the proposition. These differences in knowledge couldgive rise to the numbers shown in Box 2.

To sharpen the truth-detection problem, the numbers have been fine-tuned to create a situation

where respondents with the wrong signal should be more confident. Those endorsing Yes should inferfrom their signal (by applying Bayes’ rule) that there is an 83% probability that they are correct, while

those endorsing No should infer that there is only a 73% probability that they are correct. Hence, ifopinions are weighted by confidence the collective decision would still be wrong, as would any

decision reached by mathematical aggregation of subjective probabilities.

In the actual world Chicago is not the capital of Illinois. However, this model explains how

most people might think that it is, even though they understand the informational mechanism andfollow the rules of Bayesian inference. How does BTS discern the true answer in this situation? The

user of BTS asks respondents to predict the beliefs of others. If respondents compute themetaknowledge matrix, those with the incorrect opinion will predict that 89% will agree with them,

while those with the correct opinion will predict that only 30% will agree, that is, they expect to be in

the minority. The user would record these two sets of predictions, and would also record that about60% of the sample endorses the proposition that Chicago is the capital. Applying (3) to these statistics,

she would arrive at a score of —0.57 for Yes and +0.86 for No, and conclude that majority opinion iswrong.

To prove that this will work in the general case, we must specify how actual reports relate to the

probabilistic model in the large sample limit. An ideal respondent, facing the scoring system, wouldcarry out the Bayesian computations required to find the metaknowledge matrix. In reality, errors in

these computations and noise in the reports are likely, but they are assumed to average out. Namely, the

average log prediction of all respondents endorsing answer j converges to the jth column of the logmetaknowledge matrix,

†

log Mkj . Second, the average endorsement frequencies converge to the column

of the signal matrix associated with the true answer.

With these inputs, the BTS formula reveals, for each possible signal, the probability that thatsignal assigns to the true answer (modulo a constant). This is proven in the Theorem of Box 1, which

states that the average score of respondents endorsing answer k is

†

limn Æ•

u k = log Pr W = i | T r = k[ ] + C, (5)

where W=i is the true answer, and C does not depend on k. The theorem proves that the BTS user can

rank respondents according to how well they anticipate the truth. Indeed, the theorem goes further,

proving that the asymptotic BTS score is identical to the score that respondents would have receivedwith the logarithmic proper scoring rule (27), as administered by a user with independent access to

truth.

Equation 5 is the central mathematical result of this paper. When combined with truthsensitivity it immediately implies that maximizing the average BTS score

†

u k with respect to k is a

prescription for finding truth, in the limit of a large number of respondents. Therefore the truth selection

property of the BTS method is proven, at least for the particular model of belief formation consideredhere (28).

Our theorem predicts that the BTS decision rule should outperform majority rule in recoveringtrue answers. We test this with a survey of knowledge of US state capitals, presented to students at MIT

(n=51) and Princeton (n=32) (details in SOM). The survey contained fifty questions, of the form: “Is

Chicago the capital of Illinois,” where the named city was always the most populous in the state.

Students answered Yes or No, and predicted the distribution of Yes and No votes in the sample.

Although elementary, these questions should give rise to differences in knowledge that the method canexploit. Given natural variation in backgrounds, it is likely that for most states some subset — the local

experts — would know the correct answer. At the same time, their number might be small and theirinfluence diluted by the uninformed majority. The survey tests our main theoretical claim — that BTS

is able to reduce such dilution of expert votes.

The survey proved difficult — the average respondent was slightly above chance, endorsing29.5 correct answers at MIT and 31 at Princeton. The collective judgment was slightly better: The

majority decision was correct for 31 states at MIT, and for 36 states (and 4 ties) at Princeton. Asimplied by the Bayesian model, endorsements had a strong impact on predictions. Respondents who

agreed that a city was the capital predicted on average that 70.3% of others would vote Yes, while those

who denied the proposition predicted only 49.8% for Yes (29). Predictions made by respondents withcorrect answers were on average more accurate, closer to actual percentages in 37 states at MIT, 46

states at Princeton.

Can the BTS algorithm determine the true answers, as implied by theory? Following Steps 1-4,we find that in the MIT sample, BTS decision differs from the majority decision on 18 states: for 14

states (including Illinois) it corrects a wrong majority decision, while for 4 states it reverses a correctmajority decision. The 14-4 split is significantly different from chance (p<.02, binomial test). The total

number of mistakes drops from 19 to 9 (matched pair t49=2.45, p<.01) In the Princeton sample, the split

is 12-4 in favor of BTS (p<.05, binomial test). If ties are coded as 0.5, the total number of mistakesdrops from 12 to 6 (matched pair t49=1.69, p<.05).

BTS does, therefore, improve on majority decision when applied to individual questions.However, BTS scores can also be used identify expert respondents if knowledge correlates across

multiple questions. BTS suggests two possible indices for rating the expertise of a single respondent r.

The first, individual index is the BTS score ur of the respondent, averaged across all questions. The

second, pooled index is the average BTS score of the answer endorsed by respondent r,

†

xrku k

k=1

m

Â ,

averaged over all questions. The two indices coincide in theory, because respondents endorsing thesame answer should have identical BTS scores. An advantage of the pooled index is that it filters out

individual differences in prediction competence (i.e., longrun calibration of predictions).

Figure 2 shows how the pooled BTS index of expertise correlates with actual knowledge ofstates’ capitals. As benchmark, we included an index of conventional wisdom (CW), defined as the

number of states for which a respondent votes with the majority opinion for that state. Because majorityopinion is correct more than half the time, one might expect that respondents with high CW scores will

also get more answers right. However, accuracy and CW are uncorrelated, as shown by the two left

panels of Figure 2. CW is worthless as a predictor of genuine knowledge for this domain (30). By

contrast, the panels on the right show that the BTS score is able to sort respondents by true accuracy(MIT: r=+.84, p<.0001; Princeton: r=+.94, p<.0001). The pooled BTS index is thus an excellent

predictor of genuine knowledge.The effectiveness of the BTS expertise index suggests hybrid decision policies. For example,

one could follow the majority opinion of expert subsets, where expertise is determined by BTS scores

across all fifty questions. Judgments of other respondents would be ignored in this elitist version ofmajority rule. Figure 2 shows the performance of such policies for the two expertise indices displayed

in Figure 1, as a function of subset size. Each line starts with a subset of one — the accuracy of thesingle respondent who scored highest on the corresponding criterion. For any size of expert subset

(indicated by the numerical value on the x-axis) the y-axis gives the number of correct answers

produced by majority rule applied to that subset.The blue lines in the Figure show that there is no benefit in eliminating respondents with

unconventional judgments. However, if the sample is purified according to BTS scores (red lines), then

small subsets achieve impressive accuracy levels. Retaining the votes of the top 20% of the sampleresults in 45 (MIT) or 47 (Princeton) correct majority opinions. The majority vote of the top three

respondents by BTS score is a nearly perfect indicator of truth in either sample (47 at MIT, 48 atPrinceton). These results highlight the latent superiority of expert subsets and establish the possibility

of using purely internal criteria to select them. As the simple survey shows, it may be better to follow a

handful of individuals with high metaknowledge than the majority opinion of the full sample.In summary, we have presented here an algorithmic method for determining true answers, when

private opinions are the only source of information. The method aims to extend the advantages ofcompetitive information aggregation mechanisms to subjective domains of knowledge, such as

aesthetics or law. Like an information market, the method requires minimal inputs from the user —

only the question and possible answers. Although the supporting theory assumes a Bayesian model ofbelief formation, the user is not challenged to specify prior probabilities and in fact need not be a

Bayesian at all.The algorithm is naturally suited to open access implementation, which is often presented as a

special virtue of markets (2, 6, 7). Since truth is not associated with a statistical average but with the

opinions of winners, there is no need to screen for expertise. The less informed individuals will notdisturb the outcome, but will only subsidize those who are more informed. Over time, self-selection and

attrition should lead to a situation where experts predominate in the active player pool. In this sense, theBTS algorithm enforces a meritocratic outcome by an open democratic process.

Notes

1. R. M. Cooke, Experts in Uncertainty (Oxford University Press, New York, 1991).2. J. Surowiecki, The Wisdom of Crowds (Random House, New York, 2004).3. P. E. Tetlock, Expert Political Judgment: How good is it? How can we know? (Princeton

University Press, Princeton, 2006).4. For example, from the back cover of the recent best-seller (2): “Large groups of people are

smarter than an elite few, no matter how brilliant — better at solving problems, fosteringinnovation, coming to wise decisions, even predicting the future.”

5. C. R. Sunstein, Infotopia (Oxford University Press, New York, 2006).6. D. M. Pennock, C. L. Giles, Science 291, 987 (2001).7. J. Wolfers, E. Zitzewitz, J Econ Perspect 18, 107 (2004).8. D. Austen-Smith, J. S. Banks, Am Pol Sci Rev 90, 34 (1995).9. M. d. Condorcet, in Condorcet: Selected Writings K. M. Baker, Ed. (Bobbs-Merrill Co.,

Indianapolis, 1785/1976) pp. 33-70.10. M. De Groot, J Am Stat Assoc 69 (1974).11. B. Grofman, G. Owen, S. L. Feld, Theory and Decision 15 (1983).12. R. Hastie, T. Kameda, Psychol Rev 112, 494 (2005).13. K. K. Ladha, Am J Pol Sci 36, 617 (1992).14. W. H. Batchelder, A. K. Romney, Psychometrika 53, 71 (1988).15. J. S. Uebersax, J Am Stat Assoc 88, 421 (1993).16. K.-Y. Chen, L. R. Fine, B. A. Huberman, Manage Sci 50, 983 (2004).17. P. Morris, Manage Sci 20, 1233 (1977).18. R. Winkler, Manage Sci 15, B61 (1968).19. D. Prelec, Science 306, 462 (2004).20. Interestingly, majoritarian decision rules need not be truthfulness-inducing even in the absence

of any variation in preferences (8). Condorcet’s jury theorem may be the first statisticalargument for majority rule as an algorithm for arriving at the truth of a proposition (9).Condorcet assumes independence of judgments conditional on truth, and individual competence— that each person has a greater than even chance of knowing the truth of a proposition.Majority opinion converges on the true answer as the number of polled individuals increases.This remains an influential paradigm, with many refinements (10-14). Our framework deniesthe individual competence assumption.

21. R. Winkler, Manage Sci 27, 479 (1981).22. S. Frederick, J Econ Perspect 19, 24 (2005).23. The BTS score has two components: the first evaluates the person’s answer, while the second

evaluates the prediction. Answers score high to the extent that they are surprisingly common —more common than collectively expected. Predictions are scored for accuracy, measured by theKullback-Leibler divergence. For smaller samples, the formula can be adjusted to exclude aperson's own answers from the statistics that determine his score (19). A respondent whobelieves that other respondents are truthful should behave truthfully provided a single answerhas negligible impact on sample statistics (and his sole objective is to maximize expectedscore). Formally, truth-telling is a strict Bayesian Nash equilibrium in a large enough sample(19). The scoring formula is zero-sum. Zero-one predictions are trimmed, e.g., to .01 and .99.

24. J. Harsanyi, Manage Sci 14, 159 (1967).25. Truth sensitivity holds automatically for the dichotomous case m=2, as

†

Pr W = i | T r = i[ ] > 0.5,i=1, 2, jointly imply:

†

Pr W =1 | T r =1[ ]> 0.5 >

†

1- Pr W = 2 | T r = 2[ ] =

†

Pr W =1 | T r = 2[ ] , andsimilarly for W=2.

26. R. M. Dawes, J Exp Soc Psychol 25, 1 (1989).27. L. J. Savage, J Am Stat Assoc 66, 783 (1971).

28. The superiority of this algorithm over majority decision should be detectable even with smallsamples. Observe that by Bayes’ rule, the BTS score of equation 5 is the log ratio of

†

Pr T r = k | W = i[ ] to

†

Pr T r = k[ ]. The first quantity is the endorsement frequency in a largesample. This is normalized (weighted inversely) by the second quantity, the probability ofendorsements unconditioned on the truth. The normalization is what makes it possible for aminority opinion to win. If small sample frequencies deviate substantially from

†

Pr T r = k | W = i[ ] , this will potentially affect not only the BTS decision but also the majoritydecision, which goes exclusively by these frequencies. The benefit of weighting the frequenciesaccrues to BTS alone.

29. Pooling across all questions and both samples. For example, with the Chicago question thosewho endorsed Yes believed that 80% would agree, while those endorsing No predicted a 50-50% split; the actual percentage was 59% for Yes. The true answer has a very small positiveimpact on predictions (+2.5%), after controlling for the respondent’s answer and subject fixedeffects. This is consistent with our assumption that respondents' information can beapproximated by a binary signal favoring one or the other answer. State-by-state statistics are inthe SOM.

30. Average percent agreement with other respondents also fails to predict accuracy (MIT: r= +.09;Princeton: r=.+22). Another index of interest is skepticism, defined as the total number of Noanswers. Skepticism does correlate with accuracy (+.66 & +.60). However, in a simultaneousregression on both variables, the partial correlation of accuracy and BTS remains very high(+.71 & +.91), while the partial correlation of accuracy and skepticism disappears (+.26 &–.02). This rules out the interpretation that the objective accuracy of individuals with high BTSscores derives from a nonspecific, but correct suspicion that the largest city is often not thecapital. Additional indices, including principal component scores, are considered in the SOM,with similar results (Tables S5, S6).

Box 1 The BTS formula reveals the subjective probability of truthConsider the model of belief formation described in the main text. Suppose that the average of nendorsements converges to the i-th column of the signal matrix:

†

limn Æ•

x j = limn Æ•

1n

x js

s=1

n

Â = S ji (a)

and that the average of log predictions converges to the logarithm of the metaknowledge matrix:

†

limn Æ•

1nx j

x js log yk

s

s=1

n

Â = log Mkj (b)

Theorem. It follows from (a) and (b) above that the averages of BTS scores for experts endorsing the

same answer j=1,.., m,

†

u j = limn Æ•

1nx j

x js us,

s=1

n

Â (c)

converge to the log of the row of the belief matrix

†

Bij = Pr W = i | T r = j[ ] corresponding to the true

answer

†

W = i, plus a constant C that is independent j,

†

limn Æ•

u j = logPr W = i | T r = j[ ] + C, (d)

Proof: Our task is to compute (c), the average of the BTS scores (see Eq. 2)

†

ur = xkr log x k

y kk=1

m

Â - x k log x kyk

rk=1

m

Â (e)

over all experts who endorsed answer j, and then take the limit

†

n Æ •. Averaging the first term selects

the k=j element from the summation, leaving

†

log x j / y j( ). In the limit, assumption (a) allows the

replacement of

†

x j by Sji. Furthermore,

†

y j is replaced by

†

limn Æ•

log y j = Ski log M jkk=1

m

Â , which follows from

assumption (b). Averaging the second term of (e) and taking the limit

†

n Æ • replaces the predictions

by the metaknowledge matrix, according to assumption (b). In this term also, the average endorsements

†

x k are replaced by Ski. The final result is

†

limn Æ•

u j = Ski logS ji

M jkk=1

m

Â - Ski log Ski

Mkjk=1

m

Â . By Bayes’ rule

†

M jkmk = Mkjm j , yielding the desired average

†

limn Æ•

u j = logS ji

m j

- Ski log Ski

mkk=1

m

Â . The second term does not

depend on j. Application of Bayes’ rule to the first term yields

†

S ji /m j µ Bij . Equation (d) of the

Theorem follows.

Box 2 Illustration of the theorem in context of the Chicago – Illinois problem

Is Chicago thecapital of Illinois?

Prior probabilities

Possible truthvalues W

1 = Yes 2 = No

Signal probabilitiesconditional on

truth value

Tr * large city * in Illinois

* large city* in Illinois* not state

capital

* large city * in Illinois

* large city * in Illinois * not state

capital

Beliefs conditionalon signal

83% thatChicago isthe capital




Endorsementsconditional on

signalYes No Yes No

Predictionsconditional on

signal

89% for Yes11% for No




BTS score +0.06 –1.05 –0.57 +0.86

The blue numbers are the unobserved common prior. The red numbers are what the BTS user willobserve. The prior odds are 3:1 that a prominent city like Chicago is a state capital. Solid arrowsrepresent the actual scenario, that Chicago is not the capital; dotted arrows, the counterfactual scenario.Respondents receive either a null signal (no further information), or a signal suggesting a differentcapital. On the counterfactual scenario, there is a small probability (5%) of getting the wrong signal. Onthe actual scenario, the probability of getting the correct signal is 40%. Predicted percentages arederived by Bayes’ rule: The null signal implies a 83% probability that Chicago is the capital, leading toa 89% = (.83)(.95)+(.17)(.60) prediction of fraction voting Yes; the other signal implies a 27%probability that Chicago is the capital, leading to a 70% = (.27)(.95)+(.73)(.60) prediction for Yes.From these predictions and the 60-40 split in actual votes, the user calculates BTS scores of: —.57 forYes and +.86 for No, and concludes that majority decision is incorrect. Counterfactually, were Chicagothe capital of Illinois, the observed votes would split 95-5 in favor, leading to BTS scores of +.06 forYes, and –1.05 for No, again in favor of the correct answer.

.75 .25

.95 .05 .60 .40

Figure 1 Expertise correlates with BTS scores but not with Conventional Wisdom.Conventional wisdom is defined as the number of consensus answers (answers consistent withmajority). The two top panels are taken from the MIT study; the bottom panels from thePrinceton study. The y-axis is the number of correct answers out of fifty. The x-axis is: (leftpanels) the number of states where a respondent’s answer matches majority opinion; (right) foreach subject r, the sum over 50 states of the BTS score for the answer that they endorsed

(averaged across individuals who endorsed that answer), or:

†

xkr

k=1

m

Â u k.

Figure 2 High accuracy of BTS selected expert subsets. Performance of expert subsets,selected by either BTS scores (red lines) or by Conventional Wisdom scores (blue lines). Ineach case the top ranked respondents by given criterion are retained in the subset. Right panel isMIT study, left panel the Princeton study. The y-axis is the number of correct answers fordecisions reached by majority rule within a respondent subset (ties are credited 0.5). The x-axisis the number of respondents in the subset. The majority opinion of small BTS selected subsetsis more accurate than majority decision (solid black horizontal line) and also more accurate thanthe BTS criterion applied to each question individually (dashed red horizontal line).

25

30

35

40

45

50

1 11 21 31 41 51

size of selected subset

# correct by subset majority

selected by BTS scores of endorsed answersselected by individual BTS scoresselected by # votes with majorityselected by percent agreement with others

20

25

30

35

40

45

50

1 11 21 31


# correct by subset majority

selected by BTS, pooled

selected by # votes with majority

20

25

30

35

40

45

50

1 11 21 31 41 51


An algorithm that finds truth even if most people are …...An algorithm that finds truth even if...

Documents

Transcript of An algorithm that finds truth even if most people are …...An algorithm that finds truth even if...