Piecewise Versus Total Support: How to Deal with...

Piecewise Versus Total Support:

How to Deal with Background Information in Likelihood Arguments

Benjamin C. Jantzen

Virginia Tech

The probabilistic notion of likelihood offers a systematic means of assessing “the relative merits

of rival hypotheses in the light of observational or experimental data that bear upon them.”1 In

particular, likelihood allows one to adjudicate among competing hypotheses by way of a two-

part principle:

Law of Likelihood (LL):2

(i) Evidence E supports hypothesis H1 over H2 just if 𝑃(𝐸|𝐻1) > 𝑃(𝐸|𝐻2), where

𝑃(𝐸|𝐻𝑖) is the likelihood of hypothesis Hi given evidence E.

(ii) The degree to which E supports H1 over H2 is measured by the likelihood ratio,

Λ = 𝑃(𝐸|𝐻1)𝑃(𝐸|𝐻2).

The claims sanctioned by LL are strictly comparative. The principle does not say what you

should believe or to what degree you should believe it. Rather, the notion of ‘supporting’ one

hypothesis over another is contrastive and perhaps best characterized as a relation of ‘favoring’.3

LL tells you how to determine the degree to which one hypothesis is favored over another on the

basis of some evidence, E, and nothing more. Proponents of the principle are adamant that LL 1 A. W. F. Edwards, Likelihood (Cambridge: Cambridge University Press, 1972) p. 1.

2 I am using Elliot Sober’s terminology here. The Law of Likelihood as I’ve presented it is to be

distinguished from the weaker “Likelihood Principle,” which in most formulations is equivalent

to part (i) of LL. I caution the reader that both the terms “Law of Likelihood” and “Likelihood

Principle” are used ambiguously in the philosophy of statistics and inductive inference literature.

3 Elliott Sober, Evidence and Evolution: The Logic Behind the Science (Cambridge: Cambridge

University Press, 2008).

cannot provide sufficient grounds for apportioning belief, only ranking hypotheses in a particular

evidentiary context.

While LL has been defended at length as a general tool for both formal and informal reasoning

about hypothesis ranking,4 there remains an important ambiguity its application. Intuitively, we

ought to make use of all available information when assessing the relative merits of two

hypotheses, not just the particular piece of evidence E under consideration. Any additional

information already in our possession prior to obtaining E is typically referred to as background

information. LL does not, on the face of it, tell us how to deal with such information. Some, most

prominently Elliott Sober5, have argued that we ought to condition on this additional information

when computing likelihoods. That is, if we denote the background information by B, then the

likelihood ratio we should use is Λ = 𝑃(𝐸|𝐻1,𝐵)𝑃(𝐸|𝐻2,𝐵). Taking this approach, however, means that Λ—

and thus our judgments concerning rival hypotheses H1 and H2—will depend on exactly which

information is taken to constitute background information, and which is considered evidence and

thus part of E. Under Sober’s interpretation, LL can be taken to yield different judgments for the

4 See, for instance, Edwards, Likelihood; Ian Hacking, Logic of Statistical Inference (Cambridge:

Cambridge University Press, 1965); Sober, Evidence and Evolution: The Logic Behind the

Science.

5 Elliott Sober, "The Design Argument," God and Design, ed. Neil Manson (New York, NY:

Routledge, 2003) 27-54; Sober, Evidence and Evolution: The Logic Behind the Science; Elliott

Sober, "Absence of Evidence and Evidence of Absence: Evidential Transitivity in Connection

with Fossils, Fishing, Fine-Tuning, and Firing Squads," Philosophical Studies 143 (2009): 63-90.

same data when the line between evidence and background information is moved. The use of LL

is thus encumbered by a “line-drawing problem.”6

This line-drawing problem also appears in a slightly different guise in the literature on statistical

inference. In this more restricted context, the problem manifests as an apparent ambiguity in the

likelihood function. Specifically, there appears to be no systematic way of deciding which

random variables and model parameters should be included in the likelihood function, and no

principled way of deciding on which side of the conditionalization bar these quantities belong if

included.7 As in the general case, the problem for the likelihoodist is to provide a principled

division of propositions into background and evidence.

A variety of solutions have been proposed to both versions of the problem of background

information, though not always in these terms. Some, e.g. Jonathan Weisberg,8 attempt to

provide a principled means of distinguishing evidence from background information. Others, e.g.

Matthew Kotzen,9 attempt to dissolve the problem by scrapping LL. In the context of statistical

6 M. Kotzen, "Selection Biases in Likelihood Arguments," The British journal for the philosophy

of science (2012).

7 See M. J. Bayarri, M. H. DeGroot and J. B. Kadane, "What Is the Likelihood Function?,"

Statistical Decision Theory and Related Topics Iv, eds. Shanti S. Gupta and James O. Berger,

vol. 1 (New York: Springer-Verlag, 1987) 3-27.

8 Jonathan Weisberg, "Firing Squads and Fine-Tuning: Sober on the Design Argument," British

Journal for the Philosophy of Science 56 (2005): 809-21.

9 Kotzen, "Selection Biases in Likelihood Arguments."

inference, a common strategy is to disambiguate the likelihood function by fiat.10 I argue that

none of these strategies is well-motivated. Background information is only problematic when

one fails to distinguish between two related questions: (i) Given that I know B, to what degree

does the additional piece of evidence E support H1 over H2? and (ii) to what degree does all the

evidence to hand—B and E—support H1 over H2? My aim is to demonstrate that, once these

questions are distinguished the very same considerations that motivate the adoption of LL entail

distinct answers to both questions, thus resolving any ambiguity over the treatment of

background information. Note that I am emphatically not offering a defense of LL as a general

inference procedure. Mine is the more modest goal of dissolving an apparent defect of LL using

the resources to which proponents of the principle already assent.

To draw out the distinction relevant to eliminating the problem of background information, I will

begin with a detailed example. I will then argue for an expression that represents the degree to

which a particular piece of evidence supports one hypothesis over another in context, and then

derive a related expression for the total support provided by all available evidence. Finally, I will

show how these new expressions dissolve ambiguities in the treatment of background

information by applying them to the so-called ‘fine-tuning argument’.

I. ILLUSTRATING THE PROBLEM

10 See, e.g., Jason Grossman, "The Likelihood Principle," Philosophy of Statistics, eds. Malcolm

R. Forster and Prasanta S. Bandyopadhyay (Oxford, UK; Burlington, MA: North-Holland,

2011).

To draw out the distinction which I claim obviates the problem of background information, it

will help to have a concrete example in mind. To avoid pre-conceived interpretations, I will

intentionally eschew standard examples, at least at the outset. So rather than treat of fish or firing

squads, I’ll consider carnivals.

Suppose that Albert finds himself on the midway of an old-fashioned carnival. He decides to

play one of the games—the one where contestants try to toss a ball into a milk-can. Albert is

savvy about carnival games; he knows they are often rigged. In a fair game, there is a 50%

chance of winning a prize. But when no authorities are around, there is an appreciable chance

that the carnie running the game will hand him a ball too big to fit in the can, making it

impossible to win. On the other hand, if there happens to be a police officer in sight the game is

likely to be rigged in Albert’s favor—the carnies want the police to think the games are fair, so

they arrange to let people win when the authorities are present. A set of probabilities reflecting

these facts is provided by the joint distribution of Table 1.

Table 1.

P = police present P = police absent

G = fair G = rigged G = fair G = rigged

O = lose 1/20 1/20 1/10 11/20

O = win 1/20 1/10 1/10 0

Knowing all of the probabilities in Table 1, Albert puts his money down, and promptly tosses a

ball into the can. Given that he has just won, what can Albert conclude about the game?

Specifically, does he now have grounds to favor the hypothesis that the game is fair over the

hypothesis that it is rigged? According to LL, Albert needs to compare two probabilities: the

probability that he would win given that the game is fair, P(win| fair) and the probability that he

would win given that the game is rigged P(win| rigged). Since P(win | fair) = 1/2 > P(win |

rigged) = 1/7, LL asserts that Albert’s success in the game supports the hypothesis of a fair

game—Albert has reason to think that he has played a fair game.

But suppose that, before he tosses the ball, Albert notices a police officer standing near the

booth. What can be said in light of this additional information? Here is where different

interpretations of LL begin to diverge. According to Sober’s approach, we must recognize two

sorts of propositions: evidence and background knowledge. Evidence is whatever fresh

information we are currently considering when applying LL to distinguish among hypotheses. It

appears to the left of the conditionalization bar when computing a likelihood. Background

knowledge constitutes whatever we already know about the world, and is presumed to belong on

the right side of the conditionalization bar. According to this view then, Albert should treat the

fact of the police officer’s presence as background knowledge and condition on this information.

The relevant likelihoods are now P(win| fair, present) = 1/2 and P(win| rigged, present) = 2/3.

With the additional information, he should now favor the hypothesis that the game is rigged—the

background information has reversed our ordering on hypotheses.

That we should take all available information into account when comparing hypotheses is not

especially controversial—most authors assume some sort of principle of total evidence.11 What

is controversial is how and whether ‘evidence’ should be distinguished from background

information. It is not clear why Albert should treat the information that a police officer was

present any differently than the information that he won the game. Albert might just have well

have treated the observation of the police officer as the evidence, and conditioned instead on the

fact that he won: P(present | rigged, win) = 1 > P(present | fair, win) = 1/3. In this way of

accounting for all the information, LL still favors the hypothesis that the game is rigged, but does

so to a much greater degree. Alternatively, Albert might have treated all the information at hand

as ‘evidence’ and compared the following likelihoods: P(win, present | fair) = 1/6 > P(win,

present | rigged) = 1/7. Taking this approach once again inverts the ordering of hypotheses, and

favors the hypothesis that the game was fair. It might appear then that LL must be modified in

order to provide a principled means of discriminating background information from evidence.

However, no such modification is required—a careful interpretation of LL as it stands obviates

the question of evidence versus background information.

II. THE PIECEWISE IMPACT OF EVIDENCE

To resolve the ambiguity over background information, we need to distinguish between two

questions: (i) to what degree does learning a particular fact in the context of an additional set of

facts support a given hypothesis? and (ii) to what degree does learning a particular fact in

conjunction with an additional set of facts support a given hypothesis? In terms of the midway

11 Rudolph Carnap, "On the Application of Inductive Logic," Philosophy and Phenomenological

Research 8, 1 (1947): 133-48.

example above, the distinction can be made as follows: (i) to what degree does winning the game

having already learned that a police officer is present support the hypothesis that the game is

fair? and (ii) to what degree does the full set of information at hand—that Albert has won the

game and that a police officer was present—support the hypothesis that the game is fair?

To address question (i), we need to examine the piecewise introduction of evidence, taking care

to note one important fact: learning the truth of a proposition (or the value of a random variable)

is effectively an intervention that changes the background distribution describing the ways the

world might be. To begin with, let’s assume that we are given a full joint distribution reflecting

all relevant aspects of the world and nothing else—there is nothing given that might qualify as

either evidence or background information. For ease of exposition, I will further assume that this

distribution is discrete, though nothing about my derivation hinges on this being the case.

Since all we have is the distribution and no information to sort out, LL can be applied

unambiguously upon obtaining our first piece of evidence, I1. According to LL, the degree to

which this information supports hypothesis H1 over H2 is given by the likelihood ratio Λ(𝐼1) =

𝑃(𝐼1|𝐻1) 𝑃(𝐼1|𝐻2)⁄ . Furthermore, on learning that I1 is the case, the space of possible events has

been reduced—acquiring information requires us to update the background distribution with

which we started. Specifically, the probability of I1 being the case must now be unity,

irrespective of the value it had prior to learning this outcome. One way to represent the change is

to construct a new event space by simply removing all the events incompatible with the fact that

I1 is the case while preserving the relative measure on all remaining events. That is, the new

distribution 𝑃1(α), where α is any event in the original event space compatible with I1, is

obtained from the old distribution by the following relation: 𝑃1(𝛼) = 𝑃(𝛼|𝐼1).12 In the midway

example, for instance, when Albert learned that a police officer was present he should have

replaced the original distribution of Table 1 with that of Table 2.

Table 2.

P = police present

G = fair G = rigged

O = lose 1/5 1/5

O = win 1/5 2/5

Once we realize that we are working with a new distribution, there is no need to draw a line

between background information and evidence—our prior information is reflected in the new

distribution. When additional evidence, I2, is acquired, we need only appeal to LL just as we did

at the outset. This time, however, we are assessing likelihoods with respect to the currently

applicable distribution 𝑃1(α). So the evidence I2, if we take LL seriously, supports H1 over H2

just if 𝑃1(𝐼2|𝐻1) > 𝑃1(𝐼2|𝐻2) and does so to a degree Λ(I2) = 𝑃1(𝐼2|𝐻1) 𝑃1(𝐼2|𝐻2)⁄ . In terms of

the original joint distribution, we can express this likelihood ratio as

Λ(I2) = 𝑃(𝐼2|𝐼1,𝐻1) 𝑃(𝐼2|𝐼1,𝐻2)⁄ .

12 This is simply the updating procedure recommended by Bayesian epistemology. It is invoked

here without any commitment to the subjective or objective status of priors.

As before, when we learn I2, we must update our distribution to reflect this restriction of the

possibilities. This new distribution 𝑃2(β) is obtained from the old distribution in the same way as

above: 𝑃2(𝛽) = 𝑃1(𝛽|𝐼2) = 𝑃(𝛽|𝐼2, 𝐼1). This is easy to generalize for an indefinite sequence of

evidence: once we’ve learned I1, I2, …, In-1, we should compute the likelihoods involving a new

piece of evidence In using the distribution 𝑃𝑛−1(𝛾) = 𝑃(𝛾|𝐼𝑛−1, … , 𝐼1). The new piece of

information In introduced in the context of prior information I1, I2,…, In-1 supports H1 over H2

just if 𝑃(𝐼𝑛|𝐼𝑛−1, … , 𝐼1,𝐻1) > 𝑃(𝐼𝑛|𝐼𝑛−1, … , 𝐼1,𝐻2) and does so to the degree

(1) Λ(𝐼𝑛) = 𝑃(𝐼𝑛|𝐼𝑛−1,…,𝐼1,𝐻1)𝑃(𝐼𝑛|𝐼𝑛−1,…,𝐼1,𝐻2).

The point is that whenever we acquire a piece of information we can apply LL without

modification, but must do so using a distribution that reflects all of the facts already in evidence.

Put this way, there is no ambiguity in using LL—we always compute a straightforward

likelihood. However, when this likelihood is expressed in terms of the original joint distribution

with which we started, each successive likelihood is conditioned on the previous facts. So by

applying LL and taking care to note the way in which the acquisition of information forces a

change in distribution, we have found that in order to determine the relative support of one

hypothesis over another provided some particular piece of evidence, we must use likelihoods

conditioned on all previously acquired facts.

Thus far, it may seem that I have been arguing for Sober’s interpretation of LL. However, Sober

seems to view the likelihood ratio (1) as representing the overall degree to which H1 is supported

over H2 once In is obtained. I have been urging that, if we take LL at face value, this is not how

we should interpret this expression. At every stage in the above derivation, we were applying LL

to determine the degree to which a particular piece of evidence supported one hypothesis over

another. Other information was relevant, but only in determining the epistemic context in which

this degree of support was determined. I am suggesting that Sober has the right expression but

gives it in answer to the wrong question—in what follows, I’ll show that LL leads us to a very

different expression for the degree of support for H1 over H2 provided by the totality of evidence.

III. TOTAL SUPPORT

There are two ways to argue for an expression of the likelihood ratio pertaining to the totality of

available evidence. In one approach, we could take the expression given in (1) for the degree to

which a particular piece of evidence supports H1 over H2 and couple this with a function for

combining likelihood ratios—a function measuring the overall degree to which two pieces of

evidence support H1 over H2. Strictly speaking, this means adding to LL since the principle does

not provide such a rule. However, there are some reasonable constraints we can put on such a

function without begging the question concerning background information. For starters,

whatever function f we choose should itself yield a likelihood ratio, meaning that it must map

pairs of likelihoods to the interval [0, ∞). Furthermore, if either likelihood in the combination is

zero—implying that one hypothesis has been entirely ruled out—then the joint likelihood should

also be zero. The function should be symmetric since it ought not to matter in what order we give

the likelihoods to be combined, and it should be an increasing function of both arguments. An

obvious choice satisfying all of these constraints is simply the product of the component

likelihoods. That is, given Λ1 and Λ2, the combined likelihood is given by 𝑓(Λ1,Λ2) = Λ1Λ2.

With this rule for combining likelihoods, we can use the results of the last section to derive an

expression for the overall degree to which the facts I1, I2, …, In support one hypothesis over

another, assuming they were learned in sequence:

(2) Λ(𝐼1, 𝐼2, … , 𝐼𝑛) = Λ(𝐼1)Λ(𝐼2)⋯Λ(𝐼𝑛) = 𝑃(𝐼1|𝐻1)𝑃(𝐼2|𝐻1,𝐼1)⋯𝑃(𝐼𝑛|𝐻1 ,𝐼1,…,𝐼𝑛−1)𝑃(𝐼1|𝐻2)𝑃(𝐼2|𝐻2,𝐼1)⋯𝑃(𝐼𝑛|𝐻2 ,𝐼1,…,𝐼𝑛−1)

Using nothing but the rules of probability, the right hand side of equation (2) can be written

much more compactly to give the following expression for the total support of the facts I1, I2, …,

In:

(3) Λ(𝐼1, 𝐼2, … , 𝐼𝑛) = 𝑃( 𝐼1,…,𝐼𝑛|𝐻1)𝑃( 𝐼1,…,𝐼𝑛|𝐻2)

Of course, the right-hand side of equation (3) is just the expression we would have gotten by

applying LL to the proposition I1^I2^…^In with respect to the initial joint distribution—in a

straightforward reading, it is just the total support for H1 over H2 provided by the conjunction of

all available evidence.

The form of Equation (3) suggests that it might have been derived more directly by appealing to

LL without worrying about how to determine the contextual support provided by each piece of

information or introducing a way to combine these (thus justifying my claim that we need not

modify LL). All we had to do was note that, if we let 𝐸 = 𝐼1^𝐼2^ … ^𝐼𝑛, then LL immediately

yields (3). From (3) we could then deduce (2) just from the rules of the probability calculus.

Once we identified the factors of the right-hand side of Equation (2) with individual likelihood

ratios, we could have used this fact to justify a rule for combining likelihoods. In fact, this is

what A. F. Edwards does, at least in the special case of independent evidence, in his development

of the likelihood framework.13 Viewed from this perspective, Equation (3) is implicit in LL.

13 Edwards, Likelihood.

Whichever approach we take to justifying this rule for assessing total support, we are led to the

following amplified form of LL:

Amplified Law of Likelihoods (ALL):

(i) If it is already known to be that case that I1Î2^…^ In, then learning evidence E

supports hypothesis H1 over H2 just if 𝑃(𝐸|𝐻1, 𝐼1, 𝐼2, … , 𝐼𝑛) > 𝑃(𝐸|𝐻2, 𝐼1, 𝐼2, … , 𝐼𝑛),

where 𝑃(𝐸|𝐻𝑖 , 𝐼1, 𝐼2, … , 𝐼𝑛) is the likelihood of hypothesis Hi given evidence E in the

context of I1Î2^…^ In.

(ii) The degree to which E supports H1 over H2 in the context of I1Î2^…^ In is measured

by the likelihood ratio Λ = 𝑃(𝐸|𝐻1,𝐼1,𝐼2,…,𝐼𝑛)𝑃(𝐸|𝐻2,𝐼1,𝐼2,…,𝐼𝑛).

(iii) The total evidence E^ I1Î2^…^ In supports hypothesis H1 over H2 just if

𝑃(𝐸, 𝐼1, 𝐼2, … , 𝐼𝑛|𝐻1) > 𝑃(𝐸, 𝐼1, 𝐼2, … , 𝐼𝑛|𝐻2).

(iv) The degree to which the total evidence E^ I1Î2^…^ In supports H1 over H2 is

measured by the likelihood ratio Λ = 𝑃(𝐸,𝐼1,𝐼2,…,𝐼𝑛|𝐻1)𝑃(𝐸,𝐼1,𝐼2,…,𝐼𝑛|𝐻2).

With ALL, we can answer the questions posed above concerning the midway example. The

information that Albert has won the game, acquired after learning that a police officer is present,

supports the hypothesis that the game is rigged because 𝑃(win|present, rigged) >

𝑃(win|present, fair). According to ALL (ii), this information favors the rigged hypothesis over

its rival to a degree Λ = 𝑃(win|present, rigged)𝑃(win|present, fair) =

2312

= 43. This one piece of information, in the context

of previously established information about the presence of police officers, tends to favor the

hypothesis of a rigged game. However, the aggregate information—that a police officer is

present and Albert has won the game—favors the hypothesis that the game is fair. This follows

from ALL (iii) and (iv) since 𝑃(win, present| fair)𝑃(win, present| rigged) =

1617

= 76. This looks like a contradiction until we

realize that the first piece of information obtained—that the police officer is present—strongly

favored the hypothesis that the game is fair: 𝑃(present|fair)𝑃(present|rigged) = 14

9. The upshot is that the aggregate

effect of the totality of evidence can differ from the piecewise impact of each bit of evidence.

Rather than being a contradiction, this is precisely how one would expect these two distinct

measures to relate—the total support for the fair hypothesis is simply the product of the

contextual likelihood ratios for each piece of evidence.14

IV. KICKING AWAY THE FULL DISTRIBUTION LADDER

In the preceding arguments, I made extensive use of full probability distributions. This appears

problematic since the appealing feature of the likelihood approach—and that which sets it apart

from Bayesianism—is its disregard for prior probabilities. However, I claim that the likelihoodist

who thinks that prior probabilities are often absent or unattainable might nonetheless justify LL

or ALL. To see how, let’s reconsider the case in which we start with a full prior distribution

P(α), and then obtain evidence I1. Once we acquire the evidence, we should update the

probabilities assigned to H1 and H2 by setting each new probability equal to the corresponding

conditional probability assigned by the original distribution:

14 It should be noted that, while the order in which information is learned determines the degree

to which each additional piece of information favors one hypothesis over another, order is

irrelevant when considering the overall support conferred by the totality of evidence.

( ) ( ) ( )( ) ( ) ( ) ( )

( )1 2

1 1 1 1 2 1 1 21 1

, P H P H

P H I P I H P H I P I HP I P I

= =

What can we now say about the degree to which I1 favors H1 over H2? One way we might

understand this question is in terms of a hypothetical. Suppose that either H1 or H2 is true. Then

the initial odds in favor of H1 are simply P(H1|H1∨H2)/P(~H1|H1∨H2) = P(H1)/P(H2). How does

the new information change the odds in favor of H1? In this case, the posterior odds are given by:

(4) ( )( )

( ) ( )( ) ( ) ( ) ( )

( )1 1 1 2 1 1 1 1

121 1 1 2 1 2 2

,

~ ,

P H I H H P I H P H P HI

P HP H I H H P I H P H

∨= = Λ

∨

The right-hand equality in Equation (4) indicates that all of the work to shift the posterior odds

up or down relative to our prior odds is being done by the likelihood ratio, Λ(I1). In other words,

the change in posterior odds is a function of Λ(I1). To put it still another way, the effect of I1 on

the odds is entirely determined by Λ(I1). This fact motivates adopting the likelihood function as a

measure of relative support. While the likelihood ratio cannot tell us which posterior probability

is higher, it can tell us how the odds shift in favor of one hypothesis or the other, assuming that

one or the other is right. Furthermore, it does so whether or not we know the prior probabilities.

In this sense, LL is a general guide to differential support, and in those cases in which we have

no objective basis for assigning priors, the likelihoodist claims it is our only guide.

By considering effects on posterior odds, we can motivate ALL in much the same way as LL. As

before, the full distribution (if we knew it to begin with) after learning I1 would be given by

P1(α) = P(α|I1). If we now learn that I2 is the case, then we must change our posterior odds in

favor of H1 over H2 to the following:

(5) ( )( )

( ) ( )( ) ( ) ( ) ( )

( )1 1 2 1 2 1 2 1 1 1 1 1

21 21 2 2 1 2 1 2 2 1 2

,

,

P H I H H P I H P H P HI

P HP H I H H P I H P H

∨= = Λ

∨

Once again, it is the likelihood function that increases (or decreases) the posterior over the prior

odds. This time, however, it is in the context of the new distribution P1(α), a distribution

reflecting prior knowledge of I1. If the motivation offered for LL in the first place is compelling,

then it seems we must also accept ALL (i) and (ii)—the relative support for H1 over H2 conferred

by the new piece of evidence I2 after already learning I1 is indicated by the likelihood function,

Λ(I2) = P1(I2|H1)/P1(I2|H2) = P(I2|I1, H1)/P(I2|I1,H2). But what about the overall support for H1

over H2 given our epistemic starting point? How should our posterior odds have changed relative

to our initial odds as a result of learning I1 and I2? We can rewrite the right-hand side of Equation

(5) as follows:

(6)

( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( )

1 2 1 1 1 2 1 1 1 1

1 2 2 1 2 2 2 1 2 1

1 2 1 1

1 2 2 2

11 2

2

,

,

,

,

,

P I H P H P I H I P H I

P I H P H P I H I P H IP I I H P H

P I I H P HP H

I IP H

=

=

= Λ

Written this way, we can see that the combined likelihood function Λ(I1, I2) determines the

change in odds relative to what they were before learning anything at all. So once again, we can

kick away the ladder of the full distribution. If we did know the distribution, then learning I1 and

I2 would change the odds in favor of H1 over H2 by an amount given by the likelihood ratio.

Since this is true irrespective of what the priors are, we can always take the likelihood alone to

indicate differential support, in this case the degree of support for H1 over H2 conferred by the

totality of evidence.

Of course, one might object to my interpretation of what it is to favor one hypothesis over

another. Instead, one might attempt to prove LL(i) from other premises15 and take the

quantitative measure of contrastive support given by LL (ii) to be a postulate that stands or falls

with how well the results coincide with our intuitions.16 ALL could then be motivated by the

second line of argument I suggested in the previous section: treat all information as evidence and

note that the resulting likelihood ratio factors into a product of likelihoods, each of which can be

consistently interpreted as corresponding to the impact of a single piece of information.

The point is that insofar as LL is well-motivated, so too is ALL. My use of full distributions

above was strictly heuristic. Once we’ve seen what role the likelihood function plays and which

likelihood function is relevant to which question, we can ignore the full distribution. Of course,

the proponent of LL or ALL can only claim to be free of worrisome priors if conditional

probabilities can be taken as primitive.17 It is not my task here to defend that claim and thus

rescue likelihoodism from the charge of subjectivity. My more modest assertion is simply that if

we have grounds to take LL seriously, then we should really embrace ALL. Once we’ve done so,

it becomes clear that whatever problems likelihoodism has, line-drawing isn’t one of them.

V. BUT WHAT IS THE LIKELIHOOD FUNCTION?

15 See Grossman, "The Likelihood Principle."

16 See Sober, Evidence and Evolution: The Logic Behind the Science.

17 For a defense of taking conditional probabilities as primitives, see ibid.

My arguments so far have concerned the problem of background information as it appears in the

literature on LL in its broadest epistemic use. As I mentioned above, the same problem arises in

the more restricted context of statistical inference. Addressing this narrower community, Bayarri,

De Groot, and Kadane famously asked, “What is the likelihood function?”18 To illustrate the

ambiguity in answering that question, the authors consider a case analogous to one in which,

with respect to each possible value of some discrete parameter θ characterizing a statistical

model, a random variable X has a conditional probability distribution P(x|θ).19 Furthermore, it is

not the random variable X that is observed, but rather some other random variable Y for which

P(y|x, θ). The authors then ask, “What is the [likelihood function] in this problem?”.20 They

claim that there are three candidates, P(y|θ), P(x, y|θ), and P(y|x, θ), and that a “subjective

judgment must be made in order to decide which of the functions…to use in a given problem.”21

The thesis I’ve been defending is that this is simply false. The question has two parts: (i) which

random variables and parameters are to be included in the likelihood function, and (ii) which side

of the conditionalization bar each belongs on. The answer to both parts, according to ALL,

depends on two things: what hypotheses we wish to consider and whether we wish to assess the

impact of a particular piece of data in context or the aggregate of all data. So, for instance,

suppose we wish to ask about hypotheses concerning the value of θ in light of the only piece of 18 Bayarri, DeGroot and Kadane, "What Is the Likelihood Function?."

19 The original example was stated in terms of probability densities since θ typically takes a

continuum of values. To keep the discussion consistent, I’ve assumed that θ is discrete, and thus

the distributions in question are discrete as well.

20 Bayarri, DeGroot and Kadane, "What Is the Likelihood Function?," at p. 6.

21 Ibid., 6.

evidence available, namely a value y of Y. Then the relevant likelihood function must have the

form P(y|θ). If on the other hand, we wanted to consider finer-grained hypotheses concerning

both the value of θ and the unobserved random variable X, then we would have functions of the

form P(y|x, θ). Under no circumstances would ALL entail the use of a likelihood function of the

form P(x|…) unless a value of the random variable X was observed (or otherwise learned) and

thus added to our store of facts. Suppose X and Y were both observed and we wish to know the

relative support given to hypotheses about θ. Then our likelihood functions would look like

P(x,y|θ). Suppose instead, we learned the value of Y and then the value of X and wish to know

what impact learning X = x has given what we already know about Y. Then the likelihood

functions would have the form P(x|y, θ). I’m belaboring the point, but I want to make it clear that

ALL unambiguously selects a set of variables and parameter values and distributes these around

the conditionalization bar. There are many further objections raised by Bayarri et al to the use of

LL as a statistical inference method, in particular problems with prediction. However, many of

these objections conflate LL (or ALL) with the method of maximum likelihood estimation

(MLE). A discussion of the relation of MLE to ALL is beyond the scope of this paper, and so too

are the remaining objections to likelihoodism. It suffices here to note that there is no ambiguity

in factoring the likelihood function as far as ALL is concerned. The principle may not be right,

but it is unambiguous.22

22 The authors might object that in my initial discussion of ALL, I used a full distribution which

dictates all the relevant quantities and so implicitly settles the question of which likelihood

function to use. But as I argued in the last section, a full distribution is unnecessary for

motivating ALL. Rather, in the likelihoodist view, specifying a question of interest specifies a

VI. FISH, FIRING SQUADS, AND FINE-TUNING

The question of how to handle background information is especially pressing in the context of

the fine-tuning argument (FTA). The FTA attempts to establish the existence of a cosmic

designer by noting that various physical constants have values within a narrow range amenable

to the occurrence of carbon-based life—the laws appear ‘fine-tuned’ for life. For instance, had

the 7.65 MeV energy level of the C12 nucleus been slightly lower or higher, then the process that

produces carbon and the other heavy elements essential to life in the interior of stars would not

have occurred.23 Denote by E the observation that many constants occurring in physical laws

take values within a comparatively narrow range that permits life to exist, and consider the

following two hypotheses:

HC: The relevant physical constants acquired their values by chance.

HD: The relevant physical constants acquired their values by design.

The FTA is usually presented as a likelihood argument. If we appeal to LP and note that

𝑃(𝐸|𝐻𝐷) > 𝑃(𝐸|𝐻𝐶), then we must conclude that the evidence favors design over chance.

A prominent objection to the fine-tuning argument notes that we have left out an important piece

of information: all knowledge concerning physical constants has been acquired by carbon-based

likelihood ratio which in turn constrains what full distributions the Bayesian (or anyone else

committed to using full distributions) may consider.

23 John D. Barrow and Frank J. Tipler, The Anthropic Cosmological Principle (New York:

Oxford University Press, 1986) pp. 252-53.

life forms.24 Call this fact I. We must account for all available background information—so the

objection goes—and so we must condition our likelihoods on I. However, since I entails E, both

hypotheses have the same likelihood given the evidence: 𝑃(𝐸|𝐻𝐷 , 𝐼) = 𝑃(𝐸|𝐻𝐶 , 𝐼) = 1. Thus,

the evidence cannot favor design over chance (or any other hypothesis for that matter). This

objection, however, conflates the two questions with which we began and emphasizes the need

for the clarification provided by ALL.

To motivate an analysis of the FTA in terms of ALL, it will help to first consider a pair of

structurally similar examples endemic in the literature. The first of these, due originally to Sir

Arthur Eddington,25 asks us to think about fishing. Suppose we are confronted with the following

observation:

Ef: All 10 of the fish caught in the lake today were longer than 10 inches.

For the sake of simplicity, suppose that we consider only two hypotheses that might account for

this evidence:

H100: All of the fish in the lake are longer than 10 inches.

H50: Half of the fish in the lake are longer 10 inches.

If this was all the information we had, LP would urge us to favor H100 since 𝑃�𝐸𝑓�𝐻100� ≫

𝑃�𝐸𝑓�𝐻50�. However, suppose we had some additional information:

I>10: The net used has holes 10 inches wide.

24 Sober, "The Design Argument."; Sober, "Absence of Evidence and Evidence of Absence:

Evidential Transitivity in Connection with Fossils, Fishing, Fine-Tuning, and Firing Squads."

25 A. Eddington, The Philosophy of Physical Science (Cambridge: Cambridge University Press,

1947).

This new information I>10 entails Ef. Thus, if we account for this new information by

conditioning on it as Sober would urge, we find that the evidence fails to distinguish between the

hypotheses at all: 𝑃�𝐸𝑓�𝐻100, 𝐼>10 � = 𝑃�𝐸𝑓�𝐻50, 𝐼>10 � = 1. According to Sober, this constitutes

an Observation Selection Effect (OSE) because the method by which the observation was

obtained biased the outcome. One is faced with an OSE whenever accounting for the process by

which an observation was made alters the likelihoods that determine the degree to which the

observation favors one hypothesis over another. In this case, the effect is extreme.

The picture changes dramatically when we analyze this scenario using ALL. It becomes

immediately obvious that the likelihoods being compared— 𝑃�𝐸𝑓�𝐻100, 𝐼>10 � and

𝑃�𝐸𝑓�𝐻50, 𝐼>10 �—represent only the degree to which learning about the day’s catch supports

either H100 or H50 in the context of information about the net used. These do not represent the

degree to which the aggregate evidence supports one or the other hypothesis. It is true that

learning E after learning what net was used fails to further discriminate between H100 and H50.

But learning I>10 may have already discriminated between the two, and thus, according to ALL,

the aggregate information might also discriminate between the two hypotheses.

To illustrate the point, consider the joint distribution in Table 3. I’ve added a proposition, I>0,

which is the claim that the net used had very tiny holes capable of catching the smallest fish.

With this additional possibility added, the probabilities given are compatible with all of the facts

above. In particular, 𝑃�𝐸𝑓�𝐻100� = 1 ≫ 𝑃�𝐸𝑓�𝐻50� = .003 and 𝑃�𝐸𝑓�𝐻100, 𝐼>10� =

𝑃�𝐸𝑓�𝐻50, 𝐼>10� = 1.

Table 3.

H100 H50

I>0 I>10 I>0 I>10

Ef .001 .002 .001 .002

¬ Ef 0 0 .994 0

However, we can see that learning I>10 at the outset strongly favored the hypothesis H100 since

𝑃(𝐼>10|𝐻100) = 0.67 ≫ 𝑃(𝐼>10|𝐻50) = 0.002. Likewise, according to ALL (iv), the aggregate

information overwhelmingly favors H100 over H50 to a degree given by

Λ = 𝑃�𝐸𝑓 , 𝐼>10�𝐻100� 𝑃�𝐸𝑓 , 𝐼>10�𝐻50� = 334� . This conclusion is not surprising given the

details of the example. The distribution given in Table 3 is plausible in that those who frequently

fish a particular lake are more likely to use nets with large holes if the lake contains mostly large

fish—they may not know the distribution of fish in the lake, but they know what works.

Whatever story one might tell to account for the particular probabilities in this case, the upshot is

that if an OSE renders a particular observation irrelevant in a particular context it is still possible

for the aggregate information to discriminate between hypotheses.

While Eddington’s fishing example illustrates the way in which previously acquired information

can deprive subsequent evidence of relevance, there is another example in the literature more

closely analogous to the fine-tuning case.26 This scenario involves firing squads. We are asked to

imagine that a firing squad staffed by twelve expert marksmen takes aim at the prisoner to be

executed. Each marksman fires twelve times when given the signal. When the smoke clears, we

discover that the prisoner is still unharmed. Call the fact of this surprising survival Es. In this

case, we are interested in what the prisoner can infer from Es concerning the following two

hypotheses:

Hcon: The marksmen conspired at time t1 to spare the prisoner’s life when they fired

at t2.

Hmiss: The marksmen decided at time t1 to shoot the prisoner when they fired at t2 but

missed by chance.

At first we might think that the prisoner has ample reason to favor Hcon over Hmiss since, given

that these are expert marksmen, 𝑃(𝐸𝑠|𝐻𝑐𝑜𝑛) ≫ 𝑃(𝐸𝑠|𝐻𝑚𝑖𝑠𝑠). However, in making his analysis

the prisoner left out some pertinent information about the manner in which the observation of Es

was made:

IO: At t3 the prisoner made the observation that he is still alive.

According to those who would single out background information, we must incorporate IO into

the likelihoods by conditioning. In this view, the prisoner suffers from an OSE and cannot

distinguish between the two hypotheses at all since 𝑃(𝐸𝑠|𝐻𝑐𝑜𝑛, 𝐼𝑂) = 𝑃(𝐸𝑠|𝐻𝑚𝑖𝑠𝑠 , 𝐼𝑂) = 1.

Because IO entails Es, so the argument goes, learning Es can tell the prisoner nothing about which

26 The scenario was introduced in John Leslie, Universes (London: Routledge, 1989). and

elaborated in Richard Swinburne, "Arguments from the Fine-Tuning of the Universe," Physical

Cosmology and Philosophy, ed. J. Leslie (New York: MacMillan, 1990) 160-79.

hypothesis to favor. Thus, the prisoner in the grip of a strong OSE cannot reasonably conclude

there was a conspiracy to save his life.

At this point, the tight analogy with the FTA should be clear. The prisoner stands in for us

carbon-based life forms. While the prisoner is attempting to assess whether design or chance is

responsible for his survival, in the FTA we are attempting to infer design in the cosmos. In both

cases, it has been objected that the observer suffers from an OSE that prevents discrimination

between hypotheses. Supporters of the FTA invoke the firing-squad scenario because they think

that our intuition strongly opposes the OSE objection—surely the prisoner can reasonably

conclude that conspiracy is the better hypothesis. By analogy, they claim that we can conclude

that an OSE is not a problem for the FTA.

In both cases, ALL tells us that the role of the OSE has been misinterpreted. It is true that, in the

context of knowing that it was himself who made the observation, the prisoner learns nothing

further by noting that he is alive. Likewise, it is the case that, knowing that all physics is done by

carbon-based life forms, we learn nothing further by discovering that the constants of physical

law are just right to sustain carbon-based life. Nonetheless, the aggregate information might still

favor one hypothesis over the other. In the firing-squad scenario, it is eminently plausible that

𝑃(𝐸𝑠, 𝐼𝑂|𝐻𝑐𝑜𝑛) ≫ 𝑃(𝐸𝑠, 𝐼𝑂|𝐻𝑚𝑖𝑠𝑠). In the case of fine-tuning, it may be that 𝑃(𝐸, 𝐼|𝐻𝐷) >

𝑃(𝐸, 𝐼|𝐻𝐶). This will be the case if 𝑃(𝐼|𝐻𝐷) > 𝑃(𝐼|𝐻𝐶). I certainly do not wish to argue that this

is in fact the case—there seem to be insurmountable difficulties in providing a well-defined

measure corresponding to 𝑃(𝐼|𝐻𝐷).27 My point is just that, when one distinguishes between

contextual and total support, the presence of an OSE does not prove fatal to design arguments in

either the firing-squad or FTA case.

VII. CONCLUSION

Insofar as one is inclined to accept LL as a framework for inference, no modification is

necessary in order to deal with background information—unpacking LL leads to ALL. The

interpretive key is the discrimination of two questions, one concerning the immediate support

provided by a piece of evidence in context and one concerning the overall support provided by

the total set of evidence. Looked at in this way, it becomes clear that objections based on

observer bias are not necessarily fatal to the FTA. It is true that we, as carbon-based life-forms,

cannot use the fact that some physical constants are just right for the existence of carbon-based

life to discriminate between design hypotheses and their rivals. However, it may be the case that

the aggregate evidence (including the fact of our existence) might permit such discrimination.

Whether this is the case must be settled on other grounds.

27 It is not clear that the question of fine-tuning is even well-posed. There is reason to reject the

strong metaphysical assumptions necessary to make the possibility of different ‘constants’ in the

laws of nature meaningful or to entertain the existence of processes—whether physical or

divine—that determined those constants in the past.

Piecewise Versus Total Support: How to Deal with...

Documents

Transcript of Piecewise Versus Total Support: How to Deal with...